RAG Chatbot – MoviesGPT
What is RAG?
Retrieval-Augmented Generation or RAG is when you change the output of a Large Language Model (LLM) by providing the model more context alongside a user’s input. That way, the model can use its ability to generate text along with extra context to provide accurate answers to users’ questions
Why is RAG useful?
- Cost Effective
- Models have cut-off dates, after which knowledge isn’t updated.
- Covers up for information that does not exist
What is Vector Embedding?
A popular technique to represent information in a format that algorithms, especially deep learning models, can easily process. This ‘information’ can be text, pictures, video or audio.
Step-by-step workflow of MoviesGPT
Data Collection (Wikipedia Scraping)
- The project uses Puppeteer (via LangChain) to scrape Wikipedia pages containing lists of movies in various Indian languages for the year 2025.
- Each Wikipedia page’s content is fetched and cleaned of HTML tags.
Text Chunking
- The scraped content is split into manageable chunks using a text splitter (RecursiveCharacterTextSplitter).
- This ensures each chunk is of optimal size for embedding and storage.
Embedding Generation
- Each text chunk is sent to NVIDIA’s embedding API (
nvidia/nv-embedqa-e5-v5
model) to generate a high-dimensional vector representation. - These embeddings capture the semantic meaning of each chunk.
Database Storage (AstraDB)
- The vector embeddings and their corresponding text chunks are stored in AstraDB, a vector database.
- The database is set up to support efficient similarity search using the chosen metric (e.g., dot product).
User Interaction (Frontend)
- Users interact with a chat interface built with Next.js.
- When a user submits a question, it is sent to the backend API.
Query Embedding & Context Retrieval
- The backend generates an embedding for the user’s question using the same NVIDIA model.
- It then queries AstraDB for the most similar text chunks (context) based on vector similarity to the question embedding.
Prompt Construction
- The retrieved context is formatted and combined with the user’s question to create a system prompt.
- This prompt instructs the AI to use the provided context to answer the question, but to fall back on its own knowledge if needed.
AI Response Generation
- The prompt and chat history are sent to OpenRouter’s chat API (using a model like
deepseek/deepseek-chat
). - The AI generates a streaming response, which is sent back to the frontend in real time.
User Receives Answer
- The user sees the AI’s answer in the chat interface, formatted in markdown for readability.
Workflow Diagram (Textual)
Wikipedia Pages
↓
[Scraping & Cleaning]
↓
[Text Chunking]
↓
[Embedding Generation]
↓
[AstraDB Storage]
↓
(User asks a question)
↓
[Question Embedding]
↓
[Vector Search in AstraDB]
↓
[Relevant Context Retrieved]
↓
[Prompt Construction]
↓
[OpenRouter AI Chat Completion]
↓
[Streaming Response to User]
Summary
- Backend: Handles scraping, embedding, storage, and retrieval.
- Frontend: Provides a chat interface for users using NextJS via TypeScript.
- AI Models: NVIDIA for embeddings, OpenRouter for chat.
- Database: AstraDB for vector search and storage.
This workflow ensures that MoviesGPT can answer movie-related questions with up-to-date, contextually relevant information, providing a seamless and intelligent user experience.
Links