Building RAG for Data Science

Educational

Sheary Tan

OCT 1, 2025

Retrieval Augmented Generation (RAG) isn’t just about mashing two technologies together. It’s actually solving a problem that’s been bugging data scientists since large language models started getting really good: how do you make them accurate without retraining them on every new piece of information? Let me explain what I mean.

What RAG Actually Is

RAG is essentially a hybrid approach where you pair a language model with a retrieval system. Instead of relying solely on what the model learned during training, RAG lets it pull relevant information from external sources, your databases, documents, knowledge bases, whatever—and use that context to generate responses.

Here’s why data scientists are getting excited about this: LLMs are trained on fixed datasets with cutoff dates. They hallucinate. They can’t access proprietary company data unless you fine-tune them (which is expensive and time-consuming). RAG sidesteps all of this by keeping knowledge external and retrievable.

Looking for better and faster notebook than Jupyter or Colab? Head to Livedocs. Livedocs is an agentic notebook that fetch datasets, analyse and generate notebooks for you in seconds.

—

The Architecture: Breaking Down How RAG Actually Works

The retrieval component typically uses vector embeddings. You take your documents, could be customer support tickets, research papers, product specs, and convert them into numerical representations that capture semantic meaning. These embeddings get stored in a vector database like Pinecone, Weaviate, or even pgvector if you’re with PostgreSQL.

When a query comes in, you embed that query using the same model, then perform a similarity search to find the most relevant chunks of information. Cosine similarity is the usual suspect here, though you’ll see dot product or Euclidean distance depending on the use case.

The augmentation piece is where you construct your actual prompt. You take those retrieved documents and stuff them into the context window of your LLM along with the user’s question.

Something like: “Given these documents: [retrieved_content], answer this question: [user_query].” The generation step is straightforward—your LLM processes everything and spits out an answer that’s grounded in the retrieved information rather than just its training data.

—

Implementing RAG: The Practical Stuff

Let’s talk implementation because there are decisions you’ll need to make at every turn.

Choosing Your Embedding Model

Your embedding model determines how well you can match queries to relevant documents. OpenAI’s text-embedding-3-small and text-embedding-3-large are solid choices if you don’t mind the API costs. For open-source options, sentence-transformers like ‘all-MiniLM-L6-v2’ work surprisingly well for general use cases, while domain-specific models like BioBERT or SciBERT make sense if you’re working in specialized fields.

The key metric here is retrieval accuracy—you need embeddings that actually capture semantic similarity in your domain. Sometimes that means experimenting with multiple models and seeing which performs best on your data.

Chunking Strategy

Here’s where people often stumble. You can’t just throw entire documents at an embedding model and call it a day. You need to chunk your content intelligently.

Fixed-size chunking (say, 512 tokens) is the simplest approach, but it’s pretty crude. You might split a paragraph mid-thought. Semantic chunking, where you preserve natural boundaries like paragraphs or sections, tends to work better because retrieved chunks make sense as standalone pieces of information.

There’s also the question of overlap. Should chunks overlap by 50 tokens? 100? It helps with continuity but increases your storage requirements.

Vector Database Selection

Pinecone is managed and fast, but you’re paying for that convenience.
Weaviate gives you more control and is fully open-source.
Chroma is lightweight and perfect for prototyping.
FAISS from Meta is blazingly fast for local development but doesn’t come with all the bells and whistles of a full database.

For most data science use cases, especially in production environments, I’d lean toward Weaviate or Pinecone. They handle scaling well.

—

Real Data Science Use Cases

Where does RAG actually add value in data science workflows?

Customer Support Intelligence

You’ve got thousands of support tickets, product documentation, FAQ pages. Traditional chatbots are rigid and rules-based. RAG lets you build a system that retrieves relevant documentation and generates contextual, accurate responses. The model isn’t making stuff up—it’s referencing actual company knowledge.

One team I know implemented this and reduced their average response time by 40%. Not because the system was faster, but because it was more accurate the first time around, reducing back-and-forth with customers.

Research Paper Analysis

If you’re in pharma, biotech, or academia, you know the pain of keeping up with literature. RAG can ingest hundreds of papers, index them properly, and let researchers ask questions like “What methodologies have been used to study protein folding in the last two years?” The system retrieves relevant papers and synthesizes an answer. This isn’t just search—it’s comprehension and synthesis at scale.

Internal Knowledge Management

Companies accumulate institutional knowledge in Confluence pages, Google Docs, Slack threads, wikis. RAG turns all that scattered information into something queryable. New employees can ask questions and get answers grounded in actual company documentation rather than waiting for someone to dig through old Slack channels.

Data Analysis Assistants

Livedocs shines in this, which combine notebooks with live data connections from tools like Stripe, Segment, and Google Analytics. RAG could layer on top of these environments to create truly intelligent data assistants, ones that understand not just your static documentation but also your live metrics, past analyses, and current business context. Instead of manually searching through old reports, you’d ask “How did we handle seasonality in our last revenue forecast?” and get answers grounded in actual past work, complete with references to the specific Livedocs notebooks where that analysis lives.

Try out Livedocs now.

—

The Challenges Nobody Talks About Enough

There are practical challenges you’ll hit too.

Context window limitations are still real. Even with models boasting 128K token windows, stuffing too much retrieved content into your prompt reduces the quality of generation. You need to be selective about what you retrieve and how you rank it.
Retrieval accuracy can make or break your system. If your retrieval component pulls irrelevant documents, your LLM will generate garbage answers—or worse, confident-sounding wrong answers. You need strong evaluation metrics here: precision@k, recall, mean reciprocal rank. Monitor them.
Latency is another issue. Every RAG query involves embedding the question, searching the vector database, and generating a response. That’s multiple network calls and compute steps. In production, you might need caching strategies or pre-computed answers for common queries.
Cost management. Embedding API calls, vector database storage, LLM generation—it adds up fast when you’re processing thousands of queries daily. You’ll want to optimize chunk sizes, implement caching, and maybe even explore open-source models for parts of your pipeline.

Don’t forget to try out Livedocs! We now have Claude 4.5 integrated, faster agent response to my data!

—

Getting Started: A Practical Roadmap

If you’re ready to implement RAG, here’s how I’d approach it:

Start small. Pick a single use case with clear success metrics. Maybe it’s answering questions about your data pipeline documentation or helping analysts understand past experiments. Don’t try to boil the ocean.
Get your data in order. Seriously—if your documents are poorly structured, inconsistent, or full of errors, RAG will amplify those problems. Clean it up, organize it, make sure it’s actually useful content.
Prototype fast. Use something like LangChain or LlamaIndex to get a working system quickly. These frameworks handle a lot of the plumbing so you can focus on what matters: retrieval quality and prompt engineering.
Measure everything. Set up logging and evaluation from day one. You need to know what’s working and what isn’t, and gut feeling won’t cut it at scale.
Iterate based on real usage. The first version won’t be perfect. That’s fine. Deploy it to a small group, gather feedback, and improve. RAG systems get better with tuning and real-world data.

—

Final Thoughts

RAG is evolving fast. We’re seeing models trained specifically to work better with retrieved context. Vector databases are getting smarter about handling structured and unstructured data together. The line between RAG and fine-tuning is blurring as techniques like RAFT (Retrieval-Augmented Fine-Tuning) emerge.

Just remember: start simple, measure everything, and don’t let perfect be the enemy of good enough. That’s advice that applies to most things in data science, but it’s especially true for RAG implementations.

The best, fastest agentic notebook 2026? Livedocs.

8x speed response
Ask agent to find datasets for you
Set system rules for agent
Collaborate
And more

Get started with Livedocs and build your first live notebook in minutes.

—

💬 If you have questions or feedback, please email directly at a[at]livedocs[dot]com
📣 Take Livedocs for a spin over at livedocs.com/start. Livedocs has a great free plan, with $10 per month of LLM usage on every plan
🤝 Say hello to the team on X and LinkedIn

Stay tuned for the next tutorial!