The Beginning: Why I Built This

It started with a simple problem: I wanted to chat with my research papers. Most AI tools either required uploading documents to third-party servers or couldn't maintain context across multiple files. I decided to build a solution that kept everything local and allowed multiple users to work with their own private document collections.

The idea evolved into a full SaaS platform where users can upload multiple documents, chat with an AI that understands their specific context, generate quizzes from the content, and receive automated daily digests of their most important conversations.

Lesson 1: Retrieval Quality Matters More Than Model Size

My first version used the largest model available and expected great results. Unfortunately, answers were inconsistent. The problem wasn't the model—it was how I retrieved the relevant context to feed into it.

The Chunking Problem: I initially used fixed-size chunking with 512 tokens. This seemed efficient, but it broke semantic units—paragraphs, code blocks, and tables were split arbitrarily, destroying meaning.

The Fix: I switched to semantic chunking that respects natural language boundaries. I also added overlap between chunks (128 tokens) to ensure context continuity. Finally, I implemented metadata tagging (document title, section headers, dates) that helped the retriever understand context better.

The improvement was dramatic—answer relevance scores went up by 40%, and user satisfaction increased significantly. The model didn't change at all; only the retrieval pipeline improved.

Lesson 2: Multi-User Isolation Is Critical

The biggest backend challenge was isolating users and their files cleanly while still maintaining fast response times. In a SaaS environment, data privacy isn't just a feature— it's a legal requirement.

Database Design: I used PostgreSQL with pgvector for the vector store. Each user has their own schema partition, and document embeddings are stored with user_id as a mandatory filter on every query. This ensures zero data leakage across sessions.

API Layer: FastAPI handles authentication via JWT tokens. Every endpoint that touches user data validates the token and extracts the user_id. This user_id is then automatically injected into all database queries—no exceptions.

Rate Limiting: I implemented per-user rate limiting to prevent one user from consuming all resources. This was crucial during testing when I accidentally triggered infinite loops in my prompts.

Lesson 3: Automation Creates Real Product Value

Beyond the core chat functionality, I added features that transformed the platform from a toy into a useful tool:

Quiz Generation: Users can request quizzes generated from their uploaded documents. The AI extracts key concepts and creates multiple-choice questions. This features helps students and professionals study from their own materials.

Daily Digests: Using n8n workflows, the system sends email summaries each morning with:

  • Conversations from the past 24 hours
  • Files that were added or modified
  • Suggested follow-up questions based on activity

This proactive approach increased user retention by 60%. People didn't have to remember to check the platform—the insights came to them.

Lesson 4: Local Models Have Surprising Benefits

I initially used OpenAI's API for the LLM, but switched to Ollama for local inference using models like Llama 3.2. This decision surprised me with its advantages:

  • Cost: Zero API costs after initial hardware investment
  • Privacy: All document processing happens locally—nothing leaves the server
  • Customization: I could fine-tune prompts and system instructions without worrying about rate limits or policy changes
  • Speed: For medium-complexity queries, local models were actually faster due to no network latency

Lesson 5: The UX Details Matter

I underestimated how much UX improvements would impact perceived quality. Simple additions made the platform feel much more polished:

Streaming Responses: Instead of waiting for the complete answer, tokens stream in real-time. This reduced perceived wait time by 70% even though actual processing time was identical.

Citations: Every answer includes links to the specific documents and chunks used to generate it. Users can click to see the exact source text.

Conversation History: Full conversation threads with the ability to branch and explore alternative questions without losing context.

What I'd Do Differently

If I started fresh, I'd focus on these areas from day one:

  1. Evaluation Framework: Build comprehensive tests for retrieval quality before writing any business logic. I spent too much time retrofitting tests.
  2. Event-Driven Architecture: Use message queues for async operations like quiz generation and email sending. Currently, some operations block the main thread.
  3. Incremental Indexing: Implement live document updates instead of requiring full re-indexing when a document changes.

Conclusion

Building a production RAG system is more about infrastructure and data engineering than it is about AI models. The model is just one component—retrieval quality, user isolation, and proactive features are what make a platform truly useful.

The next frontier for this project is adding multimodal capabilities—allowing users to upload images and diagrams alongside text documents. Watch this space for updates.