Why Run LLMs Locally?
After using OpenAI's API for months, I grew concerned about:
- Cost: API calls add up quickly, especially during development
- Privacy: Sending sensitive documents to third-party servers
- Rate Limits: Hitting quotas at the worst possible times
- Internet Dependency: Can't work offline or during connectivity issues
Ollama solves all of these. It packages everything you need to run open-source LLMs locally with a simple command-line interface.
Installation
Getting started takes less than 5 minutes. Choose your platform:
macOS
# Download from ollama.com or use Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com. The Windows version runs via WSL for best performance.
Your First Model
Pulling a model takes a few minutes depending on size. Let's start with Llama 3.2:
ollama pull llama3.2
Once downloaded, run it interactively:
ollama run llama3.2
>>> What is RAG?
>>> Type /bye to exit
Understanding Model Sizes
Ollama offers various models with different tradeoffs:
| Model | Parameters | RAM Required | Best For |
|---|---|---|---|
| phi3 | 3.8B | 4GB | Quick tasks, testing |
| mistral | 7B | 8GB | Balanced performance |
| llama3.2 | 3B | 4GB | Efficient, good quality |
| codellama | 7B | 8GB | Code generation |
Python Integration
The real power comes when you integrate Ollama into your applications. Here's a complete example:
import requests
import json
def chat_with_ollama(prompt, model="llama3.2"):
response = requests.post("http://localhost:11434/api/generate", json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
# Usage
answer = chat_with_ollama("Explain RAG in simple terms")
print(answer)
Streaming Responses
For a better user experience, stream tokens as they're generated:
import requests
def stream_response(prompt, model="llama3.2"):
with requests.post("http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": True},
stream=True) as r:
for line in r.iter_lines():
if line:
data = json.loads(line)
yield data.get("response", "")
# Usage
for token in stream_response("Write a haiku about coding"):
print(token, end="", flush=True)
Advanced Configuration
Customize model behavior with parameters:
{
"model": "llama3.2",
"prompt": "Explain quantum computing",
"options": {
"temperature": 0.7, # Higher = more creative
"num_predict": 256, # Max tokens to generate
"top_p": 0.9, # Nucleus sampling threshold
"repeat_penalty": 1.2 # Penalize repetition
}
}
Modelfile: Custom Models
Create custom model configurations with Modelfiles:
FROM llama3.2
PARAMETER temperature 0
PARAMETER num_ctx 4096
SYSTEM You are a helpful AI assistant specialized in Python.
You always provide code examples with explanations.
Create and use your custom model:
ollama create python-assistant -f Modelfile
ollama run python-assistant
API Server Mode
Run Ollama as a REST API server for production applications:
ollama serve
# Server runs at http://localhost:11434
Available endpoints:
POST /api/generate- Generate text completionPOST /api/chat- Chat-style completionsGET /api/tags- List available models
Performance Tips
Get the most out of your local setup:
- GPU Acceleration: Ollama automatically uses NVIDIA CUDA if available
- Model Quantization: Use -q flag to reduce model size (q4_0, q5_1)
- Context Length: Adjust num_ctx based on your RAM
- Memory Management: Close models when not in use with /bye
Best Practices
From my experience working with local LLMs:
- Start small: Test with phi3 before committing to larger models
- Quantization is fine: q4_0 quantization barely affects quality but saves RAM
- System prompts matter: Spend time crafting effective system instructions
- Keep models organized: Use descriptive names for custom models
Troubleshooting
Common issues and solutions:
- Out of memory: Use a smaller model or quantized version
- Slow responses: Ensure GPU acceleration is enabled
- Model not found: Run
ollama pullagain to download - API connection refused: Check that Ollama server is running
Conclusion
Ollama democratizes access to large language models. Whether you're prototyping an AI product, learning about LLMs, or building privacy-sensitive applications, local inference is a valuable skill. Start small, experiment often, and don't be afraid to explore different models to find what works for your use case.