Why Run LLMs Locally?

After using OpenAI's API for months, I grew concerned about:

  • Cost: API calls add up quickly, especially during development
  • Privacy: Sending sensitive documents to third-party servers
  • Rate Limits: Hitting quotas at the worst possible times
  • Internet Dependency: Can't work offline or during connectivity issues

Ollama solves all of these. It packages everything you need to run open-source LLMs locally with a simple command-line interface.

Installation

Getting started takes less than 5 minutes. Choose your platform:

macOS

# Download from ollama.com or use Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com. The Windows version runs via WSL for best performance.

Your First Model

Pulling a model takes a few minutes depending on size. Let's start with Llama 3.2:

ollama pull llama3.2

Once downloaded, run it interactively:

ollama run llama3.2
>>> What is RAG?
>>> Type /bye to exit

Understanding Model Sizes

Ollama offers various models with different tradeoffs:

Model Parameters RAM Required Best For
phi3 3.8B 4GB Quick tasks, testing
mistral 7B 8GB Balanced performance
llama3.2 3B 4GB Efficient, good quality
codellama 7B 8GB Code generation

Python Integration

The real power comes when you integrate Ollama into your applications. Here's a complete example:

import requests
import json

def chat_with_ollama(prompt, model="llama3.2"):
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]

# Usage
answer = chat_with_ollama("Explain RAG in simple terms")
print(answer)

Streaming Responses

For a better user experience, stream tokens as they're generated:

import requests

def stream_response(prompt, model="llama3.2"):
    with requests.post("http://localhost:11434/api/generate", 
                      json={"model": model, "prompt": prompt, "stream": True},
                      stream=True) as r:
        for line in r.iter_lines():
            if line:
                data = json.loads(line)
                yield data.get("response", "")

# Usage
for token in stream_response("Write a haiku about coding"):
    print(token, end="", flush=True)

Advanced Configuration

Customize model behavior with parameters:

{
    "model": "llama3.2",
    "prompt": "Explain quantum computing",
    "options": {
        "temperature": 0.7,      # Higher = more creative
        "num_predict": 256,      # Max tokens to generate
        "top_p": 0.9,            # Nucleus sampling threshold
        "repeat_penalty": 1.2    # Penalize repetition
    }
}

Modelfile: Custom Models

Create custom model configurations with Modelfiles:

FROM llama3.2
PARAMETER temperature 0
PARAMETER num_ctx 4096

SYSTEM You are a helpful AI assistant specialized in Python.
       You always provide code examples with explanations.

Create and use your custom model:

ollama create python-assistant -f Modelfile
ollama run python-assistant

API Server Mode

Run Ollama as a REST API server for production applications:

ollama serve
# Server runs at http://localhost:11434

Available endpoints:

  • POST /api/generate - Generate text completion
  • POST /api/chat - Chat-style completions
  • GET /api/tags - List available models

Performance Tips

Get the most out of your local setup:

  • GPU Acceleration: Ollama automatically uses NVIDIA CUDA if available
  • Model Quantization: Use -q flag to reduce model size (q4_0, q5_1)
  • Context Length: Adjust num_ctx based on your RAM
  • Memory Management: Close models when not in use with /bye

Best Practices

From my experience working with local LLMs:

  1. Start small: Test with phi3 before committing to larger models
  2. Quantization is fine: q4_0 quantization barely affects quality but saves RAM
  3. System prompts matter: Spend time crafting effective system instructions
  4. Keep models organized: Use descriptive names for custom models

Troubleshooting

Common issues and solutions:

  • Out of memory: Use a smaller model or quantized version
  • Slow responses: Ensure GPU acceleration is enabled
  • Model not found: Run ollama pull again to download
  • API connection refused: Check that Ollama server is running

Conclusion

Ollama democratizes access to large language models. Whether you're prototyping an AI product, learning about LLMs, or building privacy-sensitive applications, local inference is a valuable skill. Start small, experiment often, and don't be afraid to explore different models to find what works for your use case.