Getting Started with Ollama for Local LLM Inference

A beginner-friendly guide to running large language models on your own machine using Ollama. No cloud dependencies, no API costs, full privacy.

Why Run LLMs Locally?

After using OpenAI's API for months, I grew concerned about:

Cost: API calls add up quickly, especially during development
Privacy: Sending sensitive documents to third-party servers
Rate Limits: Hitting quotas at the worst possible times
Internet Dependency: Can't work offline or during connectivity issues

Ollama solves all of these. It packages everything you need to run open-source LLMs locally with a simple command-line interface.

Installation

Getting started takes less than 5 minutes. Choose your platform:

macOS

# Download from ollama.com or use Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com. The Windows version runs via WSL for best performance.

Your First Model

Pulling a model takes a few minutes depending on size. Let's start with Llama 3.2:

ollama pull llama3.2

Once downloaded, run it interactively:

ollama run llama3.2
>>> What is RAG?
>>> Type /bye to exit

Understanding Model Sizes

Ollama offers various models with different tradeoffs:

Model	Parameters	RAM Required	Best For
phi3	3.8B	4GB	Quick tasks, testing
mistral	7B	8GB	Balanced performance
llama3.2	3B	4GB	Efficient, good quality
codellama	7B	8GB	Code generation

Python Integration

The real power comes when you integrate Ollama into your applications. Here's a complete example:

import requests
import json

def chat_with_ollama(prompt, model="llama3.2"):
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]

# Usage
answer = chat_with_ollama("Explain RAG in simple terms")
print(answer)

Streaming Responses

For a better user experience, stream tokens as they're generated:

import requests

def stream_response(prompt, model="llama3.2"):
    with requests.post("http://localhost:11434/api/generate", 
                      json={"model": model, "prompt": prompt, "stream": True},
                      stream=True) as r:
        for line in r.iter_lines():
            if line:
                data = json.loads(line)
                yield data.get("response", "")

# Usage
for token in stream_response("Write a haiku about coding"):
    print(token, end="", flush=True)

Advanced Configuration

Customize model behavior with parameters:

{
    "model": "llama3.2",
    "prompt": "Explain quantum computing",
    "options": {
        "temperature": 0.7,      # Higher = more creative
        "num_predict": 256,      # Max tokens to generate
        "top_p": 0.9,            # Nucleus sampling threshold
        "repeat_penalty": 1.2    # Penalize repetition
    }
}

Modelfile: Custom Models

Create custom model configurations with Modelfiles:

FROM llama3.2
PARAMETER temperature 0
PARAMETER num_ctx 4096

SYSTEM You are a helpful AI assistant specialized in Python.
       You always provide code examples with explanations.

Create and use your custom model:

ollama create python-assistant -f Modelfile
ollama run python-assistant

API Server Mode

Run Ollama as a REST API server for production applications:

ollama serve
# Server runs at http://localhost:11434

Available endpoints:

POST /api/generate - Generate text completion
POST /api/chat - Chat-style completions
GET /api/tags - List available models

Performance Tips

Get the most out of your local setup:

GPU Acceleration: Ollama automatically uses NVIDIA CUDA if available
Model Quantization: Use -q flag to reduce model size (q4_0, q5_1)
Context Length: Adjust num_ctx based on your RAM
Memory Management: Close models when not in use with /bye

Best Practices

From my experience working with local LLMs:

Start small: Test with phi3 before committing to larger models
Quantization is fine: q4_0 quantization barely affects quality but saves RAM
System prompts matter: Spend time crafting effective system instructions
Keep models organized: Use descriptive names for custom models

Troubleshooting

Common issues and solutions:

Out of memory: Use a smaller model or quantized version
Slow responses: Ensure GPU acceleration is enabled
Model not found: Run ollama pull again to download
API connection refused: Check that Ollama server is running

Conclusion

Ollama democratizes access to large language models. Whether you're prototyping an AI product, learning about LLMs, or building privacy-sensitive applications, local inference is a valuable skill. Start small, experiment often, and don't be afraid to explore different models to find what works for your use case.

AI Engineering Insights