From AI to Data Engineering

My transition from AI engineering to leading a data team was driven by a simple realization: every AI system is only as good as the data pipeline feeding it. As the Lead Data Engineer at Tivro Technologies, I spend my days architecting pipelines, managing Supabase databases, and ensuring our data infrastructure can handle growth without crumbling.

This article covers what I've learned building production data systems — from schema design in Supabase to orchestrating ETL workflows that process millions of records daily.

Why Supabase for Production Data

When we evaluated database platforms, Supabase stood out because it gives you PostgreSQL under the hood with a managed layer that removes operational overhead. Here's why it works for us:

  • Full PostgreSQL: No abstraction layer — you get real PostgreSQL with all its extensions and features
  • Managed infrastructure: Automatic backups, point-in-time recovery, and read replicas without DevOps overhead
  • Real-time capabilities: Built-in subscriptions for live data updates across our applications
  • Row Level Security: Multi-tenant data isolation without application-level complexity
  • Scalable storage: From gigabytes to terabytes with minimal configuration changes

The real win is the developer experience. My team can spin up a new database environment in minutes, test migrations, and promote to production — all without waiting on infrastructure provisioning.

Database Architecture at Scale

After several iterations, here's the architecture pattern that works for our multi-tenant platform:

-- Tenant isolation with RLS
CREATE TABLE tenants (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    slug TEXT UNIQUE NOT NULL,
    config JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Enable Row Level Security
ALTER TABLE tenants ENABLE ROW LEVEL SECURITY;

-- Shared data with tenant isolation
CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    tenant_id UUID REFERENCES tenants(id) NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    ingested_at TIMESTAMPTZ DEFAULT NOW()
);

-- RLS policy for multi-tenancy
CREATE POLICY tenant_isolation ON events
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

-- Partition by ingestion date for query performance
CREATE TABLE events_partitioned (
    LIKE events INCLUDING ALL
) PARTITION BY RANGE (ingested_at);

CREATE TABLE events_2026_q2 PARTITION OF events_partitioned
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');
CREATE TABLE events_2026_q3 PARTITION OF events_partitioned
    FOR VALUES FROM ('2026-07-01') TO ('2026-10-01');

Partitioning by time range is essential once you cross a few million rows. Queries that took 30 seconds on a single table now execute in under 200ms.

Building Reliable ETL Pipelines

The core of our data platform is a set of ETL pipelines that ingest data from multiple sources, transform it, and load it into our analytics warehouse. Here's the stack we use:

# Pipeline orchestration with Python + Prefect
from prefect import flow, task
from prefect.tasks import task_input_mapping
from supabase import create_client
import pandas as pd

@task(retries=3, retry_delay_seconds=60)
def extract_from_source(source_config: dict) -> pd.DataFrame:
    """Extract data from external API or database."""
    # Handle pagination, rate limits, and failures
    ...

@task
def transform(raw_data: pd.DataFrame) -> pd.DataFrame:
    """Clean, normalize, and enrich the data."""
    # Deduplication, type casting, enrichment
    ...

@task
def load_to_supabase(transformed: pd.DataFrame, table: str):
    """Bulk upsert into Supabase."""
    supabase = create_client(
        supabase_url=source_config["url"],
        supabase_key=source_config["service_key"]
    )
    # Batch upsert in chunks of 1000
    for batch in chunks(transformed, 1000):
        supabase.table(table).upsert(
            batch.to_dict("records"),
            ignore_duplicates=False
        ).execute()

@flow(name="data-ingestion-pipeline")
def pipeline():
    raw = extract_from_source(config)
    clean = transform(raw)
    load_to_supabase(clean, "events_partitioned")

# Schedule the pipeline
pipeline.serve(
    cron="0 */6 * * *",  # Run every 6 hours
    name="hourly-ingestion"
)

Monitoring and Observability

A pipeline you can't observe is a pipeline waiting to fail. We built a monitoring layer around every data flow:

  • Record counts: Every pipeline reports extracted, transformed, and loaded row counts
  • Latency tracking: End-to-end duration from source to destination
  • Data quality checks: Automated validation for nulls, duplicates, and type mismatches
  • Alerting: PagerDuty integration for pipeline failures and data anomalies
-- Data quality check query
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT tenant_id) as tenant_count,
    COUNT(*) FILTER (WHERE payload IS NULL) as null_payloads,
    COUNT(*) FILTER (WHERE ingested_at < NOW() - INTERVAL '1 hour') as stale_records,
    MAX(ingested_at) as latest_ingestion
FROM events_partitioned
WHERE ingested_at >= NOW() - INTERVAL '24 hours';

Handling Schema Migrations

Schema changes in a production database are nerve-wracking. Here's the process we follow to migrate without downtime:

-- 1. Add new column as nullable
ALTER TABLE events ADD COLUMN processed BOOLEAN;

-- 2. Backfill in batches
DO $$
DECLARE
    batch_size INT := 10000;
    affected INT;
BEGIN
    LOOP
        UPDATE events
        SET processed = FALSE
        WHERE processed IS NULL
        LIMIT batch_size;
        
        GET DIAGNOSTICS affected = ROW_COUNT;
        EXIT WHEN affected < batch_size;
        
        COMMIT;
        PERFORM pg_sleep(0.1); -- Throttle
    END LOOP;
END $$;

-- 3. Set NOT NULL after backfill
ALTER TABLE events ALTER COLUMN processed SET NOT NULL;

-- 4. Create index
CREATE INDEX idx_events_processed ON events(processed);

Supabase's branching feature is a game-changer here. We create a branch for every migration, test against production-like data, and only merge when all checks pass.

Leading the Data Team

Managing a team of data engineers taught me that technical skills alone aren't enough. Here are the practices that improved our team's output:

  • Data contracts: Every pipeline has a written contract specifying schema, SLAs, and ownership
  • Code review for SQL: SQL review catches more production issues than application code review
  • Runbooks: Documented procedures for common failure scenarios — reduces MTTR significantly
  • Post-mortems: Blameless incident reviews with actionable follow-ups

The biggest shift came when we started treating data pipelines as products. Each pipeline has a clearly defined owner, documented SLAs, and regular health reviews.

Cost Optimization

Supabase costs can grow fast if you're not careful. Here's how we keep them under control:

  • Data retention policies: Raw data moves to cold storage after 90 days
  • Index management: Remove unused indexes — they slow writes and consume storage
  • Connection pooling: Use PgBouncer (built into Supabase) to handle thousands of connections
  • Query optimization: Regular EXPLAIN ANALYZE reviews to catch slow queries
  • Compression: Use native PostgreSQL compression for JSONB and text columns
-- Find slow queries
SELECT
    query,
    calls,
    mean_exec_time,
    rows,
    ROUND(100.0 * shared_blks_hit /
        NULLIF(shared_blks_hit + shared_blks_read, 0), 2) AS hit_ratio
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_%'
ORDER BY mean_exec_time DESC
LIMIT 10;

Lessons Learned

After months of running production pipelines on Supabase, here are the lessons that stuck with me:

  • Start simple, evolve deliberately. Don't over-engineer your first pipeline. Get it working, then optimize.
  • Observability first. You can't fix what you can't see. Invest in monitoring before scaling.
  • Test with production data. Synthetic tests never catch edge cases that real data exposes.
  • Document ruthlessly. Your future self (and your team) will thank you for clear documentation.

The data engineering landscape is evolving fast. Supabase makes PostgreSQL accessible without sacrificing power, and the right pipeline architecture keeps your data flowing reliably. Whether you're a solo founder or leading a team, these patterns will help you build data systems you can trust.