PostgreSQL Is More Than a Web App Database
Modern PostgreSQL (v14+) has everything for a production data pipeline: JSONB for schemaless ingestion, Partitioning by time, native CDC via Logical Replication, Materialized Views for heavy aggregations, and LISTEN/NOTIFY for lightweight event streaming. Under 1TB data and 50K events/second, it's the most cost-effective and operationally simple stack available — no Kafka, Spark, or Snowflake required.
ETL vs ELT: Choosing the Right Pattern
| Criterion | ETL (Python-first) | ELT (SQL-first) |
|---|
| Data size | Under 100MB/batch | GB+ |
| Transformation | Complex Python logic | SQL + dbt |
| Performance | Memory bottleneck | Push computation to DB |
| Tools | Pandas, Polars | dbt, SQLMesh |
Practical recommendation: Use ELT for analytical workloads — PostgreSQL handles JOINs and aggregations better than Pandas for large datasets. Use ETL when business logic is too complex to express in SQL.
Data Sources (API / Webhook / CSV)
↓
[Python Ingestion Layer — httpx, asyncpg, pandas]
↓
[PostgreSQL Staging — raw_events JSONB, PARTITION BY time]
↓ SQL transforms / dbt
[Fact Tables + Materialized Views]
↓
[Serving — Metabase / Grafana / Application APIs]
Schema Design Essentials
Staging table accepts raw data without strict schema validation:
CREATE TABLE raw_events (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
received_at TIMESTAMPTZ DEFAULT NOW(),
processed BOOLEAN DEFAULT FALSE,
batch_id UUID
) PARTITION BY RANGE (received_at);
CREATE INDEX idx_unprocessed ON raw_events (received_at)
WHERE processed = FALSE;
CREATE INDEX idx_payload_gin ON raw_events USING GIN (payload);
Python Stack: 3 Core Libraries
SQLAlchemy 2.0 with QueuePool and pool_pre_ping=True. Use COPY instead of INSERT for bulk loads — 10-100x faster:
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/db",
pool_size=10, max_overflow=20,
pool_pre_ping=True,
)
def fetch_unprocessed(limit=1000):
with engine.connect() as conn:
return conn.execute(text("""
SELECT id, payload FROM raw_events
WHERE processed = FALSE
LIMIT :limit
FOR UPDATE SKIP LOCKED
"""), {"limit": limit}).fetchall()
Pandas / Polars: pd.read_sql() reads directly from PostgreSQL. Polars + ConnectorX is 5-20x faster than Pandas for large datasets. Write back using to_sql(..., method='multi', chunksize=5000).
Concurrency: 3 Patterns You Must Know
Pattern 1 — ThreadPoolExecutor for I/O-bound tasks (API calls, DB queries). max_workers=8 is the sweet spot for most pipelines.
Pattern 2 — asyncpg + asyncio for high concurrency (500+ concurrent operations):
pool = await asyncpg.create_pool("postgresql://...", min_size=5, max_size=20)
async def fetch_events(batch_size=500):
async with pool.acquire() as conn:
async with conn.transaction():
return await conn.fetch("""
SELECT id, event_type, payload FROM raw_events
WHERE processed = FALSE LIMIT $1
FOR UPDATE SKIP LOCKED
""", batch_size)
Pattern 3 — FOR UPDATE SKIP LOCKED: When Worker 1 locks 100 rows, Worker 2 automatically skips them — no deadlocks, no duplicate processing. This is the single most important pattern for parallel workers.
Orchestration: Prefect 3 vs Airflow
| Criterion | Apache Airflow | Prefect 3 |
|---|
| Setup | High (Celery/K8s) | Low (pip install) |
| Code style | Complex DAG definitions | Natural Python functions |
| Dynamic flows | Difficult, needs workarounds | Native support |
| Best for | Enterprise, 100+ DAGs | Startup, 1-50 flows |
Prefect: @flow + @task decorators, automatic retries, built-in UI. Covers 95% of startup pipeline needs out of the box.
Query Optimization: 5 Index Strategies
CREATE INDEX idx_unprocessed ON raw_events (received_at)
WHERE processed = FALSE;
CREATE INDEX idx_user_time ON fact_user_events (user_id, event_at DESC);
CREATE INDEX idx_payload ON raw_events USING GIN (payload);
CREATE INDEX idx_user_id_str ON raw_events ((payload->>'user_id'));
CREATE INDEX idx_time_brin ON raw_events USING BRIN (received_at);
Use REFRESH MATERIALIZED VIEW CONCURRENTLY for heavy aggregations — no read lock during refresh.
Always debug with EXPLAIN (ANALYZE, BUFFERS). Watch for: Seq Scan = missing index; Row estimate way off = run ANALYZE; target cache hit ratio above 95%.
Dead Letter Queue — Never Lose Records
Failed records don't crash the entire batch — push to a DLQ table with error_msg, attempt_count, last_failed_at. Alert on sudden DLQ growth. Enable manual review and replay of failed records.
When to Use PostgreSQL (and When Not To)
| Situation | Decision | Alternative |
|---|
| Under 500GB, 1-50K events/s, small team | ✅ Ideal | — |
| Need ACID across full pipeline | ✅ Ideal | — |
| Over 1TB, heavy analytics | ⚠️ Consider alternatives | DuckDB, ClickHouse |
| Over 100K events/second | ❌ Not suitable | Kafka + Flink |
| Pure time-series | ⚠️ Needs extension | TimescaleDB, InfluxDB |
Production Checklist
PostgreSQL + Python is the most pragmatic stack for startup to mid-size companies. Add ClickHouse or Kafka-scale tools only when you outgrow it — don't over-engineer from the start.
Want to add AI processing to your data pipeline? See the AI agent builder's guide.