← Back to playground

🐘ClickHouse

A senior engineer's design doc for the OLAP database β€” storage layout, vectorized execution, distributed query, replication, and ClickHouse Cloud's separated compute and storage.

TL;DR

  • ClickHouse is a columnar OLAP DBMS optimized for high-throughput SELECTs over append-mostly tables. Data lives in immutable parts that background processes merge, indexed by a sparse primary index that points at granules (8192 rows each by default).
  • Query execution is vectorized over column chunks with runtime-dispatched SIMD (SSE 4.2 β†’ AVX-512), LLVM-JIT-fused hot expressions, and 30+ specialized hash-table variants for joins and aggregations. Joins now spill via grace hash when memory exceeds 50%.
  • The Cloud product (SharedMergeTree, 2024+) separates compute and storage: parts live on S3, metadata in Keeper, and replicas are stateless β€” they no longer ship data part-to-part.

Why ClickHouse exists

Analytical workloads look almost nothing like transactional ones. A point-of-sale system writes one row per purchase and looks it up by primary key. A web analytics system writes one row per page view and asks "how many sessions per hour, by country, over the last 30 days" β€” over billions of rows. Row-oriented engines, which interleave all columns of one row contiguously on disk, are good at the first job and bad at the second: a query that touches three of forty columns still drags every byte of every row through I/O and cache. Columnar storage is the standard fix β€” store each column in its own file, in sorted order, with codec-appropriate compression β€” and was widely understood by the time ClickHouse showed up in 2016.

What ClickHouse added was an unusually opinionated combination of choices on top of "columnar." A sparse rather than dense primary index, kept in RAM. Background merges of immutable parts in place of an LSM memtable. Vectorized execution operating on chunks of columns rather than rows. A single hash-table-per-query model with 30+ template specializations. SIMD kernels dispatched at runtime by CPU feature. And a deployment story β€” single C++ binary, no JVM, no external coordinator beyond ZooKeeper/Keeper β€” that made it possible to run at high density on commodity hardware.

The system originated inside Yandex.Metrica, a web-analytics product that needed to serve interactive aggregations over hundreds of billions of clickstream events. (The name is a contraction of Clickstream Data Warehouse.) Open-sourced in 2016, formalized in the VLDB 2024 paper, ClickHouse has since become the default OLAP engine for time-series telemetry, logs, and event streams β€” workloads that look a lot like clickstreams, even when they aren't.

Framing

An analytical database, at the level of abstraction that matters here, is two engines bolted together: a compression engine that minimizes the bytes you read off disk, and a scan engine that minimizes the cycles you spend processing those bytes. ClickHouse is unusually opinionated on both. Most of the rest of this doc is a tour of how each is built.

Mental model in 60 seconds

From the bottom up, a running ClickHouse cluster looks like this:

Persistent storage Local disks (OSS) Β· object storage S3/GCS/Azure (Cloud) Β· compressed column files MergeTree storage engine parts Β· granules Β· marks Β· sparse PK Β· skipping indexes Β· background merges Execution engine vectorized processor graph Β· SIMD (SSE4.2 β†’ AVX-512) Β· LLVM JIT Β· 30+ hash tables Β· grace-hash spill Parser, analyzer, planner SQL β†’ AST β†’ query tree (new analyzer) β†’ logical plan β†’ physical pipeline Distributed router Β· replication Distributed table β†’ shard fan-out Β· two-stage aggregation Β· ReplicatedMergeTree log replay Client API surface HTTP Β· native TCP Β· gRPC Β· MySQL/Postgres wire protocols Β· 90+ file formats ClickHouse Keeper 3-node Raft quorum replication log DDL log part manifest (Cloud only)
A running ClickHouse cluster. Solid lines are data flow; dashed lines are Keeper-mediated coordination.

A few things to know up front:

  • Sharding and replication are both opt-in. A single-node ClickHouse install is a useful database. Distributed tables add horizontal partitioning across shards; ReplicatedMergeTree adds redundancy per shard. Most production deployments use both.
  • The MergeTree family is the only production-grade engine. ClickHouse has dozens of table engines (Log, Memory, Buffer, Kafka, S3, MySQL, Distributed…), but the analytical-DB story is MergeTree. The other engines are mostly facades over external sources or specialized helpers.
  • There is no memtable. Writes go directly to disk as immutable parts; a fresh part is created per INSERT (or per coalesced batch in the async-insert path). This is not LSM β€” there's no RocksDB-style in-memory buffer in front of the merge tree.
  • Coordination runs through Keeper, not through the ClickHouse servers themselves. Servers do not gossip. Every replication log entry, every DDL operation, every part announcement passes through a small Keeper quorum. The health of Keeper directly bounds the health of the cluster.

Storage: parts, granules, marks

This is the layer the rest of ClickHouse is built on. The unit is the part: an immutable directory on disk produced by exactly one INSERT (or by merging existing parts). Inside a part, each column is stored separately in its own compressed file, sorted by the table's ORDER BY clause. Background processes periodically merge smaller parts into larger ones using a k-way merge sort, up to a default cap of about 150 GB per merged part. Parts in different partitions never merge with each other β€” partitioning is purely a logical separation, typically by date, that bounds the merge cost and enables coarse-grained drop/move operations.

The on-disk shape of a part comes in two layouts. Wide format stores each column in its own pair of files (column.bin + column.mrk); Compact format bundles all columns into a single file. The threshold between them is controlled by min_bytes_for_wide_part and min_rows_for_wide_part β€” small parts get the compact layout to avoid filesystem overhead from thousands of tiny files; large parts get the wide layout so column scans only read the columns they need.

Part (immutable directory) e.g. /var/lib/clickhouse/data/db/tbl/202605_3_3_0/ user_id.bin user_id.mrk2 ts.bin ts.mrk2 event.bin event.mrk2 primary.idx (sparse: one entry per granule) skp_idx_*.idx + skp_idx_*.mrk2 (skipping) columns.txt Β· checksums.txt Β· count.txt zoom Inside user_id.bin compressed blocks (64 KB – 1 MB each) [ compressed block 0 ] [ compressed block 1 ] [ compressed block 2 ] Mark = (block_offset, in_block_offset) β†’ seeks straight to a granule's bytes without scanning the column. Granule = 8192 rows (default) ~1000 PK entries address 8.1M rows.
A part contains one compressed column file plus one mark file per column, an index file, and a small bag of metadata. Granules are the unit of indexing and pruning, not the unit of compression.

Two units inside a part deserve careful attention: the granule and the compressed block. Granules are the unit of indexing; the default is index_granularity = 8192 rows. Compressed blocks are the unit of I/O and decompression; their uncompressed sizes range roughly from 64 KB to 1 MB depending on the data. The two are not the same. A mark file bridges them: per column, per granule, it stores the offset of the granule's bytes both within the compressed-block stream and within the uncompressed block. That mapping is what lets the reader skip straight to the bytes for a needed granule without scanning the column from the start.

Default worth not changing

The 8192-row granule default is a tradeoff. Smaller granules mean finer pruning (a query that touches one row reads 8K rows, not 80K), but more entries in the sparse primary index and more mark entries per column. Larger granules compress better but waste more I/O on small lookups. The defaults are well-tuned for typical analytical workloads; reach for non-defaults only after you've measured a real workload bottleneck.

MergeTree variants

The base MergeTree engine is append-only. The variants change what happens during the merge step β€” letting you express upsert, aggregation, and time-series compaction semantics declaratively, at write time, rather than rewriting them as ad-hoc batch jobs.

EngineMerge-time semanticsUse case
MergeTreek-way sort merge; no row collapsingAppend-only events, logs, time series
ReplacingMergeTreeWithin a PK, keep the latest row (by insertion time or an optional version column)Upserts; mutable dimension tables
SummingMergeTreeSum numeric columns sharing a PKPre-aggregated rollup tables (one row per (dimension, time-bucket))
AggregatingMergeTreeMerge AggregateFunction columns via their -Merge finalizersReal-time materialized rollups for any associative aggregate (p50, p95, uniq, …)
CollapsingMergeTreePair rows with sign = +1 and sign = -1; emit only the netState-machine compaction (open/close, in/out)
VersionedCollapsingMergeTreeSame as Collapsing but resolves conflicts via a version columnSame use case under concurrent writes
GraphiteMergeTreeTime-bucketed rollups configured by retention rulesGraphite-style metrics with retention tiers

A typical MergeTree DDL looks like this:

CREATE TABLE events
(
    user_id   UInt64,
    ts        DateTime64(3),
    event     LowCardinality(String),
    payload   String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, ts)
SETTINGS index_granularity = 8192;

Reading this top to bottom: rows are bucketed into monthly partitions; within each part, rows are sorted by (user_id, ts); and within each part, the sparse primary index has one entry every 8192 rows. The PARTITION BY expression is what makes a DROP PARTITION '202405' a near-instant metadata operation rather than a row scan β€” it just unlinks the relevant part directories.

Sparse primary index, skipping, and projections

This is the most-misunderstood ClickHouse mechanic, and the source of most of its scan-time advantage. The "primary index" in ClickHouse is not what a B-tree-using database means by the term. It is sparse: only the first row of each granule produces an index entry. A table with 8.1 million rows and the default granularity has roughly a thousand index entries β€” small enough to keep entirely in RAM, on every server, for every part of every table. That fact is the load-bearing one. The index never needs to be paged in or out, never spills, never grows linearly with row count.

The lookup uses two algorithms depending on which key column the query filters on. For a predicate on the leading key column, ClickHouse does a binary search over the in-memory index entries β€” O(log2 n) comparisons β€” and returns the range of granules that could contain matching rows. For a predicate on a non-leading column, it falls back to a "generic exclusion search" that walks the index using ternary logic (all-match, none-match, mixed) and only prunes granules where it can prove no row matches. The practical implication: the leftmost columns of ORDER BY are the ones that earn their keep. Common practice is to order from low to high cardinality (e.g. ORDER BY (tenant_id, customer_id, ts)) so each successive column further narrows the candidate granules of its predecessors.

Sparse primary index (in RAM) first PK value per granule Β· ~1000 entries / 8.1M rows Β· binary search on leading column [0]β†’g0 [1]β†’g1 [2]β†’g2 … [497]β†’g497 [498]β†’g498 … [999]β†’g999 Skipping indexes (per granule, per indexed column) minmax set bloom_filter text (full-text) Granules selected for read ~2 of ~30 granules survive pruning β†’ read set is <10% of part
The PK narrows the candidate range; skipping indexes prune further per granule; the executor reads only the surviving granules' compressed bytes.

Skipping indexes

A skipping index is a secondary structure stored alongside the part (skp_idx_*.idx + skp_idx_*.mrk2) that summarizes one column per granule. At query time, ClickHouse evaluates the predicate against the index summary; if the summary proves no row in the granule can match, the granule is skipped. The names are old enough that the documentation calls them data skipping indexes; the practical effect is the same as zone maps or block-level bloom filters in other systems.

KindWhat it storesGood for
minmaxMin and max value per granuleLoosely-sorted numeric or date columns
setUp to N distinct values per granule (set(N))Low-cardinality categoricals clustered in granules
bloom_filterBloom filter per granule (configurable false-positive rate)Equality / IN-set predicates on high-cardinality columns
textTokenized inverted index per granule (replaces deprecated tokenbf_v1 / ngrambf_v1)Full-text search and LIKE queries
vector_similarityHNSW graph per granule on Array(Float32) columnsApproximate-nearest-neighbor (see below)

Two gotchas to internalize. First, skipping indexes only apply to new data: existing parts are not retroactively indexed unless you run ALTER TABLE … MATERIALIZE INDEX. Second, an index that doesn't actually prune granules is pure overhead β€” both at write time (computing the summary) and at read time (evaluating the predicate). A bloom filter on a column that the planner could already exclude via the PK is a tax, not a help. Measure.

Vocabulary

ORDER BY chooses the sort within a part β€” and therefore the compression locality. PRIMARY KEY chooses what gets the in-RAM index. They are usually the same tuple, and if you specify only ORDER BY, ClickHouse derives a primary key from it. The case where they differ: PRIMARY KEY can be a prefix of ORDER BY, when you want to limit RAM use without changing the sort.

Projections

A projection is an alternative sort order β€” or a pre-aggregated rollup β€” maintained inside each part. The planner evaluates which projection (if any) would do the least work for a query and silently substitutes it. The mental model is "covering index, but at part granularity": projections share the part's lifecycle (created with the part, dropped with the part, merged when the part merges), and so they are always consistent with the base data without external rebuild. Trade: every projection costs you on write, since each insert must produce all of the part's projections in addition to the base sort.

Column statistics and vector ANN

For the new cost-based planner, ClickHouse maintains optional column statistics (HyperLogLog sketches for cardinality, t-digest for percentiles) that feed join-order and selectivity decisions. These are still being rolled out in 24.x–26.x and are not yet always-on.

Approximate vector search is exposed as a skipping-index variant: INDEX … TYPE vector_similarity('hnsw', 'cosineDistance') on an Array(Float32) column builds a per-granule HNSW graph (via the embedded HNSW implementation from USearch). It supports L2, cosine, and dot-product distances, fixed dimensions per index, and a much coarser default granularity (~100M rows) than ordinary skipping indexes. It is a useful colocation feature β€” vectors next to the rows they describe, queried with normal SQL β€” but is not designed to compete with purpose-built vector databases on QPS or at the high-billions scale.

Vectorized execution: chunks, SIMD, JIT

If the storage layer is "fewer bytes off disk," the execution layer is "fewer cycles per byte." ClickHouse approaches this with the same trick as MonetDB/X100: operators consume and produce chunks of columns (typically thousands of values at a time), not rows. A Volcano-style pull pipeline that returns one row per next() call would spend most of its cycles on dispatch overhead. A chunk-at-a-time pipeline turns those cycles into useful work β€” every operator is a tight inner loop over an array of values, the CPU prefetcher sees a predictable access pattern, and the same loops can be auto-vectorized or hand-written with SIMD intrinsics.

The pipeline itself is a graph of processors, each a small state machine in one of three states: need-chunk (waiting for input), ready (has work to do), and done. Worker threads traverse the graph, picking up ready processors and advancing them. Adjacent same-lane processors get a scheduling hint so cache lines stay warm. Some operators are pipeline breakers β€” aggregations, sorts, the build side of a hash join β€” that force the current pipeline to drain before downstream work can start; others (filters, projections, the probe side of a hash join) pass chunks straight through.

SQL SELECT p99(latency_ms) FROM events … AST query tree (new analyzer) logical plan + optimize Physical pipeline: processor graph ReadFromMergeTree (parallel lanes) Filter (vectorized) Projection (JIT-fused) Aggregate (pipeline breaker) Sink β†’ client chunks of columns (β‰ˆ8K rows Γ— N cols) flow left-to-right; arrows are chunks, not rows. Hot path β†’ SIMD kernel chosen at startup by cpuid: SSE 4.2 β†’ AVX2 β†’ AVX-512. β†’ LLVM JIT fuses adjacent expressions (a*b + c) into a single compiled function, cached.
A query's life. SQL becomes a processor graph; chunks of columns flow through it; SIMD and JIT are applied where they pay off.

Two more pieces of the execution story are worth knowing. First, ClickHouse maintains over 30 hash-table variants, templated by key shape and cardinality, for joins and GROUP BY. The runtime picks the right one β€” a two-level hash table for high-cardinality grouping, a flat lookup table for tiny key spaces, a string-specialized table with embedded hashes for short string keys. This shows up as a measurable advantage on cardinality-heavy aggregations.

Second, joins now spill. The grace hash join triggers automatically when the build-side hash table would exceed max_bytes_ratio_before_external_join (default 0.5 β€” i.e. half of available memory). It partitions the build side to disk in non-blocking shared partitions and continues the probe in streaming fashion. The result is that a join whose right side doesn't fit in RAM no longer OOMs; it merely slows down. Combined with the new analyzer's better cost estimates, joins are the area of ClickHouse that has improved most in the 24.x–26.x window.

New analyzer

The query analyzer is in the middle of a long-running rewrite β€” from a legacy AST-walking pass to a typed "query tree" representation that can run more aggressive optimizations (filter pushdown, common-subexpression elimination, monotonicity-aware index reads). It has been enabled by default since 24.4 and is gradually subsuming the legacy interpreter. Some edge cases still require allow_experimental_analyzer = 0 to fall back; the analyzer_compatibility_* settings document what's known not to round-trip.

Distributed query: shards and two-stage aggregation

ClickHouse scales horizontally by sharding β€” splitting a table across multiple servers β€” and within each shard, replicating for redundancy and read parallelism. The Distributed table engine is the abstraction the client talks to. It holds no data itself; it is a thin routing layer that knows the cluster topology (from config.xml / remote_servers) and translates a query against the Distributed table into a fan-out of subqueries against the underlying shard-local tables. The shard a row lands on is decided by a sharding expression β€” typically cityHash64(user_id) or similar β€” modulo the total shard weight.

The interesting part is what happens for queries with GROUP BY. A naive fan-out would have each shard return its rows to an initiator, which would do all the work. ClickHouse instead performs two-stage aggregation: each shard computes intermediate aggregate states, ships only those states to the initiator, and the initiator merges them. For avg, the intermediate state is the pair (sum, count), not the final average; for uniqExact, it's a hash set; for quantilesTDigest, it's a t-digest sketch. The shape is exposed in SQL via the -State / -Merge aggregate function combinators:

-- intermediate state shipped from each shard:
SELECT avgState(latency_ms) FROM events_local GROUP BY region;

-- initiator finalizes:
SELECT avgMerge(s) FROM (/* ... fanned out subquery ... */) GROUP BY region;

You rarely see this written explicitly; the planner does it for you when the query goes through a Distributed table. But understanding the shape matters when you build incremental materialized views β€” the target table of an AggregatingMergeTree stores the same intermediate states, and the same -Merge functions are how you read them back.

ClickHouse Keeper (Raft quorum) DDL log Β· replication log Β· part manifests client shard A shard B shard C replica A1 (initiator) routes + finalizes events_local replica B1 events_local replica C1 events_local replica A2 events_local replica B2 events_local replica C2 events_local Query lifecycle 1. Client sends SELECT to the initiator (any replica that owns the Distributed table). 2. Initiator sends shard-local subqueries (one per shard, one chosen replica per shard) β†’ gray arrows. 3. Each shard's replica computes partial aggregate states locally and returns them β†’ blue arrows. 4. Initiator applies the -Merge finalizer per group, returns to client. 5. INSERTs (not shown): Distributed picks a shard per row; ReplicatedMergeTree handles intra-shard replication via Keeper.
Cluster anatomy. Keeper (dashed) sequences metadata; data flows (solid) move through the initiator, fanned out to one chosen replica per shard.

A few practical details. The recommended INSERT path through a Distributed table is to set internal_replication = true, which tells the Distributed engine to write to one replica per shard and let ReplicatedMergeTree handle the within-shard fan-out via Keeper. The alternative β€” false, which has the Distributed engine itself replicate writes to every replica β€” exists for historical reasons and is rarely the right choice today. Shard topology is static: there is no automatic rebalancing on node add. Re-sharding is a manual operation (insert into a new cluster, swap atomically, drop the old).

Parallel replicas, added in the 23.x–24.x window, push this further. The work for a single query is divided at granule granularity across the replicas of one shard, with the initiator acting as a coordinator and other replicas as workers that pull granule ranges off a shared queue. Faster replicas steal slower replicas' work; the cache locality of "this part lives on this replica" is preserved via a consistent-hashed assignment. The feature is best for read-heavy workloads with large per-shard data; it explicitly does not yet help queries that use FINAL, queries that depend on projections, or queries whose coordination overhead exceeds the speedup (small queries, complex joins).

Tradeoff to know

Static topology is a real constraint. Adding a shard does not automatically move data to it β€” new writes flow there proportionally, but existing data sits where it was. For a workload that needs to scale compute elastically without re-sharding pain, ClickHouse Cloud's SharedMergeTree (next section but one) is the answer; on open-source, plan capacity carefully.

Replication and ClickHouse Keeper

Within a shard, redundancy is provided by ReplicatedMergeTree. Every state-changing operation β€” INSERT, MERGE, ALTER, MUTATION β€” is recorded as an entry in a replication log stored in ClickHouse Keeper. Each replica subscribes to the log, replays entries asynchronously, and either applies them locally (rerunning a merge, executing an ALTER) or fetches the resulting part from a peer (saving redundant CPU on heavy merges). Replicas are not "primary" and "secondary" β€” any replica can accept writes, and writes propagate via the log.

Two defaults define the consistency profile. First, an INSERT returns success after exactly one replica has the data; replication catches up asynchronously. This is good for ingest throughput and bad if you read your own writes from a different replica milliseconds later. The insert_quorum setting strengthens this: insert_quorum = 2 waits until two replicas have the data; insert_quorum = 'auto' waits for a majority. Second, recovery is automatic for small divergences: on startup, a replica validates its local parts against the Keeper part manifest and fetches anything it's missing.

The coordination service is itself worth understanding. ClickHouse Keeper is a C++ reimplementation of ZooKeeper using the embedded NuRaft Raft library. It exposes ZK's wire protocol (so existing ZK clients and tooling work) but is dramatically lighter on memory and faster on coordination-heavy workloads. The official guidance is to deploy 3 Keeper nodes on dedicated hosts in production β€” small VMs are fine; what matters is that Keeper isn't competing for resources with the ClickHouse server processes it coordinates.

OSS replication vs. Cloud replication, at a glance

PropertyOSS ReplicatedMergeTreeCloud SharedMergeTree
Where parts liveLocal disk of every replicaObject storage; compute holds only cache
How data propagatesReplica fetches part from peer (or re-merges locally)All replicas see the same S3 objects; no peer fetch
Practical replica countSmall (3–10); replication log overhead growsDesigned for hundreds per table
Default quorum1 replica acks; tunable via insert_quorumAutomatic quorum (write commits to Keeper)
Stateful compute?Yes β€” each replica owns its diskNo β€” compute is stateless, scales to zero
Mental model

Keeper is to ClickHouse what etcd is to Kubernetes: a small consensus service that the whole rest of the system depends on for correctness. Most of the time you don't think about it. When Keeper is degraded β€” slow leader election, disk full, network partition β€” the symptoms appear everywhere: stalled INSERTs, stuck merges, failing DDL. Treat Keeper sizing and monitoring as first-class operational work, not an afterthought.

Compression and ingestion

Compression

Each column is independently compressed, by a pipeline of one or more codecs that can be chained together. The default in open-source ClickHouse has historically been LZ4 (fast, modest ratio); ClickHouse Cloud's default is ZSTD(1), which is the right balance for most workloads. Heavier ZSTD levels rarely pay back the compression-time cost. The codec is set per-column in the DDL:

CREATE TABLE metrics
(
    host  LowCardinality(String)            CODEC(ZSTD(1)),
    ts    DateTime64(3)                      CODEC(Delta, ZSTD(1)),
    value Float64                            CODEC(Gorilla, LZ4)
) ENGINE = MergeTree() ORDER BY (host, ts);

Codec families fall into a few groups:

FamilyExamplesGood for
GenericLZ4, ZSTD(N)Almost everything; sane default per column
Sorted-integer / monotonicDelta, DoubleDelta, GCD, T64Timestamps, monotonic IDs, sequence numbers
Floating-point time seriesGorilla, FPC, ALPGauges, metric values, measurements
EncryptionAES_128_GCM_SIV, AES_256_GCM_SIVAt-rest column-level encryption
Dictionary (wrapper, not a codec)LowCardinality(String)Repeating high-cardinality-but-bounded strings (countries, status codes, user agents)

Chained codecs are written outer-to-inner left-to-right: CODEC(Delta, ZSTD(1)) means "compute deltas first, then ZSTD the deltas." For a typical metrics column, the delta of consecutive timestamps is near-zero, so ZSTD's dictionary can pack the result extremely tightly. The combinations matter β€” ZSTD alone on a monotonic timestamp column will typically compress 5–10Γ— worse than Delta + ZSTD.

LowCardinality(T) deserves a special note. It dictionary-encodes the column: a small per-part dictionary maps each distinct value to an integer code, and the column file stores the codes. For columns with millions of rows but only thousands of distinct values, it's a massive storage and CPU win β€” equality comparisons compare integers, not strings. The break-even cardinality is roughly 10⁢ distinct values; above that, the dictionary itself becomes expensive.

Ingestion

The standard write path is a synchronous INSERT: the client sends a batch of rows, ClickHouse writes one new part to disk, the part is announced to Keeper (for replicated tables), and the call returns. One part per INSERT is the rule, and it is why "small frequent inserts" is the canonical anti-pattern: every INSERT pays the fixed cost of allocating a new directory, computing per-column compression, writing mark and index files, and registering the part. The standard advice is to batch β€” at least 1000 rows, ideally 10,000–100,000.

When the writer is a fleet of small agents that can't batch on their own β€” telemetry sidecars, IoT devices, agent log shippers β€” the answer is asynchronous inserts (async_insert = 1). ClickHouse buffers incoming rows server-side and writes one part per coalesced batch. Buffers flush on three triggers: byte size (100 MiB by default), age (200 ms by default; 1000 ms on Cloud), or query count (450). Since 24.2 the byte and time thresholds are adaptive: when load is light, ClickHouse waits longer to coalesce more rows; under bursty load, it flushes sooner. wait_for_async_insert = 1 (default) blocks the client until the part is durably written; = 0 is fire-and-forget and trades durability for latency.

Both INSERT paths support automatic deduplication. ClickHouse hashes the contents of recent parts (~100 by default) and silently skips an incoming INSERT whose payload matches one of them. The window means a client can safely retry an INSERT after a timeout without producing a duplicate part β€” useful for at-least-once delivery from queueing systems like Kafka.

Materialized views

Two flavors, and they look very different at runtime.

An incremental materialized view is an insert-time trigger. When a row is written to the source table, the view's SELECT runs against just the new rows (not the whole table) and the result is inserted into a target table. The combination of an AggregatingMergeTree target with -State aggregate functions produces the same shape as a distributed query's intermediate state β€” and reading the rollup later uses the matching -Merge finalizers. The result is a real-time aggregate that's never wrong, never out of date, and never needs a nightly batch job.

A refreshable materialized view, added more recently, runs on a schedule (REFRESH EVERY 1 MINUTE) and replaces its target table's contents atomically. This is the right shape for joins that can't be incrementalized, for full-table rebuilds, and for periodic snapshots. It is closer to a Spark/Airflow job than to a trigger.

Mutations and what changed since 2023

ClickHouse's mutation story used to be simple and harsh: ALTER TABLE … DELETE WHERE … and ALTER TABLE … UPDATE … WHERE … are mutations that rewrite affected parts entirely. They are correct but slow β€” for a delete that touches 0.001% of a 50 TB table, you still rewrite every part containing a deleted row. For analytical workloads where deletes are rare (GDPR erasures, regulatory holds, very occasional corrections), this is fine. For workloads with more frequent mutation, it isn't.

Lightweight delete

DELETE FROM … WHERE … (without ALTER) is the modern lightweight path. Instead of rewriting parts, ClickHouse maintains a hidden _row_exists bitmap column per part; a delete just flips bits. Reads transparently filter out flipped rows. Physical removal happens lazily, during the next background merge that touches the affected part. The trade: SELECTs against tables with many lightweight deletes pay a small per-granule cost to evaluate the bitmap, and skipping indexes are not yet supported on patched columns.

Lightweight update (beta)

A 2024+ feature, still flagged as experimental on most engines. Rather than rewrite the affected parts, ClickHouse writes a patch part that contains only the changed column values for the changed rows. At read time, the patch part is overlaid on the base part β€” the reader applies patches transparently. Background merges eventually fold the patch back into a full rewritten part. The advantage over mutations is that the update cost is proportional to the rows updated, not the part size. Limitations are non-trivial: skipping indexes on patched columns don't help, the feature requires per-table enabling (enable_block_number_column + enable_block_offset_column), and frequent updates can pile up patch parts that slow down reads until they're merged.

JSON and Dynamic types

The native JSON type (different from the older Object('json')) stores semi-structured documents by splitting each detected path into its own subcolumn. Querying events.json.user_id reads only the user_id subcolumn, not the whole JSON blob. ClickHouse handles up to 1024 paths per block by default; paths beyond that fall back to a shared binary representation. The Dynamic type generalizes this further β€” a column that can hold any type, with the top max_types = 32 distinct types getting subcolumn treatment. Combined, they let you build observability or event-store schemas without committing to a fixed structure.

S3-backed MergeTree (open-source)

Even outside Cloud, MergeTree parts can live on object storage. The S3 disk type plus storage policies let you tier hot parts on local NVMe and cold parts on S3, with parts moving automatically based on age (TTL TO DISK 's3') or size. Reads do asynchronous prefetch and an optional local filesystem cache. It is slower than local NVMe but cheap, and a single-tier "all S3" deployment is the on-prem analog of the Cloud product.

Query cache

The query cache (introduced in 23.x) caches materialized result sets keyed by the query's normalized AST. A subsequent identical query within the TTL window (default 60 s) skips execution entirely. It is not invalidated on writes β€” a stale result will be served for up to query_cache_ttl seconds after the underlying table changes. This is fine for dashboarding queries where 60 s of staleness is acceptable; it is a footgun for operational queries that need to see fresh data. Non-deterministic functions (now(), rand()) cause the cache to be skipped automatically. User-scoped row policies are not shared across users.

Footgun

Treat the query cache as a read-mostly optimization for slow-changing data. If your tables receive ingest at sub-minute cadence and downstream readers expect fresh data, leave use_query_cache = 0. Do not enable it globally and assume the TTL is "short enough."

ClickHouse Cloud: separated compute and storage

The hosted ClickHouse Cloud product, announced in 2022 and generally available since 2023, replaces the open-source replication story with a different architecture. It is the same SQL surface, the same MergeTree engine family at a SQL level, but underneath, the storage and replication layers are completely different. For system-design purposes it pattern-matches to Snowflake: storage in object storage, metadata in a small consensus service, stateless compute that scales independently.

The cloud-native engine is SharedMergeTree, which replaces ReplicatedMergeTree for tables in Cloud. The key shifts:

  • Parts live exclusively in object storage (S3 / GCS / Azure Blob). There is no per-replica copy of part data. A replica reading a part fetches it from S3, optionally caching the bytes locally.
  • Metadata lives only in Keeper. The Keeper-stored part manifest is the source of truth for which parts exist, what's in them, and what's been deleted. Compute nodes hold no persistent state beyond an in-memory cache view.
  • Compute is stateless. A compute node has CPU and RAM and a local filesystem cache, but no local disks holding table data. It can be added, removed, restarted, or migrated to a different host with zero data movement.
  • Insert acknowledgement is metadata-grounded. An INSERT commits when (a) the new part has been written to object storage and (b) the part record has been replicated to a Keeper quorum. Local disk fsync is not on the durability path.
  • Quorum inserts are automatic. There is no insert_quorum knob β€” the Keeper-quorum write is the quorum.
Service A (stateless compute) 3 compute nodes Β· auto-scale 0–N node 1 node 2 node 3 Service B (stateless compute) isolated CPU/RAM Β· shared storage node 1 node 2 … Keeper metadata part manifest DDL log quorum acks Distributed cache service (consistent-hashed, network-accessible) survives node restart Β· shared across compute nodes in a zone Object storage (S3 / GCS / Azure Blob) all parts Β· all replicas see the same bytes Β· 11 nines durability Insert ack = part written to S3 βœ“ + metadata committed to Keeper βœ“ Β· no local disk fsync Β· scale to zero in seconds.
ClickHouse Cloud. Compute is stateless; durability lives in S3 + Keeper. The distributed cache sits between, so cache state survives compute restarts.

Three knock-on architectural consequences are worth knowing.

First, read latency from S3 would be a problem β€” a 50 ms RTT to object storage is fine for a batch query but unacceptable for a sub-second OLAP query. ClickHouse Cloud addresses this with a distributed cache service: a network-accessible, consistent-hashed cache that sits between compute and S3. Cached blocks survive compute restarts (because they live in the cache service, not the compute pod), and the consistent hash means that adding or removing compute nodes doesn't invalidate the cache.

Second, compute groups can share storage. A "warehouse" in Cloud terminology is a group of compute services that share the same object-storage folder. You can have a read-write service for ingestion sized for write throughput, plus a read-only service sized for analytical queries, plus a dev/staging service β€” all over the same underlying data, with independent CPU/RAM budgets. The one operational subtlety is that multiple read-write services in the same warehouse can contend on background merges; that's the documented constraint, and the standard pattern is one read-write service plus N read-only services.

Third, elasticity is real. Because compute is stateless, scaling a service down to zero between queries is a single-digit-second operation; scaling up does not require warm-up or data sync; adding more compute does not require re-sharding. This is what makes the Cloud product the closest analog in the OLAP world to serverless data warehouses.

Contrast with OSS

In open-source ClickHouse, a replica that wants to serve queries first needs the parts β€” physically β€” on its local disk. Add a replica, ship it the parts. In Cloud, a compute node just needs network access to S3 and Keeper; the parts are already there. This is what lets Cloud scale to hundreds of replicas per table where ReplicatedMergeTree would saturate the replication log.

vs Snowflake, BigQuery, Druid / Pinot, DuckDB

These systems get compared often and miscompared often. They're not all the same shape β€” a hosted warehouse, an embedded analytical library, and a real-time slice-and-dice engine solve different problems, even if all of them are "columnar OLAP." A side-by-side helps:

ClickHouseSnowflakeBigQueryDruid / PinotDuckDB
DeploymentOSS self-host or managed CloudClosed-source SaaS onlyClosed-source SaaS onlyOSS self-host (managed via Imply / StarTree)Embedded library (in-process)
Storage architectureOSS: shared-nothing local parts + replication. Cloud: parts on S3 + Keeper.Compute/storage separated; FDN micro-partitions on cloud blobCompute/storage separated; Capacitor columnar on ColossusSegments on deep storage + local cache; historical / middle-manager tiersSingle process; reads native files, Parquet, CSV
Query engineVectorized + SIMD + LLVM JIT, processor graph, 30+ hash tables, grace-hash spillVectorized; micro-partition pruning; mature optimizerVectorized Dremel; tree of mixers / leavesDruid: real-time + historical tiers. Pinot: column store + star-tree indexVectorized push-based pipeline
IngestBulk + async; one part per INSERT; 20K+ rows/batch recommendedBulk via stage tables; Snowpipe for streamingStreaming inserts + batch loadsStream-native, Kafka-tightAppend via SQL; not designed for streaming
Real-time ingestStrong (async inserts, MVs, Kafka engine)Snowpipe; not the sweet spotStreaming inserts (Storage Write API)Built for it β€” sub-second freshnessNot designed for it
Update semanticsAppend-mostly; lightweight DELETE (bitmap), lightweight UPDATE (patch parts, beta); Replacing / Aggregating for upsertsFull ANSI UPDATE / DELETE / MERGEFull DML; rate-limited on some tablesAppend-only; reingest to updateFull DML
Concurrency modelMany concurrent queries; resource profiles + queuesWarehouse abstraction; auto-suspend; serverless readerSlot-based; serverlessTiered: historical + broker fan-outSingle-process
PricingOSS free; Cloud = compute-second + storage-GBWarehouse-second credits + storageOn-demand bytes scanned, or flat slots + storageSelf-host TCO; managed variesFree OSS
Sweet spotWide-table analytics, real-time observability, time-series, agent telemetryHeterogeneous warehousing, SQL-first BIAd-hoc analytics over GCP data lakes; serverless scaleSlice-and-dice OLAP cubes with sub-second p99Local analytics, notebooks, prototyping
Where it faltersComplex joins (improving), workload isolation, transactional updatesCost at scale; cold starts; vendor lock-inPer-query cost predictability; egress costLimited SQL (Pinot improving); operational complexityNot distributed; not a server
Open source?Apache 2.0NoNoApache 2.0MIT

vs Snowflake

The Cloud product's storage/compute separation lines up closely with Snowflake's. The differences are in the surrounding choices: Snowflake is closed-source SaaS with full ANSI DML and a more polished concurrency story (warehouses, auto-suspend, serverless reader); ClickHouse trades that for raw scan throughput, real-time ingest as a first-class feature, and the optionality of running the open-source build wherever you want. For BI workloads where queries are heterogeneous and the data is heavily relational, Snowflake's optimizer remains more forgiving; for high-volume repetitive aggregations against append-mostly tables, ClickHouse is typically faster and cheaper.

vs BigQuery

BigQuery is serverless-by-design with Google-scale storage underneath (Colossus). You don't manage clusters or warehouses; you pay per query. ClickHouse gives you per-cluster control and predictable cost β€” important when the workload is high-volume rather than ad-hoc. If you have a handful of expensive queries that run thousands of times a day, ClickHouse's amortized cost is dramatically lower; if you have an analyst team that occasionally explores a 100-TB data lake, BigQuery's "no infrastructure" model is hard to beat.

vs Druid / Pinot

These are the closest kin to ClickHouse β€” built for the same shape of workload (high-cardinality time-series, real-time ingest, low-latency slice-and-dice). Druid and Pinot include richer per-segment indexing (star-tree, inverted indexes) and were designed around tightly-coupled streaming ingest from Kafka. ClickHouse arrives at similar latencies with a more general SQL surface, simpler operations (no separate broker / historical / coordinator tiers), and a less opinionated stance on the streaming pipeline. The tradeoff is in the indexes β€” Druid's slice-and-dice cube semantics outperform ClickHouse on certain shapes of low-latency dashboard queries.

vs DuckDB

Not really competitors; both are columnar, both are vectorized, both borrow ideas from each other and from MonetDB. DuckDB is an embedded library β€” it runs in your Python process, your notebook kernel, your dbt job β€” and excels at single-machine analytics over local files (Parquet, CSV, Arrow). ClickHouse is a server. A common pattern is to use DuckDB for ad-hoc exploration and prototyping and ClickHouse for the production data plane that serves the same queries to many users.

Honorable mentions: StarRocks (MPP fork heritage), Apache Doris (similar lineage), Tinybird (managed ClickHouse with a developer-experience layer), MotherDuck (managed DuckDB).

Where ClickHouse fits in AI infrastructure

This section is the one most relevant to system-design interviews at AI-first companies. The short version: most of the data an AI system generates as a byproduct of being an AI system β€” agent traces, prompt/response logs, eval scores, model-serving metrics β€” is shaped exactly like the workload ClickHouse was designed for. None of the rest of the doc has been AI-specific; this section connects the pieces.

1. Why an analytical DB matters for AI workloads

The data produced by AI systems is almost entirely append-only events with mixed-cardinality dimensions: every LLM call generates a row with a model name (low cardinality), a tenant ID (medium), a request ID (high), a timestamp (sortable), and a payload of latency / tokens / cost / outcome metrics. Queries against this data are overwhelmingly aggregations and time-bucketed rollups β€” "p95 latency by model by hour," "token spend by tenant this month," "error rate by tool in the last 5 minutes." This is the shape that columnar OLAP engines win on by an order of magnitude over row-oriented databases, and ClickHouse in particular wins because the workload is also high-cardinality and high-ingest.

2. Agent and LLM observability

The single biggest production use case in the 2024–2026 window. Companies storing every span of an agent's execution β€” LLM call, tool call, retry, error, latency, cost β€” need a backend that ingests fast, compresses tightly, and answers high-cardinality slice queries in milliseconds. Primary-source case studies on the ClickHouse engineering blog: Langfuse (open-source LLM observability, now part of ClickHouse) and Helicone (LLM proxy and analytics) both describe migrating from Postgres to ClickHouse to handle real-time observability over fast-growing trace volumes. The architecture they share is essentially: agent SDK β†’ Kafka or async-insert β†’ ClickHouse β†’ dashboards over materialized views.

3. Real-time eval rollups via incremental materialized views

Eval pipelines produce per-completion judgments (faithfulness, exact-match, toxicity, tool-use correctness). Stored raw in a MergeTree table β€” one row per (eval_id, completion_id, judge, score, timestamp) β€” they get heavy. The leverage is to define an incremental materialized view whose target is a SummingMergeTree or AggregatingMergeTree table keyed on (eval_id, model, hour). Each new completion's row triggers the view, which pushes summary stats β€” mean score, p50 / p95 via t-digest -State, count by tag β€” into the rollup. The result is a real-time eval dashboard that never goes stale and never needs a periodic recompute. Reading the rollup uses the -Merge finalizers from the distributed-query section.

4. Feature store and event log

For ML training, the offline feature store is mostly a wide table of (entity_id, timestamp, feature_value) tuples. PARTITION BY toYYYYMM(timestamp) ORDER BY (entity_id, timestamp) plus skipping indexes on entity_id makes point-in-time training-set construction efficient β€” given a list of (entity, prediction_time) pairs, find the most recent feature value per entity at or before the prediction time. ClickHouse is not a replacement for Feast or Tecton (which handle online serving, feature versioning, and dual-write consistency), but it's a frequently-used ground-truth event log underneath those systems.

5. Embeddings and vector search

ClickHouse's vector_similarity index (HNSW via the embedded USearch library) is a pragmatic choice when you want to colocate embeddings with the metadata they describe and do hybrid SQL+vector queries β€” "give me the 10 nearest embeddings to this query vector, filtered to documents owned by tenant X, created in the last 7 days." It is not designed to outscale dedicated vector databases (Pinecone, Qdrant, Weaviate) on pure-vector QPS, and the default granularity for the index is intentionally coarse (~100M rows). The right question to ask in an interview is "where would you put the vector index" β€” colocated with the metadata in the same table is often the answer for hybrid workloads; dedicated vector store for pure similarity at scale.

6. Cost characteristics on repetitive log data

The single most underrated property for AI workloads. Agent telemetry is repetitive β€” model names, tool names, tenant IDs, status codes are all small dictionaries. With LowCardinality(String) on those columns and ZSTD(1) on free-text fields (prompts, responses, error messages), 5–20Γ— compression ratios are routine. Combined with object-storage-backed MergeTree (open-source S3 disk, or ClickHouse Cloud's SharedMergeTree), the marginal cost of storing months of complete agent traces becomes small enough that "store everything by default" stops being an extravagance and becomes the default. That changes how teams think about observability budgets, debugging, and retraining data.

The framing from the introduction holds: an analytical database is a compression engine plus a scan engine. AI workloads β€” high volume, repetitive structure, append-mostly, queried analytically β€” happen to fit the shape of both engines extremely well.

Reading the source

To verify any claim in this doc against the real implementation, here's where to look in ClickHouse/ClickHouse:

  • src/Storages/MergeTree/MergeTreeData.{h,cpp} β€” the main parts manager: lifecycle, attach, detach, mutation queue.
  • src/Storages/MergeTree/IMergeTreeDataPart.h β€” the abstract part interface. Concrete subclasses for wide vs compact format.
  • src/Storages/MergeTree/Compaction/MergeSelectors/SimpleMergeSelector.h β€” the default size-based merge selector. TTL and "all" variants are siblings.
  • src/Storages/MergeTree/KeyCondition.{h,cpp} β€” primary-key condition analysis and granule pruning. Worth reading carefully; this is where the index magic happens.
  • src/Storages/MergeTree/MergeTreeIndices.h β€” the registry and factory for skipping-index kinds. One file per kind: MergeTreeIndexMinMax.h, MergeTreeIndexBloomFilter.h, etc.
  • src/Storages/MergeTree/MergeTreeIndexVectorSimilarity.h β€” the vector ANN index. Thin wrapper over USearch.
  • src/Processors/IProcessor.h β€” the base class for the chunk-at-a-time processor graph. Every operator inherits from this.
  • src/QueryPipeline/QueryPipeline.h + QueryPipelineBuilder.h β€” how a logical plan turns into a runnable processor graph.
  • src/Storages/StorageDistributed.h + src/QueryPipeline/RemoteQueryExecutor.h β€” distributed query: routing, fan-out, intermediate-state collection.
  • src/Storages/StorageReplicatedMergeTree.h β€” the replication state machine. Reads the Keeper log, applies entries, fetches parts.
  • src/Coordination/KeeperServer.h + KeeperStateMachine.h β€” the Keeper service itself. Uses NuRaft from contrib/NuRaft/.
  • src/Compression/ICompressionCodec.h + per-codec CompressionCodec*.cpp β€” codec implementations. Short and focused, one file per codec.
  • src/Interpreters/AsynchronousInsertQueue.{h,cpp} β€” server-side coalescing of async inserts.
  • src/Analyzer/ + src/Planner/ β€” the new query analyzer (query-tree) and planner. src/Planner/README.md is the design doc.
  • CHANGELOG.md at the repo root β€” the most reliable source of "what shipped when" for the post-2023 features described above.

Further reading

  • ClickHouse β€” Lightning Fast Analytics for Everyone. Schulze, Schreiber, Yatsishin, Dahimene, and Milovidov. VLDB 2024. PDF. The authoritative description of the architecture, written by the engineers who built it. The honest place to read about merge thresholds, hash-table counts, ClickBench numbers.
  • clickhouse.com/docs β€” the official documentation. Engine semantics, SQL reference, settings catalog. Where to verify defaults.
  • clickhouse.com/blog β€” the ClickHouse engineering blog. Filter for posts by ClickHouse engineers (SharedMergeTree, distributed cache, warehouses, parallel replicas posts are the canonical references for those features).
  • CHANGELOG.md in the repo β€” the only consistently up-to-date source for "what changed in release N." Skim the last 12 months when you suspect a feature has moved.
  • docs/development/architecture.md β€” the in-tree architecture overview. Short, opinionated, and useful as a sanity check.