We Switched From ClickHouse to Postgres

About 6 months ago, Kanes Rakeshan and I started building Vannamayil, an e-commerce platform for a clothing store in Nelliady, Jaffna.

Analytics mattered from the start. The questions were simple:

“How many units of product X sold last month?”
“What’s the revenue trend across our top 10 products?”

We were expecting at least 100,000 events per month. We looked at Google Analytics, PostHog, Umami. Self-hosting seemed cheaper. ClickHouse felt like the right call. We’d never used it before, though.

ClickHouse is a columnar OLAP database built for analytical queries. Billion-row aggregations in milliseconds. I’d seen the benchmarks. I was sold.

# Setting Up ClickHouse

We modeled everything around it: products, inventory levels, order line items. Wide, denormalized schema, the way ClickHouse wants it. MergeTree tables, good sorting keys. We felt pretty good.

# The Memory Problem

Development took about 4 months. We deployed on AWS, on a t3-micro with 1GB of memory.

It hit memory limits almost immediately. The SSH agent crashed, locking us out of the box. After a few failed attempts we got proper memory limits set. But once we did, it was obvious ClickHouse was the problem. It was sitting at 80% CPU and 100% memory.

We scaled up: 2GB, then 4GB, 8GB, finally 16GB. Even at 16GB, ClickHouse consumed all of it with zero traffic. Our initial 50GB EBS volume had 6GB gone with no users at all.

# Trying to Fix It

I went through every ClickHouse config option I could find. Memory limits, concurrent job limits, CPU limits. I tweaked everything. Nothing worked.

The initial setup wrote analytics events synchronously. I switched to async batched writes using an in-memory buffer. Memory dropped a bit. Not enough to matter.

I tried different table engines too. ReplacingMergeTree looked good for inventory updates but introduced consistency problems that needed careful query patterns to work around. SummingMergeTree helped with some aggregations but made ad-hoc queries worse.

I added monthly partitioning to speed up time-range queries. Small improvement. Not enough.

I rewrote aggregation queries to use pre-computed materialized views. That helped specific dashboards but added more things to maintain and more places for bugs.

After each round I’d measure again. Numbers improved slightly. But the complexity kept growing with each fix.

# The Actual Problem

At some point I stopped tuning and started reading.

ClickHouse is what Cloudflare uses to query trillions of DNS records. Uber runs it to analyze billions of trips. ByteDance built their entire analytics stack on it. It is designed for insane data volumes, and the need to scan most of it on every query.

We had maybe 100,000 events per month which is so. That’s not even a rounding error for what ClickHouse is built for.

We did not have that volume. And our queries were just lookups. Fetch a product by ID, update an inventory count, insert an order, pull up a customer’s recent orders. The analytical ones such as revenue by date range, units sold per product were maybe 15% of total volume. And they weren’t complicated. Postgres handles that fine with a couple of indexes.

We had picked a database built for a problem ten thousand times larger than ours.

# The Migration

To overcome this, we decided to replace ClickHouse with Postgres. We have already deployed the site at this point. After deployment, any migration/refactor must be carefully planned. We cannot just do it in a single commit in a single night.

We first switched to dual-writes while still keeping the read path. All analytics events were written to both ClickHouse and Postgres. Analytics queries were read from ClickHouse. This was done to validate Postgres’s write path was correct.

Then, we added dual-reads, which is not so common. In the admin dashboard, we added a selector to switch between ClickHouse and Postgres. This way, we can validate whether both analytics results are matching. At first, they weren’t. We found minor bugs with date ranges and numeric precision. Once the issues were fixed, the results were matching.

Finally we removed the read and write paths for ClickHouse. Its data was kept for 2 more days. After that, ClickHouse was completely removed from the setup. The migration took about 5 days.

# Payoff

CPU dropped to 3%. Memory dropped to 100 MB. Storage dropped to 50 MB. We grew confident that 8 vCPU and 16GB RAM was an overkill for our setup.

As a side effect, error messages became legible. Debugging became easy. With drizzle, we reached 99% type safety for database queries.

# Closing Thoughts

Don’t pick a database based on what it can do at its ceiling. Pick it based on what your queries look like today. ClickHouse is fast for analytical workloads at scale. But fast-at-scale isn’t the same as right-for-you.

And the benchmarks aren’t wrong. ClickHouse really is faster, for the workloads it’s designed for. But benchmarks show the best case, not your case.

Start boring. Add specialized infrastructure when your query patterns give you a concrete reason to, not when a use case sounds plausible. My case was “we might want analytics someday.” That’s a guess.

Two months tuning the wrong database. I’d like those back.

Table Of Contents