Ali Gunes

Ali Ihsan Gunes

Apr 8, 2025 • 4 min read

What is Notion Data Lake? A Deep Dive into Scalable Data Infrastructure

Explore how Notion handles billions of blocks daily with a custom-built data lake and distributed system architecture.

Why Did Notion Need a Data Lake?

Initially, Notion relied on a single PostgreSQL database. This setup worked well until the user base hit 100 million and the block count exceeded 20 billion. At this point, the monolithic design started causing performance issues, such as slow queries and overloaded servers.

Moving from Vertical to Horizontal Scaling with Sharding

To address the bottlenecks, Notion adopted horizontal scaling with sharding. Data was partitioned by workspace ID and distributed across:

Double write strategies ensured data consistency during migration, and powerful 96-core machines were used to reload historical data within three days.

Transition Challenges: Analytics and Machine Learning Workloads

Post-migration, PostgreSQL still handled both transactional and analytical queries. This caused new performance challenges, especially for machine learning and data analytics workloads.

Enter the Data Lake Architecture

To offload analytical queries, Notion engineers created a data lake. Change Data Capture (CDC) streamed PostgreSQL changes through Apache Kafka into Snowflake. However, due to high-frequency updates, Snowflake wasn’t sufficient.

Building a Custom Data Lake

Notion built a custom data lake using open-source tools:

This architecture greatly reduced load on core systems and improved analytics performance.

Scaling Again: Sharding Limitations Revisited

As growth continued, shard utilization exceeded 90%, and PgBouncer became a bottleneck. Solutions included:

Minimizing Downtime with "Dark Reads"

To ensure zero downtime, Notion used "dark reads." Both old and new systems ran in parallel, and query outputs were compared. Once validated, migrations occurred without user disruption.

Final Thoughts

Notion’s infrastructure evolution showcases the power of modern open-source data tooling. PostgreSQL, Kafka, Spark, S3, and Hudi enabled the platform to scale reliably. It’s a strong case study in how engineering teams can adapt architectures to meet explosive growth and evolving data needs.