Ali Ihsan Gunes

Apr 8, 2025 • 4 min read

What is Notion Data Lake? A Deep Dive into Scalable Data Infrastructure

Explore how Notion handles billions of blocks daily with a custom-built data lake and distributed system architecture.

Why Did Notion Need a Data Lake?

Initially, Notion relied on a single PostgreSQL database. This setup worked well until the user base hit 100 million and the block count exceeded 20 billion. At this point, the monolithic design started causing performance issues, such as slow queries and overloaded servers.

Moving from Vertical to Horizontal Scaling with Sharding

To address the bottlenecks, Notion adopted horizontal scaling with sharding. Data was partitioned by workspace ID and distributed across:

32 database servers
15 shards per server — totaling 480 logical shards

Double write strategies ensured data consistency during migration, and powerful 96-core machines were used to reload historical data within three days.

Transition Challenges: Analytics and Machine Learning Workloads

Post-migration, PostgreSQL still handled both transactional and analytical queries. This caused new performance challenges, especially for machine learning and data analytics workloads.

Enter the Data Lake Architecture

To offload analytical queries, Notion engineers created a data lake. Change Data Capture (CDC) streamed PostgreSQL changes through Apache Kafka into Snowflake. However, due to high-frequency updates, Snowflake wasn’t sufficient.

Building a Custom Data Lake

Notion built a custom data lake using open-source tools:

Amazon S3 – for storage
Apache Spark – for big data processing
Apache Kafka – for real-time data streaming
Apache Hudi – for fast, update-friendly data management

This architecture greatly reduced load on core systems and improved analytics performance.

Scaling Again: Sharding Limitations Revisited

As growth continued, shard utilization exceeded 90%, and PgBouncer became a bottleneck. Solutions included:

Scaling to 96 database servers
Reducing shard count per machine to 5
Using logical replication for seamless migration
Splitting PgBouncer into 4 managers, each managing 24 databases

Minimizing Downtime with "Dark Reads"

To ensure zero downtime, Notion used "dark reads." Both old and new systems ran in parallel, and query outputs were compared. Once validated, migrations occurred without user disruption.

Final Thoughts

Notion’s infrastructure evolution showcases the power of modern open-source data tooling. PostgreSQL, Kafka, Spark, S3, and Hudi enabled the platform to scale reliably. It’s a strong case study in how engineering teams can adapt architectures to meet explosive growth and evolving data needs.