Explore how Notion handles billions of blocks daily with a custom-built data lake and distributed system architecture.
Initially, Notion relied on a single PostgreSQL database. This setup worked well until the user base hit 100 million and the block count exceeded 20 billion. At this point, the monolithic design started causing performance issues, such as slow queries and overloaded servers.
To address the bottlenecks, Notion adopted horizontal scaling with sharding. Data was partitioned by workspace ID and distributed across:
Double write strategies ensured data consistency during migration, and powerful 96-core machines were used to reload historical data within three days.
Post-migration, PostgreSQL still handled both transactional and analytical queries. This caused new performance challenges, especially for machine learning and data analytics workloads.
To offload analytical queries, Notion engineers created a data lake. Change Data Capture (CDC) streamed PostgreSQL changes through Apache Kafka into Snowflake. However, due to high-frequency updates, Snowflake wasn’t sufficient.
Notion built a custom data lake using open-source tools:
This architecture greatly reduced load on core systems and improved analytics performance.
As growth continued, shard utilization exceeded 90%, and PgBouncer became a bottleneck. Solutions included:
To ensure zero downtime, Notion used "dark reads." Both old and new systems ran in parallel, and query outputs were compared. Once validated, migrations occurred without user disruption.
Notion’s infrastructure evolution showcases the power of modern open-source data tooling. PostgreSQL, Kafka, Spark, S3, and Hudi enabled the platform to scale reliably. It’s a strong case study in how engineering teams can adapt architectures to meet explosive growth and evolving data needs.
Ali Gunes
Designed and coded by Ali Gunes
© 2024