Scale By the Bay Scale By the Bay

The Incremental ETL Architecture


John O'Dwyer
Databricks, Developer Advocate

John O’Dwyer is a Developer Advocate at Databricks where he helps empower the Databricks, Spark, Delta Lake and MLflow communities. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems. He has an MS from University of Colorado and a BS from Ohio University. His current technical focuses include Distributed Systems, Apache Spark, Delta Lake and Machine Learning.

Incremental ETL in a conventional Data Warehouse has been possible for some time but scale, cost, accounting for state and the lack of access for machine learning make it not ideal. Until now, Incremental ETL in a Data Lake has not been possible due to factors such as updating data and identifying changed data in a big data table. Incremental ETL also makes the medallion table architecture possible and efficient so that all consumers of data can have the correct curated data sets for their needs. We will discuss the advances in big data technology that make Incremental ETL possible as well as the architecture as a whole.