Scale By the Bay Scale By the Bay

Friday, October 29, 2021

- PDT
Multi-Dimensional Clustering with Z-Ordering
Russell Spitzer
Russell Spitzer
Apple, Software Engineer

Sort columns are great for file skipping when you have predicates that are selective on a single dimension, but what do you do when you have multiple dimensions to filter on? Adding additional hierarchal sort columns has diminishing results so what can we do? Enter Z-Ordering. Instead of using a single column to sort our data, we order our data based on value constructed from a combination of multiple columns. This combined Z-Value is constructed such that rows with similar z-values share column values that are similar. This lets us write data files which will be selective over any of the columns that the z-value is constructed from. In this presentation we’ll go over the basics of the math behind the computation and how we are implementing it in Apache Iceberg.