Managing batch ML jobs is a central competency for Data Science (DS) teams in the ad tech space. According to PWC research, digital ad spend in the US has increased by 16.9% to $57.9 Billion in the first half of 2019. Worldwide digital ad spend is expected to reach over $375 Billion by 2021. To deal with this growth, DS teams need flexible tools.
We present our k8s-workqueue system. A pluggable scheduling mechanism for ML Kubernetes workloads where tens of thousands of models are built every day on our platform. The focus on simplicity, led us to the design of this system that combines familiar features of traditional cron jobs and containers, with the power of the Kubernetes API.
We bring back the lessons learned from our k8s-workqueue system. This system has been managing ML batch jobs on our Kubernetes API/Clusters for the past 2 years. These lessons are about building, operating and maintaining hundreds of product-impacting jobs. These are ML centric and data heavy production workloads.