DeveloperWeek New York 2020 DeveloperWeek New York 2020
Get your ticket or log in to build your agenda.

K8s-Workqueue: Simplified Kubernetes ML Batch Jobs

Session Stage
Join on Hopin

Dr. Moussa Taifi
Xandr-AT&T, Senior Data Science Platform Engineer II - Team Lead

Moussa Taifi is currently a Senior Data Science Platform Engineer - Team Lead - at Xandr-AT&T. He holds a PhD in Computer and Information Science from Temple university. He is a machine learning and big data systems engineer, focused on data science productivity, reliability, performance and cost. He is interested in designing and implementing large scale AI products, through data collection, analysis and warehousing.

Chinmay Nerurkar
Xandr, Senior Software Engineer II

Chinmay Nerurkar is a Senior Software Engineer II at Xandr Inc. He holds a Masters degree in Electrical Engineering from New York University and has over 11 years diverse experience working in embedded software, digital video processing, finance and the Ad-tech industry. He is currently focused on building impactful products for Xandr harnessing the power of big data and machine learning. He is inerested in behavioural finance, economics, contextual data analysis using NLP and artificial intelligence.

Managing batch ML jobs is a central competency for Data Science (DS) teams in the ad tech space. According to PWC research, digital ad spend in the US has increased by 16.9% to $57.9 Billion in the first half of 2019. Worldwide digital ad spend is expected to reach over $375 Billion by 2021. To deal with this growth, DS teams need flexible tools.

 We present our k8s-workqueue system. A pluggable scheduling mechanism for ML Kubernetes workloads where tens of thousands of models are built every day on our platform. The focus on simplicity, led us to the design of this system that combines familiar features of traditional cron jobs and containers, with the power of the Kubernetes API.

We bring back the lessons learned from our k8s-workqueue system. This system has been managing ML batch jobs on our Kubernetes API/Clusters for the past 2 years. These lessons are about building, operating and maintaining hundreds of product-impacting jobs. These are ML centric and data heavy production workloads.