Scale By the Bay Scale By the Bay

Friday, October 29, 2021

Rethinking scalable machine learning with Spark ecosystem
Adi Polak
Adi Polak
Microsoft, Sr. Software Engineer and Developer Advocate

Today we understand that to create good products that leverage AI, we need to run machine learning algorithms on massive amounts of data. In order to do so, we can leverage existing distributed machine learning frameworks, such as Spark MLlib, which helps us simplify the development and usage of large-scale machine learning training and serving. The typical machine learning workflow is: 

* Loading data (data ingestion) 

* Preparing data (data cleanup) 

* Extracting features (feature extraction) 

* Fitting model (model training) 

* Serving the model 

* Scoring (or predictionizing) / using in production 

With Apache Spark libraries we can cover the entire basic machine learning workflow. As software and data engineers, it is important to understand the flow, how we can leverage what already exists, and create more enhanced products. As tech leads and architects, understanding the workflow and options available will help us create better architecture and software.  

Join this session to learn more about how you can use Apache Spark ecosystem to develop your machine learning end-to-end pipeline.