Scale By the Bay Scale By the Bay

data

Thursday, October 28, 2021

- PDT
Grand Welcome & Opening Remarks
Alexy Khrabrov
Alexy Khrabrov
IBM Accelerated Discovery, Technical Ecosystem Development Lead
- PDT
Advanced machine learning time series forecasting method
Paweł Skrzypek
Paweł Skrzypek
7bulls.com Sp. z o.o., Chief Multi Cloud Architect
Anna Warno
Anna Warno
7bulls.com Sp. z o.o.

Review of the latest and most advanced time series forecasting methods like ES-Hybrid, N-Beats, Tsetlin machine, and more. Tips and tricks for forecasting difficult, noisy, and nonstationary time series, which can significantly improve the accuracy and performance of the methods.

- PDT
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!!
Opher Dubrovsky
Opher Dubrovsky
Nielsen, Big Data Dev Lead
Ido Nadler
Ido Nadler
Nielsen, Big data team lead

Kafka data pipeline maintenance can be painful. It usually comes with complicated and lengthy recovery processes, scaling difficulties, traffic ‘moodiness’, and latency issues after downtimes and outages.

It doesn’t have to be that way!

We’ll examine one of our multi-petabyte scale Kafka pipelines, and go over some of the pitfalls we’ve encountered. We’ll offer solutions that alleviate those problems, and go over comparisons between the before and after . We’ll then explain why some common sense solutions do not work well and offer an improved, scalable and resilient way of processing your stream.

We’ll cover: Costs of processing in stream compared to in batch Scaling out for bursts and reprocessing Making the tradeoff between wait times and costs Recovering from outages And much more…

- PDT
Building an ML Platform from Scratch
Alon Gubkin
Alon Gubkin
Aporia, VP R&D

In this workshop you’ll learn how to set up an ML platform using open-source tools like Cookiecutter, DVC, MLFlow, KFServing, Pulumi, GitHub Actions and more.

We'll explain each tool in an intuitive way and the problem they solve, in order to build a useful platform that combines all of them. All code will be available on GitHub after the workshop, so you'll be able to easily integrate it into your existing work environment. There’s no “one size fits all” ML Platform. Each organization has their own needs and requires a customizable and flexible solution. In this workshop you’ll learn how to create the right solution for your own organization.

The workshop is intended for data scientists and ML engineers from all industries – from small startups to large corporations and academic institutions.

- PDT
Tuning Hyperparameters with DVC Experiments
Milecia McGregor
Milecia McGregor
Iterative.ai, Developer Advocate

When you start exploring multiple model architectures with different hyperparameter values, you need a way to quickly iterate. There are a lot of ways to handle this, but all of them require time and you might not be able to go back to a particular point to resume or restart training.

In this talk, you will learn how you can use the open-source tool, DVC, to compare training metrics using two methods for tuning hyperparameters: grid search and random search. You'll learn how you can save and track the changes in your data, code, and metrics without adding a lot of commits to your Git history. This approach will scale with your data and projects and make sure that your team can reproduce results easily.

- PDT
Casting the spell: Druid in practice
Itai Yaffe
Itai Yaffe
Databricks, Senior Solutions Architect
Yakir Buskilla
Yakir Buskilla
cocohub.ai, Co-founder and CEO

We’ve been using Apache Druid for over 5 years, to provide customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale, and we will show why Druid is a great tool for that.

In this talk, we will also share some of the best practices and tips we’ve gathered over the years. We will cover the following topics: * Data modeling * Ingestion * Retention and deletion * Query optimization

- PDT
Apache Spark Performance Tuning Session with Delight
Jean-Yves Stephan
Jean-Yves Stephan
Data Mechanics, Co-Founder & CEO

Delight is an open-source monitoring dashboard for Apache Spark displaying system metrics (Memory, CPU, ...) aligned on the same timeline as your Spark jobs and stages. It's a great complement to the Spark UI to help you troubleshoot the performance and stability of your applications. Delight works for free on top of all commercial and open-source Spark platforms.

In this talk, JY will use Delight and the Spark UI to go through real-world performance tuning sessions for Apache Spark. Parallelism, shuffle, memory, instance type.. he will show you how Delight can help make your applications stable and cost-effective at scale.

- PDT
Scaling the Disney Streaming Offer Management Platform within Growth Engineering
Anshuman Nayak
Anshuman Nayak
The Walt Disney Company, Principal Engineer

In this talk, we will go over the challenges to manage the scale of subscribers for multiple products (Disney+, ESPN+, Star+ and Hulu) as seen by Disney Streaming. We will discuss Growth Lifecyle Engineering in Disney streaming at a high level and do a deep dive specifically on Offer and Product catalog management and highlight the various architectural patterns leveraged

- PDT
Panel: The Hardware vs Software
Anna Goldie
Anna Goldie
Google Brain / Stanford, Staff Research Scientist
Bryan Cantrill
Bryan Cantrill
Oxide Computer, co-founder and CTO
Vitaly Gordon
Vitaly Gordon
Faros AI, Co-founder and CEO
David Kanter
David Kanter
MLCommons™, Founder and the Executive Director
Alexander O'Connor
Alexander O'Connor
Autodesk, Senior Manager, Data Science & Machine Learning

This panel brings together builders of hardware and software that drive the post-cloud and AI landscape of tomorrow.  How will the hardware changes affect software development?  What drives what?  Join us for a famous debate format that is an SBTB tradition!

- PDT
Keynote: Why and how to care about ethics in ML
Clement Delangue
Clement Delangue
Hugging Face, Co-founder and CEO

Hugging Face CEO and co-founder Clement Delangue will discuss how Hugging Face is advancing and democratizing responsible AI through open source and open science. Clement will outline how human bias creeps into the machine learning development pipeline and the steps Hugging Face and the community can take in order to operationalize ethical AI.

- PDT
Convergence of AI, Simulations and HPC
Anima Anandkumar
Anima Anandkumar
Caltech and NVIDIA, Professor, Director of AI

Many scientific applications heavily rely on the use of brute-force numerical methods performed on high-performance computing (HPC) infrastructure. Can artificial intelligence (AI) methods augment or even entirely replace these brute-force calculations   to obtain significant speed-ups? Can we make groundbreaking new discoveries because of such speed-ups?  I will present exciting recent advances that build new foundations in AI that are applicable to a wide range of problems such as fluid dynamics and quantum chemistry. On the other side of the coin, the use of simulations to train AI models can be very effective in applications such as robotics and autonomous driving. Thus, we see a growing convergence of AI, Simulations and HPC.

- PDT
Transformers End-to-End: Experience training and deploying large language models in production
Alexander O'Connor
Alexander O'Connor
Autodesk, Senior Manager, Data Science & Machine Learning

The Transformers revolution has reached near-ubiquity in research, and continues to grow its influence in industry. In this talk I will discuss our team's experience of leveraging large language models for the kinds of problems we, and many other data science teams, face in customer support, ecommerce, and more.

- PDT
Pure-Scala Approach for Building Frontend and Backend Applications
Taro Saito
Taro Saito
Treasure Data, Principal Software Engineer

Scala is a versatile programming language that can be used for building both frontend and backend applications. To further leverage this advantage, we built Airframe RPC, a framework that uses Scala as a unified RPC interface between servers and clients. With Airframe RPC, you can build HTTP/1 (Finagle) and HTTP/2 (gRPC) services just by defining Scala traits and case classes. It simplifies web application design as you only need to care about Scala interfaces without using existing standards like REST, ProtocolBuffers, OpenAPI, etc. Scala.js support of Airframe RPC also enables building interactive Web applications that can dynamically render DOM elements while talking Scala-based RPC servers. With Airframe RPC, the value of Scala developers will be unprecedentedly higher both for frontend and backend development.

- PDT
Chip Floorplanning with Deep Reinforcement Learning
Anna Goldie
Anna Goldie
Google Brain / Stanford, Staff Research Scientist

In this talk, I will describe a reinforcement learning (RL) method for chip floorplanning, the engineering problem of designing the physical layout of a computer chip. Chip floorplanning ordinarily requires weeks or months of effort by physical design engineers to produce manufacturable layouts. Our method generates floorplans in under six hours that are superior or comparable to humans in all key metrics, including power consumption, performance, and chip area. To achieve this, we pose chip floorplanning as a reinforcement learning problem, and develop a novel edge-based graph convolutional neural network architecture capable of learning rich and transferrable representations of the chip. Our method was used in the design of the next generation of Google’s artificial intelligence (AI) accelerators (TPU).

- PDT
Elasticity across the Facebook Cloud
Ariel Rao
Ariel Rao
Facebook, Software Engineer

Facebook operates an internal cloud to support its family of products. The strategy for scaling has included an investment in elasticity of capacity management. Elasticity means several things. At the physical infrastructure level, we mobilize buffer capacity to mitigate dynamic unavailability and coordinate maintenances. At the workload management level, we AutoScale capacity allocations based on predictive and real-time models of workload demand. At the global resource management level, we time-shift flexible workloads based on time series models of supply availability. And at the global efficiency level, we leverage spare capacity for opportunistic workloads. During this talk, we will dive into each dimension and how they fit together. We will show how elasticity across the stack allows us to meet a high bar on reliability and availability, while making efficient use of all capacity deployed.

- PDT
From python and SQL to Scala, presto to Spark, elbow grease to automation.
Mohit Jaggi
Mohit Jaggi
Uber, Tech Lead Manager

The lifecycle of data and ML based business innovation is complicated. From quick and dirty explorations to solid operation at Uber scale and everything in between. Naturally, different tools and technologies are suitable for different stages in the lifecycle. From python and SQL to Scala, presto to Spark, elbow grease to automation...I will share what I have learned so far.

- PDT
The SAME Project: A New Project to Address Reproducible Machine Learning
David Aronchick
David Aronchick
Microsoft, Product Manager

We live in a time of both feast and famine in machine learning. Large orgs are publishing state-of-the-art models at an ever-increasing rate but data scientists face daunting challenges reproducing results themselves. Even in the best cases, where a newly forked code runs without syntax errors, this only solves a part of the problem as the pipelines used to run the models are often completely excluded. The Self-Assembling Machine Learning Environment (SAME) project is a new project and community around a common goal: creating tooling that allows for quick ramp-up, seamless collaboration and efficient scaling. This talk will discuss our initial public release, done in collaboration with data scientists from across the spectrum, where we are going next and how people can use our learnings in their own practices.

- PDT
Designing and Building Complex Machine Learning Engineering Projects and Workflows: Serverless x Containers
Joshua Arvin Lat
Joshua Arvin Lat
NuWorks Interactive Labs, Chief Technology Officer

Over the past couple of years, several professionals and teams have started to utilize serverless and container concepts and techniques to design and build scalable, low cost, and maintainable applications. Understanding the concepts alone will not guarantee success especially when dealing with modern complex requirements involving Machine Learning and Data Engineering. In this talk, we will talk about how to use different tools and services to perform machine learning experiments ranging from fully abstracted to fully customized solutions. These include performing automated hyperparameter optimization and bias detection when dealing with intermediate requirements and objectives. We will also show how these are done with different ML libraries and frameworks such as scikit-learn, PyTorch, TensorFlow, Keras, MXNet, and more. In addition to these, I will also share some of the risks and common mistakes Machine Learning Engineers must avoid to help bridge the gap between reality and expectations. While discussing these topics, we will show how containerization and serverless engineering helps solve our technical requirements.

Friday, October 29, 2021

- PDT
Alice and Mad Hatter: Predict or not to predict
Roksolana Diachuk
Roksolana Diachuk
Captify, Big Data Developer
Marianna Diachuk
Marianna Diachuk
Restream, Data scientist

Alice and Mad Hatter should solve a mystical riddle that will help them to find the way back home. Alice is a big data engineer and Mad Hatter is a skilled data scientists. In order to solve the riddle they need to build a prediction using a machine learning model. Will knowledge of Scala help Alice to find a solution? And will she be able to collaborate with the Mad Hatter? You will find out in this talk.

- PDT
Introducing Spark Cyclone: Accelerating Spark with the hidden supercomputer device plugin in Hadoop
Eduardo Gonzalez
Eduardo Gonzalez
Xpress AI, Founder and CEO

The Spark Cyclone project is an open-source plug-in for Apache Spark that accelerates Spark SQL with the NEC SX-Aurora TSUBASA accelerator which is supported in Hadoop 3.3. Dubbed the “Vector Engine” the card has lots of onboard RAM (48GB), lots of memory bandwidth (1.5 TB/s), and lots of computing power (6 TFLOPs).  This talk covers why the Vector Engine is uniquely suited for analytics (compared to the alternatives), how we execute Catalyst queries on the Vector Engine, and how the performance compares to Spark Rapids running on a V100.

- PDT
Continous integration of ML products and UX design
Adarsa Sivaprasad
Adarsa Sivaprasad
Independent, Senior Data Scientist

As ML and deep learning algorithms are increasingly coming out from research and being integrated into products, the conventional product design is also learning to adapt to it. Data-centric product design in the last decade has concentrated on data warehousing, scalability of pipelines for big data, and interpretation through KPIs and dashboards. Integrating ML products and involving artificial intelligent learning systems in the product additionally demands, emphasis on data quality, data exploration, and model explainability. Concerns of data privacy, data democratization, and bias within business and users too need to be possibly built into the design. In this context, the talk would explore these aspects by considering two common scenarios. A typical telemetry summarizer (a core data product) and a recommendation engine(a user-facing product with reinforcement learning). The talk aims at exploring the following aspects: - How might we use UX and design to make ML models more explainable and interpretable - How to use UX for ensuring data quality and identifying bias in data collection pipeline - How might we use data science techniques to inform and drive UX design decisions

- PDT
Relational Databases: Don't call it a comeback!
Rob Hedgpeth
Rob Hedgpeth
MariaDB, Director, Developer Relations

The data revolution is upon us, and, well, has been for several years. It comes as no surprise that as application technology has evolved to keep up with the ever increasing expectations of users, the data platforms and solutions have had to as well. A decade or so ago we thought all our problems had been solved with a new player in the game, NoSQL. But, spoiler alert, they weren't.

In this session we're going to dive into a brief history of data. We'll examine its humble beginnings, where we stand today, and what the landscape will look like in the future. Throughout the journey you'll gain an understanding of how SQL and relational databases have adapted to pave the road for a truly bright, scalable, and performant future!

- PDT
Rethinking scalable machine learning with Spark ecosystem
Adi Polak
Adi Polak
Microsoft, Sr. Software Engineer and Developer Advocate

Today we understand that to create good products that leverage AI, we need to run machine learning algorithms on massive amounts of data. In order to do so, we can leverage existing distributed machine learning frameworks, such as Spark MLlib, which helps us simplify the development and usage of large-scale machine learning training and serving. The typical machine learning workflow is: 

* Loading data (data ingestion) 

* Preparing data (data cleanup) 

* Extracting features (feature extraction) 

* Fitting model (model training) 

* Serving the model 

* Scoring (or predictionizing) / using in production 

With Apache Spark libraries we can cover the entire basic machine learning workflow. As software and data engineers, it is important to understand the flow, how we can leverage what already exists, and create more enhanced products. As tech leads and architects, understanding the workflow and options available will help us create better architecture and software.  

Join this session to learn more about how you can use Apache Spark ecosystem to develop your machine learning end-to-end pipeline.

- PDT
Responsible AI: Challenges and Recommendations
Ricardo Baeza-Yates
Ricardo Baeza-Yates
Institute for Experiential AI @ Northeastern University, Research Director

In the first part we cover five current specific challenges through examples: (1) discrimination (e.g., facial recognition, justice, sharing economy, language models); (2) phrenology (e.g., biometric based predictions); (3) unfair digital commerce (e.g., exposure and popularity bias); (4) stupid models (e.g., Signal, minimal adversarial AI) and (5) indiscriminated use of computing resources (e.g., large language models). These examples do have a personal bias but set the context for the second part where we address four generic challenges: (1) too many principles (e.g., principles vs. techniques), (2) cultural differences (e.g., Christian vs. Muslim); (3) regulation (e.g., privacy, antitrust) and (4) our cognitive biases. We finish with some recommendations to tackle these challenges and build responsible and ethical AI.

- PDT
Activity schema: data modeling using a single table
Cedric Dussud
Cedric Dussud
Narrator.ai, Co-founder

Cedric will present a new data modeling approach called the activity schema. It can answer any data question using a single time series table (only 11 columns and no JSON). Instead of facts and dimensions, data is modeled as a customer doing an activity over time. This approach works for any business data used for BI. 

This approach has some fundamental benefits over dimensional modeling.

  1. Single modeling layer. All aggregations, metrics, materialized views for BI, etc, are built directly from the single activity schema table. This means the only dependency is the raw source data.
  2. No more foreign key joins. Queries use relationships in time to relate activities together. This means that any data in the warehouse can be directly combined with any other data, without having to create foreign keys between them.
  3. Open source analyses. The activity schema specifies a specific table structure. This means that the data is structurally the same, no matter who models it. This allows analyses or queries to be directly shared between companies.
- PDT
Panel: Startups and the ML landscape
Stephen Merity
Stephen Merity
Independent AI researcher
James Cham
James Cham
Bloomberg Beta, Partner
Lisha li
Lisha li
Rosebud AI, CEO
Alexy Khrabrov
Alexy Khrabrov
IBM Accelerated Discovery, Technical Ecosystem Development Lead

The panel brings together AI founders and VCs to review what makes deep tech so exciting, challenging, and potentially transformative. How might the machine learning landscape evolve and how will that impact all of us?

- PDT
Grand Welcome & Opening Remarks
Alexy Khrabrov
Alexy Khrabrov
IBM Accelerated Discovery, Technical Ecosystem Development Lead
- PDT
Keynote: How to start a startup for AI Engineers
Alyona Medelyan
Alyona Medelyan
Thematic, CEO

What's an AI startup? Why you should (or shouldn't) start one? How to make sure your AI startup succeeds? In this talk, I will share what I learned over the past 12 years of working at AI startups and starting my own. Heads up that technology itself is just one of the 3 core pillars of successful startups. Join in to learn how to think about the other two and hear my journey.

- PDT
Introducing Hasktorch
Austin Huang
Austin Huang
Fidelity Investments, Vice President of AI and Machine Learning

Hasktorch is a library for implementing tensor math and neural networks using typed functional programming in Haskell. It binds the C++ backend of PyTorch, making all functions available in Haskell including GPU support and automatic differentiation.
Hasktorch has two complementary goals. First, to advance research at the intersection of machine learning and programming languages. Second, to enable developers to leverage the strengths of typed functional programming for building reliable, scalable machine-learning systems.
In this talk, Austin will introduce the library, highlight example projects, and help new users get started with Hasktorch.

- PDT
Making the Transition to Scala 3
Seth Tisue
Seth Tisue
Lightbend, Software Engineer

Scala 3.0 was released in May 2021, capping nearly a decade of work. Much in Scala 3 is new and exciting, but also, existing Scala programmers will find that most of their existing knowledge and experience still applies. Let's explore the following questions:

• why upgrade at all?

• when is the right time to upgrade?

• how does upgrading work?

- PDT
Live Coding: Building Stateful and Scalable Reactive Streams with Akka
Nolan Grace
Nolan Grace
M1 Finance, Senior Software Engineer

Akka is an amazing toolkit for building scalable and resilient applications. In this session we will dive into the hidden gems of Akka. Nolan will start the session with a quick introduction to Akka Streams and Actors. From there he will dive into the live coding part of the session. Nolan will build a fully backpressured, stateful, and scalable application from scratch. By the end of this session you will understand the Akka modules best suited for your next Reactive Application and the pitfalls to avoid.

- PDT
- PDT
Multi-Dimensional Clustering with Z-Ordering
Russell Spitzer
Russell Spitzer
Apple, Software Engineer

Sort columns are great for file skipping when you have predicates that are selective on a single dimension, but what do you do when you have multiple dimensions to filter on? Adding additional hierarchal sort columns has diminishing results so what can we do? Enter Z-Ordering. Instead of using a single column to sort our data, we order our data based on value constructed from a combination of multiple columns. This combined Z-Value is constructed such that rows with similar z-values share column values that are similar. This lets us write data files which will be selective over any of the columns that the z-value is constructed from. In this presentation we’ll go over the basics of the math behind the computation and how we are implementing it in Apache Iceberg.

- PDT
Location-Based Data Engineering for Good with PySpark/Graphs
Evan Chan
Evan Chan
UrbanLogiq, Inc., Senior Data Engineer

Cell phones are ubiquitous, and a huge amount of location-based data is generated by apps, advertiser networks, libraries, mobile software and hardware providers, and cell networks. This is a talk about working with location-based data for good: to understand mobility patterns and help improve urban, traffic, and economic planning. PySpark is used to wrangle the large amount of data, sessionize and calculate potential trips, and ingest into different form factors such as graphs to understand mobility patterns. I also discuss data quality engineering and experimentation.

- PDT
Semantic Search and Neural Information Retrieval
Amin Ahmad
Amin Ahmad
ZIR AI, Сo-founder and CEO

In 2017, the introduction of transformers into feed-forward neural networks enabled computers to achieve human-level performance on a range of language tasks. In this talk, we'll describe some of the theories behind these systems, and explain how it can be applied to search and information retrieval problems.

- PDT
Deploying and serving hardware optimized ML pipelines using graalvm
Adam Gibson
Adam Gibson
Konduit, CTO

This talk will cover an apprach for using graalvm combined with the eclipse deeplearning4j framework to create and deploy ML pipelines combining python scripts and hardware optimized ML pipelines in to 1 binary.