
Thursday, October 28, 2021
Advanced machine learning time series forecasting method
Review of the latest and most advanced time series forecasting methods like ES-Hybrid, N-Beats, Tsetlin machine, and more. Tips and tricks for forecasting difficult, noisy, and nonstationary time series, which can significantly improve the accuracy and performance of the methods.
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!!
Kafka data pipeline maintenance can be painful. It usually comes with complicated and lengthy recovery processes, scaling difficulties, traffic ‘moodiness’, and latency issues after downtimes and outages.
It doesn’t have to be that way!
We’ll examine one of our multi-petabyte scale Kafka pipelines, and go over some of the pitfalls we’ve encountered. We’ll offer solutions that alleviate those problems, and go over comparisons between the before and after . We’ll then explain why some common sense solutions do not work well and offer an improved, scalable and resilient way of processing your stream.
We’ll cover: Costs of processing in stream compared to in batch Scaling out for bursts and reprocessing Making the tradeoff between wait times and costs Recovering from outages And much more…
Building an ML Platform from Scratch
In this workshop you’ll learn how to set up an ML platform using open-source tools like Cookiecutter, DVC, MLFlow, KFServing, Pulumi, GitHub Actions and more.
We'll explain each tool in an intuitive way and the problem they solve, in order to build a useful platform that combines all of them. All code will be available on GitHub after the workshop, so you'll be able to easily integrate it into your existing work environment. There’s no “one size fits all” ML Platform. Each organization has their own needs and requires a customizable and flexible solution. In this workshop you’ll learn how to create the right solution for your own organization.
The workshop is intended for data scientists and ML engineers from all industries – from small startups to large corporations and academic institutions.
Tuning Hyperparameters with DVC Experiments
When you start exploring multiple model architectures with different hyperparameter values, you need a way to quickly iterate. There are a lot of ways to handle this, but all of them require time and you might not be able to go back to a particular point to resume or restart training.
In this talk, you will learn how you can use the open-source tool, DVC, to compare training metrics using two methods for tuning hyperparameters: grid search and random search. You'll learn how you can save and track the changes in your data, code, and metrics without adding a lot of commits to your Git history. This approach will scale with your data and projects and make sure that your team can reproduce results easily.
Casting the spell: Druid in practice
We’ve been using Apache Druid for over 5 years, to provide customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale, and we will show why Druid is a great tool for that.
In this talk, we will also share some of the best practices and tips we’ve gathered over the years. We will cover the following topics: * Data modeling * Ingestion * Retention and deletion * Query optimization
Apache Spark Performance Tuning Session with Delight
Delight is an open-source monitoring dashboard for Apache Spark displaying system metrics (Memory, CPU, ...) aligned on the same timeline as your Spark jobs and stages. It's a great complement to the Spark UI to help you troubleshoot the performance and stability of your applications. Delight works for free on top of all commercial and open-source Spark platforms.
In this talk, JY will use Delight and the Spark UI to go through real-world performance tuning sessions for Apache Spark. Parallelism, shuffle, memory, instance type.. he will show you how Delight can help make your applications stable and cost-effective at scale.
Scaling the Disney Streaming Offer Management Platform within Growth Engineering
In this talk, we will go over the challenges to manage the scale of subscribers for multiple products (Disney+, ESPN+, Star+ and Hulu) as seen by Disney Streaming. We will discuss Growth Lifecyle Engineering in Disney streaming at a high level and do a deep dive specifically on Offer and Product catalog management and highlight the various architectural patterns leveraged
Panel: The Hardware vs Software
This panel brings together builders of hardware and software that drive the post-cloud and AI landscape of tomorrow. How will the hardware changes affect software development? What drives what? Join us for a famous debate format that is an SBTB tradition!
Keynote: Why and how to care about ethics in ML
Hugging Face CEO and co-founder Clement Delangue will discuss how Hugging Face is advancing and democratizing responsible AI through open source and open science. Clement will outline how human bias creeps into the machine learning development pipeline and the steps Hugging Face and the community can take in order to operationalize ethical AI.
Convergence of AI, Simulations and HPC
Many scientific applications heavily rely on the use of brute-force numerical methods performed on high-performance computing (HPC) infrastructure. Can artificial intelligence (AI) methods augment or even entirely replace these brute-force calculations to obtain significant speed-ups? Can we make groundbreaking new discoveries because of such speed-ups? I will present exciting recent advances that build new foundations in AI that are applicable to a wide range of problems such as fluid dynamics and quantum chemistry. On the other side of the coin, the use of simulations to train AI models can be very effective in applications such as robotics and autonomous driving. Thus, we see a growing convergence of AI, Simulations and HPC.
Transformers End-to-End: Experience training and deploying large language models in production
The Transformers revolution has reached near-ubiquity in research, and continues to grow its influence in industry. In this talk I will discuss our team's experience of leveraging large language models for the kinds of problems we, and many other data science teams, face in customer support, ecommerce, and more.
Pure-Scala Approach for Building Frontend and Backend Applications
Scala is a versatile programming language that can be used for building both frontend and backend applications. To further leverage this advantage, we built Airframe RPC, a framework that uses Scala as a unified RPC interface between servers and clients. With Airframe RPC, you can build HTTP/1 (Finagle) and HTTP/2 (gRPC) services just by defining Scala traits and case classes. It simplifies web application design as you only need to care about Scala interfaces without using existing standards like REST, ProtocolBuffers, OpenAPI, etc. Scala.js support of Airframe RPC also enables building interactive Web applications that can dynamically render DOM elements while talking Scala-based RPC servers. With Airframe RPC, the value of Scala developers will be unprecedentedly higher both for frontend and backend development.
Chip Floorplanning with Deep Reinforcement Learning
In this talk, I will describe a reinforcement learning (RL) method for chip floorplanning, the engineering problem of designing the physical layout of a computer chip. Chip floorplanning ordinarily requires weeks or months of effort by physical design engineers to produce manufacturable layouts. Our method generates floorplans in under six hours that are superior or comparable to humans in all key metrics, including power consumption, performance, and chip area. To achieve this, we pose chip floorplanning as a reinforcement learning problem, and develop a novel edge-based graph convolutional neural network architecture capable of learning rich and transferrable representations of the chip. Our method was used in the design of the next generation of Google’s artificial intelligence (AI) accelerators (TPU).
Elasticity across the Facebook Cloud
Facebook operates an internal cloud to support its family of products. The strategy for scaling has included an investment in elasticity of capacity management. Elasticity means several things. At the physical infrastructure level, we mobilize buffer capacity to mitigate dynamic unavailability and coordinate maintenances. At the workload management level, we AutoScale capacity allocations based on predictive and real-time models of workload demand. At the global resource management level, we time-shift flexible workloads based on time series models of supply availability. And at the global efficiency level, we leverage spare capacity for opportunistic workloads. During this talk, we will dive into each dimension and how they fit together. We will show how elasticity across the stack allows us to meet a high bar on reliability and availability, while making efficient use of all capacity deployed.
From python and SQL to Scala, presto to Spark, elbow grease to automation.
The lifecycle of data and ML based business innovation is complicated. From quick and dirty explorations to solid operation at Uber scale and everything in between. Naturally, different tools and technologies are suitable for different stages in the lifecycle. From python and SQL to Scala, presto to Spark, elbow grease to automation...I will share what I have learned so far.
The SAME Project: A New Project to Address Reproducible Machine Learning
We live in a time of both feast and famine in machine learning. Large orgs are publishing state-of-the-art models at an ever-increasing rate but data scientists face daunting challenges reproducing results themselves. Even in the best cases, where a newly forked code runs without syntax errors, this only solves a part of the problem as the pipelines used to run the models are often completely excluded. The Self-Assembling Machine Learning Environment (SAME) project is a new project and community around a common goal: creating tooling that allows for quick ramp-up, seamless collaboration and efficient scaling. This talk will discuss our initial public release, done in collaboration with data scientists from across the spectrum, where we are going next and how people can use our learnings in their own practices.
Designing and Building Complex Machine Learning Engineering Projects and Workflows: Serverless x Containers
Over the past couple of years, several professionals and teams have started to utilize serverless and container concepts and techniques to design and build scalable, low cost, and maintainable applications. Understanding the concepts alone will not guarantee success especially when dealing with modern complex requirements involving Machine Learning and Data Engineering. In this talk, we will talk about how to use different tools and services to perform machine learning experiments ranging from fully abstracted to fully customized solutions. These include performing automated hyperparameter optimization and bias detection when dealing with intermediate requirements and objectives. We will also show how these are done with different ML libraries and frameworks such as scikit-learn, PyTorch, TensorFlow, Keras, MXNet, and more. In addition to these, I will also share some of the risks and common mistakes Machine Learning Engineers must avoid to help bridge the gap between reality and expectations. While discussing these topics, we will show how containerization and serverless engineering helps solve our technical requirements.
Friday, October 29, 2021
Alice and Mad Hatter: Predict or not to predict
Alice and Mad Hatter should solve a mystical riddle that will help them to find the way back home. Alice is a big data engineer and Mad Hatter is a skilled data scientists. In order to solve the riddle they need to build a prediction using a machine learning model. Will knowledge of Scala help Alice to find a solution? And will she be able to collaborate with the Mad Hatter? You will find out in this talk.
Introducing Spark Cyclone: Accelerating Spark with the hidden supercomputer device plugin in Hadoop
The Spark Cyclone project is an open-source plug-in for Apache Spark that accelerates Spark SQL with the NEC SX-Aurora TSUBASA accelerator which is supported in Hadoop 3.3. Dubbed the “Vector Engine” the card has lots of onboard RAM (48GB), lots of memory bandwidth (1.5 TB/s), and lots of computing power (6 TFLOPs). This talk covers why the Vector Engine is uniquely suited for analytics (compared to the alternatives), how we execute Catalyst queries on the Vector Engine, and how the performance compares to Spark Rapids running on a V100.
Continous integration of ML products and UX design
As ML and deep learning algorithms are increasingly coming out from research and being integrated into products, the conventional product design is also learning to adapt to it. Data-centric product design in the last decade has concentrated on data warehousing, scalability of pipelines for big data, and interpretation through KPIs and dashboards. Integrating ML products and involving artificial intelligent learning systems in the product additionally demands, emphasis on data quality, data exploration, and model explainability. Concerns of data privacy, data democratization, and bias within business and users too need to be possibly built into the design. In this context, the talk would explore these aspects by considering two common scenarios. A typical telemetry summarizer (a core data product) and a recommendation engine(a user-facing product with reinforcement learning). The talk aims at exploring the following aspects: - How might we use UX and design to make ML models more explainable and interpretable - How to use UX for ensuring data quality and identifying bias in data collection pipeline - How might we use data science techniques to inform and drive UX design decisions
Relational Databases: Don't call it a comeback!
The data revolution is upon us, and, well, has been for several years. It comes as no surprise that as application technology has evolved to keep up with the ever increasing expectations of users, the data platforms and solutions have had to as well. A decade or so ago we thought all our problems had been solved with a new player in the game, NoSQL. But, spoiler alert, they weren't.
In this session we're going to dive into a brief history of data. We'll examine its humble beginnings, where we stand today, and what the landscape will look like in the future. Throughout the journey you'll gain an understanding of how SQL and relational databases have adapted to pave the road for a truly bright, scalable, and performant future!
Rethinking scalable machine learning with Spark ecosystem
Today we understand that to create good products that leverage AI, we need to run machine learning algorithms on massive amounts of data. In order to do so, we can leverage existing distributed machine learning frameworks, such as Spark MLlib, which helps us simplify the development and usage of large-scale machine learning training and serving. The typical machine learning workflow is:
* Loading data (data ingestion)
* Preparing data (data cleanup)
* Extracting features (feature extraction)
* Fitting model (model training)
* Serving the model
* Scoring (or predictionizing) / using in production
With Apache Spark libraries we can cover the entire basic machine learning workflow. As software and data engineers, it is important to understand the flow, how we can leverage what already exists, and create more enhanced products. As tech leads and architects, understanding the workflow and options available will help us create better architecture and software.
Join this session to learn more about how you can use Apache Spark ecosystem to develop your machine learning end-to-end pipeline.
Responsible AI: Challenges and Recommendations
In the first part we cover five current specific challenges through examples: (1) discrimination (e.g., facial recognition, justice, sharing economy, language models); (2) phrenology (e.g., biometric based predictions); (3) unfair digital commerce (e.g., exposure and popularity bias); (4) stupid models (e.g., Signal, minimal adversarial AI) and (5) indiscriminated use of computing resources (e.g., large language models). These examples do have a personal bias but set the context for the second part where we address four generic challenges: (1) too many principles (e.g., principles vs. techniques), (2) cultural differences (e.g., Christian vs. Muslim); (3) regulation (e.g., privacy, antitrust) and (4) our cognitive biases. We finish with some recommendations to tackle these challenges and build responsible and ethical AI.
Activity schema: data modeling using a single table
Cedric will present a new data modeling approach called the activity schema. It can answer any data question using a single time series table (only 11 columns and no JSON). Instead of facts and dimensions, data is modeled as a customer doing an activity over time. This approach works for any business data used for BI.
This approach has some fundamental benefits over dimensional modeling.
- Single modeling layer. All aggregations, metrics, materialized views for BI, etc, are built directly from the single activity schema table. This means the only dependency is the raw source data.
- No more foreign key joins. Queries use relationships in time to relate activities together. This means that any data in the warehouse can be directly combined with any other data, without having to create foreign keys between them.
- Open source analyses. The activity schema specifies a specific table structure. This means that the data is structurally the same, no matter who models it. This allows analyses or queries to be directly shared between companies.
Panel: Startups and the ML landscape
The panel brings together AI founders and VCs to review what makes deep tech so exciting, challenging, and potentially transformative. How might the machine learning landscape evolve and how will that impact all of us?
Keynote: How to start a startup for AI Engineers
What's an AI startup? Why you should (or shouldn't) start one? How to make sure your AI startup succeeds? In this talk, I will share what I learned over the past 12 years of working at AI startups and starting my own. Heads up that technology itself is just one of the 3 core pillars of successful startups. Join in to learn how to think about the other two and hear my journey.
Introducing Hasktorch
Hasktorch is a library for implementing tensor math and neural networks using typed functional programming in Haskell. It binds the C++ backend of PyTorch, making all functions available in Haskell including GPU support and automatic differentiation.
Hasktorch has two complementary goals. First, to advance research at the intersection of machine learning and programming languages. Second, to enable developers to leverage the strengths of typed functional programming for building reliable, scalable machine-learning systems.
In this talk, Austin will introduce the library, highlight example projects, and help new users get started with Hasktorch.
Making the Transition to Scala 3
Scala 3.0 was released in May 2021, capping nearly a decade of work. Much in Scala 3 is new and exciting, but also, existing Scala programmers will find that most of their existing knowledge and experience still applies. Let's explore the following questions:
• why upgrade at all?
• when is the right time to upgrade?
• how does upgrading work?
Live Coding: Building Stateful and Scalable Reactive Streams with Akka
Akka is an amazing toolkit for building scalable and resilient applications. In this session we will dive into the hidden gems of Akka. Nolan will start the session with a quick introduction to Akka Streams and Actors. From there he will dive into the live coding part of the session. Nolan will build a fully backpressured, stateful, and scalable application from scratch. By the end of this session you will understand the Akka modules best suited for your next Reactive Application and the pitfalls to avoid.
Multi-Dimensional Clustering with Z-Ordering
Sort columns are great for file skipping when you have predicates that are selective on a single dimension, but what do you do when you have multiple dimensions to filter on? Adding additional hierarchal sort columns has diminishing results so what can we do? Enter Z-Ordering. Instead of using a single column to sort our data, we order our data based on value constructed from a combination of multiple columns. This combined Z-Value is constructed such that rows with similar z-values share column values that are similar. This lets us write data files which will be selective over any of the columns that the z-value is constructed from. In this presentation we’ll go over the basics of the math behind the computation and how we are implementing it in Apache Iceberg.
Location-Based Data Engineering for Good with PySpark/Graphs
Cell phones are ubiquitous, and a huge amount of location-based data is generated by apps, advertiser networks, libraries, mobile software and hardware providers, and cell networks. This is a talk about working with location-based data for good: to understand mobility patterns and help improve urban, traffic, and economic planning. PySpark is used to wrangle the large amount of data, sessionize and calculate potential trips, and ingest into different form factors such as graphs to understand mobility patterns. I also discuss data quality engineering and experimentation.
Semantic Search and Neural Information Retrieval
In 2017, the introduction of transformers into feed-forward neural networks enabled computers to achieve human-level performance on a range of language tasks. In this talk, we'll describe some of the theories behind these systems, and explain how it can be applied to search and information retrieval problems.
Deploying and serving hardware optimized ML pipelines using graalvm
This talk will cover an apprach for using graalvm combined with the eclipse deeplearning4j framework to create and deploy ML pipelines combining python scripts and hardware optimized ML pipelines in to 1 binary.