Transformer-based models have been dominant in the NLP landscape due to their state of the art performance on a wide variety of benchmarks and tasks. However, deploying such large models at scale can be quite difficult and costly. Learn about the techniques that we've utilized at Stream to overcome these challenges and moderate real-time chat messages efficiently on relatively inexpensive hardware. While this talk will focus on the BERT and its offshoots, many of these techniques can also be applied to other models.
OPEN TALK: Low Latency and High Throughput Chat Moderation on a CPU
Neha Rao is a Data Scientist at Stream, where she's been working on personalized feeds and chat moderation. Her recent interests lie in explainable AI and AI ethics. She is based out of Boulder, CO and can be found fermenting things and watering her many plants when not behind a screen.