[#VIRTUAL] PRO Workshop (AI): Sparsity without Sacrifice – How to Accelerate AI Models Without Losing Accuracy

Lucas Souza
Numenta, Senior Researcher

Lucas is a research scientist at Numenta, a machine intelligence company that focuses on applying neocortical theory to dramatically increase performance and unlock new capabilities of AI. Previously, Lucas worked as a data scientist for the Brazilian anticorruption agency helping to identify and prevent corruption schemes through the use of machine learning, and founded Derivada, a non-profit organization that focuses on promoting AI through open source projects, research and education. His current research focus is on accelerating  deep neural networks in large attention-based models through a wide array of techniques inspired by the study of the neocortex and its main driving principles 

Lawrence Spracklen
Numenta, Director of Machine Learning Architecture

Dr. Lawrence Spracklen is an experienced leader, with over two decades of experience in developing and delivering cutting-edge solutions. At Numenta Lawrence leads the machine learning architecture team focused on the intersection of AI and hardware. Prior to joining Numenta, Lawrence led research and development teams at several other AI startups; RSquared, SupportLogic, Alpine Data and Ayasdi. Before this, Lawrence spent over a decade working at Sun Microsystems, Nvidia and VMware, where he led teams focused on hardware architecture, software performance and scalability. Lawrence holds a Ph.D. in Electronics Engineering from the University of Aberdeen, a B.Sc. in Computational Physics from the University of York and has been issued over 65 US patents. 

Most companies with AI models in production today are grappling with stringent latency requirements and escalating energy costs. One way to reduce these burdens is by pruning such models to create sparse lightweight networks. Pruning involves the iterative removal of weights from a pre-trained dense network to obtain a network with fewer parameters, trading off against model accuracy. Determining which weights should be removed in order to minimize the impact to the network’s accuracy is critical. For real-world networks with millions of parameters, however, analytical determination is often computationally infeasible; heuristic techniques are a compelling alternative.In this presentation, we talk about how to implement commonly-used heuristics such as gradual magnitude pruning (GMP) in production, along with their associated accuracy-speed trade offs, using the BERT family of language models as an example.Next, we cover ways of accelerating such lightweight networks to achieve peak computational efficiencies and reduce energy consumption. We walk through how our acceleration algorithms optimize hardware efficiency, unlocking order-of-magnitude speedups and energy savings.Finally, we present best practices on how these techniques can be combined to achieve multiplicative effects in reducing energy consumption costs and runtime latencies without sacrificing model accuracy.