In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul has extensive R&D and consulting experience in distributed systems, technology innovation, software architecture, and engineering, software performance and scalability, grid and cloud computing, and data analytics and machine learning. He currently serves as the Technology Evangelist at Instaclustr.