Adam Zegelin is Instaclustr's founding software engineer. This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load. Micro-batching combines writes for the same partition key into a single network request and ensures they hit the “fast path” for writes on a Cassandra node.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
Adam Zegelin is Instaclustr's founding software engineer. This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load. Micro-batching combines writes for the same partition key into a single network request and ensures they hit the “fast path” for writes on a Cassandra node.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
Hello Cronies,
Here are the slides of our recent meetup. .
Title: It's about Time: Deep dive into event store using Apache Cassandra
Big data At-A-Glance
· What is Big data?
· What we have seen so far in AJM Bigdata series?
· Refresher/Overview of basic terminology
· Where it is? Am I using it?
Introduction to Apache Cassandra
· What, When and Why of Apache Cassandra
· Protocol, Queries, Architecture and everything else
· Who is using Apache Cassandra
· Interesting use cases of Apache Cassandra ( Twitter/ Disqus/ etc.)
· Demo application walk-through
Over the past year data at GumGum has quadrupled. Now a days we process 20 TB of new data every day. New data/reporting requirements pour in every week. Usage of the real time data is growing with the daily data. In this talk we are tying to answer the following questions: How do we serve real time data with daily batched data to our consumers together? What is Lambda Architecture and how does it help? What role does Cassandra play in Lambda Architecture at GumGum? How did we solve few bottlenecks in the architecture using Cassandra? How Cassandra can help you avoid microbatching and give you a true realtime data?
About the Speaker
Vaibhav Puranik VP of Engineering, Big Data & Platform, GumGum
Vaibhav has 15 years of experience in Software. He began his career with Johnson Space Center in Houston and has a masters degree in computer science. For past 5 years Vaibhav has been responsible for architecting multiple big data systems at GumGum. He manages Data Science, Data Engineering and DevOps teams at GumGum.
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...DataStax
With the addition of vnodes (Virtual Nodes), Cassandra users were able to gain a few benefits as a result of streaming when it came to bootstrapping and decommissioning nodes. On the flip side, having to route requests on larger clusters became a lot more intensive of a workload for all nodes that were then forced to act coordinator nodes. By setting up a tier of proxy nodes, we were able to have our cluster of 50 nodes perform with a 300% improvement on average in a mixed workload environment. This is an explanation of what we did, how we did it, and why it works.
About the Speaker
Eric Lubow CTO, SimpleReach
Eric Lubow is CTO of SimpleReach, where he builds highly-scalable distributed systems for processing analytics data. Eric is also a DataStax MVP for Cassandra, and co-author of Practical Cassandra. In his spare time, Eric is a skydiver, motorcycle rider, mixed martial artist, and dog dad.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
This session will re-evaluate Cassandra’s relationship with runtime and build systems, pointing out ways that the existing systems fall down, and identifying avenues for improvement. Over the past few years, a number of platforms have emerged for running user code. Container runtimes like Docker, container orchestrators such as Kubernetes, and metrics collections agents like Prometheus and Spectator have all gained popularity and mind-share. Cassandra functionality such as metrics, bootstrapping, and monitoring integrates with the newer paradigms, but in an ad-hoc and improvised fashion. By taking a purposeful approach to integrating with these new methods of deployment, the Cassandra community can more fully benefit from their advertised strengths. The Cassandra build system based on Ant+Ivy dates to the early 2000’s, and reflects legacy complexity that could be avoided with modern build systems. Cassandra’s system package builds are not much better and often fail to integrate with industry standards such as systemd. Iterating on the existing systems is difficult, but this technical debt slows innovation in our build systems. In this talk, we propose solutions to make building, deploying and monitoring Cassandra easy and low overhead, while taking advantage of cloud advancements wherever possible.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax
We use Apache Cassandra at BlackRock to help power our Aladdin investment management platform. Like most users, we love Cassandra’s scalability and fault tolerance. One challenge we’ve faced is keeping data consistent between data centers. Cassandra is great at replicating data to multiple data centers, and many users take advantage of this feature to achieve eventual consistency in multi-region clusters. At BlackRock, we have several use cases where eventual consistency is not good enough; sometimes we need to guarantee that the most recent data is available from all locations. Cassandra’s tunable consistency makes it possible to achieve this extreme level of resiliency. In this talk we’ll discuss our experience from the past several years using Cassandra for cross-WAN consistency, some of the novel ways we’ve dealt with the performance implications, and our ideas for improving support for this usage model in future versions of Cassandra.
About the Speaker
Randy Fradin Vice President, BlackRock
Randy Fradin is part of BlackRock’s Aladdin Product Group. His team is responsible for developing the core software infrastructure in BlackRock’s Aladdin platform, including scalable storage, compute, and messaging services. Previously he spent time developing the market data, risk reporting, and core trading functions in Aladdin. He has been an enthusiastic Cassandra user since 2011.
The Cassandra architecture shines at ensuring a very high availability of data even while nodes are failing or are overloaded. On the other hand, query latency will often rise during these events, especially on the higher percentiles. Many improvements have been made to reduce this effect over the past years. This talk will focus on one in particular: Speculative Retries. Introduced in Cassandra 2.0 on the server side and in the Java Driver 3.0 on the client side, this strategy remains complex to fully understand and to finely tune. This talk will deep dive into theoretical and practical aspects of Speculative Retries, showing the effect of tuning strategies with ad-hoc benchmarks.
About the Speakers
Michael Figuiere Cloud Platform Engineer, Netflix
Michael is a senior software engineer at Netflix where he works on improving the cloud storage infrastructure. He previously worked at Apple and DataStax where he worked for several years on creating Drivers and Developer Tools for Cassandra. At ease with both enterprise applications and lower level technologies, he specializes in distributed architectures and topics such as databases, search engines, and cloud.
Minh Do Senior Distributed Engineer, Netflix
Minh Do has been working at Netflix for the last several years to run, patch, and troubleshoot Cassandra on both server and client sides, and is also a co-creator of Dynomite project. Prior to Netflix, at Tango, he spearheaded its Big Data pipeline system from the ground using Spark/Hadoop. Before that, at Qualys, he built a distributed queue system that bridges traffics between all major components. He has passion in distributed system, machine learning/deep learning, and data storages.
These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Hello Cronies,
Here are the slides of our recent meetup. .
Title: It's about Time: Deep dive into event store using Apache Cassandra
Big data At-A-Glance
· What is Big data?
· What we have seen so far in AJM Bigdata series?
· Refresher/Overview of basic terminology
· Where it is? Am I using it?
Introduction to Apache Cassandra
· What, When and Why of Apache Cassandra
· Protocol, Queries, Architecture and everything else
· Who is using Apache Cassandra
· Interesting use cases of Apache Cassandra ( Twitter/ Disqus/ etc.)
· Demo application walk-through
Over the past year data at GumGum has quadrupled. Now a days we process 20 TB of new data every day. New data/reporting requirements pour in every week. Usage of the real time data is growing with the daily data. In this talk we are tying to answer the following questions: How do we serve real time data with daily batched data to our consumers together? What is Lambda Architecture and how does it help? What role does Cassandra play in Lambda Architecture at GumGum? How did we solve few bottlenecks in the architecture using Cassandra? How Cassandra can help you avoid microbatching and give you a true realtime data?
About the Speaker
Vaibhav Puranik VP of Engineering, Big Data & Platform, GumGum
Vaibhav has 15 years of experience in Software. He began his career with Johnson Space Center in Houston and has a masters degree in computer science. For past 5 years Vaibhav has been responsible for architecting multiple big data systems at GumGum. He manages Data Science, Data Engineering and DevOps teams at GumGum.
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...DataStax
With the addition of vnodes (Virtual Nodes), Cassandra users were able to gain a few benefits as a result of streaming when it came to bootstrapping and decommissioning nodes. On the flip side, having to route requests on larger clusters became a lot more intensive of a workload for all nodes that were then forced to act coordinator nodes. By setting up a tier of proxy nodes, we were able to have our cluster of 50 nodes perform with a 300% improvement on average in a mixed workload environment. This is an explanation of what we did, how we did it, and why it works.
About the Speaker
Eric Lubow CTO, SimpleReach
Eric Lubow is CTO of SimpleReach, where he builds highly-scalable distributed systems for processing analytics data. Eric is also a DataStax MVP for Cassandra, and co-author of Practical Cassandra. In his spare time, Eric is a skydiver, motorcycle rider, mixed martial artist, and dog dad.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
This session will re-evaluate Cassandra’s relationship with runtime and build systems, pointing out ways that the existing systems fall down, and identifying avenues for improvement. Over the past few years, a number of platforms have emerged for running user code. Container runtimes like Docker, container orchestrators such as Kubernetes, and metrics collections agents like Prometheus and Spectator have all gained popularity and mind-share. Cassandra functionality such as metrics, bootstrapping, and monitoring integrates with the newer paradigms, but in an ad-hoc and improvised fashion. By taking a purposeful approach to integrating with these new methods of deployment, the Cassandra community can more fully benefit from their advertised strengths. The Cassandra build system based on Ant+Ivy dates to the early 2000’s, and reflects legacy complexity that could be avoided with modern build systems. Cassandra’s system package builds are not much better and often fail to integrate with industry standards such as systemd. Iterating on the existing systems is difficult, but this technical debt slows innovation in our build systems. In this talk, we propose solutions to make building, deploying and monitoring Cassandra easy and low overhead, while taking advantage of cloud advancements wherever possible.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax
We use Apache Cassandra at BlackRock to help power our Aladdin investment management platform. Like most users, we love Cassandra’s scalability and fault tolerance. One challenge we’ve faced is keeping data consistent between data centers. Cassandra is great at replicating data to multiple data centers, and many users take advantage of this feature to achieve eventual consistency in multi-region clusters. At BlackRock, we have several use cases where eventual consistency is not good enough; sometimes we need to guarantee that the most recent data is available from all locations. Cassandra’s tunable consistency makes it possible to achieve this extreme level of resiliency. In this talk we’ll discuss our experience from the past several years using Cassandra for cross-WAN consistency, some of the novel ways we’ve dealt with the performance implications, and our ideas for improving support for this usage model in future versions of Cassandra.
About the Speaker
Randy Fradin Vice President, BlackRock
Randy Fradin is part of BlackRock’s Aladdin Product Group. His team is responsible for developing the core software infrastructure in BlackRock’s Aladdin platform, including scalable storage, compute, and messaging services. Previously he spent time developing the market data, risk reporting, and core trading functions in Aladdin. He has been an enthusiastic Cassandra user since 2011.
The Cassandra architecture shines at ensuring a very high availability of data even while nodes are failing or are overloaded. On the other hand, query latency will often rise during these events, especially on the higher percentiles. Many improvements have been made to reduce this effect over the past years. This talk will focus on one in particular: Speculative Retries. Introduced in Cassandra 2.0 on the server side and in the Java Driver 3.0 on the client side, this strategy remains complex to fully understand and to finely tune. This talk will deep dive into theoretical and practical aspects of Speculative Retries, showing the effect of tuning strategies with ad-hoc benchmarks.
About the Speakers
Michael Figuiere Cloud Platform Engineer, Netflix
Michael is a senior software engineer at Netflix where he works on improving the cloud storage infrastructure. He previously worked at Apple and DataStax where he worked for several years on creating Drivers and Developer Tools for Cassandra. At ease with both enterprise applications and lower level technologies, he specializes in distributed architectures and topics such as databases, search engines, and cloud.
Minh Do Senior Distributed Engineer, Netflix
Minh Do has been working at Netflix for the last several years to run, patch, and troubleshoot Cassandra on both server and client sides, and is also a co-creator of Dynomite project. Prior to Netflix, at Tango, he spearheaded its Big Data pipeline system from the ground using Spark/Hadoop. Before that, at Qualys, he built a distributed queue system that bridges traffics between all major components. He has passion in distributed system, machine learning/deep learning, and data storages.
These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
In depth overview of the Flex data binding code generation. Provides info on accomplish data binding through actionscript as well as limitations of the process.
Processing 50,000 events per second with Cassandra and SparkInstaclustr
Chief Product Officer at Instaclustr, Ben Slater's presentation from the Cassandra Summit 2016, in San Jose.
Over the last 12 months, Instaclustr has developed a centralised system to capture and analysis monitoring metrics from its managed services fleet of well over 500 Cassandra and Spark nodes. While we entered into the exercise with plenty of experience provisioning and managing Cassandra and Spark, this was our first experience building a significant application. We’ll walk through some of the missteps we made and lessons learned along the way as well as sharing our current solution architecture.
Ben Slater, Chief Product Officer at Instaclustr, presentation from the Cassandra Summit 2016, in San Jose.
This presentation walks through some of the key considerations for planning and running load test to ensure your Cassandra application will meet you expected scaling requirements. It also goes through some examples of using the cassandra-stress tool to construct load test for real-life application scenarios.
Teads is #1 in Video Ads. Read how Teads handles up to ~1 million requests/s with Apache Cassandra. How do we tuned Cassandra servers and clients. What issues we faced during the last year. How do we provision our clusters. Which tools are used: Datadog for monitoring and alerting, Cassandra reaper, Rundeck, Sumologic, cassandra_snapshotter. Why do we need a fork.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Customizing JVM settings for the needs of an application can be a tricky business, especially when running externally developed software such as Cassandra. In this talk I will share our experiences and the procedure that we have used to test and validate changes with Java tuning. We'll explore with two recent experiences: changes and monitoring of G1 garbage collection, and moving buffer objects off the heap.
For the talk, I'll discuss our tuning process at Knewton. I will share some of the challenges that we faced while identifying what we expected to learn. I'll discuss how we isolated and minimized variables across tests, the importance of the duration of these tests, and how we try to separate correlation from causation. I will demonstrate how to use and interpret the results of the custom scripts that we were driven to develop to gain visibility into our G1GC processes; these scripts will be open sourced.
About the Speaker
Carlos Monroy Senior Software Engineer, Knewton
Carlos Monroy is a senior engineer on the database team at Knewton, an education company that created an adaptive learning platform. Carlos has been developing software professionally since 1998. His experience holding multiple roles on the software lifecycle provides him a wholistic approach. Having used over a half dozen relational database engines, he has recently come over to the NoSQL side, first working with HBase and for the last three years Cassandra.
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...DataStax
Over the last 12 months, Instaclustr has developed a centralised system to capture and analysis monitoring metrics from its managed services fleet of well over 500 Cassandra and Spark nodes. While we entered into the exercise with plenty of experience provisioning and managing Cassandra and Spark, this was our first experience building a significant application. We'll walk through some of the missteps we made and lessons learned along the way as well as sharing our current solution architecture.
About the Speaker
Ben Slater Chief Product Officer, Instaclustr
Instaclustr provides Cassandra and Spark as a managed service in the cloud. As Chief Product Officer, Ben is charged with steering Instaclustr's development roadmap, managing product engineering and overseeing the production support and consulting teams. Ben has over 20 years experience in systems development including previously as lead architect for the product that is now Oracle Policy Automation and over 10 years as a solution architect and project manager for Accenture.
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
Most web applications start out with a Postgres database and it serves the application very well for an extended period of time. Based on type of application, the data model of the app will have a table that tracks some kind of state for either objects in the system or the users of the application. Names for this table include logs, messages or events. The growth in the number of rows in this table is not linear as the traffic to the app increases, it's typically exponential.
Over time, the state table will increasingly become the bulk of the data volume in Postgres, think terabytes, and become increasingly hard to query. This use case can be characterized as the one-big-table problem. In this situation, it makes sense to move that table out of Postgres and into Cassandra. This talk will walk through the conceptual differences between the two systems, a bit of data modeling, as well as advice on making the conversion.
About the Speaker
Rimas Silkaitis Product Manager, Heroku
Rimas currently runs Product for Heroku Postgres and Heroku Redis but the common thread throughout his career is data. From data analysis, building data warehouses and ultimately building data products, he's held various positions that have allowed him to see the challenges of working with data at all levels of an organization. This experience spans the smallest of startups to the biggest enterprises.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
A presentation I made for Apache Spark and Apache Cassandra Integration.
First I present what are some of the differences between RDBMS and NoSQL, then I proceed with the Cassandra infrastructure and usual errors when creating a Cassandra Data Model.
Finally, I provide the Spark underlying main concepts and some settings for proper configuration.
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
Azure SQL Database is a fully managed cloud database service with built-in intelligence, elastic scale, performance, reliability, and data protection that enables enterprises and ISVs to reduce their total cost of ownership and operational cost and overheads. In this session, I will share real-world experience of successfully migrated existing SaaS application and on-premises workload for some our tier 1 customers and ISV partners to Azure SQL Database service. The session walks through planning, assessment, migration tools and best practices from the proven experiences and practices of migrating real world applications to Azure SQL Database service.
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
This presentation focuses on the implementation of Continuous Delivery and Microservices principles in Capital One’s
cybersecurity data platform – which ingests ~6 TB of data every day, and where Elasticsearch is a core component.
This talk will walk through the journey of Cassandra at Netflix. It will go into 3-4 specific use cases where Cassandra stands out than the rest of the data-stores and is being used in Netflix, bringing great viewing experience to all customers globally. Roopa will go into the specifics of the data model being used and where Cassandra stands out with its strengths and which places where they learnt the hard way. Roopa will then share some of the best practices and self service platform being used for Cassandra to cater to their developer needs.
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
With Akka you take a complicated system and break it down into lots of smaller units (actors) that communicate by passing messages. A single actor system can easily scale to millions or tens of millions of actors running on many machines. As actors process messages, they build up internal state, and many times we want that state persisted somewhere. In this talk, we'll dive into the event sourcing API used to persist actor state in Akka and talk about how we build a data model to support it in Cassandra. At first, the data model seems pretty straightforward, but the more we dig in, the more we see that a couple of classic Cassandra anti-patterns are pushing us close to the Pit of Despair. We'll come up with a way to avoid these problems so we can go on building distributed systems happily ever after with Akka and Cassandra.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
A quick overview of Redshift and common use-cases. Followed by tools and links to performance tuning. How Redshift fits in the AWS data services. A list of key new features since last meetup in September 2016, including Redshift Spectrum that allows one to run SQL directly on your data sitting on Amazon S3. It also includes Redshift echosystem with data integration, bi, consultancy and data modelling partners.
Similar to Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apache Cassandra (20)
Presentation at the Next Generation Cassandra Conference (NGCC) 2017 delivered by Ben Bromhead on the health of the Apache Cassandra open source community.
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr
See how our engineering team have implemented an open source Apache Spark on Apache Cassandra solution to capture metrics of the various nodes that we monitor for our customers.
Being able to rapidly iterate on, build and test your code is key to being a productive developer. Without local automation, working with the numerous platforms and technologies in your stack can become very frustrating.
In this on-demand webinar, Ben Bromhead CTO of Instaclustr explores best practices to easily integrate Apache Cassandra into your development workflow, so you spend more time writing good code and less time fighting your environment.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
3. Introduction
• Problem background
• Solution overview
• Implementation approach
• Writing data
• Rolling up data
• Presenting data
• Optimization
• What’s next?
3
4. Problem background
• How to efficiently monitor >600 servers all running
Cassandra
• Need to develop a metric history over time for tuning
alerting & automated response systems
• Off the shelf systems are available, however:
• Probably don’t give us the flexibility we want to be able to
optimize for our environment
• We wanted a meaty problem to tackle ourselves to dog-food our
own offering and build our internal skills and understanding
4
6. Implementation approach
1. Writing Data
2. Rolling Up Data
3. Presenting Data
6
~ 9(!) months
(with quite a few detours
and distractions)
7. Writing Data
• Aligning Data Model with DTCS
• Initial design did not have time value in partition key
• Settled on bucketing by 5 mins
• Enables DTCS to work
• Works really well for extracting data for roll-up
• Adds complexity for retrieving data
• When running with STCS needed unchecked_compactions=true to avoid build up of TTL’d data
• Batching of writes
• Found batching of 200 rows per insert to provide optimal throughput and client load
• See Adam’s C* summit talk for all the detail
• Controlling data volumes from column family metrics
• Limited, rotating set of CFs per check-in
• Managing back pressure is important
7
8. Rolling Up Data
• Developing functional solution was easy,
• Getting to acceptable performance was hard and time consuming
• But all seemed easy once we’d solved it
• Keys to performance?
• Align raw data partition bucketing with roll-up timeframe (5 mins)
• Use joinWithCassandra table to extract the required data – 2-3x performance
improvement over alternate approaches
val RDDJoin = sc.cassandraTable[(String, String)]("instametrics" , "service_per_host")
.filter(a => broadcastListEventAll.value.map(r => a._2.matches(r)).foldLeft(false)(_ || _))
.map(a => (a._1, dateBucket, a._2))
.repartitionByCassandraReplica("instametrics", "events_raw_5m", 100)
.joinWithCassandraTable("instametrics", "events_raw_5m").cache()
• Write limiting
• cassandra.output.throughput_mb_per_sec not necessary as writes << reads
8
9. Presenting Data
• Generally just worked!
• Main challenge
• Dealing with how to find latest data in buckets when not all data is
reported in each data set
9
10. Optimization
• Upgraded to Cassandra 3.7 and change code to use Cassandra aggregates:
val RDDJoin = sc.cassandraTable[(String, String)]("instametrics" ,
"service_per_host")
.filter(a => broadcastListEventAll.value.map(r =>
a._2.matches(r)).foldLeft(false)(_ || _))
.map(a => (a._1, dateBucket, a._2))
.repartitionByCassandraReplica("instametrics", "events_raw_5m",
100)
.joinWithCassandraTable("instametrics", "events_raw_5m",
SomeColumns("time", "state", FunctionCallRef("avg",
Seq(Right("metric")), Some("avg")), FunctionCallRef("max",
Seq(Right("metric")), Some("max")), FunctionCallRef("min",
Seq(Right("metric")), Some("min")))).cache()
• 50% reduction in roll-up job runtime (from 5-6 mins to 2.5-3mins) with reduced CPU usage
10
11. What’s next
• Investigate:
• Use Spark Streaming for 5 min roll-ups rather than save and
extract
• Scale-out by adding nodes is working as expected
• Continue to add additional metrics to roll-ups as we add
functionality
• Plan to introduce more complex analytics & feed historic
values back to Reimann for use in alerting
11