This talk was presented at O'Reilly's Velocity conference in Santa Clara, May 28 2015.
Abstract: http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/42284
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
What we do:
-We build a system for the operation of modern data centers
-Triage and diagnostics, exploration, trends, advanced analytics of complex systems
-Our data: logs, metrics, human activity, anything that occurs in the data center
-Enterprise Software (i.e. we build for others.)
Today's presentation: how we built what we built on top of Apache Hadoop
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
It is widely understood that our software needs to become reactive; we need to consider responsiveness, maintainability, elasticity and scalability from the outset. Not all systems need to implement all these to the same degree, as specific project requirements will determine where effort is most wisely spent. But, in the vast majority of cases, the need to go reactive will demand that we design our applications differently.
In this presentation Dr. Roland Kuhn will explore several architecture elements that are commonly found in reactive systems, like the circuit breaker, various replication techniques, and flow control protocols. These patterns are language agnostic and also independent of the abundant choice of reactive programming frameworks and libraries. They are well-specified starting points for exploring the design space of a concrete problem: thinking is strictly required!
This webinar is based off of Dr. Kuhn’s session, Reactive Design Sessions, presented at WJAX and Code Mesh.
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
CQRS (Command Query Responsibility Segregation) is a pattern, which separates the process of querying and updating data. As a query only returns data without any side effects, a command is designed to change data. CQRS is often combined with Event Sourcing. This is an architecture in which all changes to an application state are stored as a sequence of events.
Because of its great capability to store time series data Cassandra is the perfect fit for implementing the event store. But there a still a lot of open questions: What about the data modeling? What techniques will be used to process and store data in the Cassandra database? How to access the current state of the application, without replaying every event? And what about failure handling?
In this talk, I will give a brief introduction to CQRS and the Event Sourcing pattern and will then answer the questions above using a real life example of a data store for customer data.
In this presentation, Akka Team Lead and author Roland Kuhn presents the freshly released final specification for Reactive Streams on the JVM. This work was done in collaboration with engineers representing Netflix, Red Hat, Pivotal, Oracle, Typesafe and others to define a standard for passing streams of data between threads in an asynchronous and non-blocking fashion. This is a common need in Reactive systems, where handling streams of "live" data whose volume is not predetermined.
The most prominent issue facing the industry today is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination. Asynchrony is needed in order to enable the parallel use of computing resources, on collaborating network hosts or multiple CPU cores within a single machine.
Here we'll review the mechanisms employed by Reactive Streams, discuss the applicability of this technology to a variety of problems encountered in day to day work on the JVM, and give an overview of the tooling ecosystem that is emerging around this young standard.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
What we do:
-We build a system for the operation of modern data centers
-Triage and diagnostics, exploration, trends, advanced analytics of complex systems
-Our data: logs, metrics, human activity, anything that occurs in the data center
-Enterprise Software (i.e. we build for others.)
Today's presentation: how we built what we built on top of Apache Hadoop
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
It is widely understood that our software needs to become reactive; we need to consider responsiveness, maintainability, elasticity and scalability from the outset. Not all systems need to implement all these to the same degree, as specific project requirements will determine where effort is most wisely spent. But, in the vast majority of cases, the need to go reactive will demand that we design our applications differently.
In this presentation Dr. Roland Kuhn will explore several architecture elements that are commonly found in reactive systems, like the circuit breaker, various replication techniques, and flow control protocols. These patterns are language agnostic and also independent of the abundant choice of reactive programming frameworks and libraries. They are well-specified starting points for exploring the design space of a concrete problem: thinking is strictly required!
This webinar is based off of Dr. Kuhn’s session, Reactive Design Sessions, presented at WJAX and Code Mesh.
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
CQRS (Command Query Responsibility Segregation) is a pattern, which separates the process of querying and updating data. As a query only returns data without any side effects, a command is designed to change data. CQRS is often combined with Event Sourcing. This is an architecture in which all changes to an application state are stored as a sequence of events.
Because of its great capability to store time series data Cassandra is the perfect fit for implementing the event store. But there a still a lot of open questions: What about the data modeling? What techniques will be used to process and store data in the Cassandra database? How to access the current state of the application, without replaying every event? And what about failure handling?
In this talk, I will give a brief introduction to CQRS and the Event Sourcing pattern and will then answer the questions above using a real life example of a data store for customer data.
In this presentation, Akka Team Lead and author Roland Kuhn presents the freshly released final specification for Reactive Streams on the JVM. This work was done in collaboration with engineers representing Netflix, Red Hat, Pivotal, Oracle, Typesafe and others to define a standard for passing streams of data between threads in an asynchronous and non-blocking fashion. This is a common need in Reactive systems, where handling streams of "live" data whose volume is not predetermined.
The most prominent issue facing the industry today is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination. Asynchrony is needed in order to enable the parallel use of computing resources, on collaborating network hosts or multiple CPU cores within a single machine.
Here we'll review the mechanisms employed by Reactive Streams, discuss the applicability of this technology to a variety of problems encountered in day to day work on the JVM, and give an overview of the tooling ecosystem that is emerging around this young standard.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
In this session, we will discuss:
* reactive architecture tenets
* distributed “fast data” streams
* application and analytics focused Data Lake
Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated. The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies. DevOps and application architecture concerns will be first-class themes throughout.
Reactive principles and technology will be the second act of this talk. Kafka. Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for. The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform). We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns.
The final act will be to dive into the Data Lake, from both an analytics and application development perspective. Technologies used to explain concepts will include Amazon and Hadoop. A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data. It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect). The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake. The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline… Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.
Speaker: Neil Avery, Technologist, Office of the CTO, Confluent
Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business.
Use cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale.
Apache Kafka® has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice.
Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution.
Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy.
Watch the recording: https://videos.confluent.io/watch/rmU6GHrd4EKFaZrRhdTE3s?.
Always On: Building Highly Available Applications on CassandraRobbie Strickland
Cassandra was built from the ground up to enable linearly scalable, always-on applications. But the path to high availability has many land mines that can mean failure for the inexperienced user. In this talk, I will offer practical advice on how to achieve 100% uptime on millions of transactions per second. I'll address all aspects of the topic, including deployment, configuration, application design, and operations.
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
This session will cover:
1) The process that led to our decision to use Cassandra
2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
4) Our lessons learned and next steps
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality.
Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
In this session, we will discuss:
* reactive architecture tenets
* distributed “fast data” streams
* application and analytics focused Data Lake
Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated. The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies. DevOps and application architecture concerns will be first-class themes throughout.
Reactive principles and technology will be the second act of this talk. Kafka. Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for. The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform). We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns.
The final act will be to dive into the Data Lake, from both an analytics and application development perspective. Technologies used to explain concepts will include Amazon and Hadoop. A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data. It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect). The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake. The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline… Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.
Speaker: Neil Avery, Technologist, Office of the CTO, Confluent
Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business.
Use cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale.
Apache Kafka® has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice.
Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution.
Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy.
Watch the recording: https://videos.confluent.io/watch/rmU6GHrd4EKFaZrRhdTE3s?.
Always On: Building Highly Available Applications on CassandraRobbie Strickland
Cassandra was built from the ground up to enable linearly scalable, always-on applications. But the path to high availability has many land mines that can mean failure for the inexperienced user. In this talk, I will offer practical advice on how to achieve 100% uptime on millions of transactions per second. I'll address all aspects of the topic, including deployment, configuration, application design, and operations.
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
This session will cover:
1) The process that led to our decision to use Cassandra
2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
4) Our lessons learned and next steps
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality.
Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes per day of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. This session is especially recommended for data infrastructure engineers and architects planning, building, or maintaining similar systems.
OSDC 2018 | From Monolith to Microservices by Paul Puschmann_NETWAYS
Scaling up from two developer teams supporting a monolith to more than 20 developer teams powering a micro-service landscape is not only a matter of technical excellence but also the matter of culture and collaboration. This talk will show the positive aspects of our evolution as well as the things we learned to improve on.
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
Our Logging, Metrics and Monitoring as a Service, Volta, is aimed at providing a scalable logging and metrics service for applications and services across the stack: starting from low level networks and core openstack services to platform services to Symantec products. Volta integrates with Keystone to provide secure authentication and multi-tenancy which is used to limit the visibility of logs/metrics to specific users/tenants or to specific services (e.g., only nova or only swift). Volta also provides features for setting up Alerts on log and metric events.
In this session, we will share with you how we have built Volta using battle tested open source / OpenStack components such as Keystone, Kafka, Storm, ElasticSearch, InfluxDB, Logstash, Kibana, and Grafana. We will also present our Keystone based authentication and multi-tenancy model and its implementation for limiting the visibility of logs and metrics for queries and alerts.
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
The art of the event streaming application: streams, stream processors and sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database. In this talk I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
Minimizing customer impact is a key feature in successfully rolling out frequent code updates. Learn how to leverage the AWS cloud so you can minimize bug impacts, test your services in isolation with canary data, and easily roll back changes. Learn to love deployments, not fear them, with a blue/green architecture model. This talk walks you through the reasons it works for us and how we set up our AWS infrastructure, including package repositories, Elastic Load Balancing load balancers, Auto Scaling groups, internal tools, and more to help orchestrate the process. Learn to view thousands of servers as resources at your command to help improve your engineering environment, take bigger risks, and not spend weekends firefighting bad deployments.
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
If you want to learn more about it, check out our webinar here: https://www.youtube.com/watch?v=EfGPY_NyYQ8&t=77s
The webinar was organized by GetinData on 2020. During the webinar, we shared our lessons learnt from building and running stream processing platform in production for over 2 years.
Watch more here: https://www.youtube.com/watch?v=EfGPY_NyYQ8
Author: Krzysztof Zarzycki
Linkedin: https://www.linkedin.com/in/kzarzycki/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...Zhenzhong Xu
Netflix is obsessed with customer joy, we relentlessly focus on product experience and high-quality content. In recent years, we have been making heavy investments in the tech-driven studio and content production. As a result, a lot of unique challenges arise in the real-time data infrastructure space. For example, in a microservices architecture, domain entities are spread in different applications and persistence storages, this made low latency consistent operational reporting and entity searching especially challenging.
In this talk, we’ll talk about some interesting use cases, the various challenges lay in the fundamentals of distributed systems, and how did we solve them. We will also discuss the learnings, things we could’ve done differently, and the new vision towards an open self-serving Data Mesh platform that empowers our partners and users to build flexible real-time data pipelines.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
Presentation given by Renzo Tomà as "Tech and Use Case Deep Dive", during the Elastic{ON}Tour 2015 event in Amsterdam on October 29th.
Explanation of how bol.com is using the Elastic ELK stack to power a logsearch platform. Lots of details on the types of sources and number of feeds. Some history and reasoning why the current set of in-process JSON based logshippers are used. Links to the bol.com github account for the logshipper projects. The presentation ends with two special sauces: fun things you can do with lots of data in Elasticsearch. The 1st sauce is 'the call stack' - tagging each request with a unique ID, passing that ID along to all service calls and making sure this ID ends up in all access logging, enables you to group all calls together and get a call stack. The 2nd sauce is a way of generating a service map using access logging and some logstash magic.
I love questions and feedback. My mail address can be found in the presentation.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are available for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink.
Topics include:
* Aggregating IoT event data, in which event-time-aware processing, handling of late data, and state are important
* Data enrichment, in which a stream of real-time events is “enriched” with data from a slowly changing database of supplemental data points
* Dynamic stream processing, in which a stream of control messages and dynamically updated user logic is used to process a stream of events for use cases such as alerting and fraud detection
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scaleDataScienceConferenc1
Rivian makes adventurous electric vehicles with a mission of a sustainable planet and keeping the world adventurous forever. Rivian's vehicles are born in the cloud and embody tenets of a software defined vehicle, where not only the user accessible features such as infotainment are software driven and updated, but also internals aspects such as vehicle dynamics. Real-time instrumentation and telemetry are the key underpinnings that make all this possible. Rivian has built a cutting-edge Real-time stack using a combination of open-source technologies like Kafka, Flink and Druid and in house services. This talk will go into how these are combined and leveraged to deliver real-time analytics.
Integrate Solr with real-time stream processing applicationsthelabdude
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Similar to Building a system for machine and event-oriented data - Velocity, Santa Clara 2015 (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas