Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism
Isolation, Data Locality, Location Transparency
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
Organizations everywhere are struggling to load, integrate, analyze and collaborate with data. This is largely thanks to their antiquated data platform, designed in a time when few people had the desire or need to interact with the database. Snowflake, the data warehouse built for the cloud, can help.
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
Splunk Data Onboarding Overview - Splunk Data Collection ArchitectureSplunk
Splunk's Naman Joshi and Jon Harris presented the Splunk Data Onboarding overview at SplunkLive! Sydney. This presentation covers:
1. Splunk Data Collection Architecture 2. Apps and Technology Add-ons
3. Demos / Examples
4. Best Practices
5. Resources and Q&A
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
A walk-through of various options in integration Apache Spark and Apache NiFi in one smooth dataflow. There are now several options in interfacing between Apache NiFi and Apache Spark with Apache Kafka and Apache Livy.
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
In this webinar, learn from industry analyst and big data thought leader Mark Madsen about the future of big data and importance of the new Enterprise Data Lake reference architecture.
This webinar also covers what’s important when building a modern, multi-use data infrastructure, the difference between a Hadoop application and a Data Lake infrastructure, and an enterprise data lake reference architecture to get you started.
To learn more, visit: www.snaplogic.com/big-data
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
Organizations everywhere are struggling to load, integrate, analyze and collaborate with data. This is largely thanks to their antiquated data platform, designed in a time when few people had the desire or need to interact with the database. Snowflake, the data warehouse built for the cloud, can help.
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
Splunk Data Onboarding Overview - Splunk Data Collection ArchitectureSplunk
Splunk's Naman Joshi and Jon Harris presented the Splunk Data Onboarding overview at SplunkLive! Sydney. This presentation covers:
1. Splunk Data Collection Architecture 2. Apps and Technology Add-ons
3. Demos / Examples
4. Best Practices
5. Resources and Q&A
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
A walk-through of various options in integration Apache Spark and Apache NiFi in one smooth dataflow. There are now several options in interfacing between Apache NiFi and Apache Spark with Apache Kafka and Apache Livy.
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
In this webinar, learn from industry analyst and big data thought leader Mark Madsen about the future of big data and importance of the new Enterprise Data Lake reference architecture.
This webinar also covers what’s important when building a modern, multi-use data infrastructure, the difference between a Hadoop application and a Data Lake infrastructure, and an enterprise data lake reference architecture to get you started.
To learn more, visit: www.snaplogic.com/big-data
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
NOTE: This was converted to Powerpoint from Keynote. Slideshare does not play the embedded videos. You can download the powerpoint from slideshare and import it into keynote. The videos should work in the keynote.
Abstract:
In this presentation, we will describe the "Spark Kernel" which enables applications, such as end-user facing and interactive applications, to interface with Spark clusters. It provides a gateway to define and run Spark tasks and to collect results from a cluster without the friction associated with shipping jars and reading results from peripheral systems. Using the Spark Kernel as a proxy, applications can be hosted remotely from Spark.
This presentation includes a comprehensive introduction to Apache Spark. From an explanation of its rapid ascent to performance and developer advantages over MapReduce. We also explore its built-in functionality for application types involving streaming, machine learning, and Extract, Transform and Load (ETL).
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
The gambling industry has arguably been one of the most comprehensively affected by the internet revolution, and if an organization such as William Hill hadn't adapted successfully it would have disappeared. We call this, “Going Reactive.”
The company's latest innovations are very cutting edge platforms for personalization, recommendation, and big data, which are based on Akka, Scala, Play Framework, Kafka, Cassandra, Spark, and Mesos.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Video: https://youtu.be/C_u4_l84ED8
Karl Isenberg reviews the history of distributed computing, clarifies terminology for layers in the container stack, and does a head to head comparison of several tools in the space, including Kubernetes, Marathon, and Docker Swarm. Learn which features and qualities are critical for container orchestration and how you can apply this knowledge when evaluating platforms.
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
Talk for USENIX LISA 2016 by Brendan Gregg.
"Linux 4.x Tracing Tools: Using BPF Superpowers
The Linux 4.x series heralds a new era of Linux performance analysis, with the long-awaited integration of a programmable tracer: Enhanced BPF (eBPF). Formally the Berkeley Packet Filter, BPF has been enhanced in Linux to provide system tracing capabilities, and integrates with dynamic tracing (kprobes and uprobes) and static tracing (tracepoints and USDT). This has allowed dozens of new observability tools to be developed so far: for example, measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. Tracing superpowers have finally arrived.
In this talk I'll show you how to use BPF in the Linux 4.x series, and I'll summarize the different tools and front ends available, with a focus on iovisor bcc. bcc is an open source project to provide a Python front end for BPF, and comes with dozens of new observability tools (many of which I developed). These tools include new BPF versions of old classics, and many new tools, including: execsnoop, opensnoop, funccount, trace, biosnoop, bitesize, ext4slower, ext4dist, tcpconnect, tcpretrans, runqlat, offcputime, offwaketime, and many more. I'll also summarize use cases and some long-standing issues that can now be solved, and how we are using these capabilities at Netflix."
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Lesfurest.com invited me to talk about the KAPPA Architecture style during a BBL.
Kappa architecture is a style for real-time processing of large volumes of data, combining stream processing, storage, and serving layers into a single pipeline. It's different from the Lambda architecture, uses separate batch and stream processing pipelines.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.
Apache Cassandra Lunch #72: Databricks and CassandraAnant Corporation
In Cassandra Lunch #72, we will discuss how we can use Databricks with Cassandra.
Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-72-databricks-and-cassandra
Accompanying YouTube: https://youtu.be/5zCN27KHADo
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
All Day DevOps - FLiP Stack for Cloud Data LakesTimothy Spann
https://www.alldaydevops.com/addo-speakers/timothy-spann
Timothy Spann
StreamNative
MODERN INFRASTRUCTURE
SHARE THIS SESSION
Session Name: FLiP Stack for Cloud Data Lakes
Utilizing an all Apache stack for Rapid Data Lake Population and querying utilizing Apache Flink, Apache Pulsar, and Apache NiFi. We can quickly stream data to and from any datalake, data lake house, lakehouse, database or any datamart regardless of cloud or size. FLiP allows for Java and Python developers to build scalable solutions that span messaging and streaming in cloud native fashion with full monitoring.
Speaker Bio:
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData, and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, Pulsar, NiFi, the blockchain, and Spark.
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
Data visualization can be a tricky problem, even more if the dataset is made of several billions of 3-dimensional particles moving along the time. The talk will focus on some simple indexing and data thinning techniques and how (and how do not) implement them with Cassandra and Spark.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
Similar to Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala (20)
Data stream processing platforms and microservices platform infrastructure and strategies are converging. As we edge towards larger, more complex and decoupled systems, combined with the continual growth of the global information graph, our frontiers of unsolved challenges grow equally as fast. Central challenges for distributed systems include persistence strategies across DCs, zones or regions, network partitions, data optimization, system stability in all phases.
How does leveraging CRDTs and Event Sourcing address several core distributed systems challenges? What are useful strategies and patterns involved in the design, deployment, and running of stateful and stateless applications for the cloud, for example with Kubernetes. Combined with code samples, we will see how Akka Cluster, Multi-DC Persistence, Split Brain, Sharding and Distributed Data can help solve these problems.
Humans have a tendency to invent new problems rather than solve old ones. As we build larger, more complex systems, we unearth global challenges around networks, compute resources and data. Have we neglected to see more elegant examples which existed all along?
It is possible for even the most complex systems to be organized and simplified in ways that may not occur to us. In situations where we still search for the right algorithms, by turning to complex natural systems around us we can find the problem was solved long ago. What we think is a new protocol may in fact be one that has been tested and evolving over hundreds or millions of years. One invented for the early internet is incredibly similar to a strategy evolved by desert ants millions of years ago. And this is why it works.
This talk will address these questions with examples of self-organization, decentralization and diversification from emergent phenomena found in nature.
Disorder And Tolerance In Distributed Systems At ScaleHelena Edelson
Rethinking intelligent resilient systems. Re-framing problems changes how we see and solve them. The intersection of scientific thought and principles parallels much of what we solve as engineers of information (e.g. uncertainty, time, distribution) and need. This talk is an interdisciplinary look at complex adaptive systems and how they innately solve things like resource distribution, growth and rebalancing. From the context of intelligence and systems, this talk will look at ideas around entropy and time, ensemble forecasting, self-organization theory, the butterfly effect, virus-human co-evolution and adaption, natural feedback loops, self-balancing, and adaptation.
Can we leverage these principles, behaviors and strategies to design intelligent systems at scale?
Can seeing things in an interdisciplinary way benefit solving common problems and speed innovation?
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
Building Self Healing, Intelligent Platforms, systems that learn, multi-datacenter, removing human intervention with ML. Reactive Summit 2016 @helenaedelson
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. • Spark Cassandra Connector committer
• Akka contributor - 2 new features in Akka Cluster
• Big Data & Scala conference speaker
• Currently Sr Software Engineer, Analytics @ DataStax
• Sr Cloud Engineer, VMware,CrowdStrike,SpringSource…
• Prev Spring committer - Spring AMQP, Spring Integration
Analytic
Who Is This Person?
3. Talk Roadmap
What Lambda Architecture & Delivering Meaning
Why Spark, Kafka, Cassandra & Akka integration
How Composable Pipelines - Code
helena.edelson@datastax.com
4. I need fast access
to historical data
on the fly for
predictive modeling
with real time data
from the stream
5. Lambda Architecture
A data-processing architecture designed to handle massive quantities of
data by taking advantage of both batch and stream processing methods.
• Spark is one of the few data processing frameworks that allows you to
seamlessly integrate batch and stream processing
• Of petabytes of data
• In the same application
11. Strategies
• Scalable Infrastructure
• Partition For Scale
• Replicate For Resiliency
• Share Nothing
• Asynchronous Message Passing
• Parallelism
• Isolation
• Data Locality
• Location Transparency
12. Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
13. Analytic
Analytic
Search
• Fast, distributed, scalable and
fault tolerant cluster compute
system
• Enables Low-latency with
complex analytics
• Developed in 2009 at UC
Berkeley AMPLab, open sourced
in 2010
• Became an Apache project in
February, 2014
14. • High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
21. • Stream data from Kafka to Cassandra
• Stream data from Kafka to Spark and write to Cassandra
• Stream from Cassandra to Spark - coming soon!
• Read data from Spark/Spark Streaming Source and write to C*
• Read data from Cassandra to Spark
22. HADOOP
• Distributed Analytics Platform
• Easy Abstraction for Datasets
• Support in several languages
• Streaming
• Machine Learning
• Graph
• Integrated SQL Queries
• Has Generalized DAG execution
All in one package
And it uses Akka
25. Apache Spark - Easy to Use API
Returns the top (k) highest temps for any location in the year
def topK(aggregate: Seq[Double]): Seq[Double] =
sc.parallelize(aggregate).top(k).collect
Returns the top (k) highest temps … in a Future
def topK(aggregate: Seq[Double]): Future[Seq[Double]] =
sc.parallelize(aggregate).top(k).collectAsync
Analytic
Analytic
Search
26. Use the Spark Shell to
quickly try out code samples
Available in
and
Pyspark
Spark Shell
27. Analytic
Analytic
Search
Collection To RDD
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distributedData = sc.parallelize(data)
distributedData: spark.RDD[Int] =
spark.ParallelCollection@10d13e3e
32. Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse
cassandra
-‐k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
35. Analytic
Analytic
Search
Your Data Is Like Candy
Delicious: you want it now
Batch Analytics
Analysis after data has accumulated
Decreases the weight of the data by the time it is processed
Streaming Analytics
Analytics as data arrives.
The data won’t be stale and neither will our analytics
Both in same app = Lambda
36. Spark Streaming
• I want results continuously in the event stream
• I want to run computations in my even-driven async apps
• Exactly once message guarantees
37. DStream (Discretized Stream)
RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3)
A transformation on a DStream = transformations on its RDDs
DStream
Continuous stream of micro batches
• Complex processing models with minimal effort
• Streaming computations on small time intervals
38. val conf = new SparkConf().setMaster(SparkMaster).setAppName(AppName)
val ssc = new StreamingContext(conf, Milliseconds(500))
ssc.textFileStream("s3n://raw_data_bucket/")
.flatMap(_.split("s+"))
.map(_.toLowerCase, 1))
.countByValue()
.saveToCassandra(keyspace,table)
ssc.checkpoint(checkpointDir)
ssc.start()
ssc.awaitTermination
Starts the streaming application piping
raw incoming data to a Sink
The batch streaming interval
Basic Streaming: FileInputDStream
39. DStreams - the stream of raw data received from streaming sources:
• Basic Source - in the StreamingContext API
• Advanced Source - in external modules and separate Spark artifacts
Receivers
• Reliable Receivers - for data sources supporting acks (like Kafka)
• Unreliable Receivers - for data sources not supporting acks
39
ReceiverInputDStreams
41. Streaming Window Operations
kvStream
.flatMap { case (k,v) => (k,v.value) }
.reduceByKeyAndWindow((a:Int,b:Int) =>
(a + b), Seconds(30), Seconds(10))
.saveToCassandra(keyspace,table)
Window Length:
Duration = every 10s
Sliding Interval:
Interval at which the window operation
is performed = every 10 s
46. Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add themselves to
the cluster
• DataStax drivers automatically discover new
nodes
52. C* At CERN: Large Haldron Colider
•ATLAS - Largest of several detectors along the Large Hadron Collider
• Measures particle production when protons collide at a very high
center of mass energy
•- Bursty traffic
•- Volume of data from sensors requires
• - Very large trigger and data acquisition system
• - 30,000 applications on 2,000 nodes
56. CREATE TABLE users (
username varchar,
firstname varchar,
lastname varchar,
email list<varchar>,
password varchar,
created_date timestamp,
PRIMARY KEY (username)
);
INSERT INTO users (username, firstname, lastname,
email, password, created_date)
VALUES ('hedelson','Helena','Edelson',
[‘helena.edelson@datastax.com'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00’)
IF NOT EXISTS;
• Familiar syntax
• Many Tools & Drivers
• Many Languages
• Friendly to programmers
• Paxos for locking
CQL - Easy
57. CREATE
TABLE
weather.raw_data
(
wsid
text,
year
int,
month
int,
day
int,
hour
int,
temperature
double,
dewpoint
double,
pressure
double,
wind_direction
int,
wind_speed
double,
one_hour_precip
PRIMARY
KEY
((wsid),
year,
month,
day,
hour)
)
WITH
CLUSTERING
ORDER
BY
(year
DESC,
month
DESC,
day
DESC,
hour
DESC);
C* Clustering Columns Writes by most recent
Reads return most recent first
Timeseries Data
Cassandra will automatically sort by most recent for both write and read
58. val multipleStreams = (1 to numDstreams).map { i =>
streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))
}
streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
A record of every event, in order in which it happened, per URL:
CREATE TABLE IF NOT EXISTS requests_ks.timeline (
timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,
PRIMARY KEY ((url, timesegment) , t_uuid)
);
timeuuid protects from simultaneous events over-writing one another.
timesegment protects from writing unbounded partitions.
60. Spark Cassandra Connector
•NOSQL JOINS!
•Write & Read data between Spark and Cassandra
•Compatible with Spark 1.3
•Handles Data Locality for Speed
•Implicit type conversions
•Server-Side Filtering - SELECT, WHERE, etc.
•Natural Timeseries Integration
https://github.com/datastax/spark-cassandra-connector
63. Analytic
Write from Spark to Cassandra
sc.parallelize(Seq(0,1,2)).saveToCassandra(“keyspace”,
"raw_data")
SparkContext Keyspace Table
Spark RDD JOIN with NOSQL!
predictionsRdd.join(music).saveToCassandra("music",
"predictions")
64. Read From C* to Spark
val
rdd
=
sc.cassandraTable("github",
"commits")
.select("user","count","year","month")
.where("commits
>=
?
and
year
=
?",
1000,
2015)
CassandraRDD[CassandraRow]
Keyspace Table
Server-Side Column
and Row Filtering
SparkContext
65. val
rdd
=
ssc.cassandraTable[MonthlyCommits]("github",
"commits_aggregate")
.where("user
=
?
and
project_name
=
?
and
year
=
?",
"helena",
"spark-‐cassandra-‐connector",
2015)
CassandraRow Keyspace TableStreamingContext
Rows: Custom Objects
66. Rows
val
tuplesRdd
=
sc.cassandraTable[(Int,Date,String)](db,
tweetsTable)
.select("cluster_id","time",
"cluster_name")
.where("time
>
?
and
time
<
?",
"2014-‐07-‐12
20:00:01",
"2014-‐07-‐12
20:00:03”)
val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time")
.where("key = 1").limit(10).collect
val
rdd
=
ssc.cassandraTable[(Int,DateTime,String)]("stats",
"clustering_time")
.where("key
=
1").withDescOrder.collect
67. Cassandra User Defined Types
CREATE TYPE address (
street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);
UDT = Your Custom Field Type In Cassandra
69. Data Locality
● Spark asks an RDD for a list of its partitions (splits)
● Each split consists of one or more token-ranges
● For every partition
● Spark asks RDD for a list of preferred nodes to process on
● Spark creates a task and sends it to one of the nodes for execution
Every Spark task uses a CQL-like query to fetch data for the given token range:
C*
C*
C*C*
SELECT
"key",
"value"
FROM
"test"."kv"
WHERE
token("key")
>
595597420921139321
AND
token("key")
<=
595597431194200132
ALLOW
FILTERING
70. All of the rows in a Cassandra Cluster
are stored based based on their
location in the Token Range.
Cassandra Locates a Row Based on
Partition Key and Token Range
71. New York City/
Manhattan:
Helena
Warsaw:
Piotr & Jacek
San Francisco:
Brian,Russell &
Alex
Each of the Nodes in a
Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
Cassandra Locates a Row Based on
Partition Key and Token Range
St. Petersburg:
Artem
72. New York City
Warsaw
San Francisco
Each of the Nodes in a
Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
750 - 99
350 - 749
100 - 349
Cassandra Locates a Row Based on
Partition Key and Token Range
St. Petersburg
73. Jacek 514 Red
The CQL Schema designates
at least one column to be the
Partition Key.
New York City
Warsaw
San Francisco
Cassandra Locates a Row Based on
Partition Key and Token Range
St. Petersburg
74. Helena 514 Red
The hash of the Partition Key
tells us where a row
should be stored.
New York City
Warsaw
San Francisco
Cassandra Locates a Row Based on
Partition Key and Token Range
St. Petersburg
75. Amsterdam
Spark Executor
The C* Driver pages spark.cassandra.input.page.row.size
CQL rows at a time
SELECT * FROM keyspace.table WHERE
pk =
The Spark Executor uses the Connector to
Pull Rows from the Local Cassandra Instance
78. Spark SQL with Cassandra
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sparkContext)
cc.setKeyspace(keyspaceName)
cc.sql("""
SELECT table1.a, table1.b, table.c, table2.a
FROM table1 AS table1
JOIN table2 AS table2 ON table1.a = table2.a
AND table1.b = table2.b
AND table1.c = table2.c
""")
.map(Data(_))
.saveToCassandra(keyspace1, table3)
79.
val sql = new SQLContext(sparkContext)
val json = Seq(
"""{"user":"helena","commits":98, "month":3, "year":2015}""",
"""{"user":"jacek-lewandowski", "commits":72, "month":3, "year":2015}""",
"""{"user":"pkolaczk", "commits":42, "month":3, "year":2015}""")
// write
sql.jsonRDD(json)
.map(CommitStats(_))
.flatMap(compute)
.saveToCassandra("stats","monthly_commits")
// read
val rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits")
cqlsh>
CREATE
TABLE
github_stats.commits_aggr(user
VARCHAR
PRIMARY
KEY,
commits
INT…);
Spark SQL with Cassandra & JSON
87. • Store raw data per ID
• Store time series data in order: most recent to oldest
• Compute and store aggregate data in the stream
• Set TTLs on historic data
• Get data by ID
• Get data for a single date and time
• Get data for a window of time
• Compute, store and retrieve daily, monthly, annual aggregations
Design Data Model to support queries
Queries I Need
88. Data Model
• Weather Station Id and Time
are unique
• Store as many as needed
CREATE TABLE daily_temperature (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY (weather_station,year,month,day,hour)
);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,8,-5.1);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,9,-4.9);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,10,-5.3);
89.
90. class HttpNodeGuardian extends ClusterAwareNodeGuardianActor {
cluster.joinSeedNodes(Vector(..))
context.actorOf(BalancingPool(PoolSize).props(Props(
new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize))))
Cluster(context.system) registerOnMemberUp {
context.actorOf(BalancingPool(PoolSize).props(Props(
new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize))))
}
def initialized: Actor.Receive = { … }
}
Load-Balanced Data Ingestion
91. class HttpDataIngestActor(kafka: ActorRef) extends Actor with ActorLogging {
implicit val system = context.system
implicit val askTimeout: Timeout = settings.timeout
implicit val materializer = ActorFlowMaterializer(
ActorFlowMaterializerSettings(system))
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.POST, Uri.Path("/weather/data"), headers, entity, _) =>
headers.toSource collect { case s: Source =>
kafka ! KafkaMessageEnvelope[String, String](topic, group, s.data:_*)
}
HttpResponse(200, entity = HttpEntity(MediaTypes.`text/html`)
}.getOrElse(HttpResponse(404, entity = "Unsupported request"))
case _: HttpRequest =>
HttpResponse(400, entity = "Unsupported request")
}
Http(system).bind(HttpHost, HttpPort).map { case connection =>
log.info("Accepted new connection from " + connection.remoteAddress)
connection.handleWithSyncHandler(requestHandler) }
def receive : Actor.Receive = {
case e =>
}
}
Client: HTTP Receiver Akka Actor
92. class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
case _: ActorInitializationException => Stop
case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate
case _: Exception => Escalate
}
private val producer = new KafkaProducer[K, V](producerConfig)
override def postStop(): Unit = producer.close()
def receive = {
case e: KafkaMessageEnvelope[K,V] => producer.send(e)
}
}
Client: Kafka Producer Akka Actor
94. val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]
(ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)
.map(transform)
.map(RawWeatherData(_))
/** Saves the raw data to Cassandra. */
kafkaStream.saveToCassandra(keyspace, raw_ws_data)
Store Raw Data From Kafka Stream To C*
/** Now proceed with computations from the same stream.. */
kafkaStream…
Now we can replay on failure
for later computation, etc
95. CREATE
TABLE
weather.raw_data
(
wsid
text,
year
int,
month
int,
day
int,
hour
int,
temperature
double,
dewpoint
double,
pressure
double,
wind_direction
int,
wind_speed
double,
one_hour_precip
PRIMARY
KEY
((wsid),
year,
month,
day,
hour)
)
WITH
CLUSTERING
ORDER
BY
(year
DESC,
month
DESC,
day
DESC,
hour
DESC);
CREATE
TABLE
daily_aggregate_precip
(
wsid
text,
year
int,
month
int,
day
int,
precipitation
counter,
PRIMARY
KEY
((wsid),
year,
month,
day)
)
WITH
CLUSTERING
ORDER
BY
(year
DESC,
month
DESC,
day
DESC);
Let’s See Our Data Model Again
96. Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
Efficient Stream Computation
class KafkaStreamingActor(kafkaPm: Map[String, String], ssc: StreamingContext, ws: WeatherSettings)
extends AggregationActor {
import settings._
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
.map(_._2.split(","))
.map(RawWeatherData(_))
kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
/** RawWeatherData: wsid, year, month, day, oneHourPrecip */
kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
/** Now the [[StreamingContext]] can be started. */
context.parent ! OutputStreamInitialized
def receive : Actor.Receive = {…}
}
97. /** For a given weather station, calculates annual cumulative precip - or year to date. */
class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
def receive : Actor.Receive = {
case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
}
/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync()
.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
/** Returns the 10 highest temps for any station in the `year`. */
def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync().map(toTopK) pipeTo requester
}
}
98. class TemperatureActor(sc: SparkContext, settings: WeatherSettings)
extends AggregationActor {
import akka.pattern.pipe
def receive: Actor.Receive = {
case e: GetMonthlyHiLowTemperature => highLow(e, sender)
}
def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
.collectAsync()
.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester
}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
Efficient Batch Analytics