Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
A very short set of slides to describe an RDD data structure.
Extracted from my 3-day course: www.sparkInternals.com
There is also a video of this on YouTube: http://youtu.be/odcEg515Ne8
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
A very short set of slides to describe an RDD data structure.
Extracted from my 3-day course: www.sparkInternals.com
There is also a video of this on YouTube: http://youtu.be/odcEg515Ne8
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Technical introduction into Apache Spark - the Swiss Army Knife of Big Data analytics tools.
The talk was held at the Big Data User Group Mannheim, Germany at 24.11.2014.
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Using spark 1.2 with Java 8 and CassandraDenis Dus
Brief introduction in Spark data processing ideology, comparison Java 7 and Java 8 usage with Spark. Examples of loading and processing data with Spark Cassandra Loader.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Zero to Streaming: Spark and Cassandra
1. From 0 to Streaming
Cassandra and Spark Streaming
Russell Spitzer
+ =
2. Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop, Solr,
and SPARK!
• Spends a lot of time spinning up
clusters on EC2, GCE, Azure, …
http://www.datastax.com/dev/blog/
testing-cassandra-1000-nodes-at-
a-time
• Writing FAQ’s for Spark
Troubleshooting
http://www.datastax.com/dev/blog/
common-spark-troubleshooting
3. From 0 to Streaming
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
4. From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
5. From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark Streaming
Streaming Basics
Writing Streaming Applications
Custom Receivers
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
7. Spark is a Distributed Analytics
Platform
HADOOP
•Has Generalized DAG execution
•Integrated SQL Queries
•Streaming
•Easy Abstraction for Datasets
•Support in lots of languages
All in one package!
8. Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
9. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
10. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
11. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
12. RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
13. RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
14. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
15. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd
Create
16. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2
Transform
17. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2 rdd3
Transform
18. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
19. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd2
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
20. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2 rdd3
ACTION
21. Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
Executor
Transformation
RDD’
22. Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
1 1’
Executor
2 2’
Transformation
RDD’
23. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
24. 1 32
4 5 6
7 8 9
RDD
Executor
3 3’
Executor
4 4’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
25. 1 32
4 5 6
7 8 9
RDD
Executor
5 5’
Executor
6 6’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
3’
4’
26. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 5’ 6’
7’ 8’ 9’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
27. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
5 5’
Node Failure
28. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5 5’
Node Failure
29. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5’
Because the actions on any partition can be tracked
backwards we can recover from failure without redoing the
entire RDD
30. Use the Spark Shell to
quickly try out code samples
Available in
and
Pyspark
Spark Shell
31. Spark Context is the Core Api for
all Communication with Spark
val conf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
new SparkContext(conf)
Almost all options can also be set as environment
variables or on the command line during spark-submit!
32. Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-‐submit
-‐-‐class
MainClass
JarYouWantDistributedToExecutor.jar
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
33. Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-‐submit
-‐-‐class
MainClass
JarYouWantDistributedToExecutor.jar
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
34. Co-locate Spark and C* for
Best Performance
C*
C*C*
C*
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Running Spark Workers
on the same nodes as
your C* Cluster will save
network hops when
reading and writing
35. Use a Separate Datacenter
for your Analytics Workloads
C*
C*C*
C*
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
C*
C*C*
C*
OLTP OLAP
37. DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled
and
Supported
with
DSE
>
4.5!
38. Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of
tokens
39. Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
40. Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
41. Requirements for Following
Code Examples
The following examples use are targeted at
Spark 1.1.X
Cassandra 2.0.X
or if you are using DataStax Enterprise
DSE 4.6.x
42. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
43. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>
val
rdd
=
sc.cassandraTable("candy","inventory")
scala>
rdd.count
res13:
Long
=
4
cassandraTable
44. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>
val
rdd
=
sc.cassandraTable("candy","inventory")
scala>
rdd.count
res13:
Long
=
4
cassandraTable
count
4
49. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
50. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
cassandraTable
51. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
52. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
53. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
54. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
55. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
56. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
amount
10
57. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
58. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
59. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
60. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
61. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
C*
C*C*
C*
Under the hood this is done via the
Cassandra Java Driver
62. Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
63. Spark Sql Provides a Fast SQL
Like Syntax For Cassandra!
HQL
SQL
Catalyst
Query Plan
Grab Data
Filter
Group
Return Results
SchemaRDD
SQL In, RDD’s Out
64. Building a Context Object For
interacting with Spark SQL
In the DSE Spark Shell both HiveContext and Cassandra Sql Context are created
automatically on startup
import
org.apache.spark.sql.cassandra.CassandraSQLContext
val
sc:
SparkContext
=
...
val
csc
=
new
CassandraSQLContext(sc)
JavaSparkContext
jsc
=
new
JavaSparkContext(conf);
//
create
a
Cassandra
Spark
SQL
context
CassandraSQLContext
csc
=
new
CassandraSQLContext(jsc.sc());
Since HiveContext Requires the Hive Driver accessing C* Directly,
HC only available in DSE.
Workaround: get SchemaRDD’s with Cassandra Sql Context then Register with HC
65. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
QueryPlan
66. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
SchemaRDDQueryPlan
67. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
SchemaRDDQueryPlan
68. Counting Data From
Cassandra With SQL Syntax
scala>
csc.sql("SELECT
COUNT(*)
FROM
candy.inventory").collect
res5:
Array[org.apache.spark.sql.Row]
=
Array([4])
69. Counting Data From
Cassandra With SQL Syntax
scala>
csc.sql("SELECT
COUNT(*)
FROM
candy.inventory").collect
res5:
Array[org.apache.spark.sql.Row]
=
Array([4])
70. Joining Data From
Cassandra With SQL Syntax
scala>
csc.sql("
SELECT
*
FROM
candy.inventory
as
inventory
JOIN
candy.requests
as
requests
WHERE
inventory.name
=
requests.name").collect
res12:
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,WonkaBar,3,Russ,WonkaBar,2],
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]
)
71. Joining Data From
Cassandra With SQL Syntax
scala>
csc.sql("
SELECT
*
FROM
candy.inventory
as
inventory
JOIN
candy.requests
as
requests
WHERE
inventory.name
=
requests.name").collect
res12:
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,WonkaBar,3,Russ,WonkaBar,2],
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]
)
72. Insert to another Cassandra
Table
csc.sql("
INSERT
INTO
candy.low
SELECT
*
FROM
candy.inventory
as
inv
WHERE
inv.amount
<
5
").collect
73. Insert to another Cassandra
Table
csc.sql("
INSERT
INTO
candy.low
SELECT
*
FROM
candy.inventory
as
inv
WHERE
inv.amount
<
5
").collect
75. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
76. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
77. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
78. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
Streaming Analytics:
We do our analytics on the data as it arrives.
The data won’t be stale and neither will our
analytics
79. DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Streaming involves a receiver or set of receivers each of which publishes a DStream
80. DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Batch Batch
RDD RDD RDD RDD
The DStream is (Discretized) into batches, the timing of which is set in the
Spark Streaming Context. Each Batch is made up of RDDs.
83. Demo Streaming Application:
Analyze HttpRequests with Spark Streaming
Spark Cassandra
HttpServerTraffic
Spark Executor
Source Included in DSE 4.6.0
84. Spark Receivers only really need to
describe how to publish to a DStream
case
class
HttpRequest(
timeuuid:
UUID,
method:
String,
headers:
Map[String,
List[String]],
uri:
URI,
body:
String)
extends
ReceiverClass
First we need to define a Case Class to make moving around
HttpRequest information Easier. This type will be used to
specify what type of DStream we are creating.
85. Spark Receivers only really need to
describe how to publish to a DStream
class
HttpReceiver(port:
Int)
extends
Receiver[HttpRequest]
(StorageLevel.MEMORY_AND_DISK_2)
with
Logging
{
def
onStart():
Unit
=
{}
def
onStop():
Unit
=
{}
}
Now we just need to write the code for a receiver to actually
publish these HttpRequest Objects
Receiver
[HttpRequest]
86. Spark Receivers only really need to
describe how to publish to a DStream
import
com.sun.net.httpserver.{
HttpExchange,
HttpHandler,
HttpServer}
def
onStart():
Unit
=
{
val
s
=
HttpServer.create(new
InetSocketAddress(p),
0)
s.createContext("/",
new
StreamHandler())
s.start()
server
=
Some(s)
}
def
onStop():
Unit
=
server
map(_.stop(0))
This will start up our server and direct all HttpTraffic to be
handled by StreamHandler
Receiver
[HttpRequest]
HttpServer
87. Spark Receivers only really need to
describe how to publish to a DStream
class
StreamHandler
extends
HttpHandler
{
override
def
handle(transaction:
HttpExchange):
Unit
=
{
val
dataReader
=
new
BufferedReader(new
InputStreamReader(transaction.getRequestBody))
val
data
=
Stream.continually(dataReader.readLine).takeWhile(_
!=
null).mkString("n")
val
headers:
Map[String,
List[String]]
=
transaction.getRequestHeaders.toMap.map
{
case
(k,
v)
=>
(k,
v.toList)}
store(HttpRequest(
UUIDs.timeBased(),
transaction.getRequestMethod,
headers,
transaction.getRequestURI,
data))
transaction.sendResponseHeaders(200,
0)
val
response
=
transaction.getResponseBody
response.close()
//
Empty
response
body
transaction.close()
//
Finish
Transaction
}
}
StreamHandler actually does the work
publishing events to the DStream.
Receiver
[HttpRequest]
HttpServer
StreamHandler
88. Streaming Context sets Batch Timing
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
89. Create One Receiver Per Node
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
90. Merge Separate DStreams into One
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
requests[HttpRequest]
91. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
Persist Every Event That
Comes into the System
92. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(
url
text,
method
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
((url,method),
time))
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
93. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(
url
text,
method
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
((url,method),
time))
CREATE
TABLE
IF
NOT
EXISTS
sorted_urls(
url
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
(time,
count)
)
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
Table for finding the most
popular url in each batch
94. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
Results
95. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
timesegment url t_uuid method Headers Body
Results
timelineRow
96. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
C*
C*C*
C*
timesegment url t_uuid method Headers Body
Results
timelineRow
97. Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
98. Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
CountByValue
99. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
100. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
101. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
102. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
103. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
countByValue
104. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
uri count time
countByValue
transform
105. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
uri count time
countByValue
transform
Let Cassandra
Do the Sorting! PRIMARY KEY (time, count)
C*
C*C*
C*saveToCassandra
106. Start the application!
ssc.start()
ssc.awaitTermination()
This will start the streaming application
piping all incoming data to Cassandra!
107. Live Demo
Demo run Script
#Start Streaming Application
echo "Starting Streaming Receiver(s): Logging to http_receiver.log"
cd HttpSparkStream
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d
$NUM_SPARK_NODES > ../http_receiver.log 2>&1 &
cd ..
echo "Waiting for 60 Seconds for streaming to come online"
sleep 60
#Start Http Requester
echo "Starting to send requests against streaming receivers: Logging to http_requester.log"
cd HttpRequestGenerator
./sbt/sbt "run -i $SPARK_NODE_IPS " > ../http_requester.log 2>&1 &
cd ..
#Monitor Results Via Cqlsh
watch -n 5 './monitor_queries.sh'
109. I hope this gives you some
exciting ideas for your
applications!
Questions?
110. Thanks for coming to the meetup!!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Getting started with Cassandra?!
In production?!
Tweet us: @PlanetCassandra!