Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
Take your analytics to the next level by using Apache Spark to accelerate complex interactive analytics using your Apache Cassandra data. Includes an introduction to Spark as well as how to read Cassandra tables in Spark.
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
Take your analytics to the next level by using Apache Spark to accelerate complex interactive analytics using your Apache Cassandra data. Includes an introduction to Spark as well as how to read Cassandra tables in Spark.
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Muvr is a real-time personal trainer system. It must be highly available, resilient and responsive, and so it relies on heavily on Spark, Mesos, Akka, Cassandra, and Kafka—the quintuple also known as the SMACK stack. In this talk, we are going to explore the architecture of the entire muvr system, exploring, in particular, the challenges of ingesting very large volume of data, applying trained models on the data to provide real-time advice to our users, and training & evaluating new models using the collected data. We will specifically emphasize on how we have used Cassandra for consuming lots of fast incoming biometric data from devices and sensors, and how to securely access the big data sets from Cassandra in Spark to compute the models.
We will finish by showing the mechanics of deploying such a distributed application. You will get a clear understanding of how Mesos, Marathon, in conjunction with Docker, is used to build an immutable infrastructure that allows us to provide reliable service to our users and a great environment for our engineers.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Muvr is a real-time personal trainer system. It must be highly available, resilient and responsive, and so it relies on heavily on Spark, Mesos, Akka, Cassandra, and Kafka—the quintuple also known as the SMACK stack. In this talk, we are going to explore the architecture of the entire muvr system, exploring, in particular, the challenges of ingesting very large volume of data, applying trained models on the data to provide real-time advice to our users, and training & evaluating new models using the collected data. We will specifically emphasize on how we have used Cassandra for consuming lots of fast incoming biometric data from devices and sensors, and how to securely access the big data sets from Cassandra in Spark to compute the models.
We will finish by showing the mechanics of deploying such a distributed application. You will get a clear understanding of how Mesos, Marathon, in conjunction with Docker, is used to build an immutable infrastructure that allows us to provide reliable service to our users and a great environment for our engineers.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
Building Self Healing, Intelligent Platforms, systems that learn, multi-datacenter, removing human intervention with ML. Reactive Summit 2016 @helenaedelson
Streaming Data Analysis and Online Learning by John Myles White Hakka Labs
John Myles White surveys some basic methods for analyzing data in a streaming manner with a focus on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks. Some basic implementation issues are shown, demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization.
Relational Databases are Evolving To Support New Data CapabilitiesEDB
A commissioned study conducted by Forrester Consulting on behalf of EnterpriseDB, published in January 2015, presents a case study for the evolution of relational database management systems. The study, Relational Databases are Evolving to Support New Data Capabilities, found that the majority—78%—of database decisions makers wanted one solution that could handle relational and NoSQL data types.
The study finds that relational databases are evolving to address the needs of end users seeking to link unstructured and structured data types and that decision makers should look to invest in these solutions. EDB’s Postgres Plus Advanced Server, for example, addresses these needs with such capabilities as support for unstructured data types, non-durable tables, tools for large-scale data loads, and integration technologies that connect standalone NoSQL solutions with Postgres.
Watch the full video at: https://skillsmatter.com/skillscasts/6100-scala-abide-a-lint-tool-for-scala
Recently there's been a flurry of compiler plugins aimed at finding potential errors, or forbidding certain patterns, in Scala: Linter and its forks, Wart Remover, ScalaStyle. [Abide](https://github.com/scala/scala-abide) aims at providing a common frame for all such efforts.
Abide integrates with Sbt, IDEs (via compiler plugins) and soon with Maven. Users can add project-specific rules, and additional rule libraries can be imported from any ivy or maven repository. Rules have access to the fully type-checked tree and may use quasiquotes for easy AST pattern matching.
Have you heard that all in-memory databases are equally fast but unreliable, inconsistent and expensive? This session highlights in-memory technology that busts all those myths.
Redis, the fastest database on the planet, is not a simply in-memory key-value data-store; but rather a rich in-memory data-structure engine that serves the world’s most popular apps. Redis Labs’ unique clustering technology enables Redis to be highly reliable, keeping every data byte intact despite hundreds of cloud instance failures and dozens of complete data-center outages. It delivers full CP system characteristics at high performance. And with the latest Redis on Flash technology, Redis Labs achieves close to in-memory performance at 70% lower operational costs. Learn about the best uses of in-memory computing to accelerate everyday applications such as high volume transactions, real time analytics, IoT data ingestion and more.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism
Isolation, Data Locality, Location Transparency
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Sink Your Teeth into Streaming at Any ScaleTimothy Spann
Sink Your Teeth into Streaming at Any Scale
https://www.scylladb.com/2023/01/26/scylladb-summit-for-the-scylladb-curious-serious-sea-monsters/
Sink Your Teeth into Streaming at Any Scale
Timothy Spann & David Kjerrumgaard, StreamNative
How to build a low-latency scalable platform for today’s massively data-intensive real-time streaming applications using ScyllaDB, Pulsar, and Flink.
Sink Your Teeth into Streaming at Any ScaleScyllaDB
Using the low-latency Apache Pulsar we can build up millions of streams of concurrent data and join them in real time with Apache Flink. We need an ultra-low latency database that can support these workloads to build next-generation IoT, financial and instant analytical transit applications
By sinking data into ScyllaDB we enable amazingly fast applications that can grow to any size and join with existing data sources.
The next generation of apps is being built now, you must choose the right low-latency scalable platform for these massively data-intensive applications. ScyllaDB + Pulsar + Flink is that platform. Choose Open, Choose Fast, and Make the right choice.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
Similar to Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka (20)
Data stream processing platforms and microservices platform infrastructure and strategies are converging. As we edge towards larger, more complex and decoupled systems, combined with the continual growth of the global information graph, our frontiers of unsolved challenges grow equally as fast. Central challenges for distributed systems include persistence strategies across DCs, zones or regions, network partitions, data optimization, system stability in all phases.
How does leveraging CRDTs and Event Sourcing address several core distributed systems challenges? What are useful strategies and patterns involved in the design, deployment, and running of stateful and stateless applications for the cloud, for example with Kubernetes. Combined with code samples, we will see how Akka Cluster, Multi-DC Persistence, Split Brain, Sharding and Distributed Data can help solve these problems.
Humans have a tendency to invent new problems rather than solve old ones. As we build larger, more complex systems, we unearth global challenges around networks, compute resources and data. Have we neglected to see more elegant examples which existed all along?
It is possible for even the most complex systems to be organized and simplified in ways that may not occur to us. In situations where we still search for the right algorithms, by turning to complex natural systems around us we can find the problem was solved long ago. What we think is a new protocol may in fact be one that has been tested and evolving over hundreds or millions of years. One invented for the early internet is incredibly similar to a strategy evolved by desert ants millions of years ago. And this is why it works.
This talk will address these questions with examples of self-organization, decentralization and diversification from emergent phenomena found in nature.
Disorder And Tolerance In Distributed Systems At ScaleHelena Edelson
Rethinking intelligent resilient systems. Re-framing problems changes how we see and solve them. The intersection of scientific thought and principles parallels much of what we solve as engineers of information (e.g. uncertainty, time, distribution) and need. This talk is an interdisciplinary look at complex adaptive systems and how they innately solve things like resource distribution, growth and rebalancing. From the context of intelligence and systems, this talk will look at ideas around entropy and time, ensemble forecasting, self-organization theory, the butterfly effect, virus-human co-evolution and adaption, natural feedback loops, self-balancing, and adaptation.
Can we leverage these principles, behaviors and strategies to design intelligent systems at scale?
Can seeing things in an interdisciplinary way benefit solving common problems and speed innovation?
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
3. Who Is This Person?
• Using Scala in production since 2010!
• Spark Cassandra Connector Committer!
• Akka (Cluster) Contributor !
• Scala Driver for Cassandra Committer !
• @helenaedelson!
• https://github.com/helena !
• Senior Software Engineer, Analytics Team, DataStax!
• Spent the last few years as a Senior Cloud Engineer
Analytic
Analytic
13. Analytic
Analytic
Search
What Is Apache Spark
• Fast, general cluster compute system!
• Originally developed in 2009 in UC
Berkeley’s AMPLab!
• Fully open sourced in 2010 – now at
Apache Software Foundation!
• Distributed, Scalable, Fault Tolerant
14. Apache Spark - Easy to Use & Fast
• 10x faster on disk,100x faster in memory than Hadoop MR!
• Works out of the box on EMR!
• Fault Tolerant Distributed Datasets!
• Batch, iterative and streaming analysis!
• In Memory Storage and Disk !
• Integrates with Most File and Storage Options
Up to 100× faster
(2-10× on disk)
Analytic
2-5× less code
Analytic
Search
16. • Functional
• On the JVM
• Capture functions and ship them across the network
• Static typing - easier to control performance
• Leverage Scala REPL for the Spark REPL
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536p6538.
html
Analytic
Analytic
Search
Why Scala?
20. Common Use Cases
applications sensors web mobile phones
intrusion detection
malfunction
detection
site analytics
network metrics
analysis
fraud detection
dynamic process
optimisation
recommendations location based ads
log processing
supply chain
planning
sentiment analysis …
21. DStream - Micro Batches
• Continuous sequence of micro batches!
• More complex processing models are possible with less effort!
• Streaming computations as a series of deterministic batch computations on small time intervals
DStream
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs
22. DStreams representing the stream of raw data received from streaming
sources. !
!•!Basic sources: Sources directly available in the StreamingContext API!
!•!Advanced sources: Sources available through external modules
available in Spark as artifacts
22
InputDStreams
23. Spark Streaming Modules
GroupId ArtifactId Latest Version
org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0
org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-flume-sink_2.10 1.1.0
org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
24. Streaming Window Operations
// where pairs are (word,count)
pairsStream
.flatMap { case (k,v) => (k,v.value) }
.reduceByKeyAndWindow((a:Int,b:Int) => (a + b),
Seconds(30), Seconds(10))
.saveToCassandra(keyspace,table)
24
27. Quick intro to Cassandra
• Shared nothing
• Masterless peer-to-peer
• Based on Dynamo
• Highly Scalable
• Spans DataCenters
28. Apache Cassandra
• Elasticity - scale to as many nodes as you need, when you need!
•! Always On - No single point of failure, Continuous availability!!
!•! Masterless peer to peer architecture!
•! Designed for Replication!
!•! Flexible Data Storage!
!•! Read and write to any node syncs across the cluster!
!•! Operational simplicity - with all nodes in a cluster being the same,
there is no complex configuration to manage
29. Easy to use
• CQL - familiar syntax!
• Friendly to programmers!
• Paxos for locking
CREATE TABLE users (!
username varchar,!
firstname varchar,!
lastname varchar,!
email list<varchar>,!
password varchar,!
created_date timestamp,!
PRIMARY KEY (username)!
);
INSERT INTO users (username, firstname, lastname, !
email, password, created_date)!
VALUES ('hedelson','Helena','Edelson',!
[‘helena.edelson@datastax.com'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00');!
INSERT INTO users (username, firstname, !
lastname, email, password, created_date)!
VALUES ('pmcfadin','Patrick','McFadin',!
['patrick@datastax.com'],!
'ba27e03fd95e507daf2937c937d499ab',!
'2011-06-20 13:50:00')!
IF NOT EXISTS;
31. Spark On Cassandra
• Server-Side filters (where clauses)!
• Cross-table operations (JOIN, UNION, etc.)!
• Data locality-aware (speed)!
• Data transformation, aggregation, etc. !
• Natural Time Series Integration
32. Spark Cassandra Connector
• Loads data from Cassandra to Spark!
• Writes data from Spark to Cassandra!
• Implicit Type Conversions and Object Mapping!
• Implemented in Scala (offers a Java API)!
• Open Source !
• Exposes Cassandra Tables as Spark RDDs + Spark DStreams
34. Use Cases: KillrWeather
• Get data by weather station!
• Get data for a single date and time!
• Get data for a range of dates and times!
• Compute, store and quickly retrieve daily, monthly and annual
aggregations of data!
Data Model to support queries
• Store raw data per weather station
• Store time series in order: most recent to oldest
• Compute and store aggregate data in the stream
• Set TTLs on historic data
37. Spark Cassandra Example
val conf = new SparkConf(loadDefaults = true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("spark://127.0.0.1:7077")
!
val sc = new SparkContext(conf)
!
!
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")
!
val ssc = new StreamingContext(sc, Seconds(30))
val stream = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder](
ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount")
ssc.start()
ssc.awaitTermination()
!
Initialization
CassandraRDD
Stream Initialization
Transformations
and Action
38. Spark Cassandra Example
val sc = new SparkContext(..)
val ssc = new StreamingContext(sc, Seconds(5))
!
val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2)
val transform = (cruft: String) =>
Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#"))
/** Note that Cassandra is doing the sorting for you here. */
stream.flatMap(_.getText.toLowerCase.split("""s+"""))
.map(transform)
.countByValueAndWindow(Seconds(5), Seconds(5))
.transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))})
.saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))
39. Reading: From C* To Spark
row
representation keyspace table
val table = sc
.cassandraTable[CassandraRow]("keyspace", "tweets")
.select("user_name", "message")
.where("user_name = ?", "ewa")
server side
column and row
selection
40. CassandraRDD
class CassandraRDD[R](..., keyspace: String, table: String, ...)
extends RDD[R](...) {
// Splits the table into multiple Spark partitions,
// each processed by single Spark Task
override def getPartitions: Array[Partition]
// Returns names of hosts storing given partition (for data locality!)
override def getPreferredLocations(split: Partition): Seq[String]
// Returns iterator over Cassandra rows in the given partition
override def compute(split: Partition, context: TaskContext): Iterator[R]
}
42. Paging Reads with .cassandraTable
• Page size is configurable!
• Controls how many CQL rows to fetch at a time, when fetching a single
partition!
• Connector returns an iterator for rows to Spark!
• Spark iterates over this, lazily !
• Handled by the java driver as well as spark
43. ResultSet Paging and Pre-Fetching
Node 1
Client Cassandra
Node 1
request a page
data
process data
request a page
data
request a page
Node 2
Client Cassandra
Node 2
request a page
data
process data
request a page
data
request a page
44. Co-locate Spark and C* for Best
Performance
44
C*
C* C*
Spark
Worker
C*
Spark
Worker
Spark
Master
Spark
Worker
Running Spark Workers on
the same nodes as your C* Cluster
will save network hops when
reading and writing
45. The Key To Speed - Data Locality
• LocalNodeFirstLoadBalancingPolicy!
• Decides what node will become the coordinator for the given mutation/read!
• Selects local node first and then nodes in the local DC in random order!
• Once that node receives the request it will be distributed !
• Proximal Node Sort Defined by the C* snitch!
Analytic
•https://github.com/apache/cassandra/blob/trunk/src/java/org/
apache/cassandra/locator/DynamicEndpointSnitch.java#L155-
Analytic
L190
Search
46. Column Type Conversions
trait
TypeConverter[T]
extends
Serializable
{
def
targetTypeTag:
TypeTag[T]
def
convertPF:
PartialFunction[Any,
T]
}
!
class
CassandraRow(...)
{
def
get[T](index:
Int)(implicit
c:
TypeConverter[T]):
T
=
…
def
get[T](name:
String)(implicit
c:
TypeConverter[T]):
T
=
…
…
}
!
implicit
object
IntConverter
extends
TypeConverter[Int]
{
def
targetTypeTag
=
implicitly[TypeTag[Int]]
convertPF
=
{
case
def
x:
Number
=>
x.intValue
case
x:
String
=>
x.toInt
}
}
47. Rows to Case Class Instances
val
row:
CassandraRow
=
table.first
case
class
Tweet(userName:
String,
tweetId:
UUID,
publicationDate:
Date,
message:
String)
extends
Serializable
!
val
tweets
=
sc.cassandraTable[Tweet]("db",
"tweets")
Scala Cassandra
message message
column1 column_1
userName user_name
Scala Cassandra
Message Message
column1 column1
userName userName
Recommended convention:
48. Convert Rows To Tuples
!
val
tweets
=
sc
.cassandraTable[(String,
String)]("db",
"tweets")
.select("userName",
"message")
When returning tuples, always use
select to ! specify the column order!
49. Handling Unsupported Types
case
class
Tweet(
userName:
String,
tweetId:
my.private.implementation.of.UUID,
publicationDate:
Date,
message:
String)
!
val
tweets:
CassandraRDD[Tweet]
=
sc.cassandraTable[Tweet]("db",
"tweets")
Current behavior: runtime error
Future: compile-time error (macros!)
Macros could even
parse Cassandra
schema!
50. Connector Code and Docs
https://github.com/datastax/spark-cassandra-connector!
!
!
!
!
!
!
Add It To Your Project:!
!
val
connector
=
"com.datastax.spark"
%%
"spark-‐cassandra-‐connector"
%
"1.1.0-‐beta2"
51. What’s New In Spark?
•Petabyte sort record!
• Myth busting: !
• Spark is in-memory. It doesn’t work with big data!
• It’s too expensive to buy a cluster with enough memory to fit our
data!
•Application Integration!
• Tableau, Trifacta, Talend, ElasticSearch, Cassandra!
•Ongoing development for Spark 1.2!
• Python streaming, new MLib API, Yarn scaling…
52. Recent / Pending Spark Updates
•Usability, Stability & Performance Improvements !
• Improved Lightning Fast Shuffle!!
• Monitoring the performance long running or complex jobs!
•Public types API to allow users to create SchemaRDD’s!
•Support for registering Python, Scala, and Java lambda functions as UDF !
•Dynamic bytecode generation significantly speeding up execution for
queries that perform complex expression evaluation
53. Recent Spark Streaming Updates
•Apache Flume: a new pull-based mode (simplifying deployment and
providing high availability)!
•The first of a set of streaming machine learning algorithms is introduced
with streaming linear regression.!
•Rate limiting has been added for streaming inputs
54. Recent MLib Updates
• New library of statistical packages which provides exploratory analytic
functions *stratified sampling, correlations, chi-squared tests, creating
random datasets…)!
• Utilities for feature extraction (Word2Vec and TF-IDF) and feature
transformation (normalization and standard scaling). !
• Support for nonnegative matrix factorization and SVD via Lanczos. !
• Decision tree algorithm has been added in Python and Java. !
• Tree aggregation primitive!
• Performance improves across the board, with improvements of around
55. Recent/Pending Connector Updates
•Spark SQL (came in 1.1)!
•Python API!
•Official Scala Driver for Cassandra!
•Removing last traces of Thrift!!!
•Performance Improvements!
• Token-aware data repartitioning!
• Token-aware saving!
• Wide-row support - no costly groupBy call