Brief introduction in Spark data processing ideology, comparison Java 7 and Java 8 usage with Spark. Examples of loading and processing data with Spark Cassandra Loader.
Spark-HandsOn
In this Hands-On, we are going to show how you can use Apache Spark and some components of it ecosystem for data processing. This workshop is split in four parts. We will use a dataset that consists of tweets containing just a few fields like id, user, text, country and place.
In the first one, you will play with the Spark API for basic operations like counting, filtering, aggregating.
After that, you will get to know Spark SQL to query structured data (here in json) using SQL.
In the third part, you will use Spark Streaming and the twitter streaming API to analyse a live stream of Tweets.
To finish we will build a simple model to identify the language in a text. For that you will use MLLib.
Let's go and have fun !
Prerequisites
Java > 6 (8 is better to use the lambdas)
IDE
Some links
Apache Spark https://spark.apache.org
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
https://speakerdeck.com/samklr/scalable-machine-learning-with-spark
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Spark-HandsOn
In this Hands-On, we are going to show how you can use Apache Spark and some components of it ecosystem for data processing. This workshop is split in four parts. We will use a dataset that consists of tweets containing just a few fields like id, user, text, country and place.
In the first one, you will play with the Spark API for basic operations like counting, filtering, aggregating.
After that, you will get to know Spark SQL to query structured data (here in json) using SQL.
In the third part, you will use Spark Streaming and the twitter streaming API to analyse a live stream of Tweets.
To finish we will build a simple model to identify the language in a text. For that you will use MLLib.
Let's go and have fun !
Prerequisites
Java > 6 (8 is better to use the lambdas)
IDE
Some links
Apache Spark https://spark.apache.org
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
https://speakerdeck.com/samklr/scalable-machine-learning-with-spark
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job
My presentation on Java User Group BD Meet up # 5.0 (JUGBD#5.0)
Apache Spark™ is a fast and general engine for large-scale data processing.Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).
Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job
My presentation on Java User Group BD Meet up # 5.0 (JUGBD#5.0)
Apache Spark™ is a fast and general engine for large-scale data processing.Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).
Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Spark Training Institutes: kelly technologies is the best Spark class Room training institutes in Bangalore. Providing Spark training by real time faculty in Bangalore.
Lightning Fast Analytics with Cassandra and SparkTim Vincent
Presentation on the integration of Apache Cassandra with Apache Spark to deliver near real-time analytics against operational data in your Cassandra distributed database
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Reproducibility and automation of machine learning processDenis Dus
A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
2. Spark
Apache Spark is a fast and general-purpose
cluster computing system. It provides high-
level APIs in Java, Scala and Python, and an
optimized engine that supports general
execution graphs. It also supports a rich set of
higher-level tools including Spark SQL for SQL
and structured data processing, MLlib for
machine learning, GraphX for graph
processing, and Spark Streaming.
3. Components
1. Driver program
Our main program, which connects to Spark cluster through SparkContext object, submits
transformations and actions on RDD
2. Cluster manager
Allocates resources across applications (e.g. standalone manager, Mesos, YARN)
3. Worker node
Executor - A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them.
Task - A unit of work that will be sent to one executor
4. Spark RDD
Spark revolves around the concept of
a resilient distributed dataset (RDD), which is a
fault-tolerant collection of elements that can
be operated on in parallel. There are two ways
to create RDDs: parallelizing an existing
collection in your driver program, or
referencing a dataset in an external storage
system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop
InputFormat.
7. Shared variables in Spark
Spark provides two limited types of shared variables for two common usage
patterns: broadcast variables and accumulators.
• Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks. They can be
used, for example, to give every node a copy of a large input dataset in an
efficient manner. Spark also attempts to distribute broadcast variables using
efficient broadcast algorithms to reduce communication cost.
• Accumulators
Accumulators are variables that are only “added” to through an associative
operation and can therefore be efficiently supported in parallel.
Spark natively supports accumulators of numeric types, and programmers can
add support for new types. If accumulators are created with a name, they will
be displayed in Spark’s UI. This can be useful for understanding the progress of
running stages.
9. Building a simple Spark application
SparkConf sparkConf = new SparkConf().setAppName("SparkApplication").setMaster("local[*]");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> file = sparkContext.textFile("hdfs://...");
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) {
return a + b;
}
});
counts.saveAsTextFile("hdfs://...");
sparkContext.close();
10. Java 8 + Spark 1.2 + Cassandra for BI:
Driver program skeleton
SparkConf sparkConf = new SparkConf()
.setAppName("SparkCassandraTest")
.setMaster("local[*]")
.set("spark.cassandra.connection.host", "127.0.0.1");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
CassandraLoader<UserEvent> cassandraLoader = new CassandraLoader<>(sparkContext,
"dataanalytics", "user_events", UserEvent.class);
JavaRDD<UserEvent> rdd = cassandraLoader.fetchAndUnion(venues, startDate, endDate);
… Events processing here …
sparkContext.close();
11. Java 8 + Spark 1.2 + Cassandra for BI:
Load events from Cassandra
public class CassandraLoader<T> {
private JavaSparkContext sparkContext;
private String keySpace;
private String tableName;
private Class<T> clazz;
…
private CassandraJavaRDD<T> fetchForVenueAndDateShard (String venueId, String dateShard) {
RowReaderFactory<T> mapper = CassandraJavaUtil.mapRowTo(clazz);
return CassandraJavaUtil.
javaFunctions(sparkContext). // SparkContextJavaFunctions appears here
cassandraTable(keySpace, tableName, mapper). // CassandraJavaRDD appears here
where("venue_id=? AND date_shard=?", venueId, dateShard);
}
…
}
CassandraJavaUtil
The main entry point to Spark Cassandra Connector Java API. Builds useful wrappers around Spark Context, Streaming Context, RDD.
SparkContextJavaFunctions -> CassandraJavaRDD<T> cassandraTable (String keyspace, String table, RowReaderFactory<T> rrf)
Returns a view of a Cassandra table. With this method, each row is converted to a object of type T by a specified row reader factory.
CassandraJavaUtil -> RowReaderFactory<T> mapRowTo(Class<T> targetClass, Pair<String, String>... columnMappings)
Constructs a row reader factory which maps an entire row to an object of a specified type (JavaBean style convention).
The default mapping of attributes to column names can be changed by providing a custom map of attribute-column mappings for the pairs which do
not follow the general convention.
CassandraJavaRDD
CassandraJavaRDD<R> select(String... columnNames)
CassandraJavaRDD<R> where(String cqlWhereClause, Object... args)
12. Java 8 + Spark 1.2 + Cassandra for BI:
Load events from Cassandra
public Map<String, JavaRDD<T>> fetchByVenue(List<String> venueIds, Date startDate, Date endDate) {
Map<String, JavaRDD<T>> result = new HashMap<>();
List<String> dateShards = ShardingUtils.generateDailyShards(startDate, endDate);
List<CassandraJavaRDD<T>> dailyRddList = new LinkedList<>();
venueIds.stream().forEach(venueId -> {
dailyRddList.clear();
dateShards.stream().forEach(dateShard -> {
CassandraJavaRDD<T> rdd = fetchForVenueAndDateShard(venueId, dateShard);
dailyRddList.add(rdd);
});
result.put(venueId, unionRddCollection(dailyRddList));
});
return result;
}
private JavaRDD<T> unionRddCollection(Collection<? extends JavaRDD<T>> rddCollection) {
JavaRDD<T> result = null;
for (JavaRDD<T> rdd : rddCollection) {
result = (result == null) ? rdd : result.union(rdd);
}
return result;
}
public JavaRDD<T> fetchAndUnion(List<String> venueIds, Date startDate, Date endDate) {
Map<String, JavaRDD<T>> data = fetchByVenue(venueIds, startDate, endDate);
return unionRddCollection(data.values());
}
13. Java 8 + Spark 1.2 + Cassandra for BI:
Some processing
JavaPairRDD<String, Iterable<UserEvent>> groupedRdd = rdd.filter(event -> {
boolean result = false;
boolean isSessionEvent = TYPE_SESSION.equals(event.getEvent_type());
if (isSessionEvent) {
Map<String, String> payload = event.getPayload();
String action = payload.get(PAYLOAD_ACTION_KEY);
if (StringUtils.isNotEmpty(action)) {
result = ACTION_SESSION_START.equals(action) || ACTION_SESSION_STOP.equals(action);
}
}
return result;
}).groupBy(event -> event.getUser_id());
14. Java 8 + Spark 1.2 + Cassandra for BI:
Some processing
JavaRDD<SessionReport> reportsRdd = groupedRdd.map(pair -> {
String sessionId = pair._1();
Iterable<UserEvent> events = pair._2();
Date sessionStart = null;
Date sessionEnd = null;
for (UserEvent event : events) {
Date eventDate = event.getDate();
if (eventDate != null) {
String action = event.getPayload().get(PAYLOAD_ACTION_KEY);
if (ACTION_SESSION_START.equals(action)) {
if (sessionStart == null || eventDate.before(sessionStart))
sessionStart = eventDate;
}
if (ACTION_SESSION_STOP.equals(action)) {
if (sessionEnd == null || endDate.after(sessionEnd))
sessionEnd = eventDate;
}
}
}
String sessionType = ((sessionStart != null) && (sessionEnd != null)) ? SessionReport.TYPE_CLOSED : SessionReport.TYPE_ACTIVE;
return new SessionReport(sessionId, sessionType, sessionStart, sessionEnd);
});
15. Java 8 + Spark 1.2 + Cassandra for BI:
Get result to Driver Program
List<SessionReport> reportsList = reportsRdd.collect(); // Returns RDD as a List to driver program, be aware of OOM
reportsList.forEach(Main::printReport);
….
SessionReport{sessionId='36a39b8e-27b9-4560-a1c5-9bfa77679930', sessionType='closed', sessionStart=2014-08-13 21:37:38, sessionEnd=2014-08-13 21:39:12}
SessionReport{sessionId='aee19a86-e060-42fb-b34f-76cd698e483e', sessionType='closed', sessionStart=2014-07-28 17:17:21, sessionEnd=2014-07-28 19:58:12}
SessionReport{sessionId='cecc03eb-f2fb-4ed4-9354-76ec8a965d8d', sessionType='closed', sessionStart=2014-09-04 19:46:51, sessionEnd=2014-09-04 21:12:43}
SessionReport{sessionId='1bd85e46-3fe2-4d46-acc5-2fe69735c453', sessionType='closed', sessionStart=2014-08-24 15:56:54, sessionEnd=2014-08-24 15:57:55}
SessionReport{sessionId='0d4e4b9f-fbd0-4eaf-a815-4f46693dbb2b', sessionType='closed', sessionStart=2014-09-09 13:39:39, sessionEnd=2014-09-09 13:46:08}
SessionReport{sessionId='32e822a6-5835-4001-bd95-ede38746e3bd', sessionType='closed', sessionStart=2014-08-27 21:24:03, sessionEnd=2014-08-28 01:21:11}
SessionReport{sessionId='cd35f911-29f4-496a-92f0-a9f5b51b0298', sessionType='closed', sessionStart=2014-09-09 20:14:49, sessionEnd=2014-09-10 01:07:17}
SessionReport{sessionId='8941e14f-9278-4a42-b000-1a228244cbc9', sessionType='active', sessionStart=2014-09-15 16:58:39, sessionEnd=UNKNOWN}
SessionReport{sessionId='c5bf123a-2e34-4c85-a25f-a705a2d408fa', sessionType='closed', sessionStart=2014-09-10 21:20:15, sessionEnd=2014-09-10 23:58:42}
SessionReport{sessionId='4252c7fd-90c0-4a34-8ddb-8db47d68c5a6', sessionType='closed', sessionStart=2014-07-09 08:32:35, sessionEnd=2014-07-09 08:34:23}
SessionReport{sessionId='f6441966-8d6d-4f1c-801c-29201fa75fe6', sessionType='active', sessionStart=2014-08-05 20:47:14, sessionEnd=UNKNOWN}
….