This document provides an overview of Apache Spark, a distributed computing framework. It discusses why distributed computing is necessary to solve large problems, and compares disk-based frameworks like Hadoop to memory-based frameworks like Spark. The core of Spark is the resilient distributed dataset (RDD) abstraction, which allows transformations and actions to be performed on large datasets in parallel. Spark applications run as independent processes coordinated by a master on a cluster. The document gives examples of Spark usage and discusses related projects in the Spark ecosystem.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
Big Data Processing with Apache Spark 2014mahchiev
Apache Spark™ is a fast and general engine for large-scale data processing. It has gained enormous popularity recently with its speed and ease of use and is currently replacing traditional Hadoop MapReduce. We'll talk about:
1. What is Big Data ?
2. The Map-Reduce paradigm
3. What does Apache Spark do?
4. Finally, we'll make a quick demo
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
Big Data Processing with Apache Spark 2014mahchiev
Apache Spark™ is a fast and general engine for large-scale data processing. It has gained enormous popularity recently with its speed and ease of use and is currently replacing traditional Hadoop MapReduce. We'll talk about:
1. What is Big Data ?
2. The Map-Reduce paradigm
3. What does Apache Spark do?
4. Finally, we'll make a quick demo
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Apache Spark has quickly become a major tool in the problem space of crunching big data. This presentation tells the history of Spark, when and why to use it, and ends with an example of how easy it is to get started!
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Apache Spark has quickly become a major tool in the problem space of crunching big data. This presentation tells the history of Spark, when and why to use it, and ends with an example of how easy it is to get started!
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Have you been in the situation where you’re about to start a new project and ask yourself, what’s the right tool for the job here? I’ve been in that situation many times and thought it might be useful to share with you a recent project we did and why we selected Spark, Python, and Parquet. My plan is take you through a use case that involves loading, transforming, aggregating, and persisting the dataset. We’ll use an open dataset consisting of full fund holdings graciously provided by Morningstar. My goal in presenting this use case are to have the audience learn about how these technologies can be applied to a real world problem and to inspire members of the audience to start learning these technologies and applying them to their own projects.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Introduction to Apache Spark Workshop at Lambda World 2015 on October 23th and 24th, 2015, celebrated in Cádiz. Speakers: @fperezp and @juanpedromoreno
Github Repo: https://github.com/47deg/spark-workshop
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Infrastructure for Deep Learning in Apache SparkDatabricks
In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
In the session, we discussed the End-to-end working of Apache Spark that mainly focused on "Why What and How" factors. We discussed RDDs and the high-level API like Dataframe, Dataset. The session also took care of the internals of Spark.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
2. 2
Outline
I. Distributed Computing at a High Level
II. Disk versus Memory based Systems
III. Spark Core
I. Brief background
II. Benchmarks and Comparisons
III. What is an RDD
IV. RDD Actions and Transformations
V. Caching and Serialization
VI. Anatomy of a Program
VII. The Spark Family
3. 3
Why Distributed Computing?
Divide and Conquer
Problem Single machine cannot complete the
computation at hand
Solution Parallelize the job and distribute work
among a network of machines
4. 4
Issues Arise in Distributed Computing
View the world from the eyes of a single worker
• How do I distribute an algorithm?
• How do I partition my dataset?
• How do I maintain a single consistent view of a
shared state?
• How do I recover from machine failures?
• How do I allocate cluster resources?
• …..
8. 8
Disk Based vs Memory Based Frameworks
• Disk Based Frameworks
‒Persists intermediate results to disk
‒Data is reloaded from disk with every query
‒Easy failure recovery
‒Best for ETL like work-loads
‒Examples: Hadoop, Dryad
Acyclic data flow
Map
Map
Map
Reduce
Reduce
Input Output
Image courtesy of Matei Zaharia, Introduction to Spark
9. 9
Disk Based vs Memory Based Frameworks
• Memory Based Frameworks
‒Circumvents heavy cost of I/O by keeping intermediate
results in memory
‒Sensitive to availability of memory
‒Remembers operationsapplied to dataset
‒Best for iterative workloads
‒Examples: Spark, Flink
Reuse working data set in memory
iter. 1 iter. 2 . . .
Input
Image courtesy of Matei Zaharia, Introduction to Spark
10. 10
The rest of the talk
I. Spark Core
I. Brief background
II. Benchmarks and Comparisons
III. What is an RDD
IV. RDD Actions and Transformations
V. Spark Cluster
VI. Anatomy of a Program
VII. The Spark Family
11. 11
Spark Background
• Amplab UC Berkley
• Project Lead: Dr. Matei Zaharia
• First paper published on RDD’s was in 2012
• Open sourced from day one, growing number of
contributors
• Released its 1.0 version May 2014. Currently in 1.2.1
• Databricks company established to support Spark and all
its related technologies. Matei currently sits as its CTO
• Amazon, Alibaba, Baidu, eBay, Groupon, Ooyala,
OpenTable, Box, Shopify, TechBase, Yahoo!
Arose from an academic setting
14. 14
Resilient Distributed Datasets (RDDs)
• Main object in Spark’s universe
• Think of it as representing the data at that stage in
the operation
• Allows for coarse-grained transformations (e.g. map,
group-by, join)
• Allows for efficient fault recovery using lineage
‒Log one operation to apply to many elements
‒Recompute lost partitions of dataset on failure
‒No cost if nothing fails
15. 15
RDD Actions and Transformations
• Transformations
‒ Lazy operations applied on an RDD
‒ Creates a new RDD from an existing RDD
‒ Allows Spark to perform optimizations
‒ e.g. map, filter, flatMap, union, intersection,
distinct, reduceByKey, groupByKey
• Actions
‒ Returns a value to the driver program after
computation
‒ e.g. reduce, collect, count, first, take, saveAsFile
Transformations are realized when an action is called
16. 16
RDD Representation
• Simple common interface:
–Set of partitions
–Preferred locations for each partition
–List of parent RDDs
–Function to compute a partition given parents
–Optional partitioning info
• Allows capturing wide range of
transformations
Slide courtesy of Matei Zaharia, Introduction to Spark
17. 17
Spark Cluster
Driver
• Entry point of Spark application
• Main Spark application is ran here
• Results of “reduce” operations are aggregated here
Driver
20. 20
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
. . .
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Slide courtesy of Matei Zaharia, Introduction to Spark
21. 21
The Spark Family
Cheaper by the dozen
• Aside from its performance and API, the
diverse tool set available in Spark is the
reason for its wide adoption
1. Spark Streaming
2. Spark SQL
3. MLlib
4. GraphX
How do I distribute an algorithm – example finding max
<slide>
Let’s say your dataset is huge and can’t fit in one machine
Suggestions how to find max?
<slide>
Enter map reduce paradigm
Suggestions how to find max?
<slide>
Enter map reduce paradigm
Perspective of where Spark lies in the ecosystem of compute frameworks
Distributed systems have many tradeoffs:
Consistency versus Latency – “eventual consistency”
Exactly once processing versus Latency
<slide>
Perspective of where Spark lies in the ecosystem of compute frameworks
Distributed systems have many tradeoffs:
Consistency versus Latency – “eventual consistency”
Exactly once processing versus Latency
<slide>
DEMO
Optimization: realize that a map is followed by a reduce, hence won’t store the intermediate value in memory instead directly perform the reduction
** depending on time pull up the actions, transformations page
DEMO
Key idea: add “variables” to the “functions” in functional programming
DEMO