SlideShare a Scribd company logo
Using Apache Spark
PRESENTER: SNEHA CHALLA
VENUE: GOOGLE, PITTSBURGH
DATE: AUGUST 25TH, 2015
What is Apache Spark?
Spark is an open source computation engine built on top of the popular Hadoop
Distributed File System (HDFS). Fast and general cluster computing engine for large scale
data processing
Efficient Usable
Offers In memory computing
DAG Execution Engine
Up to 10× faster on disk,
100× in memory 2-5× less code
Rich APIs in Java,
Scala, Python
Interactive shell
Why Spark? Runs Everywhere
• Standalone Mode (private
cluster)
• Apache Mesos – Cluster
manager
• Hadoop YARN
• Amazon EC2 (prepared
deployment)
Spark Community
Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide
variety of companies.
 MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA
 More than 150 Contributors in the past One Year
 25 + Companies Contributing and it’s growing.
Spark was designed to both make traditional Map Reduce programming easier and to support
new types of applications, with one of the earliest focus areas being machine learning. Spark can
be used to build fast end-to-end Machine Learning workflows.
SPARK COMMUNITY
Elephant in the room- Map Reduce
Why Apache Spark?
Spark VS Map Reduce
 Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
 Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
Why Spark?
Spark Installation
http://spark.apache.org/downlo
ads.html
Spark Installation
• Extract compressed folder spark-1.3.0-binhadoop2.4
• From terminal, go to spark-1.3.0-bin-hadoop2.4/
bin
• Run pyspark
• Run rdd = sc.parallelize([0, 1, 2]);
rdd.map(lambda x: x*x).collect()
• Get result [0,1,4]
• It’s that easy!
 Windows users might need to download and
run additional winutils.exe for smooth
running of applications
 Download winutils here
http://www.srccodes.com/p/article/39/error
-util-shell-failed-locate-winutils-binary-
hadoop-binary-path.
and add to $HADOOP_HOME/bin
• Download a bigger zip (1.9GB) from
http://bit.ly/1FpZAXH
Interactive Shell & Spark Context
Interactive Shell:
The Fastest Way to Learn Spark
Available in Python and Scala
Runs as an application on an existing Spark Cluster.
OR Can run locally
Spark Context:
Main entry point to Spark functionality
Available in shell as variable sc
In standalone programs, you’d make your
own by calling SC object(see later for details)
Key Concepts – RDD Distributional Model
RDD – Resilient Distributed Dataset
Programs are written in terms of transformations on these Distributed Datasets
• RDD = Resilient
Distributed Database
• Transformations
convert one RDD into
another
• No actual calculation
• Actions force
calculation of result
• Lazy evaluation
RDD’s - Motivation
RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
1) Iterative algorithms - Common in graph applications and Machine Learning
2) Interactive Data Mining Tools
To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory:
they are read-only datasets that can only be constructed through bulk operations on other
RDDs.
However, RDDs are expressive enough to capture a wide class of computations, including
MapReduce and specialized computations.
Spark Architecture
Basic Abstraction: RDDs
Immutability
Laziness
Transformations
Programming Model
Two types of operations on an RDD:
• transformations
• actions
Transformations are lazily evaluated - they are not executed when you issue the
command.
RDDs are recomputed when an action is executed.
Data distribution/partitioning
RDD - read only collection of objects that are partitioned across a set of machines.
RDD Computational Model
• Operators on RDDs form a directed acyclic graph.
• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
FIRST STEP IN DATA ANALYSIS :
Create an RDD
Read data from a text file on local machine , S3 or HDFS into RDD.
Give life to an RDD using Spark Context
# Convert a python collection to an RDD
# Turn a Python collection into an RDD
>sc.parallelize ([7, 8, 9])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“textfile.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
Transformations – RDD: map
Pass each element of an RDD through a function
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)
Transformations – RDD :reduce
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.reduce(add)
RDD Transformations: Reduce By Key
>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])
>>>result_rdd = rdd.reduceByKey(add)
Some more RDD Transformations
rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and
then flattening the results.
>>> rdd = sc.parallelize([2,3,4])
sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on
transformation
[1,1,1,2,2,3]
 rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.
>>> rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x%2 ==0).collect()
[2,4]
Some more RDD Transformations
sortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by given keyfunc
>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]
rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).
 rdd.cache() : Cache RDD in memory for repeated use
countByKey(self):
rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])
rdd.countByKey().items()
 join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with
matching keys in self and others.
Setting the level of parallelism
All the pair RDD operations take an optional second parameter for number of tasks.
> rdd.reduceByKey(lambda x, y: x + y, 5)
>rdd.groupByKey(5)
>rdd.join(pageViews, 5)
Some RDD Actions
RDD Transformations are lazily evaluated . Actions kick off computation on transformations.
Eg: Collect(), glom() etc
 rdd.collect() : Return RDD content as a list
rdd = sc.parallelize([1,2,3], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [1, 4, 9]
 rdd.glom().collect():
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
 saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system.
 take(n) :Return an array with the first n elements of the dataset.
 first() : return the first element of the dataset. Similar to take(1)
In Map Reduce you get only two operators – map and reduce.
Whereas Spark offers 80+ operations!
Automatic parallelization of workflows on SPARK. In Spark a whole series of
individual tasks is expressed as a single program flow that is lazily evaluated., so
that system has a complete picture of the execution graph.
Word Count
from pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
Fault tolerant - Persistent
RDDs track lineage information that can be used to efficiently re compute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
Spark will persist or cache RDD slices in memory on each node during operations.
You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
Spark Libraries
MLlib
 Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.
Shipped with Spark since version 0.8
35 contributors
Highlights include:
 Basic statistics - summary statistics ,correlation and stratified sampling
 hypothesis testing, random data generation
 Linear models of regression (logistic and linear regression, SVM’s)
 Naive Bayes and Decision Tree classifiers, ensemble of trees
Collaborative Filtering with ALS
 K-Means clustering and Gaussian Mixture.
 Stochastic gradient descent
 SVD (singular value decomposition) and PCA
Spark’s scalable machine
learning library consisting
of common learning
algorithms and utilities,
including classification,
regression, clustering,
collaborative filtering,
dimensionality reduction,
as well as underlying
optimization primitives.
Running a Spark Application
Command: submit-spark <python_file_path>
Let’s see the implementation of
1) K-Means
2) Logistic Regression
K-Means
# Import the required pyspark functions
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
from pyspark import SparkContext
sc =SparkContext()
data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt")
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
K-Means (Cont..)
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
Logistic Regression
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
# Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData)
# Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size)))))
# Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr)
Spark UI
Run Spark in local mode (pyspark) .
Spark UI is at http://localhost:4040
You will be able to see the RDD sizes and Identify slow running tasks.
MOVING TO A CLUSTER – EC2
Setting up a EMR Cluster
If your data is too large to compute on your local machine - then you’re in the right
place. An easy way to get Spark running is with EC2.
Create an account on aws.amazon.com
Get a keypair from aws Console: This is the security for your instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName
 Create EMR instance and configure the nodes:
https://console.aws.amazon.com/console/home?region=us-east-1#
 Launch EMR instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
EMR Cluster
https://console.aws.amazon.com/s3/home?region=us-east-1
Data can be uploaded to the bucket on Amazon S3 .
For more info on Spark
 Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
Github: https://github.com/apache/spark
Mailing lists: user@spark.apache.org,dev@spark.apache.org
Python API documentation. http://spark.apache.org/docs/latest/api/python/
Questions?
THANK YOU
References
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf
http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 

Viewers also liked

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
Abhishek Choudhary
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
Jongwook Woo
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
Knoldus Inc.
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 

Viewers also liked (17)

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 

Similar to Apache spark sneha challa- google pittsburgh-aug 25th

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
Mohit Jain
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 

Similar to Apache spark sneha challa- google pittsburgh-aug 25th (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Data Science
Data ScienceData Science
Data Science
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Scala+data
Scala+dataScala+data
Scala+data
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 

More from Sneha Challa

Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
Sneha Challa
 
Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentation
Sneha Challa
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexity
Sneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
Sneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
Sneha Challa
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart Surgery
Sneha Challa
 

More from Sneha Challa (6)

Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentation
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexity
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart Surgery
 

Recently uploaded

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 

Recently uploaded (20)

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 

Apache spark sneha challa- google pittsburgh-aug 25th

  • 1. Using Apache Spark PRESENTER: SNEHA CHALLA VENUE: GOOGLE, PITTSBURGH DATE: AUGUST 25TH, 2015
  • 2. What is Apache Spark? Spark is an open source computation engine built on top of the popular Hadoop Distributed File System (HDFS). Fast and general cluster computing engine for large scale data processing Efficient Usable Offers In memory computing DAG Execution Engine Up to 10× faster on disk, 100× in memory 2-5× less code Rich APIs in Java, Scala, Python Interactive shell
  • 3. Why Spark? Runs Everywhere • Standalone Mode (private cluster) • Apache Mesos – Cluster manager • Hadoop YARN • Amazon EC2 (prepared deployment)
  • 4. Spark Community Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide variety of companies.  MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA  More than 150 Contributors in the past One Year  25 + Companies Contributing and it’s growing. Spark was designed to both make traditional Map Reduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. Spark can be used to build fast end-to-end Machine Learning workflows.
  • 6. Elephant in the room- Map Reduce
  • 8. Spark VS Map Reduce  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  • 11. Spark Installation • Extract compressed folder spark-1.3.0-binhadoop2.4 • From terminal, go to spark-1.3.0-bin-hadoop2.4/ bin • Run pyspark • Run rdd = sc.parallelize([0, 1, 2]); rdd.map(lambda x: x*x).collect() • Get result [0,1,4] • It’s that easy!  Windows users might need to download and run additional winutils.exe for smooth running of applications  Download winutils here http://www.srccodes.com/p/article/39/error -util-shell-failed-locate-winutils-binary- hadoop-binary-path. and add to $HADOOP_HOME/bin • Download a bigger zip (1.9GB) from http://bit.ly/1FpZAXH
  • 12. Interactive Shell & Spark Context Interactive Shell: The Fastest Way to Learn Spark Available in Python and Scala Runs as an application on an existing Spark Cluster. OR Can run locally Spark Context: Main entry point to Spark functionality Available in shell as variable sc In standalone programs, you’d make your own by calling SC object(see later for details)
  • 13. Key Concepts – RDD Distributional Model RDD – Resilient Distributed Dataset Programs are written in terms of transformations on these Distributed Datasets • RDD = Resilient Distributed Database • Transformations convert one RDD into another • No actual calculation • Actions force calculation of result • Lazy evaluation
  • 14. RDD’s - Motivation RDDs are motivated by two types of applications that current data flow systems handle inefficiently: 1) Iterative algorithms - Common in graph applications and Machine Learning 2) Interactive Data Mining Tools To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs. However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized computations.
  • 15. Spark Architecture Basic Abstraction: RDDs Immutability Laziness Transformations
  • 16. Programming Model Two types of operations on an RDD: • transformations • actions Transformations are lazily evaluated - they are not executed when you issue the command. RDDs are recomputed when an action is executed.
  • 17. Data distribution/partitioning RDD - read only collection of objects that are partitioned across a set of machines.
  • 18. RDD Computational Model • Operators on RDDs form a directed acyclic graph. • If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
  • 19. FIRST STEP IN DATA ANALYSIS : Create an RDD Read data from a text file on local machine , S3 or HDFS into RDD. Give life to an RDD using Spark Context # Convert a python collection to an RDD # Turn a Python collection into an RDD >sc.parallelize ([7, 8, 9]) # Load text file from local FS, HDFS, or S3 >sc.textFile(“textfile.txt”) >sc.textFile(“directory/*.txt”) >sc.textFile(“hdfs://namenode:9000/path/file”)
  • 20. Transformations – RDD: map Pass each element of an RDD through a function >>>rdd = sc.parallelize(range(1,8)) >>>result_rdd = rdd.map(lambda x: x%3)
  • 21. Transformations – RDD :reduce >>>rdd = sc.parallelize(range(1,8)) >>>result_rdd = rdd.reduce(add)
  • 22. RDD Transformations: Reduce By Key >>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)]) >>>result_rdd = rdd.reduceByKey(add)
  • 23. Some more RDD Transformations rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. >>> rdd = sc.parallelize([2,3,4]) sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on transformation [1,1,1,2,2,3]  rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate. >>> rdd = sc.parallelize([1,2,3,4,5]) rdd.filter(lambda x: x%2 ==0).collect() [2,4]
  • 24. Some more RDD Transformations sortBy(self, keyfunc, ascending=True, numPartitions=None) Sorts this RDD by given keyfunc >> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ] rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).  rdd.cache() : Cache RDD in memory for repeated use countByKey(self): rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ]) rdd.countByKey().items()  join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with matching keys in self and others.
  • 25. Setting the level of parallelism All the pair RDD operations take an optional second parameter for number of tasks. > rdd.reduceByKey(lambda x, y: x + y, 5) >rdd.groupByKey(5) >rdd.join(pageViews, 5)
  • 26. Some RDD Actions RDD Transformations are lazily evaluated . Actions kick off computation on transformations. Eg: Collect(), glom() etc  rdd.collect() : Return RDD content as a list rdd = sc.parallelize([1,2,3], 3) rdd2 = rdd.map(lambda x: x*x) rdd2.glom().collect(): [1, 4, 9]  rdd.glom().collect(): rdd = sc.parallelize([0,1,2], 3) rdd2 = rdd.map(lambda x: x*x) rdd2.collect(): [[0], [1], [4]]  saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system.  take(n) :Return an array with the first n elements of the dataset.  first() : return the first element of the dataset. Similar to take(1)
  • 27. In Map Reduce you get only two operators – map and reduce. Whereas Spark offers 80+ operations! Automatic parallelization of workflows on SPARK. In Spark a whole series of individual tasks is expressed as a single program flow that is lazily evaluated., so that system has a complete picture of the execution graph.
  • 28. Word Count from pyspark import SparkContext logFile = "hdfs://localhost:9000/user/bigdatavm/input" sc = SparkContext("spark://bigdata-vm:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
  • 29. Fault tolerant - Persistent RDDs track lineage information that can be used to efficiently re compute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) Spark will persist or cache RDD slices in memory on each node during operations. You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
  • 31. MLlib  Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley. Shipped with Spark since version 0.8 35 contributors Highlights include:  Basic statistics - summary statistics ,correlation and stratified sampling  hypothesis testing, random data generation  Linear models of regression (logistic and linear regression, SVM’s)  Naive Bayes and Decision Tree classifiers, ensemble of trees Collaborative Filtering with ALS  K-Means clustering and Gaussian Mixture.  Stochastic gradient descent  SVD (singular value decomposition) and PCA Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
  • 32. Running a Spark Application Command: submit-spark <python_file_path> Let’s see the implementation of 1) K-Means 2) Logistic Regression
  • 33. K-Means # Import the required pyspark functions from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt from pyspark import SparkContext sc =SparkContext() data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt") parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
  • 34. K-Means (Cont..) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||") # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
  • 35. Logistic Regression from pyspark.mllib.classification import LogisticRegressionWithSGD from numpy import array # Load and parse the data data = sc.textFile("mllib/data/sample_svm_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) model = LogisticRegressionWithSGD.train(parsedData) # Build the model labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)), model.predict(point.take(range(1, point.size))))) # Evaluating the model on training data trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr)
  • 36. Spark UI Run Spark in local mode (pyspark) . Spark UI is at http://localhost:4040 You will be able to see the RDD sizes and Identify slow running tasks.
  • 37. MOVING TO A CLUSTER – EC2
  • 38. Setting up a EMR Cluster If your data is too large to compute on your local machine - then you’re in the right place. An easy way to get Spark running is with EC2. Create an account on aws.amazon.com Get a keypair from aws Console: This is the security for your instance https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName  Create EMR instance and configure the nodes: https://console.aws.amazon.com/console/home?region=us-east-1#  Launch EMR instance https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
  • 40. For more info on Spark  Website: http://spark.apache.org Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org Github: https://github.com/apache/spark Mailing lists: user@spark.apache.org,dev@spark.apache.org Python API documentation. http://spark.apache.org/docs/latest/api/python/

Editor's Notes

  1. MapReduce has been around as the major framework for distributed computing for 10 years - this is pretty old in technology time! Well known limitations include: 1. Programmability a. Requires multiple chained MR steps b. Specialized systems for applications 2. Performance a. Writes to disk between each computational step b. Expensive for apps to "reuse" data i. Iterative algorithms ii. Interactive analysis Most machine learning algorithms are iterative …
  2. Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR. Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.
  3. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
  4. Comprehensive list of actions: http://spark.apache.org/docs/latest/programming-guide.html#actions
  5. When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
  6. Spark allows you to access these operators in the context of a full programming language — thus, you can use control statements, functions, and classes as you would in a typical programming environment. Automatic Parallelization of Complex Flows When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention. This capability also has the property of enabling certain optimizations to the engine while reducing the burden on the application developer. Win, and win again!
  7. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data inHDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source The RDD data model and cached memory computing allow Spark to quickly and easily solve similar workflows and use cases that are part of Hadoop. Spark has a series of high level tools at it’s disposal that are added as component libraries, not integrated into the general computing framework:
  8. Know more here: http://spark.apache.org/docs/latest/mllib-guide.html