SlideShare a Scribd company logo
Machine Learning (on Spark)
BDS Session
Session objective
Introduction Quick Recap Spark
Machine Learning (on Spark)
Spark: Introduction, Architecture and Features
Spark: Resilient Distributed Datasets, Transformation, Examples
Regression, Classification, Collaborative Filtering, & Clustering.
SUMMARY
Shortcoming of MapReduce
 Forces your data processing into Map and Reduce
 Other workflows missing include join, filter, flatMap, groupByKey,
union, intersection, …
 Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
 Read and write to Disk before and after Map and Reduce (stateless
machine)
 Not efficient for iterative tasks, i.e. Machine Learning
 Only Java natively supported
 Support for others languages needed
 Only for Batch processing
 Interactivity, streaming data
Introduction
One Solution is Apache Spark
 A new general framework, which solves many of the short comings
of MapReduce
 It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN,
HBase, S3, …
 Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
 (around 30 efficient distributed operations)
 In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
 Native Scala, Java, Python, and R support
 Supports interactive shells for exploratory data analysis
 Spark API is extremely simple to use
 Developed at AMPLab UC Berkeley, now by Databricks.com
Introduction Quick Recap Spark
 expressive computing system, not limited to map-
reduce model
 facilitate system memory
 avoid saving intermediate results to disk
 cache data for repetitive queries (e.g. for machine
learning)
 compatible with Hadoop
Spark Ideas
Spark Uses Memory instead of Disk
Iteration
1
Iteration
2
HDFS read
Iteration
1
Iteration
2
HDFS
read
HDFS
Write
HDFS
read HDFS
Write
Spark: In-Memory Data Sharing
Hadoop: Use Disk for Data Sharing
Spark: Introduction
Sort competition
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records),http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
Spark, 3x
faster with
1/10 the
nodes
Spark: Introduction
Apache Spark supports data analysis, machine learning, graphs, streaming
data, etc. It can read/write from a range of data types and allows
development in multiple languages.
Spark Core
Spark
Streaming
MLlib GraphX
ML Pipelines
Spark SQL
DataFrames
Data Sources
Scala, Java, Python, R, SQL
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style
(GlusterFS, Lustre)
Spark: Introduction
Spark: Architecture
Spark: Architecture
Spark: Features
 Streaming Data
 Machine Learning
 Interactive Analysis
 Data Warehousing
 Batch Processing
 Exploratory Data Analysis
 Graph Data Analysis
 Spatial (GIS) Data Analysis
 And many more
Spark’s Main Use Cases
 Fingerprint Matching
 Developed a Spark based fingerprint minutia detection and fingerprint
matching code
 Twitter Sentiment Analysis
 Developed a Spark based Sentiment Analysis code for a Twitter dataset
Spark’s Main Use Cases
 Uber – the online taxi company gathers terabytes of event data from its mobile
users every day.
 By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
 Convert raw unstructured event data into structured data as it is collected
 Uses it further for more complex analytics and optimization of operations
 Pinterest – Uses a Spark ETL pipeline
 Leverages Spark Streaming to gain immediate insight into how users all over
the world are engaging with Pins—in real time.
 Can make more relevant recommendations as people navigate the site
 Recommends related Pins
 Determine which products to buy, or destinations to visit
Spark’s Main Use Cases
Here are Few other Real World Use Cases:
 Conviva – 4 million video feeds per month
 This streaming video company is second only to YouTube.
 Uses Spark to reduce customer churn by optimizing video streams and managing live
video traffic
 Maintains a consistently smooth, high quality viewing experience.
 Capital One – is using Spark and data science algorithms to understand customers in a
better way.
 Developing next generation of financial products and services
 Find attributes and patterns of increased probability for fraud
 Netflix – leveraging Spark for insights of user viewing habits and then recommends movies
to them.
 User data is also used for content creation
Spark’s Main Use Cases
 Even though Spark is versatile, that doesn’t mean Spark’s in-memory
capabilities are the best fit for all use cases:
 For many simple use cases Apache MapReduce and Hive might be a
more appropriate choice
 Spark was not designed as a multi-user environment
 Spark users are required to know that memory they have is sufficient
for a dataset
 Adding more users adds complications, since the users will have to
coordinate memory usage to run code
Spark: when not to use
 Clouds and supercomputers are collections of
computers networked together in a
datacenter
 Clouds have different networking, I/O, CPU
and cost trade-offs than supercomputers
 Cloud workloads are data oriented vs.
computation oriented and are less closely
coupled than supercomputers
 Principles of parallel computing same on both
 Apache Hadoop and Spark vs. Open MPI
HPC and Big Data Convergence
 RDDs (Resilient Distributed Datasets) is Data Containers
 All the different processing components in Spark share
the same abstraction called RDD
 As applications share the RDD abstraction, you can mix
different kind of transformations to create new RDDs
 Created by parallelizing a collection or reading a file
 Fault tolerant
Resilient Distributed Datasets (RDDs)
RDD abstraction
 Resilient Distributed Datasets
 partitioned collection of records
 spread across the cluster
 read-only
 caching dataset in memory
 different storage levels available
 fallback to disk possible
Introduction Quick Recap Spark
RDD operations
 transformations to build RDDs through deterministic
operations on other RDDs
 transformations include map, filter, join
 lazy operation
 actions to return value or export data
 actions include count, collect, save
 triggers execution
Introduction Quick Recap Spark
Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Driver
Worker Worker Worker
Block3Block1 Block2
Cache1 Cache2 Cache2
Action!
Introduction Quick Recap Spark
RDD partition-level view
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
shouldCache = true
log:
errors:
Partition-level view:Dataset-level view:
Task 1Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Introduction Quick Recap Spark
Job scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Introduction Quick Recap Spark
Available APIs
 You can write in Java, Scala or Python
 interactive interpreter: Scala & Python only
 standalone applications: any
 performance: Java & Scala are faster thanks to static
typing
Introduction Quick Recap Spark
Summary
 concept not limited to single pass map-reduce
 avoid soring intermediate results on disk or HDFS
 speedup computations when reusing datasets
Introduction Quick Recap Spark
Summary
 concept not limited to single pass map-reduce
 avoid soring intermediate results on disk or HDFS
 speedup computations when reusing datasets
Introduction Quick Recap Spark
 DataFrames (DFs) is one of the other distributed datasets organized in
named columns
 Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
 Immutable once constructed
 Track lineage
 Enable distributed computations
 How to construct Dataframes
 Read from file(s)
 Transforming an existing DFs(Spark or Pandas)
 Parallelizing a python collection list
 Apply transformations and actions
DataFrames & SparkSQL
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
//Alternatively, using Pandas-like syntax
students = users[users.age < 21]
//Count the number of students users by gender
students.groupBy("gender").count()
// Join young students with another DataFrame called logs
students.join(logs, logs.userId == users.userId,
“left_outer")
Resilient Distributed Datasets (RDDs)
 RDDs provide a low level interface into Spark
 DataFrames have a schema
 DataFrames are cached and optimized by Spark
 DataFrames are built on top of the RDDs and the
core Spark API
Example: performance
RDDs vs. DataFrames
Transformations
(create a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
intersection
flatMap
union
join
cogroup
cross
mapValues
reduceByKey
Actions
(return results to driver
program)
collect first
Reduce take
Count takeOrdered
takeSample countByKey
take save
lookupKey foreach
Spark Operations
A
B
S
C
E
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Transformation
A,1 A,[1,2]
A,2
Narrow Wide
Map groupByKey
Vs.
Narrow Vs. Wide Transformation
Spark Workflow
FlatMap Map groupbyKey
Spark
Contex
t
Driver
Progra
m
Collect
Narrow Vs. Wide Transformation
 What is an action
 The final stage of the workflow
 Triggers the execution of the DAG
 Returns the results to the driver
 Or writes the data to HDFS or to a file
Actions
 Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
Examples from http://spark.apache.org/
RDD API Examples
Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
 RDD Persistence
 RDD.persist()
 Storage level:
 MEMORY_ONLY, MEMORY_AND_DISK,
MEMORY_ONLY_SER, DISK_ONLY,…….
 RDD Removal
 RDD.unpersist()
RDD Persistence and Removal
Broadcast Variables and Accumulators
(Shared Variables )
 Broadcast variables allow the programmer to keep a read-only
variable cached on each node, rather than sending a copy of it
with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
 Accumulators are variables that are only “added” to through an
associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
RDD Persistence and Removal
Regression, Classification, Collaborative Filtering, &
Clustering.
Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
π = Proportion of “Success”
In ordinary regression the model predicts the mean Y for any
combination of predictors.
What’s the “mean” of a 0/1 indicator variable?
success""ofProportion
trialsof#
'1of#



s
n
y
y i
Goal of logistic regression: Predict the “true” proportion of success, π, at
any value of the predictor.
Regression
Binary Logistic Regression Model
Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
p =
e
b0 +b1 X
1+ e
b0 +b1 X
Equivalent forms of the logistic regression model:
What does this look like?
X10
1
log 









Logit form Probability form
N.B.: This is natural log (aka “ln”)
Regression
y
0.2
0.4
0.6
0.8
1.0
x
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
y =
bo b1 x•+( )exp
1 bo b1 x•+( )exp+
no data Function Plot
Logit Function
Regression
Regression
Regression
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
THANK YOU
Image Source
searchenterpriseai.techtarget.com
wikipedia

More Related Content

What's hot

MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop
HadoopHadoop
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 

What's hot (19)

MapReduce
MapReduceMapReduce
MapReduce
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
hadoop
hadoophadoop
hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop
HadoopHadoop
Hadoop
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 

Similar to Bds session 13 14

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
amarkayam
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
Prwatech Institution
 

Similar to Bds session 13 14 (20)

spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 

More from Infinity Tech Solutions

Database management system session 6
Database management system session 6Database management system session 6
Database management system session 6
Infinity Tech Solutions
 
Database management system session 5
Database management system session 5Database management system session 5
Database management system session 5
Infinity Tech Solutions
 
Database Management System-session 3-4-5
Database Management System-session 3-4-5Database Management System-session 3-4-5
Database Management System-session 3-4-5
Infinity Tech Solutions
 
Database Management System-session1-2
Database Management System-session1-2Database Management System-session1-2
Database Management System-session1-2
Infinity Tech Solutions
 
Main topic 3 problem solving and office automation
Main topic 3 problem solving and office automationMain topic 3 problem solving and office automation
Main topic 3 problem solving and office automation
Infinity Tech Solutions
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programming
Infinity Tech Solutions
 
E commerce
E commerce E commerce
E commerce
E commerceE commerce
Computer memory, Types of programming languages
Computer memory, Types of programming languagesComputer memory, Types of programming languages
Computer memory, Types of programming languages
Infinity Tech Solutions
 
Basic hardware familiarization
Basic hardware familiarizationBasic hardware familiarization
Basic hardware familiarization
Infinity Tech Solutions
 
User defined functions in matlab
User defined functions in  matlabUser defined functions in  matlab
User defined functions in matlab
Infinity Tech Solutions
 
Programming with matlab session 6
Programming with matlab session 6Programming with matlab session 6
Programming with matlab session 6
Infinity Tech Solutions
 
Programming with matlab session 3 notes
Programming with matlab session 3 notesProgramming with matlab session 3 notes
Programming with matlab session 3 notes
Infinity Tech Solutions
 
AI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime SectorAI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime Sector
Infinity Tech Solutions
 
Programming with matlab session 5 looping
Programming with matlab session 5 loopingProgramming with matlab session 5 looping
Programming with matlab session 5 looping
Infinity Tech Solutions
 
BIG DATA Session 7 8
BIG DATA Session 7 8BIG DATA Session 7 8
BIG DATA Session 7 8
Infinity Tech Solutions
 
BIG DATA Session 6
BIG DATA Session 6BIG DATA Session 6
BIG DATA Session 6
Infinity Tech Solutions
 
MS word
MS word MS word
DBMS CS 4-5
DBMS CS 4-5DBMS CS 4-5
DBMS CS3
DBMS CS3DBMS CS3

More from Infinity Tech Solutions (20)

Database management system session 6
Database management system session 6Database management system session 6
Database management system session 6
 
Database management system session 5
Database management system session 5Database management system session 5
Database management system session 5
 
Database Management System-session 3-4-5
Database Management System-session 3-4-5Database Management System-session 3-4-5
Database Management System-session 3-4-5
 
Database Management System-session1-2
Database Management System-session1-2Database Management System-session1-2
Database Management System-session1-2
 
Main topic 3 problem solving and office automation
Main topic 3 problem solving and office automationMain topic 3 problem solving and office automation
Main topic 3 problem solving and office automation
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programming
 
E commerce
E commerce E commerce
E commerce
 
E commerce
E commerceE commerce
E commerce
 
Computer memory, Types of programming languages
Computer memory, Types of programming languagesComputer memory, Types of programming languages
Computer memory, Types of programming languages
 
Basic hardware familiarization
Basic hardware familiarizationBasic hardware familiarization
Basic hardware familiarization
 
User defined functions in matlab
User defined functions in  matlabUser defined functions in  matlab
User defined functions in matlab
 
Programming with matlab session 6
Programming with matlab session 6Programming with matlab session 6
Programming with matlab session 6
 
Programming with matlab session 3 notes
Programming with matlab session 3 notesProgramming with matlab session 3 notes
Programming with matlab session 3 notes
 
AI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime SectorAI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime Sector
 
Programming with matlab session 5 looping
Programming with matlab session 5 loopingProgramming with matlab session 5 looping
Programming with matlab session 5 looping
 
BIG DATA Session 7 8
BIG DATA Session 7 8BIG DATA Session 7 8
BIG DATA Session 7 8
 
BIG DATA Session 6
BIG DATA Session 6BIG DATA Session 6
BIG DATA Session 6
 
MS word
MS word MS word
MS word
 
DBMS CS 4-5
DBMS CS 4-5DBMS CS 4-5
DBMS CS 4-5
 
DBMS CS3
DBMS CS3DBMS CS3
DBMS CS3
 

Recently uploaded

Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 

Recently uploaded (20)

Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 

Bds session 13 14

  • 1. Machine Learning (on Spark) BDS Session
  • 2. Session objective Introduction Quick Recap Spark Machine Learning (on Spark) Spark: Introduction, Architecture and Features Spark: Resilient Distributed Datasets, Transformation, Examples Regression, Classification, Collaborative Filtering, & Clustering. SUMMARY
  • 3. Shortcoming of MapReduce  Forces your data processing into Map and Reduce  Other workflows missing include join, filter, flatMap, groupByKey, union, intersection, …  Based on “Acyclic Data Flow” from Disk to Disk (HDFS)  Read and write to Disk before and after Map and Reduce (stateless machine)  Not efficient for iterative tasks, i.e. Machine Learning  Only Java natively supported  Support for others languages needed  Only for Batch processing  Interactivity, streaming data Introduction
  • 4. One Solution is Apache Spark  A new general framework, which solves many of the short comings of MapReduce  It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase, S3, …  Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey, reduceByKey, sortByKey, collect, count, first…  (around 30 efficient distributed operations)  In-memory caching of data (for iterative, graph, and machine learning algorithms, etc.)  Native Scala, Java, Python, and R support  Supports interactive shells for exploratory data analysis  Spark API is extremely simple to use  Developed at AMPLab UC Berkeley, now by Databricks.com Introduction Quick Recap Spark
  • 5.  expressive computing system, not limited to map- reduce model  facilitate system memory  avoid saving intermediate results to disk  cache data for repetitive queries (e.g. for machine learning)  compatible with Hadoop Spark Ideas
  • 6. Spark Uses Memory instead of Disk Iteration 1 Iteration 2 HDFS read Iteration 1 Iteration 2 HDFS read HDFS Write HDFS read HDFS Write Spark: In-Memory Data Sharing Hadoop: Use Disk for Data Sharing Spark: Introduction
  • 7. Sort competition Hadoop MR Record (2013) Spark Record (2014) Data Size 102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GB/s (est.) 618 GB/s Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records),http://databricks.com/blog/2014/11/05/spark- officially-sets-a-new-record-in-large-scale-sorting.html Spark, 3x faster with 1/10 the nodes Spark: Introduction
  • 8. Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It can read/write from a range of data types and allows development in multiple languages. Spark Core Spark Streaming MLlib GraphX ML Pipelines Spark SQL DataFrames Data Sources Scala, Java, Python, R, SQL Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre) Spark: Introduction
  • 12.  Streaming Data  Machine Learning  Interactive Analysis  Data Warehousing  Batch Processing  Exploratory Data Analysis  Graph Data Analysis  Spatial (GIS) Data Analysis  And many more Spark’s Main Use Cases
  • 13.  Fingerprint Matching  Developed a Spark based fingerprint minutia detection and fingerprint matching code  Twitter Sentiment Analysis  Developed a Spark based Sentiment Analysis code for a Twitter dataset Spark’s Main Use Cases
  • 14.  Uber – the online taxi company gathers terabytes of event data from its mobile users every day.  By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline  Convert raw unstructured event data into structured data as it is collected  Uses it further for more complex analytics and optimization of operations  Pinterest – Uses a Spark ETL pipeline  Leverages Spark Streaming to gain immediate insight into how users all over the world are engaging with Pins—in real time.  Can make more relevant recommendations as people navigate the site  Recommends related Pins  Determine which products to buy, or destinations to visit Spark’s Main Use Cases
  • 15. Here are Few other Real World Use Cases:  Conviva – 4 million video feeds per month  This streaming video company is second only to YouTube.  Uses Spark to reduce customer churn by optimizing video streams and managing live video traffic  Maintains a consistently smooth, high quality viewing experience.  Capital One – is using Spark and data science algorithms to understand customers in a better way.  Developing next generation of financial products and services  Find attributes and patterns of increased probability for fraud  Netflix – leveraging Spark for insights of user viewing habits and then recommends movies to them.  User data is also used for content creation Spark’s Main Use Cases
  • 16.  Even though Spark is versatile, that doesn’t mean Spark’s in-memory capabilities are the best fit for all use cases:  For many simple use cases Apache MapReduce and Hive might be a more appropriate choice  Spark was not designed as a multi-user environment  Spark users are required to know that memory they have is sufficient for a dataset  Adding more users adds complications, since the users will have to coordinate memory usage to run code Spark: when not to use
  • 17.  Clouds and supercomputers are collections of computers networked together in a datacenter  Clouds have different networking, I/O, CPU and cost trade-offs than supercomputers  Cloud workloads are data oriented vs. computation oriented and are less closely coupled than supercomputers  Principles of parallel computing same on both  Apache Hadoop and Spark vs. Open MPI HPC and Big Data Convergence
  • 18.  RDDs (Resilient Distributed Datasets) is Data Containers  All the different processing components in Spark share the same abstraction called RDD  As applications share the RDD abstraction, you can mix different kind of transformations to create new RDDs  Created by parallelizing a collection or reading a file  Fault tolerant Resilient Distributed Datasets (RDDs)
  • 19. RDD abstraction  Resilient Distributed Datasets  partitioned collection of records  spread across the cluster  read-only  caching dataset in memory  different storage levels available  fallback to disk possible Introduction Quick Recap Spark
  • 20. RDD operations  transformations to build RDDs through deterministic operations on other RDDs  transformations include map, filter, join  lazy operation  actions to return value or export data  actions include count, collect, save  triggers execution Introduction Quick Recap Spark
  • 21. Job example val log = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.filter(_.contains(“I/O”)).count() errors.filter(_.contains(“timeout”)).count() Driver Worker Worker Worker Block3Block1 Block2 Cache1 Cache2 Cache2 Action! Introduction Quick Recap Spark
  • 22. RDD partition-level view HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) shouldCache = true log: errors: Partition-level view:Dataset-level view: Task 1Task 2 ... source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals Introduction Quick Recap Spark
  • 23. Job scheduling rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals Introduction Quick Recap Spark
  • 24. Available APIs  You can write in Java, Scala or Python  interactive interpreter: Scala & Python only  standalone applications: any  performance: Java & Scala are faster thanks to static typing Introduction Quick Recap Spark
  • 25. Summary  concept not limited to single pass map-reduce  avoid soring intermediate results on disk or HDFS  speedup computations when reusing datasets Introduction Quick Recap Spark
  • 26. Summary  concept not limited to single pass map-reduce  avoid soring intermediate results on disk or HDFS  speedup computations when reusing datasets Introduction Quick Recap Spark
  • 27.  DataFrames (DFs) is one of the other distributed datasets organized in named columns  Similar to a relational database, Python Pandas Dataframe or R’s DataTables  Immutable once constructed  Track lineage  Enable distributed computations  How to construct Dataframes  Read from file(s)  Transforming an existing DFs(Spark or Pandas)  Parallelizing a python collection list  Apply transformations and actions DataFrames & SparkSQL
  • 28. DataFrame example // Create a new DataFrame that contains “students” students = users.filter(users.age < 21) //Alternatively, using Pandas-like syntax students = users[users.age < 21] //Count the number of students users by gender students.groupBy("gender").count() // Join young students with another DataFrame called logs students.join(logs, logs.userId == users.userId, “left_outer") Resilient Distributed Datasets (RDDs)
  • 29.  RDDs provide a low level interface into Spark  DataFrames have a schema  DataFrames are cached and optimized by Spark  DataFrames are built on top of the RDDs and the core Spark API Example: performance RDDs vs. DataFrames
  • 30. Transformations (create a new RDD) map filter sample groupByKey reduceByKey sortByKey intersection flatMap union join cogroup cross mapValues reduceByKey Actions (return results to driver program) collect first Reduce take Count takeOrdered takeSample countByKey take save lookupKey foreach Spark Operations
  • 31. A B S C E D F DAGs track dependencies (also known as Lineage )  nodes are RDDs  arrows are Transformations Transformation
  • 32. A,1 A,[1,2] A,2 Narrow Wide Map groupByKey Vs. Narrow Vs. Wide Transformation
  • 33. Spark Workflow FlatMap Map groupbyKey Spark Contex t Driver Progra m Collect Narrow Vs. Wide Transformation
  • 34.  What is an action  The final stage of the workflow  Triggers the execution of the DAG  Returns the results to the driver  Or writes the data to HDFS or to a file Actions
  • 35.  Word count text_file = sc.textFile("hdfs://usr/godil/text/book.txt") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt") Examples from http://spark.apache.org/ RDD API Examples Logistic Regression # Every record of this DataFrame contains the label and # features represented by a vector. df = sqlContext.createDataFrame(data, ["label", "features"]) # Set parameters for the algorithm. # Here, we limit the number of iterations to 10. lr = LogisticRegression(maxIter=10) # Fit the model to the data. model = lr.fit(df) # Given a dataset, predict each point's label, and show the results. model.transform(df).show()
  • 36.  RDD Persistence  RDD.persist()  Storage level:  MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, DISK_ONLY,…….  RDD Removal  RDD.unpersist() RDD Persistence and Removal
  • 37. Broadcast Variables and Accumulators (Shared Variables )  Broadcast variables allow the programmer to keep a read-only variable cached on each node, rather than sending a copy of it with tasks >broadcastV1 = sc.broadcast([1, 2, 3,4,5,6]) >broadcastV1.value [1,2,3,4,5,6]  Accumulators are variables that are only “added” to through an associative operation and can be efficiently supported in parallel accum = sc.accumulator(0) accum.add(x) accum.value RDD Persistence and Removal
  • 38. Regression, Classification, Collaborative Filtering, & Clustering.
  • 39. Regression Logistic Regression is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification. Logistic Regression algorithm is used to build a machine learning model with Apache Spark. The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores.
  • 40. Regression Logistic Regression is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification. Logistic Regression algorithm is used to build a machine learning model with Apache Spark. The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores.
  • 41. π = Proportion of “Success” In ordinary regression the model predicts the mean Y for any combination of predictors. What’s the “mean” of a 0/1 indicator variable? success""ofProportion trialsof# '1of#    s n y y i Goal of logistic regression: Predict the “true” proportion of success, π, at any value of the predictor. Regression
  • 42. Binary Logistic Regression Model Y = Binary response X = Quantitative predictor π = proportion of 1’s (yes,success) at any X p = e b0 +b1 X 1+ e b0 +b1 X Equivalent forms of the logistic regression model: What does this look like? X10 1 log           Logit form Probability form N.B.: This is natural log (aka “ln”) Regression
  • 43. y 0.2 0.4 0.6 0.8 1.0 x -10 -8 -6 -4 -2 0 2 4 6 8 10 12 y = bo b1 x•+( )exp 1 bo b1 x•+( )exp+ no data Function Plot Logit Function Regression