Machine Learning with Apache Spark - HackNY Masters

•

0 likes•956 views

Evan Casey

Introduction to Machine Learning with MLlib and Apache Spark

Data & Analytics

Scalable Machine Learning with
Apache Spark
Evan Casey
@ev_ancasey

Who am I?
● Engineer at Tapad
● HackNY 2014 Fellow
● Things I work on:
○ Scala
○ Distributed systems
○ Hadoop/Spark

Overview
● Apache Spark
○ Dataflow model
○ Spark vs Hadoop MapReduce
○ Programming with Spark
● Machine Learning with Spark
○ MLlib overview
○ Gradient descent example
○ Distributed implementation on Apache Spark
○ Lessons learned

Apache Spark
● Distributed data-processing
framework built on top of HDFS
● Use cases:
○ Interactive analytics
○ Graph processing
○ Stream processing
○ Scalable ML

Why Spark?
● Up to 100x faster than
Hadoop
● Built on top of Akka
● Expressive APIs in Scala,
Java, and Python
● Active open-source
community

Spark vs Hadoop MapReduce
● In-memory data flow model
optimized for multi-stage
jobs
● Novel approach to fault
tolerance
● Similar programming style
to Scalding/Cascading

Programming Model
● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize
● Parallel Operations
○ Map, GroupBy, Filter, Join, etc
● Optimizations
○ Caching, shared variables

Wordcount Example
val sc = new SparkContext()
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split("
"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// counts.cache
// sc.broadcast(counts)

Machine Learning in Spark
Algorithms:
- classification: logistic regression, linear SVM, naive bayes, random
forests
- regression: generalized linear models, regression tree
- collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
- clustering: k-means
- decomposition: singular value decompositions (SVD), principal
component analysis (PCA)

K-Means Clustering
val data = sc.textFile("hdfs://...")
val parsedData = data.map(_.split(‘ ‘).map(_.
toDouble)).cache()
// Cluster the data into two classes
val clusters = KMeans.train(parsedData, 2,
numIterations = 20)
// Compute the sum of squared errors
val cost = clusters.computeCost(parsedData)

$Gradient Descent Example val file = sc.textFile("hdfs://...") val points = file.map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient }$

About Tapad
● 350k QPS
● Ingest multiple TBs daily
● Kafka, Scalding, Spark, Zookeeper, Aerospike
● We’re hiring! :)

What's hot

HBase introduction talkHayden Marchant

The ABC of Big DataAndré Faria Gomes

10 big data analytics tools to watch out for in 2019JanBask Training

JDD 2016 - Michal Matloka - Small Intro To Big DataPROIDEA

SparkKnoldus Inc.

An Introduction of Apache HadoopKMS Technology

Spark intro iwomm_2017-07-19Erik Schmiegelow

Apache sparkRamakrishna kapa

Handling the growth of dataPiyush Katariya

Hadoop_EcoSystem_Pradeep_MGPradeep MG

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

Visualizing big data in the browser using sparkDatabricks

Apache spark its place within a big data stackJunjun Olympia

Sistema de recomendación entiempo real usando Delta LakeGlobant

SparkR-Advance Analytic for Big Datasamuel shamiri

Hadoop course curriculm alogarg

CCI DAY PRESENTATIONApurva Kulkarni

Workspace Managementwaldotyson

Hive and querying dataKarthigaGunasekaran1

Heuritech: Apache Spark REXdidmarin

What's hot (20)

HBase introduction talk

The ABC of Big Data

10 big data analytics tools to watch out for in 2019

JDD 2016 - Michal Matloka - Small Intro To Big Data

Spark

An Introduction of Apache Hadoop

Spark intro iwomm_2017-07-19

Apache spark

Handling the growth of data

Hadoop_EcoSystem_Pradeep_MG

Geek Night - Functional Data Processing using Spark and Scala

Visualizing big data in the browser using spark

Apache spark its place within a big data stack

Sistema de recomendación entiempo real usando Delta Lake

SparkR-Advance Analytic for Big Data

Hadoop course curriculm

CCI DAY PRESENTATION

Workspace Management

Hive and querying data

Heuritech: Apache Spark REX

Similar to Machine Learning with Apache Spark - HackNY Masters

Apache Spark Overview @ ferretAndrii Gakhov

End-to-end Data Pipeline with Apache SparkDatabricks

Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi

Apache Spark Introductionsudhakara st

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai

Spark corePrashant Gupta

Spark training-in-bangaloreKelly Technologies

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

In Memory Analytics with Apache SparkVenkata Naga Ravi

Big data analytics_beyond_hadoop_public_18_july_2013Vijay Srinivas Agneeswaran, Ph.D

Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov

Bds session 13 14Infinity Tech Solutions

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

Spark Study NotesRichard Kuo

Introduction to Spark - DataFactZDataFactZ

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Scala+dataSamir Bessalah

Large Scale Machine Learning with Apache SparkCloudera, Inc.

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Similar to Machine Learning with Apache Spark - HackNY Masters (20)

Apache Spark Overview @ ferret

End-to-end Data Pipeline with Apache Spark

Big Data Analytics and Ubiquitous computing

Apache Spark Introduction

20130912 YTC_Reynold Xin_Spark and Shark

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...

Spark core

Spark training-in-bangalore

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

In Memory Analytics with Apache Spark

Big data analytics_beyond_hadoop_public_18_july_2013

Data processing platforms with SMACK: Spark and Mesos internals

Bds session 13 14

Apache spark sneha challa- google pittsburgh-aug 25th

Spark Study Notes

Introduction to Spark - DataFactZ

Artigo 81 - spark_tutorial.pdf

Scala+data

Large Scale Machine Learning with Apache Spark

Big data vahidamiri-tabriz-13960226-datastack.ir

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

B2 Creative Industry Response Evaluation.docxStephen266013

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Digi Khata Problem along complete plan.pptxTanveerAhmed817946

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati

Schema on read is obsolete. Welcome metaprogramming..pdf

Brighton SEO | April 2024 | Data Storytelling

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

B2 Creative Industry Response Evaluation.docx

Call Girls In Mahipalpur O9654467111 Escorts Service

Customer Service Analytics - Make Sense of All Your Data.pptx

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Digi Khata Problem along complete plan.pptx

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Unveiling Insights: The Role of a Data Analyst

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...

Machine Learning with Apache Spark - HackNY Masters

1. Scalable Machine Learning with Apache Spark Evan Casey @ev_ancasey

2. Who am I? ● Engineer at Tapad ● HackNY 2014 Fellow ● Things I work on: ○ Scala ○ Distributed systems ○ Hadoop/Spark

3. Overview ● Apache Spark ○ Dataflow model ○ Spark vs Hadoop MapReduce ○ Programming with Spark ● Machine Learning with Spark ○ MLlib overview ○ Gradient descent example ○ Distributed implementation on Apache Spark ○ Lessons learned

4. Apache Spark ● Distributed data-processing framework built on top of HDFS ● Use cases: ○ Interactive analytics ○ Graph processing ○ Stream processing ○ Scalable ML

5. Why Spark? ● Up to 100x faster than Hadoop ● Built on top of Akka ● Expressive APIs in Scala, Java, and Python ● Active open-source community

6. Spark vs Hadoop MapReduce ● In-memory data flow model optimized for multi-stage jobs ● Novel approach to fault tolerance ● Similar programming style to Scalding/Cascading

7. Programming Model ● Resilient Distributed Dataset (RDD) ○ Textfile, parallelize ● Parallel Operations ○ Map, GroupBy, Filter, Join, etc ● Optimizations ○ Caching, shared variables

8. Wordcount Example val sc = new SparkContext() val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") // counts.cache // sc.broadcast(counts)

9. Machine Learning in Spark Algorithms: - classification: logistic regression, linear SVM, naive bayes, random forests - regression: generalized linear models, regression tree - collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) - clustering: k-means - decomposition: singular value decompositions (SVD), principal component analysis (PCA)

10. K-Means Clustering val data = sc.textFile("hdfs://...") val parsedData = data.map(_.split(‘ ‘).map(_. toDouble)).cache() // Cluster the data into two classes val clusters = KMeans.train(parsedData, 2, numIterations = 20) // Compute the sum of squared errors val cost = clusters.computeCost(parsedData)

11. Gradient Descent Example val file = sc.textFile("hdfs://...") val points = file.map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient }

12. About Tapad ● 350k QPS ● Ingest multiple TBs daily ● Kafka, Scalding, Spark, Zookeeper, Aerospike ● We’re hiring! :)

13. Thanks! @ev_ancasey Questions?

Machine Learning with Apache Spark - HackNY Masters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning with Apache Spark - HackNY Masters

Similar to Machine Learning with Apache Spark - HackNY Masters (20)

Recently uploaded

Recently uploaded (20)

Machine Learning with Apache Spark - HackNY Masters