Apache spark presentation

•

0 likes•72 views

Talk on Apache Spark I gave at Hyderabad Software Architects meetup on 20-Jan-2018. Source code and commands are at http://www.mediafire.com/file/tzmzahftxnabs0g/HSA-Spark-20-Jan-2018.zip

Technology

Apache Spark
Presenter: MH
http://bit.ly/mahboob

Spark Overview
● General purpose cluster computing system
● High-level APIs in Java, Scala, Python and R
● Supports general execution graphs
● Spark SQL
● MLlib
● GraphX
● Spark Streaming

Deployment Modes
● Amazon EC2:
○ scripts that let you launch a cluster on EC2 in about 5 minutes
● Standalone Deploy Mode:
○ launch a standalone cluster quickly without a third-party cluster manager
● Mesos:
○ deploy a private cluster using Apache Mesos
● YARN:
○ deploy Spark on top of Hadoop NextGen (YARN)

FAQ
Hadoop vs Spark
File system based | Memory based
Map-Reduce Paradigm | Any distributed computing workload
Details:
https://docs.google.com/document/d/1hcv3JOc009AVer6bVFEeGnHrt_Jb0v3v3g2qGPgRy1Y

The Spark Driver
Source:
http://mapr.com

Very good resource
https://mapr.com/ebooks/spark/
Downloadable as pdf

K-Means Clustering
First, decide the number of clusters k.
Then:
1. Initialize the center of the clusters
2. Attribute the closest cluster to each
data point
3. Set the position of each cluster to the
mean of all data points belonging to
that cluster
4. Repeat steps 2-3 until convergence
Create k points for starting centroids (often randomly)
While any point has changed cluster assignment
for every point in the dataset:
for every centroid
calculate distance between centroid and point
assign point to the cluster with the lowest distance
for every cluster calculate mean of points in that cluster
assign the centroid to the mean

Example 1
Applying k-Means Algorithm using Python
Program: kMeans.py*
Data: x-y coordinates in file testSet.txt
* Source: Chapter 10, Machine Learning in Action

Example 2
Applying k-Means Algorithm using Java API to Spark
Program: Spark_KMeans.java
Data: Random numbers in marks.txt

Example 3
https://mapr.com/ebooks/spark/08-machine-learning-mllib-spark-use-case.html
Techniques applied for NLP
● Hashing
● Stemming
● TF-IDF
● Naive Bayes
● Random Forest Regression
● Gradient Boosted Trees Regression
● Code repository: https://github.com/joebluems/Mockingbird/issues
Is Harper Lee’s Go Set a Watchman a discarded rough
draft that was to become the universally beloved
classic To Kill AMockingbird, or was it a truly separate
work?

Who is using Spark?
Source:
https://medium.com/ai
rbnb-
engineering/data-
infrastructure-at-
airbnb-8adfb34f169c

Spark @AirBnB
Spark is also an important tool for Airbnb. The team actually built something called
Airstream, which is a computational framework that sits on top of Spark Streaming and
Spark SQL, allowing engineersand the data team to get quick insights. Ultimately, for an
organization that depends on predictions and machine learning, something like Spark -
alongsideother open sourcemachinelearning libraries- iscrucial in theAirbnb stack.
Source:
https://www.packtpub.com/books/content/what-software-stack-does-airbnb-use

What's hot

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks

Giraph주영 송

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Briefing on the Modern ML Stack with RDatabricks

Easily reduce runtimes with cythonMichal Mucha

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

High Performance Python on Apache SparkWes McKinney

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Data Stream Algorithms in Storm and RRadek Maciaszek

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

Highly Available GraphiteMatthew Barlocker

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Brief introduction to Distributed Deep LearningAdam Gibson

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Scaling Graphite At YelpPaul O'Connor

What's hot (20)

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...

Giraph

Enterprise Scale Topological Data Analysis Using Spark

Stories About Spark, HPC and Barcelona by Jordi Torres

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Briefing on the Modern ML Stack with R

Easily reduce runtimes with cython

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

High Performance Python on Apache Spark

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Data Stream Algorithms in Storm and R

Re-Architecting Spark For Performance Understandability

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

Highly Available Graphite

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Brief introduction to Distributed Deep Learning

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016

Scaling Graphite At Yelp

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

How to convert PDF to text with Nanonets

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

08448380779 Call Girls In Friends Colony Women Seeking Men

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Presentation on how to chat with PDF using ChatGPT code interpreter

Scaling API-first – The story of a global engineering organization

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Salesforce Community Group Quito, Salesforce 101

Slack Application Development 101 Slides

CNv6 Instructor Chapter 6 Quality of Service

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Apache spark presentation

1. Apache Spark Presenter: MH http://bit.ly/mahboob

2. Spark Overview ● General purpose cluster computing system ● High-level APIs in Java, Scala, Python and R ● Supports general execution graphs ● Spark SQL ● MLlib ● GraphX ● Spark Streaming

3. Deployment Modes ● Amazon EC2: ○ scripts that let you launch a cluster on EC2 in about 5 minutes ● Standalone Deploy Mode: ○ launch a standalone cluster quickly without a third-party cluster manager ● Mesos: ○ deploy a private cluster using Apache Mesos ● YARN: ○ deploy Spark on top of Hadoop NextGen (YARN)

4. FAQ Hadoop vs Spark File system based | Memory based Map-Reduce Paradigm | Any distributed computing workload Details: https://docs.google.com/document/d/1hcv3JOc009AVer6bVFEeGnHrt_Jb0v3v3g2qGPgRy1Y

5. The Spark Driver Source: http://mapr.com

6. Very good resource https://mapr.com/ebooks/spark/ Downloadable as pdf

7. K-Means Clustering First, decide the number of clusters k. Then: 1. Initialize the center of the clusters 2. Attribute the closest cluster to each data point 3. Set the position of each cluster to the mean of all data points belonging to that cluster 4. Repeat steps 2-3 until convergence Create k points for starting centroids (often randomly) While any point has changed cluster assignment for every point in the dataset: for every centroid calculate distance between centroid and point assign point to the cluster with the lowest distance for every cluster calculate mean of points in that cluster assign the centroid to the mean

8. Example 1 Applying k-Means Algorithm using Python Program: kMeans.py* Data: x-y coordinates in file testSet.txt * Source: Chapter 10, Machine Learning in Action

9. Example 2 Applying k-Means Algorithm using Java API to Spark Program: Spark_KMeans.java Data: Random numbers in marks.txt

10. Example 3 https://mapr.com/ebooks/spark/08-machine-learning-mllib-spark-use-case.html Techniques applied for NLP ● Hashing ● Stemming ● TF-IDF ● Naive Bayes ● Random Forest Regression ● Gradient Boosted Trees Regression ● Code repository: https://github.com/joebluems/Mockingbird/issues Is Harper Lee’s Go Set a Watchman a discarded rough draft that was to become the universally beloved classic To Kill AMockingbird, or was it a truly separate work?

11. Who is using Spark? Source: https://medium.com/ai rbnb- engineering/data- infrastructure-at- airbnb-8adfb34f169c

12. Spark @AirBnB Spark is also an important tool for Airbnb. The team actually built something called Airstream, which is a computational framework that sits on top of Spark Streaming and Spark SQL, allowing engineersand the data team to get quick insights. Ultimately, for an organization that depends on predictions and machine learning, something like Spark - alongsideother open sourcemachinelearning libraries- iscrucial in theAirbnb stack. Source: https://www.packtpub.com/books/content/what-software-stack-does-airbnb-use

13. Thank You

Apache spark presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache spark presentation

Similar to Apache spark presentation (20)

Recently uploaded

Recently uploaded (20)

Apache spark presentation