Talk on Apache Spark I gave at Hyderabad Software Architects meetup on 20-Jan-2018.
Source code and commands are at
http://www.mediafire.com/file/tzmzahftxnabs0g/HSA-Spark-20-Jan-2018.zip
2. Spark Overview
● General purpose cluster computing system
● High-level APIs in Java, Scala, Python and R
● Supports general execution graphs
● Spark SQL
● MLlib
● GraphX
● Spark Streaming
3. Deployment Modes
● Amazon EC2:
○ scripts that let you launch a cluster on EC2 in about 5 minutes
● Standalone Deploy Mode:
○ launch a standalone cluster quickly without a third-party cluster manager
● Mesos:
○ deploy a private cluster using Apache Mesos
● YARN:
○ deploy Spark on top of Hadoop NextGen (YARN)
4. FAQ
Hadoop vs Spark
File system based | Memory based
Map-Reduce Paradigm | Any distributed computing workload
Details:
https://docs.google.com/document/d/1hcv3JOc009AVer6bVFEeGnHrt_Jb0v3v3g2qGPgRy1Y
7. K-Means Clustering
First, decide the number of clusters k.
Then:
1. Initialize the center of the clusters
2. Attribute the closest cluster to each
data point
3. Set the position of each cluster to the
mean of all data points belonging to
that cluster
4. Repeat steps 2-3 until convergence
Create k points for starting centroids (often randomly)
While any point has changed cluster assignment
for every point in the dataset:
for every centroid
calculate distance between centroid and point
assign point to the cluster with the lowest distance
for every cluster calculate mean of points in that cluster
assign the centroid to the mean
8. Example 1
Applying k-Means Algorithm using Python
Program: kMeans.py*
Data: x-y coordinates in file testSet.txt
* Source: Chapter 10, Machine Learning in Action
9. Example 2
Applying k-Means Algorithm using Java API to Spark
Program: Spark_KMeans.java
Data: Random numbers in marks.txt
10. Example 3
https://mapr.com/ebooks/spark/08-machine-learning-mllib-spark-use-case.html
Techniques applied for NLP
● Hashing
● Stemming
● TF-IDF
● Naive Bayes
● Random Forest Regression
● Gradient Boosted Trees Regression
● Code repository: https://github.com/joebluems/Mockingbird/issues
Is Harper Lee’s Go Set a Watchman a discarded rough
draft that was to become the universally beloved
classic To Kill AMockingbird, or was it a truly separate
work?
11. Who is using Spark?
Source:
https://medium.com/ai
rbnb-
engineering/data-
infrastructure-at-
airbnb-8adfb34f169c
12. Spark @AirBnB
Spark is also an important tool for Airbnb. The team actually built something called
Airstream, which is a computational framework that sits on top of Spark Streaming and
Spark SQL, allowing engineersand the data team to get quick insights. Ultimately, for an
organization that depends on predictions and machine learning, something like Spark -
alongsideother open sourcemachinelearning libraries- iscrucial in theAirbnb stack.
Source:
https://www.packtpub.com/books/content/what-software-stack-does-airbnb-use