Introduction to Apache Spark
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
●Lead Consultant on all things DevOps and Spark
●@carsondial
Me!
www.mammothdata.com | @mammothdataco
●Apache Spark™ is a fast and general engine for large-scale data
processing
What Is Apache Spark?!
www.mammothdata.com | @mammothdataco
● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…
www.mammothdata.com | @mammothdataco
● Began at UC Berkeley in 2009
● Apache project since 2013
● Top-level Apache project since 2014
● Creators formed databricks.com
History
www.mammothdata.com | @mammothdataco
● Performance
● Developer productivity
Why Spark?
www.mammothdata.com | @mammothdataco
● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!
www.mammothdata.com | @mammothdataco
●First class support for Scala, Java, Python, and R!
●Data Science friendly
Developers!
www.mammothdata.com | @mammothdataco
Word Count: Hadoop
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
●Spark Streaming
●GraphX (graph algorithms)
●MLLib (machine learning)
●Dataframes (data access)
Spark: Batteries Included
www.mammothdata.com | @mammothdataco
● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications
www.mammothdata.com | @mammothdataco
●RDD = Resilient Distributed Dataset
●Immutable, Fault-tolerant
●Operated on in parallel
●Can be created manually or from external sources
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
●Transformations
●Actions
●Transformations are lazy
●Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations
www.mammothdata.com | @mammothdataco
●reduce()
●count()
●take()
●saveAsTextFile()
●…and yes, more
RDDs – Example Actions
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
●cache() / persist()
●When an action is performed for the first time - keep the result in
memory
●Different levels of persistence available
RDDs – cache()
www.mammothdata.com | @mammothdataco
●Micro-batches (DStreams of RDDs)
●Access to other parts of Spark (MLLib, GraphX, Dataframes)
●Fault-tolerant
●Connectors to Kafka, Flume, Kinesis, ZeroMQ
●(we’ll come back to this)
Streaming
www.mammothdata.com | @mammothdataco
●Spark SQL
●Support for JSON, Cassandra, SQL databases, etc.
●Easier syntax than RDDs
●Dataframes ‘borrowed’ from Python/R
●Catalyst query planner
Dataframes
www.mammothdata.com | @mammothdataco
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example
www.mammothdata.com | @mammothdataco
●Optimizing query planning for Spark
●Takes Dataframe operations and ‘compiles’ them down to RDD
operations
●Often faster than writing RDD code manually
●Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
●Machine Learning
●Includes algorithm implementations for Bayes, k-means
clustering, ALS, word2vec, random forests, etc.
●Matrix operations (dense / sparse), dimensionality reduction, etc.
●And basic stats too!
MLLib
www.mammothdata.com | @mammothdataco
●Common interface between different ML solutions
●(still in progress, but production-ready as of 1.5)
●Pipelines : ML as Dataframes : RDDs
MLLib - Pipelines
www.mammothdata.com | @mammothdataco
●Graph processing algorithms
●Operations on vertices and edges
●Includes PageRank algorithm
●Can be combined with Streaming/Dataframes/MLLib
GraphX
www.mammothdata.com | @mammothdataco
●Standalone
●YARN (Hadoop ecosystem)
●Mesos (Hipster ecosystem)
Deploying Spark
www.mammothdata.com | @mammothdataco
●Traditional (write code, submit to cluster)
●REPL / Shell (write code interactively, backed by cluster)
●Interactive Notebooks (iPython/Zeppelin)
Using Spark
www.mammothdata.com | @mammothdataco
●Log / diary approach to data science
●Type code into a web page
●Visualizations built-in
Interactive Notebooks
www.mammothdata.com | @mammothdataco
●iPython / Juypter - most popular
●Zeppelin - built for Spark
Interactive Notebooks
www.mammothdata.com | @mammothdataco
Interactive Notebooks
www.mammothdata.com | @mammothdataco
●spark.apache.org
●databricks.com
●zeppelin.incubator.apache.org
●mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links
www.mammothdata.com | @mammothdataco
Questions?

Intro to Apache Spark