Intro to Apache Spark

www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)

●Lead Consultant on all things DevOps and Spark
●@carsondial
Me!

●Apache Spark™ is a fast and general engine for large-scale data
processing
What Is Apache Spark?!

● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…

● Began at UC Berkeley in 2009
● Apache project since 2013
● Top-level Apache project since 2014
● Creators formed databricks.com
History

● Performance
● Developer productivity
Why Spark?

● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!

●First class support for Scala, Java, Python, and R!
●Data Science friendly
Developers!

Word Count: Hadoop

from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark

●Spark Streaming
●GraphX (graph algorithms)
●MLLib (machine learning)
●Dataframes (data access)
Spark: Batteries Included

● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications

●RDD = Resilient Distributed Dataset
●Immutable, Fault-tolerant
●Operated on in parallel
●Can be created manually or from external sources
RDDs – The Building Block

●Transformations
●Actions
●Transformations are lazy
●Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block

● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations

●reduce()
●count()
●take()
●saveAsTextFile()
●…and yes, more
RDDs – Example Actions

●cache() / persist()
●When an action is performed for the first time - keep the result in
memory
●Different levels of persistence available
RDDs – cache()

●Micro-batches (DStreams of RDDs)
●Access to other parts of Spark (MLLib, GraphX, Dataframes)
●Fault-tolerant
●Connectors to Kafka, Flume, Kinesis, ZeroMQ
●(we’ll come back to this)
Streaming

●Spark SQL
●Support for JSON, Cassandra, SQL databases, etc.
●Easier syntax than RDDs
●Dataframes ‘borrowed’ from Python/R
●Catalyst query planner
Dataframes

val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example

●Optimizing query planning for Spark
●Takes Dataframe operations and ‘compiles’ them down to RDD
operations
●Often faster than writing RDD code manually
●Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst

Dataframes: Catalyst

●Machine Learning
●Includes algorithm implementations for Bayes, k-means
clustering, ALS, word2vec, random forests, etc.
●Matrix operations (dense / sparse), dimensionality reduction, etc.
●And basic stats too!
MLLib

●Common interface between different ML solutions
●(still in progress, but production-ready as of 1.5)
●Pipelines : ML as Dataframes : RDDs
MLLib - Pipelines

●Graph processing algorithms
●Operations on vertices and edges
●Includes PageRank algorithm
●Can be combined with Streaming/Dataframes/MLLib
GraphX

●Standalone
●YARN (Hadoop ecosystem)
●Mesos (Hipster ecosystem)
Deploying Spark

●Traditional (write code, submit to cluster)
●REPL / Shell (write code interactively, backed by cluster)
●Interactive Notebooks (iPython/Zeppelin)
Using Spark

●Log / diary approach to data science
●Type code into a web page
●Visualizations built-in
Interactive Notebooks

●iPython / Juypter - most popular
●Zeppelin - built for Spark

●spark.apache.org
●databricks.com
●zeppelin.incubator.apache.org
●mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links

Questions?

Intro to Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Intro to Apache Spark

More from Mammoth Data

Recently uploaded

Intro to Apache Spark