Introduction To Spark - Durham LUG 20150916

www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)

● Lead Consultant on all things DevOps and Spark
● @carsondial
Me!

● Apache Spark™ is a fast and general engine for large-scale data
processing
● Not all that helpful, is it?
What Is Apache Spark?!

● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…

● Performance
● Developer productivity
Why Spark?

● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!

● First class support for Scala, Java, Python, and R!
● Data Science friendly
Developers!

Word Count: Hadoop

from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark

● Spark Streaming
● GraphX (graph algorithms)
● MLLib (machine learning)
● Dataframes (data access)
Spark: Batteries Included

● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications

● RDD = Resilient Distributed Dataset
● Immutable, Fault-tolerant
● Operated on in parallel
● Can be created manually or from external sources
RDDs – The Building Block

● Transformations
● Actions
● Transformations are lazy
● Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block

● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations

● reduce()
● count()
● take()
● saveAsTextFile()
● …and yes, more
RDDs – Example Actions

● cache() / persist()
● When an action is performed for the first time - keep the result in
memory
● Different levels of persistence available
RDDs – cache()

● Micro-batches (DStreams of RDDs)
● Access to other parts of Spark (MLLib, GraphX, Dataframes)
● Fault-tolerant
● Connectors to Kafka, Flume, Kinesis, ZeroMQ
● (we’ll come back to this)
Streaming

● Spark SQL
● Support for JSON, Cassandra, SQL databases, etc.
● Easier syntax than RDDs
● Dataframes ‘borrowed’ from Python/R
● Catalyst query planner
Dataframes

val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example

● Optimizing query planning for Spark
● Takes Dataframe operations and ‘compiles’ them down to RDD
operations
● Often faster than writing RDD code manually
● Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst

Dataframes: Catalyst

● Standalone
● YARN (Hadoop ecosystem)
● Mesos (Hipster ecosystem)
Deploying Spark

● Spark-Shell
● Zeppelin
Demos

● Spark Streaming is not ‘pure’ streaming
● Low latency requirements - use Storm
● Still immature in some ways
● Come to my All Things Open talk to learn more!
Spark for Everything?

● http://www.meetup.com/Triangle-Apache-Spark-Meetup/
● Next meeting likely to be in late October
Triangle Apache Spark Meetup Group

● spark.apache.org
● databricks.com
● zeppelin.incubator.apache.org
● mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links

● Questions for you! (for a $15 Digital Ocean voucher)
1. What is a RDD?
2. What’s the difference between a transformation and an action?
3. When wouldn’t you use Spark Streaming?
Questions?

Introduction To Spark - Durham LUG 20150916

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Introduction To Spark - Durham LUG 20150916

Similar to Introduction To Spark - Durham LUG 20150916 (20)

Recently uploaded

Recently uploaded (20)

Introduction To Spark - Durham LUG 20150916