Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Spark Camp @ Strata CA 
An Intro to Apache Spark with Hands-on Tutorials
Wed Feb 18, 2015 9:00am–5:00pm
strataconf.com/big-data-conference-ca-2015/

Spark Camp @ Strata + Hadoop World
A day long hands-on introduction to the Spark platform including
Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib,
GraphX, and more…

• overview of use cases and demonstrate writing  
simple Spark applications

• cover each of the main components of the  
Spark stack

• a series of technical talks targeted at developers  
who are new to Spark

• intermixed with the talks will be periods of  
hands-on lab work

Spark Camp @ Strata + Hadoop World
Strata NY @ NYC 
2014-10-15 
~450 people
Strata EU @ Barcelona 
2014-11-19 
~250 people

Spark Camp: Ask Us Anything
Fri, Feb 20 2:20pm-3:00pm 
public/schedule/detail/40701
Join the Spark team for an informal question and
answer session. Several of the Spark committers,
trainers, etc., from Databricks will be on hand to
ﬁeld a wide range of detailed questions.

Even if you don’t have a speciﬁc question, join  
in to hear what others are asking!

Apache Spark Advanced Training
Feb 17-19 9:00am-5:00pm 
Sameer Farooqui leads this new 3-day
training program offered by Databricks and
O’Reilly Media at Strata + Hadoop World
events worldwide.

Participants will also receive limited free-tier
accounts on Databricks Cloud.

Note: this sold out early, so if you want to
attend it at Strata EU, sign up quickly!

Spark Developer Certification 
Fri Feb 20, 2015 10:40am-12:40pm
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise

• 40 multiple-choice questions, 90 minutes

• mostly structured as choices among code blocks

• expect some Python, Java, Scala, SQL

• understand theory of operation

• identify best practices

• recognize code that is more parallel, less
memory constrained

!
Overall, you need to write Spark apps in practice
Developer Certiﬁcation: Overview
7

Even More Apache Spark! 
Feb 17-20, 2015

Keynote: New Directions for Spark in 2015
Fri Feb 20 9:15am-9:25am 
As the Apache Spark userbase grows, the developer community is working  
to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the
enterprise and major improvements in its performance, scalability and
standard libraries. In 2015, we want to make Spark accessible to a wider  
set of users, through new high-level APIs for data science: machine learning
pipelines, data frames, and R language bindings. In addition, we are deﬁning
extension points to let Spark grow as a platform, making it easy to plug in
data sources, algorithms, and external packages. Like all work on Spark,  
these APIs are designed to plug seamlessly into Spark applications, giving
users a uniﬁed platform for streaming, batch and interactive data processing.
Matei Zaharia – started the Spark project
at UC Berkeley, currently CTO of Databricks,
SparkVP at Apache, and an assistant professor
at MIT

Databricks Spark Talks @ Strata + Hadoop World
Thu Feb 19 10:40am-11:20am 
Lessons from Running Large Scale SparkWorkloads 
Reynold Xin, Matei Zaharia
Thu Feb 19 4:00pm–4:40pm 
Spark Streaming -The State of the Union, and Beyond 
Tathagata Das

Databricks Spark Talks @ Strata + Hadoop World
Fri Feb 20 11:30am-12:10pm 
Tuning and Debugging in Apache Spark 
Patrick Wendell
Fri Feb 20 4:00pm–4:40pm 
Everyday I’m Shufﬂing -Tips forWriting Better Spark Programs 
Vida Ha, Holden Karau

13
A Brief History: Functional Programming for Big Data
circa late 1990s:  
explosive growth e-commerce and machine data
implied that workloads could not ﬁt on a single
computer anymore…

notable ﬁrms led the shift to horizontal scale-out  
on clusters of commodity hardware, especially  
for machine learning use cases at scale

14
A Brief History: Functional Programming for Big Data
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit

15
circa 2002:  
mitigate risk of large distributed workloads lost  
due to disk failures on commodity hardware…
Google File System

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

research.google.com/archive/gfs.html

!
MapReduce: Simpliﬁed Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat

research.google.com/archive/mapreduce.html
A Brief History: MapReduce

16
MR doesn’t compose well for large applications,  
and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
A Brief History: MapReduce

Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become  
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges – 
including collection, ETL, storage, exploration and analytics – 
should consider Spark for its in-memory performance and 
the breadth of its model. It supports advanced analytics 
solutions on Hadoop clusters, including the iterative model 
required for machine learning and graph analysis.”

Gartner, Advanced Analytics and Data Science (2014)
17
A Brief History: Spark

Spark is one of the most active Apache projects
ohloh.net/orgs/apache
19
TL;DR: Sustained Exponential Growth

databricks.com/blog/2015/01/27/big-data-projects-are-
hungry-for-simpler-and-more-powerful-tools-survey-
validates-apache-spark-is-gaining-developer-traction.html
TL;DR: Spark Survey 2015 by Databricks +Typesafe
20

databricks.com/blog/2014/11/05/spark-ofﬁcially-
sets-a-new-record-in-large-scale-sorting.html
TL;DR: SmashingThe Previous Petabyte Sort Record
21

oreilly.com/data/free/2014-data-science-
salary-survey.csp
TL;DR: Spark ExpertiseTops Median Salaries within Big Data
22

WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
24
Simple Spark Apps: WordCount

val sqlContext = new org.apache.spark.sql.SQLContext(sc)!
import sqlContext._!
!
// Define the schema using a case class.!
case class Person(name: String, age: Int)!
!
// Create an RDD of Person objects and register it as a table.!
val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))!
!
people.registerTempTable("people")!
!
// SQL statements can be run by using the sql methods provided by sqlContext.!
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// The results of SQL queries are SchemaRDDs and support all the !
// normal RDD operations.!
// The columns of a row in the result can be accessed by ordinal.!
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Data Workﬂows: Spark SQL
25

// http://spark.apache.org/docs/latest/streaming-programming-guide.html!
!
import org.apache.spark.streaming._!
import org.apache.spark.streaming.StreamingContext._!
!
// create a StreamingContext with a SparkConf configuration!
val ssc = new StreamingContext(sparkConf, Seconds(10))!
!
// create a DStream that will connect to serverIP:serverPort!
val lines = ssc.socketTextStream(serverIP, serverPort)!
!
// split each line into words!
val words = lines.flatMap(_.split(" "))!
!
// count each word in each batch!
val pairs = words.map(word => (word, 1))!
val wordCounts = pairs.reduceByKey(_ + _)!
!
// print a few of the counts to the console!
wordCounts.print()!
!
ssc.start() // start the computation!
ssc.awaitTermination() // wait for the computation to terminate
Data Workﬂows: Spark Streaming
26

27
spark.apache.org/docs/latest/mllib-guide.html

!
Key Points:

!
• framework vs. library

• scale, parallelism, sparsity

• building blocks for long-term approach
MLI: An API for Distributed Machine Learning

Evan Sparks, Ameet Talwalkar, et al.

International Conference on Data Mining (2013)

http://arxiv.org/abs/1310.5426
Data Workﬂows: MLlib

community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training

books:
Fast Data Processing  
with Spark 
Holden Karau 
Packt (2013) 
shop.oreilly.com/product/
9781782167068.do
Spark in Action 
Chris Fregly 
Manning (2015*) 
sparkinaction.com/
Learning Spark 
Holden Karau,  
Andy Konwinski,
Matei Zaharia 
O’Reilly (2015*) 
shop.oreilly.com/product/
0636920028512.do

31
http://spark-summit.org/20% discount: 
SSEDBFRIEND20

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Similar to Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More (20)

More from Paco Nathan

More from Paco Nathan (8)

Recently uploaded

Recently uploaded (20)

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More