Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Intro to Spark 
and Spark SQL 
AMP Camp 2014 
Michael Armbrust - @michaelarmbrust
What is Apache Spark? 
Fast and general cluster computing system, 
interoperable with Hadoop, included in all major 
distr...
Spark Model 
Write programs in terms of transformations on 
distributed datasets 
Resilient Distributed Datasets (RDDs) 
>...
More than Map & Reduce 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduc...
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
val 
lin...
A General Stack 
Spark 
Streaming# 
real-time 
Spark 
Spark 
SQL 
GraphX 
graph 
MLlib 
machine 
learning 
…
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Stre...
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Stre...
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Stre...
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Stre...
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Stre...
Community Growth 
200 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
Not Just for In-Memory Data 
Hadoop 
Record 
Spark 100TB Spark 
1PB 
Data Size 102.5TB 100TB 1000TB 
Time 72 min 23 min 23...
SQL Overview 
• Newest component of Spark initially 
contributed by databricks (< 1 year old) 
• Tightly integrated way to...
Relationship to 
Shark modified the Hive backend to run over 
Spark, but had two challenges: 
> Limited integration with S...
Adding Schema to RDDs 
Spark + RDDs 
Functional transformations on 
partitioned collections of opaque 
objects. 
SQL + Sch...
SchemaRDDs: More than SQL 
SchemaRDD 
{JSON} 
Image credit: http://barrymieny.deviantart.com/ 
SQL 
QL 
Parquet 
MLlib 
Un...
Getting Started: Spark SQL 
SQLContext/HiveContext 
• Entry point for all SQL functionality 
• Wraps/extends existing spar...
Example Dataset 
A text file filled with people’s names and ages: 
Michael, 
30 
Andy, 
31 
…
RDDs as Relations (Python) 
# Load a text file and convert each line to a dictionary. 
lines = sc.textFile("examples/…/peo...
RDDs as Relations (Scala) 
val 
sqlContext 
= 
new 
org.apache.spark.sql.SQLContext(sc) 
import 
sqlContext._ 
// 
Define ...
RDDs as Relations (Java) 
public 
class 
Person 
implements 
Serializable 
{ 
private 
String 
_name; 
private 
int 
_age;...
Querying Using SQL 
# 
SQL 
can 
be 
run 
over 
SchemaRDDs 
that 
have 
been 
registered 
# 
as 
a 
table. 
teenagers 
= 
...
Existing Tools, New Data Sources 
Spark SQL includes a server that exposes its data 
using JDBC/ODBC 
• Query data from HD...
Caching Tables In-Memory 
Spark SQL can cache tables using an in-memory 
columnar format: 
• Scan only required columns 
•...
Caching Comparison 
Spark MEMORY_ONLY Caching 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
Schem...
Language Integrated UDFs 
registerFunction(“countMatches”, 
lambda 
(pattern, 
text): 
re.subn(pattern, 
'', 
text)[1]) 
s...
SQL and Machine Learning 
training_data_table 
= 
sql(""" 
SELECT 
e.action, 
u.age, 
u.latitude, 
u.logitude 
FROM 
Users...
Machine Learning Pipelines 
// 
training:{eventId:String, 
features:Vector, 
label:Int} 
val 
training 
= 
parquetFile("/p...
Reading Data Stored in Hive 
from 
pyspark.sql 
import 
HiveContext 
hiveCtx 
= 
HiveContext(sc) 
hiveCtx.hql(""" 
CREATE ...
Parquet Compatibility 
Native support for reading data in Parquet: 
• Columnar storage avoids reading unneeded data. 
• RD...
Using Parquet 
# 
SchemaRDDs 
can 
be 
saved 
as 
Parquet 
files, 
maintaining 
the 
# 
schema 
information. 
peopleTable....
{JSON} Support 
• Use jsonFile or jsonRDD to convert a 
collection of JSON objects into a SchemaRDD 
• Infers and unions t...
{JSON} Example 
# 
Create 
a 
SchemaRDD 
from 
the 
file(s) 
pointed 
to 
by 
path 
people 
= 
sqlContext.jsonFile(path) 
...
Data Sources API 
Allow easy integration with new sources of 
structured data: 
CREATE 
TEMPORARY 
TABLE 
episodes 
USING ...
Efficient Expression Evaluation 
Interpreting expressions (e.g., ‘a + b’) can very 
expensive on the JVM: 
• Virtual funct...
Interpreting “a+b” 
1. Virtual call to Add.eval() 
2. Virtual call to a.eval() 
3. Return boxed Int 
4. Virtual call to b....
Using Runtime Reflection 
def 
generateCode(e: 
Expression): 
Tree 
= 
e 
match 
{ 
case 
Attribute(ordinal) 
=> 
q"inputR...
Code Generating “a + b” 
val 
left: 
Int 
= 
inputRow.getInt(0) 
val 
right: 
Int 
= 
inputRow.getInt(1) 
val 
result: 
In...
Performance Comparison 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Intepreted Evaluation 
Hand-written Code 
Generated with Scala 
...
Performance vs. 
Deep 
Analytics 
TPC-DS 
Interactive Reporting 
450 
400 
350 
300 
250 
200 
150 
100 
50 
0 
34 46 59 7...
What’s Coming in Spark 1.2? 
• MLlib pipeline support for SchemaRDDs 
• New APIs for developers to add external data 
sour...
Questions?
Upcoming SlideShare
Loading in …5
×

Intro to Spark and Spark SQL

49,088 views

Published on

"Intro to Spark and Spark SQL" talk by Michael Armbrust of Databricks at AMP Camp 5

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Njce! Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More than 5000 IT Certified ( SAP,Oracle,Mainframe,Microsoft and IBM Technologies etc...)Consultants registered. Register for IT courses at http://www.todaycourses.com Most of our companies will help you in processing H1B Visa, Work Permit and Job Placements
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Intro to Spark and Spark SQL

  1. 1. Intro to Spark and Spark SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust
  2. 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: > In-memory computing primitives > General computation graphs Improves usability through: > Rich APIs in Scala, Java, Python > Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code
  3. 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure
  4. 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  5. 5. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”) val messages = errors.map(_.split(“t”)(2)) messages.cache() lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_ contains “foo”).count() messages.filter(_ contains “bar”).count() . . . results tasks messages Cache 1 messages Cache 2 messages Cache 3 BaseT RraDnDsf ormed RDD Action Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin # (vs 170 sec for on-disk data) <1 sec (vs 20 sec for on-disk data)
  6. 6. A General Stack Spark Streaming# real-time Spark Spark SQL GraphX graph MLlib machine learning …
  7. 7. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines
  8. 8. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming
  9. 9. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines SparkSQL Streaming
  10. 10. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL
  11. 11. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Your App? GraphX Streaming SparkSQL
  12. 12. Community Growth 200 180 160 140 120 100 80 60 40 20 0 0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
  13. 13. Not Just for In-Memory Data Hadoop Record Spark 100TB Spark 1PB Data Size 102.5TB 100TB 1000TB Time 72 min 23 min 234 min # Cores 50400 6592 6080 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Environment Dedicate Cloud (EC2) Cloud (EC2)
  14. 14. SQL Overview • Newest component of Spark initially contributed by databricks (< 1 year old) • Tightly integrated way to work with structured data (tables with rows/columns) • Transform RDDs using SQL • Data source integration: Hive, Parquet, JSON, and more
  15. 15. Relationship to Shark modified the Hive backend to run over Spark, but had two challenges: > Limited integration with Spark programs > Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows • Hive data loading • In-memory column store Adds • RDD-aware optimizer • Rich language interfaces
  16. 16. Adding Schema to RDDs Spark + RDDs Functional transformations on partitioned collections of opaque objects. SQL + SchemaRDDs Declarative transformations on partitioned collections of tuples. User User User User User User Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height
  17. 17. SchemaRDDs: More than SQL SchemaRDD {JSON} Image credit: http://barrymieny.deviantart.com/ SQL QL Parquet MLlib Unified interface for structured data JDBC ODBC
  18. 18. Getting Started: Spark SQL SQLContext/HiveContext • Entry point for all SQL functionality • Wraps/extends existing spark context from pyspark.sql import SQLContext sqlCtx = SQLContext(sc)
  19. 19. Example Dataset A text file filled with people’s names and ages: Michael, 30 Andy, 31 …
  20. 20. RDDs as Relations (Python) # Load a text file and convert each line to a dictionary. lines = sc.textFile("examples/…/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) # Infer the schema, and register the SchemaRDD as a table peopleTable = sqlCtx.inferSchema(people) peopleTable.registerAsTable("people")
  21. 21. RDDs as Relations (Scala) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")
  22. 22. RDDs as Relations (Java) public class Person implements Serializable { private String _name; private int _age; public String getName() { return _name; } public void setName(String name) { _name = name; } public int getAge() { return _age; } public void setAge(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.JavaSQLContext(sc) JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/ people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
  23. 23. Querying Using SQL # SQL can be run over SchemaRDDs that have been registered # as a table. teenagers = sqlCtx.sql(""" SELECT name FROM people WHERE age >= 13 AND age <= 19""") # The results of SQL queries are RDDs and support all the normal # RDD operations. teenNames = teenagers.map(lambda p: "Name: " + p.name)
  24. 24. Existing Tools, New Data Sources Spark SQL includes a server that exposes its data using JDBC/ODBC • Query data from HDFS/S3, • Including formats like Hive/Parquet/JSON* • Support for caching data in-memory * Coming in Spark 1.2
  25. 25. Caching Tables In-Memory Spark SQL can cache tables using an in-memory columnar format: • Scan only required columns • Fewer allocated objects (less GC) • Automatically selects best compression cacheTable("people") schemaRDD.cache() – *Requires Spark 1.2
  26. 26. Caching Comparison Spark MEMORY_ONLY Caching User Object User Object User Object User Object User Object SchemaRDD Columnar Caching ByteBuffer ByteBuffer ByteBuffer User Object Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String
  27. 27. Language Integrated UDFs registerFunction(“countMatches”, lambda (pattern, text): re.subn(pattern, '', text)[1]) sql("SELECT countMatches(‘a’, text)…")
  28. 28. SQL and Machine Learning training_data_table = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""") def featurize(u): LabeledPoint(u.action, [u.age, u.latitude, u.longitude]) // SQL results are RDDs so can be used directly in Mllib. training_data = training_data_table.map(featurize) model = new LogisticRegressionWithSGD.train(training_data)
  29. 29. Machine Learning Pipelines // training:{eventId:String, features:Vector, label:Int} val training = parquetFile("/path/to/training") val lr = new LogisticRegression().fit(training) // event: {eventId: String, features: Vector} val event = parquetFile("/path/to/event") val prediction = lr.transform(event).select('eventId, 'prediction) prediction.saveAsParquetFile("/path/to/prediction”)
  30. 30. Reading Data Stored in Hive from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) hiveCtx.hql(""" CREATE TABLE IF NOT EXISTS src (key INT, value STRING)""") hiveCtx.hql(""" LOAD DATA LOCAL INPATH 'examples/…/kv1.txt' INTO TABLE src""") # Queries can be expressed in HiveQL. results = hiveCtx.hql("FROM src SELECT key, value").collect()
  31. 31. Parquet Compatibility Native support for reading data in Parquet: • Columnar storage avoids reading unneeded data. • RDDs can be written to parquet files, preserving the schema. • Convert other slower formats into Parquet for repeated querying
  32. 32. Using Parquet # SchemaRDDs can be saved as Parquet files, maintaining the # schema information. peopleTable.saveAsParquetFile("people.parquet") # Read in the Parquet file created above. Parquet files are # self-­‐describing so the schema is preserved. The result of # loading a parquet file is also a SchemaRDD. parquetFile = sqlCtx.parquetFile("people.parquet”) # Parquet files can be registered as tables used in SQL. parquetFile.registerAsTable("parquetFile”) teenagers = sqlCtx.sql(""" SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19""")
  33. 33. {JSON} Support • Use jsonFile or jsonRDD to convert a collection of JSON objects into a SchemaRDD • Infers and unions the schema of each record • Maintains nested structures and arrays
  34. 34. {JSON} Example # Create a SchemaRDD from the file(s) pointed to by path people = sqlContext.jsonFile(path) # Visualized inferred schema with printSchema(). people.printSchema() # root # |-­‐-­‐ age: integer # |-­‐-­‐ name: string # Register this SchemaRDD as a table. people.registerTempTable("people")
  35. 35. Data Sources API Allow easy integration with new sources of structured data: CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS ( path ”./episodes.avro” ) https://github.com/databricks/spark-­‐avro
  36. 36. Efficient Expression Evaluation Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: • Virtual function calls • Branches based on expression type • Object creation due to primitive boxing • Memory consumption by boxed primitive objects
  37. 37. Interpreting “a+b” 1. Virtual call to Add.eval() 2. Virtual call to a.eval() 3. Return boxed Int 4. Virtual call to b.eval() 5. Return boxed Int 6. Integer addition 7. Return boxed result Add Attribute a Attribute b
  38. 38. Using Runtime Reflection def generateCode(e: Expression): Tree = e match { case Attribute(ordinal) => q"inputRow.getInt($ordinal)" case Add(left, right) => q""" { val leftResult = ${generateCode(left)} val rightResult = ${generateCode(right)} leftResult + rightResult } """ }
  39. 39. Code Generating “a + b” val left: Int = inputRow.getInt(0) val right: Int = inputRow.getInt(1) val result: Int = left + right resultRow.setInt(0, result) • Fewer function calls • No boxing of primitives
  40. 40. Performance Comparison 40 35 30 25 20 15 10 5 0 Intepreted Evaluation Hand-written Code Generated with Scala Reflection Milliseconds Evaluating 'a+a+a' One Billion Times!
  41. 41. Performance vs. Deep Analytics TPC-DS Interactive Reporting 450 400 350 300 250 200 150 100 50 0 34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89 Shark Spark SQL
  42. 42. What’s Coming in Spark 1.2? • MLlib pipeline support for SchemaRDDs • New APIs for developers to add external data sources • Full support for Decimal and Date types. • Statistics for in-memory data • Lots of bug fixes and improvements to Hive compatibility
  43. 43. Questions?

×