Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL
But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
11
Know when to use other data stores 

besides file systems
Today’s Goal
Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13
Good: General Purpose Processing
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have 

to go through all your files to find your row.
16
Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index 

on your key column.
• Key-Value NOSQL stores retrieves the value 

of a key efficiently out of the box.
17
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS
Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS
Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27
Thank you

Not Your Father's Database by Vida Ha

  • 1.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 2.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 3.
    About Me 2005 MobileWeb & Voice Search 3
  • 4.
    About Me 2005 MobileWeb & Voice Search 4 2012 Reporting & Analytics
  • 5.
    About Me 2005 MobileWeb & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  • 6.
    This system talkslike a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
  • 7.
    But the performanceis very different… Is this your Spark infrastructure? 7 SQL HDFS
  • 8.
    Just in TimeData Warehouse w/ Spark HDFS
  • 9.
    Just in TimeData Warehouse w/ Spark HDFS
  • 10.
    Just in TimeData Warehouse w/ Spark and more… HDFS
  • 11.
    11 Know when touse other data stores 
 besides file systems Today’s Goal
  • 12.
    Good: General PurposeProcessing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12
  • 13.
    Types of workloads •Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing
  • 14.
    Benefits: • Inexpensive Storage •Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing
  • 15.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? 15
  • 16.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 16
  • 17.
    Bad: Random Access Solution:If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 17
  • 18.
    Bad: Frequent Inserts sqlContext.sql(“insertinto TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18
  • 19.
    Bad: Frequent Inserts Solution: •Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19
  • 20.
    Good: Data Transformation/ETL UseSpark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20
  • 21.
    Bad: Frequent/Incremental Updates Updatestatements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21
  • 22.
    Use Case: Up-to-date,live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +
  • 23.
    Good: Connecting BITools Tip: Cache your tables for optimal performance. 23 HDFS
  • 24.
    Bad: External Reportingw/ load Too many concurrent requests will overload Spark. 24 HDFS
  • 25.
    Solution: Write outto a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB
  • 26.
    Good: Machine Learning& Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26
  • 27.
    Bad: Searching Contentw/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27
  • 28.