Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Architect
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL
But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
Separate Compute vs. Storage
11
Benefits:
• No need to import your data into Spark to begin
processing.
• Dynamically Scale Spark clusters to match compute
vs. storage needs.
• Choose the best data storage with different
performance characteristics for your use case.
12
Know when to use other data stores 

besides file systems
Today’s Goal
13
Data Warehousing
Use Case:
Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
14
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
15
Good: General Purpose Processing
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
16
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
17
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have 

to go through all your files to find your row.
18
Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index 

on your key column.
• Key-Value NOSQL stores retrieves the value 

of a key efficiently out of the box.
19
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
20
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
21
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
22
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
23
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
24
Incremental
SQL Query
Database
Snapshot
+
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
25
HDFS
Bad: External Reporting w/ load
Too many concurrent requests will start to queue up.
26
HDFS
Solution: Write out to a DB as a cache to handle load.
Bad: External Reporting w/ load
27
HDFS
DB
28
Advanced Analytics and Data Science
Use Case:
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• All in one solution: Data cleansing, featurization,
training, testing, serving, etc.
29
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
30
Clarification: Serving a live ML model
Spark Streaming isn’t necessarily required to serve a
live ML model.
A simple web server that can make a call to Spark on
the model may suffice.
31
32
Streaming and Realtime Analytics
Use Case:
Good: Periodic Scheduled Jobs
Schedule your workloads to run on a regular basis:
• Launch a dedicated cluster for important workloads.
• Output your results as reports or store to a files/
database.
• Poor Man’s Streaming: Spark is fast, so push the
interval to be frequent.
33
Bad: Low Latency Stream Processing
Files can be used as a input source for Spark
Streaming, but data is not available immediately.
Solution: Send data to message queues not files for
low latency stream processing.
34
Clarification: Instantaneous vs. Streaming
With data cached in memory, Spark can return results
quickly and can even live data.
Spark Streaming is needed when you have a stream
of input data to process.
35
Thank you

Not Your Father's Database by Databricks

  • 1.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture
  • 2.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture
  • 3.
    About Me 2005 MobileWeb & Voice Search 3
  • 4.
    About Me 2005 MobileWeb & Voice Search 4 2012 Reporting & Analytics
  • 5.
    About Me 2005 MobileWeb & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Architect
  • 6.
    This system talkslike a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
  • 7.
    But the performanceis very different… Is this your Spark infrastructure? 7 SQL HDFS
  • 8.
    Just in TimeData Warehouse w/ Spark HDFS
  • 9.
    Just in TimeData Warehouse w/ Spark HDFS
  • 10.
    Just in TimeData Warehouse w/ Spark and more… HDFS
  • 11.
    Separate Compute vs.Storage 11 Benefits: • No need to import your data into Spark to begin processing. • Dynamically Scale Spark clusters to match compute vs. storage needs. • Choose the best data storage with different performance characteristics for your use case.
  • 12.
    12 Know when touse other data stores 
 besides file systems Today’s Goal
  • 13.
  • 14.
    Good: General PurposeProcessing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 14
  • 15.
    Types of workloads •Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 15 Good: General Purpose Processing
  • 16.
    Benefits: • Inexpensive Storage •Incredibly flexible processing • Speed and Scale 16 Good: General Purpose Processing
  • 17.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? 17
  • 18.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 18
  • 19.
    Bad: Random Access Solution:If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 19
  • 20.
    Bad: Frequent Inserts sqlContext.sql(“insertinto TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 20
  • 21.
    Bad: Frequent Inserts Solution: •Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 21
  • 22.
    Good: Data Transformation/ETL UseSpark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 22
  • 23.
    Bad: Frequent/Incremental Updates Updatestatements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 23
  • 24.
    Use Case: Up-to-date,live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 24 Incremental SQL Query Database Snapshot +
  • 25.
    Good: Connecting BITools Tip: Cache your tables for optimal performance. 25 HDFS
  • 26.
    Bad: External Reportingw/ load Too many concurrent requests will start to queue up. 26 HDFS
  • 27.
    Solution: Write outto a DB as a cache to handle load. Bad: External Reporting w/ load 27 HDFS DB
  • 28.
    28 Advanced Analytics andData Science Use Case:
  • 29.
    Good: Machine Learning& Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • All in one solution: Data cleansing, featurization, training, testing, serving, etc. 29
  • 30.
    Bad: Searching Contentw/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 30
  • 31.
    Clarification: Serving alive ML model Spark Streaming isn’t necessarily required to serve a live ML model. A simple web server that can make a call to Spark on the model may suffice. 31
  • 32.
    32 Streaming and RealtimeAnalytics Use Case:
  • 33.
    Good: Periodic ScheduledJobs Schedule your workloads to run on a regular basis: • Launch a dedicated cluster for important workloads. • Output your results as reports or store to a files/ database. • Poor Man’s Streaming: Spark is fast, so push the interval to be frequent. 33
  • 34.
    Bad: Low LatencyStream Processing Files can be used as a input source for Spark Streaming, but data is not available immediately. Solution: Send data to message queues not files for low latency stream processing. 34
  • 35.
    Clarification: Instantaneous vs.Streaming With data cached in memory, Spark can return results quickly and can even live data. Spark Streaming is needed when you have a stream of input data to process. 35
  • 36.