Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL
But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
11
Know when to use other data stores 

besides file systems
Today’s Goal
Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social...
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Ite...
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, b...
Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases,...
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new ...
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spar...
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pa...
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in th...
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequen...
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS
Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS
Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits...
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each ...
Thank you
Upcoming SlideShare
Loading in …5
×

Not Your Father's Database by Vida Ha

5,486 views

Published on

Spark Summit East Talk

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Not Your Father's Database by Vida Ha

  1. 1. Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  2. 2. Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  3. 3. About Me 2005 Mobile Web & Voice Search 3
  4. 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics
  5. 5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  6. 6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
  7. 7. But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS
  8. 8. Just in Time Data Warehouse w/ Spark HDFS
  9. 9. Just in Time Data Warehouse w/ Spark HDFS
  10. 10. Just in Time Data Warehouse w/ Spark and more… HDFS
  11. 11. 11 Know when to use other data stores 
 besides file systems Today’s Goal
  12. 12. Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12
  13. 13. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing
  14. 14. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing
  15. 15. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15
  16. 16. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 16
  17. 17. Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 17
  18. 18. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18
  19. 19. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19
  20. 20. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20
  21. 21. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21
  22. 22. Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +
  23. 23. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS
  24. 24. Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS
  25. 25. Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB
  26. 26. Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26
  27. 27. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27
  28. 28. Thank you

×