Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Not Your Father’s Database:
How to Use Apache® Spark™ Properly
in Your Big Data Architecture
Not Your Father’s Database:
How to Use Apache® Spark™ Properly
in Your Big Data Architecture
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
But the performance is very different…
Is this your Spark infrastructure?
7
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
Separate Compute vs. Storage
11
Benefits:
• No need to import your data into Spark to begin
processing.
• Dynamically Scal...
12
Know when to use other data stores
besides file systems
Today’s Goal
13
Data Warehousing
Use Case:
Good: General Purpose Processing
Types of Data Sets to Store in FileSystems:
• Archival Data
• Unstructured Data
• Social ...
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Ite...
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
16
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
17
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, b...
Bad: Random Access
Solution: If you frequently randomlyaccess your
data, use a database.
• For traditional SQL databases, ...
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new ...
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spar...
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pa...
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locatetherow(s) in the ...
Use Case: Up-to-date, liveviews of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent...
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
25
HDFS
Bad: External Reporting w/ load
Too manyconcurrentrequestswill start to queueup.
26
HDFS
Solution: Write out to a DB as a cache to handle load.
Bad: External Reporting w/ load
27
HDFS
DB
28
Advanced Analytics and Data Science
Use Case:
Good: Machine Learning & Data Science
UseMLlib, GraphXandSparkpackagesformachine
learninganddatascience.
Benefits:
• Built...
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each ...
31
Streaming and Realtime Analytics
Use Case:
Good: Periodic Scheduled Jobs
Schedule your workloads to run on a regular basis:
• Launch a dedicated cluster for importan...
Bad: Low Latency Stream
Processing
Spark Streaming can detect new files dropped into a
folder to process, but there is a d...
Thank you
Not Your Father’s Database:
How to Use Apache Spark Properly
in Your Big Data Architecture
SparkSummit East2016
Upcoming SlideShare
Loading in …5
×

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

1,481 views

Published on

This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk:

Use cases to store in file systems for use with Apache Spark:
- Analyzing a large set of data files.
- Doing ETL of a large amount of data.
- Applying Machine Learning & Data Science to a large dataset.
- Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

Published in: Technology
  • Be the first to comment

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

  1. 1. Not Your Father’s Database: How to Use Apache® Spark™ Properly in Your Big Data Architecture
  2. 2. Not Your Father’s Database: How to Use Apache® Spark™ Properly in Your Big Data Architecture
  3. 3. About Me 2005 Mobile Web & Voice Search 3
  4. 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics
  5. 5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  6. 6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS
  7. 7. But the performance is very different… Is this your Spark infrastructure? 7 HDFS
  8. 8. Just in Time Data Warehouse w/ Spark HDFS
  9. 9. Just in Time Data Warehouse w/ Spark HDFS
  10. 10. Just in Time Data Warehouse w/ Spark and more… HDFS
  11. 11. Separate Compute vs. Storage 11 Benefits: • No need to import your data into Spark to begin processing. • Dynamically Scale Spark clusters to match compute vs. storage needs. • Choose the best data storage with different performance characteristics for your use case.
  12. 12. 12 Know when to use other data stores besides file systems Today’s Goal
  13. 13. 13 Data Warehousing Use Case:
  14. 14. Good: General Purpose Processing Types of Data Sets to Store in FileSystems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 14
  15. 15. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 15 Good: General Purpose Processing
  16. 16. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 16 Good: General Purpose Processing
  17. 17. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 17
  18. 18. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have to go through all your files to find your row. 18
  19. 19. Bad: Random Access Solution: If you frequently randomlyaccess your data, use a database. • For traditional SQL databases, create an index on your key column. • Key-Value NOSQL stores retrieves the value of a key efficiently out of the box. 19
  20. 20. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 20
  21. 21. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 21
  22. 22. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 22
  23. 23. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locatetherow(s) in the files. • Delete &Insert: Delete the old row and insert a new one. • Update: Fileformats aren’t optimized for updating rows. Solution:Manydatabasessupport efficient update operations. 23
  24. 24. Use Case: Up-to-date, liveviews of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 24 Incremental SQL Query Database Snapshot +
  25. 25. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 25 HDFS
  26. 26. Bad: External Reporting w/ load Too manyconcurrentrequestswill start to queueup. 26 HDFS
  27. 27. Solution: Write out to a DB as a cache to handle load. Bad: External Reporting w/ load 27 HDFS DB
  28. 28. 28 Advanced Analytics and Data Science Use Case:
  29. 29. Good: Machine Learning & Data Science UseMLlib, GraphXandSparkpackagesformachine learninganddatascience. Benefits: • Built in distributedalgorithms. • In memorycapabilitiesfor iterativeworkloads. • All in one solution:Data cleansing,featurization, training, testing, serving,etc. 29
  30. 30. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 30
  31. 31. 31 Streaming and Realtime Analytics Use Case:
  32. 32. Good: Periodic Scheduled Jobs Schedule your workloads to run on a regular basis: • Launch a dedicated cluster for important workloads. • Output your results as reports or store to a files/database. • Poor Man’s Streaming: Spark is fast, so push the interval to be frequent. 32
  33. 33. Bad: Low Latency Stream Processing Spark Streaming can detect new files dropped into a folder to process, but there is a delay to build up a whole file’s worth of data. Solution: Send data to message queues not files. 33
  34. 34. Thank you
  35. 35. Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture SparkSummit East2016

×