Point Queries on Compressed RDD
Succinct Spark
Rachit Agarwal
Postdoc, AMPLab
ragarwal@berkeley.edu
Twitter: @_ragarwal_
Why Point queries?
Example: Search( , file)
0, 10, 14, 16, 19, 26, 29
1, 4, 5, 8, 20, 22, 24
2, 15, 17, 27
3, 6, 7, 9, 12, 13, 18, 23 ..
11, ...
Indexes in
slower storage
Scans in
faster storage
executing queries
off slower storage
Understanding bottlenecks
Input siz...
Succinct: Search( , file)
Succinct
Low storage
Low Latency
Queries executed
directly on the
compressed representation
What...
Qualitative comparison
Input size
Query
Latency
Indexes
Succinct
Avoiding data
scans
Avoiding queries off
slower storage
D...
Original Input
Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in...
Many powerful abstractions on top
• Unstructured data
• Key-value store [Dynamo, MICA]
• Document store [ElasticSearch, Mo...
Succinct Spark
If you are already using Spark
New
functionalities
Document store,
Key-Value store
search on
documents,
val...
SuccinctRDD (for unstructured data)
import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(...).map(_.getBytes)
val byte...
SuccinctKVRDD
import edu.berkeley.cs.succinct.kv._
val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))
val value ...
Evaluation
Datasets Wikipedia dataset
~40GB data
Cluster Amazon EC2, 5 machines, 30GB
Workload Search queries; 1-10000 occ...
Succinct Spark (search)
Take-away: Succinct Spark 2.75x faster than ElasticSearch
while being 2.5x more space efficient
(d...
What is Next?
New functionalities Support for Regular Expressions
New data types Support for JSON
New abstractions Support...
Support for Regular Expressions
Applications Data cleaning
Information Extraction
BioInformatics
Document stores
Operators...
RegEx performance
Take-away: Succinct significantly speeds up RegEx queries
even when all the data fits in memory for all ...
SuccinctRDD (for unstructured data)
val matches = succinctRDD.regexSearch("William.*Clinton")
Find all matches for the
Reg...
Support for JSON
val jsonDoc = succinctJsonRDD.get(0)
val ids1 = succinctJsonRDD.filter("city", "Berkeley")
val ids2 = suc...
What is really next?
A lot of exciting projects
Improvements Preprocessing,
Better support for scans
2GB/hr/core,
13MBps/t...
AND MANY MORE!
succinct.cs.berkeley.edu
Upcoming SlideShare
Loading in …5
×

Succinct Spark

1,811 views

Published on

by Rachit Agarwal

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,811
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
39
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Succinct Spark

  1. 1. Point Queries on Compressed RDD Succinct Spark Rachit Agarwal Postdoc, AMPLab ragarwal@berkeley.edu Twitter: @_ragarwal_
  2. 2. Why Point queries?
  3. 3. Example: Search( , file) 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency High storage Low Latency Search( )
  4. 4. Indexes in slower storage Scans in faster storage executing queries off slower storage Understanding bottlenecks Input size Query Latency Data scans Indexes Scans in slower storage Indexes in faster storage
  5. 5. Succinct: Search( , file) Succinct Low storage Low Latency Queries executed directly on the compressed representation What makes Succinct unique No additional indexes Query responses embedded within the compressed representation No data scans Functionality of indexes No decompression Queries directly on the compressed representation (except for data access queries)
  6. 6. Qualitative comparison Input size Query Latency Indexes Succinct Avoiding data scans Avoiding queries off slower storage Data scans
  7. 7. Original Input Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in uncompressed file Succinct Search( ) = {0, 10, 14, 16, 19, 26, 29} Extract(0, 5) = { , , , , } Count( ) = 7 Search: returns offsets of arbitrary strings in uncompressed file Input: flat (unstructured) files Data Model and Functionality Append( , , , , ) Range queries
  8. 8. Many powerful abstractions on top • Unstructured data • Key-value store [Dynamo, MICA] • Document store [ElasticSearch, MongoDB, CouchDB] • Tables [Cassandra, BigTable] Search(Column1, ) Search( ) Simplicity: Unified “Flat file” Interface
  9. 9. Succinct Spark If you are already using Spark New functionalities Document store, Key-Value store search on documents, values Faster operations into RDDs random access, filters avoid scans More in-memory Compressed RDDs no decompression overheads
  10. 10. SuccinctRDD (for unstructured data) import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(...).map(_.getBytes) val bytes = succinctRDD.extract(50, 100) val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Import classes Create an RDD Extract 100 bytes from offset 50 Count #occurrences of “Berkeley” Find all occurrences of “Berkeley” val succinctRDD = rdd.succinct Compress using Succinct
  11. 11. SuccinctKVRDD import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes)) val value = succinctKVRDD.get(0) val valueData = succinctKVRDD.extract(0, 50, 100) val keys = succinctKVRDD.search("Berkeley") Import classes Load data Get value for key 0 Extract 100 bytes at offset 50 in the value for key 0 Find all keys for values that contain “Berkeley” val succinctKVRDD = kvRDD.succinctKV Compress using Succinct
  12. 12. Evaluation Datasets Wikipedia dataset ~40GB data Cluster Amazon EC2, 5 machines, 30GB Workload Search queries; 1-10000 occurrences Systems Spark, ElasticSearch Caveat Absolute numbers are dataset dependent
  13. 13. Succinct Spark (search) Take-away: Succinct Spark 2.75x faster than ElasticSearch while being 2.5x more space efficient (data fits in memory for all systems)
  14. 14. What is Next? New functionalities Support for Regular Expressions New data types Support for JSON New abstractions Support for batch updates Done
  15. 15. Support for Regular Expressions Applications Data cleaning Information Extraction BioInformatics Document stores Operators OR, AND, Wildcard, Repeat Example .* (Berkeley | Stanford).edu
  16. 16. RegEx performance Take-away: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems
  17. 17. SuccinctRDD (for unstructured data) val matches = succinctRDD.regexSearch("William.*Clinton") Find all matches for the RegEx “William.*Clinton” val matchKeys = succinctKVRDD.regexSearch("William.*Clinton") Find all keys for values that contain matches for the RegEx “William.*Clinton” SuccinctRDD SuccinctKVRDD
  18. 18. Support for JSON val jsonDoc = succinctJsonRDD.get(0) val ids1 = succinctJsonRDD.filter("city", "Berkeley") val ids2 = succinctJsonRDD.search("AMPLab") Get JSON document with id 0 Filter JSON documents where the “city” attribute has value “Berkeley” Search for JSON documents containing “AMPLab”
  19. 19. What is really next? A lot of exciting projects Improvements Preprocessing, Better support for scans 2GB/hr/core, 13MBps/thread Integration with Spark ecosystem Dataframes, Applications Testing and benchmarking Regular Expressions, JSON, Document stores Help us with workloads New functionalities Support for Encryption, memory-latency tradeoff And a few surprises
  20. 20. AND MANY MORE! succinct.cs.berkeley.edu

×