Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Succinct Spark

2,117 views

Published on

by Rachit Agarwal

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Succinct Spark

  1. 1. Point Queries on Compressed RDD Succinct Spark Rachit Agarwal Postdoc, AMPLab ragarwal@berkeley.edu Twitter: @_ragarwal_
  2. 2. Why Point queries?
  3. 3. Example: Search( , file) 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency High storage Low Latency Search( )
  4. 4. Indexes in slower storage Scans in faster storage executing queries off slower storage Understanding bottlenecks Input size Query Latency Data scans Indexes Scans in slower storage Indexes in faster storage
  5. 5. Succinct: Search( , file) Succinct Low storage Low Latency Queries executed directly on the compressed representation What makes Succinct unique No additional indexes Query responses embedded within the compressed representation No data scans Functionality of indexes No decompression Queries directly on the compressed representation (except for data access queries)
  6. 6. Qualitative comparison Input size Query Latency Indexes Succinct Avoiding data scans Avoiding queries off slower storage Data scans
  7. 7. Original Input Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in uncompressed file Succinct Search( ) = {0, 10, 14, 16, 19, 26, 29} Extract(0, 5) = { , , , , } Count( ) = 7 Search: returns offsets of arbitrary strings in uncompressed file Input: flat (unstructured) files Data Model and Functionality Append( , , , , ) Range queries
  8. 8. Many powerful abstractions on top • Unstructured data • Key-value store [Dynamo, MICA] • Document store [ElasticSearch, MongoDB, CouchDB] • Tables [Cassandra, BigTable] Search(Column1, ) Search( ) Simplicity: Unified “Flat file” Interface
  9. 9. Succinct Spark If you are already using Spark New functionalities Document store, Key-Value store search on documents, values Faster operations into RDDs random access, filters avoid scans More in-memory Compressed RDDs no decompression overheads
  10. 10. SuccinctRDD (for unstructured data) import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(...).map(_.getBytes) val bytes = succinctRDD.extract(50, 100) val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Import classes Create an RDD Extract 100 bytes from offset 50 Count #occurrences of “Berkeley” Find all occurrences of “Berkeley” val succinctRDD = rdd.succinct Compress using Succinct
  11. 11. SuccinctKVRDD import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes)) val value = succinctKVRDD.get(0) val valueData = succinctKVRDD.extract(0, 50, 100) val keys = succinctKVRDD.search("Berkeley") Import classes Load data Get value for key 0 Extract 100 bytes at offset 50 in the value for key 0 Find all keys for values that contain “Berkeley” val succinctKVRDD = kvRDD.succinctKV Compress using Succinct
  12. 12. Evaluation Datasets Wikipedia dataset ~40GB data Cluster Amazon EC2, 5 machines, 30GB Workload Search queries; 1-10000 occurrences Systems Spark, ElasticSearch Caveat Absolute numbers are dataset dependent
  13. 13. Succinct Spark (search) Take-away: Succinct Spark 2.75x faster than ElasticSearch while being 2.5x more space efficient (data fits in memory for all systems)
  14. 14. What is Next? New functionalities Support for Regular Expressions New data types Support for JSON New abstractions Support for batch updates Done
  15. 15. Support for Regular Expressions Applications Data cleaning Information Extraction BioInformatics Document stores Operators OR, AND, Wildcard, Repeat Example .* (Berkeley | Stanford).edu
  16. 16. RegEx performance Take-away: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems
  17. 17. SuccinctRDD (for unstructured data) val matches = succinctRDD.regexSearch("William.*Clinton") Find all matches for the RegEx “William.*Clinton” val matchKeys = succinctKVRDD.regexSearch("William.*Clinton") Find all keys for values that contain matches for the RegEx “William.*Clinton” SuccinctRDD SuccinctKVRDD
  18. 18. Support for JSON val jsonDoc = succinctJsonRDD.get(0) val ids1 = succinctJsonRDD.filter("city", "Berkeley") val ids2 = succinctJsonRDD.search("AMPLab") Get JSON document with id 0 Filter JSON documents where the “city” attribute has value “Berkeley” Search for JSON documents containing “AMPLab”
  19. 19. What is really next? A lot of exciting projects Improvements Preprocessing, Better support for scans 2GB/hr/core, 13MBps/thread Integration with Spark ecosystem Dataframes, Applications Testing and benchmarking Regular Expressions, JSON, Document stores Help us with workloads New functionalities Support for Encryption, memory-latency tradeoff And a few surprises
  20. 20. AND MANY MORE! succinct.cs.berkeley.edu

×