Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

1,805 views

Published on

Spark Summit East talk

Published in: Data & Analytics

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

  1. 1. Interac(ve Queries on Compressed RDD Succinct Spark Rachit Agarwal AMPLab ragarwal@berkeley.edu TwiEer: @_ragarwal_
  2. 2. No secondary indexes, no data scans, no data decompression A distributed compressed data store Succinct Point queries • search • random access • range queries • regular expressions Unified Interface • Unstructured data • Key-value store • Document store • Tables
  3. 3. Interactive point queries Random access Search Range Queries Regular Expressions Aggregate queries Updates Graph queries
  4. 4. 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 Data Scans Indexes Low storage High Latency High storage Low Latency Existing systems, e.g., search( ) Search( )
  5. 5. Indexes in slower storage Scans in faster storage execu(ng queries off slower storage Input size Query Latency Data scans Indexes Scans in slower storage Indexes in faster storage Existing systems “at scale” (qualitatively)
  6. 6. Succinct Low storage Low Latency Queries executed directly on the compressed representa(on What makes Succinct unique No addi(onal indexes Query responses embedded within the compressed representa(on No data scans Func(onality of indexes No decompression Queries directly on the compressed representa(on (except for data access queries) Succinct
  7. 7. Input size Query Latency Indexes Succinct Avoiding data scans Avoiding queries off slower storage Data scans Succinct tradeoffs
  8. 8. Original Input Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in uncompressed file Succinct Search( ) = {0, 10, 14, 16, 19, 26, 29} Extract(0, 5) = { , , , , } Count( ) = 7 Search: returns offsets of arbitrary strings in uncompressed file Input: flat (unstructured) files Append( , , , , ) Range queries Succinct Data model and Functionality
  9. 9. Supported, but traded-off in favor of point queries on compressed data • Preprocessing time • CPU (data access) • Sequential scan throughput • “In-place” updates What do we lose? Succinct tradeoffs
  10. 10. No secondary indexes, no data scans, no data decompression A distributed compressed data store Succinct Point queries • search • random access • range queries • regular expressions Unified Interface • Unstructured data • Key-value store • Document store • Tables
  11. 11. With all the powerful queries on values, documents, columns • Unstructured data • Key-value stores (Voldemort, Dynamo) • Document store (Elasticsearch, MongoDB) • Tables (Cassandra, BigTable) • And many more …. Unified Interface Succinct Data Model: Flat File Interface
  12. 12. Search(Column1, )Search( ) Succinct Flat File Interface: Unification
  13. 13. Where are we? • Succinct • Succinct Spark Where are we going? • Industry collabora(on • Succinct++ A distributed compressed data store Succinct
  14. 14. • System (prototyped & tested) • As a library • C++, Java, Scala • for ease of integration • All functionalities supported Succinct Succinct: Where are we?
  15. 15. • A Spark package • Enables new functionalities • Document stores • Point queries • Faster filters • Compressed RDDs: More in-memory • Dataframes API not so mature Queries on compressed RDDs Succinct Spark Succinct: Where are we?
  16. 16. If you are already using Spark New func(onali(es Document store, Key-Value store search on documents, values Faster opera(ons into RDDs random access, filters avoid scans More in-memory Compressed RDDs no decompression overheads Succinct Spark
  17. 17. import edu.berkeley.cs.succinct._ val rdd = ctx.textFile(...).map(_.getBytes) val bytes = succinctRDD.extract(50, 100) val count = succinctRDD.count("Berkeley") val offsets = succinctRDD.search("Berkeley") Import classes Create an RDD Extract 100 bytes from offset 50 Count #occurrences of “Berkeley” Find all occurrences of “Berkeley” val succinctRDD = rdd.succinct Compress using Succinct Succinct Spark: SuccinctRDD (unstructured data)
  18. 18. import edu.berkeley.cs.succinct.kv._ val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes)) val value = succinctKVRDD.get(0) val valueData = succinctKVRDD.extract(0, 50, 100) val keys = succinctKVRDD.search("Berkeley") Import classes Load data Get value for key 0 Extract 100 bytes at offset 50 in the value for key 0 Find all keys for values that contain “Berkeley” val succinctKVRDD = kvRDD.succinctKV Compress using Succinct Succinct Spark: SuccinctKVRDD (document store)
  19. 19. • 5x Amazon EC2 servers, 30GB RAM each • Wikipedia dataset, 40GB • Spark, Elasticsearch • search queries • #occurrences 1-10k Succinct Evaluation
  20. 20. Take-away: Succinct Spark 2.75x faster than Elas(cSearch while being 2.5x more space efficient (data fits in memory for all systems) Succinct Spark Evaluation (search latency)
  21. 21. Succinct Spark now supports Regular Expressions! val matches = succinctRDD.regexSearch("William.*Clinton") Find all matches for the RegEx “William.*Clinton” val matchKeys = succinctKVRDD.regexSearch("William.*Clinton") Find all keys for values that contain matches for the RegEx “William.*Clinton” SuccinctRDD SuccinctKVRDD
  22. 22. Take-away: Succinct significantly speeds up RegEx queries even when all the data fits in memory for all systems Succinct Spark Evaluation (RegEx latency)
  23. 23. val jsonDoc = succinctJsonRDD.get(0) val ids1 = succinctJsonRDD.filter("city", "Berkeley") val ids2 = succinctJsonRDD.search("AMPLab") Get JSON document with id 0 Filter JSON documents where “city = Berkeley” Search for JSON documents containing “AMPLab” Succinct Spark now supports JSON documents!
  24. 24. • More testing, benchmarking • Succinct Spark Dataframes • New functionalities Where are we going?
  25. 25. Queries on compressed and encrypted data • BlowFish • Succinct Encryption • Succinct Graphs New functionalities Succinct BlowFish Indexes Queries on compressed graphs Storage Query Latency
  26. 26. AND MANY MORE! succinct.cs.berkeley.edu

×