Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Interac(ve Queries on Compressed RDD
Succinct Spark
Rachit Agarwal
AMPLab
ragarwal@berkeley.edu
TwiEer: @_ragarwal_

No secondary indexes, no data scans,
no data decompression
A distributed compressed data store
Succinct
Point queries
• search
• random access
• range queries
• regular expressions
Uniﬁed Interface
• Unstructured data
• Key-value store
• Document store
• Tables

Interactive point queries
Random access
Search
Range Queries
Regular Expressions
Aggregate queries
Updates
Graph queries

0, 10, 14, 16, 19, 26, 29
1, 4, 5, 8, 20, 22, 24
2, 15, 17, 27
3, 6, 7, 9, 12, 13, 18, 23 ..
11, 21
Data Scans Indexes
Low storage
High Latency
High storage
Low Latency
Existing systems, e.g., search( )
Search( )

Indexes in
slower storage
Scans in
faster storage
execu(ng queries
oﬀ slower storage
Input size
Query
Latency
Data scans
Indexes
Scans in
slower storage
Indexes in
faster storage
Existing systems “at scale” (qualitatively)

Succinct
Low storage
Low Latency
Queries executed
directly on the
compressed representa(on
What makes Succinct unique
No addi(onal
indexes
Query responses
embedded within
the compressed representa(on
No data scans Func(onality of indexes
No
decompression
Queries directly on
the compressed representa(on
(except for data access queries)
Succinct

Input size
Query
Latency
Indexes
Succinct
Avoiding data
scans
Avoiding queries oﬀ
slower storage
Data scans
Succinct tradeoffs

Original Input
Extract: returns data at arbitrary offsets in uncompressed fileCount: returns count of arbitrary strings in uncompressed file
Succinct
Search( ) = {0, 10, 14, 16, 19, 26, 29}
Extract(0, 5) = { , , , , }
Count( ) = 7
Search: returns offsets of arbitrary strings in uncompressed file
Input: flat (unstructured) files
Append( , , , , )
Range queries
Succinct Data model and Functionality

Supported, but traded-off in favor of
point queries on compressed data
• Preprocessing time
• CPU (data access)
• Sequential scan throughput
• “In-place” updates
What do
we lose?
Succinct tradeoffs

With all the powerful queries on
values, documents, columns
• Unstructured data
• Key-value stores (Voldemort, Dynamo)
• Document store (Elasticsearch, MongoDB)
• Tables (Cassandra, BigTable)
• And many more ….
Unified
Interface
Succinct Data Model: Flat File Interface

Search(Column1, )Search( )
Succinct Flat File Interface: Unification

Where are we?
• Succinct
• Succinct Spark
Where are we going?
• Industry collabora(on
• Succinct++
A distributed compressed data store
Succinct

• System (prototyped & tested)
• As a library
• C++, Java, Scala
• for ease of integration
• All functionalities supported
Succinct
Succinct: Where are we?

• A Spark package
• Enables new functionalities
• Document stores
• Point queries
• Faster filters
• Compressed RDDs: More in-memory
• Dataframes API not so mature
Queries on
compressed
RDDs
Succinct Spark
Succinct: Where are we?

If you are already using Spark
New
func(onali(es
Document store,
Key-Value store
search on
documents, values
Faster opera(ons
into RDDs
random access,
ﬁlters
avoid
scans
More in-memory Compressed RDDs no decompression
overheads
Succinct Spark

import edu.berkeley.cs.succinct._
val rdd = ctx.textFile(...).map(_.getBytes)
val bytes = succinctRDD.extract(50, 100)
val count = succinctRDD.count("Berkeley")
val oﬀsets = succinctRDD.search("Berkeley")
Import classes
Create an RDD
Extract 100 bytes
from oﬀset 50
Count #occurrences
of “Berkeley”
Find all occurrences
of “Berkeley”
val succinctRDD = rdd.succinct Compress using Succinct
Succinct Spark: SuccinctRDD (unstructured data)

import edu.berkeley.cs.succinct.kv._
val kvRDD = rdd.zipWithIndex.map(t => (t._2, t._1.getBytes))
val value = succinctKVRDD.get(0)
val valueData = succinctKVRDD.extract(0, 50, 100)
val keys = succinctKVRDD.search("Berkeley")
Import classes
Load data
Get value for key 0
Extract 100 bytes
at oﬀset 50 in the
value for key 0
Find all keys for
values that contain
“Berkeley”
val succinctKVRDD = kvRDD.succinctKV Compress using
Succinct
Succinct Spark: SuccinctKVRDD (document store)

• 5x Amazon EC2 servers, 30GB RAM each
• Wikipedia dataset, 40GB
• Spark, Elasticsearch
• search queries
• #occurrences 1-10k
Succinct
Evaluation

Take-away: Succinct Spark 2.75x faster than Elas(cSearch while being
2.5x more space eﬃcient
(data ﬁts in memory for all systems)
Succinct Spark Evaluation (search latency)

Succinct Spark now supports Regular Expressions!
val matches = succinctRDD.regexSearch("William.*Clinton")
Find all matches for
the RegEx
“William.*Clinton”
val matchKeys = succinctKVRDD.regexSearch("William.*Clinton")
Find all keys for values that
contain matches for the
RegEx “William.*Clinton”
SuccinctRDD
SuccinctKVRDD

Take-away: Succinct signiﬁcantly speeds up RegEx queries even when
all the data ﬁts in memory for all systems
Succinct Spark Evaluation (RegEx latency)

val jsonDoc = succinctJsonRDD.get(0)
val ids1 = succinctJsonRDD.ﬁlter("city", "Berkeley")
val ids2 = succinctJsonRDD.search("AMPLab")
Get JSON document
with id 0
Filter JSON
documents where
“city = Berkeley”
Search for JSON
documents containing
“AMPLab”
Succinct Spark now supports JSON documents!

• More testing, benchmarking
• Succinct Spark Dataframes
• New functionalities
Where are
we going?

Queries on compressed and encrypted data
• BlowFish
• Succinct Encryption
• Succinct Graphs
New
functionalities
Succinct
BlowFish
Indexes
Queries on compressed graphs
Storage
Query
Latency

AND MANY MORE!
succinct.cs.berkeley.edu

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Similar to Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal