Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.
In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.
These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
4. Today: Spark Data Sources API
Spark Core Engine
Push part of query into the data source.
5. Today: Spark Data Sources API
+ E.g., column pruning directly in Parquet data loader
- Little support for unstructured formats (e.g., can’t avoid JSON parsing)
Spark Core Engine
Data Source API
Push part of query into the data source.
6. Parsing: A Computational Bottleneck
Example: Existing state-of-the-
art JSON parsers 100x slower
than scanning a string!
113836
56796
360
0
50000
100000
150000
RapidJSON
Parsing
Mison Index
Contruction
String Scan
Cycles/5KBRecord
State-of-the-art Parsers
*Similar results on binary formats
like Avro and Parquet
Parsing seems to be a necessary evil:
how do we get around doing it?
7. Key Opportunity: High Selectivity
High selectivity especially true
for exploratory analytics.
0
0.2
0.4
0.6
0.8
1
1.E-09 1.E-05 1.E-01
CDF
Selectivity
Databricks Censys
40% of customer Spark queries at Databricks select < 20% of data
99% of queries in Censys select < 0.001% of data
Data Source API provides access to
query filters at data source!
8. How can we exploit high selectivity to accelerate parsing?
9. Sparser: Filter Before You Parse
Raw DataFilter
Raw DataFilter
Raw DataFilter
Raw DataParse
Raw DataParse
Today:
parse full input à slow!
Sparser: Filter before parsing first
using fast filtering functions with false
positives, but no false negatives
14. Sparser Overview
Raw Filter (“RF”): filtering function on a bytestream with false
positives but no false negatives.
Use an optimizer to combine RFs into a cascade.
But is it in the “text” field? Is the
full word “Trump”?
0x54 0x72 0x75 0x6d 0x70
T r u m b
RF
“TRUM”
Pass!
Other
RFs
Example: Tweets WHERE
text contains “Trump”
Full
Parser
15. Key Challenges in Using RFs
1. How do we implement the RFs efficiently?
2. How do we efficiently choose the RFs to maximize parsing
throughput for a given query and dataset?
Rest of this Talk
• Sparser’s API
• Sparser’s Raw Filter Designs (Challenge 1)
• Sparser’s Optimizer: Choosing a Cascade (Challenge 2)
• Performance Results on various file formats (e.g., JSON, Avro)
18. Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
19. Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
20. Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
Conjunctions
WHERE user.name = “Trump”
AND user.verified = true
Disjunctions
WHERE user.name = “Trump”
OR user.name = “Obama”
21. Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
Conjunctions
WHERE user.name = “Trump”
AND user.verified = true
Disjunctions
WHERE user.name = “Trump”
OR user.name = “Obama”
Currently does not support numerical range-based predicates.
23. Raw Filters in Sparser
SIMD-based filtering functions that pass or discard a record
by inspecting raw bytestream
113836
56796
360
0
50000
100000
150000
RapidJSON Parsing Mison Index Contruction Raw Filter
Cycles/5KBRecord
24. Example RF: Substring Search
Search for a small (e.g., 2, 4, or 8-byte) substring of a query
predicate in parallel using SIMD
On modern CPUs: compare 32 characters in parallel in ~4 cycles (2ns).
Length 4 Substring
packed in SIMD register
Input : “ t e x t ” : “ I j u s t m e t M r . T r u m b ! ! ! ”
Shift 1 : T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - - -
Shift 2 : - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - -
Shift 3 : - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - -
Shift 4 : - - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - -
Example query: text contains “Trump”
False positives (found “Trumb” by accident),
but no false negatives (No “Trum” ⇒ No “Trump”)
Other RFs also possible! Sparser selects them agnostic of implementation.
25. Key-Value Search RF
Searches for key, and if key is found, searches for value until
some stopping point. Searches occur with SIMD.
Only applicable for exact matches
Useful for queries with common substrings (e.g., favorited=true)
Key: name Value: Trump Delimiter: ,
Ex 1: “name”: “Trump”,
Ex 2: “name”: “Actually, Trump”
Ex 3: “name”: “My name is Trump”
(Pass)
(Fail)
(Pass, False positive)
Second Example would result in false negative if we allow substring matches
Other RFs also possible! Sparser selects them agnostic of implementation.
27. Choosing RFs
To decrease false positive rate, combine RFs into a cascade
Sparser uses an optimizer to choose a cascade
Raw DataFilter
Raw DataFilter
Raw DataFilter
Raw
DataParse
Raw DataFilter
Raw DataParse
Raw DataFilter
Raw
DataParse
Raw DataFilter
Raw DataFilter
vs. vs.
28. Sparser’s Optimizer
Step 2: Measure
Params on Sample
of Records
raw bytestream
0010101000110101
RF 1
RF fails
RF 2
X
X
RF fails
Filtered bytes
sent to full parser
Step 4: Apply
Chosen Cascade
for sampled
records:
1 = passed
0 = failed
Step 3: Score and
Choose Cascade
…
(name = "Trump" AND text contains "Putin")
Step 1: Compile
Possible RFs from
predicates
RF1: "Trump"
RF2: "Trum"
RF3: "rump"
RF4: "Tr"
…
RF5: "Putin"
C (RF1) = 4
C (RF1àRF2) = 1
C (RF2àRF3) = 6
C (RF1àRF3) = 9
…
S1 S2 S3
RF 1 0 1 1
RF 2 1 0 1
29. Sparser’s Optimizer: Configuring RF Cascades
Three high level steps:
1. Convert a query predicate to a set of RFs
2. Measure the passthrough rate and runtime of each RF and
the runtime of the parser on sample of records
3. Minimize optimization function to find min-cost cascade
30. 1. Converting Predicates to RFs
Running Example
(name = “Trump” AND text contains “Putin”)
OR name = “Obama”
Possible RFs*
Substring Search: Trum, Tr, ru, …, Puti, utin, Pu, ut, …, Obam, Ob, …
* Translation from predicate to RF is format-dependent:
examples are for JSON data
31. 2. Measure RFs and Parser
Evaluate RFs and parser on a sample of records.
*Passthrough rates of RFs are non-independent! But too expensive to
measure rate of each cascade (combinatorial search space)
RF
RF
Parser
Passthrough rates*
and runtimes
32. 2. Measure RFs and Parser
Passthrough rates of RFs are non-independent! But too
expensive to rate of each cascade (combinatorial search
space)
Solution: Store rates as bitmaps.
Pr[a] ∝ 1s in Bitmap of RF a: 0011010101
Pr[b] ∝ 1s in Bitmap of RF b: 0001101101
Pr[a,b] ∝ 1s in Bitwise-& of a, b: 0001000101
foreach sampled record i:
foreach RF j:
BitMap[i][j] = 1 if RF j passes on i
33. 3. Choosing Best Cascade
Running Example
(name = “Trump” AND text contains “Putin”)
OR name = “Obama”
P
Trum
utinObam
PD Obam
PD
1. Valid cascade 2. Valid cascade
utin
Obam
PD
P
3. Invalid cascade (utin must
consider Obam when it fails)
Trum
utinObam
PD PD
Pass (right branch)Fail (left branch)Parse PDiscard D
Cascade cost
computed by
traversing tree from
root to leaf.
Each RF has a
runtime/passthrough
rate we measured in
previous step.
34. Probability and cost of
executing full parser
Probability and cost
of executing RF i
Cost of Cascade
with RFs 0,1,…,R
3. Choosing Best Cascade
𝐶" = $ Pr [ 𝑒𝑥𝑒𝑐𝑢𝑡𝑒.]
.∈"
×𝑐. + Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 ×𝑐45678
P
Trum
utinObam
PD Obam
PD
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒96:; = 1
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒:=.> = Pr [Trum]
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒BC5; = Pr [¬Trum] + Pr[Trum,¬utin]
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 = Pr [¬Trum,Obam] + Pr[Trum,utin] + Pr[Trum,¬utin,Obam]
Choose cascade with minimum cost: Sparser considers up to min(32,
# ORs) RFs and considers up to depth of 4 cascades.
36. Sparser in Spark SQL
Spark Core Engine
Data Source API
Sparser Filtering Engine
37. Sparser in Spark SQL
Spark Core Engine
Data Source API
Sparser Filtering Engine
SQL Query to Candidate
Raw Filters
Native HDFS File Reader
Sparser Optimizer + Filtering
Raw Data
DataFrame
39. Evaluation Setup
Datasets
Censys - Internet port scan data used in network security
Tweets - collected using Twitter Stream API
Distributed experiments
on 20-node cluster with 4 Broadwell vCPUs/26GB of memory, locally
attached SSDs (Google Compute Engine)
Single node experiments
on Intel Xeon E5-2690 v4 with 512GB memory
40. Queries
Twitter queries from other academic work. Censys queries sampled randomly from top-3000
queries. PCAP and Bro queries from online tutorials/blogs.
41. Results: Accelerating End-to-End Spark Jobs
0
200
400
600
Disk Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Runtime(seconds)
Spark Spark + Sparser Query Only
Censys queries on 652GB of JSON data: up to 4x speedup by using Sparser.
1.1 1.3 0.7 1.4
0
20
40
60
80
Disk Q1 Q2 Q3 Q4
Runtime
(seconds)
Spark Spark + Sparser Query Only
Twitter queries on
68GB of JSON data:
up to 9x speedup
by using Sparser.
42. Results: Accelerating Existing JSON Parsers
1
10
100
1 2 3 4 5 6 7 8 9
Runtime(s,log10)
Sparser + RapidJSON Sparser + Mison RapidJSON Mison
Censys queries compared against two state-of-the-art parsers
(Mison based on SIMD): Sparser accelerates them by up to 22x.
43. Results: Accelerating Binary Format Parsing
0
0.5
1
1.5
Q1 Q2 Q3 Q4
Runtime
(seconds)
avro-c Sparser + avro-c
0
0.2
Q1 Q2 Q3 Q4
Runtime
(seconds)
parquet-cpp Sparser + parquet-cpp
Sparser accelerates Avro (above) and Parquet (below) based queries by up to 5x and 4.3x respectively.
45. Results: Sparser’s Sensitivity to Selectivity
0
20
40
60
80
100
120
0 0.2 0.4 0.6 0.8 1
Runtime(seconds)
Selectivity
RapidJSON Sparser + RapidJSON
Mison Sparser + Mison
Sparser’s
performance
degrades
gracefully as the
selectivity of the
query decreases.
46. Results: Nuts and Bolts
1
10
100
0 200 400
Runtime(s,log10)
Cascade
Sparser picks a cascade within 5%
of the optimal one using its
sampling-based measurement and
optimizer.
Picked by
Sparser
100
1000
10000
0 100 200
ParseThroughput
(MB/s)
File Offset (100s of MB)
With Resampling No Resampling
Resampling to calibrate the cascade
can improve end-to-end parsing time
by up to 20x.
47. Conclusion
Sparser:
• Uses raw filters to filter before parsing
• Selects a cascade of raw filters using an efficient optimizer
• Delivers up to 22x speedups over existing parsers
• Will be open source soon!
shoumik@cs.stanford.edu
fabuzaid@cs.stanford.edu
Questions or Comments? Contact Us!