Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Domain Specific Languages for Data Wrangling
Zain Asgar & Seshadri Mahalingam

Data Analysis: Where does the time go?
Storage & Processing
Analytics

The Data Wrangling Problem
Business System Data
Machine Generated Data
Log Data
Data Visualization
Fraud Detection
Recommendations
... ...
DATA WRANGLING

Isn’t this just a programming problem?
Source: https://github.com/rory/apache-log-parser
...
Excerpt from an apache log parser:

• Tend to be one-offs
• Tied to a specific platform
• Collaboration across IT and LOB difficult
• Burden of specificity
Data ↔ Script ↔ Implementation

What’s good about scripts?
• Easy to experiment with & iterate
• Can capture a trail
Capture the good, constrain the bad
Redefining scripts

• High-level notation for domain semantics
• Separate the “what” from the “how”
• Switch out implementations
Objectives*:
• Productivity
• Portability
• Performance
Domain Specific Languages
* http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf

Wrangle, a DSL for Data Transformation
• Operators
• Parameters, Expressions
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
merge col: make, model with: ','

{delim} -> [:,s|/-.]
{delim-ws} -> (?:s+[:,|/-.]s*|s*[:,|/-.]s+)
{number} -> (?:[-+]?(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})*)(?:.[0-9]+)? ...
{hex} -> [a-fA-F0-9]*(?:[a-fA-F][0-9]|[0-9][a-fA-F])[a-fA-F0-9]*
{ip-address} -> (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} ...
{email} -> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* ...
{url} -> ^(?:(https?|ftp)://)?(?:(?:([a-zA-Z0-9-._?,/+&%#!=~]+):([a-zA- ...
{time} -> (?:0?[0-9]|1[0-9]|2[0-3]):(?:0?[0-9]|[1-5][0-9])(?:.[0-9]{3}|...
{bool} -> (?:[tfTFyYnN01]|True|False| …
Text Matching Constructs
extract col: text after: `{delim}` on: `{ip-address}`

• Search space is prunable:
➔ Space of available verbs is finite
➔ Expression search space is large, but easier to prune with context
• Possible to build high-level interfaces
• Combination of user actions and data help infer DSL statements
Easier to programmatically synthesize

• Restrictive: only captures what language designers included
➔ Language designers need domain expertise
➔ ...and system expertise
• Requires scaffolding and infrastructure
• It’s still code
DSL “Gotchas”

Alternative: a data-centric view
The Code:

from collections import defaultdict
import fileinput
import re
EXTRACT_USERNAME_REGEX = re.compile(r'(<)((?:(?!<).)*?)(>)',
re.DOTALL | re.UNICODE)
HOUR_EXTRACTION_REGEX = re.compile(r'()([0-9]{2})()', re.DOTALL | re.UNICODE)
def main():
username_hour_counts = defaultdict(lambda: 0)
for line in fileinput.input():
after_split = line.split(' ', 3)
matches = re.match(
EXTRACT_USERNAME_REGEX, after_split[3])
username = matches.group(2)
hour_matches = re.match(
HOUR_EXTRACTION_REGEX, after_split[2])
hour = hour_matches.group(2)
username_hour_counts[(username, hour)] += 1
for (username, hour), count in username_hour_counts.iteritems():
print '{0},{1},{2}'.format(username, hour, count)
if __name__ == '__main__':
main()

register /Users/seshadri/trifacta/pig-udfs/build/libs/pig-udfs-bundle.jar;
register /Users/seshadri/trifacta/services/python-udf-service/PyUDFService/udfs/libs/udfs.jar;
original_93_1 = LOAD 'hdfs://hdfs-namenode.dockerdomain:8020/trifacta/uploads/1/38d52b37-e7bc-
49c8-85d4-3f5993b9b0d3/2009-09-01.txt' USING TrifactaStorage('n', '--maxRecordLength 1048576')
AS column1:chararray;
DEFINE SplitUDF1 SplitUDF('()( )()', '3', 'false', '', '2');
cleaned_table = FOREACH original_93_1 GENERATE flatten(SplitUDF1($0)) AS (column2:chararray,
column3:chararray, column4:chararray, column5:chararray);
DEFINE ExtractUDF1 ExtractUDF('(<)((?:(?!<).)*?)(>)', '1', 'false', '', '2');
cleaned_table_1 = FOREACH cleaned_table GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, $3 AS column5:chararray, flatten(ExtractUDF1($3)) AS
column1:chararray;
DEFINE ExtractUDF2 ExtractUDF('()([0-9]{2})()', '1', 'false', '', '2');
cleaned_table_2 = FOREACH cleaned_table_1 GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, flatten(ExtractUDF2($2)) AS column6:chararray, $3 AS
column5:chararray, $4 AS column1:chararray;
cleaned_table_3_1 = GROUP cleaned_table_2 BY ($3, $5);
DEFINE VectorizedAggregateUDF1
VectorizedAggregateUDF('[{"count":1,"agg":"COUNT_STAR","names":["row_count"],"args":[]}]');
cleaned_table_3_1 = FOREACH cleaned_table_3_1 GENERATE flatten(group) AS (column6:chararray,
column1:chararray), flatten(VectorizedAggregateUDF1(cleaned_table_2));
DEFINE TypeCastUDF1 TypeCastUDF('{"testCase":{"regexes":["^-
?d{1,16}$"]},"name":"Long","fullPath":["Integer"],"stripWhitespace":false}');
DEFINE TypeCastUDF2
TypeCastUDF('{"testCase":{},"name":"String","fullPath":["String"],"stripWhitespace":false}');
cleaned_table_3 = FOREACH cleaned_table_3_1 GENERATE TypeCastUDF1($0) AS column6:long,
TypeCastUDF2($1) AS column1:chararray, TypeCastUDF1($2) AS row_count:long;
STORE cleaned_table_3 INTO 'hdfs://hdfs-
namenode.dockerdomain:8020/trifacta/queryResults/seshadri/Web_Chat_Log/36/cleaned_table_3.json'
USING TrifactaJsonStorage();

splitrows col: column1 on: 'n'
split col: column1 on: ' ' limit: 3
extract col: column5 after: `<` before: `>`
extract col: column4 on: `{digit}{2}`
aggregate value: count() group: column6,column1

• Build compilers to exploit separation of intent from implementation
• Target-independent representation
Performance & Portability from a DSL
Wrangle DSL
Abstract Machine
Model
Targets
Targets
Targets
Targets
Intent Implementation

• Traditional compilers have a “generalized processor” worldview
• We need a data-parallel system (as opposed to a control-flow system)
➔ Similar to a database
• Operators:
• Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, …
Abstract Machine Model
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
Map
FlatAggregate(Map + Join)
Filter

• Map a set of abstract model operators to a concrete physical plan
➔ “target lowering” in compiler terms
• Template rewriting
➔ Load + Splitrows → Hadoop RecordReader
Abstract → Concrete

Compiling to LLVM as a target
Wrangle DSL
Abstract Machine
Model
Code Generation
Operators Expressions
LLVM IR
Executable

Performance &
Portability
Pick a good abstract
machine model
Target independence
is important
Target-specific
optimizations are
important
DSL
High-level notation for
domain semantics
Separate the “what” from
the “how”
Switch out
implementations
Constrain search space
Summing up
Scripts
Easy to experiment with
& iterate
Can capture a trail

Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Editor's Notes