Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
3. The Data Wrangling Problem
Business System Data
Machine Generated Data
Log Data
Data Visualization
Fraud Detection
Recommendations
... ...
DATA WRANGLING
4. Isn’t this just a programming problem?
Source: https://github.com/rory/apache-log-parser
...
Excerpt from an apache log parser:
5. • Tend to be one-offs
• Tied to a specific platform
• Collaboration across IT and LOB difficult
• Burden of specificity
Data ↔ Script ↔ Implementation
6. What’s good about scripts?
• Easy to experiment with & iterate
• Can capture a trail
Capture the good, constrain the bad
Redefining scripts
7. • High-level notation for domain semantics
• Separate the “what” from the “how”
• Switch out implementations
Objectives*:
• Productivity
• Portability
• Performance
Domain Specific Languages
* http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf
8. Wrangle, a DSL for Data Transformation
• Operators
• Parameters, Expressions
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
merge col: make, model with: ','
10. • Search space is prunable:
➔ Space of available verbs is finite
➔ Expression search space is large, but easier to prune with context
• Possible to build high-level interfaces
• Combination of user actions and data help infer DSL statements
Easier to programmatically synthesize
11. • Restrictive: only captures what language designers included
➔ Language designers need domain expertise
➔ ...and system expertise
• Requires scaffolding and infrastructure
• It’s still code
DSL “Gotchas”
17. • Build compilers to exploit separation of intent from implementation
• Target-independent representation
Performance & Portability from a DSL
Wrangle DSL
Abstract Machine
Model
Targets
Targets
Targets
Targets
Intent Implementation
18. • Traditional compilers have a “generalized processor” worldview
• We need a data-parallel system (as opposed to a control-flow system)
➔ Similar to a database
• Operators:
• Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, …
Abstract Machine Model
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
Map
FlatAggregate(Map + Join)
Filter
19. • Map a set of abstract model operators to a concrete physical plan
➔ “target lowering” in compiler terms
• Template rewriting
➔ Load + Splitrows → Hadoop RecordReader
Abstract → Concrete
20. Compiling to LLVM as a target
Wrangle DSL
Abstract Machine
Model
Code Generation
Operators Expressions
LLVM IR
Executable
21. Performance &
Portability
Pick a good abstract
machine model
Target independence
is important
Target-specific
optimizations are
important
DSL
High-level notation for
domain semantics
Separate the “what” from
the “how”
Switch out
implementations
Constrain search space
Summing up
Scripts
Easy to experiment with
& iterate
Can capture a trail
Editor's Notes
So many possible inputs. So many possible outputs. There is no single algorithm that does this – the idea doesn’t even make sense given the need for human domain knowledge.
So – at the end of the day, Data Transformation is a programming problem. If we don’t tame this problem, the flow of data it bottlenecked at the experts.
At the end of the day it’s all about the user. Their task is wrangling data, not constructing a script.
We’d like to help them by generating our language constructs, when they indicate what they want to do.
Using the DSL as the target of our program synthesis, ...
predictive interaction
Scaffolding & Infrastructure:
We expose, display and accept the language in the interface as strings, so we need to build a parser & ASTs.
Backing implementations have a common support footprint
Now that we’ve looked at the productivity aspect
Going to hand it to Zain to talk
Portability -> Why is it important? History of Hadoop slide; we’ve always been talking about a new thing
We understand the advantages of using a DSL. How do we go about implementing one?
No surprise that our operators resemble relational operators
Compile queries into native code instead of interpreting the query at each stage