SlideShare a Scribd company logo
Domain Specific Languages for Data Wrangling
Zain Asgar & Seshadri Mahalingam
Data Analysis: Where does the time go?
Storage & Processing
Analytics
The Data Wrangling Problem
Business System Data
Machine Generated Data
Log Data
Data Visualization
Fraud Detection
Recommendations
... ...
DATA WRANGLING
Isn’t this just a programming problem?
Source: https://github.com/rory/apache-log-parser
...
Excerpt from an apache log parser:
• Tend to be one-offs
• Tied to a specific platform
• Collaboration across IT and LOB difficult
• Burden of specificity
Data ↔ Script ↔ Implementation
What’s good about scripts?
• Easy to experiment with & iterate
• Can capture a trail
Capture the good, constrain the bad
Redefining scripts
• High-level notation for domain semantics
• Separate the “what” from the “how”
• Switch out implementations
Objectives*:
• Productivity
• Portability
• Performance
Domain Specific Languages
* http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf
Wrangle, a DSL for Data Transformation
• Operators
• Parameters, Expressions
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
merge col: make, model with: ','
{delim} -> [:,s|/-.]
{delim-ws} -> (?:s+[:,|/-.]s*|s*[:,|/-.]s+)
{number} -> (?:[-+]?(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})*)(?:.[0-9]+)? ...
{hex} -> [a-fA-F0-9]*(?:[a-fA-F][0-9]|[0-9][a-fA-F])[a-fA-F0-9]*
{ip-address} -> (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} ...
{email} -> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* ...
{url} -> ^(?:(https?|ftp)://)?(?:(?:([a-zA-Z0-9-._?,/+&%#!=~]+):([a-zA- ...
{time} -> (?:0?[0-9]|1[0-9]|2[0-3]):(?:0?[0-9]|[1-5][0-9])(?:.[0-9]{3}|...
{bool} -> (?:[tfTFyYnN01]|True|False| …
Text Matching Constructs
extract col: text after: `{delim}` on: `{ip-address}`
• Search space is prunable:
➔ Space of available verbs is finite
➔ Expression search space is large, but easier to prune with context
• Possible to build high-level interfaces
• Combination of user actions and data help infer DSL statements
Easier to programmatically synthesize
• Restrictive: only captures what language designers included
➔ Language designers need domain expertise
➔ ...and system expertise
• Requires scaffolding and infrastructure
• It’s still code
DSL “Gotchas”
Alternative: a data-centric view
The Code:
Demo
from collections import defaultdict
import fileinput
import re
EXTRACT_USERNAME_REGEX = re.compile(r'(<)((?:(?!<).)*?)(>)',
re.DOTALL | re.UNICODE)
HOUR_EXTRACTION_REGEX = re.compile(r'()([0-9]{2})()', re.DOTALL | re.UNICODE)
def main():
username_hour_counts = defaultdict(lambda: 0)
for line in fileinput.input():
after_split = line.split(' ', 3)
matches = re.match(
EXTRACT_USERNAME_REGEX, after_split[3])
username = matches.group(2)
hour_matches = re.match(
HOUR_EXTRACTION_REGEX, after_split[2])
hour = hour_matches.group(2)
username_hour_counts[(username, hour)] += 1
for (username, hour), count in username_hour_counts.iteritems():
print '{0},{1},{2}'.format(username, hour, count)
if __name__ == '__main__':
main()
register /Users/seshadri/trifacta/pig-udfs/build/libs/pig-udfs-bundle.jar;
register /Users/seshadri/trifacta/services/python-udf-service/PyUDFService/udfs/libs/udfs.jar;
original_93_1 = LOAD 'hdfs://hdfs-namenode.dockerdomain:8020/trifacta/uploads/1/38d52b37-e7bc-
49c8-85d4-3f5993b9b0d3/2009-09-01.txt' USING TrifactaStorage('n', '--maxRecordLength 1048576')
AS column1:chararray;
DEFINE SplitUDF1 SplitUDF('()( )()', '3', 'false', '', '2');
cleaned_table = FOREACH original_93_1 GENERATE flatten(SplitUDF1($0)) AS (column2:chararray,
column3:chararray, column4:chararray, column5:chararray);
DEFINE ExtractUDF1 ExtractUDF('(<)((?:(?!<).)*?)(>)', '1', 'false', '', '2');
cleaned_table_1 = FOREACH cleaned_table GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, $3 AS column5:chararray, flatten(ExtractUDF1($3)) AS
column1:chararray;
DEFINE ExtractUDF2 ExtractUDF('()([0-9]{2})()', '1', 'false', '', '2');
cleaned_table_2 = FOREACH cleaned_table_1 GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, flatten(ExtractUDF2($2)) AS column6:chararray, $3 AS
column5:chararray, $4 AS column1:chararray;
cleaned_table_3_1 = GROUP cleaned_table_2 BY ($3, $5);
DEFINE VectorizedAggregateUDF1
VectorizedAggregateUDF('[{"count":1,"agg":"COUNT_STAR","names":["row_count"],"args":[]}]');
cleaned_table_3_1 = FOREACH cleaned_table_3_1 GENERATE flatten(group) AS (column6:chararray,
column1:chararray), flatten(VectorizedAggregateUDF1(cleaned_table_2));
DEFINE TypeCastUDF1 TypeCastUDF('{"testCase":{"regexes":["^-
?d{1,16}$"]},"name":"Long","fullPath":["Integer"],"stripWhitespace":false}');
DEFINE TypeCastUDF2
TypeCastUDF('{"testCase":{},"name":"String","fullPath":["String"],"stripWhitespace":false}');
cleaned_table_3 = FOREACH cleaned_table_3_1 GENERATE TypeCastUDF1($0) AS column6:long,
TypeCastUDF2($1) AS column1:chararray, TypeCastUDF1($2) AS row_count:long;
STORE cleaned_table_3 INTO 'hdfs://hdfs-
namenode.dockerdomain:8020/trifacta/queryResults/seshadri/Web_Chat_Log/36/cleaned_table_3.json'
USING TrifactaJsonStorage();
splitrows col: column1 on: 'n'
split col: column1 on: ' ' limit: 3
extract col: column5 after: `<` before: `>`
extract col: column4 on: `{digit}{2}`
aggregate value: count() group: column6,column1
• Build compilers to exploit separation of intent from implementation
• Target-independent representation
Performance & Portability from a DSL
Wrangle DSL
Abstract Machine
Model
Targets
Targets
Targets
Targets
Intent Implementation
• Traditional compilers have a “generalized processor” worldview
• We need a data-parallel system (as opposed to a control-flow system)
➔ Similar to a database
• Operators:
• Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, …
Abstract Machine Model
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
Map
FlatAggregate(Map + Join)
Filter
• Map a set of abstract model operators to a concrete physical plan
➔ “target lowering” in compiler terms
• Template rewriting
➔ Load + Splitrows → Hadoop RecordReader
Abstract → Concrete
Compiling to LLVM as a target
Wrangle DSL
Abstract Machine
Model
Code Generation
Operators Expressions
LLVM IR
Executable
Performance &
Portability
Pick a good abstract
machine model
Target independence
is important
Target-specific
optimizations are
important
DSL
High-level notation for
domain semantics
Separate the “what” from
the “how”
Switch out
implementations
Constrain search space
Summing up
Scripts
Easy to experiment with
& iterate
Can capture a trail
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

More Related Content

What's hot

Advanced Relevancy Ranking
Advanced Relevancy RankingAdvanced Relevancy Ranking
Advanced Relevancy Ranking
Search Technologies
 
Getting by with just psql
Getting by with just psqlGetting by with just psql
Getting by with just psql
Corey Huinker
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
Corey Huinker
 
MongoDB crud
MongoDB crudMongoDB crud
MongoDB crud
Darshan Jayarama
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
Itamar Haber
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 
Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2
ESUG
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
The HDF-EOS Tools and Information Center
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
SFDC Advanced Apex
SFDC Advanced Apex SFDC Advanced Apex
SFDC Advanced Apex
Sujit Kumar
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
BigDataEverywhere
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
Raffael Marty
 
Genomic Analysis in Scala
Genomic Analysis in ScalaGenomic Analysis in Scala
Genomic Analysis in Scala
Ryan Williams
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
Aleksander Alekseev
 
Visualization Lifecycle
Visualization LifecycleVisualization Lifecycle
Visualization LifecycleRaffael Marty
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
LivePerson
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
Marco Gralike
 

What's hot (20)

Advanced Relevancy Ranking
Advanced Relevancy RankingAdvanced Relevancy Ranking
Advanced Relevancy Ranking
 
Getting by with just psql
Getting by with just psqlGetting by with just psql
Getting by with just psql
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
MongoDB crud
MongoDB crudMongoDB crud
MongoDB crud
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
 
Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
SFDC Advanced Apex
SFDC Advanced Apex SFDC Advanced Apex
SFDC Advanced Apex
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
Genomic Analysis in Scala
Genomic Analysis in ScalaGenomic Analysis in Scala
Genomic Analysis in Scala
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 
Visualization Lifecycle
Visualization LifecycleVisualization Lifecycle
Visualization Lifecycle
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
 

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Sperasoft
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
Tugdual Grall
 
Kief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terribleKief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terrible
Thoughtworks
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
Jorge Lopez-Malla
 
Codereview Topics
Codereview TopicsCodereview Topics
Codereview Topics
Max Kleiner
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Dart
DartDart
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Vitaly Baum
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesEelco Visser
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Domain Driven Design Tactical Patterns
Domain Driven Design Tactical PatternsDomain Driven Design Tactical Patterns
Domain Driven Design Tactical Patterns
Robert Alexe
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
Rob Knight
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta (20)

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
Kief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terribleKief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terrible
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
Codereview Topics
Codereview TopicsCodereview Topics
Codereview Topics
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Dart
DartDart
Dart
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Domain Driven Design Tactical Patterns
Domain Driven Design Tactical PatternsDomain Driven Design Tactical Patterns
Domain Driven Design Tactical Patterns
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

  • 1. Domain Specific Languages for Data Wrangling Zain Asgar & Seshadri Mahalingam
  • 2. Data Analysis: Where does the time go? Storage & Processing Analytics
  • 3. The Data Wrangling Problem Business System Data Machine Generated Data Log Data Data Visualization Fraud Detection Recommendations ... ... DATA WRANGLING
  • 4. Isn’t this just a programming problem? Source: https://github.com/rory/apache-log-parser ... Excerpt from an apache log parser:
  • 5. • Tend to be one-offs • Tied to a specific platform • Collaboration across IT and LOB difficult • Burden of specificity Data ↔ Script ↔ Implementation
  • 6. What’s good about scripts? • Easy to experiment with & iterate • Can capture a trail Capture the good, constrain the bad Redefining scripts
  • 7. • High-level notation for domain semantics • Separate the “what” from the “how” • Switch out implementations Objectives*: • Productivity • Portability • Performance Domain Specific Languages * http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf
  • 8. Wrangle, a DSL for Data Transformation • Operators • Parameters, Expressions countpattern col: text on: `{digit}+` derive value: mean(price) group: make, model keep row: (make == 'honda') && (price > 50000) merge col: make, model with: ','
  • 9. {delim} -> [:,s|/-.] {delim-ws} -> (?:s+[:,|/-.]s*|s*[:,|/-.]s+) {number} -> (?:[-+]?(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})*)(?:.[0-9]+)? ... {hex} -> [a-fA-F0-9]*(?:[a-fA-F][0-9]|[0-9][a-fA-F])[a-fA-F0-9]* {ip-address} -> (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} ... {email} -> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* ... {url} -> ^(?:(https?|ftp)://)?(?:(?:([a-zA-Z0-9-._?,/+&%#!=~]+):([a-zA- ... {time} -> (?:0?[0-9]|1[0-9]|2[0-3]):(?:0?[0-9]|[1-5][0-9])(?:.[0-9]{3}|... {bool} -> (?:[tfTFyYnN01]|True|False| … Text Matching Constructs extract col: text after: `{delim}` on: `{ip-address}`
  • 10. • Search space is prunable: ➔ Space of available verbs is finite ➔ Expression search space is large, but easier to prune with context • Possible to build high-level interfaces • Combination of user actions and data help infer DSL statements Easier to programmatically synthesize
  • 11. • Restrictive: only captures what language designers included ➔ Language designers need domain expertise ➔ ...and system expertise • Requires scaffolding and infrastructure • It’s still code DSL “Gotchas”
  • 12. Alternative: a data-centric view The Code:
  • 13. Demo
  • 14. from collections import defaultdict import fileinput import re EXTRACT_USERNAME_REGEX = re.compile(r'(<)((?:(?!<).)*?)(>)', re.DOTALL | re.UNICODE) HOUR_EXTRACTION_REGEX = re.compile(r'()([0-9]{2})()', re.DOTALL | re.UNICODE) def main(): username_hour_counts = defaultdict(lambda: 0) for line in fileinput.input(): after_split = line.split(' ', 3) matches = re.match( EXTRACT_USERNAME_REGEX, after_split[3]) username = matches.group(2) hour_matches = re.match( HOUR_EXTRACTION_REGEX, after_split[2]) hour = hour_matches.group(2) username_hour_counts[(username, hour)] += 1 for (username, hour), count in username_hour_counts.iteritems(): print '{0},{1},{2}'.format(username, hour, count) if __name__ == '__main__': main()
  • 15. register /Users/seshadri/trifacta/pig-udfs/build/libs/pig-udfs-bundle.jar; register /Users/seshadri/trifacta/services/python-udf-service/PyUDFService/udfs/libs/udfs.jar; original_93_1 = LOAD 'hdfs://hdfs-namenode.dockerdomain:8020/trifacta/uploads/1/38d52b37-e7bc- 49c8-85d4-3f5993b9b0d3/2009-09-01.txt' USING TrifactaStorage('n', '--maxRecordLength 1048576') AS column1:chararray; DEFINE SplitUDF1 SplitUDF('()( )()', '3', 'false', '', '2'); cleaned_table = FOREACH original_93_1 GENERATE flatten(SplitUDF1($0)) AS (column2:chararray, column3:chararray, column4:chararray, column5:chararray); DEFINE ExtractUDF1 ExtractUDF('(<)((?:(?!<).)*?)(>)', '1', 'false', '', '2'); cleaned_table_1 = FOREACH cleaned_table GENERATE $0 AS column2:chararray, $1 AS column3:chararray, $2 AS column4:chararray, $3 AS column5:chararray, flatten(ExtractUDF1($3)) AS column1:chararray; DEFINE ExtractUDF2 ExtractUDF('()([0-9]{2})()', '1', 'false', '', '2'); cleaned_table_2 = FOREACH cleaned_table_1 GENERATE $0 AS column2:chararray, $1 AS column3:chararray, $2 AS column4:chararray, flatten(ExtractUDF2($2)) AS column6:chararray, $3 AS column5:chararray, $4 AS column1:chararray; cleaned_table_3_1 = GROUP cleaned_table_2 BY ($3, $5); DEFINE VectorizedAggregateUDF1 VectorizedAggregateUDF('[{"count":1,"agg":"COUNT_STAR","names":["row_count"],"args":[]}]'); cleaned_table_3_1 = FOREACH cleaned_table_3_1 GENERATE flatten(group) AS (column6:chararray, column1:chararray), flatten(VectorizedAggregateUDF1(cleaned_table_2)); DEFINE TypeCastUDF1 TypeCastUDF('{"testCase":{"regexes":["^- ?d{1,16}$"]},"name":"Long","fullPath":["Integer"],"stripWhitespace":false}'); DEFINE TypeCastUDF2 TypeCastUDF('{"testCase":{},"name":"String","fullPath":["String"],"stripWhitespace":false}'); cleaned_table_3 = FOREACH cleaned_table_3_1 GENERATE TypeCastUDF1($0) AS column6:long, TypeCastUDF2($1) AS column1:chararray, TypeCastUDF1($2) AS row_count:long; STORE cleaned_table_3 INTO 'hdfs://hdfs- namenode.dockerdomain:8020/trifacta/queryResults/seshadri/Web_Chat_Log/36/cleaned_table_3.json' USING TrifactaJsonStorage();
  • 16. splitrows col: column1 on: 'n' split col: column1 on: ' ' limit: 3 extract col: column5 after: `<` before: `>` extract col: column4 on: `{digit}{2}` aggregate value: count() group: column6,column1
  • 17. • Build compilers to exploit separation of intent from implementation • Target-independent representation Performance & Portability from a DSL Wrangle DSL Abstract Machine Model Targets Targets Targets Targets Intent Implementation
  • 18. • Traditional compilers have a “generalized processor” worldview • We need a data-parallel system (as opposed to a control-flow system) ➔ Similar to a database • Operators: • Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, … Abstract Machine Model countpattern col: text on: `{digit}+` derive value: mean(price) group: make, model keep row: (make == 'honda') && (price > 50000) Map FlatAggregate(Map + Join) Filter
  • 19. • Map a set of abstract model operators to a concrete physical plan ➔ “target lowering” in compiler terms • Template rewriting ➔ Load + Splitrows → Hadoop RecordReader Abstract → Concrete
  • 20. Compiling to LLVM as a target Wrangle DSL Abstract Machine Model Code Generation Operators Expressions LLVM IR Executable
  • 21. Performance & Portability Pick a good abstract machine model Target independence is important Target-specific optimizations are important DSL High-level notation for domain semantics Separate the “what” from the “how” Switch out implementations Constrain search space Summing up Scripts Easy to experiment with & iterate Can capture a trail

Editor's Notes

  1. So many possible inputs. So many possible outputs. There is no single algorithm that does this – the idea doesn’t even make sense given the need for human domain knowledge. So – at the end of the day, Data Transformation is a programming problem. If we don’t tame this problem, the flow of data it bottlenecked at the experts.
  2. Productivity ←→ Simplification SQL, VBA, Lex/Yacc, LaTeX, HTML, OpenGL, Bash, Datalog, …
  3. At the end of the day it’s all about the user. Their task is wrangling data, not constructing a script. We’d like to help them by generating our language constructs, when they indicate what they want to do. Using the DSL as the target of our program synthesis, ... predictive interaction
  4. Scaffolding & Infrastructure: We expose, display and accept the language in the interface as strings, so we need to build a parser & ASTs. Backing implementations have a common support footprint
  5. Now that we’ve looked at the productivity aspect Going to hand it to Zain to talk Portability -> Why is it important? History of Hadoop slide; we’ve always been talking about a new thing
  6. We understand the advantages of using a DSL. How do we go about implementing one?
  7. No surprise that our operators resemble relational operators
  8. Compile queries into native code instead of interpreting the query at each stage