SlideShare a Scribd company logo
1 of 22
Domain Specific Languages for Data Wrangling
Zain Asgar & Seshadri Mahalingam
Data Analysis: Where does the time go?
Storage & Processing
Analytics
The Data Wrangling Problem
Business System Data
Machine Generated Data
Log Data
Data Visualization
Fraud Detection
Recommendations
... ...
DATA WRANGLING
Isn’t this just a programming problem?
Source: https://github.com/rory/apache-log-parser
...
Excerpt from an apache log parser:
• Tend to be one-offs
• Tied to a specific platform
• Collaboration across IT and LOB difficult
• Burden of specificity
Data ↔ Script ↔ Implementation
What’s good about scripts?
• Easy to experiment with & iterate
• Can capture a trail
Capture the good, constrain the bad
Redefining scripts
• High-level notation for domain semantics
• Separate the “what” from the “how”
• Switch out implementations
Objectives*:
• Productivity
• Portability
• Performance
Domain Specific Languages
* http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf
Wrangle, a DSL for Data Transformation
• Operators
• Parameters, Expressions
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
merge col: make, model with: ','
{delim} -> [:,s|/-.]
{delim-ws} -> (?:s+[:,|/-.]s*|s*[:,|/-.]s+)
{number} -> (?:[-+]?(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})*)(?:.[0-9]+)? ...
{hex} -> [a-fA-F0-9]*(?:[a-fA-F][0-9]|[0-9][a-fA-F])[a-fA-F0-9]*
{ip-address} -> (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} ...
{email} -> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* ...
{url} -> ^(?:(https?|ftp)://)?(?:(?:([a-zA-Z0-9-._?,/+&%#!=~]+):([a-zA- ...
{time} -> (?:0?[0-9]|1[0-9]|2[0-3]):(?:0?[0-9]|[1-5][0-9])(?:.[0-9]{3}|...
{bool} -> (?:[tfTFyYnN01]|True|False| …
Text Matching Constructs
extract col: text after: `{delim}` on: `{ip-address}`
• Search space is prunable:
➔ Space of available verbs is finite
➔ Expression search space is large, but easier to prune with context
• Possible to build high-level interfaces
• Combination of user actions and data help infer DSL statements
Easier to programmatically synthesize
• Restrictive: only captures what language designers included
➔ Language designers need domain expertise
➔ ...and system expertise
• Requires scaffolding and infrastructure
• It’s still code
DSL “Gotchas”
Alternative: a data-centric view
The Code:
Demo
from collections import defaultdict
import fileinput
import re
EXTRACT_USERNAME_REGEX = re.compile(r'(<)((?:(?!<).)*?)(>)',
re.DOTALL | re.UNICODE)
HOUR_EXTRACTION_REGEX = re.compile(r'()([0-9]{2})()', re.DOTALL | re.UNICODE)
def main():
username_hour_counts = defaultdict(lambda: 0)
for line in fileinput.input():
after_split = line.split(' ', 3)
matches = re.match(
EXTRACT_USERNAME_REGEX, after_split[3])
username = matches.group(2)
hour_matches = re.match(
HOUR_EXTRACTION_REGEX, after_split[2])
hour = hour_matches.group(2)
username_hour_counts[(username, hour)] += 1
for (username, hour), count in username_hour_counts.iteritems():
print '{0},{1},{2}'.format(username, hour, count)
if __name__ == '__main__':
main()
register /Users/seshadri/trifacta/pig-udfs/build/libs/pig-udfs-bundle.jar;
register /Users/seshadri/trifacta/services/python-udf-service/PyUDFService/udfs/libs/udfs.jar;
original_93_1 = LOAD 'hdfs://hdfs-namenode.dockerdomain:8020/trifacta/uploads/1/38d52b37-e7bc-
49c8-85d4-3f5993b9b0d3/2009-09-01.txt' USING TrifactaStorage('n', '--maxRecordLength 1048576')
AS column1:chararray;
DEFINE SplitUDF1 SplitUDF('()( )()', '3', 'false', '', '2');
cleaned_table = FOREACH original_93_1 GENERATE flatten(SplitUDF1($0)) AS (column2:chararray,
column3:chararray, column4:chararray, column5:chararray);
DEFINE ExtractUDF1 ExtractUDF('(<)((?:(?!<).)*?)(>)', '1', 'false', '', '2');
cleaned_table_1 = FOREACH cleaned_table GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, $3 AS column5:chararray, flatten(ExtractUDF1($3)) AS
column1:chararray;
DEFINE ExtractUDF2 ExtractUDF('()([0-9]{2})()', '1', 'false', '', '2');
cleaned_table_2 = FOREACH cleaned_table_1 GENERATE $0 AS column2:chararray, $1 AS
column3:chararray, $2 AS column4:chararray, flatten(ExtractUDF2($2)) AS column6:chararray, $3 AS
column5:chararray, $4 AS column1:chararray;
cleaned_table_3_1 = GROUP cleaned_table_2 BY ($3, $5);
DEFINE VectorizedAggregateUDF1
VectorizedAggregateUDF('[{"count":1,"agg":"COUNT_STAR","names":["row_count"],"args":[]}]');
cleaned_table_3_1 = FOREACH cleaned_table_3_1 GENERATE flatten(group) AS (column6:chararray,
column1:chararray), flatten(VectorizedAggregateUDF1(cleaned_table_2));
DEFINE TypeCastUDF1 TypeCastUDF('{"testCase":{"regexes":["^-
?d{1,16}$"]},"name":"Long","fullPath":["Integer"],"stripWhitespace":false}');
DEFINE TypeCastUDF2
TypeCastUDF('{"testCase":{},"name":"String","fullPath":["String"],"stripWhitespace":false}');
cleaned_table_3 = FOREACH cleaned_table_3_1 GENERATE TypeCastUDF1($0) AS column6:long,
TypeCastUDF2($1) AS column1:chararray, TypeCastUDF1($2) AS row_count:long;
STORE cleaned_table_3 INTO 'hdfs://hdfs-
namenode.dockerdomain:8020/trifacta/queryResults/seshadri/Web_Chat_Log/36/cleaned_table_3.json'
USING TrifactaJsonStorage();
splitrows col: column1 on: 'n'
split col: column1 on: ' ' limit: 3
extract col: column5 after: `<` before: `>`
extract col: column4 on: `{digit}{2}`
aggregate value: count() group: column6,column1
• Build compilers to exploit separation of intent from implementation
• Target-independent representation
Performance & Portability from a DSL
Wrangle DSL
Abstract Machine
Model
Targets
Targets
Targets
Targets
Intent Implementation
• Traditional compilers have a “generalized processor” worldview
• We need a data-parallel system (as opposed to a control-flow system)
➔ Similar to a database
• Operators:
• Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, …
Abstract Machine Model
countpattern col: text on: `{digit}+`
derive value: mean(price) group: make, model
keep row: (make == 'honda') && (price > 50000)
Map
FlatAggregate(Map + Join)
Filter
• Map a set of abstract model operators to a concrete physical plan
➔ “target lowering” in compiler terms
• Template rewriting
➔ Load + Splitrows → Hadoop RecordReader
Abstract → Concrete
Compiling to LLVM as a target
Wrangle DSL
Abstract Machine
Model
Code Generation
Operators Expressions
LLVM IR
Executable
Performance &
Portability
Pick a good abstract
machine model
Target independence
is important
Target-specific
optimizations are
important
DSL
High-level notation for
domain semantics
Separate the “what” from
the “how”
Switch out
implementations
Constrain search space
Summing up
Scripts
Easy to experiment with
& iterate
Can capture a trail
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

More Related Content

What's hot

Getting by with just psql
Getting by with just psqlGetting by with just psql
Getting by with just psqlCorey Huinker
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2Itamar Haber
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts SummaryMohamed Magdy
 
Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2ESUG
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
SFDC Advanced Apex
SFDC Advanced Apex SFDC Advanced Apex
SFDC Advanced Apex Sujit Kumar
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Genomic Analysis in Scala
Genomic Analysis in ScalaGenomic Analysis in Scala
Genomic Analysis in ScalaRyan Williams
 
Visualization Lifecycle
Visualization LifecycleVisualization Lifecycle
Visualization LifecycleRaffael Marty
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...Marco Gralike
 

What's hot (20)

Advanced Relevancy Ranking
Advanced Relevancy RankingAdvanced Relevancy Ranking
Advanced Relevancy Ranking
 
Getting by with just psql
Getting by with just psqlGetting by with just psql
Getting by with just psql
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
MongoDB crud
MongoDB crudMongoDB crud
MongoDB crud
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
 
Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2Cryptography for Smalltalkers 2
Cryptography for Smalltalkers 2
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
SFDC Advanced Apex
SFDC Advanced Apex SFDC Advanced Apex
SFDC Advanced Apex
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
Genomic Analysis in Scala
Genomic Analysis in ScalaGenomic Analysis in Scala
Genomic Analysis in Scala
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 
Visualization Lifecycle
Visualization LifecycleVisualization Lifecycle
Visualization Lifecycle
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
 

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005Tugdual Grall
 
Kief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terribleKief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terribleThoughtworks
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastJorge Lopez-Malla
 
Codereview Topics
Codereview TopicsCodereview Topics
Codereview TopicsMax Kleiner
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)lennartkats
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Vitaly Baum
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesEelco Visser
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Domain Driven Design Tactical Patterns
Domain Driven Design Tactical PatternsDomain Driven Design Tactical Patterns
Domain Driven Design Tactical PatternsRobert Alexe
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHPRob Knight
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 

Similar to Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta (20)

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
Kief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terribleKief Morris - Infrastructure is terrible
Kief Morris - Infrastructure is terrible
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
Codereview Topics
Codereview TopicsCodereview Topics
Codereview Topics
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Dart
DartDart
Dart
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Domain Driven Design Tactical Patterns
Domain Driven Design Tactical PatternsDomain Driven Design Tactical Patterns
Domain Driven Design Tactical Patterns
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Zain Asgar & Seshadri Mahalingam of Trifacta

  • 1. Domain Specific Languages for Data Wrangling Zain Asgar & Seshadri Mahalingam
  • 2. Data Analysis: Where does the time go? Storage & Processing Analytics
  • 3. The Data Wrangling Problem Business System Data Machine Generated Data Log Data Data Visualization Fraud Detection Recommendations ... ... DATA WRANGLING
  • 4. Isn’t this just a programming problem? Source: https://github.com/rory/apache-log-parser ... Excerpt from an apache log parser:
  • 5. • Tend to be one-offs • Tied to a specific platform • Collaboration across IT and LOB difficult • Burden of specificity Data ↔ Script ↔ Implementation
  • 6. What’s good about scripts? • Easy to experiment with & iterate • Can capture a trail Capture the good, constrain the bad Redefining scripts
  • 7. • High-level notation for domain semantics • Separate the “what” from the “how” • Switch out implementations Objectives*: • Productivity • Portability • Performance Domain Specific Languages * http://web.stanford.edu/class/cs442/lectures_unrestricted/cs442-dsldesign.pdf
  • 8. Wrangle, a DSL for Data Transformation • Operators • Parameters, Expressions countpattern col: text on: `{digit}+` derive value: mean(price) group: make, model keep row: (make == 'honda') && (price > 50000) merge col: make, model with: ','
  • 9. {delim} -> [:,s|/-.] {delim-ws} -> (?:s+[:,|/-.]s*|s*[:,|/-.]s+) {number} -> (?:[-+]?(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})*)(?:.[0-9]+)? ... {hex} -> [a-fA-F0-9]*(?:[a-fA-F][0-9]|[0-9][a-fA-F])[a-fA-F0-9]* {ip-address} -> (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} ... {email} -> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* ... {url} -> ^(?:(https?|ftp)://)?(?:(?:([a-zA-Z0-9-._?,/+&%#!=~]+):([a-zA- ... {time} -> (?:0?[0-9]|1[0-9]|2[0-3]):(?:0?[0-9]|[1-5][0-9])(?:.[0-9]{3}|... {bool} -> (?:[tfTFyYnN01]|True|False| … Text Matching Constructs extract col: text after: `{delim}` on: `{ip-address}`
  • 10. • Search space is prunable: ➔ Space of available verbs is finite ➔ Expression search space is large, but easier to prune with context • Possible to build high-level interfaces • Combination of user actions and data help infer DSL statements Easier to programmatically synthesize
  • 11. • Restrictive: only captures what language designers included ➔ Language designers need domain expertise ➔ ...and system expertise • Requires scaffolding and infrastructure • It’s still code DSL “Gotchas”
  • 12. Alternative: a data-centric view The Code:
  • 13. Demo
  • 14. from collections import defaultdict import fileinput import re EXTRACT_USERNAME_REGEX = re.compile(r'(<)((?:(?!<).)*?)(>)', re.DOTALL | re.UNICODE) HOUR_EXTRACTION_REGEX = re.compile(r'()([0-9]{2})()', re.DOTALL | re.UNICODE) def main(): username_hour_counts = defaultdict(lambda: 0) for line in fileinput.input(): after_split = line.split(' ', 3) matches = re.match( EXTRACT_USERNAME_REGEX, after_split[3]) username = matches.group(2) hour_matches = re.match( HOUR_EXTRACTION_REGEX, after_split[2]) hour = hour_matches.group(2) username_hour_counts[(username, hour)] += 1 for (username, hour), count in username_hour_counts.iteritems(): print '{0},{1},{2}'.format(username, hour, count) if __name__ == '__main__': main()
  • 15. register /Users/seshadri/trifacta/pig-udfs/build/libs/pig-udfs-bundle.jar; register /Users/seshadri/trifacta/services/python-udf-service/PyUDFService/udfs/libs/udfs.jar; original_93_1 = LOAD 'hdfs://hdfs-namenode.dockerdomain:8020/trifacta/uploads/1/38d52b37-e7bc- 49c8-85d4-3f5993b9b0d3/2009-09-01.txt' USING TrifactaStorage('n', '--maxRecordLength 1048576') AS column1:chararray; DEFINE SplitUDF1 SplitUDF('()( )()', '3', 'false', '', '2'); cleaned_table = FOREACH original_93_1 GENERATE flatten(SplitUDF1($0)) AS (column2:chararray, column3:chararray, column4:chararray, column5:chararray); DEFINE ExtractUDF1 ExtractUDF('(<)((?:(?!<).)*?)(>)', '1', 'false', '', '2'); cleaned_table_1 = FOREACH cleaned_table GENERATE $0 AS column2:chararray, $1 AS column3:chararray, $2 AS column4:chararray, $3 AS column5:chararray, flatten(ExtractUDF1($3)) AS column1:chararray; DEFINE ExtractUDF2 ExtractUDF('()([0-9]{2})()', '1', 'false', '', '2'); cleaned_table_2 = FOREACH cleaned_table_1 GENERATE $0 AS column2:chararray, $1 AS column3:chararray, $2 AS column4:chararray, flatten(ExtractUDF2($2)) AS column6:chararray, $3 AS column5:chararray, $4 AS column1:chararray; cleaned_table_3_1 = GROUP cleaned_table_2 BY ($3, $5); DEFINE VectorizedAggregateUDF1 VectorizedAggregateUDF('[{"count":1,"agg":"COUNT_STAR","names":["row_count"],"args":[]}]'); cleaned_table_3_1 = FOREACH cleaned_table_3_1 GENERATE flatten(group) AS (column6:chararray, column1:chararray), flatten(VectorizedAggregateUDF1(cleaned_table_2)); DEFINE TypeCastUDF1 TypeCastUDF('{"testCase":{"regexes":["^- ?d{1,16}$"]},"name":"Long","fullPath":["Integer"],"stripWhitespace":false}'); DEFINE TypeCastUDF2 TypeCastUDF('{"testCase":{},"name":"String","fullPath":["String"],"stripWhitespace":false}'); cleaned_table_3 = FOREACH cleaned_table_3_1 GENERATE TypeCastUDF1($0) AS column6:long, TypeCastUDF2($1) AS column1:chararray, TypeCastUDF1($2) AS row_count:long; STORE cleaned_table_3 INTO 'hdfs://hdfs- namenode.dockerdomain:8020/trifacta/queryResults/seshadri/Web_Chat_Log/36/cleaned_table_3.json' USING TrifactaJsonStorage();
  • 16. splitrows col: column1 on: 'n' split col: column1 on: ' ' limit: 3 extract col: column5 after: `<` before: `>` extract col: column4 on: `{digit}{2}` aggregate value: count() group: column6,column1
  • 17. • Build compilers to exploit separation of intent from implementation • Target-independent representation Performance & Portability from a DSL Wrangle DSL Abstract Machine Model Targets Targets Targets Targets Intent Implementation
  • 18. • Traditional compilers have a “generalized processor” worldview • We need a data-parallel system (as opposed to a control-flow system) ➔ Similar to a database • Operators: • Map (ForEach), Aggregate, Window, Join, Sort, Load, Store, … Abstract Machine Model countpattern col: text on: `{digit}+` derive value: mean(price) group: make, model keep row: (make == 'honda') && (price > 50000) Map FlatAggregate(Map + Join) Filter
  • 19. • Map a set of abstract model operators to a concrete physical plan ➔ “target lowering” in compiler terms • Template rewriting ➔ Load + Splitrows → Hadoop RecordReader Abstract → Concrete
  • 20. Compiling to LLVM as a target Wrangle DSL Abstract Machine Model Code Generation Operators Expressions LLVM IR Executable
  • 21. Performance & Portability Pick a good abstract machine model Target independence is important Target-specific optimizations are important DSL High-level notation for domain semantics Separate the “what” from the “how” Switch out implementations Constrain search space Summing up Scripts Easy to experiment with & iterate Can capture a trail

Editor's Notes

  1. So many possible inputs. So many possible outputs. There is no single algorithm that does this – the idea doesn’t even make sense given the need for human domain knowledge. So – at the end of the day, Data Transformation is a programming problem. If we don’t tame this problem, the flow of data it bottlenecked at the experts.
  2. Productivity ←→ Simplification SQL, VBA, Lex/Yacc, LaTeX, HTML, OpenGL, Bash, Datalog, …
  3. At the end of the day it’s all about the user. Their task is wrangling data, not constructing a script. We’d like to help them by generating our language constructs, when they indicate what they want to do. Using the DSL as the target of our program synthesis, ... predictive interaction
  4. Scaffolding & Infrastructure: We expose, display and accept the language in the interface as strings, so we need to build a parser & ASTs. Backing implementations have a common support footprint
  5. Now that we’ve looked at the productivity aspect Going to hand it to Zain to talk Portability -> Why is it important? History of Hadoop slide; we’ve always been talking about a new thing
  6. We understand the advantages of using a DSL. How do we go about implementing one?
  7. No surprise that our operators resemble relational operators
  8. Compile queries into native code instead of interpreting the query at each stage