SlideShare a Scribd company logo
Sim Simeonov, Founder/CTO, Swoop
Goal-Based Data Production
@simeons #EUent1
Hi, I’m @simeons.
I like software magic.
http://bit.ly/spark-records
a data-powered culture supported by
petabyte-scale data engineering & ML/AI
@simeons #EUent1 4
data-poweredculture
AHA ML AI
Ad-Hoc
Analytics
data-poweredculture
AHA ML AI
new data
new queries
biz users
data scientists
82%
agile data production
data-poweredculture
AHA ML AI
new data
new queries
biz users
data scientists
9@simeons #EUent1
Data-Powered
Culture
Spark
Goal-Based Data Production
SparkDBMS
BI Tools
Agile Data Production
SQL/DSL
agile data production request
top 10 campaigns by gross revenue
running on health sites,
weekly for the past 2 complete weeks,
with cost-per-click and
the click-through rate
@simeons #EUent1 10
@simeons #EUent1 11
with
origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'),
t1 as (
select campaign_id, origin_id, clicks, views, billing,
date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts
from rep_campaigns_daily
where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK,
'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')
),
t2 as (
select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue
from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id
group by campaign_id, ts
),
t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2),
t4 as (select * from t3 where rank <= 10)
select
ts, rank, campaign_short_name as campaign,
bround(gross_revenue) as gross_revenue,
format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc,
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id
order by ts, rank
@simeons #EUent1 12
Spark DSL is equally complex
spark.table("rep_campaigns_daily")
.where("""
from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""")
.join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id")
.withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))"))
.groupBy('campaign_id, 'ts)
.agg(
sum('views).as("views"),
sum('clicks).as("clicks"),
(sum('billing) / 1000000).as("gross_revenue")
)
.withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc)))
.where('rank <= 10)
.join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id")
.withColumn("gross_revenue", expr("bround(gross_revenue)"))
.withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3))
.withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3))
.select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr)
.orderBy('ts, 'rank)
@simeons #EUent1 13
Humans Spark
40+ year old abstraction
• powerful… but
• complex
• fragile
• inflexible
agile data production
SQL
traditional abstractions fail
BI tools
slow to add new
schema/data
SQL views
build only upon
existing schema
UDFs
very limited
Humans Spark
what I want
(implicit context)
what you want
how to produce it
(explicit context)
the curse of generality: verbosity
what (5%) vs. how (95%)
@simeons #EUent1 18
the curse of generality:
repetition
-- calculate click-through rate, %
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
-- join to get campaign names from campaign IDs
SELECT campaign_short_name AS campaign, ...
FROM ... lhs
JOIN dimension_campaigns rhs
ON lhs.campaign_id = rhs.campaign_id
@simeons #EUent1 19
the curse of generality: inflexibility
-- code depends on time column datatype, format & timezone
date(to_utc_timestamp(
date_sub(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(
date_format(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
'u')
as int) - 1),
'America/New_York'))
@simeons #EUent1 20
over time, generality creates
massive technical debt
Humans Spark
what I want
data
just-in-time
generative
context
goal-based data production
DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
@simeons #EUent1 23
the DSL is not as important as the
processing & configuration models
@simeons #EUent1 24
Data Production
Request
Data Production
Goal
Data Production
Rule
@simeons #EUent1 25
Request Goal
Rule
dpr.addGoal(
ResultColumn("cpc")
)
@simeons #EUent1 26
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
@simeons #EUent1 27
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
SourceColumn(
role = TimeDimension()
)
@simeons #EUent1 28
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
Prop("time.week.start")
Prop("time.tz.source")
Prop("time.tz.reporting")
@simeons #EUent1 29
Rule
Requirement Context
Resolution
satisfying
partial resolution rewrites
@simeons #EUent1 30
Rule
Requirement Context
Resolution Dataset
satisfying
partial resolution rewrites
full resolution builds
transformation DAG
@simeons #EUent1 31
modern approach to metadata
• Minimal, just-in-time context
• Generative rules
@simeons #EUent1 32
context is a metadata query
smartDataWarehouse.context
.release("3.1.2")
.merge(_.tagged("sim.ml.experiments"))
@simeons #EUent1 33
mix production context with personal context
automatic joins by column name
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["campaign_id"]} ]
}
@simeons #EUent1 34
join any table with a campaign_id column
automatic semantic joins
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["ref:swoop.campaign.id"]} ]
}
join any table with a Swoop campaign ID
@simeons #EUent1 35
calculated columns
{
"field": "ctr",
"description": "click-through rate",
"expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))"
...
}
automatically available to matching tables
@simeons #EUent1 36
humanization
{
"field": "ctr",
"humanize": {
"expr": "format_number(100 * value, 3) as pct_ctr"
}
}
change column name as units change
@simeons #EUent1 37
optimization hints
{
"field": "campaign",
"unique": true
...
}
allows join after groupBy as opposed to before
@simeons #EUent1 38
automatic source & join selection
• 14+ Swoop tables can satisfy the request
– 100+x faster if one of two good choices are picked
• Execution statistics => smarter heuristics
– Natural partner to Databricks Delta
@simeons #EUent1 39
in case of insufficient context,
add more in real-time
10x easier agile data production
val df = DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
.toDF // result is a normal dataframe/dataset
@simeons #EUent1 41
revolutionary benefits
• Self-service for data scientists & many biz users
• Increased productivity & flexibility
• Improved performance & lower costs
• Easy collaboration (even across companies!)
@simeons #EUent1 42
At Swoop, we are committed to making
goal-based data production
the best free & open-source
interface for Spark agile data production
Interested in contributing?
Email spark@swoop.com
@simeons #EUent1 43
Join our open-source work.
Email spark@swoop.com
More Spark magic: http://bit.ly/spark-records

More Related Content

Similar to Goal Based Data Production with Sim Simeonov

R (Shiny Package) - UI Side Script for Decision Support System
R (Shiny Package) - UI Side Script for Decision Support SystemR (Shiny Package) - UI Side Script for Decision Support System
R (Shiny Package) - UI Side Script for Decision Support SystemMaithreya Chakravarthula
 
Machine Learning: je m'y mets demain!
Machine Learning: je m'y mets demain!Machine Learning: je m'y mets demain!
Machine Learning: je m'y mets demain!
Louis Dorard
 
PyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management toolPyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management tool
Crea Very
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
Casper Crause
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
Personalized news recommendation engine
Personalized news recommendation enginePersonalized news recommendation engine
Personalized news recommendation engine
Prateek Sachdev
 
Data Warehousing
Data WarehousingData Warehousing
Final project kijtorntham n
Final project kijtorntham nFinal project kijtorntham n
Final project kijtorntham n
NatsarankornKijtornt
 
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
Altinity Ltd
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
Jayesh Thakrar
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance Data
JEE HYUN PARK
 
Java y Ciencia de Datos - Java Day Ecuador
Java y Ciencia de Datos - Java Day EcuadorJava y Ciencia de Datos - Java Day Ecuador
Java y Ciencia de Datos - Java Day Ecuador
jorgaf
 
Youth Tobacco Survey Analysis
Youth Tobacco Survey AnalysisYouth Tobacco Survey Analysis
Youth Tobacco Survey Analysis
Roshik Ganesan
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
kenluck2001
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
Building analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solrBuilding analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solr
Amrit Sarkar
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
Martin Loetzsch
 

Similar to Goal Based Data Production with Sim Simeonov (20)

R (Shiny Package) - UI Side Script for Decision Support System
R (Shiny Package) - UI Side Script for Decision Support SystemR (Shiny Package) - UI Side Script for Decision Support System
R (Shiny Package) - UI Side Script for Decision Support System
 
Machine Learning: je m'y mets demain!
Machine Learning: je m'y mets demain!Machine Learning: je m'y mets demain!
Machine Learning: je m'y mets demain!
 
PyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management toolPyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management tool
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Personalized news recommendation engine
Personalized news recommendation enginePersonalized news recommendation engine
Personalized news recommendation engine
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Final project kijtorntham n
Final project kijtorntham nFinal project kijtorntham n
Final project kijtorntham n
 
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance Data
 
Java y Ciencia de Datos - Java Day Ecuador
Java y Ciencia de Datos - Java Day EcuadorJava y Ciencia de Datos - Java Day Ecuador
Java y Ciencia de Datos - Java Day Ecuador
 
Youth Tobacco Survey Analysis
Youth Tobacco Survey AnalysisYouth Tobacco Survey Analysis
Youth Tobacco Survey Analysis
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Building analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solrBuilding analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solr
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 

Recently uploaded (20)

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 

Goal Based Data Production with Sim Simeonov

  • 1. Sim Simeonov, Founder/CTO, Swoop Goal-Based Data Production @simeons #EUent1
  • 2. Hi, I’m @simeons. I like software magic.
  • 4. a data-powered culture supported by petabyte-scale data engineering & ML/AI @simeons #EUent1 4
  • 6. data-poweredculture AHA ML AI new data new queries biz users data scientists
  • 7. 82%
  • 8. agile data production data-poweredculture AHA ML AI new data new queries biz users data scientists
  • 9. 9@simeons #EUent1 Data-Powered Culture Spark Goal-Based Data Production SparkDBMS BI Tools Agile Data Production SQL/DSL
  • 10. agile data production request top 10 campaigns by gross revenue running on health sites, weekly for the past 2 complete weeks, with cost-per-click and the click-through rate @simeons #EUent1 10
  • 12. with origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'), t1 as ( select campaign_id, origin_id, clicks, views, billing, date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts from rep_campaigns_daily where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York') ), t2 as ( select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id group by campaign_id, ts ), t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2), t4 as (select * from t3 where rank <= 10) select ts, rank, campaign_short_name as campaign, bround(gross_revenue) as gross_revenue, format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc, format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id order by ts, rank @simeons #EUent1 12
  • 13. Spark DSL is equally complex spark.table("rep_campaigns_daily") .where(""" from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""") .join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id") .withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))")) .groupBy('campaign_id, 'ts) .agg( sum('views).as("views"), sum('clicks).as("clicks"), (sum('billing) / 1000000).as("gross_revenue") ) .withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc))) .where('rank <= 10) .join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id") .withColumn("gross_revenue", expr("bround(gross_revenue)")) .withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3)) .withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3)) .select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr) .orderBy('ts, 'rank) @simeons #EUent1 13
  • 15. 40+ year old abstraction • powerful… but • complex • fragile • inflexible agile data production SQL
  • 16. traditional abstractions fail BI tools slow to add new schema/data SQL views build only upon existing schema UDFs very limited
  • 17. Humans Spark what I want (implicit context) what you want how to produce it (explicit context)
  • 18. the curse of generality: verbosity what (5%) vs. how (95%) @simeons #EUent1 18
  • 19. the curse of generality: repetition -- calculate click-through rate, % format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr -- join to get campaign names from campaign IDs SELECT campaign_short_name AS campaign, ... FROM ... lhs JOIN dimension_campaigns rhs ON lhs.campaign_id = rhs.campaign_id @simeons #EUent1 19
  • 20. the curse of generality: inflexibility -- code depends on time column datatype, format & timezone date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) @simeons #EUent1 20
  • 21. over time, generality creates massive technical debt
  • 22. Humans Spark what I want data just-in-time generative context
  • 23. goal-based data production DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize @simeons #EUent1 23
  • 24. the DSL is not as important as the processing & configuration models @simeons #EUent1 24
  • 25. Data Production Request Data Production Goal Data Production Rule @simeons #EUent1 25
  • 27. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) @simeons #EUent1 27
  • 28. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement SourceColumn( role = TimeDimension() ) @simeons #EUent1 28
  • 29. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement Prop("time.week.start") Prop("time.tz.source") Prop("time.tz.reporting") @simeons #EUent1 29
  • 31. Rule Requirement Context Resolution Dataset satisfying partial resolution rewrites full resolution builds transformation DAG @simeons #EUent1 31
  • 32. modern approach to metadata • Minimal, just-in-time context • Generative rules @simeons #EUent1 32
  • 33. context is a metadata query smartDataWarehouse.context .release("3.1.2") .merge(_.tagged("sim.ml.experiments")) @simeons #EUent1 33 mix production context with personal context
  • 34. automatic joins by column name { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["campaign_id"]} ] } @simeons #EUent1 34 join any table with a campaign_id column
  • 35. automatic semantic joins { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["ref:swoop.campaign.id"]} ] } join any table with a Swoop campaign ID @simeons #EUent1 35
  • 36. calculated columns { "field": "ctr", "description": "click-through rate", "expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))" ... } automatically available to matching tables @simeons #EUent1 36
  • 37. humanization { "field": "ctr", "humanize": { "expr": "format_number(100 * value, 3) as pct_ctr" } } change column name as units change @simeons #EUent1 37
  • 38. optimization hints { "field": "campaign", "unique": true ... } allows join after groupBy as opposed to before @simeons #EUent1 38
  • 39. automatic source & join selection • 14+ Swoop tables can satisfy the request – 100+x faster if one of two good choices are picked • Execution statistics => smarter heuristics – Natural partner to Databricks Delta @simeons #EUent1 39
  • 40. in case of insufficient context, add more in real-time
  • 41. 10x easier agile data production val df = DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize .toDF // result is a normal dataframe/dataset @simeons #EUent1 41
  • 42. revolutionary benefits • Self-service for data scientists & many biz users • Increased productivity & flexibility • Improved performance & lower costs • Easy collaboration (even across companies!) @simeons #EUent1 42
  • 43. At Swoop, we are committed to making goal-based data production the best free & open-source interface for Spark agile data production Interested in contributing? Email spark@swoop.com @simeons #EUent1 43
  • 44. Join our open-source work. Email spark@swoop.com More Spark magic: http://bit.ly/spark-records