SlideShare a Scribd company logo
Spark-Powered Smart Data Warehouse
Sim Simeonov,Founder & CTO, Swoop
Goal-Based Data Production
petabyte scale data engineering & ML
run our business
What is a data-powered culture?
answer questions 10x easier with
goal-based data production
simple data production request
top 10 campaigns by gross revenue
running on health sites,
weekly for the past 2 complete weeks,
with cost-per-click and
the click-through rate
with
origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'),
t1 as (
select campaign_id, origin_id, clicks, views, billing,
date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts
from rep_campaigns_daily
where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK,
'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')
),
t2 as (
select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue
from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id
group by campaign_id, ts
),
t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2),
t4 as (select * from t3 where rank <= 10)
select
ts, rank, campaign_short_name as campaign,
bround(gross_revenue) as gross_revenue,
format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc,
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id
order by ts, rank
DSL is equally complex
spark.table("rep_campaigns_daily")
.where("""
from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""")
.join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id")
.withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))"))
.groupBy('campaign_id, 'ts)
.agg(
sum('views).as("views"),
sum('clicks).as("clicks"),
(sum('billing) / 1000000).as("gross_revenue")
)
.withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc)))
.where('rank <= 10)
.join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id")
.withColumn("gross_revenue", expr("bround(gross_revenue)"))
.withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3))
.withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3))
.select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr)
.orderBy('ts, 'rank)
Spark needs to know
what you want and
how to produce it
General data processing requires
detailed instructions every single time
the curse of generality: verbosity
what (5%) vs. how (95%)
the curse of generality: duplication
-- calculate click-through rate, %
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
-- join to get campaign names from campaign IDs
SELECT campaign_short_name AS campaign, ...
FROM ... lhs
JOIN dimension_campaigns rhs
ON lhs.campaign_id = rhs.campaign_id
the curse of generality: complexity
-- weekly
date(to_utc_timestamp(
date_sub(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(
date_format(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
'u')
as int) - 1),
'America/New_York'))
the curse of generality: inflexibility
-- code depends on time column datatype, format & timezone
date(to_utc_timestamp(
date_sub(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(
date_format(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
'u')
as int) - 1),
'America/New_York'))
Can we keep what and toss how?
Can how become implicit context?
• Data sources, schema, join relationships
• Presentation & formatting
• Week start (Monday)
• Reporting time zone (East Coast)
goal-based data production
DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
let’s go behind the curtain
the DSL is not as important as the
processing & configuration models
Data Production
Request
Data Production
Goal
Data Production
Rule
Request Goal
Rule
dpr.addGoal(
ResultColumn("cpc")
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
SourceColumn(
role = TimeDimension()
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
Prop("time.week.start")
Prop("time.tz.source")
Prop("time.tz.reporting")
goal + rule + requirement
// .weekly
val timeCol = SourceColumn(
role = TimeDimension(maxGranularity = Weekly)
)
val ts = ResultColumn("ts",
production = Array(WithColumn("ts", BeginningOf("week", timeCol)))
)
dpr.addGoals(ts).addGroupBy(ts)
Rule
Requirement Context
satisfying
• Source table
• Time dimension column
• Business week start
• Reporting time zone
Rule
Requirement Context
Resolution
satisfying
partial resolution rewrites
Rule
Requirement Context
Resolution Dataset
satisfying
partial resolution rewrites
full resolution builds
transformations
Spark makes this quite easy
• Open & rewritable processing model
• Column = Expression + Metadata
• LogicalPlan
not your typical enterprise/BI
death-by-configuration system
context is a metadata query
smartDataWarehouse.context
.release("3.1.2")
.merge(_.tagged("sim.ml.experiments"))
Minimal, just-in-time context is sufficient.
No need for a complete, unified model.
automatic joins by column name
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["campaign_id"]} ]
}
join any table with campaign_id column
automatic semantic joins
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["ref:swoop.campaign.id"]} ]
}
join any table with a Swoop campaign ID
calculated columns
{
"field": "ctr",
"description": "click-through rate",
"expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))"
...
}
automatically available to matching tables
humanization
{
"field": "ctr",
"humanize": {
"expr": "format_number(100 * value, 3) as pct_ctr"
}
}
change column name as units change
humanization
ResultColumn("rank_campaign_by_gross_revenue",
production = Array(...),
shortName = "rank", // simplified name
defaultOrdering = Asc, // row ordering
before = "campaign") // column ordering
optimization hints
{
"field": "campaign",
// hint: allows join after groupBy as opposed to before
"unique": true
...
}
a general optimizer can never do this
automatic source & join selection
• 14+ Swoop tables can satisfy the request
• Only 2 are good choices based on cost
– 100+x faster execution
10x easier data production
val df = DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
.toDF // result is a normal dataframe/dataset
revolutionary benefits
• Increased productivity & flexibility
• Improved performance & lower costs
• Easy collaboration (even across companies!)
• Business users can query Spark
At Swoop, we are committed to making
goal-based data production
the best free & open-source
interface for Spark data production
Interested in contributing?
Email spark@swoop.com
Join our open-source work.
Email spark@swoop.com
More Spark magic: http://bit.ly/spark-records

More Related Content

What's hot

Average- An android project
Average- An android projectAverage- An android project
Average- An android project
Ipsit Dash
 

What's hot (20)

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Postgres Performance for Humans
Postgres Performance for HumansPostgres Performance for Humans
Postgres Performance for Humans
 
New Java Date/Time API
New Java Date/Time APINew Java Date/Time API
New Java Date/Time API
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Arctic15 keynote
Arctic15   keynoteArctic15   keynote
Arctic15 keynote
 
Data visualization by Kenneth Odoh
Data visualization by Kenneth OdohData visualization by Kenneth Odoh
Data visualization by Kenneth Odoh
 
Average- An android project
Average- An android projectAverage- An android project
Average- An android project
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
Martin Fowler's Refactoring Techniques Quick Reference
Martin Fowler's Refactoring Techniques Quick ReferenceMartin Fowler's Refactoring Techniques Quick Reference
Martin Fowler's Refactoring Techniques Quick Reference
 

Similar to Goal Based Data Production with Sim Simeonov

Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
Chris Seebacher
 

Similar to Goal Based Data Production with Sim Simeonov (20)

Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Sca...
(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Sca...(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Sca...
(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Sca...
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018
 
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
 
Optimization in django orm
Optimization in django ormOptimization in django orm
Optimization in django orm
 
Small pieces loosely joined
Small pieces loosely joinedSmall pieces loosely joined
Small pieces loosely joined
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
R console
R consoleR console
R console
 
PyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management toolPyCon SG x Jublia - Building a simple-to-use Database Management tool
PyCon SG x Jublia - Building a simple-to-use Database Management tool
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Aggregation Functions in OCL
Aggregation Functions in OCL Aggregation Functions in OCL
Aggregation Functions in OCL
 
Functional programming techniques in real-world microservices
Functional programming techniques in real-world microservicesFunctional programming techniques in real-world microservices
Functional programming techniques in real-world microservices
 
Building analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solrBuilding analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solr
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 

Goal Based Data Production with Sim Simeonov

  • 1. Spark-Powered Smart Data Warehouse Sim Simeonov,Founder & CTO, Swoop Goal-Based Data Production
  • 2. petabyte scale data engineering & ML run our business
  • 3. What is a data-powered culture?
  • 4. answer questions 10x easier with goal-based data production
  • 5. simple data production request top 10 campaigns by gross revenue running on health sites, weekly for the past 2 complete weeks, with cost-per-click and the click-through rate
  • 6.
  • 7. with origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'), t1 as ( select campaign_id, origin_id, clicks, views, billing, date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts from rep_campaigns_daily where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York') ), t2 as ( select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id group by campaign_id, ts ), t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2), t4 as (select * from t3 where rank <= 10) select ts, rank, campaign_short_name as campaign, bround(gross_revenue) as gross_revenue, format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc, format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id order by ts, rank
  • 8. DSL is equally complex spark.table("rep_campaigns_daily") .where(""" from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""") .join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id") .withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))")) .groupBy('campaign_id, 'ts) .agg( sum('views).as("views"), sum('clicks).as("clicks"), (sum('billing) / 1000000).as("gross_revenue") ) .withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc))) .where('rank <= 10) .join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id") .withColumn("gross_revenue", expr("bround(gross_revenue)")) .withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3)) .withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3)) .select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr) .orderBy('ts, 'rank)
  • 9. Spark needs to know what you want and how to produce it
  • 10. General data processing requires detailed instructions every single time
  • 11. the curse of generality: verbosity what (5%) vs. how (95%)
  • 12. the curse of generality: duplication -- calculate click-through rate, % format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr -- join to get campaign names from campaign IDs SELECT campaign_short_name AS campaign, ... FROM ... lhs JOIN dimension_campaigns rhs ON lhs.campaign_id = rhs.campaign_id
  • 13. the curse of generality: complexity -- weekly date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))
  • 14. the curse of generality: inflexibility -- code depends on time column datatype, format & timezone date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))
  • 15. Can we keep what and toss how?
  • 16. Can how become implicit context? • Data sources, schema, join relationships • Presentation & formatting • Week start (Monday) • Reporting time zone (East Coast)
  • 17. goal-based data production DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize
  • 18. let’s go behind the curtain
  • 19. the DSL is not as important as the processing & configuration models
  • 22. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage )
  • 23. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement SourceColumn( role = TimeDimension() )
  • 24. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement Prop("time.week.start") Prop("time.tz.source") Prop("time.tz.reporting")
  • 25. goal + rule + requirement // .weekly val timeCol = SourceColumn( role = TimeDimension(maxGranularity = Weekly) ) val ts = ResultColumn("ts", production = Array(WithColumn("ts", BeginningOf("week", timeCol))) ) dpr.addGoals(ts).addGroupBy(ts)
  • 26. Rule Requirement Context satisfying • Source table • Time dimension column • Business week start • Reporting time zone
  • 28. Rule Requirement Context Resolution Dataset satisfying partial resolution rewrites full resolution builds transformations
  • 29. Spark makes this quite easy • Open & rewritable processing model • Column = Expression + Metadata • LogicalPlan
  • 30. not your typical enterprise/BI death-by-configuration system
  • 31. context is a metadata query smartDataWarehouse.context .release("3.1.2") .merge(_.tagged("sim.ml.experiments")) Minimal, just-in-time context is sufficient. No need for a complete, unified model.
  • 32. automatic joins by column name { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["campaign_id"]} ] } join any table with campaign_id column
  • 33. automatic semantic joins { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["ref:swoop.campaign.id"]} ] } join any table with a Swoop campaign ID
  • 34. calculated columns { "field": "ctr", "description": "click-through rate", "expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))" ... } automatically available to matching tables
  • 35. humanization { "field": "ctr", "humanize": { "expr": "format_number(100 * value, 3) as pct_ctr" } } change column name as units change
  • 36. humanization ResultColumn("rank_campaign_by_gross_revenue", production = Array(...), shortName = "rank", // simplified name defaultOrdering = Asc, // row ordering before = "campaign") // column ordering
  • 37. optimization hints { "field": "campaign", // hint: allows join after groupBy as opposed to before "unique": true ... } a general optimizer can never do this
  • 38. automatic source & join selection • 14+ Swoop tables can satisfy the request • Only 2 are good choices based on cost – 100+x faster execution
  • 39. 10x easier data production val df = DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize .toDF // result is a normal dataframe/dataset
  • 40. revolutionary benefits • Increased productivity & flexibility • Improved performance & lower costs • Easy collaboration (even across companies!) • Business users can query Spark
  • 41. At Swoop, we are committed to making goal-based data production the best free & open-source interface for Spark data production Interested in contributing? Email spark@swoop.com
  • 42. Join our open-source work. Email spark@swoop.com More Spark magic: http://bit.ly/spark-records