Successfully reported this slideshow.
Your SlideShare is downloading. ×

Goal Based Data Production with Sim Simeonov

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 44 Ad

Goal Based Data Production with Sim Simeonov

Download to read offline

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Advertisement
Advertisement

More Related Content

Similar to Goal Based Data Production with Sim Simeonov (20)

More from Spark Summit (20)

Advertisement

Recently uploaded (20)

Goal Based Data Production with Sim Simeonov

  1. 1. Sim Simeonov, Founder/CTO, Swoop Goal-Based Data Production @simeons #EUent1
  2. 2. Hi, I’m @simeons. I like software magic.
  3. 3. http://bit.ly/spark-records
  4. 4. a data-powered culture supported by petabyte-scale data engineering & ML/AI @simeons #EUent1 4
  5. 5. data-poweredculture AHA ML AI Ad-Hoc Analytics
  6. 6. data-poweredculture AHA ML AI new data new queries biz users data scientists
  7. 7. 82%
  8. 8. agile data production data-poweredculture AHA ML AI new data new queries biz users data scientists
  9. 9. 9@simeons #EUent1 Data-Powered Culture Spark Goal-Based Data Production SparkDBMS BI Tools Agile Data Production SQL/DSL
  10. 10. agile data production request top 10 campaigns by gross revenue running on health sites, weekly for the past 2 complete weeks, with cost-per-click and the click-through rate @simeons #EUent1 10
  11. 11. @simeons #EUent1 11
  12. 12. with origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'), t1 as ( select campaign_id, origin_id, clicks, views, billing, date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts from rep_campaigns_daily where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York') ), t2 as ( select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id group by campaign_id, ts ), t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2), t4 as (select * from t3 where rank <= 10) select ts, rank, campaign_short_name as campaign, bround(gross_revenue) as gross_revenue, format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc, format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id order by ts, rank @simeons #EUent1 12
  13. 13. Spark DSL is equally complex spark.table("rep_campaigns_daily") .where(""" from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""") .join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id") .withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))")) .groupBy('campaign_id, 'ts) .agg( sum('views).as("views"), sum('clicks).as("clicks"), (sum('billing) / 1000000).as("gross_revenue") ) .withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc))) .where('rank <= 10) .join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id") .withColumn("gross_revenue", expr("bround(gross_revenue)")) .withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3)) .withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3)) .select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr) .orderBy('ts, 'rank) @simeons #EUent1 13
  14. 14. Humans Spark
  15. 15. 40+ year old abstraction • powerful… but • complex • fragile • inflexible agile data production SQL
  16. 16. traditional abstractions fail BI tools slow to add new schema/data SQL views build only upon existing schema UDFs very limited
  17. 17. Humans Spark what I want (implicit context) what you want how to produce it (explicit context)
  18. 18. the curse of generality: verbosity what (5%) vs. how (95%) @simeons #EUent1 18
  19. 19. the curse of generality: repetition -- calculate click-through rate, % format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr -- join to get campaign names from campaign IDs SELECT campaign_short_name AS campaign, ... FROM ... lhs JOIN dimension_campaigns rhs ON lhs.campaign_id = rhs.campaign_id @simeons #EUent1 19
  20. 20. the curse of generality: inflexibility -- code depends on time column datatype, format & timezone date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) @simeons #EUent1 20
  21. 21. over time, generality creates massive technical debt
  22. 22. Humans Spark what I want data just-in-time generative context
  23. 23. goal-based data production DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize @simeons #EUent1 23
  24. 24. the DSL is not as important as the processing & configuration models @simeons #EUent1 24
  25. 25. Data Production Request Data Production Goal Data Production Rule @simeons #EUent1 25
  26. 26. Request Goal Rule dpr.addGoal( ResultColumn("cpc") ) @simeons #EUent1 26
  27. 27. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) @simeons #EUent1 27
  28. 28. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement SourceColumn( role = TimeDimension() ) @simeons #EUent1 28
  29. 29. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement Prop("time.week.start") Prop("time.tz.source") Prop("time.tz.reporting") @simeons #EUent1 29
  30. 30. Rule Requirement Context Resolution satisfying partial resolution rewrites @simeons #EUent1 30
  31. 31. Rule Requirement Context Resolution Dataset satisfying partial resolution rewrites full resolution builds transformation DAG @simeons #EUent1 31
  32. 32. modern approach to metadata • Minimal, just-in-time context • Generative rules @simeons #EUent1 32
  33. 33. context is a metadata query smartDataWarehouse.context .release("3.1.2") .merge(_.tagged("sim.ml.experiments")) @simeons #EUent1 33 mix production context with personal context
  34. 34. automatic joins by column name { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["campaign_id"]} ] } @simeons #EUent1 34 join any table with a campaign_id column
  35. 35. automatic semantic joins { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["ref:swoop.campaign.id"]} ] } join any table with a Swoop campaign ID @simeons #EUent1 35
  36. 36. calculated columns { "field": "ctr", "description": "click-through rate", "expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))" ... } automatically available to matching tables @simeons #EUent1 36
  37. 37. humanization { "field": "ctr", "humanize": { "expr": "format_number(100 * value, 3) as pct_ctr" } } change column name as units change @simeons #EUent1 37
  38. 38. optimization hints { "field": "campaign", "unique": true ... } allows join after groupBy as opposed to before @simeons #EUent1 38
  39. 39. automatic source & join selection • 14+ Swoop tables can satisfy the request – 100+x faster if one of two good choices are picked • Execution statistics => smarter heuristics – Natural partner to Databricks Delta @simeons #EUent1 39
  40. 40. in case of insufficient context, add more in real-time
  41. 41. 10x easier agile data production val df = DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize .toDF // result is a normal dataframe/dataset @simeons #EUent1 41
  42. 42. revolutionary benefits • Self-service for data scientists & many biz users • Increased productivity & flexibility • Improved performance & lower costs • Easy collaboration (even across companies!) @simeons #EUent1 42
  43. 43. At Swoop, we are committed to making goal-based data production the best free & open-source interface for Spark agile data production Interested in contributing? Email spark@swoop.com @simeons #EUent1 43
  44. 44. Join our open-source work. Email spark@swoop.com More Spark magic: http://bit.ly/spark-records

×