Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache Spark Side of Funnels

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 41 Ad

Apache Spark Side of Funnels

Download to read offline

Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics

Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Apache Spark Side of Funnels (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Apache Spark Side of Funnels

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Stipanicev Zoran, GetYourGuide Spark Side of the Funnels #UnifiedDataAnalytics #SparkAISummit
  3. 3. About me Software engineer for the past 13 years and started working with data 12 years ago with Oracle and moved to reporting and BI over the years. Last 4 years enabling business users at GetYourGuide to make better decisions with data. Senior BI Engineer, Data Platform 3
  4. 4. Agenda 1. Intro to GetYourGuide 2. Introduction to Funnels 3. Deep Dive 4. Further Possibilities 5. End Result 4
  5. 5. Intro to GetYourGuide
  6. 6. We make it simple to book and enjoy incredible experiences
  7. 7. Europe’s largest marketplace for travel experiences 50k+ Products in 150+ countries 25M+ Tickets sold $650M+ In VC funding 600+ Strong global team 150+ Traveler nationalities
  8. 8. Introduction to Funnels 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Requirements 1. Looker as frontend and Spark as backend 2. Respect the order of events 3. Each step can consist of multiple events 4. Anything can happen between two steps of the Funnel 5. Support for Funnel wide and step specific filters 6. Sessions based on Touch-points (for some use cases) 7. Performance: 4 weeks of data in under 60 sec ideally under 30 sec 8. Option to ignore the order of events :) 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. Internal vs External solution 1. Performance ○ External is faster because it’s custom built bottom up 2. Where is the data? ○ With internal solution we are not sending data to 3rd parties 3. Flexibility ○ With internal solution we can join it to all of our internal data to bring more insights to our stakeholders ○ Adding it to internal dashboards 4. Cost ○ Differs for each company :)
  11. 11. Touchpoints explained G f Events Events Direct Events Booking TOUCH POINTS Funnel Session 1 Funnel Session 2 Funnel Session 3
  12. 12. Funnel filtering explained If you define that you want see visitors in the funnel A - B - C - D - E How can the actual funnel look like? 1) xyzDBz A CyxD B ABAB C BC DE 2) xyzlmnop A CCCC How many steps do we match in this cases 1) A B C D E 2) A B A B E 3) A B C E D
  13. 13. Deep Dive
  14. 14. How do we build a Funnel 1) We filter for only selected Events 2) Concatenate all the events in a single session into a string a) We use an alias for each step (A,B,C…) i) It streamlines the rest of the query ii) And nothing changed when we added step specific filters 3) We compare the generated string with the Funnel specified by filters in our BI tool
  15. 15. Pseudo SQL of the implementation SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’ ... END)OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS funnel // funnel => “ABABBBCDE...” FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date = ... ) t WHERE t.funnel RLIKE '...'
  16. 16. Pseudo SQL - Innere Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’ ... END)OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS funnel FROM event_log WHERE 1=1 AND ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date BETWEEN ... ) t WHERE t.funnel RLIKE '...'
  17. 17. Explain plan before the fix + - Filter ( ((event = Landing Page) || (event = Home Page)) || ((event = ProductPage) && (product_id = … +- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=AboutView ... PartitionCount: 1502, PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], … What do we see in the explain plan? ● Event partition listed is not for any of the filtered events ● Partition count is really high (date range is 7 days) ● Date partition pruning is applied
  18. 18. Pseudo SQL - Innere Where Fixed SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... AND ( (event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’) OR (event …) ) AND ( (event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) ) t WHERE t.funnel RLIKE '...'
  19. 19. Explain plan after the fix + - Filter ( ((event = LandingPage) || (event = HomePage)) || ((event = ProductPage) && (product_id = … +- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=LandingPage ... PartitionCount: 21, PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], (event_name = Landing… What do we see in the explain plan? ● Event name partitions are pruned ● Partition count is a lot lower (7 days x 3 events) ● Date partition pruning is still applied
  20. 20. Sample of rows from inner to outer query Event Alias Funnel Visitor ID A BBABABABCBC A1B1 B BBABABABCBC A1B1 C BBABABABCBC A1B1 A BAAA C9D9 B BAAA C9D9
  21. 21. Pseudo SQL - Outer Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... ) t WHERE t.funnel RLIKE 'A.*B.*C.*D.*E'
  22. 22. Pseudo SQL - Outer Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... ) t WHERE locate('A', funnel) > 0
  23. 23. Pseudo SQL - Outer Select SELECT CASE WHEN alias = ‘A’ THEN 1 WHEN alias = ‘B’ AND funnel RLIKE ‘A.*B’ THEN 2 WHEN alias = ‘C’ AND funnel RLIKE ‘A.*B.*C’ THEN 3 ELSE -1 END AS step FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... ) t WHERE locate('A', funnel) > 0
  24. 24. Pseudo SQL - Outer Select SELECT CASE WHEN alias = ‘A’ THEN 1 WHEN alias = ‘B’ AND locate('B', funnel, locate('A', funnel)) > 0 THEN 2 WHEN alias = ‘C’ AND locate('C', funnel, locate('B', funnel, locate('A', funnel))) > 0 THEN 3 ELSE -1 END AS step FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... ) t WHERE locate('A', funnel) > 0
  25. 25. Sample of rows from outer to BI tool query Event Alias Funnel Visitor ID Step A BBABABABCBC A1B1 1 B BBABABABCBC A1B1 2 C BBABABABCBC A1B1 3 A BAAA C9D9 1 B BAAA C9D9 -1
  26. 26. Further Possibilities
  27. 27. Slicing the funnel LandingPg ProductPg ID = 1 AddToCart LandingPg Checkout Funnel ProductPg ID = 2 LandingPg ProductPg ID = 3 AddToCart Checkout Funnel
  28. 28. Slicing the funnel ● First we get all the values that satisfy the filters ● Then we collect them into an array with a window function to apply them to every step of the funnel ● Last step is to explode using LATERAL VIEW to support multiple dimensions ● And now we can expose it as a dimension to users
  29. 29. Slicing – Inner select SELECT CASE … WHEN FROM ( SELECT … collect_set(CASE WHEN product_id IN (1,2,3) THEN product_id END) OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS product_id_array // Distinct values … FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date = ... ) t WHERE t.funnel RLIKE '...'
  30. 30. Sample of rows from inner to outer query Event Alias Funnel Visitor ID Product Array A BBABABABCBC A1B1 [1,2] A BAAA C9D9 [3]
  31. 31. Slicing – Outer query SELECT visitor_id … , product_id , CASE WHEN … … FROM ( /* INNER QUERY */ ) LATERAL VIEW OUTER explode (product_id_arr) products AS product_id LATERAL VIEW OUTER explode (category_id_arr) categories AS category_id … ● With LATERAL VIEW we can explode multiple arrays ● This will multiply number of rows sent to the BI tool generated outer query ● Benefit for end users -> they don’t have to run multiple funnels analysis to get the same data
  32. 32. Sample of rows from outer to BI tool query Event Alias Funnel Visitor ID Product ID A BBABABABCBC A1B1 1 A BBABABABCBC A1B1 2 A BAAA C9D9 3
  33. 33. Further optimisations We have optimised the query in its current form and there is one more part that allows further optimisations FROM event_log ● We are reading data directly from the partitioned table ● We can consider partitioned table as a union of tables where each partition is a table ● Could we optimise the query by replacing the table with unions? What would that look like
  34. 34. Further optimisations SELECT … FROM ( SELECT * FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) AND date BETWEEN ... ) t UNION ALL ( SELECT * FROM event_log e INNER JOIN (SELECT visitor_id, min(timestamp) as timestamp FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) AND date BETWEEN ...) GROUP BY 1 ) s1 ON e.visitor_id = s1.visitor_id AND e.timestamp >= s1.timestamp … WHERE event = ‘ProductPage’ AND product_id = ‘123’ AND date BETWEEN ... ) t
  35. 35. End result
  36. 36. End Result Screen shots
  37. 37. Screen shots
  38. 38. End Result Screen shots
  39. 39. How to ensure performance in the BI tool 1. We are using columnar storage for our data 2. Therefore we are using a feature to modify the generated SQL a. To select only needed fields b. And to include only the joins needed for those fields For simpler use cases BI tools provide this out of the box and since we need to use query with subqueries we had to use additional feature which allows us to modify custom queries
  40. 40. Thank you
  41. 41. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×