Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

London measure camp12-konstantinos-papadopoulos

127 views

Published on

Examples of using Apache Beam on Google Dataflow to perform ETL on Adobe Analytics data feeds

Published in: Internet
  • Be the first to comment

  • Be the first to like this

London measure camp12-konstantinos-papadopoulos

  1. 1. Apache Beam: The road to visits re-assessment and ETL victory Konstantinos Papadopoulos Analytics Implementation (among others) @ Debenhams 3/17/18 MeasureCamp #12 1
  2. 2. Sounds familiar? • Is network connectivity affecting the way we assess visits for commuters? • How do different cut-off periods other than 30 minutes of inactivity affect Marketing Channels? • How many visits do our kiosks have and what is their conversion rate? • In busy stores, customers do not wait 30 minutes before using them • Can we do ETL at scale without worrying about VMs, servers, clusters etc.? 3/17/18 MeasureCamp #12 2
  3. 3. Beam Programming model • Powerful framework for ETL and data transformation pipelines • Batch and Stream processing with the same code • Java and Python SKDs • Executed in multiple engines (Apache Spark, Apache Flink, Apache Hadoop MapReduce etc.) • Most importantly, on Google Dataflow. Fully managed service, a.k.a. no DevOps involved. Material in the presentation taken from either https://beam.apache.org or https://cloud.google.com/dataflow/ 3/17/18 MeasureCamp #12 3
  4. 4. Session-based Windowing • Group by key (i.e. user ID) AND window • Define custom Minimum Gap Duration • “Subdivides a collection of objects according to the timestamps of those objects and groups them by a unique key” 3/17/18 MeasureCamp #12 4
  5. 5. In other words: build your own visits! Timestamp (in seconds) Key (i.e. Client ID) 1 A 2 A 3 A 2 B 3 B 10 B Key Timestamp A [1,2,3] B [2,3,10] Key Timestamp A [1,2,3] B [2,3] B [10] 30 seconds of inactivity 5 seconds of inactivity 3/17/18 MeasureCamp #12 5
  6. 6. Executing Beam on Dataflow > python main.py --runner=Dataflow • Initial lines processed: 115 Million • Initial data size: 50 GB • Total processing time: 75 minutes 3/17/18 MeasureCamp #12 6
  7. 7. Less than 250 lines of code Windowing magic! 3/17/18 MeasureCamp #12 7
  8. 8. Straight into BigQuery SELECT visit_key, visitor_id, start_of_visit, end_of_visit FROM [visits_analysis.visits] WHERE visit_key = '1000040030_1566002138_1511211346' SELECT visit_key, timestamp, page FROM [visits_analysis.hits] WHERE visit_key = '1000040030_1566002138_1511211346' 3/17/18 MeasureCamp #12 8
  9. 9. Impact on visits – Volume of visits (5% drop) 3/17/18 MeasureCamp #12 9
  10. 10. Impact on visits – Avg. Visit Duration (~12% increase) 3/17/18 MeasureCamp #12 10
  11. 11. Cool but why? • Adobe’s Virtual Report suites support custom cut-offs only in Workspace à NOT practical • What about data outside Adobe Analytics? • Server / Network logs • Kiosks / POS logs • Powerful ETL framework - Alternative to Spark • There is no other system in existence which provides this degree of flexibility and power, period …according to Google*. • No integrated Machine Learning library like Spark’s MLlib, however… you have TensorFlow/Google Cloud ML or can write separate ML applications in Spark *: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison 3/17/18 MeasureCamp #12 11
  12. 12. Thank you! Feedback time: • https://www.linkedin.com/in/papadopoulosk/ • https://measure.slack.com/ - konstantinos.pap Reading time: • Dataflow model: https://research.google.com/pubs/pub43864.html • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • Web app to build ETL jobs: https://cloud.google.com/dataprep/ • Python Project: https://bitbucket.org/digitalanalytics/visits-analysis 3/17/18 MeasureCamp #12 12

×