Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

In Four Simple Steps, ETL Clickstream to Data Product APIs (no Engineer needed!)

55 views

Published on

Many organizations use Clickstream for reporting but struggle to turn that data into valuable Data Products and production-ready ML models. Thanks to Apache Spark and Cloud Computing, it’s not as daunting task as you may think.

I’m going to cover in detail a recipe for you to take your Clickstream, explore it, ETL into a cloud platform, then create/publish data products into a REST API for consumption. This approach is designed to be nimble and iterative, with no back-end engineering needed.

Created by Josh Janzen, Senior Data Scientist

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

In Four Simple Steps, ETL Clickstream to Data Product APIs (no Engineer needed!)

  1. 1. GM/DM 1:1 1 IN FOUR SIMPLE STEPS, ETL CLICKSTREAM TO DATA PRODUCTS (NO ENGINEER NEEDED!) SENIOR DATA SCIENTIST|JOSH JANZEN
  2. 2. GM/DM 1:1 2 JOSH JANZEN SENIOR DATA SCIENTIST Degrees from: Data Science Tools: About: Life Time champions a healthy and happy life for its members across 138 destinations in 38 major markets in the U.S. and Canada
  3. 3. GM/DM 1:1 1. DATA FEED 2. EXPLORE 3. ETL/ML 4. DEPLOY Ø FTP to S3 w/bucket credentials Ø Sample data and explore Ø Find columns of interest Ø ETL columns of interest Ø Apply ML algorithms Ø Create web APIs with Azure ML Ø Interactive Web Apps
  4. 4. GM/DM 1:1 4 STEP ØFTP to S3 w/bucket credentials ØStart off as batch (nightly) 1. DATA FEEDeffort 25% 50% 75% 100% progress
  5. 5. GM/DM 1:1 5 STEP ØSample data and explore ØFind columns of interest 2. EXPLOREeffort 25% 50% 75% 100% progress
  6. 6. GM/DM 1:1 6 STEP 2. EXPLOREeffort 25% 50% 75% 100% progress Func RemoveNullColumns: for column in dataframe: if column is null: remove column Int threshold = 2 Func RemoveLowVariationColumns: for column in dataframe: if count(distinct values) in column < threshold: remove column
  7. 7. GM/DM 1:1 7 STEP 2. EXPLOREeffort 25% 50% 75% 100% progress
  8. 8. GM/DM 1:1 8 STEP ØETL columns of interest ØApply ML algorithms 3. ETL/MLeffort 25% 50% 75% 100% progress
  9. 9. GM/DM 1:1 9 STEP Auto-scaling of Cluster Size 3. ETL/MLeffort 25% 50% 75% 100% progress
  10. 10. GM/DM 1:1 10 STEP 3. ETL/MLeffort 25% 50% 75% 100% progress event_date_time user_id action page_name os 11/15/18 7:25AM u_345 Menu_click Home Android 11/15/18 7:26AM u_345 NULL ScheduleClass Android Array files_etl_complete = [‘raw_clicks_12_01_18’,‘raw_clicks_12_02_18’ …] Func DetectNewDataTMS: for file in raw_clicks_bucket: if file NOT EXISTS in files_etl_complete: PeformETL(file) files_etl_complete.append(file)
  11. 11. GM/DM 1:1 11 STEP 3. ETL/MLeffort 25% 50% 75% 100% progress Images may be subject to copyright source: https://johnolamendy.wordpress.com/2015/10/14/collaborative-filtering-in-apache-spark/
  12. 12. GM/DM 1:1 12 STEP ØCreate web APIs with Azure ML ØInteractive Web Apps 4. DEPLOYeffort 25% 50% 75% 100% progress
  13. 13. GM/DM 1:1 13 STEP 4. DEPLOYeffort 25% 50% 75% 100% progress Images may be subject to copyright source: https://wikiazure.com/artificial-intelligence/predict-temperature-using-azure-machine-learning/
  14. 14. GM/DM 1:1 14 STEP 4. DEPLOYeffort 25% 50% 75% 100% progress
  15. 15. GM/DM 1:1 15 TIPS/TRICKS Images may be subject to copyright source: https://gifer.com/en/7kRO

×