Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting healthy with Hadoop

Technical deep-dive on MyFitnessPal/Under Armour Connected Fitness's use of big data technologies like Hive, Presto, and Spark to derive the insights needed to keep their users fit and healthy. You'll learn how we leverage data and big data technologies to serve the world's largest health & fitness community.

  • Login to see the comments

Getting healthy with Hadoop

  1. 1. Healthy with Hadoop Karthik Subramaniam
 Poojit Sharma
  2. 2. Under Armour Connected Fitness Agenda 2 • Quick intro to MyFitnessPal & Under Armour Connected Fitness • Data assets overview • Data platform overview • What are we trying to get to? • What are we trying to solve for? • Use cases we are enabling and a walkthrough of the workflow
  3. 3. Under Armour Connected Fitness MyFitnessPal 3 Massive Engaged Community Largest Food Database Simple & Effective Health/Fitness Tracking Tool Top 5 Health & Fitness App in over 70+ Countries 90 million+ registered users 7 million+ searchable food items 19 billion+ logged food entries
  4. 4. Under Armour Connected Fitness MyFitnessPal joins Under Armour 4 140+ million registered users across UA Connected Fitness
  5. 5. Under Armour Connected Fitness More about Under Armour 5 • http://www.fool.com/investing/general/2015/06/07/how-under- armour-is-becoming-a-tech-company.aspx
  6. 6. Under Armour Connected Fitness •Food / Nutrition • 19B+ logged food entries • ~7M food items • 38M+ recipes • From 90M+ MyFitnessPal Users • Workout • User-base tracks over 600 types of indoor and outdoor fitness activities • Time-series GPS data • Music data with workout • Run/Ride/Walk • Activity throughout the day • Sleep pattern • 390B+ calories burned (MFP) • 1.2T+ minutes of exercise (MFP) • Retail / E-Commerce • Product purchase transactions • User preferences on clothes / shoes / wearable devices We have lots of data! 6
  7. 7. Under Armour Connected Fitness Characteristics of Health & Fitness Data 7 • Time-series data • Wearables, sensors, GPS - still growing • Volume is crazy • 20M+ foods logged per day • Burst of data is something we need to support • Marathon/Races - logging of runs & workouts increases by city/region • 3x burst in the first week of the new year - MyFitnessPal (Predictable)
  8. 8. Under Armour Connected Fitness What are we trying to get to ideally? 8 BI tools Kafka: Stream Data Platform RDBMs NoSQL Business Services Apps Foundational Services Stream Processing: SAMZA Metrics S3: Central Repository RDBMs Hive Presto MapReduce Spark RedShift Tableau Others Visualization Tools Synchronous Req/Response Near Real-Time Offline Batch Data 0 -100s ms > 100s ms > 1 hr Use Case Example: Notifications Use Case Example: Weekly E-Mails
  9. 9. Under Armour Connected Fitness A closer look 9 S3Kafka RDBMs Foods, Workouts, Users, etc data stored in MySQL, MongoDB & DynamoDB Apps/Service Events Data stream 3rd Party Data Marketo, Silverpop, iTunes, Google, Zuora, Stripe, …. [Hive, Presto] [Spark] [Business Intelligence] Snapshots, Batch uploads Snapshots, Batch uploads Incremental Incremental
  10. 10. Under Armour Connected Fitness What are we trying to solve for? • Automation: Move away from manual data replication process & custom tools around • Standardization: Data specs & directory structure across CF data • Consistency: Develop common workflow to enable health & fitness use cases 10
  11. 11. Under Armour Connected Fitness Common workflow to enable multiple use cases 11 Raw Derived Source of Truth JSON ORC ORC Insert, Update, Delete Hive Hive Interactive ad-hoc queries Hive CSV Analytic Data Warehouse (Redshift, SAP Hana) CSV Redshift -> MySQL Analytics (BI, User insights across apps) Data Products
 (Search Improvements) (Move to NoSQL solution) Databricks Spark Use Cases Data Science
 (Personalization, Recommendation) Data Exploration Presto
  12. 12. Under Armour Connected Fitness A closer look 12 S3Kafka RDBMs Foods, Workouts, Users, etc data stored in MySQL, MongoDB & DynamoDB Apps/Service Events Data stream 3rd Party Data Marketo, Silverpop, iTunes, Google, Zoura, Stripe, …. [Spark] [Business Intelligence] [Hive, Presto]
  13. 13. Under Armour Connected Fitness Getting data to S3 & storing in S3 13 Kafka S3
  14. 14. Under Armour Connected Fitness Getting data to S3 & storing in S3 14 Kafka S3 1) 20M+ food entries per day 2) 2M+ workouts logged per day 3) 2M+ unique user steps activity per day
  15. 15. Under Armour Connected Fitness A closer look 15 S3Kafka RDBMs Foods, Workouts, Users, etc data stored in MySQL, MongoDB & DynamoDB Apps/Service Events Data stream 3rd Party Data Marketo, Silverpop, iTunes, Google, Zoura, Stripe, …. [Spark] [Business Intelligence] [Hive, Presto]
  16. 16. Under Armour Connected Fitness How are we accessing raw data in S3? 16 S3 Hive
  17. 17. Under Armour Connected Fitness External views for easy querying & transformation 17 Raw JSON Derived ORC
  18. 18. Under Armour Connected Fitness Populating derived tables 18 Raw JSON Derived ORC
  19. 19. Under Armour Connected Fitness Creating the Source of Truth 19 Derived ORC SoT ORC • Derived is the Source of Truth for non-transactional events • Food Search, Food Logged, Session Start events • Source of Truth for INSERTS/UPDATES/DELETES • We have the same workflow for food_entry_deleted • How do we combine the inserts, updates & deletes? • How do we backfill historical data?
  20. 20. Under Armour Connected Fitness Source of Truth 20 Derived ORC SoT ORC • Relatively small data set: • Large data set: • We export to ACID database. e.g. Redshift http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
  21. 21. Under Armour Connected Fitness Food logged by day in a month 21 Weekend Weekend Weekend Weekend Weekend Food Entries Table: 19B+ Rows
  22. 22. Under Armour Connected Fitness Hive? Presto? When to use what? 22 • Hive • Batch processing • User defined functions (UDFs) • Large Aggregations • Presto • Interactive and ad-hoc queries • Data exploration
  23. 23. Under Armour Connected Fitness Query performance on Workout data 23 Hive on ORC Presto on ORC Hive on CSV # of users that last logged a workout by day (after July 1st) 75 secs 7 secs 408 secs Select all columns by workout id 91.8 secs 17 secs 366 secs 0 150,000 300,000 450,000 600,000 8/5 7/1 # of users that last logged a workout by day (after July 1st) Workout Table: 518M+ Rows 42 columns
  24. 24. Under Armour Connected Fitness ORC Format 24 • Columnar structure • Type specific encoding • Single SerDe for all ORC files • Type-safe vectorization • Works well with Presto • HIVE ACID support Source: Apache HIVE Wiki
  25. 25. Under Armour Connected Fitness What next? 25 1) Updating to Hive 1.2 - ACID Support (Source of Truth in Hive) 2) Workflow Manager & Scheduler 3) Self-serving data platform
  26. 26. Under Armour Connected Fitness Summary • World’s largest health & fitness community • Unique collection of data (Food, Nutrition, Workout, Activities, Sleep, etc) • Unique characteristics of data relating to Health & Fitness • Building Data Infrastructure to support current & future needs (BI, Data Products, Data Science, and Data Exploration) • Just getting started. 26
  27. 27. Under Armour Connected Fitness Thanks 27 Thank you. Questions?

×