Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo's Next Generation User Profile Platform

1,376 views

Published on

Yahoo's Next Generation User Profile Platform

Published in: Technology
  • Be the first to comment

Yahoo's Next Generation User Profile Platform

  1. 1. 1 Yahoo’s Next Generation User Profile Platform Kai Liu, Lu Niu Yahoo Inc.
  2. 2. 2 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  3. 3. 3 Agenda - What is User Profile - Definition - Use Cases - Logical View - User ID Type - Architecture Evolution - Schema Design - Optimization - Future Work
  4. 4. 4 What is User Profile A User Profile is a visual display of personal data associated with a specific user. (Wikipedia)
  5. 5. 5 Use Cases
  6. 6. 6 Logical View
  7. 7. 7 User ID Type - Desktop - BID: for anonymous users - SID: for registered users - Mobile - IDFA: for iOS devices - GPSAID: for Android devices
  8. 8. 8 Agenda - What is User Profile - Architecture Evolution - Old architecture - Problems - New architecture - Schema Design - Optimization - Future Work
  9. 9. 9 Classic Architecture of Data System Data Preparation (ETL) Computation (Hadoop) Deep Storage (HDFS)
  10. 10. 10 Old Architecture HDFS (full) Hive AggregationETL Batch Data (daily, hourly, minutely) Ad Serving HDFS (incre) 1 day 1 day Modeling Insights
  11. 11. 11 Problems - Aggregation is very expensive - HDFS follows Write Once Read Many approach. - Actually only ~30% of users get updates every day. - Impossible to support multiple update frequencies - Lack of capability to process event stream
  12. 12. 12 - Spark - Fast - Consistent stack (batch/streaming) - HBase - Random read/write capabilities - Flexible schema - Hive - Large scale ad-hoc query engine - SQL like interface New Architecture Components
  13. 13. 13 New Architecture HBase Hive HDFS Kafka Batch Data Stream Data 10 mins - 1 day 1 - 10 secs Ad Serving Spark Batch Spark Streaming Modeling Insights
  14. 14. 14 How problems get solved - Incremental updates avoid full data load. - Multiple Spark jobs with different frequencies running concurrently. - Spark streaming for event stream processing.
  15. 15. 15 Agenda - What is User Profile - Architecture Evolution - Schema Design - Understand the data - Table design - Optimization - Future Work
  16. 16. 16 Understand the data * (1) Ad serving; (2) User Modeling; (3) Audience Insights Split user profile into multiple HBase tables. Data Type Update Pattern Use Cases Properties K/V pairs Overwrite (1)(2)(3) Events Time Series Append only (3) Segments List of K/V pairs Read-Modify-Write (1)(3) Features Hybrid Overwrite + Read-Modify-Write (1)(2)
  17. 17. 17 HBase Data Model
  18. 18. 18 Table Design - Properties Row Key Column Family: Properties c: age c: gender c: device1 c: device2 …... 0_284386766 1_1877933007 id_type + user_id val 1 val 2 val 3
  19. 19. 19 Table Design - Events Row Key Column Family: Events c: event 0_284386766_1463848639 0_284386766_1463935039 id_type + user_id + event_type + timestamp value Rows are sorted by timestamp
  20. 20. 20 Table Design - Segments Row Key Column Family: Segments c: type1 c: type2 c: type3 …... 0_284386766 1_1877933007 id_type + user_id * Different segments in different column to avoid atomic operation value
  21. 21. 21 Features Events Query “Get age, gender of user A” “Get events of user A from 05/21/2016 to 05/22/2016” Write Pattern ❏ Write only ❏ Keep multiple versions ❏ Append only ❏ Use TTL to auto-remove records Rollback ❏ Set TIMERANGE to fetch last version in application layer ❏ Filtered out bad records in application layer ❏ Deletion based on timestamp if necessary Different Access Patterns
  22. 22. 22 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Pre-split tables - Pre-aggregation in Spark - Lazy aggregation for inactive users - Sequential read on Hive - Future Work
  23. 23. 23 Pre-Split Tables
  24. 24. 24 Pre-Split Tables - Data Skew: User data is not evenly distributed across different id types - Pre-split tables based on data distribution {SPLITS => ["x00x00x00x01x50", "x00x00x00x01xA0", "x00x00x00x02x00", "x00x00x00x02x40", "x00x00x00x02x80", "x00x00x00x02xC0", , "x00x00x00x03x00", "x00x00x00x04x00"] }
  25. 25. 25 - 1 Billion native ads events per day on 0.1 Billion users - Group by (user id, time interval) - Reduce the writes by 10X Pre-Aggregate events in Spark
  26. 26. 26 Pre-Aggregate features in Spark - 5 Billion app activities per day on 0.5 Billion devices - 1 Billion search keywords per day on 0.06 Billion devices - Aggregate on user id for both features. One Spark job instead of two.
  27. 27. 27 Lazy aggregation for inactive users - Problem: read-modify-write is expensive - Facts: - A large portion of the users might not be accessed frequently - Update jobs are not evenly distributed over time - Solution: Lazy aggregation for inactive users
  28. 28. 28 - Maintain a set of users as active users - Active users - read-modify-write - Inactive users - Append updates only - Merging updates: - Batch job - Upon request Lazy aggregation for inactive users Spark r-m-w w HBase r-m-w Active Users Inactive Users update1 update2
  29. 29. 29 Sequential read on Hive - HBase to Hive - Sync data to Hive using HBase snapshots without impact Region Servers. - Hive access the data using HBaseStorageHandler. - Move sequential reads to Hive - User modeling - Audience insights
  30. 30. 30 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  31. 31. 31 Future Work - Explore Impala/Presto for better query performance. - Expose API for incremental modeling capability;
  32. 32. 32 Questions?
  33. 33. 33 Appendix
  34. 34. 34 More optimization - Less column family as possible - Turn off autoflush - Throttling writes if necessary - Compress data before sending to Hbase - Kryo for serialization

×