SlideShare a Scribd company logo
1 of 34
Download to read offline
1
Yahoo’s Next Generation
User Profile Platform
Kai Liu, Lu Niu
Yahoo Inc.
2
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
3
Agenda
- What is User Profile
- Definition
- Use Cases
- Logical View
- User ID Type
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
4
What is User Profile
A User Profile is a visual display of personal data associated with a specific user.
(Wikipedia)
5
Use Cases
6
Logical View
7
User ID Type
- Desktop
- BID: for anonymous users
- SID: for registered users
- Mobile
- IDFA: for iOS devices
- GPSAID: for Android devices
8
Agenda
- What is User Profile
- Architecture Evolution
- Old architecture
- Problems
- New architecture
- Schema Design
- Optimization
- Future Work
9
Classic Architecture of Data System
Data
Preparation
(ETL)
Computation
(Hadoop)
Deep Storage
(HDFS)
10
Old Architecture
HDFS
(full)
Hive
AggregationETL
Batch Data
(daily, hourly,
minutely)
Ad Serving
HDFS
(incre)
1 day 1 day Modeling
Insights
11
Problems
- Aggregation is very expensive
- HDFS follows Write Once Read Many approach.
- Actually only ~30% of users get updates every day.
- Impossible to support multiple update frequencies
- Lack of capability to process event stream
12
- Spark
- Fast
- Consistent stack (batch/streaming)
- HBase
- Random read/write capabilities
- Flexible schema
- Hive
- Large scale ad-hoc query engine
- SQL like interface
New Architecture Components
13
New Architecture
HBase
Hive
HDFS
Kafka
Batch Data
Stream Data
10 mins - 1 day
1 - 10 secs Ad Serving
Spark Batch
Spark
Streaming
Modeling
Insights
14
How problems get solved
- Incremental updates avoid full data load.
- Multiple Spark jobs with different frequencies running
concurrently.
- Spark streaming for event stream processing.
15
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Understand the data
- Table design
- Optimization
- Future Work
16
Understand the data
* (1) Ad serving; (2) User Modeling; (3) Audience Insights
Split user profile into multiple HBase tables.
Data Type Update Pattern Use Cases
Properties K/V pairs Overwrite (1)(2)(3)
Events Time Series Append only (3)
Segments List of K/V pairs Read-Modify-Write (1)(3)
Features Hybrid Overwrite + Read-Modify-Write (1)(2)
17
HBase Data Model
18
Table Design - Properties
Row Key
Column Family: Properties
c: age c: gender c: device1 c: device2 …...
0_284386766
1_1877933007
id_type + user_id
val 1 val 2 val 3
19
Table Design - Events
Row Key
Column Family: Events
c: event
0_284386766_1463848639
0_284386766_1463935039
id_type + user_id + event_type + timestamp
value
Rows are sorted
by timestamp
20
Table Design - Segments
Row Key
Column Family: Segments
c: type1 c: type2 c: type3 …...
0_284386766
1_1877933007
id_type + user_id
* Different segments in different column to avoid atomic operation
value
21
Features Events
Query “Get age, gender of user A”
“Get events of user A from 05/21/2016
to 05/22/2016”
Write Pattern
❏ Write only
❏ Keep multiple versions
❏ Append only
❏ Use TTL to auto-remove records
Rollback
❏ Set TIMERANGE to
fetch last version in
application layer
❏ Filtered out bad records in
application layer
❏ Deletion based on timestamp if
necessary
Different Access Patterns
22
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Pre-split tables
- Pre-aggregation in Spark
- Lazy aggregation for inactive users
- Sequential read on Hive
- Future Work
23
Pre-Split Tables
24
Pre-Split Tables
- Data Skew: User data is not evenly distributed across different id types
- Pre-split tables based on data distribution
{SPLITS =>
["x00x00x00x01x50",
"x00x00x00x01xA0",
"x00x00x00x02x00",
"x00x00x00x02x40",
"x00x00x00x02x80",
"x00x00x00x02xC0", ,
"x00x00x00x03x00",
"x00x00x00x04x00"]
}
25
- 1 Billion native ads events per day on 0.1 Billion users
- Group by (user id, time interval)
- Reduce the writes by 10X
Pre-Aggregate events in Spark
26
Pre-Aggregate features in Spark
- 5 Billion app activities per day on 0.5 Billion devices
- 1 Billion search keywords per day on 0.06 Billion devices
- Aggregate on user id for both features. One Spark job instead of two.
27
Lazy aggregation for inactive users
- Problem: read-modify-write is expensive
- Facts:
- A large portion of the users might not be accessed frequently
- Update jobs are not evenly distributed over time
- Solution: Lazy aggregation for inactive users
28
- Maintain a set of users as active users
- Active users
- read-modify-write
- Inactive users
- Append updates only
- Merging updates:
- Batch job
- Upon request
Lazy aggregation for inactive users
Spark
r-m-w
w
HBase
r-m-w
Active Users
Inactive Users
update1
update2
29
Sequential read on Hive
- HBase to Hive
- Sync data to Hive using HBase snapshots without
impact Region Servers.
- Hive access the data using HBaseStorageHandler.
- Move sequential reads to Hive
- User modeling
- Audience insights
30
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
31
Future Work
- Explore Impala/Presto for better query performance.
- Expose API for incremental modeling capability;
32
Questions?
33
Appendix
34
More optimization
- Less column family as possible
- Turn off autoflush
- Throttling writes if necessary
- Compress data before sending to Hbase
- Kryo for serialization

More Related Content

What's hot

Implementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsImplementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsMarco Parenzan
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingDatabricks
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015StampedeCon
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeBizTalk360
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
A developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksA developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksMicrosoft Tech Community
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Databricks
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLBasho Technologies
 

What's hot (20)

Implementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsImplementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBase
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
A developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksA developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure Databricks
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 

Viewers also liked

User needs and legally ruled collaboration in the VirtualLife virtual world p...
User needs and legally ruled collaboration in the VirtualLife virtual world p...User needs and legally ruled collaboration in the VirtualLife virtual world p...
User needs and legally ruled collaboration in the VirtualLife virtual world p...Vytautas Čyras
 
Technical rules and legal rules in online virtual worlds
Technical rules and legal rules in online virtual worldsTechnical rules and legal rules in online virtual worlds
Technical rules and legal rules in online virtual worldsVytautas Čyras
 
Team Virtual Technology Presentation
Team Virtual Technology Presentation Team Virtual Technology Presentation
Team Virtual Technology Presentation William Allen
 
Building Strong Virtual Teams
Building Strong Virtual TeamsBuilding Strong Virtual Teams
Building Strong Virtual TeamsOlivier Serrat
 
Virtual technology
Virtual technologyVirtual technology
Virtual technologyStudent
 
презентация Ae mind+id_entity_ivory_e_24.02.16
презентация Ae mind+id_entity_ivory_e_24.02.16презентация Ae mind+id_entity_ivory_e_24.02.16
презентация Ae mind+id_entity_ivory_e_24.02.16Arseniy Tretyakov
 
The Web Portal Platform - Enabling the Smart Campus of the Future
The Web Portal Platform - Enabling the Smart Campus of the FutureThe Web Portal Platform - Enabling the Smart Campus of the Future
The Web Portal Platform - Enabling the Smart Campus of the FutureEduserv
 
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017Vytautas Čyras
 
RTB Update 2: Richard Foster, Krux
RTB Update 2: Richard Foster, KruxRTB Update 2: Richard Foster, Krux
RTB Update 2: Richard Foster, KruxHusetMarkedsforing
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Mobile operating system..
Mobile operating system..Mobile operating system..
Mobile operating system..Aashish Uppal
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsXavier Amatriain
 
What Is That DMP Good For, Anyway?
What Is That DMP Good For, Anyway?What Is That DMP Good For, Anyway?
What Is That DMP Good For, Anyway?MediaPost
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 

Viewers also liked (19)

User needs and legally ruled collaboration in the VirtualLife virtual world p...
User needs and legally ruled collaboration in the VirtualLife virtual world p...User needs and legally ruled collaboration in the VirtualLife virtual world p...
User needs and legally ruled collaboration in the VirtualLife virtual world p...
 
Technical rules and legal rules in online virtual worlds
Technical rules and legal rules in online virtual worldsTechnical rules and legal rules in online virtual worlds
Technical rules and legal rules in online virtual worlds
 
Team Virtual Technology Presentation
Team Virtual Technology Presentation Team Virtual Technology Presentation
Team Virtual Technology Presentation
 
Building Strong Virtual Teams
Building Strong Virtual TeamsBuilding Strong Virtual Teams
Building Strong Virtual Teams
 
Virtual technology
Virtual technologyVirtual technology
Virtual technology
 
презентация Ae mind+id_entity_ivory_e_24.02.16
презентация Ae mind+id_entity_ivory_e_24.02.16презентация Ae mind+id_entity_ivory_e_24.02.16
презентация Ae mind+id_entity_ivory_e_24.02.16
 
The Web Portal Platform - Enabling the Smart Campus of the Future
The Web Portal Platform - Enabling the Smart Campus of the FutureThe Web Portal Platform - Enabling the Smart Campus of the Future
The Web Portal Platform - Enabling the Smart Campus of the Future
 
Taming Text
Taming TextTaming Text
Taming Text
 
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017
Visualization of Hajime Yoshino’s Logical Jurisprudence. IRIS 2017
 
RTB Update 2: Richard Foster, Krux
RTB Update 2: Richard Foster, KruxRTB Update 2: Richard Foster, Krux
RTB Update 2: Richard Foster, Krux
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Mobile operating system..
Mobile operating system..Mobile operating system..
Mobile operating system..
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
 
What Is That DMP Good For, Anyway?
What Is That DMP Good For, Anyway?What Is That DMP Good For, Anyway?
What Is That DMP Good For, Anyway?
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 

Similar to Yahoo's Next Generation User Profile Platform

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemYael Garten
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big DataJayesh Thakrar
 
Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Grigori Fursin
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingDave Nielsen
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs? ScyllaDB
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni PohlModern Workplace Conference Paris
 

Similar to Yahoo's Next Generation User Profile Platform (20)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
 
Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured Streaming
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
 
Mihai_Nuta
Mihai_NutaMihai_Nuta
Mihai_Nuta
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Yahoo's Next Generation User Profile Platform

  • 1. 1 Yahoo’s Next Generation User Profile Platform Kai Liu, Lu Niu Yahoo Inc.
  • 2. 2 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  • 3. 3 Agenda - What is User Profile - Definition - Use Cases - Logical View - User ID Type - Architecture Evolution - Schema Design - Optimization - Future Work
  • 4. 4 What is User Profile A User Profile is a visual display of personal data associated with a specific user. (Wikipedia)
  • 7. 7 User ID Type - Desktop - BID: for anonymous users - SID: for registered users - Mobile - IDFA: for iOS devices - GPSAID: for Android devices
  • 8. 8 Agenda - What is User Profile - Architecture Evolution - Old architecture - Problems - New architecture - Schema Design - Optimization - Future Work
  • 9. 9 Classic Architecture of Data System Data Preparation (ETL) Computation (Hadoop) Deep Storage (HDFS)
  • 10. 10 Old Architecture HDFS (full) Hive AggregationETL Batch Data (daily, hourly, minutely) Ad Serving HDFS (incre) 1 day 1 day Modeling Insights
  • 11. 11 Problems - Aggregation is very expensive - HDFS follows Write Once Read Many approach. - Actually only ~30% of users get updates every day. - Impossible to support multiple update frequencies - Lack of capability to process event stream
  • 12. 12 - Spark - Fast - Consistent stack (batch/streaming) - HBase - Random read/write capabilities - Flexible schema - Hive - Large scale ad-hoc query engine - SQL like interface New Architecture Components
  • 13. 13 New Architecture HBase Hive HDFS Kafka Batch Data Stream Data 10 mins - 1 day 1 - 10 secs Ad Serving Spark Batch Spark Streaming Modeling Insights
  • 14. 14 How problems get solved - Incremental updates avoid full data load. - Multiple Spark jobs with different frequencies running concurrently. - Spark streaming for event stream processing.
  • 15. 15 Agenda - What is User Profile - Architecture Evolution - Schema Design - Understand the data - Table design - Optimization - Future Work
  • 16. 16 Understand the data * (1) Ad serving; (2) User Modeling; (3) Audience Insights Split user profile into multiple HBase tables. Data Type Update Pattern Use Cases Properties K/V pairs Overwrite (1)(2)(3) Events Time Series Append only (3) Segments List of K/V pairs Read-Modify-Write (1)(3) Features Hybrid Overwrite + Read-Modify-Write (1)(2)
  • 18. 18 Table Design - Properties Row Key Column Family: Properties c: age c: gender c: device1 c: device2 …... 0_284386766 1_1877933007 id_type + user_id val 1 val 2 val 3
  • 19. 19 Table Design - Events Row Key Column Family: Events c: event 0_284386766_1463848639 0_284386766_1463935039 id_type + user_id + event_type + timestamp value Rows are sorted by timestamp
  • 20. 20 Table Design - Segments Row Key Column Family: Segments c: type1 c: type2 c: type3 …... 0_284386766 1_1877933007 id_type + user_id * Different segments in different column to avoid atomic operation value
  • 21. 21 Features Events Query “Get age, gender of user A” “Get events of user A from 05/21/2016 to 05/22/2016” Write Pattern ❏ Write only ❏ Keep multiple versions ❏ Append only ❏ Use TTL to auto-remove records Rollback ❏ Set TIMERANGE to fetch last version in application layer ❏ Filtered out bad records in application layer ❏ Deletion based on timestamp if necessary Different Access Patterns
  • 22. 22 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Pre-split tables - Pre-aggregation in Spark - Lazy aggregation for inactive users - Sequential read on Hive - Future Work
  • 24. 24 Pre-Split Tables - Data Skew: User data is not evenly distributed across different id types - Pre-split tables based on data distribution {SPLITS => ["x00x00x00x01x50", "x00x00x00x01xA0", "x00x00x00x02x00", "x00x00x00x02x40", "x00x00x00x02x80", "x00x00x00x02xC0", , "x00x00x00x03x00", "x00x00x00x04x00"] }
  • 25. 25 - 1 Billion native ads events per day on 0.1 Billion users - Group by (user id, time interval) - Reduce the writes by 10X Pre-Aggregate events in Spark
  • 26. 26 Pre-Aggregate features in Spark - 5 Billion app activities per day on 0.5 Billion devices - 1 Billion search keywords per day on 0.06 Billion devices - Aggregate on user id for both features. One Spark job instead of two.
  • 27. 27 Lazy aggregation for inactive users - Problem: read-modify-write is expensive - Facts: - A large portion of the users might not be accessed frequently - Update jobs are not evenly distributed over time - Solution: Lazy aggregation for inactive users
  • 28. 28 - Maintain a set of users as active users - Active users - read-modify-write - Inactive users - Append updates only - Merging updates: - Batch job - Upon request Lazy aggregation for inactive users Spark r-m-w w HBase r-m-w Active Users Inactive Users update1 update2
  • 29. 29 Sequential read on Hive - HBase to Hive - Sync data to Hive using HBase snapshots without impact Region Servers. - Hive access the data using HBaseStorageHandler. - Move sequential reads to Hive - User modeling - Audience insights
  • 30. 30 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  • 31. 31 Future Work - Explore Impala/Presto for better query performance. - Expose API for incremental modeling capability;
  • 34. 34 More optimization - Less column family as possible - Turn off autoflush - Throttling writes if necessary - Compress data before sending to Hbase - Kryo for serialization