This is part 1 of the tutorial Xavier and Deepak gave at Recsys 2016 this year. You can find the second part http://www.slideshare.net/xamat/recsys-2016-tutorial-lessons-learned-from-building-reallife-recommender-systems
Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds
This fourth meetup will present good practices and tips about deploying a recommender system in production. We will cover a wide range of the day-to-day of machine learning engineers and devops: from test-driven development to continuous integration and cloud architecture design. We will see how machine learning and recommender system in particular differ from traditional software development, and how this impacts deployment pipelines, and what tools you can use to solve this problem.
This is part 1 of the tutorial Xavier and Deepak gave at Recsys 2016 this year. You can find the second part http://www.slideshare.net/xamat/recsys-2016-tutorial-lessons-learned-from-building-reallife-recommender-systems
Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds
This fourth meetup will present good practices and tips about deploying a recommender system in production. We will cover a wide range of the day-to-day of machine learning engineers and devops: from test-driven development to continuous integration and cloud architecture design. We will see how machine learning and recommender system in particular differ from traditional software development, and how this impacts deployment pipelines, and what tools you can use to solve this problem.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingCleverTap
Join Almitra Karnik, Head of Marketing, CleverTap, and Jessie Paul, CEO, Paul Writer share their insights on how AI and ML are fundamentally changing the way we approach marketing and how we can harness these changes to further our businesses.
How to create a cutting edge recommender that is fast, scalable, can use almost any applicable data, and is extremely flexible for use in many different contexts. Uses Spark, Mahout, and a search engine.
This presentation will cover how all aspects of marketing have evolved over the years. How AI will shape the landscape of marketing in the years to come and why marketers need AI to assist them for their jobs. The future lies in working towards better customer experience and especially customer retention seems to be the key.
Marketers have to be on the lookout throughout - need to keep learning and keep a continuous tab on the customer’s pulse in order to deliver the best.
Distributed Representation-based Recommender Systems in E-commerceRakuten Group, Inc.
Intelligence Domain Group, Rakuten Institute of Technology is working on developing various kinds of solutions utilizing Rakuten Data in order to assist Rakuten services.
In this presentation, we propose a novel item recommender algorithm based on distributed representation. We confirmed that performance of the proposed algorithm outperformed conventional recommender algorithms such as collaborative filtering and matrix factorization.
Summary: Graphs are structures commonly used in computer science that model the interactions among entities. I will start from introducing the basic formulations of graph based machine learning, which has been a popular topic of research in the past decade and led to a powerful set of techniques. Particularly, I will show examples on how it acts as a generic data mining and predictive analytic tool. In the second part, I am going to discuss applications of such learning techniques in media analytics: (1) image analysis, where visually coherent objects are isolated from images; (2) social analysis of videos, where actors' social properties are predicted from videos. Materials in this part are based on our recent publications in highly selective venues (papers on https://sites.google.com/site/leiding2010/ ).
Bio: Lei Ding is a researcher making sense of large amounts of data in all media types. He currently works in Intent Media as a scientist, focusing on data analytics and applied machine learning in online advertising. Previously, he has worked in several research institutions including Columbia University, UIUC and IBM Research on digital / social media analysis and understanding. He received a Ph.D. degree in Computer Science and Engineering from The Ohio State University, where he was a Distinguished University Fellow.
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
In this presentation I will talk about the design of scalable recommender systems and its similarity with advertising systems. The problem of generating and delivering recommendations of content/products to appropriate audiences and ultimately to individual users at scale is largely similar to the matching problem in computational advertising, specially in the context of dealing with self and cross promotional content. In this analogy with online advertising a display opportunity triggers a recommendation. The actors are the publisher (website/medium/app owner) the advertiser (content owner or promoter), whereas the ads or creatives represent the items being recommended that compete for the display opportunity and may have different monetary value to the actors. To effectively control what is recommended to whom, targeting constraints need to be defined over an attribute space, typically grouped by type (Audience, Content, Context, etc.) where some associated values are not known until decisioning time. In addition to constraints, there are business objectives (e.g. delivery quota) defined by the actors. Both constraints and objectives can be encapsulated into and expressed as campaigns. Finally, there there is the concept of relevance, directly related to users' response prediction that is computed using the same attribute space used as signals.
As in advertising, recommendation systems require a serving platform where decisioning happens in real-time (few milliseconds) typically selecting an optimal set of items to display to the user from hundreds, sometimes thousands or millions of items. User actions are then taken as feedback and used to learn models that dynamically adjust order to meet business objectives.
This is a radical departure from the traditional item-based and user-based collaborative filtering approach to recommender systems, which fails to factor-in context, such as time-of-day, geo-location or category of the surrounding content to generate more accurate recommendations. Traditional approaches also fail to recognize that recommendations don't happen in a vacuum and as such may require the evaluation of business constraints and objectives. All this should be considered when designing and developing true commercial recommender/advertising systems.
Speaker Bio
Joaquin A. Delgado is currently Director of Advertising Technology at Intel Media (a wholly owned subsidiary of Intel Corp.), working on disruptive technologies in the Internet T.V. space. Previous to that he held CTO positions at AdBrite, Lending Club and TripleHop Technologies (acquired by Oracle). He was also Director of Engineering and Sr. Architect Principal at Yahoo! His expertise lies on distributed systems, advertising technology, machine learning, recommender systems and search. He holds a Ph.D in computer science and artificial intelligence from Nagoya Institute of Technology, Japan.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingCleverTap
Join Almitra Karnik, Head of Marketing, CleverTap, and Jessie Paul, CEO, Paul Writer share their insights on how AI and ML are fundamentally changing the way we approach marketing and how we can harness these changes to further our businesses.
How to create a cutting edge recommender that is fast, scalable, can use almost any applicable data, and is extremely flexible for use in many different contexts. Uses Spark, Mahout, and a search engine.
This presentation will cover how all aspects of marketing have evolved over the years. How AI will shape the landscape of marketing in the years to come and why marketers need AI to assist them for their jobs. The future lies in working towards better customer experience and especially customer retention seems to be the key.
Marketers have to be on the lookout throughout - need to keep learning and keep a continuous tab on the customer’s pulse in order to deliver the best.
Distributed Representation-based Recommender Systems in E-commerceRakuten Group, Inc.
Intelligence Domain Group, Rakuten Institute of Technology is working on developing various kinds of solutions utilizing Rakuten Data in order to assist Rakuten services.
In this presentation, we propose a novel item recommender algorithm based on distributed representation. We confirmed that performance of the proposed algorithm outperformed conventional recommender algorithms such as collaborative filtering and matrix factorization.
Summary: Graphs are structures commonly used in computer science that model the interactions among entities. I will start from introducing the basic formulations of graph based machine learning, which has been a popular topic of research in the past decade and led to a powerful set of techniques. Particularly, I will show examples on how it acts as a generic data mining and predictive analytic tool. In the second part, I am going to discuss applications of such learning techniques in media analytics: (1) image analysis, where visually coherent objects are isolated from images; (2) social analysis of videos, where actors' social properties are predicted from videos. Materials in this part are based on our recent publications in highly selective venues (papers on https://sites.google.com/site/leiding2010/ ).
Bio: Lei Ding is a researcher making sense of large amounts of data in all media types. He currently works in Intent Media as a scientist, focusing on data analytics and applied machine learning in online advertising. Previously, he has worked in several research institutions including Columbia University, UIUC and IBM Research on digital / social media analysis and understanding. He received a Ph.D. degree in Computer Science and Engineering from The Ohio State University, where he was a Distinguished University Fellow.
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
In this presentation I will talk about the design of scalable recommender systems and its similarity with advertising systems. The problem of generating and delivering recommendations of content/products to appropriate audiences and ultimately to individual users at scale is largely similar to the matching problem in computational advertising, specially in the context of dealing with self and cross promotional content. In this analogy with online advertising a display opportunity triggers a recommendation. The actors are the publisher (website/medium/app owner) the advertiser (content owner or promoter), whereas the ads or creatives represent the items being recommended that compete for the display opportunity and may have different monetary value to the actors. To effectively control what is recommended to whom, targeting constraints need to be defined over an attribute space, typically grouped by type (Audience, Content, Context, etc.) where some associated values are not known until decisioning time. In addition to constraints, there are business objectives (e.g. delivery quota) defined by the actors. Both constraints and objectives can be encapsulated into and expressed as campaigns. Finally, there there is the concept of relevance, directly related to users' response prediction that is computed using the same attribute space used as signals.
As in advertising, recommendation systems require a serving platform where decisioning happens in real-time (few milliseconds) typically selecting an optimal set of items to display to the user from hundreds, sometimes thousands or millions of items. User actions are then taken as feedback and used to learn models that dynamically adjust order to meet business objectives.
This is a radical departure from the traditional item-based and user-based collaborative filtering approach to recommender systems, which fails to factor-in context, such as time-of-day, geo-location or category of the surrounding content to generate more accurate recommendations. Traditional approaches also fail to recognize that recommendations don't happen in a vacuum and as such may require the evaluation of business constraints and objectives. All this should be considered when designing and developing true commercial recommender/advertising systems.
Speaker Bio
Joaquin A. Delgado is currently Director of Advertising Technology at Intel Media (a wholly owned subsidiary of Intel Corp.), working on disruptive technologies in the Internet T.V. space. Previous to that he held CTO positions at AdBrite, Lending Club and TripleHop Technologies (acquired by Oracle). He was also Director of Engineering and Sr. Architect Principal at Yahoo! His expertise lies on distributed systems, advertising technology, machine learning, recommender systems and search. He holds a Ph.D in computer science and artificial intelligence from Nagoya Institute of Technology, Japan.
Recommender Systems Tutorial (Part 3) -- Online ComponentsBee-Chung Chen
This is a tutorial given in the International Conference on Machine Learning. The slides consist of four parts. Please look for Part 1, Part 2 and Part 4 to get a complete picture of this technology.
In these slides, Generative Adversarial Network (GAN) is briefly introduced, and some GAN applications in medical imaging are presented. In the conclusions, some comments are given for persons who are interested in research of medical imaging using GAN.
Recommender Systems from A to Z – The Right DatasetCrossing Minds
In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...hanumayamma
Selfies are popular. They embrace and represent social and emotional pulse of the User. We offer, nevertheless, groundbreaking and novel radical view on Selfies, especially Selfies that are taken for medical image purposes. In our view Selfies that are taken for medical image purposes are valuable outpatient healthcare data assets that could provide new clinical insights. Additionally, they could be used as diagnostics markers that could provide prognosis of a potential masked disease and necessitate actions to avert any emergency incidence, thereby saving Billions of dollars. We strongly believe that Interweaving Selfies that are taken for medical image purposes with outpatient Electronic Health Records (EHR) could breed new data driven diagnosis and clinical pathways that could potentially preempt healthcare services rendering decision making process for greater efficiencies and that could potentially save valuable time and attention of healthcare professionals who’re already operating on a highly constrained time and shortage of skilled human resources. Putting in simple terms, Selfies could offer new diagnosis & clinical insights that have the potential to improve overall health outcomes of people around the globe in a cost-effective manner that epitomizes the confluence of popularity with curiosity and sharing with accountability.
In this research paper, we propose computer vision (CV) based Machine Learning (ML) / Artificial Intelligence(AI) algorithms to classify and stratify Selfies that are captured for medical imaging purposes. Finally, the paper presents a CV - ML/AI prototyping solution as well as its application and certain experimental results.
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Presented by PredictionIO at Big Data TechCon (Oct 17, 2013)
Generation of Deepfake images using GAN and Least squares GAN.pptDivyaGugulothu
Our project is to generate deep fake images using deep learning techniques i.e Generative
Adversarial Networks.
We generated Deep Fake images using Traditional GAN and least squares GAN.
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
Presentation of a real use case at TAJ law firm (Deloitte Paris) of applying Machine learning on accounting to help clients to prepare their tax audit.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
1. Statistical Models for Massive Web Data
Deepak Agarwal, LinkedIn, USA
Director, Applied Relevance Science (ARS)
CATS Big Data Panel, October 11, 2012
Hosted by National Academy of Sciences
Washington D.C., USA
2. Disclaimer
The opinions expressed here are mine and in
no way represent the official position of LinkedIn
Case studies presented today was work done
while I was at Yahoo!
NRC BIG DATA PANEL, AGARWAL, 2012
3. Big Data Applications in Business
Big Data : Competitive advantage, Innovation, reduces
uncertainty in decision making
High Frequency Data
– Large number of heterogeneous transactions per unit time
Web visits, financial trading, credit card transactions, telephone
calls, packet flows in an IP network,…
I will focus on statistical modeling for one such data
source
– User visits to websites
NRC BIG DATA PANEL, AGARWAL, 2012
4. Example 1: Yahoo! front page
Today module
Recommend content links
F1 F2 F3 F4 (out of 30-40, editorially
programmed)
4 slots exposed, F1 has
maximum exposure
Routes traffic to other Y!
properties
NRC BIG DATA PANEL, AGARWAL, 2012
7. Data Generation
User information
http request NEWS
ADS
Ranking Service
Model Updates NRC BIG DATA PANEL, AGARWAL, 2012
8. DATA
CONTEXT Select Item j with item covariates xj
(keywords, content categories, ...)
User i visits (i, j) : response yij
(User, context) (click/no-click)
covariates xit
(profile information, device id,
first degree connections,
browse information,…)
NRC BIG DATA PANEL, AGARWAL, 2012
9. Statistical Problem
Rank items (from an admissible pool) for user visits in
some context to maximize a utility of interest
Examples of utility functions
– Click-rates (CTR)
– Share-rates (CTR* [Share|Click] )
– Revenue per page-view = CTR*bid (more complex due to
second price auction)
CTR is a fundamental measure that opens the door to a
more principled approach to rank items
Converge rapidly to maximum utility items
– Sequential decision making process
– Models: help cope data sparseness (curse of dimensionality
NRC BIG DATA PANEL, AGARWAL, 2012
10. Illustrate with Y! front Page Application
Simplify: Maximize CTR on first slot (F1)
Article Pool
– Editorially selected for high quality and brand image
– Few articles in the pool but article pool dynamic
We want to provide personalized recommendations
– Users with many prior visits see recommendations “tailored” to
their taste, others see the best for the “group” they belong to
NRC BIG DATA PANEL, AGARWAL, 2012
11. Types of user covariates
Demographics, geo:
– Not useful in front-page application
Browse behavior: activity on Y! network ( xit )
– Previous visits to property, search, ad views, clicks,..
– This is useful for the front-page application
Latent user factors based on previous clicks on the
module ( uit )
– Useful for active module users, obtained via factor models
NRC BIG DATA PANEL, AGARWAL, 2012
12. Approach: Online + Offline
Offline computation
– Intensive computations done infrequently (once
a day/week) to update parameters that are less
time-sensitive
Online computation
– Lightweight computations frequent (once every 5-10
minutes) to update parameters that are time-
sensitive
– Adaptive experiments(explore-exploit) also done
online
NRC BIG DATA PANEL, AGARWAL, 2012
13. Online computation: per-item online logistic regression
For item j, the state-space model is
yijt ~ Ber( pijt )
lg t( pijt ) = u v jt + x b jt
'
i
'
it
(v j,t+1, b j,t+1 ) = (v j,t , b j,t ) + d j,t+1 ~ (0, t ) 2
(v j 0 , b j 0 ) = (Dx j , 0) + e j 0 ~ (0, s ) 2
Item coefficients are update online via Kalman-filter (discounting
approach of West and Harrison)
– Item covariates are used to initialize coefficients at epoch zero
NRC BIG DATA PANEL, AGARWAL, 2012
14. Closer Look at online model
Different components of lgt( pijt ) = u v jt + x b jt
'
i
'
it
r x1
u i :User latent factors, useful for heavy users
xit b jt : Residual item affinity to user covariate (old items)
'
NRC BIG DATA PANEL, AGARWAL, 2012
15. Online Adaptive Experimentation
(Explore/Exploit)
Three schemes (all work reasonably well for the
front page application)
– epsilon-greedy: Show article with maximum posterior
mean except with a small probability epsilon, choose an
article at random.
– Upper confidence bound (UCB): Show article with
maximum score, where score = post-mean + k. post-std
– Thompson sampling: Draw a sample (δ,β) from posterior
to compute article CTR and show article with maximum
CTR
NRC BIG DATA PANEL, AGARWAL, 2012
16. Offline computation
Computing user latent factors and item
coefficient prior
– This is computed offline once a day using
retrospective (user,item) interaction data for last
X days (X = 30 in our case)
– Computations are done on Hadoop
NRC BIG DATA PANEL, AGARWAL, 2012
17. Offline: Regression based Latent Factor
Model
yij ~ Ber(pij ) (# obs. per user has wide variation)
lgt(pij ) = å uik v jk = u¢v j (need shrinkage on factors)
i
k
ui = Gxi + e , e ~ N(0, diag(s , s ,.., s ))
u
i
u
i
2
1
2
2
2
r
regression weight matrix user/item-specific correction terms (learnt from data)
vi = Dx j + e , e ~ N(0, I)
v
j
v
j
vik ³ 0
NRC BIG DATA PANEL, AGARWAL, 2012
18. Role of shrinkage (consider Guassian for
simplicity)
For new user/article, factor estimates based on
covariates
user item
u new G x new , v new D x new
For old user, factor estimates
user
E(ui | Rest) = (l I + å v j v ) (lGxi + å yij v j )
' -1
j
jÎNi jÎNi
Linear combination of prior regression function
and user feedback on items
NRC BIG DATA PANEL, AGARWAL, 2012
19. Estimating the Regression function via EM
Maximize
( f ( u i , v j , Data ) g (u i , G ) g ( v j , D )) du i dv j
ij i j i j
Integral cannot be computed in closed form,
approximated by Monte Carlo using Gibbs Sampling
For logistic, we use ARS (Gilks and Wild) to sample the
latent factors within the Gibbs sampler
NRC BIG DATA PANEL, AGARWAL, 2012
20. Scaling to large data on Hadoop
Randomly partition by users in the Map
Run separate model on each partition
– Care taken to initialize each partition model with
same values, constraints on factors ensure
identifiability within each partition
Create ensembles by using different user partitions,
average across ensembles to obtain estimates of
user factors and regression functions
– Estimates of user factors in ensembles uncorrelated,
averaging reduces variance
NRC BIG DATA PANEL, AGARWAL, 2012
21. Data Example
1B events, 8M users, 6K articles
Offline training produced user factor ui
Baseline: Online logistic without ui
– Covariate Only online Logistic model
lgt(pijt ) = x b jt
'
it
Overall click lift: 9.7%,
Heavy users (> 10 clicks last month): 26%
Cold users (not seen in the past): 3%
NRC BIG DATA PANEL, AGARWAL, 2012
22. Click-lift for heavy users
CTR LIFT Relative to COVARIATE ONLY
Logistic Model
NRC BIG DATA PANEL, AGARWAL, 2012
23. Computational Advertising: Matching ads to opportunities
Advertisers
Pick
Ads best ads
Page Ad
User Network
Examples:
Yahoo, Google,
Opportunity MSN,
Publisher
Ad exchanges(network
of “networks”) …
NRC BIG DATA PANEL, AGARWAL, 2012
24. Ad- exchange (RightMedia) [Agarwal et al. KDD 10]
Advertisers participate in different ways
– CPM (pay by ad-view)
– CPC (pay per click)
– CPA (pay per conversion)
To conduct an auction, normalize across pricing types
– Compute eCPM (expected CPM)
Click-based ---- eCPM = click-rate*CPC
Conversion-based ---- eCPM = conv-rate*CPA
Similar strategy of computing offline and online components
– Process 90B records for each model fit
– Model has hundreds of millions of parameters
– Model fully deployed on RightMedia today
NRC BIG DATA PANEL, AGARWAL, 2012
25. Summary
Estimating interactions in high-dimensional
sparse data important in web applications
Scaling such models to Big Data is a
challenging statistical problem
Combining offline + online modeling with
explore/exploit a good practical strategy
NRC BIG DATA PANEL, AGARWAL, 2012
26. Some Challenges
Very high-dimensional modeling with very large and noisy data
– Few categorical variables with large number of levels interacting with
each other to produce response
– Scalability
Designing sequential experiments
– Multi-armed bandits are back in a big way
Data fusion
– From multiple and disparate sources
Availability of data and ability to run experiments to researchers
NRC BIG DATA PANEL, AGARWAL, 2012
Editor's Notes
Important module, 100s of millions of user visits.