Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hypothesis	Testing:		
How	to	Eliminate	Ideas	as	Soon	as	Possible

Roman	Zykov	
Retail	Rocket	
Boston,	RecSys	2016
Context
• Intro	
• Offline	vs	Online	testing	
• Make	offline	testing	shorter	
• Artificial	diversity	metric	
• Online	tests
Retail	Rocket
• Personalised	real-time	recommendations	
• E-commerce	only	
• Multiple	channels	(site,	email,	…)	
• Founded...
Why	testing	is	important?
• Highly	competitive	market	
• It’s	not	hard	to	create	own	recommendation		
• Constant	changes	i...
Offline	vs	Online	testing
Offline	testing		forecasts	online	testing	results	
• Relatively	fast,	testing	of	minor	changes	r...
Testing	facts
• Nine	out	of	ten	ideas	do	not	improve	anything	
• Most	ideas	have	minor	impact:	
o add	new	data:	extracted	...
Offline	testing
Offline	predicts	Online
Major	changes	or	new	algorithm	
• Always	check	by	online	experiment	
• Find	appropriate	offline	me...
Make	offline	testing	shorter	Retail	Rocket
What	we	did	
• Functional	programming	on	Scala/Spark.	Four	languages	
(Python,	...
Scala/Spark	notebook	with	R
Offline	framework
• Scala	on	Spark	
• Deals	with	existing	web	logs	
• Implicit	feedback	
• Major	metrics:	
o Recall,	Diver...
Offline	events	sequences
		view1													view2										view3										cart1	 						cart2											view4										vie...
Offline	metric	examples	
		view1													view2										view3										cart1	 						cart2											view4										vie...
Case:	Artificial	diversification
Artificial	diversification
Original
After
Problem:	It’s	not	impossible	to	use	Recall	for	evaluating
Recall	with	Nearest	Neighbours	(NN)
Top	4	recs
0.8 0.7 0.5 0.5
0.8 0.7 0.5 0.5
0.6 0.5 0.4
0.9 0.8 0.3 0.5
Content	based	s...
Online	A/B	testing
AA/BB	tests
A	group
A	group
B	group
B	group
Control	group
Test	group
AA/BB	tests
A
A
B
B
A
A
B
B
IdealDirty
Bayesian	approach
• Conversion	rates	
o Beta	distribution	with	normal	priors		
• Average	Order	Values	
o Normal	distributi...
Conclusion
• Offline	testing	can	predict	online	results	
• One	programming	language	for	R&D	reduces	the	test	time	
• The	S...
Thank	you!
Roman	Zykov	
Retail	Rocket		
rzykov@retailrocket.net	
https://github.com/RetailRocket/SparkMultiTool
Upcoming SlideShare
Loading in …5
×

How to eliminate ideas as soon as possible

512 views

Published on

Retail Rocket helps web shoppers make better shopping decisions by providing personalized real-time recommendations through multiple channels with over 100MM unique monthly users and 1000+ retail partners. The rapid improvement of the product is important to win on the high-concurrency market of real-time personalization platforms.
The necessity of introducing constant innovations and improvements of algorithms for recommendation systems requires correct tools and a process of rapid testing of hypotheses. It’s not a secret that 9 out of 10 hypotheses actually do not improve the performance at least. We had the task stated as follows: How to detect and eliminate the idea that doesn’t improve as early as possible, to spend a minimum of resources on that process.

In the report we will talk about:

How we make our process of hypotheses testing faster.
One programming language for R&D.
Enmity and friendship of offline and online metrics.
Why it is difficult to predict the impact of changing diversity of algorithms.
What is the benefit of AA/BB online tests.
Bayesian statistics for the evaluation of online tests.
ABOUT THE SPEAKER

Roman Zykov is the Chief Data Scientist at the Retail Rocket. In Retail Rocket is responsible for algorithms of personalized and non-personalized recommendations. Previous to Retail Rocket, Roman was the Head of analytics at the biggest e-commerce companies for almost ten years. He received Ms.Sc. in applied mathematics and physics from the MIPhT in 2004.

Published in: Science
  • Be the first to comment

How to eliminate ideas as soon as possible

  1. 1. Hypothesis Testing: How to Eliminate Ideas as Soon as Possible
 Roman Zykov Retail Rocket Boston, RecSys 2016
  2. 2. Context • Intro • Offline vs Online testing • Make offline testing shorter • Artificial diversity metric • Online tests
  3. 3. Retail Rocket • Personalised real-time recommendations • E-commerce only • Multiple channels (site, email, …) • Founded in 2012 • Offices: Amsterdam, Barcelona, Milan, Moscow • 1000+ retail partners • 100+ million daily events
  4. 4. Why testing is important? • Highly competitive market • It’s not hard to create own recommendation • Constant changes in the product and algorithms • Fast and reliable decisions
  5. 5. Offline vs Online testing Offline testing forecasts online testing results • Relatively fast, testing of minor changes requires hours • Few resources: data, computational resources, code, 1 dev • Hard to forecast online metrics in some cases • Influence of an algorithm on users' behaviour is ignored • Bad values of offline metrics prevent online implementation Online test - final decision point • Requires much time. At least two cycles of decision making • Requires many resources: design, onsite production, etc
  6. 6. Testing facts • Nine out of ten ideas do not improve anything • Most ideas have minor impact: o add new data: extracted from text, images, etc o adjust parameters of algorithm
  7. 7. Offline testing
  8. 8. Offline predicts Online Major changes or new algorithm • Always check by online experiment • Find appropriate offline metric after • Try different definitions of users’ sessions • Try different events sequences Minor changes • Use offline tests if you have proved offline metric
  9. 9. Make offline testing shorter Retail Rocket What we did • Functional programming on Scala/Spark. Four languages (Python, Java, Pig, Hive) had been previously used. • Research in Scala/Spark Notebooks with added R integration for graphics • Offline evaluation framework for all of our tasks with metrics calculations. The most complicated project among others in Retail Rocket What we got • It takes hours to prove or disapprove any simple idea whereas previously it could have taken days • Research is limited by the power of our cluster and the number of data scientists
  10. 10. Scala/Spark notebook with R
  11. 11. Offline framework • Scala on Spark • Deals with existing web logs • Implicit feedback • Major metrics: o Recall, Diversity, Recall with NN, Empty Recs • Minor metrics: o Serendipity, Novelty, Coverage • Different types of events sequences • Different definitions of users’ sessions • Personalised / Non-personalised recommendations • Adjustable TOP of viewable recommendations • Test panel of sites from different domains
  12. 12. Offline events sequences view1 view2 view3 cart1 cart2 view4 view5 view6 purchase1 View2View View2Cart View2Purchase Cart2Purchase Cart2Cart view1 -> view2 view2 -> view3 view3 -> view4 view4 -> view5 view5 -> view6 view1 -> cart1 view2 -> cart1 view3 -> cart1 view4 -> cart1 view5 -> cart2 view6 -> cart2 view1 -> purchase1 view2 -> purchase1 view3 -> purchase1 view4 -> purchase1 view5 -> purchase1 view6 -> purchase1 cart1 -> purchase1 cart2 -> purchase1 cart1 -> cart2 * Events: product view, add to cart, purchase, main page view, search, catalog page, …
  13. 13. Offline metric examples view1 view2 view3 cart1 cart2 view4 view5 view6 purchase1 What Customers Buy After Viewing This Item • View2Cart • View2Purchase • … Customers Who Bought This Item Also Bought • Cart2Cart • Cart2Purchase • View2Cart • …
  14. 14. Case: Artificial diversification
  15. 15. Artificial diversification Original After Problem: It’s not impossible to use Recall for evaluating
  16. 16. Recall with Nearest Neighbours (NN) Top 4 recs 0.8 0.7 0.5 0.5 0.8 0.7 0.5 0.5 0.6 0.5 0.4 0.9 0.8 0.3 0.5 Content based similarity
 (Nearest neighbours) Real item 0.5 Indirect hit 1.0 Direct hit No hit 0.0 Metric = Average over all sessions
  17. 17. Online A/B testing
  18. 18. AA/BB tests A group A group B group B group Control group Test group
  19. 19. AA/BB tests A A B B A A B B IdealDirty
  20. 20. Bayesian approach • Conversion rates o Beta distribution with normal priors • Average Order Values o Normal distribution (after log) with normal priors • Priors from historical data before experiment Anything may be done with posteriors. E.g.: There is a 95% chance that A has an 1% lift over B
  21. 21. Conclusion • Offline testing can predict online results • One programming language for R&D reduces the test time • The Scala language is a good alternative for ML tasks • Different event sequences for offline metrics • Recall with Nearest Neighbours (NN) metric
  22. 22. Thank you! Roman Zykov Retail Rocket rzykov@retailrocket.net https://github.com/RetailRocket/SparkMultiTool

×