BIG DATA FOR EBOOKS AND EREADERS 
INMAR GIVONI, PHD 
VP, BIG DATA, KOBO 
IGIVONI@KOBO.COM
A BIT ABOUT KOBO
A LOT OF BITS OF KOBO DATA
A LOT OF BITS OF KOBO DATA
TECHNOLOGY/TOOLS USED BY KOBO BIG DATA 
•Processing/Streaming: Hadoop, Storm, Flume 
•Storage/streaming: Hive, sql, redis, couchbase 
•Search: Solr+plugins 
•Languages: Python, Java, C++, R
ABOUT KOBO’S BIG DATA TEAM
BIG DATA IS LIKE TEENAGE SEX: EVERYONE TALKS ABOUT IT, NOBODY REALLY KNOWS HOW TO DO IT, EVERYONE THINKS EVERYONE ELSE IS DOING IT, SO EVERYONE CLAIMS THEY ARE DOING IT... 
DAN ARIELY
SEARCH
RECOMMENDATIONS
RELATED ITEMS THROUGH CONTENT ANALYSIS
SIMILARITY ANALYSIS
ADULT CONTENT CLASSIFICATION
BEYOND THE BOOK
•FIND INTERESTING THINGS IN A BOOK 
•RELATE THEM TO INTERESTING CONTENT 
FROM THE INTERNET
1. Key-term detection 
2. Disambiguation 
3. Filtering
KEY-TERM EXTRACTION 
•Approach: TF-IDF top terms 
•Only to Wikipedia terms 
•Greedy up-to-5-gram matching
DISAMBIGUATING KEY-TERMS 
•Problem : Disambiguation of item to Wikipedia articles mapping 
•Solution : choose Wikipedia articles that make sense collectively [1] 
[1] Local and Global Algorithms for Disambiguation to Wikipedia, ACL 2011
Gandalf 
Hobbit 
Bilbo Baggins
Gandalf 
Hobbit 
Bilbo Baggins
Gandalf 
Hobbit 
Bilbo Baggins
WIKIPEDIA TERM SIMILARITY 
•Using Google Similarity Distance
IN MORE TECHNICAL TERMS… 
•Max-weight k-clique in a k-partite graph problem. 
•We used a greedy approach
SOME SCREEN SHOTS
SOME SCREEN SHOTS
SOME SCREEN SHOTS
PAGE LAYOUT OPTIMIZATION
•MANY WAYS OF SHOWING CONTENT ON THE KOBO WEBPAGE 
•FIND BEST LAYOUT
MAIN PROBLEMS 
•Automatic generation of layouts 
•Finding optimal layout
LAYOUT GENERATION 
•Local search 
•Start from a handful of expert generated layouts 
•Use widget statistics to make informed swap/exchange steps
OPTIMIZATION FRAMEWORK – EXPLORE/EXPLOIT 
•Several actions an agent can take 
•Agent takes action, and observes payoff
•Pick a policy for choosing action with good trade-off 
•If you always take the best one so far, may be missing on better options 
•If you always explore new things you losses accumulate 
OPTIMIZATION FRAMEWORK – EXPLORE/EXPLOIT
MULTI-ARMED BANDIT MODEL 
•Maximize total expected reward 
•Bayesian framework (Agarwal et al [1]) 
•Context Sensitive Variant (Li et al [2]) 
[1] Explore/Exploit schemes for web content optimization, Yahoo Research, ICDM ‘09 
[2] A contextual-bandit approach to personalized news article recommendation, Yahoo Research, WWW ‘10
FEATURE VECTOR CONSTRUCTION 
•User information: 
•Purchase history, by genre, by price sensitivity, freshness 
•Browsing history 
•Geo 
•Extensions to 
•Optimize for profit (not CTR) 
•Account for different widget categories
UPCOMING PROJECTS
THANK YOU! 
QUESTIONS?

[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders