QCon SP - recommended for you

Building a recommendation platform
with Hadoop and ElasticSearch
Tatiana Al-Chueyr Martins
@tati_alchueyr
tatiana.alchueyr@gmail.com
QCon Rio
24th September 2014

tati.__doc__
● Programmer since 2002
● Pythonist since 2003
● Open-source enthusiast
● Computer Engineer by UNICAMP
● Currently Senior Software Engineer at Globo.com
#oss #freesoftware #python #linux #android #bigdata #cloud
tati_alchueyr tatiana alchueyr

free services
"If you're not paying for it, you're not the
customer. You're the product being sold."
- SomeOne (in the past, apparently online)

free services
"If you're not paying for it, you're not the
customer. You're the product being sold."
- SomeOne (in the past, apparently online)
If you are paying, it’s no better.
Interesting comparison between paid and unpaid Yahoo!, Google and Facebook services:
http://revdancatt.com/2013/02/06/the-problem-with-if-youre-not-paying-youre-the-product/

online services survival
what is one of the most popular ways of a
online service gaining money?

online services survival
for each click above, the author of these slides will gain US$ 0.01

online services advertisement
How do you choose in which website you will
advertise?
How do you measure how successful an
advertisement is?

online services advertisement
common metrics:
● conversion: # goal achievements / visits
● page-views: number of times a page is
viewed
Learn more about conversion in marketing:
http://en.wikipedia.org/wiki/Conversion_marketing

globo
● 2nd largest TV network in annual revenue worldwide
● produces 2,400 hours of entertainment per year
● produces 3,000 hours of journalism per year
● covers 98.6% of the Brazilian territory
● covers 99.5% of the Brazilian population
● US$ 7.2 billion annual revenue (2013)
More about Rede Globo:
http://en.wikipedia.org/wiki/Rede_Globo

globo.com
responsible for serving globo contents online

globo.com
distributed publishing platform for providing:
● articles
● videos
● photo galleries
● games (e.g. Cartola FC)
on several topics, including:
- news, sports, entertainment, technology, music

globo.com daily bread
my salary is paid by keeping you navigating in
our websites

my salary is paid by keeping you navigating in
our websites
well, it is not as bad
is it may seem...

let’s look from a different perspective

globo.com daily social work
we help people finding the information they want
after all, we want to make people happy

globo.com daily social work
we help people finding the information they want
after all, we want to make people happy
so they keep surfing in our website…

real-time user events
Example of globo.com Apache Kafka (publish-subscribe messaging) in realtime

what you are at globo.com
you are data to us
example of user representation (*) in JSON:
{
(1, 12359058102): {
"soccer team": "São Paulo",
"city": "Rio de Janeiro",
"likes": ["Python", "QCon", "Caelum"]
}
}
* simplified user representation for explanation-only

what you tell us directly
● eg. firefox or chrome plugin

what you tell us indirectly
● news you read
● videos you watch
● things you search
● quizzes you answer
● things you like
● things you comment

news @globo.com
http://sportv.globo.com/site/combate/noticia/2014/05/adriano-martins-volta-ao-octogono-contra-juan-puig-no-tuf-19-finale.html

recommendation platform
- users’ history
- users’ profiles
- globo.com news
- globo.com videos
interesting news
for a returning user

http://www.techtudo.com.br/noticias/noticia/2014/08/hp-faz-recall-de-cabos-de-6-milhoes-de-pcs-saiba-como-proceder.html

Several strategies, including:
● analysing your previous history
● finding users with similar interests

Several approaches, including:
● pre-processing
● real-time processing

Several approaches, including:
● pre-processing
● real-time processing
at globo.com, we use both

pre-computed platform
1. user performs an action
2. action data is logged into the HDFS (Apache Hadoop)
3. from times to times pig scripts run on a range of Hadoop
data, creating pre-computed recommendation for all users
4. recommendations are stored into Apache HBase
next time an user access globo.com:
- the recommendation is ready and retrieved from HBase

pre-computed platform
(1) pre-compute recommendations, from times to times based on users history
(2) next access after pre-computation, the user will see new recommendations
pre-compute
recommendations
for all users
simple query

real-time platform
1. user performs an action (event)
2. data is logged into the HDFS (Apache Hadoop)
3. every minute events’ data are sent to Apache Kafka
4. an internal process (Horizon) read events from Kafka and
writes into HBase
next time an user access globo.com:
- a process (Lex) identifies the user history and checks its
interest
- based on the user topics of interest, a query is run on
ElasticSearch

real-time platform
(1) store all raw data
(2) next access, a new query will be done to the database
raw
data
for all users
complex query

Which do you prefer?
[ ] pre-processing
[ ] real-time processing

# of users x # pagevisits by user

scaling technologies
● Apache Hadoop
● Apache Kafka
● Apache HBase
● ElasticSearch
● Horizon
● Lex
● Tsuru

● Apache Hadoop - filesystem
● Apache Kafka
● Apache HBase
● ElasticSearch
● Horizon
● Lex
● Tsuru

● Apache Hadoop
● Apache Kafka - event bus
● Apache HBase
● ElasticSearch
● Horizon
● Lex
● Tsuru

● Apache Hadoop
● Apache Kafka
● Apache HBase - “big table” opensource
● ElasticSearch
● Horizon
● Lex
● Tsuru

● Apache Hadoop
● Apache Kafka
● Apache HBase
● ElasticSearch - distributed search engine
● Horizon
● Lex
● Tsuru

● Apache Hadoop
● Apache Kafka
● Apache HBase
● Horizon - kafka subscriber, among others
● Lex
● Tsuru

● Apache Hadoop
● Apache Kafka
● Apache HBase
● Horizon
● Lex (python) - REST API for semantic
recommendation
● Tsuru

● Apache Hadoop
● Apache Kafka
● Apache HBase
● Horizon
● Lex
● Tsuru: platform as a service
Today 3:55 PM - Andrews Medina (room 4)

motivationNot only words
São Paulo

São Paulo?

São Paulo state

São Paulo city

São Paulo saint

São Paulo soccer team

What is "São Paulo" in this news...?
a. City São Paulo
b. State São Paulo (SP)
c. Saint São Paulo
d. São Paulo Futebol Clube
e. Other: _____________

What is "São Paulo" in this news...?
a. http://dbpedia.org/resource/S%C3%A3o_Paulo
b. http://dbpedia.org/resource/S%C3%A3o_Paulo_(state)
c. http://dbpedia.org/resource/Paul_the_Apostle
d. http://dbpedia.org/resource/S%C3%A3o_Paulo_FC
e. Other: _____________

motivationMultiple words for the same thing
Female
f
F
female
woman
...

motivationMultiple words for the same thing
http://data.globo.com/female

motivation
Soccer player
Cross-link content from different web products

Politician
motivationCross-link content from different web products

Celebrity
Cross-link content from different web products
motivation

Isabella Nardoni foi morta em 29 de março de 2008
na Zona Norte de São Paulo (Foto:Reprodução)
Isabella de Oliveira Nardoni, de 5 anos,
foi morta na noite de 29 de março de
2008. A perícia concluiu que a menina foi
atirada do sexto andar do prédio onde
moravam seu pai, Alexandre Nardoni,
sua madrasta, Anna Carolina Jatobá, e
dois filhos pequenos do casal, na Vila
Isolina Mazzei, na zona norte de São
Paulo.
Túmulo de Isabella vira local de visitação em SP; casal Nardoni está preso.
Caso Isabella Nardoni
Juliana Cardilli G1 SP
RDF
FOAF
GEO
Dublin
Core
SKOS
Semantic markup in web pages
motivation

Recommend annotations to information Producer
motivation

Suggest related content to information Consumer
motivation

changes
● Replacement of words by entities
http://data.globo.com/Person/JoseDaSilva

changes
● Replacement of labels by qualified relationships

changes
● Organize data from tables to graphs

outcomes
● To replace words by entities improved:
○ Finding
○ Linking
○ Reconciling
○ Organizing
multiple layers of information

outcomes
● Flexible ways to organize content
● Ease to find related issues
● Explicit relations derived from annotated content
● Up-to-date topic pages with little editorial effort
● Linking content across different web products
● Seamless navigation leading to flow state

video time
Metaweb video:
https://www.youtube.com/watch?v=TJfrNo3Z-DU

Lex: semantic recommendation API
REST API - not opensource - yet
http://lex.cloud.globo.com/

GET /recommendation/?
userEmail=marcelo.soares@corp.globo.com
GET /recommendation/?
userId=2326676&
userProvider=cadun
Lex: customized recommendation

GET/user/history/?
Lex: find a user history

GET /news/descriptor/?
url=http://g1.globo.com/mundo/noticia/2014/0
5/deslizamento-de-terra-no-afeganistao-deixa-
centenas-desaparecidos.html
Lex: find a news descriptor

GET /user/descriptor/?
Lex: find a user descriptor

benefits of semantics in
recsys

benefits of semantics in recsys
● more precision
● more diversity when compared to other
statistical algorithms used such as TF-IDF
● possibility of inferring categories and topics
that are not explicit in the text

Video developed by Rodrigo Senra and me at
Globo.com, related to our initial studies in this field
https://www.youtube.com/watch?v=6UW3frySEnc

Thanks! :)
Tatiana Al-Chueyr Martins
@tati_alchueyr
tatiana.alchueyr@gmail.com
QCon Rio
24th September 2014

QCon SP - recommended for you

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to QCon SP - recommended for you

Similar to QCon SP - recommended for you (20)

More from Tatiana Al-Chueyr

More from Tatiana Al-Chueyr (20)

Recently uploaded

Recently uploaded (20)

QCon SP - recommended for you