Data Science at Tumblr

Data at Tumblr

Adam Laiacano
NYC Data Science Meetup

@adamlaiacano
adamlaiacano.tumblr.com

Monday, April 8, 13

What I Needed to Learn
When I Started My Job

Monday, April 8, 13

About Me

Electrical Engineering background
Worked at CBS to learn more about stats / data

Joined Tumblr in August 2011
40th employee, now over 160

Monday, April 8, 13

About Tumblr
blogging platform / social network
100,000,000 blogs!

unique signals:
asynchronous following graph
reblogs, likes, replies

Monday, April 8, 13

About You
Country Month Value
USA March 10000
USA April 12000
USA May 14000 Country March Apr May
Canada March 7000 USA 10000 12000 14000
Canada April 6500 Canada 7000 6500 5000
Canada May 5000 France 1200 1400 2000
France March 1200
France April 1400
France May 2000

Monday, April 8, 13

About You
Country Month Value
USA March 10000
USA April 12000
Canada March 7000 USA 10000 12000 14000
France March 1200
France April 1400
France May 2000

Pivot Table!
Monday, April 8, 13

About You
Country Month Value
USA March 1000
USA April 12000
Country March Apr May
USA May 14000
Canada March 7000 USA 10000 12000 14000
France March 1200
France April 1400
France May 2000

Monday, April 8, 13

About You
Country Month Value
USA March 1000
USA April 12000
Country March Apr May
USA May 14000
Canada March 7000 USA 10000 12000 14000
France March 1200
France April 1400
France May 2000

pivoted <- cast(melted, country~month)
melted <- melt.data.frame(pivoted, id.vars='country')

Monday, April 8, 13

About You
Country Month Value
USA March 1000
USA April 12000
Canada March 7000 USA 10000 12000 14000
France March 1200
France April 1400
France May 2000

Monday, April 8, 13

About You
Country Month Value
USA March 1000
USA April 12000
Canada March 7000 USA 10000 12000 14000
France March 1200
France April 1400
France May 2000

Who Cares?
Monday, April 8, 13

One more question:

Monday, April 8, 13

Hadoop

Monday, April 8, 13

What tools we use

What we do with those tools

Monday, April 8, 13

Plumbing

John D. Cook "The plumber programmer"
November 2011 http://bit.ly/XfcXrt

Monday, April 8, 13

Pipes

1. Record events / actions
2. Store / archive everything
3. Extract information
a. Reports / BI
b. Back to Tumblr application

Monday, April 8, 13

Step 1: Log Events
GiantOctopus: in-house event logging system.

Built-in Variables
•timestamp GiantOctopus::log(
‘posts’,
•referring page array(‘send_to_fb’=>1,
•user identiﬁer )
‘send_to_twitter’=>0

•action identiﬁer );

•location (city)
•language setting

Monday, April 8, 13

Scribe
Web Servers Scribe Servers

Continuously Daily
HDFS
Writing Cron

Monday, April 8, 13

Step 2: Store in Hadoop
One huge computer:
300TB hard drive
7.8TB of RAM
85 x 2 = 170 hex-core processors

Monday, April 8, 13

Step 2: Store in Hadoop
One huge computer:
300TB hard drive
7.8TB of RAM
85 x 2 = 170 hex-core processors

One huge PITA:
awful docs (search-hadoop.com helps)
java everywhere
fragmented community

Monday, April 8, 13

Hadoop

hive

pig

map/reduce

Monday, April 8, 13

Hive

"Basically SQL" 10 most liked posts

Compiles to Java map/reduce
SELECT
About 100 hive tables root_post_id,
count(*) AS likes
FROM posts
WHERE
Each "table" is really a directory action='like'
of ﬂat ﬁles ORDER BY likes DESC
LIMIIT 10;

Monday, April 8, 13

Hive Partitions
File location in HDFS Hive partition value
/posts/2013/03/26/*.lzo dt='2013-03-26'
/posts/2013/03/27/*.lzo dt='2013-03-26'
/posts/2013/03/28/*.lzo dt='2013-03-26'

SELECT action, COUNT(*) AS views
SELECT action, COUNT(*) AS views
FROM pageviews
FROM pageviews
WHERE ts > 1330927200
WHERE dt = "2012-03-05"
AND ts < 1331013600
GROUP BY action
GROUP BY action

204 mappers 22,895 mappers

Monday, April 8, 13

Extending Hive: Streaming
•Add all .py ﬁles you’ll need to the query
•Sends each record to python script via stdin
•Can be used as a subquery in a “normal” hive query

#!/usr/bin/python
add file helpers.py;
## helpers.py
FROM
import sys, re
users
gmail = re.compile(r'.+@gmail.com')
SELECT
for row in sys.stdin:
TRANSFORM(id, email)
id, email = row.split('t')
USING 'helpers.py'
if gmail.match(email):
AS (id_with_gmail)
print id

Monday, April 8, 13

Pig
posts = LOAD 'posts.tsv' AS (
root_post_id:int,
action:chararray
);
"Basically SQL" if you had to likes = FILTER posts BY action=='like';
explain it piece by piece. grouped = GROUP likes BY root_post_id;

counted = FOREACH grouped GENERATE
"DataBag" == "DataFrame" group AS root_post_id,
COUNT(likes.root_post_id) AS likes;

sorted = ORDER counted BY likes DESC;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10.csv';

Monday, April 8, 13

Extending Pig: Python UDFs
Extract word preﬁxes for type-
ahead tag search

def prefixes(input, max_len=3):
nchar = min(len(input), max_len) + 1
return [input[:i] for i in range(1,nchar)]

>>> prefixes('museum')
['m', 'mu', 'mus', 'muse', 'museu', 'museum']

Monday, April 8, 13

Extending Pig: Python UDFs
Extract word preﬁxes for type-
ahead tag search

@outputSchema("t:(prefix:chararray)")
def prefixes(input, max_len=3):
nchar = min(len(input), max_len) + 1
return [input[:i] for i in range(1,nchar)]

>>> prefixes('museum')
['m', 'mu', 'mus', 'muse', 'museu', 'museum']

Monday, April 8, 13

Extending Pig: Java UDFs
package com.tumblr.swine;

import java.util.ArrayList;
import java.util.List;

public class Prefixes {

private int maxTermLen;

public Prefixes() {
this.maxTermLen = Integer.MAX_VALUE;
}

public Prefixes(int maxTermLen) {
this.maxTermLen = maxTermLen;
}

public List<String> get(String s) {
int size = s.length() < maxTermLen ? s.length() : maxTermLen;
ArrayList<String> results = new ArrayList<String>();
for (int i=1; i < size + 1; i++) {
results.add(s.substring(0,i));
}
return results;
}
}

Monday, April 8, 13

package com.tumblr.swine.pig;

Extending Pig: Java UDFs
import java.io.IOException;
import java.util.ArrayList;

import java.util.List;

import org.apache.pig.EvalFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.DefaultBagFactory;
import org.apache.pig.data.Tuple;
package com.tumblr.swine; import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;

import java.util.ArrayList; public class Prefixes extends EvalFunc<DataBag> {

import java.util.List; public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
public class Prefixes { DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
String word = (String)input.get(0);
int max = Integer.MAX_VALUE;
if (input.size() == 2) {
private int maxTermLen; }
max = (Integer)input.get(1);

com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max);
for (String prefix : prefixes.get(word)) {
Tuple t = TupleFactory.getInstance().newTuple(1);
public Prefixes() { t.set(0, prefix);
output.add(t);
this.maxTermLen = Integer.MAX_VALUE; }
return output;
} }catch(Exception e){
System.err.println("Prefixes: failed to process input; error - " + e.getMessage());
return null;
}
public Prefixes(int maxTermLen) { }

this.maxTermLen = maxTermLen; @Override
public Schema outputSchema(Schema input) {
} Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY));
try{
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
public List<String> get(String s) { bagSchema, DataType.BAG));
}catch (FrontendException e){
int size = s.length() < maxTermLen ? s.length() : maxTermLen; }
return null;

ArrayList<String> results = new ArrayList<String>(); }

for (int i=1; i < size + 1; i++) { @Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
results.add(s.substring(0,i)); List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2);
Schema s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
} funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
// Allow specifying optional max length of prefix
return results; s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
} s.add(new Schema.FieldSchema(null, DataType.INTEGER));
funcSpecs.add(new FuncSpec(this.getClass().getName(), s));
} return funcSpecs;
}

}

Monday, April 8, 13

HUE

Keeps query history

Preview tables / results

Save queries & templates

Monday, April 8, 13

Spam

Classic example of supervised learning

Don't get too clever

Build good tooling!

Monday, April 8, 13

Spam: Vowpal Wabbit
Online (continuously learning) system

Updates parameters with every new piece of information

Parallelizable, can run as service, very fast.

Loss functions:
•squared
•logistic
•hinge
•quantile
Monday, April 8, 13

Spam: Vowpal Wabbit
blog: 'adamlaiacano',
Post: tags: ['free ipad', 'warez'],
location: 'US~NY-New York',
is_suspended: 0 or 1

Model: is_suspended ~ free_ipad + warez + US~NY-New_York + .....

Square loss function
Very high dimension: L1 regularization to avoid overﬁtting
Great precision, decent recall

Monday, April 8, 13

Type - Ahead search

Most popular tags for any letter combination

Store daily results in distributed Redis cluster

m: [me, model, mine]
mu: [muscle, muscles, music video]
mus: [muscle, muscles, music video]
muse: [muse, museum, nine muses]
museu: [museum, metropolitan museum of art,
natural history museum]

Monday, April 8, 13

Type - Ahead search

Only keep popular preﬁxes: tag must occur 10 times

Only update keys that have changed.

- muse: [muse, museum, nine muses]
+ muse: [muse, museum, arizona muse]

Monday, April 8, 13

Questions?

@adamlaiacano

http://adamlaiacano.tumblr.com

Monday, April 8, 13

Data Science at Tumblr

More Related Content

Similar to Data Science at Tumblr

More from mortardata

Recently uploaded

Data Science at Tumblr