• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Science at Tumblr
 

Data Science at Tumblr

on

  • 2,876 views

 

Statistics

Views

Total Views
2,876
Views on SlideShare
1,075
Embed Views
1,801

Actions

Likes
2
Downloads
26
Comments
0

7 Embeds 1,801

http://blog.mortardata.com 1769
http://severian14.okeedo.com 13
http://www.tumblr.com 8
https://twitter.com 6
http://www.newsblur.com 2
http://twimblr.appspot.com 2
http://newsblur.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Science at Tumblr Data Science at Tumblr Presentation Transcript

    • Data at Tumblr Adam Laiacano NYC Data Science Meetup @adamlaiacano adamlaiacano.tumblr.comMonday, April 8, 13
    • What I Needed to Learn When I Started My JobMonday, April 8, 13
    • About Me Electrical Engineering background Worked at CBS to learn more about stats / data Joined Tumblr in August 2011 40th employee, now over 160Monday, April 8, 13
    • About Tumblr blogging platform / social network 100,000,000 blogs! unique signals: asynchronous following graph reblogs, likes, repliesMonday, April 8, 13
    • About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
    • About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Pivot Table!Monday, April 8, 13
    • About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
    • About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 pivoted <- cast(melted, country~month) melted <- melt.data.frame(pivoted, id.vars=country)Monday, April 8, 13
    • About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
    • About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Who Cares?Monday, April 8, 13
    • One more question:Monday, April 8, 13
    • Monday, April 8, 13
    • HadoopMonday, April 8, 13
    • What tools we use What we do with those toolsMonday, April 8, 13
    • Plumbing John D. Cook "The plumber programmer" November 2011 http://bit.ly/XfcXrtMonday, April 8, 13
    • Pipes 1. Record events / actions 2. Store / archive everything 3. Extract information a. Reports / BI b. Back to Tumblr applicationMonday, April 8, 13
    • Step 1: Log Events GiantOctopus: in-house event logging system. Built-in Variables •timestamp GiantOctopus::log( ‘posts’, •referring page array(‘send_to_fb’=>1, •user identifier ) ‘send_to_twitter’=>0 •action identifier ); •location (city) •language settingMonday, April 8, 13
    • Scribe Web Servers Scribe Servers Continuously Daily HDFS Writing CronMonday, April 8, 13
    • Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processorsMonday, April 8, 13
    • Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processors One huge PITA: awful docs (search-hadoop.com helps) java everywhere fragmented communityMonday, April 8, 13
    • Hadoop hive pig map/reduceMonday, April 8, 13
    • Hive "Basically SQL" 10 most liked posts Compiles to Java map/reduce SELECT About 100 hive tables root_post_id, count(*) AS likes FROM posts WHERE Each "table" is really a directory action=like of flat files ORDER BY likes DESC LIMIIT 10;Monday, April 8, 13
    • Hive Partitions File location in HDFS Hive partition value /posts/2013/03/26/*.lzo dt=2013-03-26 /posts/2013/03/27/*.lzo dt=2013-03-26 /posts/2013/03/28/*.lzo dt=2013-03-26 SELECT action, COUNT(*) AS views SELECT action, COUNT(*) AS views FROM pageviews FROM pageviews WHERE ts > 1330927200 WHERE dt = "2012-03-05" AND ts < 1331013600 GROUP BY action GROUP BY action 204 mappers 22,895 mappersMonday, April 8, 13
    • Extending Hive: Streaming •Add all .py files you’ll need to the query •Sends each record to python script via stdin •Can be used as a subquery in a “normal” hive query #!/usr/bin/python add file helpers.py; ## helpers.py FROM import sys, re users gmail = re.compile(r.+@gmail.com) SELECT for row in sys.stdin: TRANSFORM(id, email) id, email = row.split(t) USING helpers.py if gmail.match(email): AS (id_with_gmail) print idMonday, April 8, 13
    • Pig posts = LOAD posts.tsv AS ( root_post_id:int, action:chararray ); "Basically SQL" if you had to likes = FILTER posts BY action==like; explain it piece by piece. grouped = GROUP likes BY root_post_id; counted = FOREACH grouped GENERATE "DataBag" == "DataFrame" group AS root_post_id, COUNT(likes.root_post_id) AS likes; sorted = ORDER counted BY likes DESC; top10 = LIMIT sorted 10; STORE top10 INTO top10.csv;Monday, April 8, 13
    • Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes(museum) [m, mu, mus, muse, museu, museum]Monday, April 8, 13
    • Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search @outputSchema("t:(prefix:chararray)") def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes(museum) [m, mu, mus, muse, museu, museum]Monday, April 8, 13
    • Extending Pig: Java UDFs package com.tumblr.swine; import java.util.ArrayList; import java.util.List; public class Prefixes { private int maxTermLen; public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; } public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; } public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; } }Monday, April 8, 13
    • package com.tumblr.swine.pig; Extending Pig: Java UDFs import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.FuncSpec; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.DefaultBagFactory; import org.apache.pig.data.Tuple; package com.tumblr.swine; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; import java.util.ArrayList; public class Prefixes extends EvalFunc<DataBag> { import java.util.List; public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ public class Prefixes { DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); String word = (String)input.get(0); int max = Integer.MAX_VALUE; if (input.size() == 2) { private int maxTermLen; } max = (Integer)input.get(1); com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max); for (String prefix : prefixes.get(word)) { Tuple t = TupleFactory.getInstance().newTuple(1); public Prefixes() { t.set(0, prefix); output.add(t); this.maxTermLen = Integer.MAX_VALUE; } return output; } }catch(Exception e){ System.err.println("Prefixes: failed to process input; error - " + e.getMessage()); return null; } public Prefixes(int maxTermLen) { } this.maxTermLen = maxTermLen; @Override public Schema outputSchema(Schema input) { } Schema bagSchema = new Schema(); bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY)); try{ return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), public List<String> get(String s) { bagSchema, DataType.BAG)); }catch (FrontendException e){ int size = s.length() < maxTermLen ? s.length() : maxTermLen; } return null; ArrayList<String> results = new ArrayList<String>(); } for (int i=1; i < size + 1; i++) { @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { results.add(s.substring(0,i)); List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); // Allow specifying optional max length of prefix return results; s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); } return funcSpecs; } }Monday, April 8, 13
    • HUE Keeps query history Preview tables / results Save queries & templatesMonday, April 8, 13
    • What tools we use What we do with those toolsMonday, April 8, 13
    • Spam Classic example of supervised learning Dont get too clever Build good tooling!Monday, April 8, 13
    • Spam: Vowpal Wabbit Online (continuously learning) system Updates parameters with every new piece of information Parallelizable, can run as service, very fast. Loss functions: •squared •logistic •hinge •quantileMonday, April 8, 13
    • Spam: Vowpal Wabbit blog: adamlaiacano, Post: tags: [free ipad, warez], location: US~NY-New York, is_suspended: 0 or 1 Model: is_suspended ~ free_ipad + warez + US~NY-New_York + ..... Square loss function Very high dimension: L1 regularization to avoid overfitting Great precision, decent recallMonday, April 8, 13
    • Type - Ahead search Most popular tags for any letter combination Store daily results in distributed Redis cluster m: [me, model, mine] mu: [muscle, muscles, music video] mus: [muscle, muscles, music video] muse: [muse, museum, nine muses] museu: [museum, metropolitan museum of art, natural history museum]Monday, April 8, 13
    • Type - Ahead search Only keep popular prefixes: tag must occur 10 times Only update keys that have changed. - muse: [muse, museum, nine muses] + muse: [muse, museum, arizona muse]Monday, April 8, 13
    • Questions? @adamlaiacano http://adamlaiacano.tumblr.comMonday, April 8, 13