Data at Tumblr                            Adam Laiacano                        NYC Data Science Meetup                    ...
What I Needed to Learn                      When I Started My JobMonday, April 8, 13
About Me                           Electrical Engineering background                      Worked at CBS to learn more abou...
About Tumblr                      blogging platform / social network                             100,000,000 blogs!       ...
About You                Country   Month   Value                  USA     March   10000                  USA     April   1...
About You                Country   Month   Value                  USA     March   10000                  USA     April   1...
About You                  Country   Month   Value                    USA     March    1000                    USA     Apr...
About You                  Country   Month   Value                    USA     March    1000                    USA     Apr...
About You                Country   Month   Value                  USA     March    1000                  USA     April   1...
About You                Country   Month   Value                  USA     March    1000                  USA     April   1...
One more question:Monday, April 8, 13
Monday, April 8, 13
HadoopMonday, April 8, 13
What tools we use                      What we do with those toolsMonday, April 8, 13
Plumbing                             John D. Cook "The plumber programmer"                             November 2011 http:...
Pipes                      1. Record events / actions                      2. Store / archive everything                  ...
Step 1: Log Events                      GiantOctopus: in-house event logging system.                 Built-in Variables   ...
Scribe                      Web Servers       Scribe Servers                              Continuously               Daily...
Step 2: Store in Hadoop                              One huge computer:                               300TB hard drive    ...
Step 2: Store in Hadoop                              One huge computer:                               300TB hard drive    ...
Hadoop                         hive                         pig                      map/reduceMonday, April 8, 13
Hive           "Basically SQL"                      10 most liked posts           Compiles to Java map/reduce             ...
Hive Partitions                        File location in HDFS         Hive partition value                      /posts/2013...
Extending Hive: Streaming                      •Add all .py files you’ll need to the query                      •Sends each...
Pig                                            posts = LOAD posts.tsv AS (                                                ...
Extending Pig: Python UDFs                                Extract word prefixes for type-                                  ...
Extending Pig: Python UDFs                                Extract word prefixes for type-                                  ...
Extending Pig: Java UDFs                package com.tumblr.swine;                import java.util.ArrayList;              ...
package com.tumblr.swine.pig;                                    Extending Pig: Java UDFs                                 ...
HUE       Keeps query history       Preview tables / results       Save queries & templatesMonday, April 8, 13
What tools we use                      What we do with those toolsMonday, April 8, 13
Spam                      Classic example of supervised learning                      Dont get too clever                 ...
Spam: Vowpal Wabbit                 Online (continuously learning) system                 Updates parameters with every ne...
Spam: Vowpal Wabbit                                blog:           adamlaiacano,                      Post:     tags:     ...
Type - Ahead search                      Most popular tags for any letter combination                      Store daily res...
Type - Ahead search                      Only keep popular prefixes: tag must occur 10 times                      Only upda...
Questions?                             @adamlaiacano                      http://adamlaiacano.tumblr.comMonday, April 8, 13
Upcoming SlideShare
Loading in …5
×

Data Science at Tumblr

3,432 views

Published on

Published in: Business, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,432
On SlideShare
0
From Embeds
0
Number of Embeds
1,774
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data Science at Tumblr

  1. 1. Data at Tumblr Adam Laiacano NYC Data Science Meetup @adamlaiacano adamlaiacano.tumblr.comMonday, April 8, 13
  2. 2. What I Needed to Learn When I Started My JobMonday, April 8, 13
  3. 3. About Me Electrical Engineering background Worked at CBS to learn more about stats / data Joined Tumblr in August 2011 40th employee, now over 160Monday, April 8, 13
  4. 4. About Tumblr blogging platform / social network 100,000,000 blogs! unique signals: asynchronous following graph reblogs, likes, repliesMonday, April 8, 13
  5. 5. About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
  6. 6. About You Country Month Value USA March 10000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Pivot Table!Monday, April 8, 13
  7. 7. About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
  8. 8. About You Country Month Value USA March 1000 USA April 12000 Country March Apr May USA May 14000 Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 pivoted <- cast(melted, country~month) melted <- melt.data.frame(pivoted, id.vars=country)Monday, April 8, 13
  9. 9. About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000Monday, April 8, 13
  10. 10. About You Country Month Value USA March 1000 USA April 12000 USA May 14000 Country March Apr May Canada March 7000 USA 10000 12000 14000 Canada April 6500 Canada 7000 6500 5000 Canada May 5000 France 1200 1400 2000 France March 1200 France April 1400 France May 2000 Who Cares?Monday, April 8, 13
  11. 11. One more question:Monday, April 8, 13
  12. 12. Monday, April 8, 13
  13. 13. HadoopMonday, April 8, 13
  14. 14. What tools we use What we do with those toolsMonday, April 8, 13
  15. 15. Plumbing John D. Cook "The plumber programmer" November 2011 http://bit.ly/XfcXrtMonday, April 8, 13
  16. 16. Pipes 1. Record events / actions 2. Store / archive everything 3. Extract information a. Reports / BI b. Back to Tumblr applicationMonday, April 8, 13
  17. 17. Step 1: Log Events GiantOctopus: in-house event logging system. Built-in Variables •timestamp GiantOctopus::log( ‘posts’, •referring page array(‘send_to_fb’=>1, •user identifier ) ‘send_to_twitter’=>0 •action identifier ); •location (city) •language settingMonday, April 8, 13
  18. 18. Scribe Web Servers Scribe Servers Continuously Daily HDFS Writing CronMonday, April 8, 13
  19. 19. Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processorsMonday, April 8, 13
  20. 20. Step 2: Store in Hadoop One huge computer: 300TB hard drive 7.8TB of RAM 85 x 2 = 170 hex-core processors One huge PITA: awful docs (search-hadoop.com helps) java everywhere fragmented communityMonday, April 8, 13
  21. 21. Hadoop hive pig map/reduceMonday, April 8, 13
  22. 22. Hive "Basically SQL" 10 most liked posts Compiles to Java map/reduce SELECT About 100 hive tables root_post_id, count(*) AS likes FROM posts WHERE Each "table" is really a directory action=like of flat files ORDER BY likes DESC LIMIIT 10;Monday, April 8, 13
  23. 23. Hive Partitions File location in HDFS Hive partition value /posts/2013/03/26/*.lzo dt=2013-03-26 /posts/2013/03/27/*.lzo dt=2013-03-26 /posts/2013/03/28/*.lzo dt=2013-03-26 SELECT action, COUNT(*) AS views SELECT action, COUNT(*) AS views FROM pageviews FROM pageviews WHERE ts > 1330927200 WHERE dt = "2012-03-05" AND ts < 1331013600 GROUP BY action GROUP BY action 204 mappers 22,895 mappersMonday, April 8, 13
  24. 24. Extending Hive: Streaming •Add all .py files you’ll need to the query •Sends each record to python script via stdin •Can be used as a subquery in a “normal” hive query #!/usr/bin/python add file helpers.py; ## helpers.py FROM import sys, re users gmail = re.compile(r.+@gmail.com) SELECT for row in sys.stdin: TRANSFORM(id, email) id, email = row.split(t) USING helpers.py if gmail.match(email): AS (id_with_gmail) print idMonday, April 8, 13
  25. 25. Pig posts = LOAD posts.tsv AS ( root_post_id:int, action:chararray ); "Basically SQL" if you had to likes = FILTER posts BY action==like; explain it piece by piece. grouped = GROUP likes BY root_post_id; counted = FOREACH grouped GENERATE "DataBag" == "DataFrame" group AS root_post_id, COUNT(likes.root_post_id) AS likes; sorted = ORDER counted BY likes DESC; top10 = LIMIT sorted 10; STORE top10 INTO top10.csv;Monday, April 8, 13
  26. 26. Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes(museum) [m, mu, mus, muse, museu, museum]Monday, April 8, 13
  27. 27. Extending Pig: Python UDFs Extract word prefixes for type- ahead tag search @outputSchema("t:(prefix:chararray)") def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)] >>> prefixes(museum) [m, mu, mus, muse, museu, museum]Monday, April 8, 13
  28. 28. Extending Pig: Java UDFs package com.tumblr.swine; import java.util.ArrayList; import java.util.List; public class Prefixes { private int maxTermLen; public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; } public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; } public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; } }Monday, April 8, 13
  29. 29. package com.tumblr.swine.pig; Extending Pig: Java UDFs import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.FuncSpec; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.DefaultBagFactory; import org.apache.pig.data.Tuple; package com.tumblr.swine; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; import java.util.ArrayList; public class Prefixes extends EvalFunc<DataBag> { import java.util.List; public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ public class Prefixes { DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); String word = (String)input.get(0); int max = Integer.MAX_VALUE; if (input.size() == 2) { private int maxTermLen; } max = (Integer)input.get(1); com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max); for (String prefix : prefixes.get(word)) { Tuple t = TupleFactory.getInstance().newTuple(1); public Prefixes() { t.set(0, prefix); output.add(t); this.maxTermLen = Integer.MAX_VALUE; } return output; } }catch(Exception e){ System.err.println("Prefixes: failed to process input; error - " + e.getMessage()); return null; } public Prefixes(int maxTermLen) { } this.maxTermLen = maxTermLen; @Override public Schema outputSchema(Schema input) { } Schema bagSchema = new Schema(); bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY)); try{ return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), public List<String> get(String s) { bagSchema, DataType.BAG)); }catch (FrontendException e){ int size = s.length() < maxTermLen ? s.length() : maxTermLen; } return null; ArrayList<String> results = new ArrayList<String>(); } for (int i=1; i < size + 1; i++) { @Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { results.add(s.substring(0,i)); List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); // Allow specifying optional max length of prefix return results; s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); } s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); } return funcSpecs; } }Monday, April 8, 13
  30. 30. HUE Keeps query history Preview tables / results Save queries & templatesMonday, April 8, 13
  31. 31. What tools we use What we do with those toolsMonday, April 8, 13
  32. 32. Spam Classic example of supervised learning Dont get too clever Build good tooling!Monday, April 8, 13
  33. 33. Spam: Vowpal Wabbit Online (continuously learning) system Updates parameters with every new piece of information Parallelizable, can run as service, very fast. Loss functions: •squared •logistic •hinge •quantileMonday, April 8, 13
  34. 34. Spam: Vowpal Wabbit blog: adamlaiacano, Post: tags: [free ipad, warez], location: US~NY-New York, is_suspended: 0 or 1 Model: is_suspended ~ free_ipad + warez + US~NY-New_York + ..... Square loss function Very high dimension: L1 regularization to avoid overfitting Great precision, decent recallMonday, April 8, 13
  35. 35. Type - Ahead search Most popular tags for any letter combination Store daily results in distributed Redis cluster m: [me, model, mine] mu: [muscle, muscles, music video] mus: [muscle, muscles, music video] muse: [muse, museum, nine muses] museu: [museum, metropolitan museum of art, natural history museum]Monday, April 8, 13
  36. 36. Type - Ahead search Only keep popular prefixes: tag must occur 10 times Only update keys that have changed. - muse: [muse, museum, nine muses] + muse: [muse, museum, arizona muse]Monday, April 8, 13
  37. 37. Questions? @adamlaiacano http://adamlaiacano.tumblr.comMonday, April 8, 13

×