Scaling Big Data Mining Infrastructure Twitter Experience


Published on

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling Big Data Mining Infrastructure Twitter Experience

  1. 1. Scaling Big DataMining Infrastructure Jimmy Lin Dmitriy Ryaboy @lintool @squarecog Hadoop Summit Europe Thursday, March 21, 2013
  2. 2. From the Ivory Tower…Source: Wikipedia (All Souls College, Oxford)
  3. 3. … to building sh*t that works.Source: Wikipedia (Factory)
  4. 4. IMHO Represents personal opinion. Not official position of Twitter.Management not responsible for misuse. Void where prohibited. YMMV. (If someone asks, I probably wasn’t here)
  5. 5. “Yesterday” ~150 people total ~60 Hadoop nodes~6 people use analytics stack daily
  6. 6. “Today” ~1400 people total10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  7. 7. Why? actionable insightsdata data products
  8. 8. Big data miningCool, you get to work on new algorithms! No, not really…
  9. 9. Big data mining isn’t mainly about data mining per se!
  10. 10. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil ―Data Jujitsu‖Source: Wikipedia (Jujitsu)
  11. 11. Reality Your boss says something vague You think very hard on how to move the needle Where’s the data? What’s in this dataset? What’s all the f#$!* crap in the data? Clean the data Run some off-the-shelf data mining algorithm … Productionize, act on the insight Rinse, repeat
  12. 12. Data science is less glamorous that you think!How do we make data scientists’ lives a bit easier?
  13. 13. GatheringSource: Wikipedia (Logging)
  14. 14. MovingSource: Wikipedia (Timber rafting)
  15. 15. OrganizingSource: Wikipedia (Logging)
  16. 16. Log directly into a database!Source:
  17. 17. create table `my_audit_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `created_at` datetime, `user_id` int(11), `action` varchar(256), ...) ENGINE=InnoDB DEFAULT CHARSET=utf8; Don’t do this! — Workload mismatch — Scaling challenges — Overkill — Schema changes
  18. 18. Main Datacenter Scribe Aggregators HDFS Main Hadoop DW Staging Hadoop ClusterDatacenter Scribe Daemons Datacenter (Production Hosts) Scribe Aggregators Scribe Aggregators HDFS HDFS Staging Hadoop Cluster Staging Hadoop Cluster Scribe Daemons (Production Hosts) Scribe Daemons (Production Hosts) Use Scribe. or Flume.or Kafka.
  19. 19. Scribe solves log transport only… System.out.println
  20. 20. ^(w+s+d+s+d+:d+:d+)s+([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)s+[([^]]+)]s+"(w+)s+([^"]*(?:.[^"]*)*)s+(S+)"s+(S+)s+(S+)s+"([^"]*(?:.[^"]*)*)"s+"([^"]*(?:.[^"]*)*)"s*(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?(s+[-w]+)?.*$ An actual Java regular expression used to parse log message at Twitter circa 2010 Plain-text log messages suck Don’t do this!
  21. 21. useridCamelCasesmallCamelCase user_idsnake_casecamel_Snakedunder__snake Naming things is hard!
  22. 22. JSON to the Rescue!Source:
  23. 23. This should really be a list…Remember the camelSnake! { "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", Is this really an integer? "info": { "email": "" } } Is this really null? What keys? What values? This does not scale.
  24. 24. struct MessageInfo { 1: optional string name 2: optional string email }struct LogMessage { 1: required i64 token 2: required string user_id 3: optional list<Feature> enabled_features 4: optional i64 page = 0 5: optional MessageInfo info } + DDL provides type safety enum Feature { super_special, + Auto codgen less_special } + Efficient serialization + Sane schema migration + Separate logical from physical Use Thrift. or Protobufs.or Avro.
  25. 25. Schemas aren’t enough!We need a data discovery service! Where’s the data? How do I read it? Who produces it? Who consumes it? When was it last generated? …
  26. 26. Where to find data? Old way: Hard-coded partitioning scheme, path, format A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’ USING LzoStatusProtobufBlockPigLoader(); Custom loader How do people know? 1.) Ask around 2.) Cargo-cult New way: Nice UI for browsing A = LOAD ‘tables.statuses’ USING TwadoopLoader(); Same loader each time B = FILTER A BY year == ‘2011’ AND month == ‘11’ AND day == ‘01’ AND hour >= ’05’ AND hour <= ‘07’; Filters are pushed into the loader. No need to understand partitioning scheme.
  27. 27. Data Access Layer (DAL)―All problems in computer science can be solved by another levelof indirection... Except for the problem of too many layers ofindirection.‖ – David Wheeler All data accesses go through DAL Thin layer on top of HCatalog
  28. 28. Data Access Layer (DAL) Who wrote what data, when? #winAutomatically construct data/job dependency graph Automatically figure out ownership Hooks into alerting, auditing, repos, deploy systems, etc.
  29. 29. Plumbing Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012.Source: Wikipedia (Plumbing)
  30. 30. ClassificationSource: Wikipedia (Sorting)
  31. 31. labelGiven feature vector ? features features features features learner classifier features featuresInduce ? ? Training Predicting
  32. 32. Stone age machine learning…Source: Wikipedia (Stonehenge)
  33. 33. upload resultsdata munging Joining multiple dataset Feature extraction … download down-sample test data train predict
  34. 34. What doesn’t work… 1. Down-sampling for training on single-processor  Defeats the whole point of big data! 2. Ad hoc productionizing  Disconnected from rest of production workflow
  35. 35. Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting … We need…
  36. 36. Seamless scalingSource: Wikipedia (Galaxy)
  37. 37. Integration with production workflowsSource: Wikipedia (Oil refinery)
  38. 38. Stochastic Gradient Descent Conceptually, classifier training is a like user-defined aggregate function! AVG SGDinitialize sum = 0 initialize count = 0 weights update add to sum increment countterminate return sum / count return weights
  39. 39. previous Pig dataflow previous Pig dataflow mapClassifierTraining reduce label, feature vector Pig storage function model model model feature vector feature vector Making model UDF model UDFPredictions prediction prediction Just like any other parallel Pig dataflow
  40. 40. It’s just Pig! For ―free‖: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …
  41. 41. Source: Wikipedia (Road surface)
  42. 42. Takeaway messagesHow do we make data scientists’ lives a bit easier? Adding a bit of structure goes a long wayGetting the plumbing right makes all the difference ―In theory, there is no difference between theory and practice. But, in practice, there is.‖ Questions? - Jan L.A. van de Snepscheut