• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Scaling Big Data Mining Infrastructure Twitter Experience
 

Scaling Big Data Mining Infrastructure Twitter Experience

on

  • 15,624 views

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the ...

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

Statistics

Views

Total Views
15,624
Views on SlideShare
11,111
Embed Views
4,513

Actions

Likes
54
Downloads
0
Comments
0

20 Embeds 4,513

http://www.scoop.it 2729
http://nosql.mypopescu.com 1220
https://twitter.com 352
http://www.linkedin.com 42
http://tumblr.hootsuite.com 41
http://www.newsblur.com 24
http://eventifier.co 19
http://feeds.feedburner.com 19
http://www.hanrss.com 14
http://tweetedtimes.com 12
http://newsblur.com 11
http://dev.newsblur.com 8
http://gametech.tistory.com 8
http://127.0.0.1 5
http://feedproxy.google.com 3
http://localhost 2
http://192.168.3.23 1
http://eventifier.com 1
http://webcache.googleusercontent.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scaling Big Data Mining Infrastructure Twitter Experience Scaling Big Data Mining Infrastructure Twitter Experience Presentation Transcript

    • Scaling Big DataMining Infrastructure Jimmy Lin Dmitriy Ryaboy @lintool @squarecog Hadoop Summit Europe Thursday, March 21, 2013
    • From the Ivory Tower…Source: Wikipedia (All Souls College, Oxford)
    • … to building sh*t that works.Source: Wikipedia (Factory)
    • IMHO Represents personal opinion. Not official position of Twitter.Management not responsible for misuse. Void where prohibited. YMMV. (If someone asks, I probably wasn’t here)
    • “Yesterday” ~150 people total ~60 Hadoop nodes~6 people use analytics stack daily
    • “Today” ~1400 people total10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
    • Why? actionable insightsdata data products
    • Big data miningCool, you get to work on new algorithms! No, not really…
    • Big data mining isn’t mainly about data mining per se!
    • It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil ―Data Jujitsu‖Source: Wikipedia (Jujitsu)
    • Reality Your boss says something vague You think very hard on how to move the needle Where’s the data? What’s in this dataset? What’s all the f#$!* crap in the data? Clean the data Run some off-the-shelf data mining algorithm … Productionize, act on the insight Rinse, repeat
    • Data science is less glamorous that you think!How do we make data scientists’ lives a bit easier?
    • GatheringSource: Wikipedia (Logging)
    • MovingSource: Wikipedia (Timber rafting)
    • OrganizingSource: Wikipedia (Logging)
    • Log directly into a database!Source: http://www.flickr.com/photos/snukkel/3206405352/
    • create table `my_audit_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `created_at` datetime, `user_id` int(11), `action` varchar(256), ...) ENGINE=InnoDB DEFAULT CHARSET=utf8; Don’t do this! — Workload mismatch — Scaling challenges — Overkill — Schema changes
    • Main Datacenter Scribe Aggregators HDFS Main Hadoop DW Staging Hadoop ClusterDatacenter Scribe Daemons Datacenter (Production Hosts) Scribe Aggregators Scribe Aggregators HDFS HDFS Staging Hadoop Cluster Staging Hadoop Cluster Scribe Daemons (Production Hosts) Scribe Daemons (Production Hosts) Use Scribe. or Flume.or Kafka.
    • Scribe solves log transport only… System.out.println LOG.info
    • ^(w+s+d+s+d+:d+:d+)s+([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)s+[([^]]+)]s+"(w+)s+([^"]*(?:.[^"]*)*)s+(S+)"s+(S+)s+(S+)s+"([^"]*(?:.[^"]*)*)"s+"([^"]*(?:.[^"]*)*)"s*(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?(s+[-w]+)?.*$ An actual Java regular expression used to parse log message at Twitter circa 2010 Plain-text log messages suck Don’t do this!
    • useridCamelCasesmallCamelCase user_idsnake_casecamel_Snakedunder__snake Naming things is hard!
    • JSON to the Rescue!Source: http://www.flickr.com/photos/snukkel/3206405352/
    • This should really be a list…Remember the camelSnake! { "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", Is this really an integer? "info": { "email": "my@place.com" } } Is this really null? What keys? What values? This does not scale.
    • struct MessageInfo { 1: optional string name 2: optional string email }struct LogMessage { 1: required i64 token 2: required string user_id 3: optional list<Feature> enabled_features 4: optional i64 page = 0 5: optional MessageInfo info } + DDL provides type safety enum Feature { super_special, + Auto codgen less_special } + Efficient serialization + Sane schema migration + Separate logical from physical Use Thrift. or Protobufs.or Avro.
    • Schemas aren’t enough!We need a data discovery service! Where’s the data? How do I read it? Who produces it? Who consumes it? When was it last generated? …
    • Where to find data? Old way: Hard-coded partitioning scheme, path, format A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’ USING LzoStatusProtobufBlockPigLoader(); Custom loader How do people know? 1.) Ask around 2.) Cargo-cult New way: Nice UI for browsing A = LOAD ‘tables.statuses’ USING TwadoopLoader(); Same loader each time B = FILTER A BY year == ‘2011’ AND month == ‘11’ AND day == ‘01’ AND hour >= ’05’ AND hour <= ‘07’; Filters are pushed into the loader. No need to understand partitioning scheme.
    • Data Access Layer (DAL)―All problems in computer science can be solved by another levelof indirection... Except for the problem of too many layers ofindirection.‖ – David Wheeler All data accesses go through DAL Thin layer on top of HCatalog
    • Data Access Layer (DAL) Who wrote what data, when? #winAutomatically construct data/job dependency graph Automatically figure out ownership Hooks into alerting, auditing, repos, deploy systems, etc.
    • Plumbing Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012.Source: Wikipedia (Plumbing)
    • ClassificationSource: Wikipedia (Sorting)
    • labelGiven feature vector ? features features features features learner classifier features featuresInduce ? ? Training Predicting
    • Stone age machine learning…Source: Wikipedia (Stonehenge)
    • upload resultsdata munging Joining multiple dataset Feature extraction … download down-sample test data train predict
    • What doesn’t work… 1. Down-sampling for training on single-processor  Defeats the whole point of big data! 2. Ad hoc productionizing  Disconnected from rest of production workflow
    • Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting … We need…
    • Seamless scalingSource: Wikipedia (Galaxy)
    • Integration with production workflowsSource: Wikipedia (Oil refinery)
    • Stochastic Gradient Descent Conceptually, classifier training is a like user-defined aggregate function! AVG SGDinitialize sum = 0 initialize count = 0 weights update add to sum increment countterminate return sum / count return weights
    • previous Pig dataflow previous Pig dataflow mapClassifierTraining reduce label, feature vector Pig storage function model model model feature vector feature vector Making model UDF model UDFPredictions prediction prediction Just like any other parallel Pig dataflow
    • It’s just Pig! For ―free‖: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …
    • Source: Wikipedia (Road surface)
    • Takeaway messagesHow do we make data scientists’ lives a bit easier? Adding a bit of structure goes a long wayGetting the plumbing right makes all the difference ―In theory, there is no difference between theory and practice. But, in practice, there is.‖ Questions? - Jan L.A. van de Snepscheut