• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop at Meebo: Lessons in the Real World
 

Hadoop at Meebo: Lessons in the Real World

on

  • 9,540 views

 

Statistics

Views

Total Views
9,540
Views on SlideShare
9,497
Embed Views
43

Actions

Likes
26
Downloads
239
Comments
0

5 Embeds 43

http://wurk.jwuch.com 24
http://www.linkedin.com 16
http://www.slideshare.net 1
https://www.linkedin.com 1
http://slideclip.b-prep.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop at Meebo: Lessons in the Real World Hadoop at Meebo: Lessons in the Real World Presentation Transcript

    • Hadoop at MeeboLessons learned in the real world
      Vikram Oberoi
      August, 2010
      Hadoop Day, Seattle
    • About me
      SDE Intern at Amazon, ’07
      R&D on item-to-item similarities
      Data Engineer Intern at Meebo, ’08
      Built an A/B testing system
      CS at Stanford, ’09
      Senior project: Ext3 and XFS under HadoopMapReduce workloads
      Data Engineer at Meebo, ’09—present
      Data infrastructure, analytics
    • About Meebo
      Products
      Browser-based IM client (www.meebo.com)
      Mobile chat clients
      Social widgets (the Meebo Bar)
      Company
      Founded 2005
      Over 100 employees, 30 engineers
      Engineering
      Strong engineering culture
      Contributions to CouchDB, Lounge, Hadoop components
    • The Problem
      Hadoop is powerful technology
      Meets today’s demand for big data
      But it’s still a young platform
      Evolving components and best practices
      With many challenges in real-world usage
      Day-to-day operational headaches
      Missing eco-system features (e.g recurring jobs?)
      Lots of re-inventing the wheel to solve these
    • Purpose of this talk
      Discuss some real problems we’ve seen
      Explain our solutions
      Propose best practices so you can avoid them
    • What will I talk about?
      Background:
      Meebo’s data processing needs
      Meebo’s pre and post Hadoop data pipelines
      Lessons:
      Better workflow management
      Scheduling, reporting, monitoring, etc.
      A look at Azkaban
      Get wiser about data serialization
      Protocol Buffers (or Avro, or Thrift)
    • Meebo’s Data Processing Needs
    • What do we use Hadoop for?
      ETL
      Analytics
      Behavioral targeting
      Ad hoc data analysis, research
      Data produced helps power:
      internal/external dashboards
      our ad server
    • What kind of data do we have?
      Log data from all our products
      The Meebo Bar
      Meebo Messenger (www.meebo.com)
      Android/iPhone/Mobile Web clients
      Rooms
      Meebo Me
      Meebonotifier
      Firefox extension
    • How much data?
      150MM uniques/month from the Meebo Bar
      Around 200 GB of uncompressed daily logs
      We process a subset of our logs
    • Meebo’s Data Pipeline
      Pre and Post Hadoop
    • A data pipeline in general
      1. Data
      Collection
      2. Data
      Processing
      3. Data
      Storage
      4. Workflow Management
    • Our data pipeline, pre-Hadoop
      Servers
      Python/shell scripts pull log data
      Python/shell scripts process data
      MySQL, CouchDB, flat files
      Cron, wrapper shell scripts glue everything together
    • Our data pipeline post Hadoop
      Servers
      Push logs to HDFS
      Pig scripts process data
      MySQL, CouchDB, flat files
      Azkaban, a workflow management system, glues everything together
    • Our transition to using Hadoop
      Deployed early ’09
      Motivation: processing data took aaaages!
      Catalyst: Hadoop Summit
      Turbulent, time consuming
      New tools, new paradigms, pitfalls
      Totally worth it
      24 hours to process day’s logs  under an hour
      Leap in ability to analyze our data
      Basis for new core product features
    • Workflow Management
    • What is workflow management?
    • What is workflow management?
      It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.
      Most people use scripts and cron
      But end up spending too much time managing
      We need a better way
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
    • Split up your jobs into discrete chunks with dependencies
      • Minimize impact when chunks fail
      • Allow engineers to work on chunks separately
      • Monolithic scripts are no fun
    • Clean up data from log A
      Process data from log B
      Join data, train a classifier
      Post-processing
      Archive output
      Export to DB somewhere
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
      Schedules recurring jobs to run at a given time
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
      Schedules recurring jobs to run at a given time
      Monitors job progress
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
      Schedules recurring jobs to run at a given time
      Monitors job progress
      Reports when job fails, how long jobs take
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
      Schedules recurring jobs to run at a given time
      Monitors job progress
      Reports when job fails, how long jobs take
      Logs job execution and exposes logs so that engineers can deal with failures swiftly
    • Workflow management consists of:
      Executes jobs with arbitrarily complex dependency chains
      Schedules recurring jobs to run at a given time
      Monitors job progress
      Reports when job fails, how long jobs take
      Logs job execution and exposes logs so that engineers can deal with failures swiftly
      Provides resource management capabilities
    • Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      DB somewhere
      Don’t DoS yourself
    • Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      Export to DB somewhere
      2
      1
      0
      0
      0
      Permit Manager
      DB somewhere
    • Don’t roll your own scheduler!
      Building a good scheduling framework is hard
      Myriad of small requirements, precise bookkeeping with many edge cases
      Many roll their own
      It’s usually inadequate
      So much repeated effort!
      Mold an existing framework to your requirements and contribute
    • Two emerging frameworks
      Oozie
      Built at Yahoo
      Open-sourced at Hadoop Summit ’10
      Used in production for [don’t know]
      Packaged by Cloudera
      Azkaban
      Built at LinkedIn
      Open-sourced in March ‘10
      Used in production for over nine months as of March ’10
      Now in use at Meebo
    • Azkaban
    • Azkaban jobs are bundles of configuration and code
    • Configuring a job
      process_log_data.job
      type=command
      command=python process_logs.py
      failure.emails=datateam@whereiwork.com
      process_logs.py
      importos
      import sys
      # Do useful things

    • Deploying a job
      Step 1: Shove your config and code into a zip archive.
      process_log_data.zip
      .job
      .py
    • Deploying a job
      Step 2: Upload to Azkaban
      process_log_data.zip
      .job
      .py
    • Scheduling a job
      The Azkaban front-end:
    • What about dependencies?
    • get_users_widgets
      process_widgets.job
      process_users.job
      join_users_widgets.job
      export_to_db.job
    • get_users_widgets
      process_widgets.job
      type=command
      command=python process_widgets.py
      failure.emails=datateam@whereiwork.com
      process_users.job
      type=command
      command=python process_users.py
      failure.emails=datateam@whereiwork.com
    • get_users_widgets
      join_users_widgets.job
      type=command
      command=python join_users_widgets.py
      failure.emails=datateam@whereiwork.com
      dependencies=process_widgets,process_users
      export_to_db.job
      type=command
      command=python export_to_db.py
      failure.emails=datateam@whereiwork.com
      dependencies=join_users_widgets
    • get_users_widgets
      get_users_widgets.zip
      .job
      .job
      .job
      .job
      .py
      .py
      .py
      .py
    • You deploy and schedule a job flow as you would a single job.
    • Hierarchical configuration
      process_widgets.job
      type=command
      command=python process_widgets.py
      failure.emails=datateam@whereiwork.com
      This is silly. Can‘t I specify failure.emailsglobally?
      process_users.job
      type=command
      command=python process_users.py
      failure.emails=datateam@whereiwork.com
    • azkaban-job-dir/
      system.properties
      get_users_widgets/
      process_widgets.job
      process_users.job
      join_users_widgets.job
      export_to_db.job
      some-other-job/

    • Hierarchical configuration
      system.properties
      failure.emails=datateam@whereiwork.com
      db.url=foo.whereiwork.com
      archive.dir=/var/whereiwork/archive
    • What is type=command?
      Azkaban supports a few ways to execute jobs
      command
      Unix command in a separate process
      javaprocess
      Wrapper to kick off Java programs
      java
      Wrapper to kick off Runnable Java classes
      Can hook into Azkaban in useful ways
      Pig
      Wrapper to run Pig scripts through Grunt
    • What’s missing?
      Scheduling and executing multiple instances of the same job at the same time.
    • 3:00 PM
      FOO
      • Runs hourly
      • 3:00 PM took longer than expected
      4:00 PM
      FOO
    • 3:00 PM
      FOO
      • Runs hourly
      • 3:00 PM failed, restarted at 4:25 PM
      4:00 PM
      FOO
      FOO
      5:00 PM
    • What’s missing?
      Scheduling and executing multiple jobs at the same time.
      AZK-49, AZK-47
      Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
    • What’s missing?
      Scheduling and executing multiple jobs at the same time.
      AZK-49, AZK-47
      Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
      Passing arguments between jobs.
      Write a library used by your jobs
      Put your arguments anywhere you want
    • What did we get out of it?
      No more monolithic wrapper scripts
      Massively reduced job setup time
      It’s configuration, not code!
      More code reuse, less hair pulling
      Still porting over jobs
      It’s time consuming
    • Data Serialization
    • What’s the problem?
      Serializing data in simple formats is convenient
      CSV, XML etc.
      Problems arise when data changes
      Needs backwards-compatibility
      Does this really matter? Let’s discuss.
    • v1
      clickabutton.com
      Username:
      Password:
      Go!
    • “Click a Button” Analytics PRD
      We want to know the number of unique users who clicked on the button.
      Over an arbitrary range of time.
      Broken down by whether they’re logged in or not.
      With hour granularity.
    • “I KNOW!”
      Every hour, process logs and dump lines that look like this to HDFS with Pig:
      unique_id,logged_in,clicked
    • “I KNOW!”
      --‘clicked’ and ‘logged_in’ are either 0 or 1
      LOAD ‘$IN’ USING PigStorage(‘,’) AS (
      unique_id:chararray,
      logged_in:int,
      clicked:int
      );
      -- Munge data according to the PRD

    • v2
      clickabutton.com
      Username:
      Password:
      Go!
    • “Click a Button” Analytics PRD
      Break users down by which button they clicked, too.
    • “I KNOW!”
      Every hour, process logs and dump lines that look like this to HDFS with Pig:
      unique_id,logged_in,red_click,green_click
    • “I KNOW!”
      --‘clicked’ and ‘logged_in’ are either 0 or 1
      LOAD ‘$IN’ USING PigStorage(‘.’) AS (
      unique_id:chararray,
      logged_in:int,
      red_clicked:int,
      green_clicked:int
      );
      -- Munge data according to the PRD

    • v3
      clickabutton.com
      Username:
      Password:
      Go!
    • “Hmm.”
    • Bad Solution 1
      Remove red_click
      unique_id,logged_in,red_click,green_click
      unique_id,logged_in,green_click
    • Why it’s bad
      Your script thinks green clicks are red clicks.
      LOAD ‘$IN’ USING PigStorage(‘.’) AS (
      unique_id:chararray,
      logged_in:int,
      red_clicked:int,
      green_clicked:int
      );
      -- Munge data according to the PRD

    • Why it’s bad
      Now your script won’t work for all the data you’ve collected so far.
      LOAD ‘$IN’ USING PigStorage(‘.’) AS (
      unique_id:chararray,
      logged_in:int,
      green_clicked:int
      );
      -- Munge data according to the PRD

    • “I’ll keep multiple scripts lying around”
    • LOAD ‘$IN’ USING PigStorage(‘.’) AS (
      unique_id:chararray,
      logged_in:int,
      green_clicked:int
      );
      My data has three fields. Which one do I use?
      LOAD ‘$IN’ USING PigStorage(‘.’) AS (
      unique_id:chararray,
      logged_in:int,
      orange_clicked:int
      );
    • Bad Solution 2
      Assign a sentinel to red_clickwhen it should be ignored, i.e. -1.
      unique_id,logged_in,red_click,green_click
    • Why it’s bad
      It’s a waste of space.
    • Why it’s bad
      Sticking logic in your data is iffy.
    • The Preferable Solution
      Serialize your data using backwards-compatible data structures!
      Protocol Buffers and Elephant Bird
    • Protocol Buffers
      Serialization system
      Avro, Thrift
      Compiles interfaces to language modules
      Construct a data structure
      Access it (in a backwards-compatible way)
      Ser/deser the data structure in a standard, compact, binary format
    • uniqueuser.proto
      message UniqueUser {
      optional string id = 1;
      optional int32 logged_in = 2;
      optional int32 red_clicked = 3;
      }
      .h,
      .cc
      .java
      .py
    • Elephant Bird
      Generate protobuf-based Pig load/store functions + lots more
      Developed at Twitter
      Blog post
      http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
      Available at:
      http://www.github.com/kevinweil/elephant-bird
    • uniqueuser.proto
      message UniqueUser {
      optional string id = 1;
      optional int32 logged_in = 2;
      optional int32 red_clicked = 3;
      }
      *.pig.load.UniqueUserLzoProtobufB64LinePigLoader
      *.pig.store.UniqueUserLzoProtobufB64LinePigStorage
    • LzoProtobufB64?
    • LzoProtobufB64Serialization
      (bak49jsn, 0, 1)
      Protobuf Binary Blob
      Base64-encoded Protobuf Binary Blob
      LZO-compressed Base64-encoded Protobuf Binary Blob
    • LzoProtobufB64Deserialization
      (bak49jsn, 0, 1)
      Protobuf Binary Blob
      Base64-encoded Protobuf Binary Blob
      LZO-compressed Base64-encoded Protobuf Binary Blob
    • Setting it up
      Prereqs
      Protocol Buffers 2.3+
      LZO codec for Hadoop
      Check out docs
      http://www.github.com/kevinweil/elephant-bird
    • Time to revisit
    • v1
      clickabutton.com
      Username:
      Password:
      Go!
    • Every hour, process logs and dump lines to HDFS that use this protobuf interface:
      uniqueuser.proto
      message UniqueUser {
      optional string id = 1;
      optional int32 logged_in = 2;
      optional int32 red_clicked = 3;
      }
    • --‘clicked’ and ‘logged_in’ are either 0 or 1
      LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
      unique_id:chararray,
      logged_in:int,
      red_clicked:int
      );
      -- Munge data according to the PRD

    • v2
      clickabutton.com
      Username:
      Password:
      Go!
    • Every hour, process logs and dump lines to HDFS that use this protobuf interface:
      uniqueuser.proto
      message UniqueUser {
      optional string id = 1;
      optional int32 logged_in = 2;
      optional int32 red_clicked = 3;
      optional int32 green_clicked = 4;
      }
    • --‘clicked’ and ‘logged_in’ are either 0 or 1
      LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
      unique_id:chararray,
      logged_in:int,
      red_clicked:int,
      green_clicked:int
      );
      -- Munge data according to the PRD

    • v3
      clickabutton.com
      Username:
      Password:
      Go!
    • No need to change your scripts.
      They’ll work on old and new data!
    • Bonus!
      http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
    • Conclusion
      Workflow management
      Use Azkaban, Oozie, or another framework.
      Don’t use shell scripts and cron.
      Do this from day one! Transitioning expensive.
      Data serialization
      Use Protocol Buffers, Avro, Thrift. Something else!
      Do this from day one before it bites you.
    • Questions?
      voberoi@gmail.com
      www.vikramoberoi.com
      @voberoi on Twitter
      We’re hiring!