Hadoop at Meebo: Lessons in the Real World
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop at Meebo: Lessons in the Real World

on

  • 9,775 views

 

Statistics

Views

Total Views
9,775
Views on SlideShare
9,728
Embed Views
47

Actions

Likes
26
Downloads
240
Comments
0

6 Embeds 47

http://wurk.jwuch.com 24
http://www.linkedin.com 16
https://www.linkedin.com 3
http://pmomale-ld1 2
http://www.slideshare.net 1
http://slideclip.b-prep.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop at Meebo: Lessons in the Real World Presentation Transcript

  • 1. Hadoop at MeeboLessons learned in the real world
    Vikram Oberoi
    August, 2010
    Hadoop Day, Seattle
  • 2. About me
    SDE Intern at Amazon, ’07
    R&D on item-to-item similarities
    Data Engineer Intern at Meebo, ’08
    Built an A/B testing system
    CS at Stanford, ’09
    Senior project: Ext3 and XFS under HadoopMapReduce workloads
    Data Engineer at Meebo, ’09—present
    Data infrastructure, analytics
  • 3. About Meebo
    Products
    Browser-based IM client (www.meebo.com)
    Mobile chat clients
    Social widgets (the Meebo Bar)
    Company
    Founded 2005
    Over 100 employees, 30 engineers
    Engineering
    Strong engineering culture
    Contributions to CouchDB, Lounge, Hadoop components
  • 4. The Problem
    Hadoop is powerful technology
    Meets today’s demand for big data
    But it’s still a young platform
    Evolving components and best practices
    With many challenges in real-world usage
    Day-to-day operational headaches
    Missing eco-system features (e.g recurring jobs?)
    Lots of re-inventing the wheel to solve these
  • 5. Purpose of this talk
    Discuss some real problems we’ve seen
    Explain our solutions
    Propose best practices so you can avoid them
  • 6. What will I talk about?
    Background:
    Meebo’s data processing needs
    Meebo’s pre and post Hadoop data pipelines
    Lessons:
    Better workflow management
    Scheduling, reporting, monitoring, etc.
    A look at Azkaban
    Get wiser about data serialization
    Protocol Buffers (or Avro, or Thrift)
  • 7. Meebo’s Data Processing Needs
  • 8. What do we use Hadoop for?
    ETL
    Analytics
    Behavioral targeting
    Ad hoc data analysis, research
    Data produced helps power:
    internal/external dashboards
    our ad server
  • 9. What kind of data do we have?
    Log data from all our products
    The Meebo Bar
    Meebo Messenger (www.meebo.com)
    Android/iPhone/Mobile Web clients
    Rooms
    Meebo Me
    Meebonotifier
    Firefox extension
  • 10. How much data?
    150MM uniques/month from the Meebo Bar
    Around 200 GB of uncompressed daily logs
    We process a subset of our logs
  • 11. Meebo’s Data Pipeline
    Pre and Post Hadoop
  • 12. A data pipeline in general
    1. Data
    Collection
    2. Data
    Processing
    3. Data
    Storage
    4. Workflow Management
  • 13. Our data pipeline, pre-Hadoop
    Servers
    Python/shell scripts pull log data
    Python/shell scripts process data
    MySQL, CouchDB, flat files
    Cron, wrapper shell scripts glue everything together
  • 14. Our data pipeline post Hadoop
    Servers
    Push logs to HDFS
    Pig scripts process data
    MySQL, CouchDB, flat files
    Azkaban, a workflow management system, glues everything together
  • 15. Our transition to using Hadoop
    Deployed early ’09
    Motivation: processing data took aaaages!
    Catalyst: Hadoop Summit
    Turbulent, time consuming
    New tools, new paradigms, pitfalls
    Totally worth it
    24 hours to process day’s logs  under an hour
    Leap in ability to analyze our data
    Basis for new core product features
  • 16. Workflow Management
  • 17. What is workflow management?
  • 18. What is workflow management?
    It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.
    Most people use scripts and cron
    But end up spending too much time managing
    We need a better way
  • 19. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
  • 20. Split up your jobs into discrete chunks with dependencies
    • Minimize impact when chunks fail
    • 21. Allow engineers to work on chunks separately
    • 22. Monolithic scripts are no fun
  • Clean up data from log A
    Process data from log B
    Join data, train a classifier
    Post-processing
    Archive output
    Export to DB somewhere
  • 23. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
    Schedules recurring jobs to run at a given time
  • 24. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
    Schedules recurring jobs to run at a given time
    Monitors job progress
  • 25. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
    Schedules recurring jobs to run at a given time
    Monitors job progress
    Reports when job fails, how long jobs take
  • 26. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
    Schedules recurring jobs to run at a given time
    Monitors job progress
    Reports when job fails, how long jobs take
    Logs job execution and exposes logs so that engineers can deal with failures swiftly
  • 27. Workflow management consists of:
    Executes jobs with arbitrarily complex dependency chains
    Schedules recurring jobs to run at a given time
    Monitors job progress
    Reports when job fails, how long jobs take
    Logs job execution and exposes logs so that engineers can deal with failures swiftly
    Provides resource management capabilities
  • 28. Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    DB somewhere
    Don’t DoS yourself
  • 29. Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    Export to DB somewhere
    2
    1
    0
    0
    0
    Permit Manager
    DB somewhere
  • 30. Don’t roll your own scheduler!
    Building a good scheduling framework is hard
    Myriad of small requirements, precise bookkeeping with many edge cases
    Many roll their own
    It’s usually inadequate
    So much repeated effort!
    Mold an existing framework to your requirements and contribute
  • 31. Two emerging frameworks
    Oozie
    Built at Yahoo
    Open-sourced at Hadoop Summit ’10
    Used in production for [don’t know]
    Packaged by Cloudera
    Azkaban
    Built at LinkedIn
    Open-sourced in March ‘10
    Used in production for over nine months as of March ’10
    Now in use at Meebo
  • 32. Azkaban
  • 33.
  • 34.
  • 35.
  • 36. Azkaban jobs are bundles of configuration and code
  • 37. Configuring a job
    process_log_data.job
    type=command
    command=python process_logs.py
    failure.emails=datateam@whereiwork.com
    process_logs.py
    importos
    import sys
    # Do useful things

  • 38. Deploying a job
    Step 1: Shove your config and code into a zip archive.
    process_log_data.zip
    .job
    .py
  • 39. Deploying a job
    Step 2: Upload to Azkaban
    process_log_data.zip
    .job
    .py
  • 40. Scheduling a job
    The Azkaban front-end:
  • 41. What about dependencies?
  • 42. get_users_widgets
    process_widgets.job
    process_users.job
    join_users_widgets.job
    export_to_db.job
  • 43. get_users_widgets
    process_widgets.job
    type=command
    command=python process_widgets.py
    failure.emails=datateam@whereiwork.com
    process_users.job
    type=command
    command=python process_users.py
    failure.emails=datateam@whereiwork.com
  • 44. get_users_widgets
    join_users_widgets.job
    type=command
    command=python join_users_widgets.py
    failure.emails=datateam@whereiwork.com
    dependencies=process_widgets,process_users
    export_to_db.job
    type=command
    command=python export_to_db.py
    failure.emails=datateam@whereiwork.com
    dependencies=join_users_widgets
  • 45. get_users_widgets
    get_users_widgets.zip
    .job
    .job
    .job
    .job
    .py
    .py
    .py
    .py
  • 46. You deploy and schedule a job flow as you would a single job.
  • 47.
  • 48. Hierarchical configuration
    process_widgets.job
    type=command
    command=python process_widgets.py
    failure.emails=datateam@whereiwork.com
    This is silly. Can‘t I specify failure.emailsglobally?
    process_users.job
    type=command
    command=python process_users.py
    failure.emails=datateam@whereiwork.com
  • 49. azkaban-job-dir/
    system.properties
    get_users_widgets/
    process_widgets.job
    process_users.job
    join_users_widgets.job
    export_to_db.job
    some-other-job/

  • 50. Hierarchical configuration
    system.properties
    failure.emails=datateam@whereiwork.com
    db.url=foo.whereiwork.com
    archive.dir=/var/whereiwork/archive
  • 51. What is type=command?
    Azkaban supports a few ways to execute jobs
    command
    Unix command in a separate process
    javaprocess
    Wrapper to kick off Java programs
    java
    Wrapper to kick off Runnable Java classes
    Can hook into Azkaban in useful ways
    Pig
    Wrapper to run Pig scripts through Grunt
  • 52. What’s missing?
    Scheduling and executing multiple instances of the same job at the same time.
  • 53. 3:00 PM
    FOO
    • Runs hourly
    • 54. 3:00 PM took longer than expected
    4:00 PM
    FOO
  • 55. 3:00 PM
    FOO
    • Runs hourly
    • 56. 3:00 PM failed, restarted at 4:25 PM
    4:00 PM
    FOO
    FOO
    5:00 PM
  • 57. What’s missing?
    Scheduling and executing multiple jobs at the same time.
    AZK-49, AZK-47
    Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
  • 58. What’s missing?
    Scheduling and executing multiple jobs at the same time.
    AZK-49, AZK-47
    Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
    Passing arguments between jobs.
    Write a library used by your jobs
    Put your arguments anywhere you want
  • 59. What did we get out of it?
    No more monolithic wrapper scripts
    Massively reduced job setup time
    It’s configuration, not code!
    More code reuse, less hair pulling
    Still porting over jobs
    It’s time consuming
  • 60. Data Serialization
  • 61. What’s the problem?
    Serializing data in simple formats is convenient
    CSV, XML etc.
    Problems arise when data changes
    Needs backwards-compatibility
    Does this really matter? Let’s discuss.
  • 62. v1
    clickabutton.com
    Username:
    Password:
    Go!
  • 63. “Click a Button” Analytics PRD
    We want to know the number of unique users who clicked on the button.
    Over an arbitrary range of time.
    Broken down by whether they’re logged in or not.
    With hour granularity.
  • 64. “I KNOW!”
    Every hour, process logs and dump lines that look like this to HDFS with Pig:
    unique_id,logged_in,clicked
  • 65. “I KNOW!”
    --‘clicked’ and ‘logged_in’ are either 0 or 1
    LOAD ‘$IN’ USING PigStorage(‘,’) AS (
    unique_id:chararray,
    logged_in:int,
    clicked:int
    );
    -- Munge data according to the PRD

  • 66. v2
    clickabutton.com
    Username:
    Password:
    Go!
  • 67. “Click a Button” Analytics PRD
    Break users down by which button they clicked, too.
  • 68. “I KNOW!”
    Every hour, process logs and dump lines that look like this to HDFS with Pig:
    unique_id,logged_in,red_click,green_click
  • 69. “I KNOW!”
    --‘clicked’ and ‘logged_in’ are either 0 or 1
    LOAD ‘$IN’ USING PigStorage(‘.’) AS (
    unique_id:chararray,
    logged_in:int,
    red_clicked:int,
    green_clicked:int
    );
    -- Munge data according to the PRD

  • 70. v3
    clickabutton.com
    Username:
    Password:
    Go!
  • 71. “Hmm.”
  • 72. Bad Solution 1
    Remove red_click
    unique_id,logged_in,red_click,green_click
    unique_id,logged_in,green_click
  • 73. Why it’s bad
    Your script thinks green clicks are red clicks.
    LOAD ‘$IN’ USING PigStorage(‘.’) AS (
    unique_id:chararray,
    logged_in:int,
    red_clicked:int,
    green_clicked:int
    );
    -- Munge data according to the PRD

  • 74. Why it’s bad
    Now your script won’t work for all the data you’ve collected so far.
    LOAD ‘$IN’ USING PigStorage(‘.’) AS (
    unique_id:chararray,
    logged_in:int,
    green_clicked:int
    );
    -- Munge data according to the PRD

  • 75. “I’ll keep multiple scripts lying around”
  • 76. LOAD ‘$IN’ USING PigStorage(‘.’) AS (
    unique_id:chararray,
    logged_in:int,
    green_clicked:int
    );
    My data has three fields. Which one do I use?
    LOAD ‘$IN’ USING PigStorage(‘.’) AS (
    unique_id:chararray,
    logged_in:int,
    orange_clicked:int
    );
  • 77. Bad Solution 2
    Assign a sentinel to red_clickwhen it should be ignored, i.e. -1.
    unique_id,logged_in,red_click,green_click
  • 78. Why it’s bad
    It’s a waste of space.
  • 79. Why it’s bad
    Sticking logic in your data is iffy.
  • 80. The Preferable Solution
    Serialize your data using backwards-compatible data structures!
    Protocol Buffers and Elephant Bird
  • 81. Protocol Buffers
    Serialization system
    Avro, Thrift
    Compiles interfaces to language modules
    Construct a data structure
    Access it (in a backwards-compatible way)
    Ser/deser the data structure in a standard, compact, binary format
  • 82. uniqueuser.proto
    message UniqueUser {
    optional string id = 1;
    optional int32 logged_in = 2;
    optional int32 red_clicked = 3;
    }
    .h,
    .cc
    .java
    .py
  • 83. Elephant Bird
    Generate protobuf-based Pig load/store functions + lots more
    Developed at Twitter
    Blog post
    http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
    Available at:
    http://www.github.com/kevinweil/elephant-bird
  • 84. uniqueuser.proto
    message UniqueUser {
    optional string id = 1;
    optional int32 logged_in = 2;
    optional int32 red_clicked = 3;
    }
    *.pig.load.UniqueUserLzoProtobufB64LinePigLoader
    *.pig.store.UniqueUserLzoProtobufB64LinePigStorage
  • 85. LzoProtobufB64?
  • 86. LzoProtobufB64Serialization
    (bak49jsn, 0, 1)
    Protobuf Binary Blob
    Base64-encoded Protobuf Binary Blob
    LZO-compressed Base64-encoded Protobuf Binary Blob
  • 87. LzoProtobufB64Deserialization
    (bak49jsn, 0, 1)
    Protobuf Binary Blob
    Base64-encoded Protobuf Binary Blob
    LZO-compressed Base64-encoded Protobuf Binary Blob
  • 88. Setting it up
    Prereqs
    Protocol Buffers 2.3+
    LZO codec for Hadoop
    Check out docs
    http://www.github.com/kevinweil/elephant-bird
  • 89. Time to revisit
  • 90. v1
    clickabutton.com
    Username:
    Password:
    Go!
  • 91. Every hour, process logs and dump lines to HDFS that use this protobuf interface:
    uniqueuser.proto
    message UniqueUser {
    optional string id = 1;
    optional int32 logged_in = 2;
    optional int32 red_clicked = 3;
    }
  • 92. --‘clicked’ and ‘logged_in’ are either 0 or 1
    LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
    unique_id:chararray,
    logged_in:int,
    red_clicked:int
    );
    -- Munge data according to the PRD

  • 93. v2
    clickabutton.com
    Username:
    Password:
    Go!
  • 94. Every hour, process logs and dump lines to HDFS that use this protobuf interface:
    uniqueuser.proto
    message UniqueUser {
    optional string id = 1;
    optional int32 logged_in = 2;
    optional int32 red_clicked = 3;
    optional int32 green_clicked = 4;
    }
  • 95. --‘clicked’ and ‘logged_in’ are either 0 or 1
    LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (
    unique_id:chararray,
    logged_in:int,
    red_clicked:int,
    green_clicked:int
    );
    -- Munge data according to the PRD

  • 96. v3
    clickabutton.com
    Username:
    Password:
    Go!
  • 97. No need to change your scripts.
    They’ll work on old and new data!
  • 98. Bonus!
    http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
  • 99. Conclusion
    Workflow management
    Use Azkaban, Oozie, or another framework.
    Don’t use shell scripts and cron.
    Do this from day one! Transitioning expensive.
    Data serialization
    Use Protocol Buffers, Avro, Thrift. Something else!
    Do this from day one before it bites you.
  • 100. Questions?
    voberoi@gmail.com
    www.vikramoberoi.com
    @voberoi on Twitter
    We’re hiring!