1. Hadoop at MeeboLessons learned in the real world Vikram Oberoi August, 2010 Hadoop Day, Seattle
2. About me SDE Intern at Amazon, ’07 R&D on item-to-item similarities Data Engineer Intern at Meebo, ’08 Built an A/B testing system CS at Stanford, ’09 Senior project: Ext3 and XFS under HadoopMapReduce workloads Data Engineer at Meebo, ’09—present Data infrastructure, analytics
3. About Meebo Products Browser-based IM client (www.meebo.com) Mobile chat clients Social widgets (the Meebo Bar) Company Founded 2005 Over 100 employees, 30 engineers Engineering Strong engineering culture Contributions to CouchDB, Lounge, Hadoop components
4. The Problem Hadoop is powerful technology Meets today’s demand for big data But it’s still a young platform Evolving components and best practices With many challenges in real-world usage Day-to-day operational headaches Missing eco-system features (e.g recurring jobs?) Lots of re-inventing the wheel to solve these
5. Purpose of this talk Discuss some real problems we’ve seen Explain our solutions Propose best practices so you can avoid them
6. What will I talk about? Background: Meebo’s data processing needs Meebo’s pre and post Hadoop data pipelines Lessons: Better workflow management Scheduling, reporting, monitoring, etc. A look at Azkaban Get wiser about data serialization Protocol Buffers (or Avro, or Thrift)
8. What do we use Hadoop for? ETL Analytics Behavioral targeting Ad hoc data analysis, research Data produced helps power: internal/external dashboards our ad server
9. What kind of data do we have? Log data from all our products The Meebo Bar Meebo Messenger (www.meebo.com) Android/iPhone/Mobile Web clients Rooms Meebo Me Meebonotifier Firefox extension
10. How much data? 150MM uniques/month from the Meebo Bar Around 200 GB of uncompressed daily logs We process a subset of our logs
12. A data pipeline in general 1. Data Collection 2. Data Processing 3. Data Storage 4. Workflow Management
13. Our data pipeline, pre-Hadoop Servers Python/shell scripts pull log data Python/shell scripts process data MySQL, CouchDB, flat files Cron, wrapper shell scripts glue everything together
14. Our data pipeline post Hadoop Servers Push logs to HDFS Pig scripts process data MySQL, CouchDB, flat files Azkaban, a workflow management system, glues everything together
15. Our transition to using Hadoop Deployed early ’09 Motivation: processing data took aaaages! Catalyst: Hadoop Summit Turbulent, time consuming New tools, new paradigms, pitfalls Totally worth it 24 hours to process day’s logs under an hour Leap in ability to analyze our data Basis for new core product features
18. What is workflow management? It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc. Most people use scripts and cron But end up spending too much time managing We need a better way
23. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time
24. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress
25. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take
26. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly
27. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly Provides resource management capabilities
28. Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere DB somewhere Don’t DoS yourself
29. Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere 2 1 0 0 0 Permit Manager DB somewhere
30. Don’t roll your own scheduler! Building a good scheduling framework is hard Myriad of small requirements, precise bookkeeping with many edge cases Many roll their own It’s usually inadequate So much repeated effort! Mold an existing framework to your requirements and contribute
31. Two emerging frameworks Oozie Built at Yahoo Open-sourced at Hadoop Summit ’10 Used in production for [don’t know] Packaged by Cloudera Azkaban Built at LinkedIn Open-sourced in March ‘10 Used in production for over nine months as of March ’10 Now in use at Meebo
51. What is type=command? Azkaban supports a few ways to execute jobs command Unix command in a separate process javaprocess Wrapper to kick off Java programs java Wrapper to kick off Runnable Java classes Can hook into Azkaban in useful ways Pig Wrapper to run Pig scripts through Grunt
57. What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
58. What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban Passing arguments between jobs. Write a library used by your jobs Put your arguments anywhere you want
59. What did we get out of it? No more monolithic wrapper scripts Massively reduced job setup time It’s configuration, not code! More code reuse, less hair pulling Still porting over jobs It’s time consuming
61. What’s the problem? Serializing data in simple formats is convenient CSV, XML etc. Problems arise when data changes Needs backwards-compatibility Does this really matter? Let’s discuss.
63. “Click a Button” Analytics PRD We want to know the number of unique users who clicked on the button. Over an arbitrary range of time. Broken down by whether they’re logged in or not. With hour granularity.
64. “I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,clicked
65. “I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘,’) AS ( unique_id:chararray, logged_in:int, clicked:int ); -- Munge data according to the PRD …
67. “Click a Button” Analytics PRD Break users down by which button they clicked, too.
68. “I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,red_click,green_click
69. “I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
72. Bad Solution 1 Remove red_click unique_id,logged_in,red_click,green_click unique_id,logged_in,green_click
73. Why it’s bad Your script thinks green clicks are red clicks. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
74. Why it’s bad Now your script won’t work for all the data you’ve collected so far. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); -- Munge data according to the PRD …
76. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); My data has three fields. Which one do I use? LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, orange_clicked:int );
77. Bad Solution 2 Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. unique_id,logged_in,red_click,green_click
79. Why it’s bad Sticking logic in your data is iffy.
80. The Preferable Solution Serialize your data using backwards-compatible data structures! Protocol Buffers and Elephant Bird
81. Protocol Buffers Serialization system Avro, Thrift Compiles interfaces to language modules Construct a data structure Access it (in a backwards-compatible way) Ser/deser the data structure in a standard, compact, binary format
83. Elephant Bird Generate protobuf-based Pig load/store functions + lots more Developed at Twitter Blog post http://engineering.twitter.com/2010/04/hadoop-at-twitter.html Available at: http://www.github.com/kevinweil/elephant-bird
91. Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; }
92. --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int ); -- Munge data according to the PRD …
94. Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; optional int32 green_clicked = 4; }
95. --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
99. Conclusion Workflow management Use Azkaban, Oozie, or another framework. Don’t use shell scripts and cron. Do this from day one! Transitioning expensive. Data serialization Use Protocol Buffers, Avro, Thrift. Something else! Do this from day one before it bites you.