Hadoop at MeeboLessons learned in the real world<br />Vikram Oberoi<br />August, 2010<br />Hadoop Day, Seattle<br />
About me<br />SDE Intern at Amazon, ’07<br />R&D on item-to-item similarities<br />Data Engineer Intern at Meebo, ’08<br /...
About Meebo<br />Products<br />Browser-based IM client (www.meebo.com)<br />Mobile chat clients<br />Social widgets (the M...
The Problem<br />Hadoop is powerful technology<br />Meets today’s demand for big data<br />But it’s still a young platform...
Purpose of this talk<br />Discuss some real problems we’ve seen<br />Explain our solutions<br />Propose best practices so ...
What will I talk about?<br />Background:<br />Meebo’s data processing needs<br />Meebo’s pre and post Hadoop data pipeline...
Meebo’s Data Processing Needs<br />
What do we use Hadoop for?<br />ETL<br />Analytics<br />Behavioral targeting<br />Ad hoc data analysis, research<br />Data...
What kind of data do we have?<br />Log data from all our products<br />The Meebo Bar<br />Meebo Messenger (www.meebo.com)<...
How much data?<br />150MM uniques/month from the Meebo Bar<br />Around 200 GB of uncompressed daily logs<br />We process a...
Meebo’s Data Pipeline<br />Pre and Post Hadoop<br />
A data pipeline in general<br />1. Data<br />Collection<br />2. Data<br />Processing<br />3. Data<br />Storage<br />4. Wor...
Our data pipeline, pre-Hadoop<br />Servers<br />Python/shell scripts pull log data<br />Python/shell scripts process data<...
Our data pipeline post Hadoop<br />Servers<br />Push logs to HDFS<br />Pig scripts process data<br />MySQL, CouchDB, flat ...
Our transition to using Hadoop<br />Deployed early ’09<br />Motivation: processing data took aaaages!<br />Catalyst: Hadoo...
Workflow Management<br />
What is workflow management?<br />
What is workflow management?<br />It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting ...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />
Split up your jobs into discrete chunks with dependencies<br /><ul><li> Minimize impact when chunks fail
 Allow engineers to work on chunks separately
 Monolithic scripts are no fun </li></li></ul><li>Clean up data from  log A<br />Process data from log B<br />Join data, t...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jo...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jo...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jo...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jo...
Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jo...
Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to ...
Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to ...
Don’t roll your own scheduler!<br />Building a good scheduling framework is hard<br />Myriad of small requirements, precis...
Two emerging frameworks<br />Oozie<br />Built at Yahoo<br />Open-sourced at Hadoop Summit ’10<br />Used in production for ...
Azkaban<br />
Azkaban jobs are bundles of configuration and code<br />
Configuring a job<br />process_log_data.job<br />type=command<br />command=python process_logs.py<br />failure.emails=data...
Deploying a job<br />Step 1: Shove your config and code into a zip archive.<br />process_log_data.zip<br />.job<br />.py<b...
Deploying a job<br />Step 2: Upload to Azkaban<br />process_log_data.zip<br />.job<br />.py<br />
Scheduling a job<br />The Azkaban front-end:<br />
What about dependencies?<br />
get_users_widgets<br />process_widgets.job<br />process_users.job<br />join_users_widgets.job<br />export_to_db.job<br />
get_users_widgets<br />process_widgets.job<br />type=command<br />command=python process_widgets.py<br />failure.emails=da...
get_users_widgets<br />join_users_widgets.job<br />type=command<br />command=python join_users_widgets.py<br />failure.ema...
get_users_widgets<br />get_users_widgets.zip<br />.job<br />.job<br />.job<br />.job<br />.py<br />.py<br />.py<br />.py<b...
You deploy and schedule a job flow as you would a single job.<br />
Hierarchical configuration<br />process_widgets.job<br />type=command<br />command=python process_widgets.py<br />failure....
azkaban-job-dir/<br />system.properties<br />get_users_widgets/<br />process_widgets.job<br />process_users.job<br />join_...
Hierarchical configuration<br />system.properties<br />failure.emails=datateam@whereiwork.com<br />db.url=foo.whereiwork.c...
What is type=command?<br />Azkaban supports a few ways to execute jobs<br />command<br />Unix command in a separate proces...
What’s missing?<br />Scheduling and executing multiple instances of the same job at the same time.<br />
3:00 PM<br />FOO<br /><ul><li> Runs hourly
 3:00 PM took longer than expected</li></ul>4:00 PM<br />FOO<br />
3:00 PM<br />FOO<br /><ul><li> Runs hourly
 3:00 PM failed, restarted at 4:25 PM</li></ul>4:00 PM<br />FOO<br />FOO<br />5:00 PM<br />
What’s missing?<br />Scheduling and executing multiple jobs at the same time.<br />AZK-49, AZK-47<br />Stay tuned for comp...
What’s missing?<br />Scheduling and executing multiple jobs at the same time.<br />AZK-49, AZK-47<br />Stay tuned for comp...
What did we get out of it?<br />No more monolithic wrapper scripts<br />Massively reduced job setup time<br />It’s configu...
Data Serialization<br />
What’s the problem?<br />Serializing data in simple formats is convenient<br />CSV, XML etc.<br />Problems arise when data...
v1<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
“Click a Button” Analytics PRD<br />We want to know the number of unique users who clicked on the button.<br />Over an arb...
“I KNOW!”<br />Every hour, process logs and dump lines that look like this to HDFS with Pig:<br />unique_id,logged_in,clic...
“I KNOW!”<br />--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING PigStorage(‘,’) AS (<br />unique_id:cha...
v2<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
“Click a Button” Analytics PRD<br />Break users down by which button they clicked, too.<br />
“I KNOW!”<br />Every hour, process logs and dump lines that look like this to HDFS with Pig:<br />unique_id,logged_in,red_...
“I KNOW!”<br />--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:cha...
v3<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
“Hmm.”<br />
Bad Solution 1<br />Remove red_click<br />unique_id,logged_in,red_click,green_click<br />unique_id,logged_in,green_click<b...
Why it’s bad<br />Your script thinks green clicks are red clicks.<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_i...
Why it’s bad<br />Now your script won’t work for all the data you’ve collected so far.<br />LOAD ‘$IN’ USING PigStorage(‘....
“I’ll keep multiple scripts lying around”<br />
LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />green_clicked:int<br />);<br />My...
Bad Solution 2<br />Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. <br />unique_id,logged_in,red_click,...
Why it’s bad<br />It’s a waste of space.<br />
Why it’s bad<br />Sticking logic in your data is iffy.<br />
The Preferable Solution<br />Serialize your data using backwards-compatible data structures!<br />Protocol Buffers and Ele...
Protocol Buffers<br />Serialization system<br />Avro, Thrift<br />Compiles interfaces to language modules<br />Construct a...
uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional i...
Elephant Bird<br />Generate protobuf-based Pig load/store functions + lots more<br />Developed at Twitter<br />Blog post<b...
uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional i...
LzoProtobufB64?<br />
LzoProtobufB64Serialization<br />(bak49jsn, 0, 1)<br />Protobuf Binary Blob<br />Base64-encoded Protobuf Binary Blob<br />...
LzoProtobufB64Deserialization<br />(bak49jsn, 0, 1)<br />Protobuf Binary Blob<br />Base64-encoded Protobuf Binary Blob<br ...
Setting it up<br />Prereqs<br />Protocol Buffers 2.3+<br />LZO codec for Hadoop<br />Check out docs<br />http://www.github...
Time to revisit<br />
v1<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
Every hour, process logs and dump lines to HDFS that use this protobuf interface:<br />uniqueuser.proto<br />message Uniqu...
--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader ...
v2<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
Every hour, process logs and dump lines to HDFS that use this protobuf interface:<br />uniqueuser.proto<br />message Uniqu...
--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader ...
v3<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
No need to change your scripts.<br />They’ll work on old and new data!<br />
Upcoming SlideShare
Loading in …5
×

Hadoop at Meebo: Lessons in the Real World

13,438 views

Published on

0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,438
On SlideShare
0
From Embeds
0
Number of Embeds
60
Actions
Shares
0
Downloads
247
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

Hadoop at Meebo: Lessons in the Real World

  1. Hadoop at MeeboLessons learned in the real world<br />Vikram Oberoi<br />August, 2010<br />Hadoop Day, Seattle<br />
  2. About me<br />SDE Intern at Amazon, ’07<br />R&D on item-to-item similarities<br />Data Engineer Intern at Meebo, ’08<br />Built an A/B testing system<br />CS at Stanford, ’09<br />Senior project: Ext3 and XFS under HadoopMapReduce workloads<br />Data Engineer at Meebo, ’09—present<br />Data infrastructure, analytics <br />
  3. About Meebo<br />Products<br />Browser-based IM client (www.meebo.com)<br />Mobile chat clients<br />Social widgets (the Meebo Bar)<br />Company<br />Founded 2005<br />Over 100 employees, 30 engineers<br />Engineering<br />Strong engineering culture<br />Contributions to CouchDB, Lounge, Hadoop components<br />
  4. The Problem<br />Hadoop is powerful technology<br />Meets today’s demand for big data<br />But it’s still a young platform<br />Evolving components and best practices<br />With many challenges in real-world usage<br />Day-to-day operational headaches<br />Missing eco-system features (e.g recurring jobs?)<br />Lots of re-inventing the wheel to solve these<br />
  5. Purpose of this talk<br />Discuss some real problems we’ve seen<br />Explain our solutions<br />Propose best practices so you can avoid them<br />
  6. What will I talk about?<br />Background:<br />Meebo’s data processing needs<br />Meebo’s pre and post Hadoop data pipelines<br />Lessons:<br />Better workflow management<br />Scheduling, reporting, monitoring, etc.<br />A look at Azkaban<br />Get wiser about data serialization<br />Protocol Buffers (or Avro, or Thrift)<br />
  7. Meebo’s Data Processing Needs<br />
  8. What do we use Hadoop for?<br />ETL<br />Analytics<br />Behavioral targeting<br />Ad hoc data analysis, research<br />Data produced helps power:<br />internal/external dashboards<br />our ad server<br />
  9. What kind of data do we have?<br />Log data from all our products<br />The Meebo Bar<br />Meebo Messenger (www.meebo.com)<br />Android/iPhone/Mobile Web clients<br />Rooms<br />Meebo Me<br />Meebonotifier<br />Firefox extension<br />
  10. How much data?<br />150MM uniques/month from the Meebo Bar<br />Around 200 GB of uncompressed daily logs<br />We process a subset of our logs<br />
  11. Meebo’s Data Pipeline<br />Pre and Post Hadoop<br />
  12. A data pipeline in general<br />1. Data<br />Collection<br />2. Data<br />Processing<br />3. Data<br />Storage<br />4. Workflow Management<br />
  13. Our data pipeline, pre-Hadoop<br />Servers<br />Python/shell scripts pull log data<br />Python/shell scripts process data<br />MySQL, CouchDB, flat files<br />Cron, wrapper shell scripts glue everything together<br />
  14. Our data pipeline post Hadoop<br />Servers<br />Push logs to HDFS<br />Pig scripts process data<br />MySQL, CouchDB, flat files<br />Azkaban, a workflow management system, glues everything together<br />
  15. Our transition to using Hadoop<br />Deployed early ’09<br />Motivation: processing data took aaaages!<br />Catalyst: Hadoop Summit<br />Turbulent, time consuming<br />New tools, new paradigms, pitfalls<br />Totally worth it<br />24 hours to process day’s logs  under an hour<br />Leap in ability to analyze our data<br />Basis for new core product features<br />
  16. Workflow Management<br />
  17. What is workflow management?<br />
  18. What is workflow management?<br />It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.<br />Most people use scripts and cron<br />But end up spending too much time managing<br />We need a better way<br />
  19. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />
  20. Split up your jobs into discrete chunks with dependencies<br /><ul><li> Minimize impact when chunks fail
  21. Allow engineers to work on chunks separately
  22. Monolithic scripts are no fun </li></li></ul><li>Clean up data from log A<br />Process data from log B<br />Join data, train a classifier<br />Post-processing<br />Archive output<br />Export to DB somewhere<br />
  23. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jobs to run at a given time <br />
  24. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jobs to run at a given time <br />Monitors job progress<br />
  25. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jobs to run at a given time <br />Monitors job progress<br />Reports when job fails, how long jobs take<br />
  26. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jobs to run at a given time <br />Monitors job progress<br />Reports when job fails, how long jobs take<br />Logs job execution and exposes logs so that engineers can deal with failures swiftly<br />
  27. Workflow management consists of:<br />Executes jobs with arbitrarily complex dependency chains<br />Schedules recurring jobs to run at a given time <br />Monitors job progress<br />Reports when job fails, how long jobs take<br />Logs job execution and exposes logs so that engineers can deal with failures swiftly<br />Provides resource management capabilities<br />
  28. Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />DB somewhere<br />Don’t DoS yourself<br />
  29. Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />Export to DB somewhere<br />2<br />1<br />0<br />0<br />0<br />Permit Manager<br />DB somewhere<br />
  30. Don’t roll your own scheduler!<br />Building a good scheduling framework is hard<br />Myriad of small requirements, precise bookkeeping with many edge cases<br />Many roll their own<br />It’s usually inadequate<br />So much repeated effort!<br />Mold an existing framework to your requirements and contribute<br />
  31. Two emerging frameworks<br />Oozie<br />Built at Yahoo<br />Open-sourced at Hadoop Summit ’10<br />Used in production for [don’t know]<br />Packaged by Cloudera<br />Azkaban<br />Built at LinkedIn<br />Open-sourced in March ‘10<br />Used in production for over nine months as of March ’10<br />Now in use at Meebo<br />
  32. Azkaban<br />
  33. Azkaban jobs are bundles of configuration and code<br />
  34. Configuring a job<br />process_log_data.job<br />type=command<br />command=python process_logs.py<br />failure.emails=datateam@whereiwork.com<br />process_logs.py<br />importos<br />import sys<br /># Do useful things<br />…<br />
  35. Deploying a job<br />Step 1: Shove your config and code into a zip archive.<br />process_log_data.zip<br />.job<br />.py<br />
  36. Deploying a job<br />Step 2: Upload to Azkaban<br />process_log_data.zip<br />.job<br />.py<br />
  37. Scheduling a job<br />The Azkaban front-end:<br />
  38. What about dependencies?<br />
  39. get_users_widgets<br />process_widgets.job<br />process_users.job<br />join_users_widgets.job<br />export_to_db.job<br />
  40. get_users_widgets<br />process_widgets.job<br />type=command<br />command=python process_widgets.py<br />failure.emails=datateam@whereiwork.com<br />process_users.job<br />type=command<br />command=python process_users.py<br />failure.emails=datateam@whereiwork.com<br />
  41. get_users_widgets<br />join_users_widgets.job<br />type=command<br />command=python join_users_widgets.py<br />failure.emails=datateam@whereiwork.com<br />dependencies=process_widgets,process_users<br />export_to_db.job<br />type=command<br />command=python export_to_db.py<br />failure.emails=datateam@whereiwork.com<br />dependencies=join_users_widgets<br />
  42. get_users_widgets<br />get_users_widgets.zip<br />.job<br />.job<br />.job<br />.job<br />.py<br />.py<br />.py<br />.py<br />
  43. You deploy and schedule a job flow as you would a single job.<br />
  44. Hierarchical configuration<br />process_widgets.job<br />type=command<br />command=python process_widgets.py<br />failure.emails=datateam@whereiwork.com<br />This is silly. Can‘t I specify failure.emailsglobally?<br />process_users.job<br />type=command<br />command=python process_users.py<br />failure.emails=datateam@whereiwork.com<br />
  45. azkaban-job-dir/<br />system.properties<br />get_users_widgets/<br />process_widgets.job<br />process_users.job<br />join_users_widgets.job<br />export_to_db.job<br />some-other-job/<br />…<br />
  46. Hierarchical configuration<br />system.properties<br />failure.emails=datateam@whereiwork.com<br />db.url=foo.whereiwork.com<br />archive.dir=/var/whereiwork/archive<br />
  47. What is type=command?<br />Azkaban supports a few ways to execute jobs<br />command<br />Unix command in a separate process<br />javaprocess<br />Wrapper to kick off Java programs<br />java<br />Wrapper to kick off Runnable Java classes<br />Can hook into Azkaban in useful ways<br />Pig<br />Wrapper to run Pig scripts through Grunt<br />
  48. What’s missing?<br />Scheduling and executing multiple instances of the same job at the same time.<br />
  49. 3:00 PM<br />FOO<br /><ul><li> Runs hourly
  50. 3:00 PM took longer than expected</li></ul>4:00 PM<br />FOO<br />
  51. 3:00 PM<br />FOO<br /><ul><li> Runs hourly
  52. 3:00 PM failed, restarted at 4:25 PM</li></ul>4:00 PM<br />FOO<br />FOO<br />5:00 PM<br />
  53. What’s missing?<br />Scheduling and executing multiple jobs at the same time.<br />AZK-49, AZK-47<br />Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban<br />
  54. What’s missing?<br />Scheduling and executing multiple jobs at the same time.<br />AZK-49, AZK-47<br />Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban<br />Passing arguments between jobs.<br />Write a library used by your jobs<br />Put your arguments anywhere you want<br />
  55. What did we get out of it?<br />No more monolithic wrapper scripts<br />Massively reduced job setup time<br />It’s configuration, not code!<br />More code reuse, less hair pulling<br />Still porting over jobs<br />It’s time consuming<br />
  56. Data Serialization<br />
  57. What’s the problem?<br />Serializing data in simple formats is convenient<br />CSV, XML etc.<br />Problems arise when data changes<br />Needs backwards-compatibility<br />Does this really matter? Let’s discuss.<br />
  58. v1<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  59. “Click a Button” Analytics PRD<br />We want to know the number of unique users who clicked on the button.<br />Over an arbitrary range of time.<br />Broken down by whether they’re logged in or not.<br />With hour granularity.<br />
  60. “I KNOW!”<br />Every hour, process logs and dump lines that look like this to HDFS with Pig:<br />unique_id,logged_in,clicked<br />
  61. “I KNOW!”<br />--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING PigStorage(‘,’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  62. v2<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  63. “Click a Button” Analytics PRD<br />Break users down by which button they clicked, too.<br />
  64. “I KNOW!”<br />Every hour, process logs and dump lines that look like this to HDFS with Pig:<br />unique_id,logged_in,red_click,green_click<br />
  65. “I KNOW!”<br />--‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />red_clicked:int,<br />green_clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  66. v3<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  67. “Hmm.”<br />
  68. Bad Solution 1<br />Remove red_click<br />unique_id,logged_in,red_click,green_click<br />unique_id,logged_in,green_click<br />
  69. Why it’s bad<br />Your script thinks green clicks are red clicks.<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />red_clicked:int,<br />green_clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  70. Why it’s bad<br />Now your script won’t work for all the data you’ve collected so far.<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />green_clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  71. “I’ll keep multiple scripts lying around”<br />
  72. LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />green_clicked:int<br />);<br />My data has three fields. Which one do I use?<br />LOAD ‘$IN’ USING PigStorage(‘.’) AS (<br />unique_id:chararray,<br />logged_in:int,<br />orange_clicked:int<br />);<br />
  73. Bad Solution 2<br />Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. <br />unique_id,logged_in,red_click,green_click<br />
  74. Why it’s bad<br />It’s a waste of space.<br />
  75. Why it’s bad<br />Sticking logic in your data is iffy.<br />
  76. The Preferable Solution<br />Serialize your data using backwards-compatible data structures!<br />Protocol Buffers and Elephant Bird<br />
  77. Protocol Buffers<br />Serialization system<br />Avro, Thrift<br />Compiles interfaces to language modules<br />Construct a data structure<br />Access it (in a backwards-compatible way)<br />Ser/deser the data structure in a standard, compact, binary format<br />
  78. uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional int32 red_clicked = 3;<br />}<br />.h,<br />.cc<br />.java<br />.py<br />
  79. Elephant Bird<br />Generate protobuf-based Pig load/store functions + lots more<br />Developed at Twitter<br />Blog post<br />http://engineering.twitter.com/2010/04/hadoop-at-twitter.html<br />Available at:<br />http://www.github.com/kevinweil/elephant-bird<br />
  80. uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional int32 red_clicked = 3;<br />}<br />*.pig.load.UniqueUserLzoProtobufB64LinePigLoader<br />*.pig.store.UniqueUserLzoProtobufB64LinePigStorage<br />
  81. LzoProtobufB64?<br />
  82. LzoProtobufB64Serialization<br />(bak49jsn, 0, 1)<br />Protobuf Binary Blob<br />Base64-encoded Protobuf Binary Blob<br />LZO-compressed Base64-encoded Protobuf Binary Blob<br />
  83. LzoProtobufB64Deserialization<br />(bak49jsn, 0, 1)<br />Protobuf Binary Blob<br />Base64-encoded Protobuf Binary Blob<br />LZO-compressed Base64-encoded Protobuf Binary Blob<br />
  84. Setting it up<br />Prereqs<br />Protocol Buffers 2.3+<br />LZO codec for Hadoop<br />Check out docs<br />http://www.github.com/kevinweil/elephant-bird<br />
  85. Time to revisit<br />
  86. v1<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  87. Every hour, process logs and dump lines to HDFS that use this protobuf interface:<br />uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional int32 red_clicked = 3;<br />}<br />
  88. --‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (<br />unique_id:chararray,<br />logged_in:int,<br />red_clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  89. v2<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  90. Every hour, process logs and dump lines to HDFS that use this protobuf interface:<br />uniqueuser.proto<br />message UniqueUser {<br />optional string id = 1;<br />optional int32 logged_in = 2;<br />optional int32 red_clicked = 3;<br />optional int32 green_clicked = 4;<br />}<br />
  91. --‘clicked’ and ‘logged_in’ are either 0 or 1<br />LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (<br />unique_id:chararray,<br />logged_in:int,<br />red_clicked:int,<br />green_clicked:int<br />);<br />-- Munge data according to the PRD<br />… <br />
  92. v3<br />clickabutton.com<br />Username:<br />Password:<br />Go!<br />
  93. No need to change your scripts.<br />They’ll work on old and new data!<br />
  94. Bonus!<br />http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter<br />
  95. Conclusion<br />Workflow management<br />Use Azkaban, Oozie, or another framework.<br />Don’t use shell scripts and cron.<br />Do this from day one! Transitioning expensive.<br />Data serialization<br />Use Protocol Buffers, Avro, Thrift. Something else!<br />Do this from day one before it bites you.<br />
  96. Questions?<br />voberoi@gmail.com<br />www.vikramoberoi.com<br />@voberoi on Twitter<br />We’re hiring!<br />

×