SlideShare a Scribd company logo
1 of 97
Hadoop at MeeboLessons learned in the real world Vikram Oberoi August, 2010 Hadoop Day, Seattle
About me SDE Intern at Amazon, ’07 R&D on item-to-item similarities Data Engineer Intern at Meebo, ’08 Built an A/B testing system CS at Stanford, ’09 Senior project: Ext3 and XFS under HadoopMapReduce workloads Data Engineer at Meebo, ’09—present Data infrastructure, analytics
About Meebo Products Browser-based IM client (www.meebo.com) Mobile chat clients Social widgets (the Meebo Bar) Company Founded 2005 Over 100 employees, 30 engineers Engineering Strong engineering culture Contributions to CouchDB, Lounge, Hadoop components
The Problem Hadoop is powerful technology Meets today’s demand for big data But it’s still a young platform Evolving components and best practices With many challenges in real-world usage Day-to-day operational headaches Missing eco-system features (e.g recurring jobs?) Lots of re-inventing the wheel to solve these
Purpose of this talk Discuss some real problems we’ve seen Explain our solutions Propose best practices so you can avoid them
What will I talk about? Background: Meebo’s data processing needs Meebo’s pre and post Hadoop data pipelines Lessons: Better workflow management Scheduling, reporting, monitoring, etc. A look at Azkaban Get wiser about data serialization Protocol Buffers (or Avro, or Thrift)
Meebo’s Data Processing Needs
What do we use Hadoop for? ETL Analytics Behavioral targeting Ad hoc data analysis, research Data produced helps power: internal/external dashboards our ad server
What kind of data do we have? Log data from all our products The Meebo Bar Meebo Messenger (www.meebo.com) Android/iPhone/Mobile Web clients Rooms Meebo Me Meebonotifier Firefox extension
How much data? 150MM uniques/month from the Meebo Bar Around 200 GB of uncompressed daily logs We process a subset of our logs
Meebo’s Data Pipeline Pre and Post Hadoop
A data pipeline in general 1. Data Collection 2. Data Processing 3. Data Storage 4. Workflow Management
Our data pipeline, pre-Hadoop Servers Python/shell scripts pull log data Python/shell scripts process data MySQL, CouchDB, flat files Cron, wrapper shell scripts glue everything together
Our data pipeline post Hadoop Servers Push logs to HDFS Pig scripts process data MySQL, CouchDB, flat files Azkaban, a workflow management system, glues everything together
Our transition to using Hadoop Deployed early ’09 Motivation: processing data took aaaages! Catalyst: Hadoop Summit Turbulent, time consuming New tools, new paradigms, pitfalls Totally worth it 24 hours to process day’s logs  under an hour Leap in ability to analyze our data Basis for new core product features
Workflow Management
What is workflow management?
What is workflow management? It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc. Most people use scripts and cron But end up spending too much time managing We need a better way
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains
Split up your jobs into discrete chunks with dependencies ,[object Object]
 Allow engineers to work on chunks separately
 Monolithic scripts are no fun ,[object Object]
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time  Monitors job progress
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time  Monitors job progress Reports when job fails, how long jobs take
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time  Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly
Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time  Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly Provides resource management capabilities
Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere DB somewhere Don’t DoS yourself
Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere 2 1 0 0 0 Permit Manager DB somewhere
Don’t roll your own scheduler! Building a good scheduling framework is hard Myriad of small requirements, precise bookkeeping with many edge cases Many roll their own It’s usually inadequate So much repeated effort! Mold an existing framework to your requirements and contribute
Two emerging frameworks Oozie Built at Yahoo Open-sourced at Hadoop Summit ’10 Used in production for [don’t know] Packaged by Cloudera Azkaban Built at LinkedIn Open-sourced in March ‘10 Used in production for over nine months as of March ’10 Now in use at Meebo
Azkaban
Azkaban jobs are bundles of configuration and code
Configuring a job process_log_data.job type=command command=python process_logs.py failure.emails=datateam@whereiwork.com process_logs.py importos import sys # Do useful things …
Deploying a job Step 1: Shove your config and code into a zip archive. process_log_data.zip .job .py
Deploying a job Step 2: Upload to Azkaban process_log_data.zip .job .py
Scheduling a job The Azkaban front-end:
What about dependencies?
get_users_widgets process_widgets.job process_users.job join_users_widgets.job export_to_db.job
get_users_widgets process_widgets.job type=command command=python process_widgets.py failure.emails=datateam@whereiwork.com process_users.job type=command command=python process_users.py failure.emails=datateam@whereiwork.com
get_users_widgets join_users_widgets.job type=command command=python join_users_widgets.py failure.emails=datateam@whereiwork.com dependencies=process_widgets,process_users export_to_db.job type=command command=python export_to_db.py failure.emails=datateam@whereiwork.com dependencies=join_users_widgets
get_users_widgets get_users_widgets.zip .job .job .job .job .py .py .py .py
You deploy and schedule a job flow as you would a single job.
Hierarchical configuration process_widgets.job type=command command=python process_widgets.py failure.emails=datateam@whereiwork.com This is silly. Can‘t I specify failure.emailsglobally? process_users.job type=command command=python process_users.py failure.emails=datateam@whereiwork.com
azkaban-job-dir/ system.properties get_users_widgets/ process_widgets.job process_users.job join_users_widgets.job export_to_db.job some-other-job/ …
Hierarchical configuration system.properties failure.emails=datateam@whereiwork.com db.url=foo.whereiwork.com archive.dir=/var/whereiwork/archive
What is type=command? Azkaban supports a few ways to execute jobs command Unix command in a separate process javaprocess Wrapper to kick off Java programs java Wrapper to kick off Runnable Java classes Can hook into Azkaban in useful ways Pig Wrapper to run Pig scripts through Grunt
What’s missing? Scheduling and executing multiple instances of the same job at the same time.
3:00 PM FOO ,[object Object]
 3:00 PM took longer than expected4:00 PM FOO
3:00 PM FOO ,[object Object]
 3:00 PM failed, restarted at 4:25 PM4:00 PM FOO FOO 5:00 PM
What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban Passing arguments between jobs. Write a library used by your jobs Put your arguments anywhere you want
What did we get out of it? No more monolithic wrapper scripts Massively reduced job setup time It’s configuration, not code! More code reuse, less hair pulling Still porting over jobs It’s time consuming
Data Serialization
What’s the problem? Serializing data in simple formats is convenient CSV, XML etc. Problems arise when data changes Needs backwards-compatibility Does this really matter? Let’s discuss.
v1 clickabutton.com Username: Password: Go!
“Click a Button” Analytics PRD We want to know the number of unique users who clicked on the button. Over an arbitrary range of time. Broken down by whether they’re logged in or not. With hour granularity.
“I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,clicked
“I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘,’) AS ( unique_id:chararray, logged_in:int, clicked:int ); -- Munge data according to the PRD …
v2 clickabutton.com Username: Password: Go!
“Click a Button” Analytics PRD Break users down by which button they clicked, too.
“I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,red_click,green_click
“I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
v3 clickabutton.com Username: Password: Go!
“Hmm.”
Bad Solution 1 Remove red_click unique_id,logged_in,red_click,green_click unique_id,logged_in,green_click
Why it’s bad Your script thinks green clicks are red clicks. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
Why it’s bad Now your script won’t work for all the data you’ve collected so far. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); -- Munge data according to the PRD …
“I’ll keep multiple scripts lying around”
LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); My data has three fields. Which one do I use? LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, orange_clicked:int );
Bad Solution 2 Assign a sentinel to red_clickwhen it should be ignored, i.e. -1.  unique_id,logged_in,red_click,green_click
Why it’s bad It’s a waste of space.
Why it’s bad Sticking logic in your data is iffy.
The Preferable Solution Serialize your data using backwards-compatible data structures! Protocol Buffers and Elephant Bird
Protocol Buffers Serialization system Avro, Thrift Compiles interfaces to language modules Construct a data structure Access it (in a backwards-compatible way) Ser/deser the data structure in a standard, compact, binary format
uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; } .h, .cc .java .py
Elephant Bird Generate protobuf-based Pig load/store functions + lots more Developed at Twitter Blog post http://engineering.twitter.com/2010/04/hadoop-at-twitter.html Available at: http://www.github.com/kevinweil/elephant-bird
uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; } *.pig.load.UniqueUserLzoProtobufB64LinePigLoader *.pig.store.UniqueUserLzoProtobufB64LinePigStorage
LzoProtobufB64?
LzoProtobufB64Serialization (bak49jsn, 0, 1) Protobuf Binary Blob Base64-encoded Protobuf Binary Blob LZO-compressed Base64-encoded Protobuf Binary Blob
LzoProtobufB64Deserialization (bak49jsn, 0, 1) Protobuf Binary Blob Base64-encoded Protobuf Binary Blob LZO-compressed Base64-encoded Protobuf Binary Blob
Setting it up Prereqs Protocol Buffers 2.3+ LZO codec for Hadoop Check out docs http://www.github.com/kevinweil/elephant-bird
Time to revisit
v1 clickabutton.com Username: Password: Go!
Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; }
--‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int ); -- Munge data according to the PRD …
v2 clickabutton.com Username: Password: Go!
Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; optional int32 green_clicked = 4; }
--‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
v3 clickabutton.com Username: Password: Go!
No need to change your scripts. They’ll work on old and new data!

More Related Content

What's hot

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdf
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdfOracle Systems _ Nathan Kroenert _ New Software New Hardware.pdf
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdfInSync2011
 
Cloud Platform Adoption: Lessons Learned
Cloud Platform Adoption: Lessons LearnedCloud Platform Adoption: Lessons Learned
Cloud Platform Adoption: Lessons LearnedVMware Tanzu
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? John Willis
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Librarypaidi_ed
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Optimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best PracticesOptimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best PracticesTeamstudio
 
An Introduction to Web Components
An Introduction to Web ComponentsAn Introduction to Web Components
An Introduction to Web ComponentsRed Pill Now
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
 
Pass Summit Linux Scripting for the Microsoft Professional
Pass Summit Linux Scripting for the Microsoft ProfessionalPass Summit Linux Scripting for the Microsoft Professional
Pass Summit Linux Scripting for the Microsoft ProfessionalKellyn Pot'Vin-Gorman
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @ShanghaiLuke Han
 
Office 365 UK User Group London 4th September 2012
Office 365 UK User Group London 4th September 2012Office 365 UK User Group London 4th September 2012
Office 365 UK User Group London 4th September 2012Office 365 UK User Group
 

What's hot (18)

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdf
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdfOracle Systems _ Nathan Kroenert _ New Software New Hardware.pdf
Oracle Systems _ Nathan Kroenert _ New Software New Hardware.pdf
 
GluonCV
GluonCVGluonCV
GluonCV
 
Cloud Platform Adoption: Lessons Learned
Cloud Platform Adoption: Lessons LearnedCloud Platform Adoption: Lessons Learned
Cloud Platform Adoption: Lessons Learned
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How?
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Optimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best PracticesOptimus XPages: An Explosion of Techniques and Best Practices
Optimus XPages: An Explosion of Techniques and Best Practices
 
An Introduction to Web Components
An Introduction to Web ComponentsAn Introduction to Web Components
An Introduction to Web Components
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
DrupalCon 2011 Highlight
DrupalCon 2011 HighlightDrupalCon 2011 Highlight
DrupalCon 2011 Highlight
 
Pass Summit Linux Scripting for the Microsoft Professional
Pass Summit Linux Scripting for the Microsoft ProfessionalPass Summit Linux Scripting for the Microsoft Professional
Pass Summit Linux Scripting for the Microsoft Professional
 
BIG DATA ANALYSIS
BIG DATA ANALYSISBIG DATA ANALYSIS
BIG DATA ANALYSIS
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
 
Coscup
CoscupCoscup
Coscup
 
Office 365 UK User Group London 4th September 2012
Office 365 UK User Group London 4th September 2012Office 365 UK User Group London 4th September 2012
Office 365 UK User Group London 4th September 2012
 

Similar to Hadoop at Meebo: Lessons in the Real World

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Business model driven cloud adoption - what NI is doing in the cloud
Business model driven cloud adoption -  what  NI is doing in the cloudBusiness model driven cloud adoption -  what  NI is doing in the cloud
Business model driven cloud adoption - what NI is doing in the cloudErnest Mueller
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Mirco Hering
 
Aws vs azure bakeoff
Aws vs azure bakeoffAws vs azure bakeoff
Aws vs azure bakeoffSoHo Dragon
 
Movin’ On Up - SP Engage Oct 2015
Movin’ On Up - SP Engage Oct 2015Movin’ On Up - SP Engage Oct 2015
Movin’ On Up - SP Engage Oct 2015Jim Adcock
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Mvp4 croatia - Being a dba in a devops world
Mvp4 croatia - Being a dba in a devops worldMvp4 croatia - Being a dba in a devops world
Mvp4 croatia - Being a dba in a devops worldAlessandro Alpi
 
Getting started with SAP PI/PO an overview presentation
Getting started with SAP PI/PO an overview presentationGetting started with SAP PI/PO an overview presentation
Getting started with SAP PI/PO an overview presentationFigaf.com
 
Delphix and DBmaestro
Delphix and DBmaestroDelphix and DBmaestro
Delphix and DBmaestroKyle Hailey
 
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructuredevopsdaysaustin
 
OpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid InfrastructureOpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid Infrastructurerhirschfeld
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitecturesDoug Chang
 

Similar to Hadoop at Meebo: Lessons in the Real World (20)

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Business model driven cloud adoption - what NI is doing in the cloud
Business model driven cloud adoption -  what  NI is doing in the cloudBusiness model driven cloud adoption -  what  NI is doing in the cloud
Business model driven cloud adoption - what NI is doing in the cloud
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 
Big dataanalyticsinthecloud
Big dataanalyticsinthecloudBig dataanalyticsinthecloud
Big dataanalyticsinthecloud
 
Aws vs azure bakeoff
Aws vs azure bakeoffAws vs azure bakeoff
Aws vs azure bakeoff
 
manage databases like codebases
manage databases like codebasesmanage databases like codebases
manage databases like codebases
 
Movin’ On Up - SP Engage Oct 2015
Movin’ On Up - SP Engage Oct 2015Movin’ On Up - SP Engage Oct 2015
Movin’ On Up - SP Engage Oct 2015
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Mvp4 croatia - Being a dba in a devops world
Mvp4 croatia - Being a dba in a devops worldMvp4 croatia - Being a dba in a devops world
Mvp4 croatia - Being a dba in a devops world
 
Getting started with SAP PI/PO an overview presentation
Getting started with SAP PI/PO an overview presentationGetting started with SAP PI/PO an overview presentation
Getting started with SAP PI/PO an overview presentation
 
Delphix and DBmaestro
Delphix and DBmaestroDelphix and DBmaestro
Delphix and DBmaestro
 
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
2016 - Open Mic - IGNITE - Open Infrastructure = ANY Infrastructure
 
OpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid InfrastructureOpenStack Preso: DevOps on Hybrid Infrastructure
OpenStack Preso: DevOps on Hybrid Infrastructure
 
50 Shades of SharePoint: SharePoint 2013 Insanity Demystified
50 Shades of SharePoint: SharePoint 2013 Insanity Demystified50 Shades of SharePoint: SharePoint 2013 Insanity Demystified
50 Shades of SharePoint: SharePoint 2013 Insanity Demystified
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 

Hadoop at Meebo: Lessons in the Real World

  • 1. Hadoop at MeeboLessons learned in the real world Vikram Oberoi August, 2010 Hadoop Day, Seattle
  • 2. About me SDE Intern at Amazon, ’07 R&D on item-to-item similarities Data Engineer Intern at Meebo, ’08 Built an A/B testing system CS at Stanford, ’09 Senior project: Ext3 and XFS under HadoopMapReduce workloads Data Engineer at Meebo, ’09—present Data infrastructure, analytics
  • 3. About Meebo Products Browser-based IM client (www.meebo.com) Mobile chat clients Social widgets (the Meebo Bar) Company Founded 2005 Over 100 employees, 30 engineers Engineering Strong engineering culture Contributions to CouchDB, Lounge, Hadoop components
  • 4. The Problem Hadoop is powerful technology Meets today’s demand for big data But it’s still a young platform Evolving components and best practices With many challenges in real-world usage Day-to-day operational headaches Missing eco-system features (e.g recurring jobs?) Lots of re-inventing the wheel to solve these
  • 5. Purpose of this talk Discuss some real problems we’ve seen Explain our solutions Propose best practices so you can avoid them
  • 6. What will I talk about? Background: Meebo’s data processing needs Meebo’s pre and post Hadoop data pipelines Lessons: Better workflow management Scheduling, reporting, monitoring, etc. A look at Azkaban Get wiser about data serialization Protocol Buffers (or Avro, or Thrift)
  • 8. What do we use Hadoop for? ETL Analytics Behavioral targeting Ad hoc data analysis, research Data produced helps power: internal/external dashboards our ad server
  • 9. What kind of data do we have? Log data from all our products The Meebo Bar Meebo Messenger (www.meebo.com) Android/iPhone/Mobile Web clients Rooms Meebo Me Meebonotifier Firefox extension
  • 10. How much data? 150MM uniques/month from the Meebo Bar Around 200 GB of uncompressed daily logs We process a subset of our logs
  • 11. Meebo’s Data Pipeline Pre and Post Hadoop
  • 12. A data pipeline in general 1. Data Collection 2. Data Processing 3. Data Storage 4. Workflow Management
  • 13. Our data pipeline, pre-Hadoop Servers Python/shell scripts pull log data Python/shell scripts process data MySQL, CouchDB, flat files Cron, wrapper shell scripts glue everything together
  • 14. Our data pipeline post Hadoop Servers Push logs to HDFS Pig scripts process data MySQL, CouchDB, flat files Azkaban, a workflow management system, glues everything together
  • 15. Our transition to using Hadoop Deployed early ’09 Motivation: processing data took aaaages! Catalyst: Hadoop Summit Turbulent, time consuming New tools, new paradigms, pitfalls Totally worth it 24 hours to process day’s logs  under an hour Leap in ability to analyze our data Basis for new core product features
  • 17. What is workflow management?
  • 18. What is workflow management? It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc. Most people use scripts and cron But end up spending too much time managing We need a better way
  • 19. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains
  • 20.
  • 21. Allow engineers to work on chunks separately
  • 22.
  • 23. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time
  • 24. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress
  • 25. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take
  • 26. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly
  • 27. Workflow management consists of: Executes jobs with arbitrarily complex dependency chains Schedules recurring jobs to run at a given time Monitors job progress Reports when job fails, how long jobs take Logs job execution and exposes logs so that engineers can deal with failures swiftly Provides resource management capabilities
  • 28. Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere DB somewhere Don’t DoS yourself
  • 29. Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere Export to DB somewhere 2 1 0 0 0 Permit Manager DB somewhere
  • 30. Don’t roll your own scheduler! Building a good scheduling framework is hard Myriad of small requirements, precise bookkeeping with many edge cases Many roll their own It’s usually inadequate So much repeated effort! Mold an existing framework to your requirements and contribute
  • 31. Two emerging frameworks Oozie Built at Yahoo Open-sourced at Hadoop Summit ’10 Used in production for [don’t know] Packaged by Cloudera Azkaban Built at LinkedIn Open-sourced in March ‘10 Used in production for over nine months as of March ’10 Now in use at Meebo
  • 33.
  • 34.
  • 35.
  • 36. Azkaban jobs are bundles of configuration and code
  • 37. Configuring a job process_log_data.job type=command command=python process_logs.py failure.emails=datateam@whereiwork.com process_logs.py importos import sys # Do useful things …
  • 38. Deploying a job Step 1: Shove your config and code into a zip archive. process_log_data.zip .job .py
  • 39. Deploying a job Step 2: Upload to Azkaban process_log_data.zip .job .py
  • 40. Scheduling a job The Azkaban front-end:
  • 42. get_users_widgets process_widgets.job process_users.job join_users_widgets.job export_to_db.job
  • 43. get_users_widgets process_widgets.job type=command command=python process_widgets.py failure.emails=datateam@whereiwork.com process_users.job type=command command=python process_users.py failure.emails=datateam@whereiwork.com
  • 44. get_users_widgets join_users_widgets.job type=command command=python join_users_widgets.py failure.emails=datateam@whereiwork.com dependencies=process_widgets,process_users export_to_db.job type=command command=python export_to_db.py failure.emails=datateam@whereiwork.com dependencies=join_users_widgets
  • 45. get_users_widgets get_users_widgets.zip .job .job .job .job .py .py .py .py
  • 46. You deploy and schedule a job flow as you would a single job.
  • 47.
  • 48. Hierarchical configuration process_widgets.job type=command command=python process_widgets.py failure.emails=datateam@whereiwork.com This is silly. Can‘t I specify failure.emailsglobally? process_users.job type=command command=python process_users.py failure.emails=datateam@whereiwork.com
  • 49. azkaban-job-dir/ system.properties get_users_widgets/ process_widgets.job process_users.job join_users_widgets.job export_to_db.job some-other-job/ …
  • 50. Hierarchical configuration system.properties failure.emails=datateam@whereiwork.com db.url=foo.whereiwork.com archive.dir=/var/whereiwork/archive
  • 51. What is type=command? Azkaban supports a few ways to execute jobs command Unix command in a separate process javaprocess Wrapper to kick off Java programs java Wrapper to kick off Runnable Java classes Can hook into Azkaban in useful ways Pig Wrapper to run Pig scripts through Grunt
  • 52. What’s missing? Scheduling and executing multiple instances of the same job at the same time.
  • 53.
  • 54. 3:00 PM took longer than expected4:00 PM FOO
  • 55.
  • 56. 3:00 PM failed, restarted at 4:25 PM4:00 PM FOO FOO 5:00 PM
  • 57. What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
  • 58. What’s missing? Scheduling and executing multiple jobs at the same time. AZK-49, AZK-47 Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban Passing arguments between jobs. Write a library used by your jobs Put your arguments anywhere you want
  • 59. What did we get out of it? No more monolithic wrapper scripts Massively reduced job setup time It’s configuration, not code! More code reuse, less hair pulling Still porting over jobs It’s time consuming
  • 61. What’s the problem? Serializing data in simple formats is convenient CSV, XML etc. Problems arise when data changes Needs backwards-compatibility Does this really matter? Let’s discuss.
  • 63. “Click a Button” Analytics PRD We want to know the number of unique users who clicked on the button. Over an arbitrary range of time. Broken down by whether they’re logged in or not. With hour granularity.
  • 64. “I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,clicked
  • 65. “I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘,’) AS ( unique_id:chararray, logged_in:int, clicked:int ); -- Munge data according to the PRD …
  • 67. “Click a Button” Analytics PRD Break users down by which button they clicked, too.
  • 68. “I KNOW!” Every hour, process logs and dump lines that look like this to HDFS with Pig: unique_id,logged_in,red_click,green_click
  • 69. “I KNOW!” --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
  • 72. Bad Solution 1 Remove red_click unique_id,logged_in,red_click,green_click unique_id,logged_in,green_click
  • 73. Why it’s bad Your script thinks green clicks are red clicks. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
  • 74. Why it’s bad Now your script won’t work for all the data you’ve collected so far. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); -- Munge data according to the PRD …
  • 75. “I’ll keep multiple scripts lying around”
  • 76. LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, green_clicked:int ); My data has three fields. Which one do I use? LOAD ‘$IN’ USING PigStorage(‘.’) AS ( unique_id:chararray, logged_in:int, orange_clicked:int );
  • 77. Bad Solution 2 Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. unique_id,logged_in,red_click,green_click
  • 78. Why it’s bad It’s a waste of space.
  • 79. Why it’s bad Sticking logic in your data is iffy.
  • 80. The Preferable Solution Serialize your data using backwards-compatible data structures! Protocol Buffers and Elephant Bird
  • 81. Protocol Buffers Serialization system Avro, Thrift Compiles interfaces to language modules Construct a data structure Access it (in a backwards-compatible way) Ser/deser the data structure in a standard, compact, binary format
  • 82. uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; } .h, .cc .java .py
  • 83. Elephant Bird Generate protobuf-based Pig load/store functions + lots more Developed at Twitter Blog post http://engineering.twitter.com/2010/04/hadoop-at-twitter.html Available at: http://www.github.com/kevinweil/elephant-bird
  • 84. uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; } *.pig.load.UniqueUserLzoProtobufB64LinePigLoader *.pig.store.UniqueUserLzoProtobufB64LinePigStorage
  • 86. LzoProtobufB64Serialization (bak49jsn, 0, 1) Protobuf Binary Blob Base64-encoded Protobuf Binary Blob LZO-compressed Base64-encoded Protobuf Binary Blob
  • 87. LzoProtobufB64Deserialization (bak49jsn, 0, 1) Protobuf Binary Blob Base64-encoded Protobuf Binary Blob LZO-compressed Base64-encoded Protobuf Binary Blob
  • 88. Setting it up Prereqs Protocol Buffers 2.3+ LZO codec for Hadoop Check out docs http://www.github.com/kevinweil/elephant-bird
  • 91. Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; }
  • 92. --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int ); -- Munge data according to the PRD …
  • 94. Every hour, process logs and dump lines to HDFS that use this protobuf interface: uniqueuser.proto message UniqueUser { optional string id = 1; optional int32 logged_in = 2; optional int32 red_clicked = 3; optional int32 green_clicked = 4; }
  • 95. --‘clicked’ and ‘logged_in’ are either 0 or 1 LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS ( unique_id:chararray, logged_in:int, red_clicked:int, green_clicked:int ); -- Munge data according to the PRD …
  • 97. No need to change your scripts. They’ll work on old and new data!
  • 99. Conclusion Workflow management Use Azkaban, Oozie, or another framework. Don’t use shell scripts and cron. Do this from day one! Transitioning expensive. Data serialization Use Protocol Buffers, Avro, Thrift. Something else! Do this from day one before it bites you.
  • 100. Questions? voberoi@gmail.com www.vikramoberoi.com @voberoi on Twitter We’re hiring!