SlideShare a Scribd company logo
1 of 28
From Batch to Realtime
    with Hadoop
     Berlin Buzzwords, June 2012
             Lars George
         lars@cloudera.com
About Me

• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Working with Hadoop & HBase since
  2007
• Author of O’Reilly’s “HBase - The
  Definitive Guide”
The Application Stack
• Solve Business Goals
• Rely on Proven Building Blocks
• Rapid Prototyping
 ‣ Templates, MVC, Reference
   Implementations
• Evolutionary Innovation Cycles
           “Let there be light!”
LAMP
L   Linux


A   Apache


M   MySQL


P   PHP/Perl
L   Linux


A   Apache


M   MySQL


M   Memcache


P   PHP/Perl
The Dawn of Big Data
•   Industry verticals produce a staggering amount of data
•   Not only web properties, but also “brick and mortar”
    businesses
    ‣   Smart Grid, Bio Informatics, Financial, Telco
•   Scalable computation frameworks allow analysis of all the data
    ‣   No sampling anymore
•   Suitable algorithms derive even more data
    ‣   Machine learning
•   “The Unreasonable Effectiveness of Data”
    ‣   More data is better than smart algorithms
Hadoop

• HDFS + MapReduce
• Based on Google Papers
• Distributed Storage and Computation
  Framework
• Affordable Hardware, Free Software
• Significant Adoption
HDFS

• Reliably store petabytes of replicated data
  across thousands of nodes
• Master/Slave Architecture
• Built on “commodity” hardware
MapReduce
   • Distributed programming model to reliably
     process petabytes of data
   • Locality of data to processing is vital
     ‣ Run code where data resides
   • Inspired by map and reduce functions in
     functional programming

Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
From Short to Long Term
 Internet


 LAM(M)P
            • Serves the Client
            • Stores Intermediate Data

 Hadoop
            • Background Batch Processing
            • Stores Long-Term Data
Batch Processing
•   Scale is Unlimited
    ‣ Bound only by Hardware
•   Harness the Power of the Cluster
    ‣ CPUs, Disks, Memory

•   Disks extend Memory
    ‣ Spills represent Swapping

•   Trade Size Limitations with Time
    ‣ Jobs run for a few minutes to hours, days
From Batch to Realtime
•   “Time is Money”
•   Bridging the gap between batch and “now”
•   Realtime often means “faster than batch”
•   80/20 Rule
    ‣ Hadoop solves the 80% easily
    ‣ The remaining 20% is taking 80% of the
      effort
•   Go as close as possible, don’t overdo it!
Stop Gap Solutions
•   In Memory
    ‣   Memcached
    ‣   MemBase
    ‣   GigaSpaces
•   Relational Databases
    ‣   MySQL
    ‣   PostgreSQL
•   NoSQL
    ‣   Cassandra
    ‣   HBase
Complemental Design #1
   Internet
              • Keep Backup in HDFS
              • MapReduce over HDFS
              • Synchronize HBase
  LAM(M)P       ‣Batch Puts
                ‣Bulk Import

   Hadoop     HBase
Complemental Design #2
  Internet
             • Add Log Support
             • Synchronize HBase
  LAM(M)P      ‣Batch Puts
   Flume
               ‣Bulk Import


  Hadoop     HBase
Mitigation Planning
• Reliable storage has top priority
• Disaster Recovery
• HBase Backups
  ‣ Export - but what if HBase is “down”
  ‣ CopyTable - same issue
  ‣ Snapshots - not available (yet)
Complemental Design #3
  Internet
              • Add Log Processing
              • Remove Direct Connection
  LAM(M)P     • Synchronize HBase
                ‣Batch Puts
   Flume        ‣Bulk Import

             Log
  Hadoop                HBase
             Proc
Facebook Insights

• > 20B Events per Day
• 1M Counter Updates per Second
  ‣ 100 Nodes Cluster
  ‣ 10K OPS per Node


Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
Collection Layer

• “Like” button triggers AJAX request
• Event written to log file using Scribe
  ‣ Handles aggregation, delivery, file roll
    over, etc.
  ‣ Uses HDFS to store files
✓ Use Flume or Scribe
Filter Layer
• Ptail “follows” logs written by Scribe
• Aggregates from multiple logs
• Separates into event types
  ‣ Sharding for future growth
• Facebook internal tool
✓ Use Flume
Batching Layer
• Puma batches updates
  ‣ 1 sec, staggered
• Flush batch, when last is done
• Duration limited by key distribution
• Facebook internal tool
✓ Use Coprocessors (0.92.0)
Counters
•   Store counters per Domain and per URL
    ‣ Leverage HBase increment (atomic read-modify-
      write) feature
•   Each row is one specific Domain or URL
•   The columns are the counters for specific metrics
•   Column families are used to group counters by time
    range
    ‣ Set time-to-live on CF level to auto-expire counters
      by age to save space, e.g., 2 weeks on “Daily
      Counters” family
Key Design
•   Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
    ‣   Helps keeping pages per site close, as HBase efficiently scans blocks
        of sorted keys
•   Domain Row Key =
    MD5(Reversed Domain) + Reversed Domain
    ‣   Leading MD5 hash spreads keys randomly across all regions for
        load balancing reasons
    ‣   Only hashing the domain groups per site (and per subdomain if
        needed)
•   URL Row Key =
    MD5(Reversed Domain) + Reversed Domain + URL ID
    ‣   Unique ID per URL already available, make use of it
Insights Schema
Row Key: Domain Row Key
Columns:
   Hourly Counters CF             Daily Counters CF                     Lifetime Counters CF
6pm 6pm      6pm   7pm                1/1    1/1   2/1
                        ... 1/1 Total                     ...   Total      Male Female    US    ...
Total Male    US    ...               Male   US     ...
 100   50     92    45        1000    320    670   990          10000      6780   3220   9900

Row Key: URL Row Key
Columns:
   Hourly Counters CF             Daily Counters CF                     Lifetime Counters CF
6pm 6pm      6pm   7pm                1/1    1/1   2/1
                        ... 1/1 Total                     ...   Total      Male Female    US    ...
Total Male    US    ...               Male   US     ...
 10    5      9     4         100      20    70    99            100        8      92     100
Complemental Design #4
Internet
                    • Add Stream Processing
                      ‣In-Memory
LAM(M)P    Storm      ‣Fault Tolerant
                      ‣Aggregations
 Flume              • Bridges minutes/hours
                     vs. months/years

Hadoop      HBase
Batch + Stream
•   Currently moves complexity into app layer
    ‣ Reads need to merge batch and stream results
•   Stream results can be dropped once data is
    persisted in batch layer
•   Stream might not be 100% correct, but good
    enough in most cases
    ‣ Eventual Accuracy
•   Latency vs. Throughput - best of both worlds
Questions?




lars@cloudera.com
http://cloudera.com

More Related Content

What's hot

Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
eraz
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
Derek Collison
 

What's hot (20)

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeHBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup)
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Compare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDBCompare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDB
 
Webinar: Capacity Planning
Webinar: Capacity PlanningWebinar: Capacity Planning
Webinar: Capacity Planning
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
SPDY Talk
SPDY TalkSPDY Talk
SPDY Talk
 

Viewers also liked

нервная система
нервная системанервная система
нервная система
Galina Mishina
 
Composing re-useable ETL on Hadoop
Composing re-useable ETL on HadoopComposing re-useable ETL on Hadoop
Composing re-useable ETL on Hadoop
Paul Lam
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
Khanh Maudoux
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
larsgeorge
 

Viewers also liked (20)

нервная система
нервная системанервная система
нервная система
 
Composing re-useable ETL on Hadoop
Composing re-useable ETL on HadoopComposing re-useable ETL on Hadoop
Composing re-useable ETL on Hadoop
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 
Tech day hadoop, Spark
Tech day hadoop, SparkTech day hadoop, Spark
Tech day hadoop, Spark
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Soutenance ysance
Soutenance ysanceSoutenance ysance
Soutenance ysance
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 

Similar to From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
baggioss
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
liqiang xu
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
baggioss
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 

Similar to From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012 (20)

AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

  • 1. From Batch to Realtime with Hadoop Berlin Buzzwords, June 2012 Lars George lars@cloudera.com
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Working with Hadoop & HBase since 2007 • Author of O’Reilly’s “HBase - The Definitive Guide”
  • 3. The Application Stack • Solve Business Goals • Rely on Proven Building Blocks • Rapid Prototyping ‣ Templates, MVC, Reference Implementations • Evolutionary Innovation Cycles “Let there be light!”
  • 5. L Linux A Apache M MySQL P PHP/Perl
  • 6. L Linux A Apache M MySQL M Memcache P PHP/Perl
  • 7. The Dawn of Big Data • Industry verticals produce a staggering amount of data • Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco • Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore • Suitable algorithms derive even more data ‣ Machine learning • “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithms
  • 8. Hadoop • HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation Framework • Affordable Hardware, Free Software • Significant Adoption
  • 9. HDFS • Reliably store petabytes of replicated data across thousands of nodes • Master/Slave Architecture • Built on “commodity” hardware
  • 10. MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programming Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
  • 11. From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term Data
  • 12. Batch Processing • Scale is Unlimited ‣ Bound only by Hardware • Harness the Power of the Cluster ‣ CPUs, Disks, Memory • Disks extend Memory ‣ Spills represent Swapping • Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, days
  • 13. From Batch to Realtime • “Time is Money” • Bridging the gap between batch and “now” • Realtime often means “faster than batch” • 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort • Go as close as possible, don’t overdo it!
  • 14. Stop Gap Solutions • In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces • Relational Databases ‣ MySQL ‣ PostgreSQL • NoSQL ‣ Cassandra ‣ HBase
  • 15. Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBase
  • 16. Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBase
  • 17. Mitigation Planning • Reliable storage has top priority • Disaster Recovery • HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not available (yet)
  • 18. Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase Proc
  • 19. Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per Node Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
  • 20. Collection Layer • “Like” button triggers AJAX request • Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files ✓ Use Flume or Scribe
  • 21. Filter Layer • Ptail “follows” logs written by Scribe • Aggregates from multiple logs • Separates into event types ‣ Sharding for future growth • Facebook internal tool ✓ Use Flume
  • 22. Batching Layer • Puma batches updates ‣ 1 sec, staggered • Flush batch, when last is done • Duration limited by key distribution • Facebook internal tool ✓ Use Coprocessors (0.92.0)
  • 23. Counters • Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family
  • 24. Key Design • Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of it
  • 25. Insights Schema Row Key: Domain Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900 Row Key: URL Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100
  • 26. Complemental Design #4 Internet • Add Stream Processing ‣In-Memory LAM(M)P Storm ‣Fault Tolerant ‣Aggregations Flume • Bridges minutes/hours vs. months/years Hadoop HBase
  • 27. Batch + Stream • Currently moves complexity into app layer ‣ Reads need to merge batch and stream results • Stream results can be dropped once data is persisted in batch layer • Stream might not be 100% correct, but good enough in most cases ‣ Eventual Accuracy • Latency vs. Throughput - best of both worlds

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n