SlideShare a Scribd company logo
Realtime Analytics using
                    Hadoop & HBase
                                Lars George,
                      Solutions Architect @ Cloudera
                            lars@cloudera.com



Monday, July 25, 11
About Me

                      • Solutions Architect @ Cloudera
                      • Apache HBase & Whirr Committer
                      • Working with Hadoop & HBase since 2007
                      • Author of O’Reilly’s “HBase - The Definitive
                        Guide”



Monday, July 25, 11
The Application Stack
                      • Solve Business Goals
                      • Rely on Proven Building Blocks
                      • Rapid Prototyping
                       ‣ Templates, MVC, Reference
                         Implementations
                      • Evolutionary Innovation Cycles
                                 “Let there be light!”
Monday, July 25, 11
LAMP



Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      P   PHP/Perl




Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      M   Memcache


                      P   PHP/Perl


Monday, July 25, 11
The Dawn of Big Data
                      •   Industry verticals produce a staggering amount of data
                      •   Not only web properties, but also “brick and mortar”
                          businesses
                          ‣   Smart Grid, Bio Informatics, Financial, Telco
                      •   Scalable computation frameworks allow analysis of all the data
                          ‣   No sampling anymore
                      •   Suitable algorithms derive even more data
                          ‣   Machine learning
                      •   “The Unreasonable Effectiveness of Data”
                          ‣   More data is better than smart algorithms

Monday, July 25, 11
Hadoop

                      • HDFS + MapReduce
                      • Based on Google Papers
                      • Distributed Storage and Computation
                        Framework
                      • Affordable Hardware, Free Software
                      • Significant Adoption
Monday, July 25, 11
HDFS
                      •   Reliably store petabytes of replicated data across
                          thousands of nodes
                          ‣ Data divided into 64MB blocks, each block replicated
                            three times
                      •   Master/Slave Architecture
                          ‣ Master NameNode contains meta data
                          ‣ Slave DataNode manages block on local file system
                      •   Built on “commodity” hardware
                          ‣ No 15k RPM disks or RAID required (nor wanted!)
                          ‣ Commodity Server Hardware

Monday, July 25, 11
MapReduce
                      • Distributed programming model to reliably
                        process petabytes of data
                      • Locality of data to processing is vital
                        ‣ Run code where data resides
                      • Inspired by map and reduce functions in
                        functional programming

          Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
Monday, July 25, 11
From Short to Long Term
                      Internet


                      LAM(M)P
                                 • Serves the Client
                                 • Stores Intermediate Data

                      Hadoop
                                 • Background Batch Processing
                                 • Stores Long-Term Data

Monday, July 25, 11
Batch Processing
                      •   Scale is Unlimited
                          ‣ Bound only by Hardware
                      •   Harness the Power of the Cluster
                          ‣ CPUs, Disks, Memory

                      •   Disks extend Memory
                          ‣ Spills represent Swapping

                      •   Trade Size Limitations with Time
                          ‣ Jobs run for a few minutes to hours, days
Monday, July 25, 11
From Batch to Realtime
                      •   “Time is Money”
                      •   Bridging the gap between batch and “now”
                      •   Realtime often means “faster than batch”
                      •   80/20 Rule
                          ‣ Hadoop solves the 80% easily
                          ‣ The remaining 20% is taking 80% of the
                            effort
                      •   Go as close as possible, don’t overdo it!

Monday, July 25, 11
Stop Gap Solutions
                      •   In Memory
                          ‣   Memcached
                          ‣   MemBase
                          ‣   GigaSpaces
                      •   Relational Databases
                          ‣   MySQL
                          ‣   PostgreSQL
                      •   NoSQL
                          ‣   Cassandra
                          ‣   HBase

Monday, July 25, 11
HBase Architecture




Monday, July 25, 11
Client Access




Monday, July 25, 11
Auto Sharding




Monday, July 25, 11
Distribution




Monday, July 25, 11
HBase Key Design




Monday, July 25, 11
Key Cardinality




Monday, July 25, 11
Fold, Store, and Shift




Monday, July 25, 11
Complemental Design #1
                      Internet
                                 • Keep Backup in HDFS
                                 • MapReduce over HDFS
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                                   ‣Bulk Import

                      Hadoop     HBase



Monday, July 25, 11
Complemental Design #2
                      Internet
                                 • Add Log Support
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                       Flume
                                   ‣Bulk Import


                      Hadoop     HBase


Monday, July 25, 11
Mitigation Planning
                      • Reliable storage has top priority
                      • Disaster Recovery
                      • HBase Backups
                        ‣ Export - but what if HBase is “down”
                        ‣ CopyTable - same issue
                        ‣ Snapshots - not available

Monday, July 25, 11
Complemental Design #3
                      Internet
                                  • Add Log Processing
                                  • Remove Direct Connection
                      LAM(M)P     • Synchronize HBase
                                    ‣Batch Puts
                       Flume        ‣Bulk Import

                                 Log
                      Hadoop                HBase
                                 Proc


Monday, July 25, 11
Facebook Insights

                      • > 20B Events per Day
                      • 1M Counter Updates per Second
                        ‣ 100 Nodes Cluster
                        ‣ 10K OPS per Node


                      Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase

Monday, July 25, 11
Collection Layer

                      • “Like” button triggers AJAX request
                      • Event written to log file using Scribe
                        ‣ Handles aggregation, delivery, file roll
                          over, etc.
                        ‣ Uses HDFS to store files
                      ✓ Use Flume or Scribe

Monday, July 25, 11
Filter Layer
                      • Ptail “follows” logs written by Scribe
                      • Aggregates from multiple logs
                      • Separates into event types
                        ‣ Sharding for future growth
                      • Facebook internal tool
                      ✓ Use Flume

Monday, July 25, 11
Batching Layer
                      • Puma batches updates
                        ‣ 1 sec, staggered
                      • Flush batch, when last is done
                      • Duration limited by key distribution
                      • Facebook internal tool
                      ✓ Use Coprocessors (0.92.0)

Monday, July 25, 11
Counters
                      •   Store counters per Domain and per URL
                          ‣ Leverage HBase increment (atomic read-modify-
                            write) feature
                      •   Each row is one specific Domain or URL
                      •   The columns are the counters for specific metrics
                      •   Column families are used to group counters by time
                          range
                          ‣ Set time-to-live on CF level to auto-expire counters
                            by age to save space, e.g., 2 weeks on “Daily
                            Counters” family

Monday, July 25, 11
Key Design
           •          Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
                 ‣      Helps keeping pages per site close, as HBase efficiently scans blocks
                        of sorted keys
           •          Domain Row Key =
                      MD5(Reversed Domain) + Reversed Domain
                 ‣      Leading MD5 hash spreads keys randomly across all regions for
                        load balancing reasons
                 ‣      Only hashing the domain groups per site (and per subdomain if
                        needed)
           •          URL Row Key =
                      MD5(Reversed Domain) + Reversed Domain + URL ID
                 ‣      Unique ID per URL already available, make use of it

Monday, July 25, 11
Insights Schema
  Row Key: Domain Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    100   50           92    45        1000    320    670   990          10000      6780   3220   9900

  Row Key: URL Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    10    5            9     4         100      20    70    99            100        8      92     100




Monday, July 25, 11
Summary
                      • Design for Use-Case
                       ‣ Read, Write, or Both?
                      • Avoid Hotspotting
                       ‣ Region and Table
                      • Manage Automatism at Scale
                       ‣ For now!

Monday, July 25, 11
Monday, July 25, 11
Monday, July 25, 11
Questions?


                      lars@cloudera.com
                      http://cloudera.com




Monday, July 25, 11

More Related Content

What's hot

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
Ravi Veeramachaneni
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
HBaseCon
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 

What's hot (19)

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 

Viewers also liked

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
Dharmesh Vaya
 

Viewers also liked (8)

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 

Similar to Realtime Analytics with Hadoop and HBase

Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
Mind The Firebird
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
zenyk
 
Hive at booking
Hive at bookingHive at booking
Hive at booking
David Morel
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDBDATAVERSITY
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
Christan Grant
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 

Similar to Realtime Analytics with Hadoop and HBase (20)

Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Hive at booking
Hive at bookingHive at booking
Hive at booking
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

More from larsgeorge

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
larsgeorge
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
larsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
larsgeorge
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
larsgeorge
 

More from larsgeorge (14)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 

Recently uploaded

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Realtime Analytics with Hadoop and HBase

  • 1. Realtime Analytics using Hadoop & HBase Lars George, Solutions Architect @ Cloudera lars@cloudera.com Monday, July 25, 11
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Working with Hadoop & HBase since 2007 • Author of O’Reilly’s “HBase - The Definitive Guide” Monday, July 25, 11
  • 3. The Application Stack • Solve Business Goals • Rely on Proven Building Blocks • Rapid Prototyping ‣ Templates, MVC, Reference Implementations • Evolutionary Innovation Cycles “Let there be light!” Monday, July 25, 11
  • 5. L Linux A Apache M MySQL P PHP/Perl Monday, July 25, 11
  • 6. L Linux A Apache M MySQL M Memcache P PHP/Perl Monday, July 25, 11
  • 7. The Dawn of Big Data • Industry verticals produce a staggering amount of data • Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco • Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore • Suitable algorithms derive even more data ‣ Machine learning • “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithms Monday, July 25, 11
  • 8. Hadoop • HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation Framework • Affordable Hardware, Free Software • Significant Adoption Monday, July 25, 11
  • 9. HDFS • Reliably store petabytes of replicated data across thousands of nodes ‣ Data divided into 64MB blocks, each block replicated three times • Master/Slave Architecture ‣ Master NameNode contains meta data ‣ Slave DataNode manages block on local file system • Built on “commodity” hardware ‣ No 15k RPM disks or RAID required (nor wanted!) ‣ Commodity Server Hardware Monday, July 25, 11
  • 10. MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programming Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output Monday, July 25, 11
  • 11. From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term Data Monday, July 25, 11
  • 12. Batch Processing • Scale is Unlimited ‣ Bound only by Hardware • Harness the Power of the Cluster ‣ CPUs, Disks, Memory • Disks extend Memory ‣ Spills represent Swapping • Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, days Monday, July 25, 11
  • 13. From Batch to Realtime • “Time is Money” • Bridging the gap between batch and “now” • Realtime often means “faster than batch” • 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort • Go as close as possible, don’t overdo it! Monday, July 25, 11
  • 14. Stop Gap Solutions • In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces • Relational Databases ‣ MySQL ‣ PostgreSQL • NoSQL ‣ Cassandra ‣ HBase Monday, July 25, 11
  • 21. Fold, Store, and Shift Monday, July 25, 11
  • 22. Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 23. Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 24. Mitigation Planning • Reliable storage has top priority • Disaster Recovery • HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not available Monday, July 25, 11
  • 25. Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase Proc Monday, July 25, 11
  • 26. Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per Node Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase Monday, July 25, 11
  • 27. Collection Layer • “Like” button triggers AJAX request • Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files ✓ Use Flume or Scribe Monday, July 25, 11
  • 28. Filter Layer • Ptail “follows” logs written by Scribe • Aggregates from multiple logs • Separates into event types ‣ Sharding for future growth • Facebook internal tool ✓ Use Flume Monday, July 25, 11
  • 29. Batching Layer • Puma batches updates ‣ 1 sec, staggered • Flush batch, when last is done • Duration limited by key distribution • Facebook internal tool ✓ Use Coprocessors (0.92.0) Monday, July 25, 11
  • 30. Counters • Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family Monday, July 25, 11
  • 31. Key Design • Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of it Monday, July 25, 11
  • 32. Insights Schema Row Key: Domain Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900 Row Key: URL Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100 Monday, July 25, 11
  • 33. Summary • Design for Use-Case ‣ Read, Write, or Both? • Avoid Hotspotting ‣ Region and Table • Manage Automatism at Scale ‣ For now! Monday, July 25, 11
  • 36. Questions? lars@cloudera.com http://cloudera.com Monday, July 25, 11