SlideShare a Scribd company logo
1 of 24
Analyzing the internet
using Hadoop and Hbase

               Friso van Vollenhoven
               fvanvollenhoven@xebia.com
+


Friso van Vollenhoven
fvanvollenhoven@xebia.com
+



             =      One of five Regional Internet
                    Registrars




  Allocate IP address ranges to organizations
      (e.g. ISPs, network operators, etc.)

   Provide near real time network
information based on measurements
WEB 5.0 SERVICE

            PowerBook G4




   THE
INTERNET
WEB 5.0 SERVICE
                 AS

              T-Mobile                                                        PowerBook G4




                            BG
 AS                             P
                                                AS
KPN
                                              AMS-IX

                 BGP
       BG
         P




                                                            BG
       BG




         P                                                       P
                            B   GP


                 AS                                                      AS

               Level3                                                XS4All

                                 P
        BGP

                                BG
  AS

 DT                                  BG
                                          P
                   BGP




                                                       AS

                       AS                            BT
                 AT&T
Our Data
  BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||




                                  I can take traffic for anything in the
                                    range 192.37.0.0 - 192.37.255.255
                                   and deliver it through the network
                                    route 3333 -> 5378 -> 286 -> 1836


                                                                      Got
                              ROUTER
                                                                      it!

                AS3333                                               ROUTER
                                                                              PEER
                                           Thanks!
                                                                               AS
                                               RIPE NCC
                                               ROUTER
                                                              PEER
                                                               AS
(not really a router!)
Our Data

• 15 active data collection points (the not
  really routers), collecting updates from
  ~600 peers
• BGP updates every 5 minutes, ~80M / day
• Routing table dumps every 8 hours, ~10M
  entries / dump (160M index updates)
FAQ
•   What are all updates that are somehow related to
    AS3333 for the past two weeks?
•   How many updates per second were announcing
    prefixes originated by AS286 between 11:00 and
    12:00 on February 2nd in 2005?
•   Which AS adjacencies have existed for AS3356 in
    the past year?
•   What was the size of the routing table of the peer
    router at 193.148.15.140 two years ago?
Not so frequently asked:
What happened in Egypt?




     http://stat.ripe.net/egypt
THE INTERNET


                                                          DATA COLLECTOR
                         DATA COLLECTOR




                                 DATA COLLECTOR                 DATA COLLECTOR                ONLINE QUERY SYSTEM
                                                                                         -+                             RIPE.net                                              +
                                                                                                                                                      Home | Login | Register |
                                                                                                                                                                      Contact

                                                                                                RIS DATA
                                                                                              “RDX Wall Art: The
                                                                                                                     - “RDX Wall Art: The Making
                                                                                                                     Of” iand new short
                                                                                              Making Of” is a new
                                                                                                                     documentary iand new short isa
                                                                                              short documentary
                                                                                              Blog
                                                                                              highlighting
                                                                                                                     Forun
                                                                                                                     new short
                                                                                                                     - isa new short documentary
push to central server                                                                                               - highlighting iand new sho
                                                                                                                     documentary
                                                                                              some of the pioneers
                                                                                                                     - some of the pioneers
                                                                                              highlighting
                                                                                                                     highlighting iand new sho
                                                                                              more ...
                                                                                                                     more ...



                 raw data
                repository
               (file system                                                       query
                  based)
                                                                                                           Pages 1 2 3 4 5 6 7 . . .
                                                                                                                 120 121 122

  copy to HDFS



                                          CLUSTER
                  source data
                   on HDFS



                             <<MapReduce>>
                              create datasets        <<MapReduce>>
                                                     insert into HBase
                                            intermediate
                                              intermediate
                                               data on
                                                 intermediate
                                                HDFS on
                                                  data
                                                   HDFS on
                                                    data
                                                     HDFS
Import
                                           <<mapreduce>>
                                           HBASE INSERT
                                               JOB




 UPDATES                   <<mapreduce>>
 SOURCE     COPY TO HDFS      UPDATE
  FILES                     IMPORT JOB




                                           <<mapreduce>>
                                                           <<mapreduce>>
                                           ADJACENCIES
                                                           HBASE INSERT
                                            DERIVED SET
                                                               JOB
                                            CREATE JOB




TABLEDUMP                  <<mapreduce>>   <<mapreduce>>                   HBase
  SOURCE    COPY TO HDFS    TABLEDUMP      HBASE INSERT
   FILES                    IMPORT JOB         JOB




                           <<mapreduce>>
                                           <<mapreduce>>
                              PEERS
                                           HBASE INSERT
                            DERIVED SET
                                               JOB
                            CREATE JOB
Import

• Hadoop MapReduce for parallelization and
  fault tolerance
• Chain jobs to create derived datasets
• Database store operations are idempotent
BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||
       980099497 A|193.148.15.68|3333|192.37.0.0/16|3333          1836|IGP|193.148.15.140




                      RECORD
                prefix: 193.92.140/21                            TIMELINE
              origin IP: 193.148.15.68                T1, T2, T3, T4, T5, T6, T7, etc.
            AS path: 3333 5378 286 1836




    Inserting new data points means read-modify-write:
    • find record
    • fetch timeline
    • merge timeline with new data
    • write merged timeline
Schema and
      representation
 ROW KEY           CF: DATA           CF: META

                [record]:[timeline]
                [record]:[timeline]
[index bytes]   [record]:[timeline]
                                      exists = Y
                [record]:[timeline]
                [record]:[timeline]
                [record]:[timeline]
[index bytes]   [record]:[timeline]
                                      exists = Y
                [record]:[timeline]
Schema and
            representation
•   Multiple indexes; data is duplicated for each index
•   Protobuf encoded records and timelines
•   Timelines are timestamps or pairs of timestamps denoting validity
    intervals
•   Data is partitioned over time segments (index contains time part)
•   Indexes have an additional discriminator byte to avoid extremely
    wide rows
•   Keep an in-memory representation of the key space for
    backwards traversal
•   Keep data hot spot (most recent two to three weeks) in HBase
    block cache
Schema and
          representation
  Strategy:
• Make sure data hotspot fits in HBase block
  cache
• Scale out seek operations for (higher
  latency) access to old data
  Result:
• 10x 10K RPM disk per worker node
• 16GB heap space for region servers
{ "select": ["META_DATA", "BLOB", "TIMELINE"],
                                                           HTTP POST

                                                                        Query
    "data_class": "RIS_UPDATE",
    "where": {
        "composite": {
            "primary": {


                                                                        Server
                "time_index": {
                    "from": "08:00:00.000 05-02-2008",
                    "to": "18:00:00.000 05-05-2008"}
            },
            "secondary": {
                "ipv4": {
                    "match": "ALL_LEVELS_MORE_SPECIFIC",               PROTOBUF
                    "exact_match": "INCLUDE_EXACT",
                    "resource": "193.0.0.0/8"                              or
                }                                                        JSON
            }
        }
    },


                                                                         Client
    "combine_timeline": true
}




• In-memory index of all resources (IPs and AS numbers)
• Internally generates lists ranges to scan that match query predicates
• Combines time partitions into a single timeline before returning to
    client
•   Runs on every RS behind a load balancer
•   Knows which data exists
Hadoop and HBase:
 some experiences
Blocking writes because of memstore
flushing and compactions.
Tuned these (amongst others):
hbase.regionserver.global.memstore.upperLimit
hbase.regionserver.global.memstore.lowerLimit
hbase.hregion.memstore.flush.size
hbase.hregion.max.filesize
hbase.hstore.compactionThreshold
hbase.hstore.blockingStoreFiles
hbase.hstore.compaction.max
Hadoop and HBase:
       some experiences
      Dev machines: dual quad core, 32GB RAM,
      4x 750GB 5.4K RPM data disks, 150GB OS
      disk
      Prod machines: dual quad core, 64GB RAM,
      10x 600GB 10K RPM data disks,
      150GB OS disk

See a previously IO bound job become CPU bound
because of expanded IO capacity.
Hadoop and HBase:
        some experiences
       Garbage collecting a 16GB heap can take
       quite a long time.


Using G1*. Full collections still take seconds, but no
missed heartbeats.


                                 * New HBase versions have a fix / feature that
                                 works with the CMS collector. RIPE NCC is
                                 looking into upgrading.
Hadoop and HBase:
         some experiences
       Estimating required capacity and machine
       profiles is hard.


Pick a starting point. Monitor. Tune. Scale. In that order.
Hadoop and HBase:
         some experiences
        Other useful tools

•   Ganglia
•   CFEngine / Puppet
•   Splunk
•   Bash
Hadoop and HBase:
  some experiences

 Awesome
community!
(mailing lists are a big help when in trouble)
Q&A?
  Friso van Vollenhoven
  fvanvollenhoven@xebia.com

More Related Content

Similar to Analyzing internet data using Hadoop and HBase

Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Kapil Pendse
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組みRyousei Takano
 
Ip Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerIp Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerJim Jenkins
 
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolBGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolAPNIC
 
RPKI and Me
RPKI and MeRPKI and Me
RPKI and MeMyNOG
 
evolution towards NGN
evolution towards NGNevolution towards NGN
evolution towards NGNAJAL A J
 

Similar to Analyzing internet data using Hadoop and HBase (9)

IPTV lecture
IPTV lectureIPTV lecture
IPTV lecture
 
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
Ip Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerIp Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course Sampler
 
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolBGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
 
RPKI and Me
RPKI and MeRPKI and Me
RPKI and Me
 
Observer ts
Observer tsObserver ts
Observer ts
 
Observer ts
Observer tsObserver ts
Observer ts
 
evolution towards NGN
evolution towards NGNevolution towards NGN
evolution towards NGN
 

More from fvanvollenhoven

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collectorfvanvollenhoven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentationfvanvollenhoven
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jfvanvollenhoven
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksfvanvollenhoven
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 

More from fvanvollenhoven (9)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Analyzing internet data using Hadoop and HBase

  • 1. Analyzing the internet using Hadoop and Hbase Friso van Vollenhoven fvanvollenhoven@xebia.com
  • 3. + = One of five Regional Internet Registrars Allocate IP address ranges to organizations (e.g. ISPs, network operators, etc.) Provide near real time network information based on measurements
  • 4. WEB 5.0 SERVICE PowerBook G4 THE INTERNET
  • 5. WEB 5.0 SERVICE AS T-Mobile PowerBook G4 BG AS P AS KPN AMS-IX BGP BG P BG BG P P B GP AS AS Level3 XS4All P BGP BG AS DT BG P BGP AS AS BT AT&T
  • 6. Our Data BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG|| I can take traffic for anything in the range 192.37.0.0 - 192.37.255.255 and deliver it through the network route 3333 -> 5378 -> 286 -> 1836 Got ROUTER it! AS3333 ROUTER PEER Thanks! AS RIPE NCC ROUTER PEER AS (not really a router!)
  • 7. Our Data • 15 active data collection points (the not really routers), collecting updates from ~600 peers • BGP updates every 5 minutes, ~80M / day • Routing table dumps every 8 hours, ~10M entries / dump (160M index updates)
  • 8. FAQ • What are all updates that are somehow related to AS3333 for the past two weeks? • How many updates per second were announcing prefixes originated by AS286 between 11:00 and 12:00 on February 2nd in 2005? • Which AS adjacencies have existed for AS3356 in the past year? • What was the size of the routing table of the peer router at 193.148.15.140 two years ago?
  • 9. Not so frequently asked: What happened in Egypt? http://stat.ripe.net/egypt
  • 10. THE INTERNET DATA COLLECTOR DATA COLLECTOR DATA COLLECTOR DATA COLLECTOR ONLINE QUERY SYSTEM -+ RIPE.net + Home | Login | Register | Contact RIS DATA “RDX Wall Art: The - “RDX Wall Art: The Making Of” iand new short Making Of” is a new documentary iand new short isa short documentary Blog highlighting Forun new short - isa new short documentary push to central server - highlighting iand new sho documentary some of the pioneers - some of the pioneers highlighting highlighting iand new sho more ... more ... raw data repository (file system query based) Pages 1 2 3 4 5 6 7 . . . 120 121 122 copy to HDFS CLUSTER source data on HDFS <<MapReduce>> create datasets <<MapReduce>> insert into HBase intermediate intermediate data on intermediate HDFS on data HDFS on data HDFS
  • 11. Import <<mapreduce>> HBASE INSERT JOB UPDATES <<mapreduce>> SOURCE COPY TO HDFS UPDATE FILES IMPORT JOB <<mapreduce>> <<mapreduce>> ADJACENCIES HBASE INSERT DERIVED SET JOB CREATE JOB TABLEDUMP <<mapreduce>> <<mapreduce>> HBase SOURCE COPY TO HDFS TABLEDUMP HBASE INSERT FILES IMPORT JOB JOB <<mapreduce>> <<mapreduce>> PEERS HBASE INSERT DERIVED SET JOB CREATE JOB
  • 12. Import • Hadoop MapReduce for parallelization and fault tolerance • Chain jobs to create derived datasets • Database store operations are idempotent
  • 13. BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG|| 980099497 A|193.148.15.68|3333|192.37.0.0/16|3333 1836|IGP|193.148.15.140 RECORD prefix: 193.92.140/21 TIMELINE origin IP: 193.148.15.68 T1, T2, T3, T4, T5, T6, T7, etc. AS path: 3333 5378 286 1836 Inserting new data points means read-modify-write: • find record • fetch timeline • merge timeline with new data • write merged timeline
  • 14. Schema and representation ROW KEY CF: DATA CF: META [record]:[timeline] [record]:[timeline] [index bytes] [record]:[timeline] exists = Y [record]:[timeline] [record]:[timeline] [record]:[timeline] [index bytes] [record]:[timeline] exists = Y [record]:[timeline]
  • 15. Schema and representation • Multiple indexes; data is duplicated for each index • Protobuf encoded records and timelines • Timelines are timestamps or pairs of timestamps denoting validity intervals • Data is partitioned over time segments (index contains time part) • Indexes have an additional discriminator byte to avoid extremely wide rows • Keep an in-memory representation of the key space for backwards traversal • Keep data hot spot (most recent two to three weeks) in HBase block cache
  • 16. Schema and representation Strategy: • Make sure data hotspot fits in HBase block cache • Scale out seek operations for (higher latency) access to old data Result: • 10x 10K RPM disk per worker node • 16GB heap space for region servers
  • 17. { "select": ["META_DATA", "BLOB", "TIMELINE"], HTTP POST Query "data_class": "RIS_UPDATE", "where": { "composite": { "primary": { Server "time_index": { "from": "08:00:00.000 05-02-2008", "to": "18:00:00.000 05-05-2008"} }, "secondary": { "ipv4": { "match": "ALL_LEVELS_MORE_SPECIFIC", PROTOBUF "exact_match": "INCLUDE_EXACT", "resource": "193.0.0.0/8" or } JSON } } }, Client "combine_timeline": true } • In-memory index of all resources (IPs and AS numbers) • Internally generates lists ranges to scan that match query predicates • Combines time partitions into a single timeline before returning to client • Runs on every RS behind a load balancer • Knows which data exists
  • 18. Hadoop and HBase: some experiences Blocking writes because of memstore flushing and compactions. Tuned these (amongst others): hbase.regionserver.global.memstore.upperLimit hbase.regionserver.global.memstore.lowerLimit hbase.hregion.memstore.flush.size hbase.hregion.max.filesize hbase.hstore.compactionThreshold hbase.hstore.blockingStoreFiles hbase.hstore.compaction.max
  • 19. Hadoop and HBase: some experiences Dev machines: dual quad core, 32GB RAM, 4x 750GB 5.4K RPM data disks, 150GB OS disk Prod machines: dual quad core, 64GB RAM, 10x 600GB 10K RPM data disks, 150GB OS disk See a previously IO bound job become CPU bound because of expanded IO capacity.
  • 20. Hadoop and HBase: some experiences Garbage collecting a 16GB heap can take quite a long time. Using G1*. Full collections still take seconds, but no missed heartbeats. * New HBase versions have a fix / feature that works with the CMS collector. RIPE NCC is looking into upgrading.
  • 21. Hadoop and HBase: some experiences Estimating required capacity and machine profiles is hard. Pick a starting point. Monitor. Tune. Scale. In that order.
  • 22. Hadoop and HBase: some experiences Other useful tools • Ganglia • CFEngine / Puppet • Splunk • Bash
  • 23. Hadoop and HBase: some experiences Awesome community! (mailing lists are a big help when in trouble)
  • 24. Q&A? Friso van Vollenhoven fvanvollenhoven@xebia.com

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n