SlideShare a Scribd company logo
1 of 29
Using Hadoop to Process a
Trillion+ Events




Michael Brown, CTO | March 2012




                   © comScore, Inc.   Proprietary.
comScore is a leading internet technology company that
provides Analytics for a Digital World™



       NASDAQ                         SCOR

       Clients                        2,100+ Worldwide

       Employees                      1,000+

       Headquarters                   Reston, Virginia, USA

       Global Coverage                Measurement from 172 Countries; 44 Markets Reported

       Local Presence                 32 Locations in 23 Countries

       Big Data                       Over 1.5 Trillion Digital Interactions Captured Monthly




                   © comScore, Inc.   Proprietary.                                              V0113   2
Some of our Clients

 Media   Agencies   Telecom/Mobile           Financial   Retail   Travel   CPG   Pharma   Technology




                       © comScore, Inc.   Proprietary.                                         V1011
The Trusted Source for Digital Intelligence Across Vertical Markets



        9   out of the top   10                            9 out of the top 10
        INVESTMENT BANKS                                   AUTO INSURERS


       4    out of the top   4                             11   out of the top   12
       WIRELESS CARRIERS                                   INTERNET SERVICE
                                                           PROVIDERS

       47 out of the top 50                                14   out of the top   15
       ONLINE PROPERTIES                                   PHARMACEUTICAL
                                                           COMPANIES

       45     out of the top   50                          11   out of the top   12
       ADVERTISING AGENCIES                                CONSUMER FINANCE
                                                           COMPANIES

       9 out of the top 10                                 8   out of the top   10
       MAJOR MEDIA COMPANIES                               CPG COMPANIES


                         © comScore, Inc.   Proprietary.                              V1011
Vocabulary for Measuring Information
If a Grain of Sand were One Byte of Information . . .
                                                        1 Exabyte =
                                                        1,000 petabytes
                 1 Megabyte =                           the same beach—
                 1 million bytes                        from Maine to North Carolina
                 a tablespoon of sand



                 1 Gigabyte =                           1 Zetabyte =
                 1 billion bytes                        1,000 exabytes
                 patch of sand—                         the same beach—
                 9” square, 1’ deep                     along the entire US coast




                 1 Terabyte =                           1 Yottabyte =
                                                        1,000 zetabytes (24 Zeroes)
                 1 trillion bytes                       enough info to bury the entire
                 a sandbox—                             US under 296 feet of sand
                 24’ square, 1’ deep




                 1 Petabyte =
                 1,000 terabytes
                 a mile long beach—
                 100’ wide , 1’ deep
Worldwide Tags per Month

               1,600,000,000,000


               1,400,000,000,000


               1,200,000,000,000


               1,000,000,000,000
# of records




                800,000,000,000


                600,000,000,000


                400,000,000,000


                200,000,000,000


                              0
                                   May




                                   May




                                   May
                                   Nov




                                   Nov




                                   Nov




                                   Nov
                                   Feb




                                   Feb




                                   Feb
                                    Jul
                                   Aug
                                   Sep




                                    Jul
                                   Aug
                                   Sep




                                    Jul
                                   Aug
                                   Sep




                                    Jul
                                   Aug
                                   Sep
                                   Oct


                                   Jan




                                   Jun




                                   Oct


                                   Jan




                                   Jun




                                   Oct


                                   Jan




                                   Jun




                                   Oct


                                   Jan
                                   Mar
                                   Apr




                                   Mar
                                   Apr




                                   Mar
                                   Apr
                                   Dec




                                   Dec




                                   Dec




                                   Dec
                                   2009                               2010                 2011   2012   2013

                                                                Panel Records   Beacon Records




                                          © comScore, Inc.   Proprietary.
Beacon Heat Map




                  © comScore, Inc.   Proprietary.
Our Event Volume in Perspective


                    Top 65 WW Properties – Cumulative Page Views
  1,600,000

  1,400,000

  1,200,000

  1,000,000

   800,000

   600,000

   400,000

   200,000

         0




 Source: comScore MediaMetrix Worldwide December 2012


                         © comScore, Inc.   Proprietary.
Daily Records Collection Trend


                      50,000,000,000                                                                                                                                                                                                                                                                                                                        5,000,000,000
                                                                                                                                                                                                                                                                                                                      R² = 0.940
                                                                                                                                                                                                                                                                                                                                                            4,500,000,000
                                                                                                                                                                                                                                                                                                                      R² = 0.822
                      40,000,000,000
                                                                                                                                                                                                                                                                                                                                                            4,000,000,000


                                                                                                                                                                                                                                                                                                                                                            3,500,000,000
                      30,000,000,000
# of census records




                                                                                                                                                                                                                                                                                                                                                                            # of panel records
                                                                                                                                                                                                                                                                                                                                                            3,000,000,000


                      20,000,000,000                                                                                                                                                                                                                                                                                                                        2,500,000,000


                                                                                                                                                                                                                                                                                                                                                            2,000,000,000

                      10,000,000,000
                                                                                                                                                                                                                                                                                                                                                            1,500,000,000


                                                                                                                                                                                                                                                                                                                                                            1,000,000,000
                                   0
                                        Jul 2009




                                                                                                          Jul 2010




                                                                                                                                                                            Jul 2011




                                                                                                                                                                                                                                              Jul 2012




                                                                                                                                                                                                                                                                                                                Jul 2013
                                                                                    Mar 2010
                                                                                               May 2010




                                                                                                                                                      Mar 2011
                                                                                                                                                                 May 2011




                                                                                                                                                                                                                        Mar 2012
                                                                                                                                                                                                                                   May 2012




                                                                                                                                                                                                                                                                                          Mar 2013
                                                                                                                                                                                                                                                                                                     May 2013
                                                   Sep 2009
                                                              Nov 2009




                                                                                                                     Sep 2010
                                                                                                                                Nov 2010




                                                                                                                                                                                       Sep 2011
                                                                                                                                                                                                  Nov 2011




                                                                                                                                                                                                                                                         Sep 2012
                                                                                                                                                                                                                                                                    Nov 2012




                                                                                                                                                                                                                                                                                                                           Sep 2013
                                                                                                                                                                                                                                                                                                                                      Nov 2013
                                                                         Jan 2010




                                                                                                                                           Jan 2011




                                                                                                                                                                                                             Jan 2012




                                                                                                                                                                                                                                                                               Jan 2013




                                                                                                                                                                                                                                                                                                                                                 Jan 2014
                                                                                                                                                                                                                                                                                                                                                            500,000,000


                      -10,000,000,000                                                                                                                                                                                                                                                                                                                       0
                                        Beacon Records                                                                Panel Records                                                               Linear (Beacon Records)                                                                                  Linear (Panel Records)




                                                                                    © comScore, Inc.                        Proprietary.
The Project:
vCE – Validated Campaign Essentials




               © comScore, Inc.   Proprietary.
comScore - vCE




                 © comScore, Inc.   Proprietary.
The Problem Statement


Calculate the number of events and unique cookies for each reportable
campaign element
Key take away
  Data on input will be aggregated daily
  Need to process all data for 3 months
  Need to calculate values for every day in the 92 day period spanning all
   reportable campaign elements




                    © comScore, Inc.   Proprietary.
Structure of the Required Output



   Client   Campaign     Population              Location   Cookie Ct    Period

   1234     160873284        840                      1        863,185     1
   1234     160873284        840                      1      1,719,738     2
   1234     160873284        840                      1      2,631,624     3
   1234     160873284        840                      1      3,572,163     4
   1234     160873284        840                      1      4,445,508     5
   1234     160873284        840                      1      5,308,532     6
   1234     160873284        840                      1      6,032,073     7
   1234     160873284        840                      1      6,710,645     8
   1234     160873284        840                      1      7,421,258     9
   1234     160873284        840                      1      8,154,543    10




                        © comScore, Inc.   Proprietary.
Counting Uniques from a Time Ordered Log File




          A                                                      Major Downsides:
                                                     Need to keep all key elements in memory.
          D                                       Constrained to one machine for final aggregation.


          B

          C

          B

          A

          A



                © comScore, Inc.   Proprietary.
First Version


Java Map-Reduce application which processes pre-aggregated data from 92 days
Map reads the data and emits each cookie as the key of the key value pair
All 170B records go though the shuffle
Each Reducer will get all the data for a particular campaign sorted by cookie
Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates
unique cookies for period 1-92
Volume Grew rapidly to the point the daily processing took more than a day




                         © comScore, Inc.   Proprietary.
M/R Data Flow



       B        C                                     A       B       C       A




      Mapper                                          Mapper          Mapper
        Map                                             Map             Map


        A       A                                         B       B       C       C


        Reduce                                            Reduce          Reduce


            A                                                 B               C




                    © comScore, Inc.   Proprietary.
Scaling Issue


As our volume has grown we have the following stats:
  Over 500 billion events per month
  Daily Aggregate 1.5 billion (and growing)
  170 billion aggregate records for 92 days
  70K Campaigns
  Over 50 countries
  We see 15 billion distinct cookies in a month
  We only need to output 25 million rows




                    © comScore, Inc.   Proprietary.
Basic Approach Retrospective


Processing speed is not scaling to our needs on a sample of the input data
Diagnosis
  Most aggregations could not take significant advantage of combiners.
  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
   Hadoop cluster due to shuffle and skew in data for keys.


Diagnosis
  A new approach is required to reduce the shuffle




                         © comScore, Inc.   Proprietary.
Counting Uniques from a Key Ordered Log File




          A                                                   Major Downsides:
                                                        Need to sort data in advance.
          A                                       The sort time increases as volume grows.


          A

          B

          B

          C

          D



                © comScore, Inc.   Proprietary.
Counting Uniques from a Key Ordered Log File




                © comScore, Inc.   Proprietary.
Counting Uniques from Sharded Key Ordered Log Files




                © comScore, Inc.   Proprietary.
Solution to reduce the shuffle


The Problem:
  Aggregations can not take advantage of combiners, leading to large shuffles and job performance issues

The Idea:
  Partition and sort the data by cookie on a daily basis
  Create a custom InputFormat to merge daily partitions for monthly aggregations




                          © comScore, Inc.   Proprietary.
Custom Input Format with Map Side Aggregation



       B       C                                         A       B       C    A




    A Mapper
        Map                                          B
                                                         Mapper
                                                           Map       C   Mapper
                                                                           Map


      Combiner                                           Combiner        Combiner


           A                                                 B                 C

        Reduce                                            Reduce             Reduce

           A                                                 B                 C


                   © comScore, Inc.   Proprietary.
Risks for Partitioning


Data locality
  Custom InputFormat requires reading blocks of the partitioned data over the network
  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
   zero which guarantees that the data written to a volume will stay on one node



Map failures might result in long run times
  Size of the map inputs is no longer set by block size
  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
   mapper




                          © comScore, Inc.   Proprietary.
Partitioning Summary


Benefits:
  A large portion of the aggregation can be completed in the map phase
  Applications can now take advantage of combiners
  Shuffles sizes are minimal

Results:
  Took a job from 35 hours to 3 hours with no hardware changes




                         © comScore, Inc.   Proprietary.
Our Cluster


Production Hadoop Cluster
  120 nodes: Mix of Dell 720xd, R710 and R510 servers
  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
  3000+ total CPUs
  6.0TB total memory
  2PB total disk space
  Our distro is MapR M5 2.1.0




                      © comScore, Inc.   Proprietary.
Useful Factoids

     Colorful, bite-sized graphical representations of the best discoveries we unearth.




       Visit www.comscoredatamine.com or follow @datagems for the latest gems.


                     © comScore, Inc.   Proprietary.
Thank You!


 Michael Brown
 CTO
 comScore, Inc.


 mbrown@comscore.com




                  © comScore, Inc.   Proprietary.
Diagram




          © comScore, Inc.   Proprietary.   29

More Related Content

Viewers also liked

lect5_Stick_diagram_layout_rules
lect5_Stick_diagram_layout_ruleslect5_Stick_diagram_layout_rules
lect5_Stick_diagram_layout_rulesvein
 
Layout or Makeup Journalism
Layout or Makeup JournalismLayout or Makeup Journalism
Layout or Makeup JournalismDeb Homillano
 
Cmos design
Cmos designCmos design
Cmos designMahi
 
Pass Transistor Logic
Pass Transistor LogicPass Transistor Logic
Pass Transistor LogicDiwaker Pant
 
Tablet manufacturing process created by Asadulla Mulla
Tablet manufacturing process created by Asadulla MullaTablet manufacturing process created by Asadulla Mulla
Tablet manufacturing process created by Asadulla MullaAsad Mulla
 
Tablet processing problems
Tablet processing problemsTablet processing problems
Tablet processing problemsSanjay Yadav
 
Pharmaceutical industry and unit process
Pharmaceutical industry and unit processPharmaceutical industry and unit process
Pharmaceutical industry and unit processibtihal osman
 
Facility layout ppt
Facility layout pptFacility layout ppt
Facility layout pptAnju Rana
 
Plant layout ppt by me
Plant layout ppt by mePlant layout ppt by me
Plant layout ppt by meAnkit Walia
 
Using Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategyUsing Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategyCraig Martin
 
All about Tablets (Pharma)
All about Tablets  (Pharma)All about Tablets  (Pharma)
All about Tablets (Pharma)Sathish Vemula
 
Basic layout principles
Basic layout principlesBasic layout principles
Basic layout principlesSherwin Manual
 
Pharmaceutical Factory Layout
Pharmaceutical Factory LayoutPharmaceutical Factory Layout
Pharmaceutical Factory LayoutZil Shah
 
Recovery: Job Growth and Education Requirements Through 2020
Recovery: Job Growth and Education Requirements Through 2020Recovery: Job Growth and Education Requirements Through 2020
Recovery: Job Growth and Education Requirements Through 2020CEW Georgetown
 
3 hard facts shaping higher education thinking and behavior
3 hard facts shaping higher education thinking and behavior3 hard facts shaping higher education thinking and behavior
3 hard facts shaping higher education thinking and behaviorGrant Thornton LLP
 
Jornal CTB 2016 01-20-n6-ano9
Jornal CTB 2016 01-20-n6-ano9Jornal CTB 2016 01-20-n6-ano9
Jornal CTB 2016 01-20-n6-ano9Carlos Eduardo
 
What's Trending in Talent and Learning for 2016?
What's Trending in Talent and Learning for 2016?What's Trending in Talent and Learning for 2016?
What's Trending in Talent and Learning for 2016?Skillsoft
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?Maciej Lasyk
 
Can We Assess Creativity?
Can We Assess Creativity?Can We Assess Creativity?
Can We Assess Creativity?John Spencer
 

Viewers also liked (20)

lect5_Stick_diagram_layout_rules
lect5_Stick_diagram_layout_ruleslect5_Stick_diagram_layout_rules
lect5_Stick_diagram_layout_rules
 
Layout or Makeup Journalism
Layout or Makeup JournalismLayout or Makeup Journalism
Layout or Makeup Journalism
 
Cmos design
Cmos designCmos design
Cmos design
 
Pass Transistor Logic
Pass Transistor LogicPass Transistor Logic
Pass Transistor Logic
 
Tablet
TabletTablet
Tablet
 
Tablet manufacturing process created by Asadulla Mulla
Tablet manufacturing process created by Asadulla MullaTablet manufacturing process created by Asadulla Mulla
Tablet manufacturing process created by Asadulla Mulla
 
Tablet processing problems
Tablet processing problemsTablet processing problems
Tablet processing problems
 
Pharmaceutical industry and unit process
Pharmaceutical industry and unit processPharmaceutical industry and unit process
Pharmaceutical industry and unit process
 
Facility layout ppt
Facility layout pptFacility layout ppt
Facility layout ppt
 
Plant layout ppt by me
Plant layout ppt by mePlant layout ppt by me
Plant layout ppt by me
 
Using Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategyUsing Business Architecture to enable customer experience and digital strategy
Using Business Architecture to enable customer experience and digital strategy
 
All about Tablets (Pharma)
All about Tablets  (Pharma)All about Tablets  (Pharma)
All about Tablets (Pharma)
 
Basic layout principles
Basic layout principlesBasic layout principles
Basic layout principles
 
Pharmaceutical Factory Layout
Pharmaceutical Factory LayoutPharmaceutical Factory Layout
Pharmaceutical Factory Layout
 
Recovery: Job Growth and Education Requirements Through 2020
Recovery: Job Growth and Education Requirements Through 2020Recovery: Job Growth and Education Requirements Through 2020
Recovery: Job Growth and Education Requirements Through 2020
 
3 hard facts shaping higher education thinking and behavior
3 hard facts shaping higher education thinking and behavior3 hard facts shaping higher education thinking and behavior
3 hard facts shaping higher education thinking and behavior
 
Jornal CTB 2016 01-20-n6-ano9
Jornal CTB 2016 01-20-n6-ano9Jornal CTB 2016 01-20-n6-ano9
Jornal CTB 2016 01-20-n6-ano9
 
What's Trending in Talent and Learning for 2016?
What's Trending in Talent and Learning for 2016?What's Trending in Talent and Learning for 2016?
What's Trending in Talent and Learning for 2016?
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
Can We Assess Creativity?
Can We Assess Creativity?Can We Assess Creativity?
Can We Assess Creativity?
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 

Analyzing 1.4 trillion events with Hadoop

  • 1. Using Hadoop to Process a Trillion+ Events Michael Brown, CTO | March 2012 © comScore, Inc. Proprietary.
  • 2. comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,100+ Worldwide Employees 1,000+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries Big Data Over 1.5 Trillion Digital Interactions Captured Monthly © comScore, Inc. Proprietary. V0113 2
  • 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. V1011
  • 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. V1011
  • 5. Vocabulary for Measuring Information If a Grain of Sand were One Byte of Information . . . 1 Exabyte = 1,000 petabytes 1 Megabyte = the same beach— 1 million bytes from Maine to North Carolina a tablespoon of sand 1 Gigabyte = 1 Zetabyte = 1 billion bytes 1,000 exabytes patch of sand— the same beach— 9” square, 1’ deep along the entire US coast 1 Terabyte = 1 Yottabyte = 1,000 zetabytes (24 Zeroes) 1 trillion bytes enough info to bury the entire a sandbox— US under 296 feet of sand 24’ square, 1’ deep 1 Petabyte = 1,000 terabytes a mile long beach— 100’ wide , 1’ deep
  • 6. Worldwide Tags per Month 1,600,000,000,000 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000 # of records 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 0 May May May Nov Nov Nov Nov Feb Feb Feb Jul Aug Sep Jul Aug Sep Jul Aug Sep Jul Aug Sep Oct Jan Jun Oct Jan Jun Oct Jan Jun Oct Jan Mar Apr Mar Apr Mar Apr Dec Dec Dec Dec 2009 2010 2011 2012 2013 Panel Records Beacon Records © comScore, Inc. Proprietary.
  • 7. Beacon Heat Map © comScore, Inc. Proprietary.
  • 8. Our Event Volume in Perspective Top 65 WW Properties – Cumulative Page Views 1,600,000 1,400,000 1,200,000 1,000,000 800,000 600,000 400,000 200,000 0 Source: comScore MediaMetrix Worldwide December 2012 © comScore, Inc. Proprietary.
  • 9. Daily Records Collection Trend 50,000,000,000 5,000,000,000 R² = 0.940 4,500,000,000 R² = 0.822 40,000,000,000 4,000,000,000 3,500,000,000 30,000,000,000 # of census records # of panel records 3,000,000,000 20,000,000,000 2,500,000,000 2,000,000,000 10,000,000,000 1,500,000,000 1,000,000,000 0 Jul 2009 Jul 2010 Jul 2011 Jul 2012 Jul 2013 Mar 2010 May 2010 Mar 2011 May 2011 Mar 2012 May 2012 Mar 2013 May 2013 Sep 2009 Nov 2009 Sep 2010 Nov 2010 Sep 2011 Nov 2011 Sep 2012 Nov 2012 Sep 2013 Nov 2013 Jan 2010 Jan 2011 Jan 2012 Jan 2013 Jan 2014 500,000,000 -10,000,000,000 0 Beacon Records Panel Records Linear (Beacon Records) Linear (Panel Records) © comScore, Inc. Proprietary.
  • 10. The Project: vCE – Validated Campaign Essentials © comScore, Inc. Proprietary.
  • 11. comScore - vCE © comScore, Inc. Proprietary.
  • 12. The Problem Statement Calculate the number of events and unique cookies for each reportable campaign element Key take away  Data on input will be aggregated daily  Need to process all data for 3 months  Need to calculate values for every day in the 92 day period spanning all reportable campaign elements © comScore, Inc. Proprietary.
  • 13. Structure of the Required Output Client Campaign Population Location Cookie Ct Period 1234 160873284 840 1 863,185 1 1234 160873284 840 1 1,719,738 2 1234 160873284 840 1 2,631,624 3 1234 160873284 840 1 3,572,163 4 1234 160873284 840 1 4,445,508 5 1234 160873284 840 1 5,308,532 6 1234 160873284 840 1 6,032,073 7 1234 160873284 840 1 6,710,645 8 1234 160873284 840 1 7,421,258 9 1234 160873284 840 1 8,154,543 10 © comScore, Inc. Proprietary.
  • 14. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary.
  • 15. First Version Java Map-Reduce application which processes pre-aggregated data from 92 days Map reads the data and emits each cookie as the key of the key value pair All 170B records go though the shuffle Each Reducer will get all the data for a particular campaign sorted by cookie Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates unique cookies for period 1-92 Volume Grew rapidly to the point the daily processing took more than a day © comScore, Inc. Proprietary.
  • 16. M/R Data Flow B C A B C A Mapper Mapper Mapper Map Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary.
  • 17. Scaling Issue As our volume has grown we have the following stats:  Over 500 billion events per month  Daily Aggregate 1.5 billion (and growing)  170 billion aggregate records for 92 days  70K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 25 million rows © comScore, Inc. Proprietary.
  • 18. Basic Approach Retrospective Processing speed is not scaling to our needs on a sample of the input data Diagnosis  Most aggregations could not take significant advantage of combiners.  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster due to shuffle and skew in data for keys. Diagnosis  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary.
  • 19. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary.
  • 20. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary.
  • 21. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary.
  • 22. Solution to reduce the shuffle The Problem:  Aggregations can not take advantage of combiners, leading to large shuffles and job performance issues The Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary.
  • 23. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary.
  • 24. Risks for Partitioning Data locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper © comScore, Inc. Proprietary.
  • 25. Partitioning Summary Benefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimal Results:  Took a job from 35 hours to 3 hours with no hardware changes © comScore, Inc. Proprietary.
  • 26. Our Cluster Production Hadoop Cluster  120 nodes: Mix of Dell 720xd, R710 and R510 servers  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)  3000+ total CPUs  6.0TB total memory  2PB total disk space  Our distro is MapR M5 2.1.0 © comScore, Inc. Proprietary.
  • 27. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary.
  • 28. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary.
  • 29. Diagram © comScore, Inc. Proprietary. 29

Editor's Notes

  1. Key MessagecomScore is a global internet technology company providing customers with Analytics for a Digital WorldSupporting Talking PointsFounded in 1999, comScore is best known as the gold standard for measuring digital activity, including website visitation, search, video, social, digital advertisingcomScore’s data and technologies are well-established crucial components in measuring and analyzing the rapidly evolving digital world, and are widely deployed at a broad range of publishers, advertising agencies, advertisers, retailers and telecom operators, both in the US and internationally
  2. In 2011, 400exabytes of storage was shipped by drive manufacturers
  3. April 2012 Data