SlideShare a Scribd company logo
1 of 39
Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Improving MySQL Performance with
Hadoop
Sagar Jauhari, Manish Kumar
  Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
India
                                                                       May 03 – May 04, 2012

                                                                       San Francisco
                                                                       September 30 – October 4, 2012




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Program Agenda

●   Introduction
●   Inside Hadoop!
●   Integration with MySQL
●   Facebook's usage of MySQL & Hadoop
●   Twitter's usage of MySQL &Hadoop




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL
   ●          12 million product installations
   ●          65,000 downloads each day
   ●          Part of the rapidly growing open source LAMP stack
   ●          MySQL Commercial Editions Available




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop
   ●          Highly scalable Distributed Framework
                 β—‹          Yahoo! has a 4000 node cluster!
   ●          Extremely powerful in terms of computation
                 β—‹          Sorts a TB of random integers in 62 seconds!




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop is ..
   ●          A scalable system for data storage and processing.
   ●          Fault tolerant
   ●          Parallelizes data processing across many nodes
   ●          Leverages its distributed file system (HDFS)* to
              cheaply and reliably replicate chunks of data.




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Who uses Hadoop?
 ● Yahoo:
                          β–          Ad Systems and Web Search.
 ● Facebook:
                          β–          Reporting/analytics and machine learning.
 ● Twitter:
                          β–          Data warehousing, data analysis.
 ● Netflix:
                          β–          Movie recommendation algorithm uses Hive ( which uses
                                    Hadoop, HDFS & MapReduce underneath)


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL Vs Hadoop
                                                                       MySQL                        Hadoop

Data Capacity                                                          TB+ (may require sharding)   PB+

Data per query                                                         GB?                          PB+

Read/Write                                                             Random read/write            Sequential scans, Append - only

Query Language                                                         SQL                          Java MapReduce, scripting
                                                                                                    languages, Hive QL

Transaction                                                            Yes                          No

Indexes                                                                Yes                          No

Latence                                                                Sub-second (hopefully)       Minutes to hours

Data structure                                                         Structured                   Structured or unstructured
Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010

Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop


                                                                       A shallow Deep Dive


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         A distributed, scalable,                                       Name Node

              and portable file system
              written in Java
    ●         Each node in a Hadoop                                             HDFS

              instance typically has a
              single name-node; a
              cluster of data-nodes form
              the HDFS cluster.
                                                                       Map / Reduce Workers



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         Uses the TCP/IP layer for                                      Name Node

              communication
    ●         Stores large files across
              multiple machines                                                 HDFS

    ●         Single name node stores
              metadata in-memory.


                                                                       Map / Reduce Workers



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Design Goals
                  β—‹         Scalability
                  β—‹         Cost Efficiency
    ●         Implementation
                  β—‹         User Jobs are executed as 'map' and 'reduce' functions
                  β—‹         Work distribution and fault tolerance are managed


            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Map
                  β—‹         Map Reduce job splits input data into independent chunks
                  β—‹         Each chunk is processed by the map task in a parallel
                            manner
                  β—‹         Generic key-value computation




            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Reduce
                  β—‹         Data from data nodes is merge sorted so that the key-value
                            pairs for a given key are contiguous
                  β—‹         The merged data is read sequentially and the values are
                            passed to the reduce method with an iterator reading the
                            input file until the next key value is encountered



            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
How does hadoop use Map-Reduce
    ●         Framework consists of a single master JobTracker
              and one slave TaskTracker per cluster-node.
    ●         Master
                  β—‹         Schedules the jobs' component tasks on the slaves
                  β—‹         Monitors the jobs
                  β—‹         Re-executes the failed tasks
    ●         Slave
                  β—‹         Executes the tasks as directed by the master.



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Why Map Reduce ?
    ●         Language support
                  β—‹            Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) .
    ●         Scales Horizontally
    ●         Programmer is isolated from individual failed tasks
             β—‹         Tasks are restarted on another node




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce Limitations
    ●         Not a good fit for problems that exhibit task-driven
              parallelism.
    ●         Requires a particular form of input - a set of (key,
              pair) pairs.
    ●         A lot of MapReduce applications end up sharing data
              one way or another.



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                           Leveraging Hadoop to
                                                                                Improve MySQL
                                                                                    performance


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

●     The benefits of MySQL to developers is the speed,
      reliability, data integrity and scalability it provides.
●     It can successfully process large amounts of data (in
      petabytes).
●     But for applications that require a massive parallel
      processing we may need the benefits of a parallel
      processing system, such as hadoop.



    Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
  Problem Statement
Word Count Problem
 ● In a large set of
   documents, find the
   number of occurrences
   of each word.




  Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Word count problem
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Mapping

                                                                         Key and Value represent a row of data:
Map
                                                                           key is the byte office, value in a line.
(key,
value)
                                                                        Intermediate Output
foreach                                                                <word1>, 1
(word in                                                               <word2>, 1
the                                                                    <word3>, 1
value)

output
(word,1)

Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Reducing
                                                                       Hadoop aggregates the keys
Reduce                                                                 and calls reduce for each
(key, list)                                                            unique key:
  sum                                                                   <word1>, (1,1,1,1,1,1…1)
the list                                                                <word2>, (1,1,1)
  Output                                                                <word3>, (1,1,1,1,1,1) .
(key,
                                              Final result:
sum)
                                          <word1>, 45823
                                          <word2>, 1204
                                          <word3>, 2693



Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                                       Demo




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Video




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop

● Facebook collects TB of data everyday from around 800 million
  users.
● MySQL handles pretty much every user interaction: likes,
  shares, status updates, alerts, requests, etc.
● Hadoop/Hive Warehouse
  – 4800 cores, 2 PetaBytes (July 2009)
  – 4800 cores, 12 PetaBytes (Sept 2009)
● Hadoop Archival Store
  – 200 TB



 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop
Hive
    ●         Data warehouse system for Hadoop.
    ●         Facilitates easy data summarization.
    ●         Hive translates HiveQL to MapReduce code.
    ●         Querying
                  β—‹         Provides a mechanism to project structure onto this data
                  β—‹         Allows querying the data using a SQL-like language called HiveQL




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Hive Vs SQL

                                                                             RDBMS                        HIVE

                                                                             SQL-92 standard (maybe)      Subset of SQL-92 plus Hive-
           Language
                                                                                                          specific extension
                                                                             INSERT, UPDATE and           INSERT but not UPDATE or
           Update Capabilities
                                                                             DELETE                       DELETE

                                                                             Yes                          No
           Transactions

                                                                             Sub-Second                   Minutes or more
           Latency

                                                                             Any number of indexes,       No indexes, data is always
           Indexes
                                                                             very                         scanned (in parallel)
                                                                             important for performance
                                                                             TBs                          PBs
           Data size
           Data per query                                                    GBs
          Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010   PBs


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Hadoop Implementation
At Twitter
    ●         > 12 terabytes of new data per day!
    ●         Most stored data is LZ0 compressed
    ●         Uses Scribe to write logs to Hadoop
                  β—‹         Scribe: a log collection framework created and open-
                            sourced by Facebook.
    ●         Hadoop used for data warehousing, data analysis.




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
References

    ●         Leveraging Hadoop to Augment MySQL Deployments - Sarah
              Sproehnle, Cloudera
    ●         http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
    ●         http://semanticvoid.com
    ●         http://michael-noll.com
    ●         http://hadoop.apache.org/




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Legal Disclaimer

    ●         All other products, company names, brand names,
              trademarks and logos are the property of their
              respective owners.




Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Thank You


Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.

More Related Content

What's hot

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
Β 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
Β 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
Β 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
Β 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
Β 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data JourneyTugdual Grall
Β 
Syncsort et le retour d'expΓ©rience ComScore
Syncsort et le retour d'expΓ©rience ComScoreSyncsort et le retour d'expΓ©rience ComScore
Syncsort et le retour d'expΓ©rience ComScoreModern Data Stack France
Β 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
Β 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
Β 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
Β 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
Β 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
Β 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
Β 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
Β 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
Β 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
Β 

What's hot (20)

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
Β 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Β 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
Β 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
Β 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
Β 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
Β 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Β 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
Β 
Syncsort et le retour d'expΓ©rience ComScore
Syncsort et le retour d'expΓ©rience ComScoreSyncsort et le retour d'expΓ©rience ComScore
Syncsort et le retour d'expΓ©rience ComScore
Β 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
Β 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Β 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Β 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
Β 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
Β 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Β 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Β 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Β 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Β 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
Β 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Β 

Similar to Improving MySQL performance with Hadoop

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
Β 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginnersbusiness Corporate
Β 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
Β 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
Β 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? nakshatraL
Β 
Introduction to ApacheΒ hadoop
Introduction to ApacheΒ hadoopIntroduction to ApacheΒ hadoop
Introduction to ApacheΒ hadoopOmar Jaber
Β 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
Β 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
Β 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
Β 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
Β 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
Β 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdfMarianJRuben
Β 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
Β 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
Β 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
Β 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
Β 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
Β 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
Β 

Similar to Improving MySQL performance with Hadoop (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Β 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
Β 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
Β 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Β 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS?
Β 
Introduction to ApacheΒ hadoop
Introduction to ApacheΒ hadoopIntroduction to ApacheΒ hadoop
Introduction to ApacheΒ hadoop
Β 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
Β 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Β 
Hw09 Hadoop Db
Hw09   Hadoop DbHw09   Hadoop Db
Hw09 Hadoop Db
Β 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
Β 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
Β 
Big data ppt
Big data pptBig data ppt
Big data ppt
Β 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
Β 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
Β 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
Β 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
Β 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Β 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
Β 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Β 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Β 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Β 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
Β 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
Β 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
Β 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
Β 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
Β 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Β 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Β 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
Β 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
Β 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Β 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
Β 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Β 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Β 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Β 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Β 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Β 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Β 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Β 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Β 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Β 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Β 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Β 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Β 

Improving MySQL performance with Hadoop

  • 1. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 2. Improving MySQL Performance with Hadoop Sagar Jauhari, Manish Kumar Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 3. India May 03 – May 04, 2012 San Francisco September 30 – October 4, 2012 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 4. Program Agenda ● Introduction ● Inside Hadoop! ● Integration with MySQL ● Facebook's usage of MySQL & Hadoop ● Twitter's usage of MySQL &Hadoop Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 5. Introduction MySQL ● 12 million product installations ● 65,000 downloads each day ● Part of the rapidly growing open source LAMP stack ● MySQL Commercial Editions Available Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 6. Introduction Hadoop ● Highly scalable Distributed Framework β—‹ Yahoo! has a 4000 node cluster! ● Extremely powerful in terms of computation β—‹ Sorts a TB of random integers in 62 seconds! Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 7. Introduction Hadoop is .. ● A scalable system for data storage and processing. ● Fault tolerant ● Parallelizes data processing across many nodes ● Leverages its distributed file system (HDFS)* to cheaply and reliably replicate chunks of data. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 8. Introduction Who uses Hadoop? ● Yahoo: β–  Ad Systems and Web Search. ● Facebook: β–  Reporting/analytics and machine learning. ● Twitter: β–  Data warehousing, data analysis. ● Netflix: β–  Movie recommendation algorithm uses Hive ( which uses Hadoop, HDFS & MapReduce underneath) Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 9. Introduction MySQL Vs Hadoop MySQL Hadoop Data Capacity TB+ (may require sharding) PB+ Data per query GB? PB+ Read/Write Random read/write Sequential scans, Append - only Query Language SQL Java MapReduce, scripting languages, Hive QL Transaction Yes No Indexes Yes No Latence Sub-second (hopefully) Minutes to hours Data structure Structured Structured or unstructured Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 10. Inside Hadoop A shallow Deep Dive Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 11. Inside Hadoop HDFS ● A distributed, scalable, Name Node and portable file system written in Java ● Each node in a Hadoop HDFS instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. Map / Reduce Workers Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 12. Inside Hadoop HDFS ● Uses the TCP/IP layer for Name Node communication ● Stores large files across multiple machines HDFS ● Single name node stores metadata in-memory. Map / Reduce Workers Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 13. Inside Hadoop HDFS Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 14. Inside Hadoop Map Reduce ● Design Goals β—‹ Scalability β—‹ Cost Efficiency ● Implementation β—‹ User Jobs are executed as 'map' and 'reduce' functions β—‹ Work distribution and fault tolerance are managed Input Map Shuffle and sort Reduce Output Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 15. Inside Hadoop Map Reduce ● Map β—‹ Map Reduce job splits input data into independent chunks β—‹ Each chunk is processed by the map task in a parallel manner β—‹ Generic key-value computation Input Map Shuffle and sort Reduce Output Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 16. Inside Hadoop Map Reduce ● Reduce β—‹ Data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous β—‹ The merged data is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered Input Map Shuffle and sort Reduce Output Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 17. Inside Hadoop Map Reduce Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 18. Inside Hadoop How does hadoop use Map-Reduce ● Framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. ● Master β—‹ Schedules the jobs' component tasks on the slaves β—‹ Monitors the jobs β—‹ Re-executes the failed tasks ● Slave β—‹ Executes the tasks as directed by the master. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 19. Inside Hadoop Why Map Reduce ? ● Language support β—‹ Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) . ● Scales Horizontally ● Programmer is isolated from individual failed tasks β—‹ Tasks are restarted on another node Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 20. Inside Hadoop Map Reduce Limitations ● Not a good fit for problems that exhibit task-driven parallelism. ● Requires a particular form of input - a set of (key, pair) pairs. ● A lot of MapReduce applications end up sharing data one way or another. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 21. Integration with MySQL Leveraging Hadoop to Improve MySQL performance Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 22. Integration with MySQL ● The benefits of MySQL to developers is the speed, reliability, data integrity and scalability it provides. ● It can successfully process large amounts of data (in petabytes). ● But for applications that require a massive parallel processing we may need the benefits of a parallel processing system, such as hadoop. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 23. Integration with MySQL Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 24. Integration with MySQL Problem Statement Word Count Problem ● In a large set of documents, find the number of occurrences of each word. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 25. Integration with MySQL Word count problem Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 26. Integration with MySQL Mapping Key and Value represent a row of data: Map key is the byte office, value in a line. (key, value) Intermediate Output foreach <word1>, 1 (word in <word2>, 1 the <word3>, 1 value) output (word,1) Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 27. Integration with MySQL Reducing Hadoop aggregates the keys Reduce and calls reduce for each (key, list) unique key: sum <word1>, (1,1,1,1,1,1…1) the list <word2>, (1,1,1) Output <word3>, (1,1,1,1,1,1) . (key, Final result: sum) <word1>, 45823 <word2>, 1204 <word3>, 2693 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 28. Integration with MySQL Demo Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 29. Integration with MySQL Video Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 30. Facebook's usage of MySQL & Hadoop ● Facebook collects TB of data everyday from around 800 million users. ● MySQL handles pretty much every user interaction: likes, shares, status updates, alerts, requests, etc. ● Hadoop/Hive Warehouse – 4800 cores, 2 PetaBytes (July 2009) – 4800 cores, 12 PetaBytes (Sept 2009) ● Hadoop Archival Store – 200 TB Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 31. Facebook's usage of MySQL & Hadoop Hive ● Data warehouse system for Hadoop. ● Facilitates easy data summarization. ● Hive translates HiveQL to MapReduce code. ● Querying β—‹ Provides a mechanism to project structure onto this data β—‹ Allows querying the data using a SQL-like language called HiveQL Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 32. Facebook's usage of MySQL & Hadoop Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 33. Hive Vs SQL RDBMS HIVE SQL-92 standard (maybe) Subset of SQL-92 plus Hive- Language specific extension INSERT, UPDATE and INSERT but not UPDATE or Update Capabilities DELETE DELETE Yes No Transactions Sub-Second Minutes or more Latency Any number of indexes, No indexes, data is always Indexes very scanned (in parallel) important for performance TBs PBs Data size Data per query GBs Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 PBs Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 34. Hadoop Implementation At Twitter ● > 12 terabytes of new data per day! ● Most stored data is LZ0 compressed ● Uses Scribe to write logs to Hadoop β—‹ Scribe: a log collection framework created and open- sourced by Facebook. ● Hadoop used for data warehousing, data analysis. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 35. References ● Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera ● http://engineering.twitter.com/2010/04/hadoop-at-twitter.html ● http://semanticvoid.com ● http://michael-noll.com ● http://hadoop.apache.org/ Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 36. Legal Disclaimer ● All other products, company names, brand names, trademarks and logos are the property of their respective owners. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 37. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 38. Thank You Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.
  • 39. Copyright Β© 2012, Oracle and/or its affiliates. All rights reserved.