• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Improving MySQL performance with Hadoop
 

Improving MySQL performance with Hadoop

on

  • 13,277 views

Presented at Java One & Oracle Develop 2012.

Presented at Java One & Oracle Develop 2012.

Statistics

Views

Total Views
13,277
Views on SlideShare
10,653
Embed Views
2,624

Actions

Likes
21
Downloads
2
Comments
1

20 Embeds 2,624

http://www.mysqlplus.fr 2450
http://flavors.me 48
http://88.191.159.197 43
http://feeds.feedburner.com 27
http://hadoopbigdata.wordpress.com 23
http://www.mysqlplus.net 5
http://abtasty.com 4
http://jp.flavors.me 4
http://www.myxplain.net 3
http://www.linkedin.com 3
http://www.feedspot.com 2
http://fr.flavors.me 2
http://es.flavors.me 2
http://www.inoreader.com 2
http://www.slashdocs.com 1
http://de.flavors.me 1
https://twitter.com 1
http://oracle.sociview.com 1
https://www.linkedin.com 1
http://inoreader.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Improving MySQL performance with Hadoop Improving MySQL performance with Hadoop Presentation Transcript

    • Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Improving MySQL Performance withHadoopSagar Jauhari, Manish Kumar Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • India May 03 – May 04, 2012 San Francisco September 30 – October 4, 2012Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Program Agenda● Introduction● Inside Hadoop!● Integration with MySQL● Facebooks usage of MySQL & Hadoop● Twitters usage of MySQL &HadoopCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • IntroductionMySQL ● 12 million product installations ● 65,000 downloads each day ● Part of the rapidly growing open source LAMP stack ● MySQL Commercial Editions AvailableCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • IntroductionHadoop ● Highly scalable Distributed Framework ○ Yahoo! has a 4000 node cluster! ● Extremely powerful in terms of computation ○ Sorts a TB of random integers in 62 seconds!Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • IntroductionHadoop is .. ● A scalable system for data storage and processing. ● Fault tolerant ● Parallelizes data processing across many nodes ● Leverages its distributed file system (HDFS)* to cheaply and reliably replicate chunks of data.Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • IntroductionWho uses Hadoop? ● Yahoo: ■ Ad Systems and Web Search. ● Facebook: ■ Reporting/analytics and machine learning. ● Twitter: ■ Data warehousing, data analysis. ● Netflix: ■ Movie recommendation algorithm uses Hive ( which uses Hadoop, HDFS & MapReduce underneath)Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • IntroductionMySQL Vs Hadoop MySQL HadoopData Capacity TB+ (may require sharding) PB+Data per query GB? PB+Read/Write Random read/write Sequential scans, Append - onlyQuery Language SQL Java MapReduce, scripting languages, Hive QLTransaction Yes NoIndexes Yes NoLatence Sub-second (hopefully) Minutes to hoursData structure Structured Structured or unstructuredCourtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside Hadoop A shallow Deep DiveCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopHDFS ● A distributed, scalable, Name Node and portable file system written in Java ● Each node in a Hadoop HDFS instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. Map / Reduce WorkersCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopHDFS ● Uses the TCP/IP layer for Name Node communication ● Stores large files across multiple machines HDFS ● Single name node stores metadata in-memory. Map / Reduce WorkersCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopHDFSCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopMap Reduce ● Design Goals ○ Scalability ○ Cost Efficiency ● Implementation ○ User Jobs are executed as map and reduce functions ○ Work distribution and fault tolerance are managed Input Map Shuffle and sort Reduce OutputCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopMap Reduce ● Map ○ Map Reduce job splits input data into independent chunks ○ Each chunk is processed by the map task in a parallel manner ○ Generic key-value computation Input Map Shuffle and sort Reduce OutputCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopMap Reduce ● Reduce ○ Data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous ○ The merged data is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered Input Map Shuffle and sort Reduce OutputCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopMap Reduce Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 HadoopCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopHow does hadoop use Map-Reduce ● Framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. ● Master ○ Schedules the jobs component tasks on the slaves ○ Monitors the jobs ○ Re-executes the failed tasks ● Slave ○ Executes the tasks as directed by the master.Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopWhy Map Reduce ? ● Language support ○ Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) . ● Scales Horizontally ● Programmer is isolated from individual failed tasks ○ Tasks are restarted on another nodeCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Inside HadoopMap Reduce Limitations ● Not a good fit for problems that exhibit task-driven parallelism. ● Requires a particular form of input - a set of (key, pair) pairs. ● A lot of MapReduce applications end up sharing data one way or another.Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQL Leveraging Hadoop to Improve MySQL performanceCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQL● The benefits of MySQL to developers is the speed, reliability, data integrity and scalability it provides.● It can successfully process large amounts of data (in petabytes).● But for applications that require a massive parallel processing we may need the benefits of a parallel processing system, such as hadoop. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQLImage Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQL Problem StatementWord Count Problem ● In a large set of documents, find the number of occurrences of each word. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQLWord count problem Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 HadoopCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQLMapping Key and Value represent a row of data:Map key is the byte office, value in a line.(key,value) Intermediate Outputforeach <word1>, 1(word in <word2>, 1the <word3>, 1value)output(word,1)Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQLReducing Hadoop aggregates the keysReduce and calls reduce for each(key, list) unique key: sum <word1>, (1,1,1,1,1,1…1)the list <word2>, (1,1,1) Output <word3>, (1,1,1,1,1,1) .(key, Final result:sum) <word1>, 45823 <word2>, 1204 <word3>, 2693Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQL DemoCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Integration with MySQLVideoCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Facebooks usage of MySQL & Hadoop● Facebook collects TB of data everyday from around 800 million users.● MySQL handles pretty much every user interaction: likes, shares, status updates, alerts, requests, etc.● Hadoop/Hive Warehouse – 4800 cores, 2 PetaBytes (July 2009) – 4800 cores, 12 PetaBytes (Sept 2009)● Hadoop Archival Store – 200 TB Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Facebooks usage of MySQL & HadoopHive ● Data warehouse system for Hadoop. ● Facilitates easy data summarization. ● Hive translates HiveQL to MapReduce code. ● Querying ○ Provides a mechanism to project structure onto this data ○ Allows querying the data using a SQL-like language called HiveQLCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Facebooks usage of MySQL & HadoopImage Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Hive Vs SQL RDBMS HIVE SQL-92 standard (maybe) Subset of SQL-92 plus Hive- Language specific extension INSERT, UPDATE and INSERT but not UPDATE or Update Capabilities DELETE DELETE Yes No Transactions Sub-Second Minutes or more Latency Any number of indexes, No indexes, data is always Indexes very scanned (in parallel) important for performance TBs PBs Data size Data per query GBs Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 PBsCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Hadoop ImplementationAt Twitter ● > 12 terabytes of new data per day! ● Most stored data is LZ0 compressed ● Uses Scribe to write logs to Hadoop ○ Scribe: a log collection framework created and open- sourced by Facebook. ● Hadoop used for data warehousing, data analysis.Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • References ● Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera ● http://engineering.twitter.com/2010/04/hadoop-at-twitter.html ● http://semanticvoid.com ● http://michael-noll.com ● http://hadoop.apache.org/Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Legal Disclaimer ● All other products, company names, brand names, trademarks and logos are the property of their respective owners.Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Thank YouCopyright © 2012, Oracle and/or its affiliates. All rights reserved.
    • Copyright © 2012, Oracle and/or its affiliates. All rights reserved.