Hadoop and mysql by Chris Schneider

  • 2,021 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,021
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
61
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MySQL and HadoopMySQL SF Meetup 2012Chris Schneider
  • 2. About Me Chris Schneider, Data Architect @ Ning.com (a Glam Media Company) Spent the last ~2 years working with Hadoop (CDH) Spent the last 10 years building MySQL architecture for multiple companies chriss@glam.com
  • 3. What we‟ll cover Hadoop CDH Use cases for Hadoop Map Reduce Scoop Hive Impala
  • 4. What is Hadoop? An open-source framework for storing and processing data on a cluster of servers Based on Google‟s whitepapers of the Google File System (GFS) and MapReduce Scales linearly Designed for batch processing Optimized for streaming reads
  • 5. The Hadoop Distribution Cloudera  The only distribution for Apache Hadoop What Cloudera Does  Cloudera Manager  Enterprise Training  Hadoop Admin  Hadoop Development  Hbase  Hive and Pig  Enterprise Support
  • 6. Why Hadoop Volume  Use Hadoop when you cannot or should not use traditional RDBMS Velocity  Can ingest terabytes of data per day Variety  You can have structured or unstructured data
  • 7. Use cases for Hadoop Recommendation engine  Netflix recommends movies Ad targeting, log processing, search optimization  eBay, Orbitz Machine learning and classification  Yahoo Mail‟s spam detection  Financial: Identity theft and credit risk Social Graph  Facebook, Linkedin and eHarmony connections Predicting the outcome of an election before the election, 50 out of 50 correct thanks to Nate Silver!
  • 8. Some Details about Hadoop Two Main Pieces of Hadoop Hadoop Distributed File System (HDFS)  Distributed and redundant data storage using many nodes  Hardware will inevitably fail Read and process data with MapReduce  Processing is sent to the data  Many “map” tasks each work on a slice of the data  Failed tasks are automatically restarted on another node or replica
  • 9. MapReduce Word Count The key and value together represent a row of data where the key is the byte offset and the value is the linemap (key,value)foreach (word in value) output (word,1)
  • 10. Map is used for Searching64, big data is totally cool and big Foreach… word Intermediate Output (on local disk): big, 1 data, 1 is, 1 MAP totally, 1 cool, 1 and, 1 big, 1
  • 11. Reduce is used to aggregateHadoop aggregates the keys and calls a reduce for eachunique key… e.g. GROUP BY, ORDER BYreduce (key, list) big, (1,1) data, (1) is, (1) Reduce totally, (1) sum the list cool, (1) and, (1) big, 2 output (key, sum) data, 1 is, 1 totally, 1 cool, 1 and, 1
  • 12. Where does Hadoop fit in? Think of Hadoop as an augmentation of your traditional RDBMS system You want to store years of data You need to aggregate all of the data over many years time You want/need ALL your data stored and accessible not forgotten or deleted You need this to be free software running on commodity hardware
  • 13. Where does Hadoop fit in?http http http Tableau: Hive Business Pig AnalyticsMySQL MySQL MySQL Hadoop (CDH4) MySQL MySQL MySQL Secondary NameNode JobTracker NameNode2 NameNode Sqoop or ETL DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Sqoop
  • 14. Data Flow MySQL is used for OLTP data processing ETL process moves data from MySQL to Hadoop  Cron job – Sqoop OR  Cron job – Custom ETL Use MapReduce to transform data, run batch analysis, join data, etc… Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report
  • 15. MySQL HadoopData Capacity Depends, (TB)+ PB+Data per Depends, PB+query/MR (MB -> GB)Read/Write Random Sequential scans, read/write Append-onlyQuery Language SQL MapReduce, Scripted Streaming, HiveQL, Pig LatinTransactions Yes NoIndexes Yes NoLatency Sub-second Minutes to hoursData structure Relational Both structured and un-structuredEnterprise and Yes YesCommunitySupport
  • 16. About Sqoop Open Source and stands for SQL-to-Hadoop Parallel import and export between Hadoop and various RDBMS Default implementation is JDBC Optimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza, Teradata (Not Open Source)
  • 17. Sqoop Data Into Hadoop $ sqoop import --connect jdbc:mysql://example.com/world --tables City --fields-terminated-by „t‟ --lines-terminated-by „n‟ This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.City The resulting TSV file(s) will be stored in HDFS
  • 18. Sqoop Features You can choose specific tables or columns to import with the --where flag Controlled parallelism  Parallel mappers/connections (--num-mappers)  Specify the column to split on (--split-by) Incremental loads Integration with Hive and Hbase
  • 19. Sqoop Export $ sqoop export --connect jdbc:mysql://example.com/world --tables City --export-dir /hdfs_path/City_data The City table needs to exist Default CSV formatted Can use staging table (--staging-table)
  • 20. About Hive Offers a way around the complexities of MapReduce/JAVA Hive is an open-source project managed by the Apache Software Foundation Facebook uses Hadoop and wanted non-JAVA employees to be able to access data  Language based on SQL  Easy to lean and use  Data is available to many more people Hive is a SQL SELECT statement to MapReduce translator
  • 21. More About Hive Hive is NOT a replacement for RDBMS  Not all SQL works Hive is only an interpreter that converts HiveQL to MapReduce HiveQL queries can take many seconds or minutes to produce a result set
  • 22. RDBMS vs Hive RDBMS HiveLanguage SQL Subset of SQL along with Hive extensionsTransactions Yes NoACID Yes NoLatency Sub-second Many seconds to minutes (Indexed Data) (Non Index Data)Updates? Yes, INSERT INSERT OVERWRITE [IGNORE], UPDATE, DELETE, REPLACE
  • 23. Sqoop and Hive $ sqoop import --connect jdbc:mysql://example.com/world --tables City --hive-import Alternatively, you can create table(s) within the Hive CLI and run an “fs -put” with an exported CSV file on the local file system
  • 24. Impala It‟s new, it‟s fast Allows real time analytics on very large data sets Runs on top of HIVE Based off of Google‟s Dremel  http://research.google.com/pubs/pub36632.html Cloudera VM for Impala  https://ccp.cloudera.com/display/SUPPORT/Downlo ads
  • 25. Thanks Everyone Questions? Good References  Cloudera.com  http://infolab.stanford.edu/~ragho/hive- icde2010.pdf VM downloads  https://ccp.cloudera.com/display/SUPPORT/Clouder a%27s+Hadoop+Demo+VM+for+CDH4