SlideShare a Scribd company logo
1 of 26
MySQL and Hadoop
MySQL SF Meetup 2012
Chris Schneider
About Me
 Chris Schneider, Data Architect @ Ning.com (a
  Glam Media Company)

 Spent the last ~2 years working with Hadoop
  (CDH)

 Spent the last 10 years building MySQL
  architecture for multiple companies

 chriss@glam.com
What we‟ll cover
 Hadoop

 CDH

 Use cases for Hadoop

 Map Reduce

 Scoop

 Hive

 Impala
What is Hadoop?
 An open-source framework for storing and
  processing data on a cluster of servers

 Based on Google‟s whitepapers of the Google
  File System (GFS) and MapReduce

 Scales linearly

 Designed for batch processing

 Optimized for streaming reads
The Hadoop Distribution
 Cloudera
  The only distribution for Apache Hadoop

 What Cloudera Does
  Cloudera Manager
  Enterprise Training
    Hadoop Admin
    Hadoop Development
    Hbase
    Hive and Pig
  Enterprise Support
Why Hadoop
 Volume
  Use Hadoop when you cannot or should not use
   traditional RDBMS

 Velocity
  Can ingest terabytes of data per day

 Variety
  You can have structured or unstructured data
Use cases for Hadoop
 Recommendation engine
   Netflix recommends movies

 Ad targeting, log processing, search optimization
   eBay, Orbitz

 Machine learning and classification
   Yahoo Mail‟s spam detection
   Financial: Identity theft and credit risk

 Social Graph
   Facebook, Linkedin and eHarmony connections

 Predicting the outcome of an election before the
  election, 50 out of 50 correct thanks to Nate Silver!
Some Details about Hadoop
 Two Main Pieces of Hadoop

 Hadoop Distributed File System (HDFS)
  Distributed and redundant data storage using many
   nodes
  Hardware will inevitably fail

 Read and process data with MapReduce
  Processing is sent to the data
  Many “map” tasks each work on a slice of the data
  Failed tasks are automatically restarted on another
   node or replica
MapReduce Word Count
 The key and value together represent a row of
  data where the key is the byte offset and the
  value is the line

map (key,value)



foreach (word in value)



  output (word,1)
Map is used for Searching

64, big data is totally cool and big             Foreach
…                                                 word




                                       Intermediate Output (on local disk):
                                       big, 1
                                       data, 1
                                       is, 1
           MAP                         totally, 1
                                       cool, 1
                                       and, 1
                                       big, 1
Reduce is used to aggregate
Hadoop aggregates the keys and calls a reduce for each
unique key… e.g. GROUP BY, ORDER BY

reduce (key, list)    big, (1,1)
                      data, (1)
                      is, (1)               Reduce
                      totally, (1)
 sum the list         cool, (1)
                      and, (1)
                                             big, 2
 output (key, sum)                           data, 1
                                             is, 1
                                             totally, 1
                                             cool, 1
                                             and, 1
Where does Hadoop fit in?
 Think of Hadoop as an augmentation of your
  traditional RDBMS system

 You want to store years of data

 You need to aggregate all of the data over
  many years time

 You want/need ALL your data stored and
  accessible not forgotten or deleted

 You need this to be free software running on
  commodity hardware
Where does Hadoop fit in?

http      http       http             Tableau:
                                                                Hive
                                      Business
                                                                Pig
                                      Analytics



MySQL     MySQL     MySQL

                                              Hadoop (CDH4)
  MySQL     MySQL      MySQL
                                                  Secondary
                               NameNode                            JobTracker
                                    NameNode2     NameNode


        Sqoop or ETL             DataNode     DataNode     DataNode      DataNode
                                   DataNode     DataNode      DataNode     DataNode
  Sqoop
Data Flow
 MySQL is used for OLTP data processing
 ETL process moves data from MySQL to Hadoop
  Cron job – Sqoop
   OR
  Cron job – Custom ETL

 Use MapReduce to transform data, run batch
  analysis, join data, etc…
 Export transformed results to OLAP or back to
  OLTP, for example, a dashboard of aggregated
  data or report
MySQL            Hadoop
Data Capacity    Depends, (TB)+   PB+
Data per         Depends,         PB+
query/MR         (MB -> GB)
Read/Write       Random           Sequential scans,
                 read/write       Append-only
Query Language   SQL              MapReduce,
                                  Scripted
                                  Streaming,
                                  HiveQL, Pig Latin
Transactions     Yes              No
Indexes          Yes              No
Latency          Sub-second       Minutes to hours
Data structure   Relational       Both structured
                                  and un-structured
Enterprise and   Yes              Yes
Community
Support
About Sqoop
 Open Source and stands for SQL-to-Hadoop

 Parallel import and export between Hadoop and
  various RDBMS

 Default implementation is JDBC

 Optimized for MySQL but not for performance

 Integrated with connectors for
  Oracle, Netezza, Teradata (Not Open Source)
Sqoop Data Into Hadoop
  $ sqoop import --connect jdbc:mysql://example.com/world 
  --tables City 
  --fields-terminated-by „t‟ 
  --lines-terminated-by „n‟


 This command will submit a Hadoop job that
  queries your MySQL server and reads all the rows
  from world.City

 The resulting TSV file(s) will be stored in HDFS
Sqoop Features
 You can choose specific tables or columns to
  import with the --where flag

 Controlled parallelism
  Parallel mappers/connections (--num-mappers)
  Specify the column to split on (--split-by)

 Incremental loads

 Integration with Hive and Hbase
Sqoop Export
  $ sqoop export --connect jdbc:mysql://example.com/world 
  --tables City 
  --export-dir /hdfs_path/City_data



 The City table needs to exist

 Default CSV formatted

 Can use staging table (--staging-table)
About Hive
 Offers a way around the complexities of
  MapReduce/JAVA

 Hive is an open-source project managed by the
  Apache Software Foundation

 Facebook uses Hadoop and wanted non-JAVA
  employees to be able to access data
  Language based on SQL
  Easy to lean and use
  Data is available to many more people

 Hive is a SQL SELECT statement to MapReduce
  translator
More About Hive
 Hive is NOT a replacement for RDBMS
  Not all SQL works

 Hive is only an interpreter that converts HiveQL to
  MapReduce

 HiveQL queries can take many seconds or
  minutes to produce a result set
RDBMS vs Hive
               RDBMS             Hive
Language       SQL               Subset of SQL along with Hive
                                 extensions
Transactions   Yes               No
ACID           Yes               No
Latency        Sub-second        Many seconds to minutes
               (Indexed Data)    (Non Index Data)
Updates?       Yes, INSERT       INSERT OVERWRITE
               [IGNORE],
               UPDATE, DELETE,
               REPLACE
Sqoop and Hive
  $ sqoop import --connect jdbc:mysql://example.com/world 
  --tables City 
  --hive-import



 Alternatively, you can create table(s) within the
  Hive CLI and run an “fs -put” with an exported
  CSV file on the local file system
Impala
 It‟s new, it‟s fast

 Allows real time analytics on very large data sets

 Runs on top of HIVE

 Based off of Google‟s Dremel
   http://research.google.com/pubs/pub36632.html

 Cloudera VM for Impala
   https://ccp.cloudera.com/display/SUPPORT/Downlo
    ads
Thanks Everyone
 Questions?

 Good References
  Cloudera.com
  http://infolab.stanford.edu/~ragho/hive-
   icde2010.pdf

 VM downloads
  https://ccp.cloudera.com/display/SUPPORT/Clouder
   a%27s+Hadoop+Demo+VM+for+CDH4

More Related Content

What's hot

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 

What's hot (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Apache drill
Apache drillApache drill
Apache drill
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 

Viewers also liked

2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1Dmitry Makarchuk
 
A random forest approach to skin detection with r
A random forest approach to skin detection with rA random forest approach to skin detection with r
A random forest approach to skin detection with rDmitry Makarchuk
 
"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve SoudersDmitry Makarchuk
 
RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)Dmitry Makarchuk
 
Mongo DB in gaming industry
Mongo DB in gaming industryMongo DB in gaming industry
Mongo DB in gaming industryDmitry Makarchuk
 
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1Dmitry Makarchuk
 

Viewers also liked (8)

2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1
 
A random forest approach to skin detection with r
A random forest approach to skin detection with rA random forest approach to skin detection with r
A random forest approach to skin detection with r
 
"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders
 
RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)
 
I search powerpoint
I search powerpointI search powerpoint
I search powerpoint
 
Linzer slides-barug
Linzer slides-barugLinzer slides-barug
Linzer slides-barug
 
Mongo DB in gaming industry
Mongo DB in gaming industryMongo DB in gaming industry
Mongo DB in gaming industry
 
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1
 

Similar to Hadoop and mysql by Chris Schneider

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 

Similar to Hadoop and mysql by Chris Schneider (20)

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 

Hadoop and mysql by Chris Schneider

  • 1. MySQL and Hadoop MySQL SF Meetup 2012 Chris Schneider
  • 2. About Me  Chris Schneider, Data Architect @ Ning.com (a Glam Media Company)  Spent the last ~2 years working with Hadoop (CDH)  Spent the last 10 years building MySQL architecture for multiple companies  chriss@glam.com
  • 3. What we‟ll cover  Hadoop  CDH  Use cases for Hadoop  Map Reduce  Scoop  Hive  Impala
  • 4. What is Hadoop?  An open-source framework for storing and processing data on a cluster of servers  Based on Google‟s whitepapers of the Google File System (GFS) and MapReduce  Scales linearly  Designed for batch processing  Optimized for streaming reads
  • 5. The Hadoop Distribution  Cloudera  The only distribution for Apache Hadoop  What Cloudera Does  Cloudera Manager  Enterprise Training  Hadoop Admin  Hadoop Development  Hbase  Hive and Pig  Enterprise Support
  • 6. Why Hadoop  Volume  Use Hadoop when you cannot or should not use traditional RDBMS  Velocity  Can ingest terabytes of data per day  Variety  You can have structured or unstructured data
  • 7. Use cases for Hadoop  Recommendation engine  Netflix recommends movies  Ad targeting, log processing, search optimization  eBay, Orbitz  Machine learning and classification  Yahoo Mail‟s spam detection  Financial: Identity theft and credit risk  Social Graph  Facebook, Linkedin and eHarmony connections  Predicting the outcome of an election before the election, 50 out of 50 correct thanks to Nate Silver!
  • 8. Some Details about Hadoop  Two Main Pieces of Hadoop  Hadoop Distributed File System (HDFS)  Distributed and redundant data storage using many nodes  Hardware will inevitably fail  Read and process data with MapReduce  Processing is sent to the data  Many “map” tasks each work on a slice of the data  Failed tasks are automatically restarted on another node or replica
  • 9.
  • 10. MapReduce Word Count  The key and value together represent a row of data where the key is the byte offset and the value is the line map (key,value) foreach (word in value) output (word,1)
  • 11. Map is used for Searching 64, big data is totally cool and big Foreach … word Intermediate Output (on local disk): big, 1 data, 1 is, 1 MAP totally, 1 cool, 1 and, 1 big, 1
  • 12. Reduce is used to aggregate Hadoop aggregates the keys and calls a reduce for each unique key… e.g. GROUP BY, ORDER BY reduce (key, list) big, (1,1) data, (1) is, (1) Reduce totally, (1) sum the list cool, (1) and, (1) big, 2 output (key, sum) data, 1 is, 1 totally, 1 cool, 1 and, 1
  • 13. Where does Hadoop fit in?  Think of Hadoop as an augmentation of your traditional RDBMS system  You want to store years of data  You need to aggregate all of the data over many years time  You want/need ALL your data stored and accessible not forgotten or deleted  You need this to be free software running on commodity hardware
  • 14. Where does Hadoop fit in? http http http Tableau: Hive Business Pig Analytics MySQL MySQL MySQL Hadoop (CDH4) MySQL MySQL MySQL Secondary NameNode JobTracker NameNode2 NameNode Sqoop or ETL DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Sqoop
  • 15. Data Flow  MySQL is used for OLTP data processing  ETL process moves data from MySQL to Hadoop  Cron job – Sqoop OR  Cron job – Custom ETL  Use MapReduce to transform data, run batch analysis, join data, etc…  Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report
  • 16. MySQL Hadoop Data Capacity Depends, (TB)+ PB+ Data per Depends, PB+ query/MR (MB -> GB) Read/Write Random Sequential scans, read/write Append-only Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin Transactions Yes No Indexes Yes No Latency Sub-second Minutes to hours Data structure Relational Both structured and un-structured Enterprise and Yes Yes Community Support
  • 17. About Sqoop  Open Source and stands for SQL-to-Hadoop  Parallel import and export between Hadoop and various RDBMS  Default implementation is JDBC  Optimized for MySQL but not for performance  Integrated with connectors for Oracle, Netezza, Teradata (Not Open Source)
  • 18. Sqoop Data Into Hadoop $ sqoop import --connect jdbc:mysql://example.com/world --tables City --fields-terminated-by „t‟ --lines-terminated-by „n‟  This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.City  The resulting TSV file(s) will be stored in HDFS
  • 19. Sqoop Features  You can choose specific tables or columns to import with the --where flag  Controlled parallelism  Parallel mappers/connections (--num-mappers)  Specify the column to split on (--split-by)  Incremental loads  Integration with Hive and Hbase
  • 20. Sqoop Export $ sqoop export --connect jdbc:mysql://example.com/world --tables City --export-dir /hdfs_path/City_data  The City table needs to exist  Default CSV formatted  Can use staging table (--staging-table)
  • 21. About Hive  Offers a way around the complexities of MapReduce/JAVA  Hive is an open-source project managed by the Apache Software Foundation  Facebook uses Hadoop and wanted non-JAVA employees to be able to access data  Language based on SQL  Easy to lean and use  Data is available to many more people  Hive is a SQL SELECT statement to MapReduce translator
  • 22. More About Hive  Hive is NOT a replacement for RDBMS  Not all SQL works  Hive is only an interpreter that converts HiveQL to MapReduce  HiveQL queries can take many seconds or minutes to produce a result set
  • 23. RDBMS vs Hive RDBMS Hive Language SQL Subset of SQL along with Hive extensions Transactions Yes No ACID Yes No Latency Sub-second Many seconds to minutes (Indexed Data) (Non Index Data) Updates? Yes, INSERT INSERT OVERWRITE [IGNORE], UPDATE, DELETE, REPLACE
  • 24. Sqoop and Hive $ sqoop import --connect jdbc:mysql://example.com/world --tables City --hive-import  Alternatively, you can create table(s) within the Hive CLI and run an “fs -put” with an exported CSV file on the local file system
  • 25. Impala  It‟s new, it‟s fast  Allows real time analytics on very large data sets  Runs on top of HIVE  Based off of Google‟s Dremel  http://research.google.com/pubs/pub36632.html  Cloudera VM for Impala  https://ccp.cloudera.com/display/SUPPORT/Downlo ads
  • 26. Thanks Everyone  Questions?  Good References  Cloudera.com  http://infolab.stanford.edu/~ragho/hive- icde2010.pdf  VM downloads  https://ccp.cloudera.com/display/SUPPORT/Clouder a%27s+Hadoop+Demo+VM+for+CDH4