This document discusses MySQL and Hadoop. It provides an overview of Hadoop, Cloudera Distribution of Hadoop (CDH), MapReduce, Hive, Impala, and how MySQL can interact with Hadoop using Sqoop. Key use cases for Hadoop include recommendation engines, log processing, and machine learning. The document also compares MySQL and Hadoop in terms of data capacity, query languages, and support.
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
2. About Me
Chris Schneider, Data Architect @ Ning.com (a
Glam Media Company)
Spent the last ~2 years working with Hadoop
(CDH)
Spent the last 10 years building MySQL
architecture for multiple companies
chriss@glam.com
3. What we‟ll cover
Hadoop
CDH
Use cases for Hadoop
Map Reduce
Scoop
Hive
Impala
4. What is Hadoop?
An open-source framework for storing and
processing data on a cluster of servers
Based on Google‟s whitepapers of the Google
File System (GFS) and MapReduce
Scales linearly
Designed for batch processing
Optimized for streaming reads
5. The Hadoop Distribution
Cloudera
The only distribution for Apache Hadoop
What Cloudera Does
Cloudera Manager
Enterprise Training
Hadoop Admin
Hadoop Development
Hbase
Hive and Pig
Enterprise Support
6. Why Hadoop
Volume
Use Hadoop when you cannot or should not use
traditional RDBMS
Velocity
Can ingest terabytes of data per day
Variety
You can have structured or unstructured data
7. Use cases for Hadoop
Recommendation engine
Netflix recommends movies
Ad targeting, log processing, search optimization
eBay, Orbitz
Machine learning and classification
Yahoo Mail‟s spam detection
Financial: Identity theft and credit risk
Social Graph
Facebook, Linkedin and eHarmony connections
Predicting the outcome of an election before the
election, 50 out of 50 correct thanks to Nate Silver!
8. Some Details about Hadoop
Two Main Pieces of Hadoop
Hadoop Distributed File System (HDFS)
Distributed and redundant data storage using many
nodes
Hardware will inevitably fail
Read and process data with MapReduce
Processing is sent to the data
Many “map” tasks each work on a slice of the data
Failed tasks are automatically restarted on another
node or replica
9.
10. MapReduce Word Count
The key and value together represent a row of
data where the key is the byte offset and the
value is the line
map (key,value)
foreach (word in value)
output (word,1)
11. Map is used for Searching
64, big data is totally cool and big Foreach
… word
Intermediate Output (on local disk):
big, 1
data, 1
is, 1
MAP totally, 1
cool, 1
and, 1
big, 1
12. Reduce is used to aggregate
Hadoop aggregates the keys and calls a reduce for each
unique key… e.g. GROUP BY, ORDER BY
reduce (key, list) big, (1,1)
data, (1)
is, (1) Reduce
totally, (1)
sum the list cool, (1)
and, (1)
big, 2
output (key, sum) data, 1
is, 1
totally, 1
cool, 1
and, 1
13. Where does Hadoop fit in?
Think of Hadoop as an augmentation of your
traditional RDBMS system
You want to store years of data
You need to aggregate all of the data over
many years time
You want/need ALL your data stored and
accessible not forgotten or deleted
You need this to be free software running on
commodity hardware
14. Where does Hadoop fit in?
http http http Tableau:
Hive
Business
Pig
Analytics
MySQL MySQL MySQL
Hadoop (CDH4)
MySQL MySQL MySQL
Secondary
NameNode JobTracker
NameNode2 NameNode
Sqoop or ETL DataNode DataNode DataNode DataNode
DataNode DataNode DataNode DataNode
Sqoop
15. Data Flow
MySQL is used for OLTP data processing
ETL process moves data from MySQL to Hadoop
Cron job – Sqoop
OR
Cron job – Custom ETL
Use MapReduce to transform data, run batch
analysis, join data, etc…
Export transformed results to OLAP or back to
OLTP, for example, a dashboard of aggregated
data or report
16. MySQL Hadoop
Data Capacity Depends, (TB)+ PB+
Data per Depends, PB+
query/MR (MB -> GB)
Read/Write Random Sequential scans,
read/write Append-only
Query Language SQL MapReduce,
Scripted
Streaming,
HiveQL, Pig Latin
Transactions Yes No
Indexes Yes No
Latency Sub-second Minutes to hours
Data structure Relational Both structured
and un-structured
Enterprise and Yes Yes
Community
Support
17. About Sqoop
Open Source and stands for SQL-to-Hadoop
Parallel import and export between Hadoop and
various RDBMS
Default implementation is JDBC
Optimized for MySQL but not for performance
Integrated with connectors for
Oracle, Netezza, Teradata (Not Open Source)
18. Sqoop Data Into Hadoop
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--fields-terminated-by „t‟
--lines-terminated-by „n‟
This command will submit a Hadoop job that
queries your MySQL server and reads all the rows
from world.City
The resulting TSV file(s) will be stored in HDFS
19. Sqoop Features
You can choose specific tables or columns to
import with the --where flag
Controlled parallelism
Parallel mappers/connections (--num-mappers)
Specify the column to split on (--split-by)
Incremental loads
Integration with Hive and Hbase
20. Sqoop Export
$ sqoop export --connect jdbc:mysql://example.com/world
--tables City
--export-dir /hdfs_path/City_data
The City table needs to exist
Default CSV formatted
Can use staging table (--staging-table)
21. About Hive
Offers a way around the complexities of
MapReduce/JAVA
Hive is an open-source project managed by the
Apache Software Foundation
Facebook uses Hadoop and wanted non-JAVA
employees to be able to access data
Language based on SQL
Easy to lean and use
Data is available to many more people
Hive is a SQL SELECT statement to MapReduce
translator
22. More About Hive
Hive is NOT a replacement for RDBMS
Not all SQL works
Hive is only an interpreter that converts HiveQL to
MapReduce
HiveQL queries can take many seconds or
minutes to produce a result set
23. RDBMS vs Hive
RDBMS Hive
Language SQL Subset of SQL along with Hive
extensions
Transactions Yes No
ACID Yes No
Latency Sub-second Many seconds to minutes
(Indexed Data) (Non Index Data)
Updates? Yes, INSERT INSERT OVERWRITE
[IGNORE],
UPDATE, DELETE,
REPLACE
24. Sqoop and Hive
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--hive-import
Alternatively, you can create table(s) within the
Hive CLI and run an “fs -put” with an exported
CSV file on the local file system
25. Impala
It‟s new, it‟s fast
Allows real time analytics on very large data sets
Runs on top of HIVE
Based off of Google‟s Dremel
http://research.google.com/pubs/pub36632.html
Cloudera VM for Impala
https://ccp.cloudera.com/display/SUPPORT/Downlo
ads
26. Thanks Everyone
Questions?
Good References
Cloudera.com
http://infolab.stanford.edu/~ragho/hive-
icde2010.pdf
VM downloads
https://ccp.cloudera.com/display/SUPPORT/Clouder
a%27s+Hadoop+Demo+VM+for+CDH4