If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Introducing the data science sandbox as a service 8.30.18
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
1. DO NOT USE PUBLICLY
From Zero to Hadoop PRIOR TO 10/23/12
Headline Goes Here
Speaker Name | Title
Speaker Name or Subhead Goes Here
March 19, 2013
1
2. Agenda
• Hadoop Ecosystem Overview
• Hadoop Core Technical Overview
• HDFS
• MapReduce
• Hadoop in the Enterprise
• Cluster Planning
• Cluster Management with Cloudera Manager
2
4. Hadoop Ecosystem
INGEST STORE EXPLORE PROCESS ANALYZE SERVE
MANAGEMENT SOFTWARE & CONNECTORS
TECHNICAL SUPPORT BI ETL RDBMS
SUBSCIPTION OPTIONS
CLOUDERA NAVIGATOR
CLOUD USER INTERFACE WORKFLOW MGMT
AUDIT CORE METADATA
LINEAGE WH OO
(v1.0) HU
WHIRR HUE OOZIE
ACCESS
LIFECYCLE
(v1.0) INTEGRATION BATCH PROCESSING REAL-TIME ACCESS
EXPLORE SQ HI PI MA DF & COMPUTE
SQOOP HIVE PIG MAHOUT DATAFU
AC
FL BATCH COMPUTE IM ACCESS
CLOUDERA MANAGER
FLUME MR MR2 IMPALA
FILE MAPREDUCE MAPREDUCE2
MS
BDR FUSE-DFS
RESOURCE MGMT META STORE
REST & COORDINATION YA ZO
RTD RTQ WEBHDFS / HTTPFS YARN ZOOKEEPER
STORAGE
CORE SQL HDFS HB
(REQUIRED) ODBC / JDBC
HADOOP DFS HBASE
4
5. Sqoop
Performs Bi
Directional data
transfers between
Hadoop and almost
any SQL database
with a JDBC driver
5
6. FlumeNG
A streaming data Client
collection and Client
Agent
aggregation system Client
Agent
for massive volumes Agent
Client
of data, such as RPC
services, Log4J,
Syslog, etc.
6
7. HBase
• A low latency,
distributed, non-
SQL database built
on HDFS.
• A “Columnar
Database”
7
8. Hive
• Relational database
SELECT
abstraction using a SQL like s.word, s.freq, k.freq
dialect called HiveQL FROM shakespeare
JOIN ON (s.word= k.word)
• Statements are executed as WHERE s.freq >= 5;
One or more MapReduce
Jobs
8
9. Pig
• High-level scripting language
emps = LOAD 'people.txt’ AS
for for executing one or more (id,name,salary);
MapReduce jobs rich = FILTER emps BY salary >
• Created to simplify authoring 200000;
sorted_rich = ORDER rich BY
of MapReduce jobs
salary DESC;
• Can be extended with user STORE sorted_rich INTO
defined functions ’rich_people.txt';
9
10. Oozie
A workflow engine and
scheduler built specifically
for large-scale job
orchestration on a
Hadoop cluster
10
12. Mahout
A machine learning library with
algorithms for:
• Recommendation based on users'
behavior.
• Clustering groups related documents.
• Classification from existing
categorized.
• Frequent item-set mining (shopping
cart content).
12
13. Hadoop Security
• Authentication is secured by MIT Kerberos v5
and integrated with LDAP
• Provides Identity, Authentication, and
Authorization
• Useful for multitenancy or secure
environments
13
15. Components of HDFS
• NameNode – Holds all metadata for HDFS
• Needs to be a highly reliable machine
• RAID drives – typically RAID 10
• Dual power supplies
• Dual network cards – Bonded
• The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used
15
16. Components of HDFS – Contd.
• DataNodes – Hardware will depend on the specific needs of the
cluster
• No RAID needed, JBOD (just a bunch of disks) is used
• Typical ratio is:
• 1 hard drive
• 2 cores
• 4GB of RAM
16
19. MapReduce – Map
• Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value pairs:
e.g., (filename, line).
• map() produces one or more intermediate values along with an
output key from the input.
(key 1, (key 1, int.
values) values)
Map (key 2, Shuffle (key 1, int. Reduce Final (key,
Task values) Phase values) Task values)
(key 3, (key 1, int.
values) values)
19
20. MapReduce – Reduce
• After the map phase is over, all the intermediate values for a
given output key are combined together into a list
• reduce() combines those intermediate values into one or more
final values for that same output key
(key 1, (key 1, int.
values) values)
Map (key 2, Shuffle (key 1, int. Reduce Final (key,
Task values) Phase values) Task values)
(key 3, (key 1, int.
values) values)
20
22. Hadoop In the Enterprise
How It Works In The Real World
22
23. Networking
• One of the most important things to consider when
setting up a Hadoop cluster
• Typically a top of rack is used with Hadoop with a
core switch
• Careful on over subscribing the backplane of the
switch!
24
24. Hadoop Typical Data Pipeline
Hadoop
Marts
Oozie
Result or Calculated Data
Original Source Data
Data Sources
Pig
Data
Hive Warehouse
Sqoop
MapReduce
Sqoop
Flume
HDFS
25
25. Hadoop Use Cases
Use Case Application Industry Application Use Case
Social Network Analysis Web Clickstream Sessionization
ADVANCED ANALYTICS
Content Optimization Media Clickstream Sessionization
DATA PROCESSING
Network Analytics Telco Mediation
Loyalty & Promotions Analysis Retail Data Factory
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping
26
26. Hadoop in the Enterprise
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
Management Enterprise
IDE’s BI / Analytics
Tools Reporting
CUSTOMERS
Enterprise Data Warehouse
Web
Application
Relational
Logs Files Web Data
Databases
27
27. Cloudera Manager
End-to-End Administration for CDH
1 Manage
Easily deploy, configure & optimize clusters
2 Monitor
Maintain a central view of all activity
3 Diagnose
Easily identify and resolve issues
4 Integrate
Use Cloudera Manager with existing tools
28
28. Install A Cluster In 3 Simple Steps
Cloudera Manager Key Features
1
Find Nodes
2
Install Components
3
Assign Roles
Enter the names of the hosts which will be Cloudera Manager automatically installs the CDH Verify the roles of the nodes within your cluster.
included in the Hadoop cluster. Click Continue. components on the hosts you specified. Make changes as necessary.
29
Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that you’re already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies