Intro to hadoop

Haden Pereira
Data Engineer , Applications Work Group @
EMC
5+ Years Experience in the Big Data Space

Quick Survey
How many Programmers/Developers ?

Quick Survey
How many SQL Developers?

Quick Survey
How many Application Developers
(Java,C#,etc)

Quick Survey
How many System Administrators
(Database, Tomcat etc)

Quick Survey
How many of you have heard of Hadoop

Quick Survey
How many of you have hands on Experience
in Hadoop ?

Quick Survey
How many of you have worked with any of
the NoSQL tools.
Cassandra, MongoDB, Elasticsearch

What is Hadoop?
Hadoop is an open source framework for
large-scale data storing & processing.

Why Hadoop?
• Traditional Data processing was done on large systems.
• Every time need for better performance arises , they would replace
the old computer with better ones.
• Scaling up was expensive
• Also scaling was limited to the maximum available resources of a
single system.

How does Hadoop Scale?
• ”Scale Out” , rather than “Scale Up”
• If data set/data processing requirement increases , add in one more
server.
• Eliminates the strategy of growing computing capacity by throwing
more expensive hardware at the problem.

Core Components of Hadoop
Hadoop v1 - HDFS & Map/Reduce
Hadoop v2 - HDFS & YARN

HDFS
Distributed: Scale of data growing at higher pace than single storage
disk capacity growth, hence cluster of disk distributed over network is
necessary.
Scalable: Extends to handle growing data requirement.
Fault-Tolerant: Protects against increased failure probability due to
large number of disks by replication

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
Total Capacity 6 TB

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F1 F1
100MB 100MB 100MB

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1 F2 F3

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3

HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2

Map Reduce
Framework for writing applications that process large amounts of
structured and unstructured data in parallel, across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.

Map Reduce
File.txt
300 MB
….. , ….. , ….. , ….. , ….. , ….. , ….. , 654 , INR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 432 , AED
….. , ….. , ….. , ….. , ….. , ….. , ….. , 573 , USD
….. , ….. , ….. , ….. , ….. , ….. , ….. , 948 , EUR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 392 , GBP
CSV file with around 1 million lines

Map Reduce
File.txt
300 MB
1 Hour to process 300 MB File

Map Reduce
File.txt
150 MB
1/2 Hour to process 150 MB File
File.txt
150 MB

Map Reduce
File.txt
75 MB
File.txt
75 MB
File.txt
75 MB
1/4 Hour to process 75MB File
File.txt
75 MB

Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3

Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1

Map Reduce
• Handles tasks incase of server failures
• Distributes tasks evenly
• Tries to run tasks on the same server where the data block resides

YARN
Multi-tenancy - YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch, interactive and real-
time engines that can simultaneously access the same data set.
Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization
over more static Map Reduce rules used in early versions of Hadoop.
Scalability - Data center processing power continues to rapidly expand. YARN’s
Resource Manager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run
YARN without any disruption to existing processes that already work

Hadoop Ecosystem
Pig (scripting): Platform for analyzing large data sets. It is comprised of a high-
level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing
code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw
data, and iterative processing of data.
Hive (SQL). Provides data warehouse infrastructure, enabling data
summarization, ad- hoc query and analysis of large data sets. The query
language, HiveQL (HQL), is similar to SQL.
HCatalog (SQL). Table and storage management layer that provides users with
Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST
APIs so that external systems can access these tables' metadata.

Hadoop Ecosystem
Ambari : Provides an open operational framework for provisioning, managing
and monitoring Hadoop clusters.
Zookeeper : Provides distributed configuration service, a synchronization service
and a naming registry for distributed systems
Oozie : Enables Hadoop administrators to build complex data transformations out
of multiple component tasks, enabling greater control over complex jobs and also
making it easier to schedule repetitions of those jobs.

Hadoop Ecosystem
Tez leverages the MapReduce paradigm to enable the creation and execution of
more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary
tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up
data processing across both small-scale/low-latency and large-scale/high-
throughput workloads
Spark : fast and general in memory processing engine that uses YARN as a
framework for deployment and can read/write data from HDFS.

Hadoop Ecosystem
Sqoop : Tool designed to transfer data between Hadoop and relational database
servers
HBase (NoSQL). Non-relational database that provides random real-time access
to data in very large tables. HBase provides transactional capabilities to Hadoop,
allowing users to conduct updates, inserts and deletes.
Flume : Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into HDFS

Intro to hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to hadoop

Similar to Intro to hadoop (20)

Recently uploaded

Recently uploaded (20)

Intro to hadoop