Overview, HDFS and
Hadoop Workshop 1
- SW Engineer who has worked as designer / developer on
NOSQL (Mongo, Cassandra, Hadoop)
- Consultant – HBO, ACS, ACXIOM (GM), AT&T
- Specialize in SW Development, architecture and training
- Currently working with Cassandra, Storm, Kafka, Node.js and
Hadoop Workshop 2
• Intros, Agenda and software
• Hadoop Overview, Ecosystem and Use Cases
• HDFS – hadoop distributed file system
• Map/Reduce – how to create and use
• Hadoop Streaming and Pig
• A little bit about the supporting cast, the Hadoop Eco-
Hadoop Workshop 3
What is you experience and
interests with Big Data,
NoSql, Hadoop and Software
Hadoop Workshop 4
• Cloudera Quickstart VM
We are using Cloudera, there is also HortonWorks, MapR,
Apache Hadoop and multitide of other players.
Hadoop Workshop 5
• The reason for products like Hadoop is the emergence of Big
• Big Data is not new, but is more common now, we are more
capable of collecting.
• Big Data is not just about size. It is also frequency of delivery
and the unstructured nature.
• RDMS was about structure and definition. It fails in handling
data requirements in many cases.
Hadoop Workshop 6
• Not Only SQL. Doesn‟t mean “No” SQL
• Category of products developed to handle big data
• Built to handle distributed architecture
• Take advantage of large clusters of commodity hardware
• There are trade-offs, live with CAP Theroem
Hadoop Workshop 9
• Hadoop itself is not a NoSQL product
• Hbase, which is part of Hadoop eco-system is a NoSQL
• Hadoop is a computing framework (a lot more to come)
and typically runs on a cluster of machines
• NoSQL products example Cassandra, MongoDB, Riak,
CouchDB and Hbase.
Hadoop Workshop 10
Where does the name come from?
The name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those
are my naming criteria. Kids are good at generating such.
Googol is a kid’s term.
• This is from Doug Cutting, inventor of Hadoop (also of
Lucene the search engine and Nutch (along with Mike
Cafarella) which a web crawler and various other items).
Hadoop was created to process data created by Nutch.
Hadoop Workshop 11
• Hadoop is an open source framework for processing
large amounts of data in batch.
• Designed from the ground up to scale out as you add
• Hadoop Core has three main components:
Hadoop Workshop 12
• The first is called Commons and contains (as the name
implied) common functionality.
• In addition Hadoop has its own file system called HDFS
that is made to be fault tolerant and supplies the
cornerstone to let it run on commodity hardware.
• The final component making up the Hadoop system is
called MapReduce and it implements the model that
allows processing the data in a parallel manner.
Hadoop Workshop 13
• Hadoop Distributed File System (aka HDFS, or Hadoop
• Runs on top of regular OS file system, typically Linux
• Fixed-size blocks (64MB by default) that are replicated
• Write once, read many; optimized for streaming in and
Hadoop Workshop 14
• Responsible for running a job in parallel on many servers
• Handles re-trying a task that fails
• Validating complete results
• Jobs consist of special “map” and “reduce” operations
Hadoop Workshop 15
• Has one “master” server - high quality, beefy box.
• NameNode process - manages file system
• JobTracker process - manages tasks
• Has multiple “slave” servers - commodity hardware.
• DataNode process - manages file system blocks on local drives
• TaskTracker process - runs tasks on server
• Uses high speed network between all servers
Hadoop Workshop 16
• Hadoop is a lot more than HDFS and MapReduce.
• There is an large (and ever growing) Eco-System built up
• These are for making MapReduce less complex,
Montoring and Job Control, Adding Persistence systems
and adding Real-Time capabilities.
Hadoop Workshop 18
A serialization system for efficient, cross-language
RPC and persistent data storage.
Machine Learning System
High Level query langiage (PigLatin) that creates map
A distributed data warehouse. Hive manages data
stored in HDFS and provides a query language based on
SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
Hadoop Workshop 19
A distributed, column-oriented database. HBase uses
HDFS for its underlying storage, and supports both batch-style
computations using MapReduce and point queries (random
A distributed, highly available coordination service.
ZooKeeper provides primitives such as distributed locks that can
be used for building distributed applications.
A tool for efficient bulk transfer of data between
structured data stores (such as relational databases) and HDFS.
A service for running and scheduling workflows of
Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop
Hadoop Workshop 20
• A new MapReduce runtime, called MapReduce 2,
implemented on a new system called YARN (Yet Another
Resource Negotiator), which is a general resource
management system for running distributed applications.
MR2 replaces the “classic” runtime in previous releases.
• HDFS federation, which partitions the HDFS namespace
across multiple namenodes to support clusters with very
large numbers of files.
• HDFS high-availability, which removes the namenode
as a single point of failure by supporting standby
namenodes for failover.
Hadoop Workshop 21
• YARN was added in Hadoop 2
• It is a resource-management platform responsible for
managing compute resources in clusters and using them for
scheduling of users' applications
• YARN can run applications that do not follow the MapReduce
model, unlike the original MapReduce model (MR1)
• YARN provides the daemons and APIs necessary to develop
generic distributed applications of any kind, handles and
schedules resource requests (such as memory and CPU) from
such applications, and supervises their execution
Hadoop Workshop 22
Any questions on the Hadoop Overview?
Hadoop Workshop 23
• Hadoop Distributed File System
• Single Name Node and a Cluster of Datanodes
• Stores large files (+ GB) across nodes
• Reliability is through replication, default value is 3.
Hadoop Workshop 24
• Master and in Hadoop 1 is a SPOF.
• In Hadoop 2 added failover to prevent the SPOF aspect
• Directory Tree of all the files
• Keeps track of where all the data resides across the cluster
• For really large clusters, tracking these files becomes an issue,
Hadoop 2 added NameNode Federation to fix that.
Hadoop Workshop 25
• Can view information on namenode, blocks and view
Hadoop Workshop 26
• Files stored on HDFS are chunked and stored as blocks
• Manages storage attached to the nodes that they run on.
• Data never flows through NameNode, only DataNodes
Hadoop Workshop 27
• Typically file system blocks are a few kilobytes.
• HDFS uses block size typically 64MB or 128MB.
• The large size minimizes the cost of seeking data.
• Blocks do not need to be on the same cluster. So large
files in HDFS often span multiple clusters
• >hadoop fsck / -files -blocks
Hadoop Workshop 28
• Let‟s rename them
hadoop fs -mv /tmp/gutenberg/pg20417.txt
hadoop fs -mv /tmp/gutenberg/pg4300.txt
hadoop fs -mv /tmp/gutenberg/pg5000.txt
hadoop fs -copyToLocal /tmp/gutenberg/*.bak /tmp/gutenberg
• And again
Hadoop Workshop 34
• Quite simple, like normal Java File I/O
• General write to FileSystem abstract class, can use
• Simple way to read and write to file system is using
java.net.URL. For example:
InputStream in = new
Hadoop Workshop 35
• Listing Files
• Getting Status
• Copying to HDFS
• Copying from HDFS
Hadoop Workshop 36
• Splits Processing Into Steps
• These are by nature easily parallized and can take
advantage of available hardware.
Hadoop Workshop 38
• Key Value Pair: two pieces of data, exchanged between
the Map and Reduce phases. Also sometimes called a
• Map: The „map‟ function in the MapReduce algorithm
user defined converts each input Key Value Pair to 0...n
output Key Value Pairs
• Reduce: The „reduce‟ function in the MapReduce
algorithm user defined converts each input Key + all
Values to 0...n output Key Value Pairs
• Group: A built-in operation that happens between Map
and Reduce ensures each Key passed to Reduce
includes all Values
Hadoop Workshop 39
• Map translates input to keys and values to new keys and
• System Groups each unique key with all its values
• Reduce translates the values of each unique key to new
keys and values
Hadoop Workshop 41
• Let‟s look at the combining english to foreign language
Hadoop Workshop 42
• Is a single point of failure
• Determines # Mapper Tasks from file splits via
• Uses predefined value for # Reducer Tasks
• Client applications use JobClient to submit jobs and
• Command line use hadoop job <commands>
• Web status console use http://jobtracker-server:50030/
Hadoop Workshop 44
• Spawns each Task as a new child JVM
• Max # mapper and reducer tasks set independently
• Can pass child JVM opts via mapred.child.java.opts
• Can re-use JVM to avoid overhead of task initialization
Hadoop Workshop 45
• MRUnit‟s MapDriver or ReduceDriver are the key class
• Configure which mapper under test
• The input value,
• The expected output key
• The expected output value
Hadoop Workshop 47
• Testing A Mapper
• Testing A Reducer
Hadoop Workshop 48
• Not for quick jobs
• Scripts / R programs are often better
• Doesn‟t work well with a lot of small files in HDFS
• Don‟t have that real – time aspect (see Apache Storm)
• Don‟t let people tell you Hadoop is the answer for
• Hive * Pig are good alternative (espcially for SQL
speakers) Hadoop Workshop 49
• Contact Me
Hadoop Workshop 50
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.