The document discusses the evolution of Hadoop from versions 1.0 to 2.0. Key limitations of Hadoop 1.0 included lack of horizontal scalability, single points of failure, and tight coupling between components. Hadoop 2.0 addressed these issues by introducing YARN for decoupling compute from storage and enabling multiple job types beyond MapReduce. Other improvements in 2.0 included high availability, sharing between jobs, and running non-Java frameworks.
Hadoop 1.0 vs 2.0: A comparison of core features and limitations
1. • The classic tool for processing line-oriented data is awk==> without hadoop
• Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS,
renamed from NDFS)
• In earlier versions of Hadoop, there was a single site configuration file
for the Common, HDFS, and MapReduce components, called hadoopsite.xml.
2. • No horizontal scalability(Single NameNode)
• JobTracker is single Point of availability
• Perform only MapReduce Jobs
• Tight coupling
• Problems with resource utilization
• Hive can execute a SQL query as a series of
MapReduce jobs, performance is the issue
• No Sharing Between jobs
• Not good for online low latency job
• Can't run anything other than Java
• HDFS 1.0
• Supports Horizontle Scalability(Multiple NameNode)
• Highly Available
• Perform Multiple Jobs(other sub-projects)
• YARN decouples it
• Well utilization
• Good query performance
• Backward compatible
• HDFS 2.0
• Beyond Java
2.0 Features1.0 Limitations
3. HDFS
Persistent Data Structures
The namespaceID is a unique identifier for the filesystem namespace, which is created
when the namenode is first formatted. The clusterID is a unique identifier for the
HDFS cluster as a whole; this is important for HDFS federation,
where a cluster is made up of multiple namespaces and each namespace
is managed by one namenode. The blockpoolID is a unique identifier for the
block pool containing all the files in the namespace managed by this namenode
The in_use.lock file is a lock file that the namenode uses to lock the storage directory.
This prevents another namenode instance from running at the same time with (and
possibly corrupting) the same storage directory.
1.0 2.0
5. Speculative Execution(Reasons to Turn off)
1.0
On a busy cluster, speculative execution can reduce overall throughput, since redundant
tasks are being executed in an attempt to bring down the execution time for a
single job. For this reason, some cluster administrators prefer to turn it off on the cluster
and have users explicitly turn it on for individual jobs. when speculative execution could be overly aggressive in
scheduling speculative tasks.
2.0
There is a good case for turning off speculative execution for reduce tasks, since any
duplicate reduce tasks have to fetch the same map outputs as the original task, and this can
significantly increase network traffic on the cluster
Another reason for turning off speculative execution is for nonidempotent tasks(no additional effect if it is called
more than once with the same input parameters).
However, in many cases it is possible to write tasks to be idempotent and use an OutputCommitter to
promote the output to its final location when the task succeeds
8. Security
1.0
One area that hasn’t yet been addressed in the security work is encryption: neither RPC
nor block transfers are encrypted. HDFS blocks are not stored in an encrypted form.
2.0
Various parts of Hadoop can be configured to encrypt network data, including RPC
(hadoop.rpc.protection), HDFS block transfers (dfs.encrypt.data.transfer), the
MapReduce shuffle (mapreduce.shuffle.ssl.enabled), and the web UIs
(hadokop.ssl.enabled). Work is ongoing to encrypt data “at rest,” too, so that HDFS
blocks can be stored in encrypted form, for example
9. Upgrades(compatability)
1.0
All pre-1.0 Hadoop components have very rigid version compatibility requirements.
Only components from the same release are guaranteed to be compatible with each
other, which means the whole system—from daemons to clients—has to be upgraded
simultaneously, in lockstep. This necessitates a period of cluster downtime.
Version 1.0 of Hadoop promises to loosen these requirements so that, for example,
older clients can talk to newer servers (within the same major release number). In later
releases, rolling upgrades may be supported, which would allow cluster daemons to be
upgraded in phases, so that the cluster would still be available to clients during the
upgrade.
Minor releases (e.g., from 1.0.x to 1.1.0) and point releases (e.g., from 1.0.1
to 1.0.2) should not break compatibility.
2.0
API compatibility, data compatibility, and wire compatibility
1 Api Compatability:
API compatibility concerns the contract between user code and the published Hadoop APIs, such as the Java
MapReduce APIs. Major releases are allowed to break API compatibility, so user programs
may need to be modified and recompiled.
2 Data Compatability
Data compatibility concerns persistent data and metadata formats, such as the format in which the HDFS namenode
stores its persistent data. The formats can change across minor or major releases, but the change is transparent to
users because the upgrade will automatically migrate the data. There may be some restrictions about upgrade paths,
and these are covered in the release notes. For example, it may be necessary to upgrade via an intermediate release
rather than upgrading directly to the later final release in one step
3 Wire Compatability
Wire compatibility concerns the interoperability between clients and servers via wire protocols such as RPC and
HTTP. The rule for wire compatibility is that the client must have the same major release number as the
server,but may differ in its minor or point release number (e.g., client version 2.0.2 will work with server 2.0.1 or
2.1.0, but not necessarily with server 3.0.0).
10. 2.0 Features
• Flume
• Apache Crunch is a higher-level API for writing MapReduce pipelines.
• Spark
• Parquet File Format