Hadoop & HDFS " version 1.0File & Content Solutions!
What is Hadoop!§ Built and distributed as part of the Apache Software Project; "§ Hadoop EcoSystem:" § Common – set of components and interfaces for a DFS and general I/O;" § Avro – A serialization system for efﬁcient, cross language RPC, and persistent data storage;" § MapReduce – A distributed data processing model and execution environment that runs on large clusters of commodity machines;" § HDFS – A distributed File System that runs on large clusters of commodity hardware." File & Content Solutions!
Common Terms in Hadoop HDFS!§ Name node - manages the File System namespace. It maintains the File System tree and the metadata for all the ﬁles and directories in the tree. This information is stored persistently on the local disk in the form of two ﬁles: the namespace image and the edit log. "§ Data node- Workhorses of the File System. They store and retrieve blocks when they are told to (by clients or the name node), and they report back to the name node periodically with lists of blocks that they are storing." File & Content Solutions!
Common Terms in Hadoop HDFS!§ Secondary Name node - Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary name node usually runs on a separate physical machine " File & Content Solutions!
Hadoop Distributed File System - HDFS!§ HDFS is a File System designed for storing very large ﬁles with streaming data access patterns, running on clusters of commodity hardware. "§ HDFS has a permissions model for ﬁles and directories that is much like POSIX." POSIX is an acronym for Portable Operating System Interface." File & Content Solutions!
Writing data into Hadoop! File & Content Solutions!
Reading data from HDFS! File & Content Solutions!
MapReduce!§ "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "§ "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve."" File & Content Solutions!
HDFS Storage Solution!§ The DataLogix Hadoop Storage Solution contains:" § Enterprise Scale-Out storage solution for Hadoop workﬂows. " § Native connectivity for Hadoop and Eco-systems components:" § Hive" § Hbase" § Pig" § Mahout " § No single point of failure Name Node; " § No 3x mirroring, native N+M protection is used; " § SnapShot, Sync and NDMP back-up is supported." File & Content Solutions!
Writing into Hadoop with the DataLogix solution!§ The storage system becomes the Name Node and as well as the Data Node "§ Provides scalability and protection of the data. "§ Hadoop cluster no longer has a single point of failure and no longer writes multiple 64MB-128MB chunks of data to datanodes" File & Content Solutions!
Reading Hadoop Data !§ Data is read off the cluster back to the compute nodes; "§ The Data Nodes are now compute nodes and are independent of the data in the Hadoop cluster:" § Beneﬁts are that Hadoop hardware can be ugraded without the need for migration of data. " File & Content Solutions!
More information?!!§ More information about the Hadoop storage solutions? Please contact us: DataLogix Phone: +31(0)30-7440710 e-mail: firstname.lastname@example.org www.datalogix.nl" File & Content Solutions!