19. Name Node
JOB TRACKER
Commodity Compute & Storage
TASK TRACKER
Data Store
MapReduce Task
Client
TASK TRACKER
Data Store
MapReduce Task
TASK TRACKER
Data Store
MapReduce Task
HDFS ARCHITECTURE
There are megatrends transforming our industry that are predicated on a platform of trust.
According to leading industry analysts, the four major trends that are shaping IT and the business:
Mobility
Cloud
Big Data and
Social
These trends are forming what is being called the third platform - a platform architected for these trends and built to support billions of users and millions of applications
As we look back, the first platform was mainframes with thousand of applications and millions of users with end user devices of choice being proprietary terminals. The second platform is and was the internet and client servers with end user devices being the PC. This platform continues to support tens of thousands of applications and hundreds of millions of users. However, current architectures are being pushed and scaling this type of environment can be costly and ineffective.
The third platform is architected with web-scale in mind, supporting millions of applications and billions of users and is built on the technology pillars of mobility, cloud services, big data and analytics, and social networking.
When we talk about the third platform in an enterprise setting, we’re really talking about the convergence of these forces and their powerful combination to serve as a foundational architecture for IT organizations. Beyond the individual trends, the seamless “combination” of these trends is becoming critical since it collectively represents an agile new IT fabric for applications, data centers and, most importantly, the user experience. According to IDC, the third platform, will serve as the primary growth driver of the IT industry over the next decade, responsible for 75% of new growth as worldwide IT spending moves from $3.7 trillion in 2013 to more than $5 trillion in 2020.
Unstructured data is no longer files from office productivity applications. The real growth and storage management problem is coming from:
New media such as videos and podcasts
Machine-generated data from devices such as sensors – telemetry data – in fact a transatlantic flight from NYC to London can generate 20-30 TB of telemetry data!
Communities – social interactions
Mobile Devices – pictures, music, etc.
Imaging Equipment – imaging, imaging studies, health records
The intelligent economy produces a constant stream of data that is being monitored and analyzed. IDC estimates that the digital universe will be 40ZB by 2020. That’s a 40 followed by 21 zeroes. Social interactions, mobile devices, facilities, equipment, R&D, simulations, and physical infrastructure all contribute to the flow of information. In aggregate, this is what is called Big Data. The Big Data economy, is characterized by:
More Sources of data
Communities
Mobile Devices
Sensors
Imaging Equipment
Richer Content
Pictures
Videos
Data Streams
Longer utility
Durable value – information and information about information (metadata) has value for a long time after its creation. All this data can have business value.
Regulatory burdens – always a contributor to the need to retain data for longer and longer periods of time, often indefinitely.
Data has value well-beyond the context of the application that created it. Information-based applications and services will have tremendous financial impact across many market segments. Evolving to the 3rd platform and exploiting information will have quantifiable impact on profit margins, revenues, productivity metrics and operating costs.
The potential is obvious and has been validated by early adopters. Big Web companies, Oil & Gas, Pharmaceutical firms, large retailers and many more have used Big Data analytics for deep business insights that target and retain customers and build competitive advantage. The early/late majority, however, are moving more cautiously. Enterprise customers are not starting with a blank canvas, and while they want all the benefits that the 3rd platform offers, they have invested millions if not billions of dollars into an infrastructure that they must continue to maintain and grow. The cost, risk and value of moving to a 3rd platform is still uncertain. They have questions about how they gain the value of the 3rd platform while leveraging their current IT infrastructure.
Big Data and HDFS are Disruptive. According to 451 Research, the market for Hadoop/NoSQL software and services will be $3.5 billion by 2017 (45% CAGR). It’s more than analytics, though that’ a huge part of it. The disruptive change is that data has value beyond its initial application. Information about the information provides insights that are critical to understanding and predicting the business.
Everyone sees the potential but adoption has still been somewhat cautious. Hadoop represents a 3rd platform infrastructure that co-locates compute and storage. But, for most enterprises, a Hadoop cluster only contains a fraction of their enterprise data. Customers need the confidence to move from the lab to production. Can they leverage their existing infrastructure and data? Which Hadoop distribution should they use? There are also concerns about HDFS not being enterprise grade. The namenode still represents and single point of failure which can be a non-starter for some data and uses.
Customers are still calculating their risk . A dedicated cluster can be very cheap (free) to get started but requires significant investment as it scales. It’s also hard to calculate ROI when it’s unclear which data has value. Other costs that need to be factored in are the bandwidth and network costs of moving data to the cluster and back to primary storage.
Customers see the potential and the necessity of Big Data and 3rd platform applications and services. But their 2nd platform infrastructure is not built for this new model. Yet, existing infrastructure, data and applications are not going away. Organizations need a way to “mind the gap” – leveraging their existing infrastructure and data today while building a platform for the future.
The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis.
Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types.
In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives.
Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
ViPR aggregates multi-vendor heterogeneous storage into a unified storage platform, that, in turn, can be leveraged as a logical scale-out layer which can serve as the underlying infrastructure for hosting a range of data services to support collecting, managing and utilizing unstructured content at massive scale. ViPR Data Services are implemented in software and feature a simple, lightweight, low-touch, scale-out design.
Data services are storage abstractions that reflect the combination of a data type (file, object or block of data), access protocols (iSCSI, NFS, REST, etc.), and durability, availability, and security characteristics (snapshots, replication, etc.) In ViPR, block, file, object, and HDFS are all data services, though ViPR is not in the data path for file and block (these can be thought of us “control services”).
Object and HDFS are available with more to follow. Data services can be used to provide different semantic views of the same data. You can manipulate a file as a file or as an object without having to move the data to a different platform that features that semantic.
The immediate benefit of ViPR is its ability to automate storage management and provisioning and make storage available as a self-service, consumable resource within a software-defined data center (SDDC). But ViPR also transforms how enterprises deliver data services. With storage arrays and storage services defined in software and managed by policy, ViPR enables organizations to deploy unique Data Services that cloud-enable existing infrastructure and extend the use cases for their data and the value of their storage investments.
ViPR aggregates multi-vendor heterogeneous storage into a unified storage platform that can be leveraged as a logical scale-out layer which can serve as the underlying infrastructure for hosting a range of data services to support collecting, managing and utilizing unstructured content at massive scale
This depicts the architecture for ViPR and highlights the data services functionality. At the bottom are the physical arrays that ViPR can manage.
Above the arrays is the ViPR controller which has features that enable a distributed infrastructure (Cassandra, a distributed DB and Zookeeper to manage status of different nodes in the system) and device drivers to hook into APIs of arrays so the Controller can automate provisioning, management, etc.
On top of that are ViPR data services. The Object Data Service was released at the same time as ViPR Controller in October 2013. HDFS was released in December 2013. HDFS uses the same unstructured storage engine as the Object data service.
The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis.
Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types.
In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives.
Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
The ViPR HDFS data service is the second data service to be released by EMC. It will be available by the end of 2013. The HDFS service gives organizations the ability to run analytics using well known industry Hadoop distributions on existing data stored across heterogeneous systems such as VNX, Isilon and Netapp arrays.
Hadoop has become a de-facto standard for companies that are investigating novel strategies for addressing their Big Data challenges. HDFS is the core distributed file system used by Hadoop. Many organizations have an HDFS project in their labs. However, many of these companies have found Hadoop to be difficult to deploy and manage at scale. The ViPR approach to HDFS takes advantage of proven storage hardware to overcome this challenge. Instead of building a discrete analytics silo with dedicated infrastructure, the ViPR HDFS data service leverages the existing ViPR virtualized storage environment and the backend storage platforms it utilizes.
The ViPR HDFS data service is the second data service to be released by EMC. It will be available by the end of 2013. The HDFS service gives organizations the ability to run analytics using well known industry Hadoop distributions on existing data stored across heterogeneous systems such as VNX, Isilon and Netapp arrays.
Hadoop has become a de-facto standard for companies that are investigating novel strategies for addressing their Big Data challenges. HDFS is the core distributed file system used by Hadoop. Many organizations have an HDFS project in their labs. However, many of these companies have found Hadoop to be difficult to deploy and manage at scale. The ViPR approach to HDFS takes advantage of proven storage hardware to overcome this challenge. Instead of building a discrete analytics silo with dedicated infrastructure, the ViPR HDFS data service leverages the existing ViPR virtualized storage environment and the backend storage platforms it utilizes.
HDFS is becoming increasingly popular as a file system layer for distributed applications, and this goes beyond Hadoop. The ViPR HDFS data service is a Hadoop-compatible file system and supports any Hadoop 2.0 implementation including existing distros such as Cloudera and PivotalHD.
HDFS supports high aggregate throughput access to data, e.g. MapReduce. In some cases, is provides low latency access. However, concerns to enterrpises include scale, durability, cost, and management.
The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis.
Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types.
In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives.
Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
Task trackers are processes on data / slave nodes that accept tasks from a Job Tracker. The tasks are Map, reduce and shuffle operations. Task trackers monitors the tasks running on a node and communicate with the job tracker.
Every task tracker has a specified number of slots that correspond to how many tasks it can accept.
During scheduling of a task, the Job tracker looks for an empty task slot on the same node as where the data block resides – thus achieving data locality. Next, it looks for a node with an empty slot on the same rack.
ViPR HDFS provides and HDFS-compatible file system. In this way, the compute portion of an existing Hadoop cluster communicates with ViPR HDFS. Existing storage arrays managed by ViPR can now be made accessible via HDFS.
Task trackers are processes on data / slave nodes that accept tasks from a Job Tracker. The tasks are Map, reduce and shuffle operations. Task trackers monitors the tasks running on a node and communicate with the job tracker.
Every task tracker has a specified number of slots that correspond to how many tasks it can accept.
During scheduling of a task, the Job tracker looks for an empty task slot on the same node as where the data block resides – thus achieving data locality. Next, it looks for a node with an empty slot on the same rack.
The HDFS data service uses the same unstructured storage engine as the ViPR Object data service. ViPR data services create a unified pool (bucket) of data. Similar to the Object data service, users create buckets which can span file shares that can grow and shrink on demand. The data is distributed across the arrays according to how the virtual storage pool is configured. The bucket provides an HDFS interface or, optionally, an Object (S3) and HDFS interface. In this way, the compute portion of an existing Hadoop cluster communicates with ViPR HDFS, which uses existing data (added to the HDFS bucket) as the target for Big Data applications and queries.
The above diagram illustrates the system architecture of how a ViPR customer can expose their existing data in a ViPR managed array to their Hadoop cluster and run MapReduce jobs on this data.
The object data service and the HDFS data service run on the same set of ViPR Data Service VMs. These VMs can be scaled as the capacity of storage is increased.
ViPR 1.1 will make available a client library (ViPR-HDFS Client) that needs to be installed on all the nodes that run MR jobs on the customer’s Hadoop cluster.
When a task running on the node needs to read a file, the request will go to the ViPR-HDFS client (as the customer will point to viprfs:// as their data source) and the ViPR client will communicate with the HDFS head on the ViPR data node. The ViPR client passes in a authN token that identifies the user to the HDFS Head
The HDFS head in the ViPR Data node receives requests from the ViPR-HDFS client . The HDFS Head then verifies the user’s identity by authenticating against the KDC. Then it talks to the Blob engine and the controller process running on the node to fetch the requested data once authN and authZ succeed.
In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.
In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.
Use Case:
Customer sets up ViPR across multiple Isilon and VNX arrays and ingests data into ViPR
ViPR data services creates a unified pool (bucket) of data across file shares and provides user with an HDFS interface
Customer installs ViPR HDFS client on an existing PivotalHD cluster
Customer starts writing Hive queries referencing ViPR HDFS as the data source
Use Case:
Customer has an existing PivotalHD cluster with data stored in HDFS within the cluster and has also installed ViPR HDFS client on this PivotalHD cluster
Customer also sets up ViPR across multiple Isilon and VNX arrays and ingests data into ViPR
Customer starts writing MapReduce jobs that reference data in HDFS within the PivotalHD cluster as well as data in ViPR HDFS thereby opening up new analytics scenarios.
The spanning use case is meant to explain that ViPR HDFS and HDFS can coexist. ViPR HDFS will not entirely replace HDFS.
Use Case:
An environment with cloudera infrastructure installs ViPR HDFS client
Customer sets up ViPR across multiple Isilon and VNX arrays
Customer starts writing Hive queries referencing ViPR HDFS as the data source and is able to utilize existing environment to point against ViPR HDFS
Use Case:
An environment with multiple VNX and Isilon, installs ViPR data services
ViPR data services creates a unified pool (bucket) of data across file shares and provides user with access to either S3 or HDFS Interface
Object based applications as well as analytics workload are able to use the same set of data without having to move it around.