3. Dell | Hadoop White Paper Series: Hadoop Business Cases
Hadoop brings new capabilities
Growing data volumes and interconnected systems create a need for a tool capable of building the next generation of
analytics and data management solutions. Hadoop provides a framework for your company to analyze and manage
growing volumes of data while storing data longer than previously possible at a competitive price point. By extending the
life of data and not discarding it, you can enable staff to review historic data in new ways and analyze it as new methods
emerge.
The Hadoop taxonomy is outlined in Figure 1, showing the components common to all Hadoop environments. These
components are part of the core Apache Hadoop project. The Hadoop architecture is very pluggable, allowing any
component to be replaced with one optimized for a specific workload, while allowing a large variety of data presentation
layers to utilize the data stored in Hadoop. The vertical bars on the right designate components not included as part of a
default Hadoop distrobution; these components are commonly provided by IT providers to enhance their Hadoop
offerings.
Figure 1. Core components of a Hadoop deployment
In addition to the core Hadoop components shown in Figure 1, a variety of projects have developed as part of the
Hadoop ecosystem to provide specific solutions for using data within Hadoop in common ways. Many projects have
evolved for storing and processing specific types of data within Hadoop, allowing many industries to create specific
solutions built on a common storage and compute engine within Hadoop.
3
4. Dell | Hadoop White Paper Series: Hadoop Business Cases
Figure 2. The core Hadoop ecosystem with additional tools for data presentation
Components of the Hadoop ecosystem can be built on one or more of the three primary Hadoop use cases:
Compute Storage Database
Hadoop is commonly used as a One primary component of the The Hadoop ecosystem contains
distributed compute platform for Hadoop ecosystem is the Hadoop components that allow the data
analyzing or processing large Distributed File System (HDFS). The within the HDFS to be presented
amounts of data. The Hadoop HDFS allows users to have a single in a SQL interface. This allows the
ecosystem provides APIs addressable namespace, spread use of standard functions,
necessary to distribute and track across many hundreds or thousands including INSERT, SELECT, and
workloads as they are run on large of servers, creating a single large file UPDATE of data within the
numbers of distributed machines. system. Hadoop manages the Hadoop environment, with
replication of the data within this file minimal code changes to existing
system to ensure hardware failures do applications. These components
not lead to data loss. Many allow developers to quickly
organizations will use this scalable file access the data stored within a
system as a place to store large Hadoop environment with tools
amounts of data that is then accessed they are experienced at using.
by jobs run within Hadoop or by
external systems.
Hadoop provides a consistent, scalable base of tools within an organization for storing, managing, and analyzing data,
without being tied to any specific department or framework. Hadoop enables your organization to use a single set of data
4
5. Dell | Hadoop White Paper Series: Hadoop Business Cases
for all departments’ reporting, analysis, and research needs. This single source enables better quality results and eliminates
the cost and complexity of managing multiple islands of data.
Business is changing quickly; the goals of any individual tool today may not be the same tomorrow. The same goes for
organizations and their areas of focus within a large corporation. Making the decision about what data to discard in the
current floods of data most companies are experiencing is a difficult challenge. Hadoop enables your company to store
more data, with less overhead than ever before. This enables your staff to ask questions of that data and analyze it in new
ways later that are not even thought of today.
Management of the data fire hose
The evolving community around “big data” (the industry
term for environments containing large volumes of
related but un-structured data) finds new ways for
analyzing and managing growing volumes of data. We
are also exploring the creation of new ways for making
sense of otherwise large piles of previously Decisions
misunderstood data sets.
On any given day, most companies do not know what
questions to ask of certain data. When this has occurred
in the past, companies would purge that data because of
the cost of storing data with indeterminate value. Today,
Questions
companies exploit tools like Hadoop for storing that data
for much longer periods of time, often until such time
that staff find new ways to understand how the data can
be used and what questions can be asked of the data.
Data
Today, data is as valuable as any software written
by a company or any product it designs.
The data is the component that drives next-generation
products and enables maximum revenue attainment
from existing products. Hadoop provides a low-barrier-to- Figure 3. Questions bring out the value of data.
entry solution for storing the additional data being created by
today’s companies.
Hadoop enters the enterprise
Hadoop is rarely initially deployed as a company-wide data analytics solution; more often, Hadoop is deployed by a single
department or organization that sees it as a solution to certain challenges. Hadoop inevitably is then used by more and
more departments, becoming a more critical piece of the corporation’s storage and analytics solutions.
Hadoop deployments commonly start with a smaller deployment within a virtual environment; this could be virtual
machines hosted on premise or in a public cloud environment. This method enables your IT staff to learn about managing
Hadoop and enables your developers to begin testing ideas they have about uses of Hadoop. This use of virtual
infrastructure will usually stop as soon as real workloads are tested, and this usually signifies a move to physical hardware
dedicated to Hadoop. This change is primarily driven by data volumes and performance needs. At a certain inflection
point, moving data to a public cloud becomes too time-consuming, so companies look to internally hosted and managed
Hadoop solutions.
It is important to understand the evolution of Hadoop in your environment to ensure that you adequatly plan each stage
of the evolution. Hadoop can rapidly become a large, complex component of your information technology (IT)
department. By understanding how Hadoop commonly evolves, you can better manage that evolution in your
environment and ensure Hadoop meets your company’s needs, without causing an undue operations burden.
5
6. Dell | Hadoop White Paper Series: Hadoop Business Cases
Analytics
Analytics are becoming a more critical component in all business environments. Analytics are being used to provide near
real-time reporting on the state of a business, allowing leaders to make rapid decisions to correct the course of an
organization or to capitalize on the needs of the market. The emerging market of tools for analytics allows companies to
manipulate the raw data they get from a variety of sources and make intelligent decisions about the state of the business.
Many marketing and sales-focused organizations are now using Hadoop as the core of their analytics programs. Hadoop
is used to store a central copy of customer data and product usage information, allowing those developing pricing
models and sales models to refine the data in new ways, looking for new relationships. These analytics allow the analysts
to look for new relationships, not previously possible with traditional, separate relational database-driven data warehouse
environments.
Another example of using analytics to minimize operational expenses is in IT. By leveraging the hyperscale compute and
storage capabilities of Hadoop, your IT personnel can optimize system reporting, analyze system performance versus
operational expenses, detect potential cases for system failures, and minimize system downtime. Your CIO and IT
managers can analyze the most optimal operational models, determine operational inflection points, and plan the next
budget cycle.
Risk modeling
Many financial services firms are beginning to use Hadoop for risk modeling. Hadoop provides a base for storing and
processing large amounts of data, enabling firms to focus on algorithm development and optimization. Hadoop enables
companies to avoid the difficulty in massively parallel programming, while exploiting the capabilitie s provided by
commodity hardware and software.
By using Hadoop to enable your company’s risk modeling projects, data from many different sources can be pulled into a
single location and modeled by a single set of algorithms. A traditionally large company required risk modeling to occur at
business unit or departmental levels. This modeling was commonly was done in different ways by the different financial
analyst teams. Hadoop enables a single, companywide team to model a company’s exposure to risk and understand what
dynamics are at play against that risk position.
Figure 4. Hadoop enables a single, companywide team to model exposure to risk.
Hadoop ecosystem
The Hadoop ecosystem is a rapidly growing and evolving set of tools for Hadoop operations and tools specific to verticals
and uses for Hadoop. The Hadoop ecosystem contains many tools specific to operational use cases and the manipulation
of specific types of data. This large ecosystem makes Hadoop a strong platform for companies as they evaluate and grow
their analytics or business intelligence environments. Some of the most common tools within the Hadoop ecosystem for
supporting scale-out environments include Flume, Sqoop, and Zookeeper.
Flume is a commonly used tool within the Hadoop ecosystem for handling streaming data. Flume provides a framework
for agents on one or many servers to collect events and store them in a single HDFS namespace. Flume also provides the
necessary frameworks for developing work streams for processing those events, reporting on them, and taking action on
them.
Sqoop is a component within the Hadoop ecosystem for enabling connectivity between Hadoop environments and
traditional SQL environments, including relational databases and data warehouses. Sqoop enables automated processes
to be developed for moving data between Hadoop and data warehouses, enabling data warehouses to have access to
6
7. Dell | Hadoop White Paper Series: Hadoop Business Cases
large amounts of data traditionally stored in other environments or not available at all to business intelligence d evelopers
and analysts.
Zookeeper is a component commonly used by applications that exploit data stored in the HDFS. Zookeeper provides a
framework for managing distributed applications and the locks between them for consistent data access, providing
naming services, and providing synchronization between separate servers and processes that are part of a single, larger
application.
Hadoop futures
Most organizations have used specialized teams for business intelligence development and exploitation of a compan y’s
data. Hadoop enables that functionality to be pushed to a larger group of staff within the organization. Hadoop provides a
single unified interface and data store for many staff across all departments to use when analyzing company statistics and
developing new methods for success in a market.
Hadoop empowers all your employees to think of new ways to improve the bottom line and allows them access to the
necessary information to test their theories, develop strategies, and report on changes in the business.
Hadoop provides the base software and associated ecosystem to manage growing amounts of data. Hadoop enables
your company to store more data than ever before and provide it to a larger portion of the staff for analysis both today
and tomorrow. Hadoop can be used to enable near real-time decision making by your company leadership and allow
your staff to test new ideas and analyze data in new ways.
About the author
Joey Jablonski is a principal solution architect with Dell’s Data Center Solutions team. Joey works to define and
implement Dell’s solutions for Big Data, including solutions based on Apache Hadoop. Joey has spent more than 10 years
working in high performance computing, with an emphasis on interconnects, including Infiniband and parallel fi le
systems. Joey has led technical solution design and implementation at Sun Microsystems and Hewlett-Packard, as well as
consulted for customers, including Sandia National Laboratories, BP, ExxonMobil, E*Trade, Juelich Supercomputing
Centre, and Clumeq.
Special thanks
The author extends special thanks to:
Rob Hirschfeld, Principal Cloud Solutions Architect, Dell
Aurelian Dumitru, Principal Cloud Solutions Architect, Dell
John Igoe, Executive Director, Next Generation Computing Solutions, Dell
About Dell Next Generation Computing Solutions
When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next
Generation Computing Solutions are Dell’s response to your unique needs. We understand your challeng es—from
compute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune
your company’s “factory” for maximum performance and efficiency.
Dell’s Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the
needs of companies at all stages of their lifecycle. Solutions are designed to meet the needs of small startups while
allowing scalability as your company grows.
Deployment and support are tailored to your unique operational requirements. Dell’s Cloud Computing Solutions can
help you minimize the tangible operating costs that have hyper-scale impact on your business results.
7