Hadoop Overview

1
Hadoop Overview
Gregg Barrett

2
Acknowledgment
This document draws extensively, and focuses on, the work and viewpoints from industry participants including:
Diversity Limited
Forrester
IBM
Intel
ITG
MIT Sloan Management Review
TDWI Research
Planet Cassandra
Christopher Bienko @ IBM
Dirk deRoos @ IBM
Marc Andrews @ IBM
Paul Zikopoulos @ IBM
Rick Buglio @ IBM

3
Introduction
With many organisations considering getting on the Hadoop bandwagon, this document provides an overview
of the planned use cases for Hadoop, an illustration of some of the common technology components,
suggestions on when Hadoop is worth considering, some the challenges organisations are experiencing, cost
considerations and finally, how an organisation should position for a Big Data initiative. Any organisation
considering a Big Data initiative with Hadoop should thoroughly consider each of these areas before embarking
on a course of action.
Planned use cases for Big Data technologies
According to a 2015 Forrester Research report, financial and IT analytics are currently amongst the top planned
use cases for Big Data technologies.
Figure 1: Planned use cases for Big Data technologies
Figure 1. Financial And IT Analytics Are Top Planned Use Cases For Big Data Technologies. Copyright 2015 by Forrester. Reprinted with
permission.
While there is much debate on which planned use cases are best suited to kick-start a Big Data initiative, most
recommendations start with the proverbial “low-hanging fruit”. While such an approach may deliver short-term
gains and increase organisational support for a broader program, the downside of this strategy is that the
analytical gains made are likely to be easily replicated by competitors.
As articulated by MIT Sloan Management Review (2015) companies that have sustained an advantage from
analytics have often taken a more strategic approach of focusing on large critical problems.
Big Data technology: Enter Hadoop
In their book, Big Data Beyond the Hype, Zikopoulos, deRoos, Bienko, Buglio and Andrews (2014) classify Hadoop
as an ecosystem of software packages that provides a computing framework.

4
This includes:
- MapReduce, which leverages a K/V (key/value) processing framework (don’t confuse that with a K/V
database);
- a file system (HDFS);
- and many other software packages that support everything from importing and exporting data (like
Sqoop) to storing transactional data (like HBase), orchestration (Avro, ZooKeeper, YARN), and more.
When you hear that someone is running a Hadoop cluster, it’s likely to mean MapReduce (or some other
framework like Spark) running on HDFS.
On the other hand, NoSQL refers to non-RDBMS SQL database solutions such as HBase, Cassandra, MongoDB,
Riak, and CouchDB, among others.
(Zikopoulos, deRoos, Bienko, Buglio, Andrews, 2014, pg. 38)
Figure 2 provides an example illustration of some of the components of the Hadoop ecosystem. The Forrester
(2014) illustration shows mix of Apache, other open source, and vendor products.
Figure 2: Example of the Hadoop Ecosystem
Figure 2. The Hadoop Ecosystem Is A Mix Of Apache, Other Open Source, And Vendor Products. Copyright 2014 by Forrester. Reprinted with
permission.

5
Hadoop
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
MapReduce
MapReduce is a system for parallel processing of large data sets.
According to IBM (2015) as an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to each city in the empire.
Each census taker in each city would be tasked to count the number of people in that city and then return
their results to the capital city. At the capital, the results from each city would be reduced to a single count
(sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in
parallel, and then combining the results (reducing) is much more efficient than sending a single person to
count every person in the empire in a serial fashion. (IBM, 2015)
NoSQL
NoSQL is a database environment. Using the definition from Planet Cassandra (2015), a NoSQL database
environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad-
hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases were
developed in response to the sheer volume of data being generated, stored and analyzed by modern users
(user-generated data) and their applications (machine-generated data). (Planet Cassandra, 2015)
Why Hadoop
According to IBM (2015), Hadoop changes the economics and dynamics of large-scale computing by enabling a
solution that is:
- Scalable: Add new nodes as needed without changing data formats, how data is loaded, how jobs are
written or the applications on top.
- Cost-effective: Hadoop brings massively parallel computing to commodity servers. The result is a
significant decrease in the cost per terabyte of storage, which in turn makes it affordable to model all
your data.
- Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from a number of
sources. Data from multiple sources can be joined and aggregated in arbitrary ways, enabling deeper
analyses than any one system can provide by itself.
- Fault-tolerant: When you lose a node, the system redirects work to another location of the data and
continues processing without missing a beat.
(IBM, 2015, pg. 2)
Hadoop has its challenges though
While Hadoop may change the economics and dynamics of large-scale computing, Hadoop is not without its
own set of challenges. According to IBM (2014), there are four key areas of Hadoop that need to mature in order
to drive wider adoption, these include:
1) Performance
2) The reduction of skills
3) Data governance
4) Deep integration with existing technologies
(IBM, 2014)
Along similar lines TDWI Research (2015) in a recent survey found respondents struggling with the following
barriers to Hadoop implementation:

6
- Skills gap
- Weak business support
- Security concerns
- Data management hurdles
- Tool deficiencies
- Containing costs
(TDWI Research, 2015)
With data consisting of more sensitive personal details than ever before, as well as the increasing volume of
data, governance, risk and compliance concerns are particularly important. In addition, with organisations
looking to improve decision making by moving towards the internal democratization of data, such efforts
necessitate building trust into the Big Data platform. Building trust into a Big Data platform requires a
governance framework covering lineage, ownership etc. As articulated by Zikopoulos, deRoos, Bienko, Buglio
and Andrews (2014), Big Data technologies like Hadoop (and much of the NoSQL world) require more work to
have all of the enterprise hardening from a security perspective.
According to a study by the International Technology Group (2013), organisations need to be particularly mindful
in the highly skilled programming requirements demanded of most Hadoop environments, noting that:
Although the field of players has since expanded to include hundreds of venture capital-funded start-ups, along
with established systems and services vendors and large end users, social media businesses continue to control
Hadoop. Most of the more than one billion lines of code – more than 90 percent, according to some estimates –
in the Apache Hadoop stack has to date been contributed by these.
The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption that
Hadoop developers are highly skilled, capable of working with “raw” open source code and configuring software
components on a case-by-case basis as needs change. Manual coding is the norm.
Decades of experience have shown that, regardless of which technologies are employed, manual coding offers
lower developer productivity and greater potential for errors than more sophisticated techniques.
(ITG, 2013, pg. 2)
Hadoop is good for…
According to Forrester (2015) organizations can justify Hadoop investments by reducing budgets allocated to
proprietary systems. In the report Forrester indicates that some of its clients pay for their Hadoop infrastructure
by scaling down expensive proprietary database hardware and software. Using the savings they then take
advantage of the less expensive Hadoop infrastructure based on open source software and commodity hardware
using the following approaches:
1) Migrate most of the batch processes like data integration, cleansing, and aggregation to Hadoop,
freeing up database resources to do what they do best: query processing. Next, further reduce
dependency on expensive proprietary systems by moving some of the non-mission-critical data sets
and queries (such as social media analytics or data exploration) to Hadoop.
2) Turn Hadoop Data Hub into a sandbox for business analysis and data scientists. Not all enterprise data
needs to be tightly controlled and integrated. Simply by copying all enterprise data into a Hadoop-based
Data Hub or Data Lake, organizations can achieve immediate benefits.

7
Use case:
Not every business use case calls for structured analysis based on SQL; there are plenty of use cases
(e.g., data exploration, experimentation, creating what-if scenarios, looking for previously uncovered
patterns) where simply searching through data can uncover a wealth of hidden information. Most
enterprise applications are based on relational databases, which are not designed for search, plus
one would have to go to each application separately to conduct such a search. Data staged in a
Hadoop Data Hub or Data Lake becomes instantly searchable and available for exploration and
discovery.
3) Extend Agile Business Intelligence to Big Data with Hadoop on-demand data marts.
Use case:
While columnar SQL databases provide extra agility and flexibility, they still require a certain level of
investment in proprietary systems and professional resources. As a result, it may not make financial
sense to build a data mart based on a platform that may only be used for a one-off application or a
process that only runs for a few weeks (for example, analysing post-merger customer integration
implications). Apache Hive, Apache Drill, and other Hadoop relational structures are perfect venues
for such one-off use cases.
(Forrester, 2015, pg. 3 – 4)
Costs associated with typical Big Data implementations
Although a Big Data environment such as that illustrated in Figure 2 can be constructed from open source
software, such as Hadoop and a NoSQL database such as MongoDB, there are still substantial costs involved.
These include:
1) Hardware costs
2) IT and operational costs in setting up a machine cluster and supporting it
3) Cost of personnel to work on the ecosystem
These costs are NOT trivial for the following reasons:
- Dealing with cutting edge technology and finding people who know the technology is challenging
- The technology introduces a different programming paradigm, frequently requiring additional training
of existing engineering teams
- These technologies are new and still evolving and are not yet mature in the enterprise ecosystem
- The hardware is server grade and large clusters require resources including network administration,
security administration, system administration etc., as well as data centre operational costs including
electricity, cooling etc.
Although implementing a Big Data platform is not without its’ challenges, the availability of IaaS platforms for
Big Data, reduce some of the initial risks that would traditionally be associated with such projects
Infrastructure as a Service (IaaS)
One consideration that can mitigate the cost implications of hardware and support personnel is the use of a
cloud offering. As pointed out by Intel (2015) clouds are already deployed on pools of server, storage, and
networking resources and can scale up or down as needed. Cloud computing offers a cost-effective way to
support Big Data technologies and the advanced analytics applications that can drive business value.
Diversity Limited (2010) defines Infrastructure as a Service (IaaS) as “a way of delivering Cloud Computing
infrastructure – servers, storage, network and operating systems – as an on-demand service. Rather than

8
purchasing servers, software, datacenter space or network equipment, organisations instead buy those
resources as a fully outsourced service on demand.”
The Big Data approach
According to Forrester (2015) organisations should organise for Big Data using the following approach:
- Use small, agile insights teams versus centralized delivery services.
GE successfully deployed such a strategy in the development of its Data Lake with what GE’s Vince
Campisi calls “a two-pizza team,” meaning, a team no bigger than the number of people you could feed
off of two pizzas. As Forrester points out a decentralized approach requires a careful balance with
appropriate data platform management, centers of excellence (CoEs), and communities of practice.
- Centralize the data pipeline and data platform teams.
People often think that because the data is there, it is ready to be used – but that is seldom the case.
Organisations failing to undertake the critical effort of making the data accessible are left with an
ineffective process of trying to figure out where it exists, breaking down silos, and organizing data in
order for someone to actually go after the outcome they seek. The harsh reality is that Big Data is messy
data and there is no quick and easy way around it. To paraphrase the Ancient Mariner, organisations
can easily find themselves in the situation of; Data, data, every where, Nor any drop to drink.
- Put data science everywhere but technology management.
The rational on this is simple: data and analytics is a business requirement, driven by business, used by
business to solve business challenges and to drive business opportunities. Certainly IT has a seat at the
table, and is critical to enablement but ultimately IT should not own the strategy.
- Embrace communities of practice over CoEs for insights execution.
Return on investment in analytics needs to be seen as a process rather than as a control metric.
Organisations need to focus on “value in use” – which measures value in terms of how a given asset
provides benefits to a specific owner under a specific use. Consequently, only when an analytical model
of application is actually used can its real benefits (and costs) be identified. Experienced analytical
leaders therefore focus on execution and the value derived throughout the project’s life cycle, not just
at the end of the project.
- Govern analytics, not just data.
An organisation embarking on a Big Data initiative will face many first time governance challenges.
Addressing these challenges and doing so consistently is a critical consideration for organisations
looking to exploit opportunities in data and analytics – where the difference between those that
succeed and those that fail could well rest on the strength or weakness of the organisations data
governance foundation.
Summary
Big Data is having a substantive impact on organisations across industries. Organisations are combining Big Data
and analytics to overcome many of the challenges presented by more traditional technologies, and to support
new capabilities. While Big Data has been enabled by technologies like Hadoop, the challenges are acute on two
fronts. Firstly organisations face challenges finding people skilled in this environment. Secondly data governance
challenges are increasing in number and evolving in complexity. While these challenges are not trivial, those
organisations that successfully navigate them will be rewarded with opportunities yet to be discovered.

9
Reference
Diversity Limited. (2010). Moving your infrastructure to the cloud. [pdf]. Retrieved from
http://diversity.net.nz/wp-content/uploads/2011/01/Moving-to-the-Clouds.pdf
Forrester, (2014). The hadoop ecosystem is a mix of apache, other open source, and vendor products. [Diagram].
Retrieved from Forrester, (2014). Hadoop ecosystem overview, Q4 2014. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Brief: reasons to (or not to) modernize bi platforms with hadoop. [pdf]. Retrieved from
Forrester. (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
Forrester. (2015). Financial and IT analytics are top planned use cases for big data technologies. [Diagram].
Retrieved from Forrester, (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
IBM. (2014). Analytics: The speed advantage. [pdf].
Retrieved from http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/
IBM. (2015). Analytics: What is mapreduce. [web page].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
IBM. (2015). Making the case for hadoop and big data in the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?infotype=PM&subtype=BK&htmlfid=IMM14161USEN#loaded
Intel. (2015). Big data cloud technology. [pdf]. Retrieved from
http://www.intel.co.za/content/dam/www/public/us/en/documents/product-briefs/big-data- cloud-
technologies-brief.pdf
ITG. (2013). Business case for enterprise big data deployments. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IME14028USEN&appname=skmwww
MIT Sloan Management Review. (2015) Sustaining an analytics advantage. [pdf]. Retrieved from
http://mitsmr.com/1CqUqMB
Planet Cassandra. (2015). Nosql databases defined and explained. [web page].
Retrieved from http://www.planetcassandra.org/what-is-nosql/
TDWI Research. (2015). Hadoop for the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/resources.html
Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., Andrews, M. (2014). Big data beyond the hype. [pdf].
Retrieved from
https://www.ibm.com/developerworks/community/blogs/SusanVisser/entry/big_data_beyond_the_hype_a_gui
de_to_conversations_for_today_s_data_center?lang=en

Hadoop Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop Overview

Similar to Hadoop Overview (20)

More from Gregg Barrett

More from Gregg Barrett (20)

Recently uploaded

Recently uploaded (20)

Hadoop Overview