SlideShare a Scribd company logo
1
Hadoop Overview
Gregg Barrett
2
Acknowledgment
This document draws extensively, and focuses on, the work and viewpoints from industry participants including:
Diversity Limited
Forrester
IBM
Intel
ITG
MIT Sloan Management Review
TDWI Research
Planet Cassandra
Christopher Bienko @ IBM
Dirk deRoos @ IBM
Marc Andrews @ IBM
Paul Zikopoulos @ IBM
Rick Buglio @ IBM
3
Introduction
With many organisations considering getting on the Hadoop bandwagon, this document provides an overview
of the planned use cases for Hadoop, an illustration of some of the common technology components,
suggestions on when Hadoop is worth considering, some the challenges organisations are experiencing, cost
considerations and finally, how an organisation should position for a Big Data initiative. Any organisation
considering a Big Data initiative with Hadoop should thoroughly consider each of these areas before embarking
on a course of action.
Planned use cases for Big Data technologies
According to a 2015 Forrester Research report, financial and IT analytics are currently amongst the top planned
use cases for Big Data technologies.
Figure 1: Planned use cases for Big Data technologies
Figure 1. Financial And IT Analytics Are Top Planned Use Cases For Big Data Technologies. Copyright 2015 by Forrester. Reprinted with
permission.
While there is much debate on which planned use cases are best suited to kick-start a Big Data initiative, most
recommendations start with the proverbial “low-hanging fruit”. While such an approach may deliver short-term
gains and increase organisational support for a broader program, the downside of this strategy is that the
analytical gains made are likely to be easily replicated by competitors.
As articulated by MIT Sloan Management Review (2015) companies that have sustained an advantage from
analytics have often taken a more strategic approach of focusing on large critical problems.
Big Data technology: Enter Hadoop
In their book, Big Data Beyond the Hype, Zikopoulos, deRoos, Bienko, Buglio and Andrews (2014) classify Hadoop
as an ecosystem of software packages that provides a computing framework.
4
This includes:
- MapReduce, which leverages a K/V (key/value) processing framework (don’t confuse that with a K/V
database);
- a file system (HDFS);
- and many other software packages that support everything from importing and exporting data (like
Sqoop) to storing transactional data (like HBase), orchestration (Avro, ZooKeeper, YARN), and more.
When you hear that someone is running a Hadoop cluster, it’s likely to mean MapReduce (or some other
framework like Spark) running on HDFS.
On the other hand, NoSQL refers to non-RDBMS SQL database solutions such as HBase, Cassandra, MongoDB,
Riak, and CouchDB, among others.
(Zikopoulos, deRoos, Bienko, Buglio, Andrews, 2014, pg. 38)
Figure 2 provides an example illustration of some of the components of the Hadoop ecosystem. The Forrester
(2014) illustration shows mix of Apache, other open source, and vendor products.
Figure 2: Example of the Hadoop Ecosystem
Figure 2. The Hadoop Ecosystem Is A Mix Of Apache, Other Open Source, And Vendor Products. Copyright 2014 by Forrester. Reprinted with
permission.
5
Hadoop
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
MapReduce
MapReduce is a system for parallel processing of large data sets.
According to IBM (2015) as an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to each city in the empire.
Each census taker in each city would be tasked to count the number of people in that city and then return
their results to the capital city. At the capital, the results from each city would be reduced to a single count
(sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in
parallel, and then combining the results (reducing) is much more efficient than sending a single person to
count every person in the empire in a serial fashion. (IBM, 2015)
NoSQL
NoSQL is a database environment. Using the definition from Planet Cassandra (2015), a NoSQL database
environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad-
hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases were
developed in response to the sheer volume of data being generated, stored and analyzed by modern users
(user-generated data) and their applications (machine-generated data). (Planet Cassandra, 2015)
Why Hadoop
According to IBM (2015), Hadoop changes the economics and dynamics of large-scale computing by enabling a
solution that is:
- Scalable: Add new nodes as needed without changing data formats, how data is loaded, how jobs are
written or the applications on top.
- Cost-effective: Hadoop brings massively parallel computing to commodity servers. The result is a
significant decrease in the cost per terabyte of storage, which in turn makes it affordable to model all
your data.
- Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from a number of
sources. Data from multiple sources can be joined and aggregated in arbitrary ways, enabling deeper
analyses than any one system can provide by itself.
- Fault-tolerant: When you lose a node, the system redirects work to another location of the data and
continues processing without missing a beat.
(IBM, 2015, pg. 2)
Hadoop has its challenges though
While Hadoop may change the economics and dynamics of large-scale computing, Hadoop is not without its
own set of challenges. According to IBM (2014), there are four key areas of Hadoop that need to mature in order
to drive wider adoption, these include:
1) Performance
2) The reduction of skills
3) Data governance
4) Deep integration with existing technologies
(IBM, 2014)
Along similar lines TDWI Research (2015) in a recent survey found respondents struggling with the following
barriers to Hadoop implementation:
6
- Skills gap
- Weak business support
- Security concerns
- Data management hurdles
- Tool deficiencies
- Containing costs
(TDWI Research, 2015)
With data consisting of more sensitive personal details than ever before, as well as the increasing volume of
data, governance, risk and compliance concerns are particularly important. In addition, with organisations
looking to improve decision making by moving towards the internal democratization of data, such efforts
necessitate building trust into the Big Data platform. Building trust into a Big Data platform requires a
governance framework covering lineage, ownership etc. As articulated by Zikopoulos, deRoos, Bienko, Buglio
and Andrews (2014), Big Data technologies like Hadoop (and much of the NoSQL world) require more work to
have all of the enterprise hardening from a security perspective.
According to a study by the International Technology Group (2013), organisations need to be particularly mindful
in the highly skilled programming requirements demanded of most Hadoop environments, noting that:
Although the field of players has since expanded to include hundreds of venture capital-funded start-ups, along
with established systems and services vendors and large end users, social media businesses continue to control
Hadoop. Most of the more than one billion lines of code – more than 90 percent, according to some estimates –
in the Apache Hadoop stack has to date been contributed by these.
The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption that
Hadoop developers are highly skilled, capable of working with “raw” open source code and configuring software
components on a case-by-case basis as needs change. Manual coding is the norm.
Decades of experience have shown that, regardless of which technologies are employed, manual coding offers
lower developer productivity and greater potential for errors than more sophisticated techniques.
(ITG, 2013, pg. 2)
Hadoop is good for…
According to Forrester (2015) organizations can justify Hadoop investments by reducing budgets allocated to
proprietary systems. In the report Forrester indicates that some of its clients pay for their Hadoop infrastructure
by scaling down expensive proprietary database hardware and software. Using the savings they then take
advantage of the less expensive Hadoop infrastructure based on open source software and commodity hardware
using the following approaches:
1) Migrate most of the batch processes like data integration, cleansing, and aggregation to Hadoop,
freeing up database resources to do what they do best: query processing. Next, further reduce
dependency on expensive proprietary systems by moving some of the non-mission-critical data sets
and queries (such as social media analytics or data exploration) to Hadoop.
2) Turn Hadoop Data Hub into a sandbox for business analysis and data scientists. Not all enterprise data
needs to be tightly controlled and integrated. Simply by copying all enterprise data into a Hadoop-based
Data Hub or Data Lake, organizations can achieve immediate benefits.
7
Use case:
Not every business use case calls for structured analysis based on SQL; there are plenty of use cases
(e.g., data exploration, experimentation, creating what-if scenarios, looking for previously uncovered
patterns) where simply searching through data can uncover a wealth of hidden information. Most
enterprise applications are based on relational databases, which are not designed for search, plus
one would have to go to each application separately to conduct such a search. Data staged in a
Hadoop Data Hub or Data Lake becomes instantly searchable and available for exploration and
discovery.
3) Extend Agile Business Intelligence to Big Data with Hadoop on-demand data marts.
Use case:
While columnar SQL databases provide extra agility and flexibility, they still require a certain level of
investment in proprietary systems and professional resources. As a result, it may not make financial
sense to build a data mart based on a platform that may only be used for a one-off application or a
process that only runs for a few weeks (for example, analysing post-merger customer integration
implications). Apache Hive, Apache Drill, and other Hadoop relational structures are perfect venues
for such one-off use cases.
(Forrester, 2015, pg. 3 – 4)
Costs associated with typical Big Data implementations
Although a Big Data environment such as that illustrated in Figure 2 can be constructed from open source
software, such as Hadoop and a NoSQL database such as MongoDB, there are still substantial costs involved.
These include:
1) Hardware costs
2) IT and operational costs in setting up a machine cluster and supporting it
3) Cost of personnel to work on the ecosystem
These costs are NOT trivial for the following reasons:
- Dealing with cutting edge technology and finding people who know the technology is challenging
- The technology introduces a different programming paradigm, frequently requiring additional training
of existing engineering teams
- These technologies are new and still evolving and are not yet mature in the enterprise ecosystem
- The hardware is server grade and large clusters require resources including network administration,
security administration, system administration etc., as well as data centre operational costs including
electricity, cooling etc.
Although implementing a Big Data platform is not without its’ challenges, the availability of IaaS platforms for
Big Data, reduce some of the initial risks that would traditionally be associated with such projects
Infrastructure as a Service (IaaS)
One consideration that can mitigate the cost implications of hardware and support personnel is the use of a
cloud offering. As pointed out by Intel (2015) clouds are already deployed on pools of server, storage, and
networking resources and can scale up or down as needed. Cloud computing offers a cost-effective way to
support Big Data technologies and the advanced analytics applications that can drive business value.
Diversity Limited (2010) defines Infrastructure as a Service (IaaS) as “a way of delivering Cloud Computing
infrastructure – servers, storage, network and operating systems – as an on-demand service. Rather than
8
purchasing servers, software, datacenter space or network equipment, organisations instead buy those
resources as a fully outsourced service on demand.”
The Big Data approach
According to Forrester (2015) organisations should organise for Big Data using the following approach:
- Use small, agile insights teams versus centralized delivery services.
GE successfully deployed such a strategy in the development of its Data Lake with what GE’s Vince
Campisi calls “a two-pizza team,” meaning, a team no bigger than the number of people you could feed
off of two pizzas. As Forrester points out a decentralized approach requires a careful balance with
appropriate data platform management, centers of excellence (CoEs), and communities of practice.
- Centralize the data pipeline and data platform teams.
People often think that because the data is there, it is ready to be used – but that is seldom the case.
Organisations failing to undertake the critical effort of making the data accessible are left with an
ineffective process of trying to figure out where it exists, breaking down silos, and organizing data in
order for someone to actually go after the outcome they seek. The harsh reality is that Big Data is messy
data and there is no quick and easy way around it. To paraphrase the Ancient Mariner, organisations
can easily find themselves in the situation of; Data, data, every where, Nor any drop to drink.
- Put data science everywhere but technology management.
The rational on this is simple: data and analytics is a business requirement, driven by business, used by
business to solve business challenges and to drive business opportunities. Certainly IT has a seat at the
table, and is critical to enablement but ultimately IT should not own the strategy.
- Embrace communities of practice over CoEs for insights execution.
Return on investment in analytics needs to be seen as a process rather than as a control metric.
Organisations need to focus on “value in use” – which measures value in terms of how a given asset
provides benefits to a specific owner under a specific use. Consequently, only when an analytical model
of application is actually used can its real benefits (and costs) be identified. Experienced analytical
leaders therefore focus on execution and the value derived throughout the project’s life cycle, not just
at the end of the project.
- Govern analytics, not just data.
An organisation embarking on a Big Data initiative will face many first time governance challenges.
Addressing these challenges and doing so consistently is a critical consideration for organisations
looking to exploit opportunities in data and analytics – where the difference between those that
succeed and those that fail could well rest on the strength or weakness of the organisations data
governance foundation.
Summary
Big Data is having a substantive impact on organisations across industries. Organisations are combining Big Data
and analytics to overcome many of the challenges presented by more traditional technologies, and to support
new capabilities. While Big Data has been enabled by technologies like Hadoop, the challenges are acute on two
fronts. Firstly organisations face challenges finding people skilled in this environment. Secondly data governance
challenges are increasing in number and evolving in complexity. While these challenges are not trivial, those
organisations that successfully navigate them will be rewarded with opportunities yet to be discovered.
9
Reference
Diversity Limited. (2010). Moving your infrastructure to the cloud. [pdf]. Retrieved from
http://diversity.net.nz/wp-content/uploads/2011/01/Moving-to-the-Clouds.pdf
Forrester, (2014). The hadoop ecosystem is a mix of apache, other open source, and vendor products. [Diagram].
Retrieved from Forrester, (2014). Hadoop ecosystem overview, Q4 2014. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Brief: reasons to (or not to) modernize bi platforms with hadoop. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Financial and IT analytics are top planned use cases for big data technologies. [Diagram].
Retrieved from Forrester, (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
https://www.forrester.com/home/
IBM. (2014). Analytics: The speed advantage. [pdf].
Retrieved from http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/
IBM. (2015). Analytics: What is mapreduce. [web page].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
IBM. (2015). Making the case for hadoop and big data in the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?infotype=PM&subtype=BK&htmlfid=IMM14161USEN#loaded
Intel. (2015). Big data cloud technology. [pdf]. Retrieved from
http://www.intel.co.za/content/dam/www/public/us/en/documents/product-briefs/big-data- cloud-
technologies-brief.pdf
ITG. (2013). Business case for enterprise big data deployments. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IME14028USEN&appname=skmwww
MIT Sloan Management Review. (2015) Sustaining an analytics advantage. [pdf]. Retrieved from
http://mitsmr.com/1CqUqMB
Planet Cassandra. (2015). Nosql databases defined and explained. [web page].
Retrieved from http://www.planetcassandra.org/what-is-nosql/
TDWI Research. (2015). Hadoop for the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/resources.html
Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., Andrews, M. (2014). Big data beyond the hype. [pdf].
Retrieved from
https://www.ibm.com/developerworks/community/blogs/SusanVisser/entry/big_data_beyond_the_hype_a_gui
de_to_conversations_for_today_s_data_center?lang=en

More Related Content

What's hot

The Big Data Talent Gap
The Big Data Talent GapThe Big Data Talent Gap
The Big Data Talent Gap
Kip Michael Kelly
 
Addressing Storage Challenges to Support Business Analytics and Big Data Work...
Addressing Storage Challenges to Support Business Analytics and Big Data Work...Addressing Storage Challenges to Support Business Analytics and Big Data Work...
Addressing Storage Challenges to Support Business Analytics and Big Data Work...
IBM India Smarter Computing
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_final
Osman Circi
 
Big Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - WhitepaperBig Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - Whitepaper
Vasu S
 
Data Management
Data ManagementData Management
Data Management
Ismail Muhammad
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
ijdpsjournal
 
The Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate EnvironmentThe Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate Environment
IRJET Journal
 
Big data trendsdirections nimführ.ppt
Big data trendsdirections nimführ.pptBig data trendsdirections nimführ.ppt
Big data trendsdirections nimführ.ppt
Aravindharamanan S
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldGlobal Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Neil Raden
 
Big data
Big dataBig data
How Insurers Can Tame Data to Drive Innovation
How Insurers Can Tame Data to Drive InnovationHow Insurers Can Tame Data to Drive Innovation
How Insurers Can Tame Data to Drive Innovation
Cognizant
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
T.S. Lim
 
Ibm 1129-the big data zoo
Ibm 1129-the big data zooIbm 1129-the big data zoo
Ibm 1129-the big data zooAccenture
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paperJohn Enoch
 
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
AnthonyOtuonye
 
Cutting Edge Predictive Analytics with Eric Siegel
Cutting Edge Predictive Analytics with Eric Siegel   Cutting Edge Predictive Analytics with Eric Siegel
Cutting Edge Predictive Analytics with Eric Siegel
Databricks
 
BRIDGING DATA SILOS USING BIG DATA INTEGRATION
BRIDGING DATA SILOS USING BIG DATA INTEGRATIONBRIDGING DATA SILOS USING BIG DATA INTEGRATION
BRIDGING DATA SILOS USING BIG DATA INTEGRATION
ijmnct
 

What's hot (18)

The Big Data Talent Gap
The Big Data Talent GapThe Big Data Talent Gap
The Big Data Talent Gap
 
Taming the data beast
Taming the data beastTaming the data beast
Taming the data beast
 
Addressing Storage Challenges to Support Business Analytics and Big Data Work...
Addressing Storage Challenges to Support Business Analytics and Big Data Work...Addressing Storage Challenges to Support Business Analytics and Big Data Work...
Addressing Storage Challenges to Support Business Analytics and Big Data Work...
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_final
 
Big Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - WhitepaperBig Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - Whitepaper
 
Data Management
Data ManagementData Management
Data Management
 
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
 
The Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate EnvironmentThe Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate Environment
 
Big data trendsdirections nimführ.ppt
Big data trendsdirections nimführ.pptBig data trendsdirections nimführ.ppt
Big data trendsdirections nimführ.ppt
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldGlobal Data Management: Governance, Security and Usefulness in a Hybrid World
Global Data Management: Governance, Security and Usefulness in a Hybrid World
 
Big data
Big dataBig data
Big data
 
How Insurers Can Tame Data to Drive Innovation
How Insurers Can Tame Data to Drive InnovationHow Insurers Can Tame Data to Drive Innovation
How Insurers Can Tame Data to Drive Innovation
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Ibm 1129-the big data zoo
Ibm 1129-the big data zooIbm 1129-the big data zoo
Ibm 1129-the big data zoo
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paper
 
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
 
Cutting Edge Predictive Analytics with Eric Siegel
Cutting Edge Predictive Analytics with Eric Siegel   Cutting Edge Predictive Analytics with Eric Siegel
Cutting Edge Predictive Analytics with Eric Siegel
 
BRIDGING DATA SILOS USING BIG DATA INTEGRATION
BRIDGING DATA SILOS USING BIG DATA INTEGRATIONBRIDGING DATA SILOS USING BIG DATA INTEGRATION
BRIDGING DATA SILOS USING BIG DATA INTEGRATION
 

Viewers also liked

Acta v
Acta vActa v
Acta v
satelite1
 
Церебро Таргет - как делать праздничный таргетинг. Часть 2.
Церебро Таргет - как делать праздничный таргетинг. Часть 2. Церебро Таргет - как делать праздничный таргетинг. Часть 2.
Церебро Таргет - как делать праздничный таргетинг. Часть 2.
Церебро Таргет
 
proyecto etnomatematica
proyecto etnomatematica proyecto etnomatematica
proyecto etnomatematica
Giovanny Gamboa
 
II-SDV Andrew Hinton - Text mining - as normal as data mining?
II-SDV Andrew Hinton - Text mining - as normal as data mining?II-SDV Andrew Hinton - Text mining - as normal as data mining?
II-SDV Andrew Hinton - Text mining - as normal as data mining?
Dr. Haxel Consult
 
Обучающий вебинар для рекламодателей ВКонтакте
Обучающий вебинар для рекламодателей ВКонтактеОбучающий вебинар для рекламодателей ВКонтакте
Обучающий вебинар для рекламодателей ВКонтакте
Церебро Таргет
 
Annual result day 2016
Annual result day 2016Annual result day 2016
Annual result day 2016
oxford2015
 
Company yufansun
Company yufansunCompany yufansun
Company yufansun
Yufan Sun
 

Viewers also liked (9)

We Are One
We Are OneWe Are One
We Are One
 
Pagina 14 en 15
Pagina 14 en 15Pagina 14 en 15
Pagina 14 en 15
 
Acta v
Acta vActa v
Acta v
 
Церебро Таргет - как делать праздничный таргетинг. Часть 2.
Церебро Таргет - как делать праздничный таргетинг. Часть 2. Церебро Таргет - как делать праздничный таргетинг. Часть 2.
Церебро Таргет - как делать праздничный таргетинг. Часть 2.
 
proyecto etnomatematica
proyecto etnomatematica proyecto etnomatematica
proyecto etnomatematica
 
II-SDV Andrew Hinton - Text mining - as normal as data mining?
II-SDV Andrew Hinton - Text mining - as normal as data mining?II-SDV Andrew Hinton - Text mining - as normal as data mining?
II-SDV Andrew Hinton - Text mining - as normal as data mining?
 
Обучающий вебинар для рекламодателей ВКонтакте
Обучающий вебинар для рекламодателей ВКонтактеОбучающий вебинар для рекламодателей ВКонтакте
Обучающий вебинар для рекламодателей ВКонтакте
 
Annual result day 2016
Annual result day 2016Annual result day 2016
Annual result day 2016
 
Company yufansun
Company yufansunCompany yufansun
Company yufansun
 

Similar to Hadoop Overview

Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
faizrashid1995
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
Anusha sweety
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
zafarali1981
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
nkabra
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
Assignment Help
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
EMC
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 

Similar to Hadoop Overview (20)

Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big Data
Big DataBig Data
Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 

More from Gregg Barrett

Cirrus: Africa's AI initiative, Proposal 2018
Cirrus: Africa's AI initiative, Proposal 2018Cirrus: Africa's AI initiative, Proposal 2018
Cirrus: Africa's AI initiative, Proposal 2018
Gregg Barrett
 
Cirrus: Africa's AI initiative
Cirrus: Africa's AI initiativeCirrus: Africa's AI initiative
Cirrus: Africa's AI initiative
Gregg Barrett
 
Applied machine learning: Insurance
Applied machine learning: InsuranceApplied machine learning: Insurance
Applied machine learning: Insurance
Gregg Barrett
 
Road and Track Vehicle - Project Document
Road and Track Vehicle - Project DocumentRoad and Track Vehicle - Project Document
Road and Track Vehicle - Project Document
Gregg Barrett
 
Modelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boostingModelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boosting
Gregg Barrett
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
Gregg Barrett
 
Revenue Generation Ideas for Tesla Motors
Revenue Generation Ideas for Tesla MotorsRevenue Generation Ideas for Tesla Motors
Revenue Generation Ideas for Tesla Motors
Gregg Barrett
 
Data science unit introduction
Data science unit introductionData science unit introduction
Data science unit introduction
Gregg Barrett
 
Social networking brings power
Social networking brings powerSocial networking brings power
Social networking brings power
Gregg Barrett
 
Procurement can be exciting
Procurement can be excitingProcurement can be exciting
Procurement can be exciting
Gregg Barrett
 
Machine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing BeerMachine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing Beer
Gregg Barrett
 
A note to Data Science and Machine Learning managers
A note to Data Science and Machine Learning managersA note to Data Science and Machine Learning managers
A note to Data Science and Machine Learning managers
Gregg Barrett
 
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Gregg Barrett
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in R
Gregg Barrett
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
Gregg Barrett
 
Diabetes data - model assessment using R
Diabetes data - model assessment using RDiabetes data - model assessment using R
Diabetes data - model assessment using R
Gregg Barrett
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
Gregg Barrett
 
Insurance metrics overview
Insurance metrics overviewInsurance metrics overview
Insurance metrics overview
Gregg Barrett
 
Review of mit sloan management review case study on analytics at Intermountain
Review of mit sloan management review case study on analytics at IntermountainReview of mit sloan management review case study on analytics at Intermountain
Review of mit sloan management review case study on analytics at Intermountain
Gregg Barrett
 
Example: movielens data with mahout
Example: movielens data with mahoutExample: movielens data with mahout
Example: movielens data with mahout
Gregg Barrett
 

More from Gregg Barrett (20)

Cirrus: Africa's AI initiative, Proposal 2018
Cirrus: Africa's AI initiative, Proposal 2018Cirrus: Africa's AI initiative, Proposal 2018
Cirrus: Africa's AI initiative, Proposal 2018
 
Cirrus: Africa's AI initiative
Cirrus: Africa's AI initiativeCirrus: Africa's AI initiative
Cirrus: Africa's AI initiative
 
Applied machine learning: Insurance
Applied machine learning: InsuranceApplied machine learning: Insurance
Applied machine learning: Insurance
 
Road and Track Vehicle - Project Document
Road and Track Vehicle - Project DocumentRoad and Track Vehicle - Project Document
Road and Track Vehicle - Project Document
 
Modelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boostingModelling the expected loss of bodily injury claims using gradient boosting
Modelling the expected loss of bodily injury claims using gradient boosting
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
 
Revenue Generation Ideas for Tesla Motors
Revenue Generation Ideas for Tesla MotorsRevenue Generation Ideas for Tesla Motors
Revenue Generation Ideas for Tesla Motors
 
Data science unit introduction
Data science unit introductionData science unit introduction
Data science unit introduction
 
Social networking brings power
Social networking brings powerSocial networking brings power
Social networking brings power
 
Procurement can be exciting
Procurement can be excitingProcurement can be exciting
Procurement can be exciting
 
Machine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing BeerMachine Learning Approaches to Brewing Beer
Machine Learning Approaches to Brewing Beer
 
A note to Data Science and Machine Learning managers
A note to Data Science and Machine Learning managersA note to Data Science and Machine Learning managers
A note to Data Science and Machine Learning managers
 
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
 
Efficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in REfficient equity portfolios using mean variance optimisation in R
Efficient equity portfolios using mean variance optimisation in R
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
 
Diabetes data - model assessment using R
Diabetes data - model assessment using RDiabetes data - model assessment using R
Diabetes data - model assessment using R
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Insurance metrics overview
Insurance metrics overviewInsurance metrics overview
Insurance metrics overview
 
Review of mit sloan management review case study on analytics at Intermountain
Review of mit sloan management review case study on analytics at IntermountainReview of mit sloan management review case study on analytics at Intermountain
Review of mit sloan management review case study on analytics at Intermountain
 
Example: movielens data with mahout
Example: movielens data with mahoutExample: movielens data with mahout
Example: movielens data with mahout
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 

Hadoop Overview

  • 2. 2 Acknowledgment This document draws extensively, and focuses on, the work and viewpoints from industry participants including: Diversity Limited Forrester IBM Intel ITG MIT Sloan Management Review TDWI Research Planet Cassandra Christopher Bienko @ IBM Dirk deRoos @ IBM Marc Andrews @ IBM Paul Zikopoulos @ IBM Rick Buglio @ IBM
  • 3. 3 Introduction With many organisations considering getting on the Hadoop bandwagon, this document provides an overview of the planned use cases for Hadoop, an illustration of some of the common technology components, suggestions on when Hadoop is worth considering, some the challenges organisations are experiencing, cost considerations and finally, how an organisation should position for a Big Data initiative. Any organisation considering a Big Data initiative with Hadoop should thoroughly consider each of these areas before embarking on a course of action. Planned use cases for Big Data technologies According to a 2015 Forrester Research report, financial and IT analytics are currently amongst the top planned use cases for Big Data technologies. Figure 1: Planned use cases for Big Data technologies Figure 1. Financial And IT Analytics Are Top Planned Use Cases For Big Data Technologies. Copyright 2015 by Forrester. Reprinted with permission. While there is much debate on which planned use cases are best suited to kick-start a Big Data initiative, most recommendations start with the proverbial “low-hanging fruit”. While such an approach may deliver short-term gains and increase organisational support for a broader program, the downside of this strategy is that the analytical gains made are likely to be easily replicated by competitors. As articulated by MIT Sloan Management Review (2015) companies that have sustained an advantage from analytics have often taken a more strategic approach of focusing on large critical problems. Big Data technology: Enter Hadoop In their book, Big Data Beyond the Hype, Zikopoulos, deRoos, Bienko, Buglio and Andrews (2014) classify Hadoop as an ecosystem of software packages that provides a computing framework.
  • 4. 4 This includes: - MapReduce, which leverages a K/V (key/value) processing framework (don’t confuse that with a K/V database); - a file system (HDFS); - and many other software packages that support everything from importing and exporting data (like Sqoop) to storing transactional data (like HBase), orchestration (Avro, ZooKeeper, YARN), and more. When you hear that someone is running a Hadoop cluster, it’s likely to mean MapReduce (or some other framework like Spark) running on HDFS. On the other hand, NoSQL refers to non-RDBMS SQL database solutions such as HBase, Cassandra, MongoDB, Riak, and CouchDB, among others. (Zikopoulos, deRoos, Bienko, Buglio, Andrews, 2014, pg. 38) Figure 2 provides an example illustration of some of the components of the Hadoop ecosystem. The Forrester (2014) illustration shows mix of Apache, other open source, and vendor products. Figure 2: Example of the Hadoop Ecosystem Figure 2. The Hadoop Ecosystem Is A Mix Of Apache, Other Open Source, And Vendor Products. Copyright 2014 by Forrester. Reprinted with permission.
  • 5. 5 Hadoop Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets. MapReduce MapReduce is a system for parallel processing of large data sets. According to IBM (2015) as an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. At the capital, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion. (IBM, 2015) NoSQL NoSQL is a database environment. Using the definition from Planet Cassandra (2015), a NoSQL database environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad- hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases were developed in response to the sheer volume of data being generated, stored and analyzed by modern users (user-generated data) and their applications (machine-generated data). (Planet Cassandra, 2015) Why Hadoop According to IBM (2015), Hadoop changes the economics and dynamics of large-scale computing by enabling a solution that is: - Scalable: Add new nodes as needed without changing data formats, how data is loaded, how jobs are written or the applications on top. - Cost-effective: Hadoop brings massively parallel computing to commodity servers. The result is a significant decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. - Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from a number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways, enabling deeper analyses than any one system can provide by itself. - Fault-tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat. (IBM, 2015, pg. 2) Hadoop has its challenges though While Hadoop may change the economics and dynamics of large-scale computing, Hadoop is not without its own set of challenges. According to IBM (2014), there are four key areas of Hadoop that need to mature in order to drive wider adoption, these include: 1) Performance 2) The reduction of skills 3) Data governance 4) Deep integration with existing technologies (IBM, 2014) Along similar lines TDWI Research (2015) in a recent survey found respondents struggling with the following barriers to Hadoop implementation:
  • 6. 6 - Skills gap - Weak business support - Security concerns - Data management hurdles - Tool deficiencies - Containing costs (TDWI Research, 2015) With data consisting of more sensitive personal details than ever before, as well as the increasing volume of data, governance, risk and compliance concerns are particularly important. In addition, with organisations looking to improve decision making by moving towards the internal democratization of data, such efforts necessitate building trust into the Big Data platform. Building trust into a Big Data platform requires a governance framework covering lineage, ownership etc. As articulated by Zikopoulos, deRoos, Bienko, Buglio and Andrews (2014), Big Data technologies like Hadoop (and much of the NoSQL world) require more work to have all of the enterprise hardening from a security perspective. According to a study by the International Technology Group (2013), organisations need to be particularly mindful in the highly skilled programming requirements demanded of most Hadoop environments, noting that: Although the field of players has since expanded to include hundreds of venture capital-funded start-ups, along with established systems and services vendors and large end users, social media businesses continue to control Hadoop. Most of the more than one billion lines of code – more than 90 percent, according to some estimates – in the Apache Hadoop stack has to date been contributed by these. The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption that Hadoop developers are highly skilled, capable of working with “raw” open source code and configuring software components on a case-by-case basis as needs change. Manual coding is the norm. Decades of experience have shown that, regardless of which technologies are employed, manual coding offers lower developer productivity and greater potential for errors than more sophisticated techniques. (ITG, 2013, pg. 2) Hadoop is good for… According to Forrester (2015) organizations can justify Hadoop investments by reducing budgets allocated to proprietary systems. In the report Forrester indicates that some of its clients pay for their Hadoop infrastructure by scaling down expensive proprietary database hardware and software. Using the savings they then take advantage of the less expensive Hadoop infrastructure based on open source software and commodity hardware using the following approaches: 1) Migrate most of the batch processes like data integration, cleansing, and aggregation to Hadoop, freeing up database resources to do what they do best: query processing. Next, further reduce dependency on expensive proprietary systems by moving some of the non-mission-critical data sets and queries (such as social media analytics or data exploration) to Hadoop. 2) Turn Hadoop Data Hub into a sandbox for business analysis and data scientists. Not all enterprise data needs to be tightly controlled and integrated. Simply by copying all enterprise data into a Hadoop-based Data Hub or Data Lake, organizations can achieve immediate benefits.
  • 7. 7 Use case: Not every business use case calls for structured analysis based on SQL; there are plenty of use cases (e.g., data exploration, experimentation, creating what-if scenarios, looking for previously uncovered patterns) where simply searching through data can uncover a wealth of hidden information. Most enterprise applications are based on relational databases, which are not designed for search, plus one would have to go to each application separately to conduct such a search. Data staged in a Hadoop Data Hub or Data Lake becomes instantly searchable and available for exploration and discovery. 3) Extend Agile Business Intelligence to Big Data with Hadoop on-demand data marts. Use case: While columnar SQL databases provide extra agility and flexibility, they still require a certain level of investment in proprietary systems and professional resources. As a result, it may not make financial sense to build a data mart based on a platform that may only be used for a one-off application or a process that only runs for a few weeks (for example, analysing post-merger customer integration implications). Apache Hive, Apache Drill, and other Hadoop relational structures are perfect venues for such one-off use cases. (Forrester, 2015, pg. 3 – 4) Costs associated with typical Big Data implementations Although a Big Data environment such as that illustrated in Figure 2 can be constructed from open source software, such as Hadoop and a NoSQL database such as MongoDB, there are still substantial costs involved. These include: 1) Hardware costs 2) IT and operational costs in setting up a machine cluster and supporting it 3) Cost of personnel to work on the ecosystem These costs are NOT trivial for the following reasons: - Dealing with cutting edge technology and finding people who know the technology is challenging - The technology introduces a different programming paradigm, frequently requiring additional training of existing engineering teams - These technologies are new and still evolving and are not yet mature in the enterprise ecosystem - The hardware is server grade and large clusters require resources including network administration, security administration, system administration etc., as well as data centre operational costs including electricity, cooling etc. Although implementing a Big Data platform is not without its’ challenges, the availability of IaaS platforms for Big Data, reduce some of the initial risks that would traditionally be associated with such projects Infrastructure as a Service (IaaS) One consideration that can mitigate the cost implications of hardware and support personnel is the use of a cloud offering. As pointed out by Intel (2015) clouds are already deployed on pools of server, storage, and networking resources and can scale up or down as needed. Cloud computing offers a cost-effective way to support Big Data technologies and the advanced analytics applications that can drive business value. Diversity Limited (2010) defines Infrastructure as a Service (IaaS) as “a way of delivering Cloud Computing infrastructure – servers, storage, network and operating systems – as an on-demand service. Rather than
  • 8. 8 purchasing servers, software, datacenter space or network equipment, organisations instead buy those resources as a fully outsourced service on demand.” The Big Data approach According to Forrester (2015) organisations should organise for Big Data using the following approach: - Use small, agile insights teams versus centralized delivery services. GE successfully deployed such a strategy in the development of its Data Lake with what GE’s Vince Campisi calls “a two-pizza team,” meaning, a team no bigger than the number of people you could feed off of two pizzas. As Forrester points out a decentralized approach requires a careful balance with appropriate data platform management, centers of excellence (CoEs), and communities of practice. - Centralize the data pipeline and data platform teams. People often think that because the data is there, it is ready to be used – but that is seldom the case. Organisations failing to undertake the critical effort of making the data accessible are left with an ineffective process of trying to figure out where it exists, breaking down silos, and organizing data in order for someone to actually go after the outcome they seek. The harsh reality is that Big Data is messy data and there is no quick and easy way around it. To paraphrase the Ancient Mariner, organisations can easily find themselves in the situation of; Data, data, every where, Nor any drop to drink. - Put data science everywhere but technology management. The rational on this is simple: data and analytics is a business requirement, driven by business, used by business to solve business challenges and to drive business opportunities. Certainly IT has a seat at the table, and is critical to enablement but ultimately IT should not own the strategy. - Embrace communities of practice over CoEs for insights execution. Return on investment in analytics needs to be seen as a process rather than as a control metric. Organisations need to focus on “value in use” – which measures value in terms of how a given asset provides benefits to a specific owner under a specific use. Consequently, only when an analytical model of application is actually used can its real benefits (and costs) be identified. Experienced analytical leaders therefore focus on execution and the value derived throughout the project’s life cycle, not just at the end of the project. - Govern analytics, not just data. An organisation embarking on a Big Data initiative will face many first time governance challenges. Addressing these challenges and doing so consistently is a critical consideration for organisations looking to exploit opportunities in data and analytics – where the difference between those that succeed and those that fail could well rest on the strength or weakness of the organisations data governance foundation. Summary Big Data is having a substantive impact on organisations across industries. Organisations are combining Big Data and analytics to overcome many of the challenges presented by more traditional technologies, and to support new capabilities. While Big Data has been enabled by technologies like Hadoop, the challenges are acute on two fronts. Firstly organisations face challenges finding people skilled in this environment. Secondly data governance challenges are increasing in number and evolving in complexity. While these challenges are not trivial, those organisations that successfully navigate them will be rewarded with opportunities yet to be discovered.
  • 9. 9 Reference Diversity Limited. (2010). Moving your infrastructure to the cloud. [pdf]. Retrieved from http://diversity.net.nz/wp-content/uploads/2011/01/Moving-to-the-Clouds.pdf Forrester, (2014). The hadoop ecosystem is a mix of apache, other open source, and vendor products. [Diagram]. Retrieved from Forrester, (2014). Hadoop ecosystem overview, Q4 2014. [pdf]. Retrieved from https://www.forrester.com/home/ Forrester. (2015). Brief: reasons to (or not to) modernize bi platforms with hadoop. [pdf]. Retrieved from https://www.forrester.com/home/ Forrester. (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from https://www.forrester.com/home/ Forrester. (2015). Financial and IT analytics are top planned use cases for big data technologies. [Diagram]. Retrieved from Forrester, (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from https://www.forrester.com/home/ IBM. (2014). Analytics: The speed advantage. [pdf]. Retrieved from http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/ IBM. (2015). Analytics: What is mapreduce. [web page]. Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ IBM. (2015). Making the case for hadoop and big data in the enterprise. [pdf]. Retrieved from http://www-01.ibm.com/common/ssi/cgi- bin/ssialias?infotype=PM&subtype=BK&htmlfid=IMM14161USEN#loaded Intel. (2015). Big data cloud technology. [pdf]. Retrieved from http://www.intel.co.za/content/dam/www/public/us/en/documents/product-briefs/big-data- cloud- technologies-brief.pdf ITG. (2013). Business case for enterprise big data deployments. [pdf]. Retrieved from http://www-01.ibm.com/common/ssi/cgi- bin/ssialias?htmlfid=IME14028USEN&appname=skmwww MIT Sloan Management Review. (2015) Sustaining an analytics advantage. [pdf]. Retrieved from http://mitsmr.com/1CqUqMB Planet Cassandra. (2015). Nosql databases defined and explained. [web page]. Retrieved from http://www.planetcassandra.org/what-is-nosql/ TDWI Research. (2015). Hadoop for the enterprise. [pdf]. Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/resources.html Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., Andrews, M. (2014). Big data beyond the hype. [pdf]. Retrieved from https://www.ibm.com/developerworks/community/blogs/SusanVisser/entry/big_data_beyond_the_hype_a_gui de_to_conversations_for_today_s_data_center?lang=en