1. The document provides an overview of Hadoop and big data technologies, use cases, common components, challenges, and considerations for implementing a big data initiative.
2. Financial and IT analytics are currently the top planned use cases for big data technologies according to Forrester Research. Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers.
3. Organizations face challenges in implementing big data initiatives including skills gaps, data management issues, and high costs of hardware, personnel, and supporting new technologies. Careful planning is required to realize value from big data.
Who needs Big Data? What benefits can organisations realistically achieve with Big Data? What else required for success? What are the opportunities for players in this space? In this paper, Cartesian explores these questions surrounding Big Data.
www.cartesian.com
In this document, the five disruptive trends shaping the corporate IT landscape today are layed out. Out of the five, Big Data has the biggest potential to generate new sustainable competitive advantages. But the benefits will remain out of reach of many organizations as they struggle to adopt the technology, develop new capabilities, and manage the cultural change associated with the use of big data. This document offers a pragmatic approach to generating business value.
This Document Includes lecture/workshop notes for BIG DATA SCIENCE workshop at NTI 6-7th of Dec 2017
Hint: 1:This is an Initial Version, and it will be updated.
2: Telecommunication/5G parts were not covered through the workshop, although, I will add a comprehensive analysis regarding mentioned cases.
If anyone is interesting in working practically (HANDS ON) mentioned case study, just drop me an e-mail: m.rahm7n@gmail.com
Who needs Big Data? What benefits can organisations realistically achieve with Big Data? What else required for success? What are the opportunities for players in this space? In this paper, Cartesian explores these questions surrounding Big Data.
www.cartesian.com
In this document, the five disruptive trends shaping the corporate IT landscape today are layed out. Out of the five, Big Data has the biggest potential to generate new sustainable competitive advantages. But the benefits will remain out of reach of many organizations as they struggle to adopt the technology, develop new capabilities, and manage the cultural change associated with the use of big data. This document offers a pragmatic approach to generating business value.
This Document Includes lecture/workshop notes for BIG DATA SCIENCE workshop at NTI 6-7th of Dec 2017
Hint: 1:This is an Initial Version, and it will be updated.
2: Telecommunication/5G parts were not covered through the workshop, although, I will add a comprehensive analysis regarding mentioned cases.
If anyone is interesting in working practically (HANDS ON) mentioned case study, just drop me an e-mail: m.rahm7n@gmail.com
This white paper: Analyzes the big data revolution and the potential it offers organizations. Explores the critical talent needs and emerging talent gaps related to big data. Offers examples of organizations that are meeting this challenge head on. Recommends four steps HR and talent management professionals can take to bridge the talent gap.
Learn about Addressing Storage Challenges to Support Business Analytics and Big Data Workloads and how Storage teams, IT executives, and business users will benefit by recognizing that deploying appropriate storage infrastructure to support a wide range of business analytics workloads will require constant evaluation and willingness to adjust the infrastructure as needed. For more information on IBM Storage Systems, visit http://ibm.co/LIg7gk.
Visit the official Scribd Channel of IBM India Smarter Computing at http://bit.ly/VwO86R to get access to more documents.
Big Data Trends and Challenges Report - WhitepaperVasu S
In this whitepaper read How companies address common big data trends & challenges to gain greater value from their data.
https://www.qubole.com/resources/report/big-data-trends-and-challenges-report
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
In recent past, big data opportunities have gained much momentum to enhance knowledge management in
organizations. However, big data due to its various properties like high volume, variety, and velocity can
no longer be effectively stored and analyzed with traditional data management techniques to generate
values for knowledge development. Hence, new technologies and architectures are required to store and
analyze this big data through advanced data analytics and in turn generate vital real-time knowledge for
effective decision making by organizations. More specifically, it is necessary to have a single infrastructure
which provides common functionality of knowledge management, and flexible enough to handle different
types of big data and big data analysis tasks. Cloud computing infrastructures capable of storing and
processing large volume of data can be used for efficient big data processing because it minimizes the
initial cost for the large-scale computing infrastructure demanded by big data analytics. This paper aims to
explore the impact of big data analytics on knowledge management and proposes a cloud-based conceptual
framework that can analyze big data in real time to facilitate enhanced decision making intended for
competitive advantage. Thus, this framework will pave the way for organizations to explore the relationship
between big data analytics and knowledge management which are mostly deemed as two distinct entities.
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldNeil Raden
With Global Data Management methodology and tools, all of your data can be accessed and used no matter where it is or where it is from: on-premises, private cloud, public cloud(s), hybrid cloud, open source, third-party data and any combination of the these, with security, privacy and governance applied as if they were a single entity. Ingenious software products and the economics of computing make it economical to do this. Not free, but feasible.
How Insurers Can Tame Data to Drive InnovationCognizant
To thrive among entrenched rivals and compete more effectively with digital natives, insurers will need to get their data right. That will mean moving to more responsive, AI-enabled architectures that accelerate data management and deliver insights that drive business performance.
Cutting Edge Predictive Analytics with Eric Siegel Databricks
Apache Spark empowers predictive analytics and machine learning by increasing the reach and potential. But, before jumping to new deployments, it’s critical we 1) get the analytics right and 2) not overlook less conspicuous business opportunities. In this keynote, Predictive Analytics World founder and “Predictive Analytics” author Eric Siegel ramps you up on a dangerous pitfall and a critical value proposition:
– PITFALL: Avoiding BS predictive insights, i.e., “bad science,” spurious discoveries
– OPPORTUNITY: Optimizing marketing persuasion by predicting the *influence* of marketing treatments, i.e., uplift modeling
BRIDGING DATA SILOS USING BIG DATA INTEGRATIONijmnct
With cloud computing, cheap storage and technology advancements, an enterprise uses multiple
applications to operate business functions. Applications are not limited to just transactions, customer
service, sales, finance but they also include security, application logs, marketing, engineering, operations,
HR and many more. Each business vertical uses multiple applications which generate a huge amount of
data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential
growth in data volume. In almost all enterprises, data silos exist through these applications. These
applications can produce structured, semi-structured, or unstructured data at different velocity and in
different volume. Having all data sources integrated and generating timely insights helps in overall
decision making. With recent development in Big Data Integration, data silos can be managed better and it
can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability
for integrating large data sources. It also offers tools to generate analytical insights which can help
stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with
data silos and how big data integration can help to stun them.
Церебро Таргет - как делать праздничный таргетинг. Часть 2. Церебро Таргет
Презентация Алексея Князева специально для вебинара Церебро Таргет.
Алексей Князев расскажет про опыт в подходах к целевой аудитории, поделится идеями для разных тизеров и работающих настройках рекламного кабинета.
14 слайдов про следующие вопросы:
- кто покупает подарки на 14 февраля?
- Кто получает подарки на 14 февраля?
- Что следует тестировать в таргетинге на 14 февраля?
- Женские и мужские подарки - кому и сколько?
- Подходы в тизерах к 14-му февраля.
This white paper: Analyzes the big data revolution and the potential it offers organizations. Explores the critical talent needs and emerging talent gaps related to big data. Offers examples of organizations that are meeting this challenge head on. Recommends four steps HR and talent management professionals can take to bridge the talent gap.
Learn about Addressing Storage Challenges to Support Business Analytics and Big Data Workloads and how Storage teams, IT executives, and business users will benefit by recognizing that deploying appropriate storage infrastructure to support a wide range of business analytics workloads will require constant evaluation and willingness to adjust the infrastructure as needed. For more information on IBM Storage Systems, visit http://ibm.co/LIg7gk.
Visit the official Scribd Channel of IBM India Smarter Computing at http://bit.ly/VwO86R to get access to more documents.
Big Data Trends and Challenges Report - WhitepaperVasu S
In this whitepaper read How companies address common big data trends & challenges to gain greater value from their data.
https://www.qubole.com/resources/report/big-data-trends-and-challenges-report
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
In recent past, big data opportunities have gained much momentum to enhance knowledge management in
organizations. However, big data due to its various properties like high volume, variety, and velocity can
no longer be effectively stored and analyzed with traditional data management techniques to generate
values for knowledge development. Hence, new technologies and architectures are required to store and
analyze this big data through advanced data analytics and in turn generate vital real-time knowledge for
effective decision making by organizations. More specifically, it is necessary to have a single infrastructure
which provides common functionality of knowledge management, and flexible enough to handle different
types of big data and big data analysis tasks. Cloud computing infrastructures capable of storing and
processing large volume of data can be used for efficient big data processing because it minimizes the
initial cost for the large-scale computing infrastructure demanded by big data analytics. This paper aims to
explore the impact of big data analytics on knowledge management and proposes a cloud-based conceptual
framework that can analyze big data in real time to facilitate enhanced decision making intended for
competitive advantage. Thus, this framework will pave the way for organizations to explore the relationship
between big data analytics and knowledge management which are mostly deemed as two distinct entities.
Global Data Management: Governance, Security and Usefulness in a Hybrid WorldNeil Raden
With Global Data Management methodology and tools, all of your data can be accessed and used no matter where it is or where it is from: on-premises, private cloud, public cloud(s), hybrid cloud, open source, third-party data and any combination of the these, with security, privacy and governance applied as if they were a single entity. Ingenious software products and the economics of computing make it economical to do this. Not free, but feasible.
How Insurers Can Tame Data to Drive InnovationCognizant
To thrive among entrenched rivals and compete more effectively with digital natives, insurers will need to get their data right. That will mean moving to more responsive, AI-enabled architectures that accelerate data management and deliver insights that drive business performance.
Cutting Edge Predictive Analytics with Eric Siegel Databricks
Apache Spark empowers predictive analytics and machine learning by increasing the reach and potential. But, before jumping to new deployments, it’s critical we 1) get the analytics right and 2) not overlook less conspicuous business opportunities. In this keynote, Predictive Analytics World founder and “Predictive Analytics” author Eric Siegel ramps you up on a dangerous pitfall and a critical value proposition:
– PITFALL: Avoiding BS predictive insights, i.e., “bad science,” spurious discoveries
– OPPORTUNITY: Optimizing marketing persuasion by predicting the *influence* of marketing treatments, i.e., uplift modeling
BRIDGING DATA SILOS USING BIG DATA INTEGRATIONijmnct
With cloud computing, cheap storage and technology advancements, an enterprise uses multiple
applications to operate business functions. Applications are not limited to just transactions, customer
service, sales, finance but they also include security, application logs, marketing, engineering, operations,
HR and many more. Each business vertical uses multiple applications which generate a huge amount of
data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential
growth in data volume. In almost all enterprises, data silos exist through these applications. These
applications can produce structured, semi-structured, or unstructured data at different velocity and in
different volume. Having all data sources integrated and generating timely insights helps in overall
decision making. With recent development in Big Data Integration, data silos can be managed better and it
can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability
for integrating large data sources. It also offers tools to generate analytical insights which can help
stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with
data silos and how big data integration can help to stun them.
Церебро Таргет - как делать праздничный таргетинг. Часть 2. Церебро Таргет
Презентация Алексея Князева специально для вебинара Церебро Таргет.
Алексей Князев расскажет про опыт в подходах к целевой аудитории, поделится идеями для разных тизеров и работающих настройках рекламного кабинета.
14 слайдов про следующие вопросы:
- кто покупает подарки на 14 февраля?
- Кто получает подарки на 14 февраля?
- Что следует тестировать в таргетинге на 14 февраля?
- Женские и мужские подарки - кому и сколько?
- Подходы в тизерах к 14-му февраля.
II-SDV Andrew Hinton - Text mining - as normal as data mining?Dr. Haxel Consult
How can we capture information from free text as conveniently as accessing a database? One of the essential differences is the lack of normalisation of terms and concepts in free text. In this talk we will discuss several applications of specialised normalisation solutions. We will show how range search can be achieved over free text e.g. capturing weights between 60 and 80 kg whether expressed in kilograms or pounds, for patient selection from EHRs. We will also show how queries can be expressed in an Extraction and Search Language (EASL) which allows programmatic access to unstructured data similar to SQL over structured data. Finally we will show a particular use case where gene mutations have been linked to rare disease progression.
Обучающий вебинар для рекламодателей ВКонтактеЦеребро Таргет
Вебинар от директора Церебро Таргет Феликса Зинаттулина о том, как:
- Для каких типов бизнеса есть клиенты во ВКонтакте
- Как бизнес и бренды получали клиентов раньше
- Что из этого работает сейчас, что требует адаптации и почему
- Тренды в рекламе во ВКонтакте и почему к ним надо оперативно прислушиваться
- Какие тренды есть сейчас и что будет в ближайшее время
- Что отличает пользователей ВКонтакте у которых есть деньги и они готовы к активным действиям к покупкам
- Как находят аудиторию, подогревают интерес и как выводят на постоянные продажи
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Asterix Solution’s Hadoop Training is designed to help applications scale up from single servers to thousands of machines. With the rate at which memory cost decreased the processing speed of data never increased and hence loading the large set of data is still a big headache and here comes Hadoop as the solution for it.
http://www.asterixsolution.com/big-data-hadoop-training-in-mumbai.html
Duration - 25 hrs
Session - 2 per week
Live Case Studies - 6
Students - 16 per batch
Venue - Thane
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Most common technology which is used to store meta data and large databases.we can find numerous applications in the real world.It is the very useful for creating new database oriented apps
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
Data is now one of the most significant resources for businesses all around the world because of the digital revolution. However, the ability to gather, organize, process, and evaluate huge volumes of data has altered the way businesses function and arrive at educated decisions. Managing and gleaning information from the ever-expanding marine environments of information is impossible without Big Data and Hadoop. Both of which are at the vanguard of this data revolution.
If you have selected a programming language, and have difficulties writing the best assignment, get the assistance of assessment help experts to learn more about it. In this blog, we will look at the basics of Big Data and Hadoop and how they work. However, we will also explore the nature of Big Data. Also, its defining features, and the difficulties it provides. We'll also take a look at how Hadoop, an open-source platform, has become a frontrunner in the race to solve the challenges posed by Big Data. These fully appreciate the potential for change of Big Data and Hadoop for businesses across a wide range of sectors. It is necessary first to grasp the central position that they play in current data-driven decision-making.
The presentation covers the application of a machine learning approach to classification and regression for modelling the expected loss in P&C insurance.
Revenue Generation Ideas for Tesla MotorsGregg Barrett
An overview of potential sources of revenue generation for Tesla Motors.
- Tesla Roadster (road and track vehicle)
- Tesla Travel and Experience
- Tesla Driver Training
This overview is not intended to be a business case for data science. It is expected that you are already familiar with the value proposition. However, a reference to several case study examples has been included at the end of this document as a reminder of the broad applicability of the subject at hand.
The intent of this document is to set in motion the discussion for the creation of a startup in South Africa that is focused on data science.
ACCORDING to AG Lafley, CEO of Procter & Gamble, collaboration is a key ingredient in a company’s arsenal to help it innovate better and faster, and proactively
respond to the increased demand we face in a global and connected economy.
What is Social Networking?
The power behind the new communication paradigm exemplified by internet sites such as Facebook, Wikipedia and YouTube is that it promotes the flow of ideas — including
advice, feedback and criticism — all of it free of charge.
I think that AG Lafley puts it best when he says: “No company today, no matter how large or how global, can innovate fast enough or big enough by itself.
A CARTOON by American cartoonist Randy Glasbergen goes: “My team is having trouble thinking outside the box. We can‟t agree on the size of the box, what materials the box should be constructed from, a reasonable budget for the box or our first choice of box vendors.”
For me, this sums up the problem most procurement functions in South Africa face – stagnant thinking and behaviour. This has led to procurement being stifled in a world driven by a myopic focus on cost with innovation and value creation.
Variable selection for classification and regression using RGregg Barrett
This document provides a brief summary of several variable selection methods that can be utilised within the R environment. Examples are provided for classification and regression.
Diabetes data - model assessment using RGregg Barrett
This report analyses the diabetes data in Efron et al. (2003) to examine the effects of ten baseline predictor variables on a quantitative measure of disease progression one year after baseline.
An introduction to Microsoft R Services,
Microsoft R Open and Microsoft R Server.
This presentation will briefly cover the following:
-Why consider MRO and R Server
-R Server
-MRO
-Microsoft R Services/R Server Platform
-DistributedR
-RevoScaleR/ScaleR
-ConnectR
-DevelopR
-DeployR
-Resources
-References
Review of mit sloan management review case study on analytics at IntermountainGregg Barrett
This document is based on the MIT Sloan Management Review case study on data and analytics at Intermountain Healthcare titled “When Health Care Gets a Healthy Dose of Data”. The document should be viewed as a summary of the attributes that played a role in advancing data and analytics at Intermountain.
Utilizing Mahout, implement a Collaborative Filtering framework using historical data, in this instance, movie ratings by 943 users, to provide Item-based recommendations. Three item based recommendations will be provided for each user.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. 2
Acknowledgment
This document draws extensively, and focuses on, the work and viewpoints from industry participants including:
Diversity Limited
Forrester
IBM
Intel
ITG
MIT Sloan Management Review
TDWI Research
Planet Cassandra
Christopher Bienko @ IBM
Dirk deRoos @ IBM
Marc Andrews @ IBM
Paul Zikopoulos @ IBM
Rick Buglio @ IBM
3. 3
Introduction
With many organisations considering getting on the Hadoop bandwagon, this document provides an overview
of the planned use cases for Hadoop, an illustration of some of the common technology components,
suggestions on when Hadoop is worth considering, some the challenges organisations are experiencing, cost
considerations and finally, how an organisation should position for a Big Data initiative. Any organisation
considering a Big Data initiative with Hadoop should thoroughly consider each of these areas before embarking
on a course of action.
Planned use cases for Big Data technologies
According to a 2015 Forrester Research report, financial and IT analytics are currently amongst the top planned
use cases for Big Data technologies.
Figure 1: Planned use cases for Big Data technologies
Figure 1. Financial And IT Analytics Are Top Planned Use Cases For Big Data Technologies. Copyright 2015 by Forrester. Reprinted with
permission.
While there is much debate on which planned use cases are best suited to kick-start a Big Data initiative, most
recommendations start with the proverbial “low-hanging fruit”. While such an approach may deliver short-term
gains and increase organisational support for a broader program, the downside of this strategy is that the
analytical gains made are likely to be easily replicated by competitors.
As articulated by MIT Sloan Management Review (2015) companies that have sustained an advantage from
analytics have often taken a more strategic approach of focusing on large critical problems.
Big Data technology: Enter Hadoop
In their book, Big Data Beyond the Hype, Zikopoulos, deRoos, Bienko, Buglio and Andrews (2014) classify Hadoop
as an ecosystem of software packages that provides a computing framework.
4. 4
This includes:
- MapReduce, which leverages a K/V (key/value) processing framework (don’t confuse that with a K/V
database);
- a file system (HDFS);
- and many other software packages that support everything from importing and exporting data (like
Sqoop) to storing transactional data (like HBase), orchestration (Avro, ZooKeeper, YARN), and more.
When you hear that someone is running a Hadoop cluster, it’s likely to mean MapReduce (or some other
framework like Spark) running on HDFS.
On the other hand, NoSQL refers to non-RDBMS SQL database solutions such as HBase, Cassandra, MongoDB,
Riak, and CouchDB, among others.
(Zikopoulos, deRoos, Bienko, Buglio, Andrews, 2014, pg. 38)
Figure 2 provides an example illustration of some of the components of the Hadoop ecosystem. The Forrester
(2014) illustration shows mix of Apache, other open source, and vendor products.
Figure 2: Example of the Hadoop Ecosystem
Figure 2. The Hadoop Ecosystem Is A Mix Of Apache, Other Open Source, And Vendor Products. Copyright 2014 by Forrester. Reprinted with
permission.
5. 5
Hadoop
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
MapReduce
MapReduce is a system for parallel processing of large data sets.
According to IBM (2015) as an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to each city in the empire.
Each census taker in each city would be tasked to count the number of people in that city and then return
their results to the capital city. At the capital, the results from each city would be reduced to a single count
(sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in
parallel, and then combining the results (reducing) is much more efficient than sending a single person to
count every person in the empire in a serial fashion. (IBM, 2015)
NoSQL
NoSQL is a database environment. Using the definition from Planet Cassandra (2015), a NoSQL database
environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad-
hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases were
developed in response to the sheer volume of data being generated, stored and analyzed by modern users
(user-generated data) and their applications (machine-generated data). (Planet Cassandra, 2015)
Why Hadoop
According to IBM (2015), Hadoop changes the economics and dynamics of large-scale computing by enabling a
solution that is:
- Scalable: Add new nodes as needed without changing data formats, how data is loaded, how jobs are
written or the applications on top.
- Cost-effective: Hadoop brings massively parallel computing to commodity servers. The result is a
significant decrease in the cost per terabyte of storage, which in turn makes it affordable to model all
your data.
- Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from a number of
sources. Data from multiple sources can be joined and aggregated in arbitrary ways, enabling deeper
analyses than any one system can provide by itself.
- Fault-tolerant: When you lose a node, the system redirects work to another location of the data and
continues processing without missing a beat.
(IBM, 2015, pg. 2)
Hadoop has its challenges though
While Hadoop may change the economics and dynamics of large-scale computing, Hadoop is not without its
own set of challenges. According to IBM (2014), there are four key areas of Hadoop that need to mature in order
to drive wider adoption, these include:
1) Performance
2) The reduction of skills
3) Data governance
4) Deep integration with existing technologies
(IBM, 2014)
Along similar lines TDWI Research (2015) in a recent survey found respondents struggling with the following
barriers to Hadoop implementation:
6. 6
- Skills gap
- Weak business support
- Security concerns
- Data management hurdles
- Tool deficiencies
- Containing costs
(TDWI Research, 2015)
With data consisting of more sensitive personal details than ever before, as well as the increasing volume of
data, governance, risk and compliance concerns are particularly important. In addition, with organisations
looking to improve decision making by moving towards the internal democratization of data, such efforts
necessitate building trust into the Big Data platform. Building trust into a Big Data platform requires a
governance framework covering lineage, ownership etc. As articulated by Zikopoulos, deRoos, Bienko, Buglio
and Andrews (2014), Big Data technologies like Hadoop (and much of the NoSQL world) require more work to
have all of the enterprise hardening from a security perspective.
According to a study by the International Technology Group (2013), organisations need to be particularly mindful
in the highly skilled programming requirements demanded of most Hadoop environments, noting that:
Although the field of players has since expanded to include hundreds of venture capital-funded start-ups, along
with established systems and services vendors and large end users, social media businesses continue to control
Hadoop. Most of the more than one billion lines of code – more than 90 percent, according to some estimates –
in the Apache Hadoop stack has to date been contributed by these.
The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption that
Hadoop developers are highly skilled, capable of working with “raw” open source code and configuring software
components on a case-by-case basis as needs change. Manual coding is the norm.
Decades of experience have shown that, regardless of which technologies are employed, manual coding offers
lower developer productivity and greater potential for errors than more sophisticated techniques.
(ITG, 2013, pg. 2)
Hadoop is good for…
According to Forrester (2015) organizations can justify Hadoop investments by reducing budgets allocated to
proprietary systems. In the report Forrester indicates that some of its clients pay for their Hadoop infrastructure
by scaling down expensive proprietary database hardware and software. Using the savings they then take
advantage of the less expensive Hadoop infrastructure based on open source software and commodity hardware
using the following approaches:
1) Migrate most of the batch processes like data integration, cleansing, and aggregation to Hadoop,
freeing up database resources to do what they do best: query processing. Next, further reduce
dependency on expensive proprietary systems by moving some of the non-mission-critical data sets
and queries (such as social media analytics or data exploration) to Hadoop.
2) Turn Hadoop Data Hub into a sandbox for business analysis and data scientists. Not all enterprise data
needs to be tightly controlled and integrated. Simply by copying all enterprise data into a Hadoop-based
Data Hub or Data Lake, organizations can achieve immediate benefits.
7. 7
Use case:
Not every business use case calls for structured analysis based on SQL; there are plenty of use cases
(e.g., data exploration, experimentation, creating what-if scenarios, looking for previously uncovered
patterns) where simply searching through data can uncover a wealth of hidden information. Most
enterprise applications are based on relational databases, which are not designed for search, plus
one would have to go to each application separately to conduct such a search. Data staged in a
Hadoop Data Hub or Data Lake becomes instantly searchable and available for exploration and
discovery.
3) Extend Agile Business Intelligence to Big Data with Hadoop on-demand data marts.
Use case:
While columnar SQL databases provide extra agility and flexibility, they still require a certain level of
investment in proprietary systems and professional resources. As a result, it may not make financial
sense to build a data mart based on a platform that may only be used for a one-off application or a
process that only runs for a few weeks (for example, analysing post-merger customer integration
implications). Apache Hive, Apache Drill, and other Hadoop relational structures are perfect venues
for such one-off use cases.
(Forrester, 2015, pg. 3 – 4)
Costs associated with typical Big Data implementations
Although a Big Data environment such as that illustrated in Figure 2 can be constructed from open source
software, such as Hadoop and a NoSQL database such as MongoDB, there are still substantial costs involved.
These include:
1) Hardware costs
2) IT and operational costs in setting up a machine cluster and supporting it
3) Cost of personnel to work on the ecosystem
These costs are NOT trivial for the following reasons:
- Dealing with cutting edge technology and finding people who know the technology is challenging
- The technology introduces a different programming paradigm, frequently requiring additional training
of existing engineering teams
- These technologies are new and still evolving and are not yet mature in the enterprise ecosystem
- The hardware is server grade and large clusters require resources including network administration,
security administration, system administration etc., as well as data centre operational costs including
electricity, cooling etc.
Although implementing a Big Data platform is not without its’ challenges, the availability of IaaS platforms for
Big Data, reduce some of the initial risks that would traditionally be associated with such projects
Infrastructure as a Service (IaaS)
One consideration that can mitigate the cost implications of hardware and support personnel is the use of a
cloud offering. As pointed out by Intel (2015) clouds are already deployed on pools of server, storage, and
networking resources and can scale up or down as needed. Cloud computing offers a cost-effective way to
support Big Data technologies and the advanced analytics applications that can drive business value.
Diversity Limited (2010) defines Infrastructure as a Service (IaaS) as “a way of delivering Cloud Computing
infrastructure – servers, storage, network and operating systems – as an on-demand service. Rather than
8. 8
purchasing servers, software, datacenter space or network equipment, organisations instead buy those
resources as a fully outsourced service on demand.”
The Big Data approach
According to Forrester (2015) organisations should organise for Big Data using the following approach:
- Use small, agile insights teams versus centralized delivery services.
GE successfully deployed such a strategy in the development of its Data Lake with what GE’s Vince
Campisi calls “a two-pizza team,” meaning, a team no bigger than the number of people you could feed
off of two pizzas. As Forrester points out a decentralized approach requires a careful balance with
appropriate data platform management, centers of excellence (CoEs), and communities of practice.
- Centralize the data pipeline and data platform teams.
People often think that because the data is there, it is ready to be used – but that is seldom the case.
Organisations failing to undertake the critical effort of making the data accessible are left with an
ineffective process of trying to figure out where it exists, breaking down silos, and organizing data in
order for someone to actually go after the outcome they seek. The harsh reality is that Big Data is messy
data and there is no quick and easy way around it. To paraphrase the Ancient Mariner, organisations
can easily find themselves in the situation of; Data, data, every where, Nor any drop to drink.
- Put data science everywhere but technology management.
The rational on this is simple: data and analytics is a business requirement, driven by business, used by
business to solve business challenges and to drive business opportunities. Certainly IT has a seat at the
table, and is critical to enablement but ultimately IT should not own the strategy.
- Embrace communities of practice over CoEs for insights execution.
Return on investment in analytics needs to be seen as a process rather than as a control metric.
Organisations need to focus on “value in use” – which measures value in terms of how a given asset
provides benefits to a specific owner under a specific use. Consequently, only when an analytical model
of application is actually used can its real benefits (and costs) be identified. Experienced analytical
leaders therefore focus on execution and the value derived throughout the project’s life cycle, not just
at the end of the project.
- Govern analytics, not just data.
An organisation embarking on a Big Data initiative will face many first time governance challenges.
Addressing these challenges and doing so consistently is a critical consideration for organisations
looking to exploit opportunities in data and analytics – where the difference between those that
succeed and those that fail could well rest on the strength or weakness of the organisations data
governance foundation.
Summary
Big Data is having a substantive impact on organisations across industries. Organisations are combining Big Data
and analytics to overcome many of the challenges presented by more traditional technologies, and to support
new capabilities. While Big Data has been enabled by technologies like Hadoop, the challenges are acute on two
fronts. Firstly organisations face challenges finding people skilled in this environment. Secondly data governance
challenges are increasing in number and evolving in complexity. While these challenges are not trivial, those
organisations that successfully navigate them will be rewarded with opportunities yet to be discovered.
9. 9
Reference
Diversity Limited. (2010). Moving your infrastructure to the cloud. [pdf]. Retrieved from
http://diversity.net.nz/wp-content/uploads/2011/01/Moving-to-the-Clouds.pdf
Forrester, (2014). The hadoop ecosystem is a mix of apache, other open source, and vendor products. [Diagram].
Retrieved from Forrester, (2014). Hadoop ecosystem overview, Q4 2014. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Brief: reasons to (or not to) modernize bi platforms with hadoop. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
https://www.forrester.com/home/
Forrester. (2015). Financial and IT analytics are top planned use cases for big data technologies. [Diagram].
Retrieved from Forrester, (2015). Q&A: forrester’s top five questions about big data. [pdf]. Retrieved from
https://www.forrester.com/home/
IBM. (2014). Analytics: The speed advantage. [pdf].
Retrieved from http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/
IBM. (2015). Analytics: What is mapreduce. [web page].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
IBM. (2015). Making the case for hadoop and big data in the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?infotype=PM&subtype=BK&htmlfid=IMM14161USEN#loaded
Intel. (2015). Big data cloud technology. [pdf]. Retrieved from
http://www.intel.co.za/content/dam/www/public/us/en/documents/product-briefs/big-data- cloud-
technologies-brief.pdf
ITG. (2013). Business case for enterprise big data deployments. [pdf].
Retrieved from http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IME14028USEN&appname=skmwww
MIT Sloan Management Review. (2015) Sustaining an analytics advantage. [pdf]. Retrieved from
http://mitsmr.com/1CqUqMB
Planet Cassandra. (2015). Nosql databases defined and explained. [web page].
Retrieved from http://www.planetcassandra.org/what-is-nosql/
TDWI Research. (2015). Hadoop for the enterprise. [pdf].
Retrieved from http://www-01.ibm.com/software/data/infosphere/hadoop/resources.html
Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., Andrews, M. (2014). Big data beyond the hype. [pdf].
Retrieved from
https://www.ibm.com/developerworks/community/blogs/SusanVisser/entry/big_data_beyond_the_hype_a_gui
de_to_conversations_for_today_s_data_center?lang=en