What, Why, Where, When and How
BI/SQL/Data Visualization Evangelist
In this paper, we talk about what is big data, the growing prevalence of big data, the opportunities, the
challenges and architectural framework that will facilitate the delivery of opportunities while addressing
The architecture for the ‘Big Data Management’ will be demonstrated through Hadoop technology with Map-
Reduce framework and its Open Source ecosystem.
SENTHIL SUNDARESAN 1
I am senthil, a BI/SQL/Data visualization evangelist.
I have donned many roles during my short career of 13+ years such as Analyst, Developer, Lead, Project Manager, Principal
Data and Visualization Architect, Consultant, DB Administrator, Unix Administrator to name a few.
My BI and Visualization skills are SAP BO/BODS, TABLEAU, QLIKVIEW, MSBI, ESSBASE, R, OMNISCOPE, SQLSERVER, SYBASE
IQ, SYBASE, TERADATA (again) to name a few.
Been in this industry and especially in BI for so many years it’s imperative for me to understand the nuances and intricacies
of the Big Data Tech Stack. That’s the trigger for me to write this paper and while doing so I’ve started exploring big data
This paper would be a stepping stone for those who thinks whether it’s possible or plausible.
Thanks for reading!
SENTHIL SUNDARESAN 2
“Big data” is a big vibrating phrase in the IT and business world right now – and there are a dizzying array of opinions on just
what these two simple words really mean. Technology vendors in the legacy database or data warehouse spaces say “big
data” simply refers to a traditional data warehousing scenario involving data volumes in either the single or multi-terabyte
range. Others disagree with this by saying that “big data” isn’t limited to traditional data warehouse situations, but includes
real-time or operational data stores used as the primary data foundation for online applications that power key external or
internal business systems.
In 2011, people have created 1.8 Zetabytes of data and this is increasing exponentially every year. This ever increasing data
contains information that could give rise to many business opportunities. Few of the Business Drivers of Big data are:
Finance: Better and deeper understanding of risk to avoid credit crisis – Basel III
Telecommunications: More reliable network where we can predict and prevent failure
Media: More content that is lined up with your personal preferences
Life science: Better targeted medicines with fewer complications and side effects
Retail: A personal experience with products and offers that are just what you need
Government: Government services that are based on hard data, not just gut.
Big Data is here. Analysts and research organizations have made it clear that mining machine generated data is essential to
future success. Embracing new technologies and techniques are always challenging, but as architects, you are expected to
provide a fast, reliable path to business adoption.
Big Data Characteristics, Architecture Capabilities, Technologies, Market vendors, and Sample implementation are explained
in the subsequent sections.
2. WHAT IS BIG DATA?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast,
or doesn’t fit the strictures of your database architectures. To gain value from this data, an alternative way has to be chosen
to process it.
2.1 Characteristics of Big Data
Big data has the following characteristics
Very large distributed aggregations of loosely structured data are often incomplete and inaccessible:
Petabytes/exabytes of data
Billions/Trillions of records
Loosely-structured and often distributed data
Flat schemes with few complex interrelationships
Often involving time-stamped events
Often made up of incomplete data
Often including connections between data elements that must be probabilistically inferred
Applications that involved Big-data can be:
Transactional (e.g.: Facebook, Photobox etc)
Analytic (e.g., ClickFox, Merced Applications)
SENTHIL SUNDARESAN 3
Fig 1: Big Data Evolution
According to a new global report from IBM and the Said Business School at the University of Oxford, less than half of the
organizations engaged in active Big Data initiatives are currently analyzing external sources of data, like social media.
2.2 Key Metrics: The Three V’s
As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies.
Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery,
broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government
documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing?
To classify matters the three Vs of volume, velocity, and variety are commonly used to categorize different aspects of big
data. They are a helpful lens through which to view and understand the nature of the data and the software platforms that
is available to exploit them.
Terabyte records, transactions, tables, files.
A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can
create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture.
Batch, near-time, real-time, streams.
Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer
scoring probabilities. Stream data, such as movies, need to travel at high speed for proper rendering.
Variety: Structured, Semi-structured, Unstructured; Text, image, audio, video, record and all the above in a mix.
WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (Petabytes).
There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc.
Data Variety and Complexity
SENTHIL SUNDARESAN 4
Fig 2: Volume, Velocity and Variety
3. BIG DATA PROCESSING
Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational
database model, where data and the relationship between the data were stored in tables. The data was processed and stored
Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up
into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data
in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have
the data needed to answer the query and enable the storage of unstructured data.
Fig 3: Big Data Architecture
MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple
nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of
Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the
MapReduce method has now become commonly used, with the most famous implementation being in an open-source
project called Hadoop.
Bridging the Gap – The Key – Value pair
SENTHIL SUNDARESAN 5
Key-value pair is the data model underlying Map-Reduce (and thus Hadoop) that is actually the fundamental driver of
performance. A file of key value pairs has exactly two columns. One is structured – the KEY. The other, the value, is
unstructured – at least as far as the system is concerned. The Mapper then allows you to move (or split) the data between
the structured and unstructured sections at will. The reducer then allows data to be collated and aggregated provided it has
an identical key.
Massively parallel processing (MPP)
Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data
in parallel. The output is then assembled to create a result.
However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally
used on expensive specialized hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on
4. BIG DATA ARCHITECTURE
In this section, we will take a closer look at the overall architecture for big data.
Traditional Information Architecture Capabilities
To understand the high-level architecture aspects of Big Data,
let’s first review a well formed logical information architecture for structured data. In the illustration, you see two data
sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or
operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic
capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for
textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and
standardization across projects, and perhaps have matured the information architecture capability through managing it at
the enterprise level.
Fig 4: Traditional Capabilities – Courtesy Oracle
The key information architecture principles include treating data as an asset through a value, cost, and risk lens, and ensuring
timeliness, quality, and accuracy of data. And, the EA oversight responsibility is to establish and maintain a balanced
governance approach including using center of excellence for standards management and training.
Adding Big Data Capabilities
The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value
requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data
sets. There are differing technology strategies for real-time and batch processing requirements. For real-time, key-value data
stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map
Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed
directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing
environment and correlated to structured data
SENTHIL SUNDARESAN 6
Fig 5: Big Data Capabilities – Courtesy Oracle
In addition to new unstructured data realms, there are two key differences for big data. First, due to the size of the data sets,
we don’t move the raw data directly to a data warehouse. However, after MapReduce processing we may integrate the
“reduction result” into the data warehouse environment so that we can
leverage conventional BI reporting, statistical, semantic, and correlation capabilities. It is ideal to have analytic capabilities
that combine a conventional BI platform along with big data visualization and query capabilities. And second, to facilitate
analysis in the Hadoop environment, sandbox environments can be created.
For many use cases, big data needs to capture data that is continuously changing and unpredictable. And to analyze that data,
a new architecture is needed. In retail, a good example is capturing real time foot traffic with the intent of delivering in-store
promotion. To track the effectiveness of floor displays and promotions, customer movement and behavior must be
interactively explored with visualization or query tools.
In other use cases, the analysis cannot be complete until you correlate it with other enterprise data - structured data. In the
example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but
associating it with your most or least profitable customer makes it far more valuable. So, the needed capability with Big Data
BI is context and understanding. Using powerful statistical and semantic tools allow you to find the needle in the haystack,
and will help you predict the future.
In summary, the Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at
the same time correlating it with other data.
5. STEPS TO BIG DATA
Before you go down the path of big data, it's important to be prepared and approach an implementation in an organized
manner, following these steps.
What do you wish you knew?
This is where it will be decided as what is expected out of big data that you can't get from your current systems.
If the answer is nothing, then perhaps big data isn't the right thing to use.
What are the current data assets?
Can the data be cross referenced to produce insights?
Is it possible to build new data products on top of the current assets?
If not, what needs to be implemented to do so?
Once the above are understood, it's time to prioritize. Select the potentially most valuable opportunity for using big-data
techniques and technology, and prepare a business case for a proof of concept, keeping in mind the skill sets you'll need to
do it. You will need to talk to the owners of the data assets to get the full picture
Another example of applying architecture principles differently is data governance. The quality and accuracy requirements of
big data can vary tremendously. Using strict data precision rules on user sentiment data might filter out too much useful
information, whereas data standards and common definitions are still going to be critical for fraud detections scenarios.
Start the proof of concept, and make sure that there's a clear end point, so that you can evaluate what the proof of concept
has achieved. This might be the time to give the owner of the data assets to take responsibility for the project
SENTHIL SUNDARESAN 7
Once your proof of concept has been completed, evaluate whether it worked. Are you getting real insights delivered? Is the
work that went in to the concept bearing fruit? Could it be extended to other parts of the organization? Is there other data
that could be included? This will help you to discover whether to expand your implementation or revamp it.
Once the evaluation is done and the need for big data is inevitable, then it’s imperative to choose the vendors and
5.1 Architecture Decisions
Information Architecture is perhaps the most complex area of IT. It is the ultimate investment payoff. Today’s economic
environment demands that business be driven by useful, accurate, and timely information. And, the world of Big Data adds
another dimension to the problem. However, there are always business and IT tradeoffs to get to data and information in a
most cost-effective way.
Key Drivers to Consider
Here is a summary of various business and IT drivers you need to consider when making these architecture choices.
Fig 6: Key Drivers
Planning Big Data architecture is not about understanding just what is different. It’s also about how to integrate what’s new
to what you already have – from database-and-BI infrastructure to IT tools, and end user applications.
To derive real business value from big data, you need the right tools to capture and organize a wide variety of data types from
different sources, and to be able to easily analyze it within the context of all your enterprise data.
Here is a brief outline of Big Data capabilities and their primary technologies:
Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple
nodes in parallel, running on inexpensive hardware.
Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is
made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data
is replicated over more than one node, so that even if a node fails, there's still a copy of the data.
The data can then be analyzed using MapReduce, which discovers from the name node where the data needed for calculations
resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and
then loaded onto a node, which can be further analyzed using other tools. Alternatively, the data can be loaded into traditional
data warehouses for use with transactional processing.
Apache is considered to be the most noteworthy Hadoop distribution.
SENTHIL SUNDARESAN 8
Fig 7: Hadoop in the Enterprise
188.8.131.52 RDBMS and Hadoop
Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop:
Fig 8: RDBMS vs. Hadoop
Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that
are required can be difficult. Realizing this when working with Hadoop, Facebook created Hive, which converts SQL queries
to map/reduce jobs to be executed using Hadoop.
Procedural data processing language designed for Hadoop where you specify a series of steps to perform on the data.
It’s often described as “the duct tape of Big Data” for its usefulness there and it is often combined with custom streaming
code written in a scripting language for more general operations.
5.2.4 Social Network and Hadoop
Twitter uses Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. Also it
uses Cloudera's CDH2 distribution of Hadoop, and stores all data as compressed LZO files.
It uses both Scala and Java to access Hadoop's MapReduce APIs
It uses Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
It employs committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to open
Facebook uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics
and machine learning.
Currently Facebook has 2 major clusters as:
An 1100-machine cluster with 8800 cores and about 12 PB of raw storage.
A 300-machine cluster with 2400 cores and about 3 PB of raw storage.
Each (commodity) node has 8 cores and 12 TB of storage.
SENTHIL SUNDARESAN 9
Facebook is heavy users of both streaming as well as the Java APIs. It has built a higher level data warehousing
framework using these features called Hive. It has also developed a FUSE implementation over HDFS.
NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as
their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into
tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated
data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.
184.108.40.206 Types of NoSQL Databases
The following are the types of NoSQL Databases
Column oriented database
Fig 9: NoSQL Types
Cassandra is a NoSQL database alternative to Hadoop's HDFS.
5.3 Sample Implementation
Big-data projects have a number of different layers of abstraction from abstraction of the data through to running analytics
against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each
other. The higher level components help make big data projects easier and more productive. Hadoop (is an apache project,
written in java and being built and used by a global community of contributors) is often at the center of Big-data projects, but
it is not a prerequisite.
Packaging and support of Hadoop by organizations such as Cloudera; to include MapReduce - essentially he compute
layer of big data.
File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data
and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or
Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.
Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of
SENTHIL SUNDARESAN 10
Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers.
Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and
managed as a unit. It is widely used to develop special tools.
Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set
up the database that will run the analytics.
Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data
for the analytic models
ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business
issues (e.g., the customer satisfaction issues mentioned in the introduction).
Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need
a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak
consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data
transactional SQL databases that need the ACID guarantees the choices are limited. Traditional scale-up databases
are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases
have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures
that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scale
Fig 10: Sample Implementation
This area is extremely fast growing, with many new entrants into the market expected over the next few years.
There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary
database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume.
Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged
their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has
also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off-
Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so
they can sell the result to enterprise as more comfortable and familiar packaged solutions.
Below are the plans of some of the larger vendors.
Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop
open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based
distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.
SENTHIL SUNDARESAN 11
Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012,
the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated
with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere.
Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and
Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011.
Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor
networks, social networks, genomics, video and photographs.
Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualization and
analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop
connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop.
Oracle made its big-data appliance available earlier this year— a full rack of 18 Oracle Sun servers with 864GB of main
memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems;
and 10Gbps Ethernet connectivity.
The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and
a distribution of R (an open-source statistical computing and graphics environment).
It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimized
datasets to load and analyze in the database.
IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core
technologies for its big-data push.
The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to
"withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security
features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing
the continuous analysis of streaming data so that companies can respond to events.
IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's
big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse; its InfoSphere Warehouse; and its Smart
At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance,
unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to
provide real-time results for analysis and transactions. Business applications, like SAP's Business Objects, can sit on the HANA
platform to receive a real-time boost.
SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System
and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products
that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether
it's in SAP software or software from another vendor.
Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available
on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors
SENTHIL SUNDARESAN 12
between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from
Hive into Excel and Microsoft BI tools, such as PowerPivot.
EMC has centered its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified
analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce
and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried
out using data from both sets, which is fed in to dashboards.
6. VALUE TO AN ORGANIZATION
Value of Big Data falls into two categories:
1. Analytical use
2. Enabling new markets/products
Big data analytics can reveal insights hidden previously by data too costly to process. , such as peer influence among
customers, revealed by analyzing shoppers’ transactions, social and geographical data
The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services.
For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able
to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s
share of ideas and tools underpinning big data has emerged from Google, Yahoo, Amazon and Facebook.
7. FIRMS AND BIG DATA
Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of
Ford is experimenting with Hadoop to understand better how the car operates and how consumers use the vehicles, and feed
that information back into our design process and help optimize the user's experience in the future, as well so as to gain value
out of the data it generates from its business operations, vehicle research and even its customers' cars.
HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyze claims as they
are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need
to stay fit and healthy.
Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed
by the company, and to provide those insights to customers. For example, Klout might provide information on how certain
peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on
changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a
separate data silo for each social network.
7.4 Mitsui Knowledge Industry
Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was
able to shorten genome-analysis time from several days to 20 minutes.
Nokia is using Apache Hadoop and Cloudera's CDH to pull the unstructured data (generated by its phones around the world)
into a structured environment to create 3D maps that show traffic, inclusive of speed categories, elevation, current events
SENTHIL SUNDARESAN 13
WalMart uses a product it bought, called Muppet, as well as Hadoop to analyze social-media data from Twitter, Facebook,
Foursquare and other sources. Among other things, this allows WalMart to analyze in real time which stores will have the
biggest crowds, based on Foursquare check-ins.
8. BIG DATA – CHANGING WORLD
Computers are leaner, meaner and cheaper than ever before. With computing power no longer at a premium, we're
swimming in numbers that describe everything from how a small town in Minnesota behaves during rush hour to the
probability of a successful drone strike in Yemen.
The advent of so-called "big data" means that companies, governments and organizations can collect, interpret and wield
huge stores of data to an amazing breadth of ends. From shoe shopping to privacy concerns, here's a look at five ways "big
data" is changing the world:
8.1 Data as a deadly weapon
The traditional battlefield has dissolved into thin air. In the big data era, information is the deadliest weapon and leveraging
massive amounts of it is this era's arms race. But current military tech is buckling under the sheer weight of data collected
from satellites, unmanned aircraft, and more traditional means.
As part of the Obama administration's "Big Data Initiative," the Department of Defense launched XDATA, a program that
intends to invest $25 million toward systems that analyze massive data sets in record time. With more efficient number
crunching, the U.S. military can funnel petabytes of data toward cutting edge advances, like making unmanned drones
smarter and more deadly than ever.
8.2 Saving the Earth
Beyond powering predator drones and increasing retail revenue, big data can do a literal world of good. Take Google Earth
Engine, an open source big data platform that allowed researchers to map the first high-resolution map of Mexico's forests.
The map would have taken a traditional computer over three years to construct, but using Google Earth Engine's massive
data cloud it was completed in the course of a day.
Massive sets of data like this can help us understand environmental threats on a systemic level. The more data we have about
the changing face of the earth's ecosystems and weather patterns, the better we can model future environmental shifts --
and how to stop them while we still can.
8.3 Watching you shop
Big data can mean big profits. By understanding what you want to buy today, companies large and small can figure out what
you'll want to buy tomorrow -- maybe even before you do? Online retailers like Amazon scoop up information about our
shopping and e-window shopping habits on a huge scale, but even brick and mortar retailers are starting to catch on. A clever
company called RetailNext helps companies like Brookstone and American Apparel record video of shoppers as they browse
and buy. By transforming a single shopper's path into as many as 10,000 data points, companies can see how they move
through a store, where they pause and how that tracks with sales.
8.4 Scientific research in overdrive
Data has long been the cornerstone of scientific discovery, and with big data -- and the big computing power necessary to
process it -- research can move at an exponentially fast clip.
Take the Human Genome Project, widely considered to be one of the landmark scientific accomplishments in human history.
Over the course of the $3 billion project, researchers analyzed and sequenced the roughly 25,000 genes that make up the
human genome in 13 years. With today's modern methods of data collection and analysis, the same process can be completed
in hours -- all by a device the size of a USB memory stick and for less than $1,000.
SENTHIL SUNDARESAN 14
8.5 Big data, bigger privacy concerns
You might just be a number in the grand scheme of things, but that adage isn't as reassuring as it used to be. It's true that big
data is about breadth, but it's about depth, too.
Web mega-companies like Facebook and Google not only scoop up data on a huge number of users -- 955 million, in
Facebook's case -- but they collect an incredible depth of data as well. From what you search and where you click to who you
know (and who they know, and who they know), the web's biggest players own data stockpiles so robust that they border on
Where technological power, cultural advancement and profit intersect, one thing's clear: with big data comes even bigger
9. DEPLOYMENT CONSIDERATIONS
We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes
to deployment there are dimensions to consider over and above tool selection.
9.1 Cloud or In-house
The majority of big data solutions are now provided in three forms: software‐only, as an appliance or cloud‐based.
Decisions between which routes to take will depend, among other things, on issues of data locality, privacy and regulation,
human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources
to supplement in-house deployments.
220.127.116.11 Cloud Computing and Big Data
Experts in the IT industry, including Cloud Computing and Big Data, agree that a flexible and fast IT infrastructure is needed
to support Big Data. The cloud removes the infrastructure challenges, provides the necessary speed and adds scalability.
However, four areas must still be investigated more deeply: store and process, stewardship, sense making and security.
18.104.22.168 Significant Change in Cloud Computing
Traditionally, cloud computing operates in three primary layers: Software as a Service, Platform as a Service and Infrastructure
as a Service. However, the architecture of Big Data adds another layer into the stack, which is concerned with analyzing and
managing Big Data. It includes different binding concepts like lineage, pedigree and provenance. Big Data is complex and
comes with daunting challenges. Phenomenal corporate balance is required for success. For organizations to harness Big Data
effectively, they must change their business processes, implement multiple technologies and give their workforce relevant
9.2 Skills shortages
Even if a company decides to go down the big‐data path, it may be difficult to hire the right people. The data scientist requires
a unique blend of skills, including a strong statistical and mathematical background, a good command of statistical tools such
as SAS, SPSS or the open‐source R and an ability to detect patterns in data (like a data‐mining specialist), all backed by the
domain knowledge and communications skills to understand what to look for and how to deliver it.
Tracking individuals' data in order to be able to sell to them better will be attractive to a company, but not necessarily to the
consumer who is being sold the products. Not everyone wants to have an analysis carried out on their lives, and depending
on how privacy regulations develop, which is likely to vary from country to country, companies will need to be careful with
how invasive they are with big-data efforts, including how they collect data. Regulations could lead to fines for invasive
policies, but perhaps the greater risk is loss of trust.
SENTHIL SUNDARESAN 15
Individuals trust companies to keep their data safe. However, because big data is such a new area, products haven't been
built with security in mind, despite the fact that the large volumes of data stored mean that there is more at stake than ever
before if data goes missing.
9.5 Big Data is messy
It’s not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data
is cleaning it up in the first place.
9.6 Big Data is big
It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. Even if the
data isn’t too big to move, locality can still be an issue, especially with rapidly updating data.
The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming
and scientific instinct. Benefiting from big data means investing in teams with this skill set, and surrounding them with an
organizational willingness to understand and use data for advantage.
9.8.1 Do you know where your data is?
It's no use setting up a big-data product for analysis only to realize that critical data is spread across the organization in
inaccessible and possibly unknown locations.
9.8.2 A lack of direction
"Collecting and analyzing the data is not enough; it must be presented in a timely fashion, so that decisions are made as a
direct consequence that has a material impact on the productivity, profitability or efficiency of the organization. Most
organizations are ill prepared to address both the technical and management challenges posed by big data; as a direct result,
few will be able to effectively exploit this trend for competitive advantage."
Unless firms know what questions they want to answer and what business objectives they hope to achieve, big-data projects
just won't bear fruit.
Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then first, decide what
problem you want to solve.
If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it
will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a
concrete goal head
As you explore the ‘what’s new’ across the spectrum of Big Data capabilities, we suggest that you think about their integration
into your existing infrastructure and BI investments. As examples, align new operational and management capabilities with
standard IT, build for enterprise scale and resilience, unify your database and development paradigms as you embrace Open
Source, and share metadata wherever possible for both integration and analytics.
Last but not least, expand the IT governance to include a Big Data center of excellence to ensure business alignment, grow
your skills, manage Open Source tools and technologies, share knowledge, establish standards, and to manage best practices.
SENTHIL SUNDARESAN 16
Fig 11: McKinsey Survey
Corporates vs. Big Data
“Experience Certainty” - big data is imperative for Corporates to face the future.
Scale- Out Storage Systems - Hadoop Technology Stack and Services
Corporates need to have strong partnerships with storage vendors and is involved in architecture of large Data Centers with
Big Data storage requirements. Most Scale-Out storage solutions today includes Hadoop as part of the stack.
BI, Advanced and Predictive Analytics
Corporates need to have strong capability on Business Intelligence, Data Warehousing and Advanced Analytics. This
experience is around Industry Leading products and advanced and Predictive Analytics Solutions as in the case of “Listening
Platform for Social Media” and “Supply Chain Predictive Analytics“.
Vertical Domain Experience
Corporates need to have deep knowledge of Business Imperatives of Semiconductor, Computer Platforms, Consumer
Electronics and Software Product Companies. This knowledge in turn helps setting the right patterns for Advanced Analytics
and also for defining the correct rules for Big Data analytics.
What can be done?
The scarcity in the Big Data and Hadoop knowledge creates the gap between the requirements and resource availability. It
can be avoided by choosing the interested associates and train them properly in order to create a larger pool of associates
having big data expertise available for the future.
SENTHIL SUNDARESAN 17
 Edd Dumbill, http://strata.oreilly.com
 David Floyer, http://wikibon.org/wiki/v/Enterprise_Big-data
 Taylor Hatmaker, http://www.entrepreneur.com/article/224582.
 Scott Jarr, http://voltdb.com/company/blog/big-data-value-continuum.
 Oracle white paper in Enterprise Architecture
 McKinsey Global Institute Analysis
Victor Daily, http://www.techzost.com/2012/11/where-does-cloudcomputing-and-big-data.html
 TCS Hadoop and Data Xplode
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice.
To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a