1.1 About BigData:
Big data is a buzz word, or catch-phrase, used to describe a massive volume of both structured
and unstructured data that it is so large that it is difficult to process using traditional database and
software techniques. In most enterprise scenarios the data is too big or moves too fast or it
exceeds current processing capacity.
While the term may seem to reference the volume of data, that isn’t always the case. The term,
Big data, especially when used by vendors, may refer to the technology (which includes tools
and processes) that an organization requires to maintain large amounts of data and storage
The term Big Data is believed to have originated with the web search companies who had to
query very large distributed aggregations of loosely structured data. Big data has become viable
as cost-effective approaches have emerged to tame the volume, velocity and variability of
massive data. Within this data lies the valuable patterns and information, previously hidden
because of the amount of work required to extract them. To leading corporations, such as
Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s
commodity hardware, cloud architectures and open source software bring big data processing
into the reach of the less well-resourced. Big data processing is eminently feasible for even the
small garage startups, who can cheaply rent server time in the cloud.
1.2 An Example of Big Data:
An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of
data consisting of billions to trillions of records of millions of people—all from different sources
(e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is
typically loosely structured data that is often incomplete and inaccessible.
When dealing with larger datasets, organizations face difficulties in being able to create,
manipulate, and manage big data. Big data is particularly a problem in business analytics because
standard tools and procedures are not designed to search and analyze massive datasets.
1.3 SIZE OF BIGDATA:
The social networking sites, need to process data of huge size on a
daily basis. An example of such big data processed on daily basis is given as follows:
1.4 CHARACTERISTICS OF BIG DATA
The characteristics of big data are as follows:
A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests
500 terabytes of new data every day; a Boeing 737 will generate 240 terabytes of flight
data during a single flight across the US; the proliferation of smart phones, the data they
create and consume; sensors embedded into everyday objects will soon result in billions
of new, constantly-updated data feeds containing environmental, location, and other
information, including video
Click streams and ad impressions capture user behavior at millions of events per second;
high-frequency stock trading algorithms reflect market changes within microseconds;
machine to machine processes exchange data between billions of devices; infrastructure
and sensors generate massive log data in real-time; on-line gaming systems support
millions of concurrent users, each producing multiple inputs per second.
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data,
audio and video, and unstructured text, including log files and social media. Traditional
database systems were designed to address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure. Traditional database systems are also
designed to operate on a single server, making increased capacity expensive and finite.
As applications have evolved to serve large volumes of users, and as application
development practices have become agile, the traditional use of the relational database
has become a liability for many companies rather than an enabling factor in their
business. Big Data databases, such as Mongo DB, solve these problems and provide
companies with the means to create tremendous business value.
2. BIGDATA ANALYTICS
Big data analytics is the process of examining large amounts of data of a variety of types (big
data) to uncover hidden patterns, unknown correlations and other useful information. Such
information can provide competitive advantages over rival organizations and result in business
benefits, such as more effective marketing and increased revenue.
2.1 GOAL OF BIGDATA ANALYTICS:
The primary goal of big data analytics is to help companies make better business decisions by
enabling data scientists and other users to analyze huge volumes of transaction data as well as
other data sources that may be left untapped by conventional business intelligence (BI)programs.
These other data sources may include Web server logs, social media activity reports, mobilephone call detail records and information captured by sensors. Some people exclusively associate
big data and big data analytics with unstructured of that sort, but consulting firms like Gartner
Inc. and Forrester Research Inc. also consider transactions and other structured data to be valid
forms of big data.
2.2 TECHNOLOGIES ASSOCIATED WITH BIGDATA:
Big data analytics can be done with the software tools commonly used as part of advanced
analytics disciplines such as predictive analytics and data mining. But the unstructured data
sources used for big data analytics may not fit in traditional data warehouses. Furthermore,
traditional data warehouses may not be able to handle the processing demands posed by big data.
As a result, a new class of big data technology has emerged and is being used in many big data
analytics environments. The technologies associated with big data analytics include
NoSQL databases, Hadoop and Map Reduce. These technologies form the core of an open
source software framework that supports the processing of large data sets across clustered
2.3 Challenges in Big Data Analysis
Heterogeneity and Incompleteness: When humans consume information, a great deal of
heterogeneity is comfortably tolerated. In fact, the nuance and richness of natural language can
provide valuable depth. However, machine analysis algorithms expect homogeneous data, and
cannot understand nuance. In consequence, data must be carefully structured as a first step in (or
prior to) data analysis
Even after data cleaning and error correction, some incompleteness and some errors in data are
likely to remain. This incompleteness and these errors must be managed during data analysis.
Doing this correctly is a challenge.
Scale: Of course, the first thing anyone thinks of with Big Data is its size. After all, the word
“big” is there in the very name. Managing large and rapidly increasing volumes of data has been
a challenging issue for many decades.
Timeliness: The larger the data set to be processed, the longer it will take to analyze. The design
of a system that effectively deals with size is likely also to result in a system that can process a
given size of data set faster.
Privacy: The privacy of data is another huge concern, and one that increases in the context of
Big Data. For electronic health records, there are strict laws governing what can and cannot be
done. However, there is great public fear regarding the inappropriate use of personal data,
particularly through linking of data from multiple sources. Managing privacy is effectively both
a technical and a sociological problem, which must be addressed jointly from both perspectives
to realize the promise of big data.
2.4 USES OF BIGDATA ANALYTICS:
Enable data analysts to rapidly produce insights with no IT involvement
Connect to any data source of any data type using a simple, guided interface
Utilize 220+ built-in functions to quickly and easily analyze your data
Create scenarios such as market compensation comparison, global payroll cost analysis,
retention risk and impact analysis, competitive benchmark, revenue pulse, and more
3.BIG DATA TECHNOLOGY
3.1 Selecting a Big Data Technology: Operational vs. Analytical
The Big Data landscape is dominated by two classes of technology: systems that provide
operational capabilities for real-time, interactive workloads where data is primarily captured and
stored; and systems that provide analytical capabilities for retrospective, complex analysis that
may touch most or all of the data. These classes of technology are complementary and frequently
Operational and analytical workloads for Big Data present opposing requirements and systems
have evolved to address their particular demands separately and in very different ways. Each has
driven the creation of new technology architectures. Operational systems, such as the NoSQL
databases, focus on servicing highly concurrent requests while exhibiting low latency for
responses operating on highly selective access criteria. Analytical systems, on the other hand,
tend to focus on high throughput; queries can be very complex and touch most if not all of the
data in the system at any time. Both systems tend to operate over many servers operating in a
cluster, managing tens or hundreds of terabytes of data across billions of records.
OPERATIONAL BIG DATA
For operational Big Data workloads, NoSQL Big Data systems such as document databases have
emerged to address a broad set of applications, and other architectures, such as key-value stores,
column family stores, and graph databases are optimized for more specific applications. NoSQL
technologies, which were developed to address the shortcomings of relational databases in the
modern computing environment, are faster and scale much more quickly and inexpensively than
Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be run
inexpensively and efficiently. This makes operational Big Data workloads much easier to
manage, and cheaper and faster to implement.
In addition to user interactions with data, most operational systems need to provide some degree
of real-time intelligence about the active data in the system. For example in a multi-user game or
financial application, aggregates for user activities or instrument performance are displayed to
users to inform their next actions. Some NoSQL systems can provide insights into patterns and
trends based on real-time data with minimal coding and without the need for data scientists and
ANALYTICAL BIG DATA
Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database
systems and MapReduce. These technologies are also a reaction to the limitations of traditional
relational databases and their lack of ability to scale beyond the resources of a single server.
Furthermore, MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL.
As applications gain traction and their users generate increasing volumes of data, there are a
number of retrospective analytical workloads that provide real value to the business. Where these
workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce
has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native
MapReduce functionality that allows for analytics to be performed on operational data in place.
Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoop for
OVERVIEW OF OPERATIONAL VS. ANALYTICAL SYSTEMS
1 ms - 100 ms
1 min - 100 min
1000 - 100,000
1 - 10
Writes and Reads
MapReduce, MPP Database
Combining Operational and Analytical Technologies; Using Hadoop
New technologies like NoSQL, MPP databases, and Hadoop have emerged to address Big Data
challenges and to enable new types of products and services to be delivered by the business. One
of the most common ways companies are leveraging the capabilities of both systems is by
integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made
by existing APIs and allows analysts and data scientists to perform complex, retroactive queries
for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to
capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop
should be used to provide analytical insight for analysts and data scientists. Together, NoSQL,
MPP databases and Hadoop enable businesses to capitalize on Big Data.
3.2 Considerations for Decision Makers
While many Big Data technologies are mature enough to be used for mission-critical, production
use cases, it is still nascent in some regards. Accordingly, the way forward is not always clear.
As organizations develop Big Data strategies, there are a number of dimensions to consider when
selecting technology partners, including: 1. Online vs. Offline Big Data 2. Software Licensing
Models 3. Community 4. Developer Appeal 5. Agility 6. General Purpose vs. Niche Solutions
1. ONLINE VS. OFFLINE BIG DATA
Big Data can take both online and offline forms. Online Big Data refers to data that is created,
ingested, trans- formed, managed and/or analyzed in real-time to support operational applications
and their users. Big Data is born online. Latency for these applications must be very low and
availability must be high in order to meet SLAs and user expectations for modern application
performance. This includes a vast array of applications, from social networking news feeds, to
analytics to real-time ad servers to complex CRM applications. Examples of online Big Data
databases include MongoDB and other NoSQL databases.
Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big
Data in a batch context. They typically do not create new data. For these applications, response
time can be slow (up to hours or days), which is often acceptable for this type of use case. Since
they usually produce a static (vs. operational) output, such as a report or dashboard, they can
even go offline temporarily without impacting the overall goal or end product. Examples of
offline Big Data applications include Hadoop-based workloads; modern data warehouses;
extract, transform, load (ETL) applications; and business intelligence tools.
Organizations evaluating which Big Data technologies to adopt should consider how they intend
to use their data. For those looking to build applications that support real-time, operational use
cases, they will need an operational data store like MongoDB. For those that need a place to
conduct long-running analysis offline, perhaps to inform decision-making processes, offline
solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so
in tandem, and they will sometimes find integrations between online and offline Big Data
technologies. For instance, MongoDB provides integration with Hadoop.
2. SOFTWARE LICENSE MODEL
There are three general types of licenses for Big Data software technologies:
Proprietary. The software product is owned and controlled by a software company. The
source code is not available to licensees. Customers typically license the product through
a perpetual license that entitles them to indefinite use, with annual maintenance fees for
support and software upgrades. Examples of this model include databases from Oracle,
IBM and Terradata.
Open-Source. The software product and source code are freely available to use.
Companies monetize the software product by selling subscriptions and adjacent products
with value-added components, such as management tools and support services. Examples
of this model include MongoDB (by MongoDB, Inc.) and Hadoop (by Cloudera and
Cloud Service. The service is hosted in a cloud- based environment outside of
customers’ data centers and delivered over the public Internet. The predominant business
model is metered (i.e., pay-per-use) or subscription-based. Examples of this model
include Google App Engine and Amazon Elastic MapReduce.
For many Fortune 1000 companies, regulations and internal policies around data privacy limit
their ability to leverage cloud-based solutions. As a result, most Big Data initiatives are driven
with technologies deployed on-premise. Most of the Big Data pioneers are web companies that
developed powerful software and hardware, which they open-sourced to the larger community.
Accordingly, most of the software used for Big Data projects is open-source.
In these early days of Big Data, there is an opportunity to learn from others. Organizations
should consider how many other initiatives are being pursued using the same technologies and
with similar objectives. To understand a given technology’s adoption, organiza- tions should
consider the following:
The number of users
The prevalence of local, community-organized events
The health and activity of online forums such as Google Groups and StackOverflow
The availability of conferences, how frequently they occur and whether they are wellattended
4. DEVELOPER APPEAL
The market for Big Data talent is tight. The nation’s top engineers and data scientists often flock
to companies like Google and Facebook, which are known havens for the brightest minds and
places where one will be exposed to leading edge technology. If enterprises want to compete for
this talent, they have to offer more than money.
By offering developers the opportunity to work on tough problems, and by using a technology
that has strong developer interest, a vibrant community, and an auspicious long-term future,
organizations can attract the brightest minds. They can also increase the pool of candidates by
choosing technologies that are easy to learn and use — which are often the ones that appeal most
to developers. Furthermore, technologies that have strong developer appeal tend to make for
more productive teams who feel they are empowered by their tools rather than encumbered by
poorly-designed, legacy technology. Productive developer teams reduce time to market for new
initiatives and reduce development costs, as well.
Organizations should use Big Data products that enable them to be agile. They will benefit from
technologies that get out of the way and allow teams to focus on what they can do with their
data, rather than how to deploy new applications and infrastructure. This will make it easy to
explore a variety of paths and hypotheses for extracting value from the data and to iterate quickly
in response to changing business needs.
In this context, agility comprises three primary components:
Ease of Use. A technology that is easy for developers to learn and understand -- either
because of the way it’s architected, the availability of tools and information, or both -will enable teams to get Big Data projects started and to realize value quickly.
Technologies with steep learning curves and fewer resources to support education will
make for a longer road to project execution.
Technological Flexibility. The product should make it relatively easy to change
requirements on the fly—such as how data is modeled, which data is used, where data is
pulled from and how it gets processed as teams develop new findings and adapt to
internal and external needs. Dynamic data models (also known as schemas) and
scalability are capabilities to seek out.
Licensing Freedom. Open-source products are typically easier to adopt, as teams can get
started quickly with free community versions of the software. They are also usually easier
to scale from a licensing standpoint, as teams can buy more licenses as requirements
increase. By contrast, in many cases proprietary software vendors require large, upfront
license purchases, which make it harder for teams to get moving quickly and to scale in
MongoDB’s ease of use, dynamic data model and open- source licensing model make it the most
agile Big Data solution available.
6. GENERAL PURPOSE VS. NICHE SOLUTIONS
Organizations are constantly trying to standardize on fewer technologies to reduce complexity, to
improve their competency in the selected tools and to make their vendor relationships more
productive. Organizations should consider whether adopting a Big Data technology helps them
address a single initiative or many initiatives. If the technology is general purpose, the expertise,
infrastructure, skills, integrations and other investments of the initial project can be amortized
across many projects. Organizations may find that a niche technology may be a better fit for a
single project, but that a more general purpose tool is the better option for the organization as a
4. ADVANTAGES OF BIGDATA
The practical advantages of bigdata are as follows:
. Dialogue with consumers
Today’s consumers are a tough nut to crack. They look around a lot before they buy, talk to their
entire social network about their purchases, demand to be treated as unique and want to be
sincerely thanked for buying your products. Big Data allows you to profile these increasingly
vocal and fickle little ‘tyrants’ in a far-reaching manner so that you can engage in an almost oneon-one, real-time conversation with them. This is not actually a luxury. If you don’t treat them
like they want to, they will leave you in the blink of an eye.
Just a small example: when any customer enters a bank, Big Data tools allow the clerk to check
his/her profile in real-time and learn which relevant products or services (s)he might advise. Big
Data will also have a key role to play in uniting the digital and physical shopping spheres: a
retailer could suggest an offer on a mobile carrier, on the basis of a consumer indicating a certain
need in the social media.
Re-develop your products
Big Data can also help you understand how others perceive your products so that you can adapt
them, or your marketing, if need be. Analysis of unstructured social media text allows you to
uncover the sentiments of your customers and even segment those in different geographical
locations or among different demographic groups.
On top of that, Big Data lets you test thousands of different variations of computer-aided designs
in the blink of an eye so that you can check how minor changes in, for instance, material affect
costs, lead times and performance. You can then raise the efficiency of the production process
Perform risk analysis
Success not only depends on how you run your company. Social and economic factors are
crucial for your accomplishments as well. Predictive analytics, fueled by Big Data allows you to
scan and analyze newspaper reports or social media feeds so that you permanently keep up to
speed on the latest developments in your industry and its environment. Detailed health-tests on
your suppliers and customers are another goodie that comes with Big Data. This will allow you
to take action when one of them is in risk of defaulting.
Keeping your data safe
You can map the entire data landscape across your company with Big Data tools, thus allowing
you to analyze the threats that you face internally. You will be able to detect potentially sensitive
information that is not protected in an appropriate manner and make sure it is stored according to
regulatory requirements. With real-time Big Data analytics you can, for example, flag up any
situation where 16 digit numbers – potentially credit card data - are stored or emailed out and
Create new revenue streams
The insights that you gain from analyzing your market and its consumers with Big Data are not
just valuable to you. You could sell them as non-personalized trend data to large industry players
operating in the same segment as you and create a whole new revenue stream.
One of the more impressive examples comes from Shazam, the song identification application. It
helps record labels find out where music sub-cultures are arising by monitoring the use of its
service, including the location data that mobile devices so conveniently provide. The record
labels can then find and sign up promising new artists or remarket their existing ones
Customize your website in real time
Big Data analytics allows you to personalize the content or look and feel of your website in real
time to suit each consumer entering your website, depending on, for instance, their sex,
nationality or from where they ended up on your site. The best-known example is probably
offering tailored recommendations: Amazon’s use of real-time, item-based, collaborative
filtering (IBCF) to fuel its ‛Frequently bought together’ and ‛Customers who bought this item
also bought’ features or LinkedIn suggesting ‛People you may know’ or ‛Companies you may
want to follow’. And the approach works: Amazon generates about 20% more revenue via this
Reducing maintenance costs
Traditionally, factories estimate that a certain type of equipment is likely to wear out after so
many years. Consequently, they replace every piece of that technology within that many years,
even devices that have much more useful life left in them. Big Data tools do away with such
unpractical and costly averages. The massive amounts of data that they access and use and their
unequalled speed can spot failing grid devices and predict when they will give out. The result: a
much more cost-effective replacement strategy for the utility and less downtime, as faulty
devices are tracked a lot faster.
Offering tailored healthcare
We are living in a hyper-personalized world, but healthcare seems to be one of the last sectors
still using generalized approaches. When someone is diagnosed with cancer they usually undergo
one therapy, and if that doesn’t work, the doctors try another, etc. But what if a cancer patient
could receive medication that is tailored to his individual genes? This would result in a better
outcome, less cost, less frustration and less fear.
With human genome mapping and Big Data tools, it will soon be commonplace for everyone to
have their genes mapped as part of their medical record. This brings medicine closer than ever to
finding the genetic determinants that cause a disease and developing drugs expressly tailored to
treat those causes — in other words, personalized medicine.
Offering enterprise-wide insights
Previously, if business users needed to analyze large amounts of varied data, they had to ask their
IT colleagues for help as they themselves lacked the technical skills for doing so. Often, by the
time they received the requested information, it was no longer useful or even correct. With Big
Data tools, the technical teams can do the groundwork and then build repeatability into
algorithms for faster searches. In other words, they can develop systems and install interactive
and dynamic visualization tools that allow business users to analyze, view and benefit from the
Making our cities smarter
To help them deal with the consequences of their fast expansion, an increasing number of smart
cities are indeed leveraging Big Data tools for the benefit of their citizens and the environment.
The city of Oslo in Norway, for instance, reduced street lighting energy consumption by 62%
with a smart solution. Since the Memphis Police Department started using predictive software in
2006, it has been able to reduce serious crime by 30 %. The city of Portland, Oregon, used
technology to optimize the timing of its traffic signals and was able to eliminate more than
157,000 metric tons of CO2emissions in just six years – the equivalent of taking 30,000
passenger vehicles off the roads for an entire year.
These are few practical examples of BIGDATA.
5. RISKS OF BIGDATA
Big data has gotten a lot of press recently – and rightly so. With the vast amounts of data now
available, we can do more than could have been imagined in previous decades. But there is
another face to big data … and that is, companies now have to manage some very big risks.
It’s hard to visualize the amount of data we’re talking about. But as on a article put it, “In 2011
alone, 1.8 zettabytes (or 1.8 trillion gigabytes) of data will be created, the equivalent to every
U.S. citizen writing 3 tweets per minute for 26,976 years.” And this number is anticipated to
grow by a magnitude of 50 times by the year 2020.
Risk #1: Loss of agility
In a typical large-scale organization, data is housed on multiple platforms. There is transactional
data, email data, analytics data, etc. Management wants people to be able to locate, analyze, and
make decisions based on this data quickly. It is a necessity in today’s marketplace where
conditions can change in an instant. But if the data isn’t evaluated, organized, and stored
properly, critical information can be either difficult or impossible to find – slowing a business
down at the exact moment when speed is essential.
Risk #2: Loss of compliance
Laws are getting more and more complex with regard to how long companies need to retain
data, how they need to retain it, and where they need to retain it. There are both general
regulations in place as well as state- or industry-specific regulations that may apply. It is not
uncommon for regulators to perform random audits to examine a company’s policies regarding
data and their actual management of that data. A compliance failure can result in significant fine
or damage to reputational risk.
Risk #3: Loss of security
With more data located in and moving between more places than ever before, there are also a
vastly increased number of ways to hack into that data. A security breach can result in theft,
fraud, fines … and, of course, reputational loss. No company wants to be featured on the front
page of the Wall Street Journal because they’ve been hacked.
Risk #4: Loss of money
As the amount of data grows, it is all too tempting to simply throw more servers at the problem.
After all, storage is cheap, isn’t it? But consider this: I once worked with a client who said they
needed an entire new data center to house their data. SunGard Availability Services did studies
and found that not only did they not need a new data center; they actually needed only half their
current storage because they simply weren’t managing their data well. A server may seem
inexpensive at first glance – but never assume that storage is cheap.
Big data is a good thing. No question about it. But big risky data is a bad thing. Companies today
need to manage their data to minimize their risk. This involves having policies that are in
compliance with regulatory standards, processes that cover all contingencies, retention schedules
that are up to date, and a consistent self-evaluation to determine what data is necessary for the
proper functioning of the company.
The more efficiently companies store, manage, and host their data, the more agile, compliant,
secure, and cost-effective they will be.
And that will take the big risk out of big data.
We have entered an era of Big Data. Through better analysis of the large volumes of data that are
becoming available, there is the potential for making faster advances in many scientific
disciplines and improving the profitability and success of many enterprises. However, many
technical challenges described in this paper must be addressed before this potential can be
realized fully. The challenges include not just the obvious issues of scale, but also heterogeneity,
lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages
of the analysis pipeline from data acquisition to result interpretation. These technical challenges
are common across a large variety of application domains, and therefore not cost-effective to
address in the context of one domain alone. Furthermore, these challenges will require
transformative solutions, and will not be addressed naturally by the next generation of industrial
products. We must support and encourage fundamental research towards addressing these
technical challenges if we are to achieve the promised benefits of BigData.