“Big Data” is data whose scale, diversity, and
complexity requires new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it
Most analysts and practitioners currently refer to data
sets from 30-50 terabytes(1000 gigabytes per terabyte)
to multiple petabytes (1000 terabytes per petabyte) as
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Social media and networks
(all of us are generating data)
(collecting all sorts of data)
(tracking all objects all the tim
Sensor technology and
(measuring all kinds of data)
Volume: The massive scale and growth of
unstructured data outstrips traditional storage and
Velocity: Data is generated in real time, with
demands for usable information to be served up
Variety: Data is getting generated in the form of
relational data, text data, semi structured data ,Graph
There were 5 billion mobile phones in use in 2010.
There is a 40% projected growth in global data
generated per year vs. 5% growth in global IT spending.
There were 235 terabytes of data collected by the US
Library of Congress in April 2011.
15 out of 17 major business sectors in the United States
have more data stored per company than the US Library
The complex nature of big data is primarily driven by the
unstructured nature of much of the data that is generated
by modern technologies, such as that from web logs, radio
frequency Id (RFID), sensors embedded in devices,
machinery, vehicles, Internet searches, social networks such
as Facebook, portable computers, smart phones and other
cell phones, GPS devices, and call center records.
In most cases, in order to effectively utilize big data, it must
be combined with structured data (typically from a
relational database) from a more conventional business
application, such as Enterprise Resource Planning (ERP) or
Customer Relationship Management (CRM).
Global market for big data
Industry Size :
Today every organisation across the globe is faced with an unprecedented growth in
The digital universe of data was expected to expand to 2.7 Zetta bytes (ZB) by the end of
2012. Then it is predicted to be double every two years, reaching 8 ZB data by 2015. Its
hard to conceptualize this quantity of information.
US library of Congress holds 462 terabytes (TB) of digital data. At this rate 8 ZB is
equivalent to almost 18 million libraries of Congress.
That translates to a ten-fold increase over the last five years and an astounding 29-fold
increase over the next ten years.
This year, the world’s digital information is expected to grow by 57%. Within that,
internet traffic is growing by 35%, and mobile data traffic at 110%, according to Cisco.
The big data industry is worth somewhere between $30bn and $200bn.
Smartphones, tablets, sensors, social networks, online
games, video streams and mobile payments will all drive
big data for many years to come
Amazon , Apple, Facebook ,Google, Microsoft
The big Internet companies control where the data comes
from and where it goes to .
Amazon, Baidu, Facebook and Google may one day make a
lucrative side business from selling their proprietary
distributed database technologies, competing with IBM
Data storage, networking and hardware companies:
ARM, BROCADE, CISCO, DELL, EMC, HP, INTEL ,LENOVO,
Many hardware makers like Cisco, Dell, Lenovo and HP are
investing heavily in big data appliances
Data storage companies are likely to continue to beat
earnings expectations as the data deluge goes into
Enterprise software companies:
Adobe, Citrix System, IBM, Fujitsu, Informatica, Oracle,
Red Hat, SAP, Salesforce.com
Hadoop is fast becoming the industry standard enterprise
Cloud database services are likely to be the fastest growth
sector this year within the enterprise software space
A wide variety of techniques and technologies has been developed and adapted to
aggregate, manipulate, analyze, and visualize big data.
BIG DATA TECHNIQUES:
A/B testing: A technique in which a control group is compared with a variety of test
groups in order to determine what treatments (i.e., changes) will improve a given
objective variable, e.g., marketing response rate.
This technique is also known as split testing or bucket testing. An example application is
determining what copy text, layouts, images, or colors will improve conversion rates on
an e-commerce Web site.
Association rule learning: A set of techniques for discovering interesting
relationships, i.e., “association rules,” among variables in large databases.
These techniques consist of a variety of algorithms to generate and test possible rules.
One application is market basket analysis, in which a retailer can determine which
products are frequently bought together and use this information for marketing (a
commonly cited example is the discovery that many supermarket shoppers who buy
diapers also tend to buy beer). Used for data mining.
Cluster analysis: A statistical method for classifying objects
that splits a diverse group into smaller groups of similar
objects, whose characteristics of similarity are not known in
Crowdsourcing: A technique for collecting data submitted by
a large group of people or community through an open call,
usually through networked media such as the Web.
Statistics: The science of the collection, organization, and
interpretation of data, including the design of surveys and
BIG DATA TECHNOLOGIES
There is a growing number of technologies used to
aggregate, manipulate, manage, and analyze big data.
Big Table. Proprietary distributed database system
built on the Google File System. Tables are further
split into multiple tablets. When size of data grows
beyond limits, tablets are compressed by the use of
algorithms such as Snappy.
Business intelligence (BI): A type of application
software designed to report, analyze, and present
data. BI tools are often used to read data that have
been previously stored in a data warehouse or data
mart. BI tools can also be used to create standard
reports that are generated on a periodic basis, or
to display information on real-time management
dashboards, i.e., integrated displays of metrics
that measure the performance of a system.
Data warehouse: Specialized database optimized for reporting,
often used for storing large amounts of structured data. Data is
uploaded using ETL (extract, transform, and load) tools from
operational data stores, and reports are often generated using
business intelligence tools.
Extract, transform, and load (ETL): Software tools used to extract
data from outside sources, transform them to fit operational
needs, and load them into a database or data warehouse.
Hadoop: An open source (free) software framework for
processing huge datasets on certain kinds of problems on a
distributed system. Its development was inspired by Google’s
MapReduce and Google File System.
Hbase: An open source (free), distributed, non-relational
database modeled on Google’s Big Table. It enables fault tolerant
way of storing large quantities of data.
Data intent and capacity
• The data revolution
• Intent in an age of growing volatility
Social Science and Policy Applications
• Access and Sharing
• Defining and Detecting Anomalies in Human Ecosystems
• HP’s Big Data strategy and Vertica
• CSC Buys Infochimps for Big Data, Analytics Expertise
• Market Intelligence Provider FirstRain Unveils New Big Data Tool,
Whilst big data industry revenues are certain to grow, investors face
Today, internet bandwidth prices are capped, effectively making internet
bandwidth a free resource for big data companies. But, without
substantial investment by the world’s mobile operators, big data is likely
to grow far faster than the ability of the network to carry it.
As networks get overloaded, network latency rises, reducing the speed
and efficiency of analytical engines, especially those powered through
the cloud. The coming mobile bandwidth shortage will shift competitive
advantage from technology companies to telecom operators.
Open source risk
With the source code free, barriers to entry remain low. In the longer term, this may depress
the database industry’s margins.
Ever since Apple took on the mobile phone industry – and won – with barely a handful of
mobile patents to its name, a patent war has erupted across the technology sector. Were a
patent war to break out in the big data space, technological progress could be slowed down.
Whilst regulators are unlikely to allow any hoarding of patents on anti-competitive grounds,
the risk remains. Oracle, a leader in big data, is
well known for filing multi-billion dollar patent infringement lawsuits against its competitors.
Last month Global Payments, a credit card transaction processor, admitted that hackers had
stolen the details of 1.5m North American
card holders. This is the latest in a string of security breaches that have hit companies dealing
in big data. Apple, EMC, Google, Oracle and
Sony are all recent hacking victims. As the level of cyber-crime rises, so does the risk of
dealing with big data. Just as the Fukushima incident dampened prospects for the nuclear
sector, so a large cyber-attack could adversely impact big data industry profits.
Often misunderstood and ill-applied
The question is not “how big is your data?”, it is “what are
you are doing with your data?”
It fails to supply its customers with products that solve
Companies searching for data solutions are often confused
by all the big data marketing hype and sometimes end up