What is Big Data?
As per Wikipedia Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process using onhand database management tools or traditional data processing
applications. The challenges include
capture, curation, storage, search, sharing, transfer, analysis and
As per Gartner “Big data is high-volume, high-velocity and highvariety information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and decision
Any data that cannot be processed using current relational
database technologies in some manner can consider as Big Data.
Big Data Age
We are living in the age of big data
Data is collected from everywhere and stored in warehouse.
Social network – Facebook, Twitter, flickr, Google, Amazon, etc
Bank / Credit card Transactions – For sending promotional offers.
E-commerce – eBay, Amazon.
1 - Data Volume
Large Hadron Collider (LHC) in CERN generating approximately
one petabyte of data per second. CERN stores 25 petabytes of
data per year.
Large Synoptic Survey Telescope (LSST) http://lsst.org/lsst/ Over 30
thousand gigabytes (30TB) of images will be generated every
night during the decade -long LSST sky survey.
EBay got 90 Petabytes of data warehouse.
Petabyte data sets are common these days and Exabyte is not far
Data volume is increasing exponentially.
2 - Data Velocity
Initially, companies analyzed data using a batch process. One takes
a chunk of data, submits a job to the server and waits for delivery of
It works only when incoming data rate is slower than the batch
processing rate and the data is useful despite the delay.
With the new sources of data such as social and mobile
applications, the batch process breaks down.
The data is now streaming into the server in real time, in a
continuous fashion and the result is only useful if the delay is very
Twitter Users send over 400 million tweets per day.
3 - Data Variety
Previously in relation database data is stored in tables and excel
Various formats include Pure text, photo, audio, video, web, GPS
data, sensor data, relational
databases, documents, SMS, PDF, flash, social media data, etc.
One no longer has control over the input data format. Structure can
no longer be imposed like in the past in order to keep control over
In facebook we post images, audio files, comments(text),etc
Google uses smart phones as sensors to determine traffic conditions
from their GPS(gps data, Traffic cameras, RFID Tags in electronic
payment) and internet connectivity.
Importance of Big Data
Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The core
technology that keeps Amazon running is Linux-based and as of 2005
they had the world’s three largest Linux databases, with capacities of
7.8 TB, 18.5 TB, and 24.7 TB.
Facebook handles 50 billion photos from its user base.
In 2012, the Obama administration announced the Big Data Research
and Development Initiative, which explored how big data could be
used to address important problems faced by the government. The
initiative was composed of 84 different big data programs spread
across six departments.
LHC with less than 0.001% of the sensor stream data, the data flow from
all four LHC experiments represents 25 petabytes annual rate before
replication (as of 2012). This becomes nearly 200 petabytes after
Tools used in Big Data Scenarios
MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, V
oldemort, Riak, Zookeeper
Hadoop, Hive, Pig, Cascading, Cascalog, MapR, Flume, Kafka, Azkaba
n, Oozie, Greenplum
S3, Hadoop Distributed File System
EC2, Google App Engine, Elastic, Beanstalk, Heroku