The Term Bigdata stems from Characterisized
Volume: Large Volume of data
Velocity: amount of data per seconds
Variability: level of unintentional
modification affecting data Quality
throughout lifecycle of data.
Value: Value derived from data.
Variety: large range of data which is received
from video , audio, text, image.
Sources Example by 5V.
Volume: Youtube, large volume of video feeds received
and maintained at many video sites like youtube,
Variety: Large variety of data text, audio, video,
images, received in sites like facebook, twitter, other
social media platforms.
Velocity: Speed at which data is received in sites like
twitter, facebook (1 billion people all feeding there
data on one site)
Batch Processing Vs Real Time processing
Batch Jobs run at particular time of day like Nightly jobs
or morning jobs which depends on slack time When
server has less load.
But people now want to see the Status like in
transportation when bus is arriving on particular stand
in real time. Or in Retail as soon they update there status
the require real time advertisements. This is shaping
move towards Big data.
Problems differentiated by 5V.
Velocity: With large volume of data received and quick turn
around latency required to reflect the data fed at facebook then
Can it be managed by regular DBMS?
DBMS- maintains ACID properties & have lots of constraints like
primary, foreign keys, check constraints etc.. with quick
turnaround or short latency required these constraints add up
processing time and volume required for storage. So all of these
sites have there own File based storage DBMS like systems with
does not have these constraints. All data is maintained in files, id
assigned to files are indexed and regularly moved (these are
publically know open sourced databases like Cassandra developed
by facebook, BigTable by Google, etc…)
Most of this databases are popularly Categorized as NoSQL
Technology Company Open Sourced On
Cassandra DataStax Apache Cassandra
used by Facebook , Linkedin ,
BigTable Google Google BigTable
Apache HBase Apache HBase ( used by many
companies most popular)
MongoDB MongoDB Inc. Apache (written on C++,Erlang,C)
Couchbase CouchBase Inc Apache (written on Erlang)
Category No SQL database
Accumulo, Cassandra, Hbase.
Document Clusterpoint,Couchdb, Couchbase, MarkLogic, MongoDB
Key-Value Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-
Graph Allegro, Neo4J, OrientDB, Virtuoso, Stardog
- Column Oriented DB store database store Values in Column By Column rather
in other RDBMS row by row.
- It leads to better Compression Of data and hence less space required to store
- There are Still higher Compression can be achieved when used Probabilistic
- Similarly Document oriented Store and arrange data in form of documents.
- Key-Value store Data in form of collection of Key-value pairs. Allowing add,
insert, delete to key-value pairs.
- Graph Databases: Every Element is direct pointer to its adjacent hence no-
Go through the link below:
As we know now Bigdata is solving problems of 5V like
the huge (V)olume of storage required for video sites
like youtube. Etc.
It’s changing how We perceive and Visualize or
Analyze data like HBase used for data storage, Mahout
of used to run analytics and find patterns. These
databases have variety of data which require different
kind of processing cannot be achieved by traditional
RDBMS based products. Example link below:
Map-Reduce Algorithm was starting point of All we see
in BigData created by Google researcher.
Mapper divides work into multiple parallel task, sorts
within queue and filters into queue of say 1 queue for
Reducer Component Aggregates data or summarizes
from multiple units.
So Since data is mostly unstructured the best way to
analyze unstructured data is using Analytics here
Comes New Career Called Data Scientist.
Skill Set Required for Data Scientist:
Mathematics (mostly statistics), Computer Science,
Domain like Sociology (like Social Media Analysis),
One application of Bigdata has been to gather
feedback about product from social media.
Here is Sample project Report below How and what
tools can be used to Analyze social media.
Hadoop allows to distribute load among many
There can be Database clusters, OS clusters,
Application Web server level clustering But here we
are dealing with OS like Distributed File System(DFS).
Hadoop DFS (HDFS) File system developed by yahoo
Competes with BigTable of Google providing quick
storage and retrieval of data in form of files used by
many social media platforms.
‘R’ was open source Statistical Analysis language
having Statistical Constructs available used for
Analysis of data.
Java data mining API, .Net data mining API , python
libraries are used to mine and understand trends in
PIG is another Apache Hadoop based system used
provide High level language for analyzing large data
Data Science http://thedatascience.wordpress.com/
Big Data :http://thebigdatatrends.wordpress.com
Data Science Blog2:
Retail generates huge amount of data for product
positioned on different shelf at store, replenishment level,
reorder level, merchandising, assortment planning all this
data most of it usually structured Since lots of system is
Automated but there are lots of forms, customer feedback,
planning data analysis of mails other chat platforms.
Large Warehouses of Retail store needs plan positioning
and containers in Aisle.
Analyze trends from social media to find customer
preferences for products and offers.
Retail Innovation read:
Retail uses lots of Sensors for tracking items with
warehouse and inside Store. The Huge real time data
(video , text and other forms) generated every milli-
second from Sensors embedded across every store and
warehouse Cannot be analyzed by any other medium
better than in Hadoop or Bigdata based System.
Finance being Game of numbers huge data from Book
of accounts, P&L, Balance sheets of etc accumulates of
different business over a period of time But most
books are Structured and hence the data. But Hadoop
offers huge scalable clusters to quickly analyze
structured data as well.
Lots of social media data about interest for share or
any instrument does get reflected in numbers.
Spreadsheets are popular medium of analysis and
other textual forms can be better analyzed if available
over Hadoop like clusters for a kind of semi-structured