Technology Company Open Sourced On
Cassandra DataStax Apache Cassandra
used by Facebook , Linkedin ,
BigTable Google Google BigTable
Apache HBase Apache HBase ( used by many
companies most popular)
MongoDB MongoDB Inc. Apache (written on C++,Erlang,C)
Couchbase CouchBase Inc Apache (written on Erlang)
Category No SQL database
Accumulo, Cassandra, Hbase.
Document Clusterpoint,Couchdb, Couchbase, MarkLogic, MongoDB
Key-Value Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-
Graph Allegro, Neo4J, OrientDB, Virtuoso, Stardog
- Column Oriented DB store database store Values in Column By Column
rather in other RDBMS row by row.
- It leads to better Compression Of data and hence less space required to
- There are Still higher Compression can be achieved when used Probabilistic
- Similarly Document oriented Store and arrange data in form of documents.
- Key-Value store Data in form of collection of Key-value pairs. Allowing add,
insert, delete to key-value pairs.
- Graph Databases: Every Element is direct pointer to its adjacent hence no-
Go through the link below:
The Term Bigdata stems from Characterisized
Volume: Large Volume of data
Velocity: amount of data per seconds
Variability: level of unintentional modification
affecting data Quality throughout lifecycle of
Value: Value derived from data.
Variety: large range of data which is received
from video , audio, text, image.
Sources Example by 5V.
Volume: Youtube, large volume of video
feeds received and maintained at many video
sites like youtube, vimeo etc…
Variety: Large variety of data text, audio,
video, images, received in sites like
facebook, twitter, other social media
Velocity: Speed at which data is received in
sites like twitter, facebook (1 billion people
all feeding there data on one site)
Batch Processing Vs Real Time processing
Batch Jobs run at particular time of day like
Nightly jobs or morning jobs which depends on
slack time When server has less load.
But people now want to see the Status like in
transportation when bus is arriving on
particular stand in real time. Or in Retail as
soon they update there status the require real
time advertisements. This is shaping move
towards Big data.
Problems differentiated by 5V.
Velocity: With large volume of data received and quick turn
around latency required to reflect the data fed at facebook
then Can it be managed by regular DBMS?
DBMS- maintains ACID properties & have lots of constraints like
primary, foreign keys, check constraints etc.. with quick
turnaround or short latency required these constraints add up
processing time and volume required for storage. So all of
these sites have there own File based storage DBMS like
systems with does not have these constraints. All data is
maintained in files, id assigned to files are indexed and
regularly moved (these are publically know open sourced
databases like Cassandra developed by facebook, BigTable by
Most of this databases are popularly Categorized as NoSQL
As we know now Bigdata is solving problems of 5V like
the huge (V)olume of storage required for video sites
like youtube. Etc.
It’s changing how We perceive and Visualize or Analyze
data like HBase used for data storage, Mahout of used
to run analytics and find patterns. These databases
have variety of data which require different kind of
processing cannot be achieved by traditional RDBMS
based products. Example link below:
Map-Reduce Algorithm was starting point of
All we see in BigData created by Google
Mapper divides work into multiple parallel
task, sorts within queue and filters into
queue of say 1 queue for each name.
Reducer Component Aggregates data or
summarizes from multiple units.
So Since data is mostly unstructured the best
way to analyze unstructured data is using
Analytics here Comes New Career Called
Skill Set Required for Data Scientist:
Mathematics (mostly statistics), Computer
Science, Domain like Sociology (like Social
One application of Bigdata has been to
gather feedback about product from social
Here is Sample project Report below How
and what tools can be used to Analyze social
Hadoop allows to distribute load among many
There can be Database clusters, OS clusters,
Application Web server level clustering But
here we are dealing with OS like Distributed
File System(DFS). Hadoop DFS (HDFS) File
system developed by yahoo Competes with
BigTable of Google providing quick storage
and retrieval of data in form of files used by
many social media platforms.
‘R’ was open source Statistical Analysis
language having Statistical Constructs
available used for Analysis of data.
Java data mining API, .Net data mining API ,
python libraries are used to mine and
understand trends in Data.
PIG is another Apache Hadoop based system
used provide High level language for
analyzing large data sets.
Data Science Blog2:
Retail generates huge amount of data for
product positioned on different shelf at store,
replenishment level, reorder level,
merchandising, assortment planning all this data
most of it usually structured Since lots of system
is Automated but there are lots of forms,
customer feedback, planning data analysis of
mails other chat platforms.
Large Warehouses of Retail store needs plan
positioning and containers in Aisle.
Analyze trends from social media to find
customer preferences for products and offers.
Retail Innovation read:
Retail uses lots of Sensors for tracking items
with warehouse and inside Store. The Huge
real time data (video , text and other forms)
generated every milli-second from Sensors
embedded across every store and warehouse
Cannot be analyzed by any other medium
better than in Hadoop or Bigdata based
Finance being Game of numbers huge data
from Book of accounts, P&L, Balance sheets
of etc accumulates of different business over
a period of time But most books are
Structured and hence the data. But Hadoop
offers huge scalable clusters to quickly
analyze structured data as well.
Lots of social media data about interest for
share or any instrument does get reflected in
Spreadsheets are popular medium of analysis
and other textual forms can be better
analyzed if available over Hadoop like
clusters for a kind of semi-structured data