RELATED POPULAR OPEN SOURCED
DISTRIBUTED DB TECHNOLOGIES
Technology Company Open Sourced On
Cassandra DataStax Apache Cassandra
used by Facebook , Linkedin ,
BigTable Google Google BigTable
Apache HBase Apache HBase ( used by many
companies most popular)
MongoDB MongoDB Inc. Apache (written on C++,Erlang,C)
Couchbase CouchBase Inc Apache (written on Erlang)
CLASSIFICATION OF NOSQL DATABASES
Category No SQL database
Accumulo, Cassandra, Hbase.
Document Clusterpoint,Couchdb, Couchbase, MarkLogic, MongoDB
Key-Value Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom
Graph Allegro, Neo4J, OrientDB, Virtuoso, Stardog
- Column Oriented DB store database store Values in Column By Column
rather in other RDBMS row by row.
- It leads to better Compression Of data and hence less space required to
- There are Still higher Compression can be achieved when used
- Similarly Document oriented Store and arrange data in form of documents.
- Key-Value store Data in form of collection of Key-value pairs. Allowing add,
insert, delete to key-value pairs.
- Graph Databases: Every Element is direct pointer to its adjacent hence no-
RELATION OF CLOUD COMPUTING AND BI
Go through the link below:
BIGDATA – 5V
The Term Bigdata stems from Characterisized
Volume: Large Volume of data
Velocity: amount of data per seconds
Variability: level of unintentional modification
affecting data Quality throughout lifecycle of
Value: Value derived from data.
Variety: large range of data which is received
from video , audio, text, image.
SOURCES OF BIGDATA WHAT NOSQL SOLVES?
Sources Example by 5V.
Volume: Youtube, large volume of video feeds
received and maintained at many video sites
like youtube, vimeo etc…
Variety: Large variety of data text, audio, video,
images, received in sites like facebook, twitter,
other social media platforms.
Velocity: Speed at which data is received in
sites like twitter, facebook (1 billion people all
feeding there data on one site)
TRENDS SHAPING INTEREST TOWARDS BIGDATA
Batch Processing Vs Real Time processing
Batch Jobs run at particular time of day like
Nightly jobs or morning jobs which depends on
slack time When server has less load.
But people now want to see the Status like in
transportation when bus is arriving on particular
stand in real time. Or in Retail as soon they
update there status the require real time
advertisements. This is shaping move towards Big
PROBLEMS OF BIGDATA
Problems differentiated by 5V.
Velocity: With large volume of data received and quick turn
around latency required to reflect the data fed at facebook then
Can it be managed by regular DBMS?
DBMS- maintains ACID properties & have lots of constraints like
primary, foreign keys, check constraints etc.. with quick
turnaround or short latency required these constraints add up
processing time and volume required for storage. So all of these
sites have there own File based storage DBMS like systems
with does not have these constraints. All data is maintained in
files, id assigned to files are indexed and regularly moved (these
are publically know open sourced databases like Cassandra
developed by facebook, BigTable by Google, etc…)
Most of this databases are popularly Categorized as NoSQL
BIGDATA AND ANALYTICS
As we know now Bigdata is solving problems of
5V like the huge (V)olume of storage required for
video sites like youtube. Etc.
It’s changing how We perceive and Visualize or
Analyze data like HBase used for data storage,
Mahout of used to run analytics and find patterns.
These databases have variety of data which
require different kind of processing cannot be
achieved by traditional RDBMS based products.
Example link below:
BIGDATA AND MAP-REDUCE
Map-Reduce Algorithm was starting point of
All we see in BigData created by Google
Mapper divides work into multiple parallel
task, sorts within queue and filters into queue
of say 1 queue for each name.
Reducer Component Aggregates data or
summarizes from multiple units.
DATA SCIENCE AND BIG DATA
So Since data is mostly unstructured the best
way to analyze unstructured data is using
Analytics here Comes New Career Called
Skill Set Required for Data Scientist:
Mathematics (mostly statistics), Computer
Science, Domain like Sociology (like Social
HERE ANOTHER VIEW FROM WIKI SKILLS
REQUIRED FOR DATA SCIENTIST
SOCIAL MEDIA ANALYTICS
One application of Bigdata has been to
gather feedback about product from social
Here is Sample project Report below How
and what tools can be used to Analyze social
BIGDATA AND HADOOP
Hadoop allows to distribute load among many
There can be Database clusters, OS clusters,
Application Web server level clustering But here
we are dealing with OS like Distributed File
System(DFS). Hadoop DFS (HDFS) File system
developed by yahoo Competes with BigTable of
Google providing quick storage and retrieval of
data in form of files used by many social media
BIGDATA OTHER TECHNOLOGY USAGE
‘R’ was open source Statistical Analysis
language having Statistical Constructs
available used for Analysis of data.
Java data mining API, .Net data mining API ,
python libraries are used to mine and
understand trends in Data.
PIG is another Apache Hadoop based
system used provide High level language for
analyzing large data sets.
HERE SOME LINKS
Data Science Blog2:
USE CASE: RETAIL
Retail generates huge amount of data for product
positioned on different shelf at store, replenishment level,
reorder level, merchandising, assortment planning all this
data most of it usually structured Since lots of system is
Automated but there are lots of forms, customer
feedback, planning data analysis of mails other chat
Large Warehouses of Retail store needs plan positioning
and containers in Aisle.
Analyze trends from social media to find customer
preferences for products and offers.
Retail Innovation read:
USE CASE: RETAIL-2
Retail uses lots of Sensors for tracking items
with warehouse and inside Store. The Huge
real time data (video , text and other forms)
generated every milli-second from Sensors
embedded across every store and
warehouse Cannot be analyzed by any other
medium better than in Hadoop or Bigdata
USE CASE: FINANCE
Finance being Game of numbers huge data from
Book of accounts, P&L, Balance sheets of etc
accumulates of different business over a period of
time But most books are Structured and hence the
data. But Hadoop offers huge scalable clusters to
quickly analyze structured data as well.
Lots of social media data about interest for share or
any instrument does get reflected in numbers.
Spreadsheets are popular medium of analysis and
other textual forms can be better analyzed if available
over Hadoop like clusters for a kind of semi-
structured data analysis.