4. Big Data Definition
No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
8. Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
Hadoop implements Google’s MapReduce, using HDFS
MapReduce divides applications into many small blocks of work.
HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop
11. More than just the Elephant in the room
Over 120+ types of NoSQL databases
So many NoSQL options
12. Extend the Scope of RDBMS
Caching
Master/Slave
Table Partitioning
Federated Tables
Sharding
NoSql
Relational database (RDBMS) technology
Has not fundamentally changed in over 40 years
Default choice for holding data behind many web apps
Handling more users means adding a bigger server
13. RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement
14. CAP Theorem
“Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”
15. “Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
CA
Highly-available consistency
CP
Enforced consistency
AP
Eventual consistency
CAP Theorem
19. Use for data that is
document-oriented (collection of JSON documents) w/semi structured
data
Encodings include XML, YAML, JSON & BSON
binary forms
PDF, Microsoft Office documents -- Word, Excel…)
Examples: MongoDB, CouchDB
Document Database
20. Graph Database
Use for data with
a lot of many-to-many relationships
when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
Examples: Neo4J, FreeBase (Google)
21. So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL