Document Stores
think about document databases as key-value stores where the value is examinable
rich query language + indexes
MongoDB, CouchDB , Terrastore, OrientDB, RavenDB
http://www.thoughtworks.com/insights/blog/nosql-databases-overview
Column Family Stores
Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row
and the row consists of multiple columns. The difference is that various rows do not have to have the same columns,
and columns can be added to any row at any time without having to add it to other rows.
Cassandra, HBase, Hypertable
http://www.thoughtworks.com/insights/blog/nosql-databases-overview
Document-oriented system
Large Data Sets
Records similar to JSON
Automatic sharding and MapReduce
Queries are written in JavaScript
Document-oriented system
JavaScript Interface
Multi-version concurrency control approach
Client side needs to handle clashes on writes
No good built-in method for horizontal scalability
(but there are external solutions like BigCouch, Lounge, and Pillow)
Originally an internal Facebook project
Keyspaces and column families
Similar the data model used by BigTable
Data is sharded and balanced automatically
Keeps the entire Database in RAM
Its values can be complex data structures
BIG TABLE
Structure: tables, row keys, column families, column
names, timestamps, and cell values
Designed to handle very large data loads by running on
big clusters of commodity hardware
Uses GFS as its underlying storage
Open source clone of Big Table
Same structure of Big Table
Uses HDFS instead of GFS
Inspired by AWS Dynamo DB
OpenSource and Commercial Versions
Key Value System
Large Distributed Clusters
Queries in ErLang or JavaScript
Consistent hashing and a gossip protocol to avoid centralized index server
Takes care of running your code across a cluster of machines.
- chunking up the input data
- sending it to each machine
- running your code on each chunk
- checking that the code ran
- passing any results next stage
- sorting between stages
- sending each chunk of that sorted data to the right machine
- writing debugging information on each job’s progress
With Hive, you can program
Hadoop jobs using a SQL-like
language HiveQL.
Apache Pig
A procedural data
processing language
designed for Hadoop
Provides a set of functions
that help with common
data processing problems
PigPen is map-reduce for Clojure, or distributed Clojure.
It compiles to Apache Pig, but you don't need to know
much about Pig to use it.
Hadoop for Logs
The Flume project is designed to make the data
gathering process easy and scalable, by running agents
on the source machines that pass the data updates to
collectors, which then aggregate them into large chunks
that can be efficiently written as HDFS files.
The R project is both a
specialized language and a
toolkit of modules aimed at
anyone working with
statistics.
Lucene is a Java library
that handles indexing and
searching large collections
of documents, and Solr is
an application that uses
the library to build a
search engine server.
Mahout is an open source
framework that can run
common machine learning
algorithms on massive
datasets.
The framework makes it
easy to use analysis
techniques to implement
features such as “People
who bought this also
bought” recommendation
engine on your own site.