Handwritten Text Recognition for manuscripts and early printed texts
Which Freaking Database Should I Use?
1. Which Freaking
Database Should I
Use?
Andrew C. Oliver
@acoliver
{Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
2. Andrew C. Oliver
• Programming since I was about 8
• Java since ~1997
• Founded POI project (currently hosted at Apache) with Marc
Johnson ~2000
o Former member Jakarta PMC
o Emeritus member of Apache Software Foundation
• Joined JBoss ~2002
• Former Board Member/current helper/lifetime member: Open
Source Initiative (http://opensource.org)
• Column in InfoWorld: http://www.infoworld.com/author-
bios/andrew-oliver
o I make fanboys cry.
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
3. Open Software Integrators
• Founded Nov 2007 by Andrew C. Oliver (me)
o in Durham, NC
Revenue and staff has at least doubled every year since
2009.
• New office (2012) in Chicago, IL
o we're hiring mid to senior level as well as UI Developers
(JQuery, Javascript, HTML, CSS)
o up to 25% travel, salary + bonus, 401k, health, etc etc
o preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,
JQuery
o nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least one
Cloud platform
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
4. • Why not just use the RDBMS for everything?
• Operational vs Analytical
• Key Value
• Column Family
• Document
• Graph
• Hadoop?
• Convergence of "clustered filesystems" and "databases"
• Conclusions
Overview
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
5. {2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Why Not "Just Use"
RDBMS for Everything?
6. Before we begin...
• Let's handle the Elephant or rather the teddy bears in
the room:
http://highscalability.com/blog/2010/9/5/hilarious-video-relational-
database-vs-nosql-fanbois.html/
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
7. The CAP theorem
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
8. RDBMS CAP characteristics
• Great at consistency
• Okay at availability
• Not so great at partition tolerance...
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
9. • Lots of servers with many connections to few
servers.
Single process model
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
10. Multiprocess Model
Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster ManagerData Manager Cluster Manager
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
11. • 10mb disks were "big"
• Scalability meant more disks, controllers and possibly
CPUs
• CPUs went from 4.77 Mhz to 3.4ghz
• Disks went from 64kps@70ms to 6gb/s
• Network speeds went from under 4mb to gigabit to
bonded gigabit and beyond.
• Disk speeds for a long time didn't keep up with CPU...
Historical Scalability
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
12. • RDBMS is based on "Relational Algebra" which is just
an extension of basic "set theory"
• Not every problem is a set problem: "direct path" or
"which thing contains this other thing which has this
other thing" (foaf)
• Sometimes relationships are as important as the data
• Sometimes data is even simpler than the relational
model but needs higher levels of availability, etc.
• One size never really did fit all
The Mathematical model
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
13. Data Complexity
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
14. Datarrhea
• Yes I've already registered that ;-)
• The cheapness of storing data has yielded more
demand
o economics predicted this
• Moore's law ended while you slept
o Intel says next year (but when did CPU speeds last
double?)
• Massive parallelization is the most feasible way to get at
it (counter trended with an explosion in disk speeds)
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
15. ...but
• If
o your data is tabular;
o fits cleanly in a relational model;
o you aren't having scalability issues;
o you don't have a large dataset; or
o a dataset/problem that lends itself to massive
parallelization...
• you can probably stick with your RDBMS for now
o ...and probably aren't at this conference anyhow.
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
16. JPA/RDBMS Tables Example
PersonID Firstname Lastname CompanyID
2 Andy Oliver 3
CompanyID Name City State
3 Open Software
Integrators
Durham NC
PhoneNumber Type PersonID
919.627.1236 google 2
919.321.0119 work 2
17. Operational vs Analytical
• One DB type is unlikely to be well suited for all of your
problems.
• The system doing "short and sweet" "lightweight"
transactions is your operational system.
• The system doing long running reports and generating
charts and graphs and statistics is your analytical
system.
• There is also search. There are recommendation
engines, etc.
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
Which Freaking Database Should I Use?
18. {2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Other Types of
Databases
19. • Examples: Couchbase 1.8, Cassandra
o also: Gemfire, Infinispan (distributed caches)
• Constant Time O(1) - Lookup by key
• Good enough for "right now" stock quotes
• Usually combined with an index for search, but the
structure isn't inherently indexed.
• Generally works well with Map Reduce.
• Extremely scalable, easy to partition
Key-Value Stores
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
20. • Many Key-Value support "column families"
o Cassandra
• Some we designed this way
o HBase
• Keys and values become composite
• essentially a hashmap with a multi-dimensional array
o each column is a row of data
• map-reduce friendly
• Stock quote with time ranges
Column Family / Big Table
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
21. HBase Example
Row
key
First
name
Last
name
Company City State
Phone
number
Phone
type
5bfbd4a0
-d02a-
11e1-
9b23-
0800200c
9a66
Andy Oliver
Open
Software
Integrators
Durham NC
919-627-
1236
google
7b2435c
0-d02a-
11e1-
9b23-
0800200c
9a66
Andy Oliver
Open
Software
Integrators
Durham NC
919-321-
0119
work
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
22. • Many developers think these are the "holy grail" since
the fit nicely with object-oriented programming.
• Couchbase 2.0, CouchDB, MongoDB
• JSON documents
• One way to think of this is a Key-Value store that
understands the values.
• Not as map-reduce friendly, larger datasets require
indexes.
• clearly rest services, operational store
Document databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
24. • Based on Graph Theory
• Less about volume of the data and more about
complexity
• Many are transactional
o often the transactions are "more correct" than those
offered by a relational database.
• FOAF, direct path operations are easy
o very complicated/inefficient in RDBMS
• Usually paired with an index for search
Graph Databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
25. Design: RDBMS vs Graph
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
26. Phone Number: 919.627.1236
Type : googlevoice
HAS
Phone Number: 919.321.0119
Type : work
Company: Open Software
Integrators
LOCATED
FOUNDED
Firstname: Andrew
Lastname: Oliver
City:
Durham
State: NC
Neo4j Graph Example
WORKS FOR
LOCATEDCity:
Chicago
State: IL
HAS
RESIDES
Note the extra relationships and details here - graph databases are just fun and easy to
understand.
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
HAS
27. • NoSQL
• Software Framework (lots of pieces/lots of choices):
o Pig - scripting language used to quickly write MapReduce code
to handle unstructured sources
o Hive - facilitates structure for the data
o HCatalog - provides inter-operability between these internal
systems
o HBase - Bigtable-type database
o HDFS - Hadoop file system
• Excellent choice for data processing and data analysis
• MapReduce
Where does Hadoop fit?
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
28. • Hadoop HDFS is...a distributed filesystem
• So is Gluster, Ceph, GFS, etc
• Hadoop can use Ceph or Gluster in place of HDFS
Convergence of Filesystems and
Databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
29. • Triplestores
o Apache Jenna
• OODBMS /ORDMS
o Cache
Other Derivatives
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
30. • Persistence
o Asynch / Synch
• Replication
• Availability
• Transactions / Consistency
• "Locality"
• Language
• Resources
o http://en.wikipedia.org/wiki/Comparison_of_structured_storage_softwa
re
o http://sevenweeks.org/
Things you may consider
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
31. • RDBMS may not scale to your needs
• Your data may not map efficiently to tables
• Key Value Store - data by key, fast, scalable, can't handle complex
data
• Column Family/Big Table - fast, scalable, denormalized, map
reduce, good for series, not efficient for complex data
• Document - a good operational system, not your analytical,
moderately scalable, matches OO
• Graph - great for complex data, transactional, less scalable
• Filesystems and "databases" are converging
Conclusions
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver
@acoliver
32. {2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Thank you for
attending!