This document discusses the evolution of data analysis and how Couchbase database can help make data analysis more exciting again. In the past, data analysis used to be exciting because it took days to write analysis programs and results were only available overnight. Now with Couchbase, queries can be built and results retrieved in seconds for huge datasets using MapReduce queries. Couchbase allows slicing data in many ways without effort through its database clusters and JavaScript interface.
2. In
the
year
1992….
• Freetext
Database
=
Document/NoSQL
Database
• Massive
Datasets
– 19043
records!!!
– Approx.
8k
per
record
2
3. The
Drug
• Data
Analysis
was
‘Exci?ng’
• 2-‐3
days
to
write
the
analysis
program
• Processing
would
occur
overnight
• Sta?s?cs
required
‘whole
set’
processing
3
4. The
Hit
• Mornings
were
‘the
hit’
• The
joy
of
real
data
analysis
is
the
output
of
a
good
report
• Get
good
stats
– I
know
how
many
teachers
teach
Geography
in
Scotland!
– I
know
400
people
have
purchased
our
History
so]ware!
• The
wait
and
the
results
kept
us
working
4
5. In
the
year
2002
• Grid
compu?ng
was
the
drug
• Building
200-‐2000
node
grid
systems
• Analysis
could
happen
the
same
day
• Datasets
could
be
huge
– They
just
took
more
hours
• S?ll
working
on
en?re
datasets
– Sta?s?cs
s?ll
required
whole
set
process
• Jobs
became
monotonous
• More
about
construc?on
and
technology
than
stats
5
6. In
the
year
2012
• Need
info
and
sta?s?cs
quicker
than
ever
• Database
clusters
provide
the
backbone
– Grids
without
the
headache
• Build
a
query
in
seconds;
Get
the
result
in
seconds
• Need
sta?s?cs
in
different
ways:
– Live
– Online
(and
some?mes
user
visible)
– Whole
of
set
and
par?al
set,
but
based
on
Big
Data
• Slice
and
dice
in
more
ways
without
effort
6
7. Couchbase
Background
Stats
• Couchbase
1.8
already
hits
interes?ng
numbers
• Draw
Something
(OMGPOP),
within
6
weeks:
– 15
million
daily
ac?ve
users
– 3000
drawings
generated
every
two
seconds
– Over
two
billion
stored
drawings
– 90
nodes
– 3
clusters
– No
stops!
7
8. The
new
drug
• Couchbase
Server
2.0
• Cluster-‐based
database
• Fast,
Scalable,
Predictable
• Map/Reduce
based
querying
• JavaScript/Web-‐based
interface
– Type
in
your
query,
get
your
results
• Instant
Gra?fica?on!
8
9. The
Data
End
• Store
data
however
you
want
• The
Map
will
sort
it
out
for
us
9
11. Map/Reduce
Creates
Indexes
• Not
Hadoop
• Map/Reduce
creates
an
index
• Map
*AND*
Reduce
output
are
stored
• Index
is
used
for
queries
• Makes
queries
faster
(obviously!)
• Index
is
‘materialized’
at
query
?me
– Updated,
not
recreated
• Incremental
map/reduce
11
15. Incremental
Reduce
• Required
at
two
levels
– During
cluster-‐based
queries
– During
index
updates
• Incremental
reduce
requires
prepara?on
• Reduce
func?ons
must
be
able
to
consume
their
own
output
• Roll-‐your-‐own
only
– No
external
libraries
15
16. Tips
for
incremental
• Use
simple
values
when
possible
• Use
complex
(JSON)
structures
– Allows
for
more
incremental
structure
– Store
the
‘current’
result
– Store
the
informa?on
needed
for
the
incremental
result
• Iden?fy
rereduce:
– func?on(key,
value,
rereduce)
{}
16
21. Why
is
the
excitement
back?
• Data
in
is
easy;
no
schema,
no
formavng,
no
updates
• Data
out
is
about
the
stats
– Not
how
we
are
going
to
produce
them
• Queries
are
live
• Tweaks
and
updates
and
extensions
are
live
• Mul?ple
views,
mul?ple
queries
• Reduce
is
op?onal
(raw
data)
• Massive
datasets
are
not
a
problem
21