This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
8. • Schema-‐less
data.
Typical
data
sources
• Adding
new
social
pla4orms
in
future
• Needed
fast
read-‐write
opera6ons
Why MongoDB for us?
9. Aggregation Queries – Getting Insights
• Combina6on
of
queries
chained
together
• At
every
stage,
we
can
filter/chain/massage
data
Image
credit:
h@ps://www.openshiC.com/blogs/an-‐overview-‐of-‐whats-‐new-‐in-‐mongodb-‐22
10. Our use-case (esp. for graphs)
• Sen6ment
Analysis
• Demographic
Analysis
• Ar6cle
Analysis
• Plan
• Crea?on
of
Intelligence
tables
in
advance
• Reality
• On-‐the-‐fly
analysis
using
Aggrega6on
queries
11. How to go about it?
• Operates
on
a
single
collec6on
• Think
about
data
you
have
and
insights
you
want
• Focus
on
reducing
data
size
early
on
• $match
• $project
• $sort
• $limit,
$skip
• Example
db.collec?onName.aggregate(
{
"$match"
:
{
fieldName
:
matchingValue
},
{
"$project"
:
{
oldOrNewField:
fieldValue
}},
{
"$group"
:
{
fieldName
:
oldOrNewField,
"sum":
{"$sum":1}}},
{
"$sort"
:
{
"sum"
:
-‐1
}},
{
"$limit"
:
20
})
12. Javascript Capabilities
• All
the
programming
capabili6es
of
Javascript
language
at
your
disposal
• Taking
business
logic
/
processing
to
your
data-‐store
14. Javascript – Caveats
• Takes
up
read-‐write
locks
on
the
en6re
database
• Can
be
run
with
{‘noLock’
:
true}
op?on
db.runCommand({
Eval:
<func?on>,
Args:
<args>,
Nolock:
<true/false>
})
• Can
be
replaced
by
mapreduce
in
most
cases
• Take
it
as
one-‐off
case
16. Our current state
• Faster
write
opera?ons
• Under
high
data
load
from
different
sources
• Faster
read
opera?ons
• Graph
rendering
up-‐to
10
x
quicker
• Ease
of
scalability
• Though
yet
to
reach
there
17. Work In Progress
• Full-‐text
search
implementa?on
• can
be
created
only
on
strings
or
array
of
strings
• db.collec?onName.ensureIndex(
{
fieldName
:
"text"
}
)
• Capped
Collec?ons
• Widgets
for
last-‐run
jobs
/
event
log
tables
• Very
fast
writes
possible
• db.createCollec?on("cName",
{
capped
:
true,
size
:
5242880,
max
:
5000
}
)
• size
argument
is
always
required
18. Tips / Tricks – Things we learnt
• cloneCollec6on
• No
more
ssh/scp
to
remote
systems
• db.runCommand({cloneCollec?on:
<nsCollec?on>,
from:
<remote>,
query:
{}})
• db.cloneCollec?on(from,
collec?onName,
query)
• db.Collec-onName.copyTo
• doesn’t
not
copy
indexes
19. Tips / Tricks – Things we learnt
• remove()
vs
drop()
• Can’t
use
remove
for
capped
collec6ons
• remove
keeps
indexes
while
drop()
clears
them
• To
remove
all
the
documents
in
a
collec?on,
use
drop()
• To
remove
beZer
part
of
large
collec?on,
use
javascript
• preZy()
find
by
default
• DBQuery.prototype._prejyShell
=
true
(
inside
your
~/.mongorc.js)