2. The Science in Data Science
• Collect data
• Explore the data, use visualization
• Use math
• Make predictions
• Test predictions
– Collect even more data
• Repeat...
5. MongoDB and Enterprise IT Stack
CRM, ERP, Collaboration, Mobile, BI
Data Management
Online Data
Offline Data
RDBMS
RDBMS
Hadoop
EDW
Infrastructure
OS & Virtualization, Compute, Storage, Network
Security & Auditing
Management & Monitoring
Applications
10. Volume Velocity Variety
Upserts avoid
unnecessary reads
Asynchronous writes
Data
Data
Sources
Data
Sources
Data
Sources
Sources
Spread writes over
multiple shards
Writes buffered in RAM
and flushed to disk in
bulk
11. Volume Velocity Variety
MongoDB
RDBMS
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{
type :
"Health",
plan : "PPO Plus" },
{
type :
"Dental",
plan : "Standard" }
]
}
20. Dynamic Queries
Find all logs for a
URL
db.logs.find( { ‘path’ : ‘/index.html’ } )
Find all logs for a
time range
db.logs.find( {
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, s1) }
} )
Find all logs for a
host over a range of
dates
db.logs.find( {
‘host’ : ‘127.0.0.1’,
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, 1) }
} )
23. Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Scale-out
• Declared in JSON, executes in C++
• Runs inside MongoDB on local data
29. MongoDB Map/Reduce Benefits
• Runs inside MongoDB
• Sharding supported
• JavaScript
– Pro: functionality, expressiveness
– Con: overhead
• Input can be a collection or query!
• Output directly to document or collection
• Easy, when you don’t want overhead of Hadoop
34. How it works
•
Adapter examines MongoDB input collection and
calculates a set of splits from data
•
Each split is assigned to a Hadoop node
•
In parallel hadoop pulls data from splits on
MongoDB (or BSON) and starts processing locally
•
Hadoop merges results and streams output back to
MongoDB (or BSON) output collection
40. Map Phase – each document get’s
through mapper function
@Override
public void map(NullWritable key, BSONObject val,
final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From"); String to =
(String) headers.get("To"); String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new
IntWritable(1));
}
}
}
41. Reduce Phase – output Maps are
grouped by key and passed to Reducer
public void reduce(final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){
sum += value.get();
}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) ); }
43. Hadoop Connector Benefits
•
Full multi-core parallelism to process MongoDB data
•
mongo.input.query
•
Full integration w/ Hadoop and JVM ecosystem
•
Mahout, et.al.
•
Can be used on Amazon Elastic MapReduce
•
Read and write backup files to local, HDFS and S3
•
Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive
45. A/B testing
• Hey, it looks like teenage girls clicked a lot on that ad
with a pink background...
• Hypothesis: Given otherwise the same ad, teenage
girls are more likely to click on ads with pink
backgrounds than white
• Test 50-50 pink vs white ads
• Collect click stream stats in MongoDB or Hadoop
• Analyze results
46. Recommendations – social filtering
• ”Customers who bought this book also bought”
• Computed offline / nightly
• As easy as it sounds!
google it: Amazon item-to-item algorithm
47. Personalization
• ”Even if you are a teenage girl, you seem to be 60%
more likely to click on blue ads than pink.”
• User specific recommendations a hybrid of offline &
online recommendations
• User profile in MongoDB
• May even be updated real time