Big Data: Guidelines and Examples for the Enterprise Decision Maker

Big Data: Examples and
Guidelines for the Enterprise
Decision Maker
Solutions Architect, MongoDB
Buzz Moschetti
buzz.moschetti@mongodb.com
#MongoDB

Who is your Presenter?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 25 years of designing and building systems
• Big and small
• Super-specialized to broadly useful in any vertical
• “Traditional” to completely disruptive
• Advocate of language leverage and strong factoring
• Still programming – using emacs, of course

Agenda
• (Occasionally) Brutal Truths about Big Data
• Review of Directed Content Business Architecture
• A Simple Technical Implementation

Truths
• Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial
• Developing, debugging, understanding MapReduce
• Cluster monitoring & management, job scheduling/recovery
• If you thought regular ETL Hell was bad….
• Big Data is not about math/set accuracy
• The last 25000 items in a 25,497,612 set “don’t matter”
• Big Data questions are best asked periodically
• “Are we there yet?”
• Realtime means … realtime

It’s About The Functions, not the
Terms
DON’T ASK:
• Is this an operations or an analytics problem?
• Is this online or offline?
• What query language should we use?
• What is my integration strategy across tools?
ASK INSTEAD:
• Am I incrementally addressing data (esp.
writes)?
• Am I computing a precise answer or a trend?
• Do I need to operate on this data in realtime?
• What is my holistic architecture?

What We’re Going to “Build” today
Realtime Directed Content System
• Based on what users click, “recommended”
content is returned in addition to the target
• The example is sector (manufacturing, financial
services, retail) neutral
• System dynamically updates behavior in
response to user activity

The Participants and Their Roles
Directed
Content
System
Customer
s
Content
Creators
Management/
Strategy
Analysts/
Data Scientists
Generate and tag
content from a known
domain of tags
Make decisions based
on trends and other
summarized data
Operate on data to
identify trends and
develop tag domains
Developers/
ProdOps
Bring it all together:
apps, SDLC,
integration, etc.

Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experience
Provide management reporting and trend analysis
Engineer for Day 2 agility on recommendation engine
Provide scrubbed click history for customer
Permit low-cost horizontal scaling
Minimize technical integration
Minimize technical footprint
Use conventional and/or approved tools
Provide a RESTful service layer
…..

The Architecture
mongoDB HadoopApp(s) MapReduce

Complementary Strengths
• Standard design paradigm
(objects, tools, 3rd party products,
IDEs, test drivers, skill pool, etc.
etc.)
• Language flexibility (Java, C#, C++
python, Scala, …)
• Webscale deployment model
• appservers, DMZ, monitoring
• High performance rich shape
CRUD
• MapReduce design paradigm
• Node deployment model
• Very large set operations
• Computationally intensive, longer
duration
• Read-dominated workload

“Legacy” Approach: Somewhat
unidirectional
• Extract data from mongoDB and other
sources nightly (or weekly)
• Run analytics
• Generate reports for people to read
• Where’s the feedback?

Somewhat better approach
• Extract data from mongoDB and other
sources nightly (or weekly)
• Run analytics
• Generate reports for people to read
• Move important summary data back to
mongoDB for consumption by apps.

…but the overall problem remains:
• How to realtime integrate and operate upon both
periodically generated data and realtime current
data?
• Lackluster integration between OLTP and
Hadoop
• It’s not just about the database: you need a
realtime profile and profile update function

The legacy problem in pseudocode
onContentClick() {
String[] tags = content.getTags();
Resource[] r = f1(database, tags);
}
• Realtime intraday state not well-handled
• Baselining is a different problem than click
handling

The Right Approach
• Users have a specific Profile entity
• The Profile captures trend analytics as baselining
information
• The Profile has per-tag “counters” that are updated
with each interaction / click
• Counters plus baselining are passed to fetch function
• The fetch function itself could be dynamic!

24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & PowerTools

Monday, 1:30AM EST
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if using the mongoDB-Hadoop
connector!

mongoDB-Hadoop MapReduce Example
public class ProfileMapper
extends Mapper<Object, BSONObject, IntWritable, IntWritable>
{
@Override
public void map(final Object pKey,
final BSONObject pValue,
final Context pContext )
throws IOException, InterruptedException{
String user = (String)pValue.get(”user");
Date d1 = (Date)pValue.get(“lastUpdate”);
int count = 0;
List<String> keys = pValue.get(“tags”).keys();
for ( String tag : keys) {
count += pValue.get(tag).get(“hist”).size();
)
int avg = count / keys.size();
pContext.write( new IntWritable( count), new
IntWritable( avg ) );
}
}

Monday, 1:45AM EST
• Grind through all content data and user Profile data to
produce:
• Tags based on feature extraction (vs. creator-applied
tags)
• Trend baseline per user for tags Pets and PowerTools
• Load Profiles with new baseline back into mongoDB
• Or skip if using the mongoDB-Hadoop connector!

Monday, 8AM EST
• User Bob logs in and Profile retrieved from mongoDB
• Bob clicks on Content X which is already tagged as “Pets”
• Bob has clicked on Pets tagged content many times
• Adjust Profile for tag “Pets” and save back to mongoDB
• Analysis = f(Profile)
• Analysis can be “anything”; it is simply a result. It could trigger
an ad, a compliance alert, etc.

Monday, 8:02AM EST
• Bob clicks on Content Y which is already tagged as “Spices”
• Spice is a new tag type for Bob
• Adjust Profile for tag “Spices” and save back to mongoDB
• Analysis = f(profile)

Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10024”,
gender: “M”
},
tags: {
PETS: { algo: “A4”,
baseline: [0,0,10,4,1322,44,23, … ],
hist: [
{ ts: datetime1, url: url1 },
{ ts: datetime2, url: url2 } // 100 more
]},
SPICE: { hist: [
{ ts: datetime3, url: url3 }
]}
}
}

Tag-based algorithm detail
getRecommendedContent(profile, [“PETS”, other]) {
if algo for a tag available {
filter = algo(profile, tag);
}
fetch N recommendations (filter);
}
A4(profile, tag) {
weight = get tag (“PETS”) global weighting;
adjustForPersonalBaseline(weight, “PETS” baseline);
if “PETS” clicked more than 2 times in past 10 mins
then weight += 10;
if “PETS” clicked more than 10 times in past 2 days
then weight += 3;
return new filter({“PETS”, weight}, globals)
}

Tuesday, 1AM EST
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if using the mongoDB-Hadoop
connector!

Tuesday, 1:30AM EST
• Grind through all content data and user profile data to
produce:
• Tags based on feature extraction (vs. creator-applied
tags)
• Trend baseline for Pets and PowerTools and Spice
• Data can be specific to individual or by group
• Load baseline back into mongoDB
• Or skip if using the mongoDB-Hadoop connector!

New Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10024”,
gender: “M”
},
tags: {
baseline: [0,0,10,4,1322,44,23, … ],
hist: [
{ ts: datetime1, url: url1 },
]},
SPICE: { hist: [
baseline: [0],
]}
}
}

Tuesday, 1:35AM EST
• Perform maintenance on user Profiles
• Click history trimming (variety of algorithms)
• “Dead tag” removal
• Update of auxiliary reference data

New Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10022”,
gender: “M”
},
tags: {
baseline: [ 1322,44,23, … ],
hist: [
]},
SPICE: { algo: “Z1”, hist: [
baseline: [0],
]}
}
}

Feel free to run the baselining more
frequently
… but avoid “Are We There
Yet?”

Nearterm / Realtime Questions & Actions
With respect to the Customer:
• What has Bob done over the past 24 hours?
• Given an input, make a logic decision in 100ms or less
With respect to the Provider:
• What are all current users doing or looking at?
• Can we nearterm correlate single events to shifts in
behavior?

Longterm/ Not Realtime Questions &
Actions
With respect to the Customer:
• Any way to explain historic performance / actions?
• What are recommendations for the future?
With respect to the Provider:
• Can we correlate multiple events from multiple sources
over a long period of time to identify trends?
• What is my entire customer base doing over 2 years?
• Show me a time vs. aggregate tag hit chart
• Slice and dice and aggregate tags vs. XYZ
• What tags are trending up or down?

The Key To Success: It is One System
mongoDB
Hadoop
App(s)
MapReduce

Thank You
Buzz Moschetti
buzz.moschetti@mongodb.com
#MongoDB

Big Data: Guidelines and Examples for the Enterprise Decision Maker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data: Guidelines and Examples for the Enterprise Decision Maker

Similar to Big Data: Guidelines and Examples for the Enterprise Decision Maker (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Big Data: Guidelines and Examples for the Enterprise Decision Maker

Editor's Notes