Big Data: Guidelines and Examples for the Enterprise Decision Maker
1. Big Data: Examples and
Guidelines for the Enterprise
Decision Maker
Solutions Architect, MongoDB
Buzz Moschetti
buzz.moschetti@mongodb.com
#MongoDB
2. Who is your Presenter?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 25 years of designing and building systems
• Big and small
• Super-specialized to broadly useful in any vertical
• “Traditional” to completely disruptive
• Advocate of language leverage and strong factoring
• Still programming – using emacs, of course
3. Agenda
• (Occasionally) Brutal Truths about Big Data
• Review of Directed Content Business Architecture
• A Simple Technical Implementation
4. Truths
• Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial
• Developing, debugging, understanding MapReduce
• Cluster monitoring & management, job scheduling/recovery
• If you thought regular ETL Hell was bad….
• Big Data is not about math/set accuracy
• The last 25000 items in a 25,497,612 set “don’t matter”
• Big Data questions are best asked periodically
• “Are we there yet?”
• Realtime means … realtime
5. It’s About The Functions, not the
Terms
DON’T ASK:
• Is this an operations or an analytics problem?
• Is this online or offline?
• What query language should we use?
• What is my integration strategy across tools?
ASK INSTEAD:
• Am I incrementally addressing data (esp.
writes)?
• Am I computing a precise answer or a trend?
• Do I need to operate on this data in realtime?
• What is my holistic architecture?
6. What We’re Going to “Build” today
Realtime Directed Content System
• Based on what users click, “recommended”
content is returned in addition to the target
• The example is sector (manufacturing, financial
services, retail) neutral
• System dynamically updates behavior in
response to user activity
7. The Participants and Their Roles
Directed
Content
System
Customer
s
Content
Creators
Management/
Strategy
Analysts/
Data Scientists
Generate and tag
content from a known
domain of tags
Make decisions based
on trends and other
summarized data
Operate on data to
identify trends and
develop tag domains
Developers/
ProdOps
Bring it all together:
apps, SDLC,
integration, etc.
8. Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experience
Provide management reporting and trend analysis
Engineer for Day 2 agility on recommendation engine
Provide scrubbed click history for customer
Permit low-cost horizontal scaling
Minimize technical integration
Minimize technical footprint
Use conventional and/or approved tools
Provide a RESTful service layer
…..
10. Complementary Strengths
mongoDB HadoopApp(s) MapReduce
• Standard design paradigm
(objects, tools, 3rd party products,
IDEs, test drivers, skill pool, etc.
etc.)
• Language flexibility (Java, C#, C++
python, Scala, …)
• Webscale deployment model
• appservers, DMZ, monitoring
• High performance rich shape
CRUD
• MapReduce design paradigm
• Node deployment model
• Very large set operations
• Computationally intensive, longer
duration
• Read-dominated workload
11. “Legacy” Approach: Somewhat
unidirectional
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other
sources nightly (or weekly)
• Run analytics
• Generate reports for people to read
• Where’s the feedback?
12. Somewhat better approach
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other
sources nightly (or weekly)
• Run analytics
• Generate reports for people to read
• Move important summary data back to
mongoDB for consumption by apps.
13. …but the overall problem remains:
• How to realtime integrate and operate upon both
periodically generated data and realtime current
data?
• Lackluster integration between OLTP and
Hadoop
• It’s not just about the database: you need a
realtime profile and profile update function
14. The legacy problem in pseudocode
onContentClick() {
String[] tags = content.getTags();
Resource[] r = f1(database, tags);
}
• Realtime intraday state not well-handled
• Baselining is a different problem than click
handling
15. The Right Approach
• Users have a specific Profile entity
• The Profile captures trend analytics as baselining
information
• The Profile has per-tag “counters” that are updated
with each interaction / click
• Counters plus baselining are passed to fetch function
• The fetch function itself could be dynamic!
16. 24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & PowerTools
17. Monday, 1:30AM EST
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if using the mongoDB-Hadoop
connector!
mongoDB HadoopApp(s) MapReduce
18. mongoDB-Hadoop MapReduce Example
public class ProfileMapper
extends Mapper<Object, BSONObject, IntWritable, IntWritable>
{
@Override
public void map(final Object pKey,
final BSONObject pValue,
final Context pContext )
throws IOException, InterruptedException{
String user = (String)pValue.get(”user");
Date d1 = (Date)pValue.get(“lastUpdate”);
int count = 0;
List<String> keys = pValue.get(“tags”).keys();
for ( String tag : keys) {
count += pValue.get(tag).get(“hist”).size();
)
int avg = count / keys.size();
pContext.write( new IntWritable( count), new
IntWritable( avg ) );
}
}
19. Monday, 1:45AM EST
• Grind through all content data and user Profile data to
produce:
• Tags based on feature extraction (vs. creator-applied
tags)
• Trend baseline per user for tags Pets and PowerTools
• Load Profiles with new baseline back into mongoDB
• Or skip if using the mongoDB-Hadoop connector!
mongoDB HadoopApp(s) MapReduce
20. Monday, 8AM EST
• User Bob logs in and Profile retrieved from mongoDB
• Bob clicks on Content X which is already tagged as “Pets”
• Bob has clicked on Pets tagged content many times
• Adjust Profile for tag “Pets” and save back to mongoDB
• Analysis = f(Profile)
• Analysis can be “anything”; it is simply a result. It could trigger
an ad, a compliance alert, etc.
mongoDB HadoopApp(s) MapReduce
21. Monday, 8:02AM EST
• Bob clicks on Content Y which is already tagged as “Spices”
• Spice is a new tag type for Bob
• Adjust Profile for tag “Spices” and save back to mongoDB
• Analysis = f(profile)
mongoDB HadoopApp(s) MapReduce
23. Tag-based algorithm detail
getRecommendedContent(profile, [“PETS”, other]) {
if algo for a tag available {
filter = algo(profile, tag);
}
fetch N recommendations (filter);
}
A4(profile, tag) {
weight = get tag (“PETS”) global weighting;
adjustForPersonalBaseline(weight, “PETS” baseline);
if “PETS” clicked more than 2 times in past 10 mins
then weight += 10;
if “PETS” clicked more than 10 times in past 2 days
then weight += 3;
return new filter({“PETS”, weight}, globals)
}
24. Tuesday, 1AM EST
mongoDB HadoopApp(s) MapReduce
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if using the mongoDB-Hadoop
connector!
25. Tuesday, 1:30AM EST
• Grind through all content data and user profile data to
produce:
• Tags based on feature extraction (vs. creator-applied
tags)
• Trend baseline for Pets and PowerTools and Spice
• Data can be specific to individual or by group
• Load baseline back into mongoDB
• Or skip if using the mongoDB-Hadoop connector!
mongoDB HadoopApp(s) MapReduce
27. Tuesday, 1:35AM EST
• Perform maintenance on user Profiles
• Click history trimming (variety of algorithms)
• “Dead tag” removal
• Update of auxiliary reference data
mongoDB HadoopApp(s) MapReduce
29. Feel free to run the baselining more
frequently
… but avoid “Are We There
Yet?”
mongoDB HadoopApp(s) MapReduce
30. Nearterm / Realtime Questions & Actions
With respect to the Customer:
• What has Bob done over the past 24 hours?
• Given an input, make a logic decision in 100ms or less
With respect to the Provider:
• What are all current users doing or looking at?
• Can we nearterm correlate single events to shifts in
behavior?
31. Longterm/ Not Realtime Questions &
Actions
With respect to the Customer:
• Any way to explain historic performance / actions?
• What are recommendations for the future?
With respect to the Provider:
• Can we correlate multiple events from multiple sources
over a long period of time to identify trends?
• What is my entire customer base doing over 2 years?
• Show me a time vs. aggregate tag hit chart
• Slice and dice and aggregate tags vs. XYZ
• What tags are trending up or down?
32. The Key To Success: It is One System
mongoDB
Hadoop
App(s)
MapReduce
Hello this is Buzz Moschetti; welcome to the webinar entitled “Big Data…” if your travel plans do not include Big Data, please exit the aircraft and see a customer agent.
Today we are going to explore using mongoDB and Hadoop in a well integrated way to solve a familiar but chronically thorny problem in the directed content space.
We’ll cover the agenda in just a sec but first some logistics:
The presentation audio & slides will be recorded and made available to you in about 24 hours.
We have an hour set up but I’ll use about 40 minutes of that for the presentation with some time for questions.
You can use the webex Q&A box to ask questions at any time but I will wait until the end of the presentation to address them.
If you have technical issues, please send a webex message to the participant ID’d as mongoDB webinar team; otherwise keep your Qs focused on the content.
I am a fan of presentations that are useful after the presentation so you’ll see lots of text, code, etc.
Clear def: Lots of terms. Online, analytical. we speak of the Three V and that’s good – but then what?
Also: Big Data platform MAY need to perform known, tuned operations on data AND also provide a sandbox for analysis and experimentation.
For this, tech and performance/flexibilty tradeoffs, not to mention SDLC controls will likely be different.
Operationalizing: One-off experiments do not equal a day to day production environment
Math/Set accuracy: if you care about today’s EOD close for EMEA, not big data. If you care about the latest price adjustment for widget X, not big data
Realtime means millisecond response. Not 10 seconds. Not 2 seconds.
DON’T ASK
Terms like online / offline are vague
Looking at integration strategy puts you on the path to creating islands of tech. Unfortunate part is the tech might actually be OK but the overall solution is impaired by second-class bridging schemes.
Agenda item #2: The example: A realtime directed content system using mongoDB and Hadoop.
Analysts may also develop machine learning and other approaches to auto tag content.
Pretty simple, eh? Let’s start with the basics.
High perf CRUD includes index-optimized queries, aggregation, etc.
Still very batchy
Traditional approaches tend to treat the OLTP and Hadoop sides of the house separately
Chicken and egg
Pets & PowerTools : Both together a little scary (ha!) but it really doesn’t matter what the tags are. Taxonomy / ontology is related but independent of the Big Data machinery in play here. That is the output of the Analysts/Data Scientists
By systemetized we mean where we know how we want to optimize behavior. Other tags can exist that are not systemetized – and we’ll see an example of that.
It’s Sunday night – we’re going to do the weekly trend baselining
The mongoDB Hadoop connector speaks MapReduce on one side and mongoDB driver API on the other.
If you’re NOT running a 1000 node Hadoop clueter, chances are you can significantly benefit from the connector to
Eliminate ETL
Manage a single “data lake”
Serving suggestion – but the important part is that the MapReduce job gets a rich BSONObject to work with, not a String[] cracked from a CSV!
Useful for development
VITAL for day 2 agility because rich types can flow and be addressed by name, not position.
IMPORTANT: We are saving the profile back to mongoDB! This is the realtime update component.
1,000,000 entries bigger than this takes 1ms to find.
Updates run at thousands per/second.
personalData is there to assist algos. ActivityProfile may or may not be co-mingled with AccountProfile.
PETS tag has been hit a lot
SPICE is new so no algo yet…
Pseudocode!
The point is that the A4 algo can flexibly deal with nearterm data stored in the Profile histlist PLUS per-user aggregated baseline PLUS system-wide globals
SPICE is not yet systemetized by analysts so no special algo assigned; default algo / weighting will be used.
Later, analysts can change the nightly grind run.
Chopped off 4 entries on baseline, trimmed up hist. Changed zip .
In our example we ran baselining nightly.
Running hourly or by minute does not add significant information value across large data sets.
Maybe weekly is better? Less burden? More time to observe effects of changes to algos?
The point is the actions you take in realtime, nearterm interaction with the system are different than those computed over huge sets of data over long periods of time.