Big Data: Guidelines and Examples for the Enterprise Decision Maker
 

Like this? Share it with your network

Share

Big Data: Guidelines and Examples for the Enterprise Decision Maker

on

  • 672 views

This presentation covers how to use MongoDB with Hadoop to leverage big data within your company.

This presentation covers how to use MongoDB with Hadoop to leverage big data within your company.

Statistics

Views

Total Views
672
Views on SlideShare
527
Embed Views
145

Actions

Likes
0
Downloads
20
Comments
0

3 Embeds 145

http://www.mongodb.com 113
https://www.mongodb.com 30
https://comwww-drupal.10gen.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hello this is Buzz Moschetti; welcome to the webinar entitled “Big Data…” if your travel plans do not include Big Data, please exit the aircraft and see a customer agent. <br /> Today we are going to explore using mongoDB and Hadoop in a well integrated way to solve a familiar but chronically thorny problem in the directed content space. <br /> We’ll cover the agenda in just a sec but first some logistics: <br /> The presentation audio & slides will be recorded and made available to you in about 24 hours. <br /> We have an hour set up but I’ll use about 40 minutes of that for the presentation with some time for questions. <br /> You can use the webex Q&A box to ask questions at any time but I will wait until the end of the presentation to address them. <br /> If you have technical issues, please send a webex message to the participant ID’d as mongoDB webinar team; otherwise keep your Qs focused on the content. <br /> <br />
  • I am a fan of presentations that are useful after the presentation so you’ll see lots of text, code, etc.
  • Clear def: Lots of terms. Online, analytical. we speak of the Three V and that’s good – but then what? <br /> Also: Big Data platform MAY need to perform known, tuned operations on data AND also provide a sandbox for analysis and experimentation. <br /> For this, tech and performance/flexibilty tradeoffs, not to mention SDLC controls will likely be different. <br /> Operationalizing: One-off experiments do not equal a day to day production environment <br /> Math/Set accuracy: if you care about today’s EOD close for EMEA, not big data. If you care about the latest price adjustment for widget X, not big data <br /> Realtime means millisecond response. Not 10 seconds. Not 2 seconds. <br />
  • DON’T ASK <br /> Terms like online / offline are vague <br /> Looking at integration strategy puts you on the path to creating islands of tech. Unfortunate part is the tech might actually be OK but the overall solution is impaired by second-class bridging schemes.
  • Agenda item #2: The example: A realtime directed content system using mongoDB and Hadoop.
  • Analysts may also develop machine learning and other approaches to auto tag content. <br />
  • Pretty simple, eh? Let’s start with the basics.
  • High perf CRUD includes index-optimized queries, aggregation, etc.
  • Still very batchy
  • Traditional approaches tend to treat the OLTP and Hadoop sides of the house separately <br />
  • Chicken and egg <br /> Pets & PowerTools : Both together a little scary (ha!) but it really doesn’t matter what the tags are. Taxonomy / ontology is related but independent of the Big Data machinery in play here. That is the output of the Analysts/Data Scientists <br /> By systemetized we mean where we know how we want to optimize behavior. Other tags can exist that are not systemetized – and we’ll see an example of that.
  • It’s Sunday night – we’re going to do the weekly trend baselining <br /> The mongoDB Hadoop connector speaks MapReduce on one side and mongoDB driver API on the other. <br /> If you’re NOT running a 1000 node Hadoop clueter, chances are you can significantly benefit from the connector to <br /> Eliminate ETL <br /> Manage a single “data lake”
  • Serving suggestion – but the important part is that the MapReduce job gets a rich BSONObject to work with, not a String[] cracked from a CSV! <br /> Useful for development <br /> VITAL for day 2 agility because rich types can flow and be addressed by name, not position. <br />
  • IMPORTANT: We are saving the profile back to mongoDB! This is the realtime update component.
  • 1,000,000 entries bigger than this takes 1ms to find. <br /> Updates run at thousands per/second. <br /> personalData is there to assist algos. ActivityProfile may or may not be co-mingled with AccountProfile. <br /> PETS tag has been hit a lot <br /> SPICE is new so no algo yet… <br /> <br />
  • Pseudocode! <br /> The point is that the A4 algo can flexibly deal with nearterm data stored in the Profile histlist PLUS per-user aggregated baseline PLUS system-wide globals
  • SPICE is not yet systemetized by analysts so no special algo assigned; default algo / weighting will be used. <br /> Later, analysts can change the nightly grind run. <br />
  • Chopped off 4 entries on baseline, trimmed up hist. Changed zip . <br />
  • In our example we ran baselining nightly. <br /> Running hourly or by minute does not add significant information value across large data sets. <br /> Maybe weekly is better? Less burden? More time to observe effects of changes to algos? <br /> The point is the actions you take in realtime, nearterm interaction with the system are different than those computed over huge sets of data over long periods of time.

Big Data: Guidelines and Examples for the Enterprise Decision Maker Presentation Transcript

  • 1. Big Data: Examples and Guidelines for the Enterprise Decision Maker Solutions Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #MongoDB
  • 2. Who is your Presenter? • Yes, I use “Buzz” on my business cards • Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that • Over 25 years of designing and building systems • Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring • Still programming – using emacs, of course
  • 3. Agenda • (Occasionally) Brutal Truths about Big Data • Review of Directed Content Business Architecture • A Simple Technical Implementation
  • 4. Truths • Clear definition of Big Data still maturing • Efficiently operationalizing Big Data is non-trivial • Developing, debugging, understanding MapReduce • Cluster monitoring & management, job scheduling/recovery • If you thought regular ETL Hell was bad…. • Big Data is not about math/set accuracy • The last 25000 items in a 25,497,612 set “don’t matter” • Big Data questions are best asked periodically • “Are we there yet?” • Realtime means … realtime
  • 5. It’s About The Functions, not the Terms DON’T ASK: • Is this an operations or an analytics problem? • Is this online or offline? • What query language should we use? • What is my integration strategy across tools? ASK INSTEAD: • Am I incrementally addressing data (esp. writes)? • Am I computing a precise answer or a trend? • Do I need to operate on this data in realtime? • What is my holistic architecture?
  • 6. What We’re Going to “Build” today Realtime Directed Content System • Based on what users click, “recommended” content is returned in addition to the target • The example is sector (manufacturing, financial services, retail) neutral • System dynamically updates behavior in response to user activity
  • 7. The Participants and Their Roles Directed Content System Customer s Content Creators Management/ Strategy Analysts/ Data Scientists Generate and tag content from a known domain of tags Make decisions based on trends and other summarized data Operate on data to identify trends and develop tag domains Developers/ ProdOps Bring it all together: apps, SDLC, integration, etc.
  • 8. Priority #1: Maximizing User value Considerations/Requirements Maximize realtime user value and experience Provide management reporting and trend analysis Engineer for Day 2 agility on recommendation engine Provide scrubbed click history for customer Permit low-cost horizontal scaling Minimize technical integration Minimize technical footprint Use conventional and/or approved tools Provide a RESTful service layer …..
  • 9. The Architecture mongoDB HadoopApp(s) MapReduce
  • 10. Complementary Strengths mongoDB HadoopApp(s) MapReduce • Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.) • Language flexibility (Java, C#, C++ python, Scala, …) • Webscale deployment model • appservers, DMZ, monitoring • High performance rich shape CRUD • MapReduce design paradigm • Node deployment model • Very large set operations • Computationally intensive, longer duration • Read-dominated workload
  • 11. “Legacy” Approach: Somewhat unidirectional mongoDB HadoopApp(s) MapReduce • Extract data from mongoDB and other sources nightly (or weekly) • Run analytics • Generate reports for people to read • Where’s the feedback?
  • 12. Somewhat better approach mongoDB HadoopApp(s) MapReduce • Extract data from mongoDB and other sources nightly (or weekly) • Run analytics • Generate reports for people to read • Move important summary data back to mongoDB for consumption by apps.
  • 13. …but the overall problem remains: • How to realtime integrate and operate upon both periodically generated data and realtime current data? • Lackluster integration between OLTP and Hadoop • It’s not just about the database: you need a realtime profile and profile update function
  • 14. The legacy problem in pseudocode onContentClick() { String[] tags = content.getTags(); Resource[] r = f1(database, tags); } • Realtime intraday state not well-handled • Baselining is a different problem than click handling
  • 15. The Right Approach • Users have a specific Profile entity • The Profile captures trend analytics as baselining information • The Profile has per-tag “counters” that are updated with each interaction / click • Counters plus baselining are passed to fetch function • The fetch function itself could be dynamic!
  • 16. 24 hours in the life of The System • Assume some content has been created and tagged • Two systemetized tags: Pets & PowerTools
  • 17. Monday, 1:30AM EST • Fetch all user Profiles from mongoDB; load into Hadoop • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  • 18. mongoDB-Hadoop MapReduce Example public class ProfileMapper extends Mapper<Object, BSONObject, IntWritable, IntWritable> { @Override public void map(final Object pKey, final BSONObject pValue, final Context pContext ) throws IOException, InterruptedException{ String user = (String)pValue.get(”user"); Date d1 = (Date)pValue.get(“lastUpdate”); int count = 0; List<String> keys = pValue.get(“tags”).keys(); for ( String tag : keys) { count += pValue.get(tag).get(“hist”).size(); ) int avg = count / keys.size(); pContext.write( new IntWritable( count), new IntWritable( avg ) ); } }
  • 19. Monday, 1:45AM EST • Grind through all content data and user Profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline per user for tags Pets and PowerTools • Load Profiles with new baseline back into mongoDB • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  • 20. Monday, 8AM EST • User Bob logs in and Profile retrieved from mongoDB • Bob clicks on Content X which is already tagged as “Pets” • Bob has clicked on Pets tagged content many times • Adjust Profile for tag “Pets” and save back to mongoDB • Analysis = f(Profile) • Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc. mongoDB HadoopApp(s) MapReduce
  • 21. Monday, 8:02AM EST • Bob clicks on Content Y which is already tagged as “Spices” • Spice is a new tag type for Bob • Adjust Profile for tag “Spices” and save back to mongoDB • Analysis = f(profile) mongoDB HadoopApp(s) MapReduce
  • 22. Profile in Detail { user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ { ts: datetime3, url: url3 } ]} } }
  • 23. Tag-based algorithm detail getRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available { filter = algo(profile, tag); } fetch N recommendations (filter); } A4(profile, tag) { weight = get tag (“PETS”) global weighting; adjustForPersonalBaseline(weight, “PETS” baseline); if “PETS” clicked more than 2 times in past 10 mins then weight += 10; if “PETS” clicked more than 10 times in past 2 days then weight += 3; return new filter({“PETS”, weight}, globals) }
  • 24. Tuesday, 1AM EST mongoDB HadoopApp(s) MapReduce • Fetch all user Profiles from mongoDB; load into Hadoop • Or skip if using the mongoDB-Hadoop connector!
  • 25. Tuesday, 1:30AM EST • Grind through all content data and user profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline for Pets and PowerTools and Spice • Data can be specific to individual or by group • Load baseline back into mongoDB • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  • 26. New Profile in Detail { user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} } }
  • 27. Tuesday, 1:35AM EST • Perform maintenance on user Profiles • Click history trimming (variety of algorithms) • “Dead tag” removal • Update of auxiliary reference data mongoDB HadoopApp(s) MapReduce
  • 28. New Profile in Detail { user: “Bob”, personalData: { zip: “10022”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [ 1322,44,23, … ], hist: [ { ts: datetime1, url: url1 } // 50 more ]}, SPICE: { algo: “Z1”, hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} } }
  • 29. Feel free to run the baselining more frequently … but avoid “Are We There Yet?” mongoDB HadoopApp(s) MapReduce
  • 30. Nearterm / Realtime Questions & Actions With respect to the Customer: • What has Bob done over the past 24 hours? • Given an input, make a logic decision in 100ms or less With respect to the Provider: • What are all current users doing or looking at? • Can we nearterm correlate single events to shifts in behavior?
  • 31. Longterm/ Not Realtime Questions & Actions With respect to the Customer: • Any way to explain historic performance / actions? • What are recommendations for the future? With respect to the Provider: • Can we correlate multiple events from multiple sources over a long period of time to identify trends? • What is my entire customer base doing over 2 years? • Show me a time vs. aggregate tag hit chart • Slice and dice and aggregate tags vs. XYZ • What tags are trending up or down?
  • 32. The Key To Success: It is One System mongoDB Hadoop App(s) MapReduce
  • 33. Webex Q&A
  • 34. Thank You Buzz Moschetti buzz.moschetti@mongodb.com #MongoDB