Creating a Single View: Data Design and Loading Strategies
 

Creating a Single View: Data Design and Loading Strategies

on

  • 305 views

Learn how to design a single view application and load your data into the application.

Learn how to design a single view application and load your data into the application.

Statistics

Views

Total Views
305
Views on SlideShare
150
Embed Views
155

Actions

Likes
1
Downloads
19
Comments
0

3 Embeds 155

http://www.mongodb.com 98
https://www.mongodb.com 53
https://comwww-drupal.10gen.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Carb coma
  • Blblblb <br />
  • <br /> AND WHY ARE WE DOING IT AT ALL! Federation? Managed QoS? Because traditional RDBMS dynamics make it difficult to well-serve a number of access patterns <br /> <br /> The single most important part of this that will make you successful is the simplest – and is part of the mongoDB data environment <br /> <br />
  • ETL fabric fidelity of data typically LCD <br /> CSV still carries the day because easy to make and technically parse (but difficult to change or express things) <br /> XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions <br /> <br /> Anecdote about getting screwered by the arrow <br /> The arrow is disingenuous! <br /> This is LOSS OF FIDELITY <br /> <br /> <br />
  • Most people use an ORM to get from DB to good objects – and mongoDB has a story around that too! <br /> But for the moment, assume we use it. <br /> <br /> <br />
  • <br /> XML was supposed to be The Thing.
  • XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions <br /> <br /> No one runs schema validation in production because of performance <br /> Schemas became too complicated anyway….. <br /> <br /> JAXB, JAXP are compile-time bound
  • XML set us back about 10 years <br /> <br /> Leads to this: Can you please just send me a CSV again? <br /> <br /> <br />
  • Changes to data in source system imply DB schema upgrade in data hub – with X source systems, this starts to become unscalable <br /> Hub Data storage scalability <br /> In summary: traditionally, common data hubs are harder to manage than the sum of their source systems – which themselves are not so easy to manage! <br /> <br /> Remember this formula; we’ll see how we improve upon this in just a bit. <br />
  • Data entitlement implicit to system access <br /> Fast moving businesses cannot be held up by naturally more slowing moving ones <br /> <br /> (Andreas will cover this in greater detail later) <br />   <br /> <br />
  • Knowing legacy problems and experience, here are the 3 things that work. <br /> <br /> Don’t think about transfering tables’ think about transfering products, logs, trades, customers <br /> <br /> Cross ref at the SPOV. Especially as the number of feeders grows large, you’ll want to concentrate and control enrichment instead of having potentially dozens of scripts and utils getting involved in the flow. This also vastly simplifies a necessary evil: reconciliation. <br /> <br /> ----- Meeting Notes (5/19/14 13:31) -----
  • A zillion APIs. <br /> This does not necessarily mean REALTIME. We can do realtime with “microbatching”. We can do EOD batch with a filefree API. It’s all about how producer and consumer agree to capture the data – we’ll see more about this context later in the presentation. <br /> ----- Meeting Notes (5/19/14 13:31) ----- <br /> Our most successful customers do this <br /> or use microbatching. <br /> <br /> If direct connect isn’t your bag, feel free to create a web service: but pass JSON to that web service. <br /> <br />
  • JSON is the new leader in highly interoperable, ASCII structured data format <br /> ASCII interop is critical so GPB, Avro, and other formats are out. <br /> Better than XML because <br /> Strings, numbers, maps, and arrays natively supported <br /> Simpler data model (no attributes or unnested content) <br /> Easier to programmatically construct <br /> <br /> (Much!) better than CSV because <br /> Rich detail is preserved <br /> Content can be expanded later without struggling with “comma hell” <br /> <br /> Warning: JSON does NOT have Date or binary (BLOB) types! We’ll come back to a strategy on that…. <br /> <br /> WRT to actually creating JSON, there are all sorts of options including frameworks that use annotations on your POJOs <br /> BUT: My recommendation observer software engineering 101: have feeder program build a Map then use anyone of the JSON parser/generators like Jackson to <br />
  • The Basic Rules: <br /> Let feeder systems drive the data design <br /> Do not dilute, format, or otherwise mess with the data <br /> <br /> Schema Design: An entire session could be devoted to schema design. In general, <br /> always embed 1:1 <br /> embed “co-lo” 1:n (vectors of bespoke results, contact and phone numbers) <br /> use foreign keys to link 1:n where n is shared by others <br /> use foreign keys for n:n <br />
  • JUST ADD IT. <br /> Not talking about doubles turning into lists of dates – but there’s a hint coming up arounding versioning that could help there too. <br /> <br /> If you do this even halfway right, it may be last feed infra you need to create for this consolidated view. <br />
  • MUCH easier to update JSON feed handler for new data <br /> Essentially constant time to ingest new or changed data! <br /> <br /> No silver bullet or magic about processing the data – but you are no longer wrestling with the database!
  • Build the rich structure! <br /> You have to do this anyway to produce a JSON file so if you can, go the extra distance and just directly insert the content. <br /> Don’t worry about transactions; you should be using batchID which we’ll get to in a moment. <br />
  • mongoDB does not extend JSON per se. Rather, within the JSON spec, we have a structural-naming convention that allows us to clearly hint at the true intended type of the string value. <br /> <br /> These are natively grok’d by mongoimport, BTW.
  • By CR delimited we mean no pretty-printing of the JSON. <br /> The computer doesn’t care if it’s pretty or not and <br /> Packing everything on one line allows you to: <br /> Easy to write a BufferedReader / fread <br /> Easy to grep and Std unix utils work nicely too: <br /> Same format as mongoimport and mongoexport <br /> Does not force large memory footprint on loader <br /> <br /> and you can use jq! <br />
  • We have 100,000 items. <br /> Goal: How many mobile phones are explicitly marked as do-not-call? <br /> <br /> Challenge: single person per “greppable” line and phones is an array. <br /> In these 2 lines, there are 5 phones. <br /> Also phones.type is not the same as .type SO grepping for “mobile” leads to peril and very often wrong results <br /> <br />
  • .phones select phones element from doc <br /> But we still have it as an ARRAY <br /> [] “flattens” out the array to be a set of documents! (just like $unwind in the mongoDB agg framework) <br /> <br /> jq operations are very rich . You can redact/replace fields, add brand new fields to output, etc. <br /> <br /> The –c option produces CR-delimited JSON <br /> <br /> JSON compresses very well (like one FIFTH the space) so go ahead and gzip -9 the JSON and decompress on the fly into jq! <br />
  • Don’t be afraid to make mistakes – for the same reason we explored on slide 21. <br />
  • Context is an identifier for a set of data: ABC123 <br /> Dates are dangerous <br /> For global systems, two (or more!) local dates possible. <br /> System processing date can be misleading <br /> <br /> Context has additional benefits <br /> Easy to associate other information with context ID like functional ID <br />
  • Single View of Customer does not mean Single Technical visualization of Customer thru GUI!!
  • Examples: Fin svc who uses this stack and how. <br />
  • Blblblb <br />

Creating a Single View: Data Design and Loading Strategies Creating a Single View: Data Design and Loading Strategies Presentation Transcript

  • Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Creating a Single View Part 2: Data Design & Loading Strategies
  • Who Is Talking To You? • Yes, I use “Buzz” on my business cards • Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that • Over 27 years of designing and building systems • Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring • Inventor of perl DBI/DBD • Still programming – using emacs, of course
  • What Is He Going To Talk About? Historic Challenges New Strategy for Success Technical examples and tips Overview & Data Analysis Data Design & Loading Strategies Securing Your Deployment ç Ω Creating A Single View Part 1 Part 2 Part 3
  • Historic Challenges
  • It’s 2014: Why is this still hard to do? • Business / Technical / Information Challenges • Missteps in evolution of data transfer technology A X
  • We wish this “just worked” A Query objects from A with great performance Query objects from B with great performance X Query objects from merged A and B with great performance B
  • …but Beware The Blue Arrow! A X • Extracting many tables into many files • Some tables require more than one file to capture representation • Encoding/formatting clever tricks • Reconciliation • Different extracts for different consumers • Different extracts for different versions of data to same consumer
  • Loss of fidelity exposed class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } widget1,,3,,good texture,retains value,,,20142304,102.3,201401 widget2,XS,6,,,,not fragile,,,20132304,73,87653 widget3,XT,,,4,,dense,shiny,mysterious,,,19990304,73,87653,, widget4,,,3,4,,,,,,20040101,,999999,, AORM
  • What happened to XML? class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… ç Ω
  • XML: Created More Issues Than Solved <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… • No native handling of arrays • Attribute vs. nested tag rules/conventions widely variable • Generic parsing (DOM) yields a tree of Nodes of Strings – not very friendly • SAX is fast but too low level
  • … and it eventually became this <p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” … <p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” … <p name=“widget3” ftxt1=“dense” idt=“20140203” … <p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” … • Short, cryptic, conflated tag names • Everything is a string attribute • Mix of flattened arrays and delimited strings • Irony: org.xml.sax.Attributes easier to deal with than rest of DOM
  • Schema Change Challenges: Multiplied & Concentrated! X Alter table(s) split() more data A Alter table(s) Extract more data LOE = x1 Alter table(s) split() more data Alter table(s) split() more data B Alter table(s) Extract more data LOE = x2 C Alter table(s) Extract more data LOE = x3 LOE = xn 1 n å + f (n) where f() is nonlinear wrt n
  • SLAs & Security: Tough to Combine A B User 1 entitled to see X User 2 entitled to see Y User 1 entitled to see Z User 2 entitled to see V X Entitlements managed per- system/per-application here…. …are lost in the low-fidelity transfer of data…. …and have to be reconstituted here …somehow…
  • Solving The Problem with mongoDB
  • What We Are Building Today
  • Overall Strategy For Success • Let the source systems entities drive the data design, not the physical database • Capture data in full fidelity • Perform cross-ref and additional logic at the single point of view, not in transit
  • Don’t forget the power of the API class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } If you can, avoid files altogether! Haskell ç Ω
  • But if you are creating files: emit JSON class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “unitBundles”: [1,3,7,9] // … } ç Ω
  • Let The Feeding System Express itself A B C { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ] } { “myColors”: [“red”,”blue”], “myFloats”: [ 3.14159, 2.71828 ], “nest”: { “as”: { “deep”: true }}} } { “myBlob”: { “$binary”: “aGVsbG8K”}, “myDate”: { “$date”: “20130405” } }
  • What if you forgot something? { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “coverage”: [ “NY”, “NJ” ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } ç Ω
  • The Joy (and value) of mongoDB A Alter table(s) Extract more data LOE = .25x1 B Alter table(s) Extract more data LOE = .25x2 C Alter table(s) Extract more data LOE = .25x3 LOE =O(1)
  • Helpful Hints
  • Helpful Hint: Use the APIs curs.execute("select A.did, A.fullname, B.number from contact A left outer join phones B on A.did = B.did order by A.did") for q in curs.fetchall(): if q[0] != lastDID: if lastDID != None: coll.insert(contact) contact = { "did": q[0], "name": q[1]} lastDID = q[0] if q[2] is not None: if 'phones' not in contact: contact['phones'] = [] contact['phones'].append({"number”:q[2]}) if lastDID != None: coll.insert(contact) { "did" : ”D159308", "phones" : [ {"number”: "1-666-444-3333”}, {"number”: "1-999-444-3333”}, {"number”: "1-999-444-9999”} ], "name" : ”Buzz" } ç Ω
  • Helpful Hint: Declare Types Use mongoDB conventions for dates and binary data: {“dateA”: {“$date”:“2014-05-16T09:42:57.112-0000”}} {“dateB”: {“$date”:1400617865438}} {“someBlob”: { "$binary" : "YmxhIGJsYSBibGE=", "$type" : "00" }
  • Helpful Hint: Keep the file flexible Use CR-delimited JSON: { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”} …instead of a giant array: records = [ { “name”: “buzz”, “locale”: “NY”}, { “name”: “steve”, “locale”: “UK”}, { “name”: “john”, “locale”: “NY”}, ]
  • Helpful Hint: A quick sidebar on jq $ cat myData { "name": "dave", “type”: “mobile”, "phones": [ { "type": "mobile", "number": "2123455634", "dnc": false }, { "type": "mobile", "number": "6173455634" }, { "type": "land", "number": "2023455634" } ] } { "name": "bob", “type”: “WFH”, "phones": [ { "type": ”land", "number": "70812342342", "dnc": false }, { "type": "land", "number": "7083455634" } ] } (another 99,998 rows)
  • Helpful Hint: jq is JSON awk/sed/grep $ jq -c '.phones[] | select(.dnc == false and .type == “mobile” )' myData {"dnc":false,"number":"2123455634","type":"mobile"} {"dnc":false,"number":"70812342342","type":"mobile"} … $ jq [expression above] | wc –l 32433 $ gzip –c –d myData.gz | jq [expression above] | wc –l 32433 http://stedolan.github.io/jq/
  • Helpful Hint: Don’t be afraid of metadata Use a version number in each document: { “v”: 1, “name”: “buzz”, “locale”: “NY”} { “v”: 1, “name”: “steve”, “locale”: “UK”} { “v”: 2, “name”: “john”, “region”: “NY”} …or get fancier and use a header record: { “vers”: 1, “creator”: “ID”, “createDate”: …} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
  • Helpful Hints: Use batch ID { “vers”: 1, “batchID”: “B213W”, “createDate”:…} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
  • Now that we have the data… You’re well on your way to a single view consolidation…but first: – Data Work • Cross-reference important keys • Potential scrubbing/cleansing – Software Stack Work
  • You’ve Built a Great Data Asset; leverage it!
  • DON’T Build This! Giant Glom Of GUI-biased code http://yourcompany/yourapp
  • Build THIS! http://yourcompany/yourapp Data Access Layer Object Constructon Layer Basic Functional Layer Portal Functional Layer GUI adapter Layer Web Service Layer Other Regular Performance Applications Higher Performance Applications Special Generic Applications
  • What Is Happening Next? Access Control Data Protection Auditing Overview & Data Analysis Data Design & Loading Strategies ç Ω Creating A Single View Part 1 Part 2 Securing Your Deployment Part 3
  • Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Q&A