• Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Creating a Single View: Data Design and Loading Strategies


Learn how to design a single view application and load your data into the application.

Learn how to design a single view application and load your data into the application.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Carb coma
  • Blblblb

  • AND WHY ARE WE DOING IT AT ALL! Federation? Managed QoS? Because traditional RDBMS dynamics make it difficult to well-serve a number of access patterns

    The single most important part of this that will make you successful is the simplest – and is part of the mongoDB data environment

  • ETL fabric fidelity of data typically LCD
    CSV still carries the day because easy to make and technically parse (but difficult to change or express things)
    XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions

    Anecdote about getting screwered by the arrow
    The arrow is disingenuous!

  • Most people use an ORM to get from DB to good objects – and mongoDB has a story around that too!
    But for the moment, assume we use it.

  • XML was supposed to be The Thing.
  • XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions

    No one runs schema validation in production because of performance
    Schemas became too complicated anyway…..

    JAXB, JAXP are compile-time bound
  • XML set us back about 10 years

    Leads to this: Can you please just send me a CSV again?

  • Changes to data in source system imply DB schema upgrade in data hub – with X source systems, this starts to become unscalable
    Hub Data storage scalability
    In summary: traditionally, common data hubs are harder to manage than the sum of their source systems – which themselves are not so easy to manage!

    Remember this formula; we’ll see how we improve upon this in just a bit.
  • Data entitlement implicit to system access
    Fast moving businesses cannot be held up by naturally more slowing moving ones

    (Andreas will cover this in greater detail later)

  • Knowing legacy problems and experience, here are the 3 things that work.

    Don’t think about transfering tables’ think about transfering products, logs, trades, customers

    Cross ref at the SPOV. Especially as the number of feeders grows large, you’ll want to concentrate and control enrichment instead of having potentially dozens of scripts and utils getting involved in the flow. This also vastly simplifies a necessary evil: reconciliation.

    ----- Meeting Notes (5/19/14 13:31) -----
  • A zillion APIs.
    This does not necessarily mean REALTIME. We can do realtime with “microbatching”. We can do EOD batch with a filefree API. It’s all about how producer and consumer agree to capture the data – we’ll see more about this context later in the presentation.
    ----- Meeting Notes (5/19/14 13:31) -----
    Our most successful customers do this
    or use microbatching.

    If direct connect isn’t your bag, feel free to create a web service: but pass JSON to that web service.

  • JSON is the new leader in highly interoperable, ASCII structured data format
    ASCII interop is critical so GPB, Avro, and other formats are out.
    Better than XML because
    Strings, numbers, maps, and arrays natively supported
    Simpler data model (no attributes or unnested content)
    Easier to programmatically construct

    (Much!) better than CSV because
    Rich detail is preserved
    Content can be expanded later without struggling with “comma hell”

    Warning: JSON does NOT have Date or binary (BLOB) types! We’ll come back to a strategy on that….

    WRT to actually creating JSON, there are all sorts of options including frameworks that use annotations on your POJOs
    BUT: My recommendation observer software engineering 101: have feeder program build a Map then use anyone of the JSON parser/generators like Jackson to
  • The Basic Rules:
    Let feeder systems drive the data design
    Do not dilute, format, or otherwise mess with the data

    Schema Design: An entire session could be devoted to schema design. In general,
    always embed 1:1
    embed “co-lo” 1:n (vectors of bespoke results, contact and phone numbers)
    use foreign keys to link 1:n where n is shared by others
    use foreign keys for n:n
    Not talking about doubles turning into lists of dates – but there’s a hint coming up arounding versioning that could help there too.

    If you do this even halfway right, it may be last feed infra you need to create for this consolidated view.
  • MUCH easier to update JSON feed handler for new data
    Essentially constant time to ingest new or changed data!

    No silver bullet or magic about processing the data – but you are no longer wrestling with the database!
  • Build the rich structure!
    You have to do this anyway to produce a JSON file so if you can, go the extra distance and just directly insert the content.
    Don’t worry about transactions; you should be using batchID which we’ll get to in a moment.
  • mongoDB does not extend JSON per se. Rather, within the JSON spec, we have a structural-naming convention that allows us to clearly hint at the true intended type of the string value.

    These are natively grok’d by mongoimport, BTW.
  • By CR delimited we mean no pretty-printing of the JSON.
    The computer doesn’t care if it’s pretty or not and
    Packing everything on one line allows you to:
    Easy to write a BufferedReader / fread
    Easy to grep and Std unix utils work nicely too:
    Same format as mongoimport and mongoexport
    Does not force large memory footprint on loader

    and you can use jq!
  • We have 100,000 items.
    Goal: How many mobile phones are explicitly marked as do-not-call?

    Challenge: single person per “greppable” line and phones is an array.
    In these 2 lines, there are 5 phones.
    Also phones.type is not the same as .type SO grepping for “mobile” leads to peril and very often wrong results

  • .phones select phones element from doc
    But we still have it as an ARRAY
    [] “flattens” out the array to be a set of documents! (just like $unwind in the mongoDB agg framework)

    jq operations are very rich . You can redact/replace fields, add brand new fields to output, etc.

    The –c option produces CR-delimited JSON

    JSON compresses very well (like one FIFTH the space) so go ahead and gzip -9 the JSON and decompress on the fly into jq!
  • Don’t be afraid to make mistakes – for the same reason we explored on slide 21.
  • Context is an identifier for a set of data: ABC123
    Dates are dangerous
    For global systems, two (or more!) local dates possible.
    System processing date can be misleading

    Context has additional benefits
    Easy to associate other information with context ID like functional ID
  • Single View of Customer does not mean Single Technical visualization of Customer thru GUI!!
  • Examples: Fin svc who uses this stack and how.
  • Blblblb


  • 1. Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Creating a Single View Part 2: Data Design & Loading Strategies
  • 2. Who Is Talking To You? • Yes, I use “Buzz” on my business cards • Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that • Over 27 years of designing and building systems • Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring • Inventor of perl DBI/DBD • Still programming – using emacs, of course
  • 3. What Is He Going To Talk About? Historic Challenges New Strategy for Success Technical examples and tips Overview & Data Analysis Data Design & Loading Strategies Securing Your Deployment ç Ω Creating A Single View Part 1 Part 2 Part 3
  • 4. Historic Challenges
  • 5. It’s 2014: Why is this still hard to do? • Business / Technical / Information Challenges • Missteps in evolution of data transfer technology A X
  • 6. We wish this “just worked” A Query objects from A with great performance Query objects from B with great performance X Query objects from merged A and B with great performance B
  • 7. …but Beware The Blue Arrow! A X • Extracting many tables into many files • Some tables require more than one file to capture representation • Encoding/formatting clever tricks • Reconciliation • Different extracts for different consumers • Different extracts for different versions of data to same consumer
  • 8. Loss of fidelity exposed class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } widget1,,3,,good texture,retains value,,,20142304,102.3,201401 widget2,XS,6,,,,not fragile,,,20132304,73,87653 widget3,XT,,,4,,dense,shiny,mysterious,,,19990304,73,87653,, widget4,,,3,4,,,,,,20040101,,999999,, AORM
  • 9. What happened to XML? class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… ç Ω
  • 10. XML: Created More Issues Than Solved <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… • No native handling of arrays • Attribute vs. nested tag rules/conventions widely variable • Generic parsing (DOM) yields a tree of Nodes of Strings – not very friendly • SAX is fast but too low level
  • 11. … and it eventually became this <p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” … <p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” … <p name=“widget3” ftxt1=“dense” idt=“20140203” … <p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” … • Short, cryptic, conflated tag names • Everything is a string attribute • Mix of flattened arrays and delimited strings • Irony: org.xml.sax.Attributes easier to deal with than rest of DOM
  • 12. Schema Change Challenges: Multiplied & Concentrated! X Alter table(s) split() more data A Alter table(s) Extract more data LOE = x1 Alter table(s) split() more data Alter table(s) split() more data B Alter table(s) Extract more data LOE = x2 C Alter table(s) Extract more data LOE = x3 LOE = xn 1 n å + f (n) where f() is nonlinear wrt n
  • 13. SLAs & Security: Tough to Combine A B User 1 entitled to see X User 2 entitled to see Y User 1 entitled to see Z User 2 entitled to see V X Entitlements managed per- system/per-application here…. …are lost in the low-fidelity transfer of data…. …and have to be reconstituted here …somehow…
  • 14. Solving The Problem with mongoDB
  • 15. What We Are Building Today
  • 16. Overall Strategy For Success • Let the source systems entities drive the data design, not the physical database • Capture data in full fidelity • Perform cross-ref and additional logic at the single point of view, not in transit
  • 17. Don’t forget the power of the API class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } If you can, avoid files altogether! Haskell ç Ω
  • 18. But if you are creating files: emit JSON class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “unitBundles”: [1,3,7,9] // … } ç Ω
  • 19. Let The Feeding System Express itself A B C { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ] } { “myColors”: [“red”,”blue”], “myFloats”: [ 3.14159, 2.71828 ], “nest”: { “as”: { “deep”: true }}} } { “myBlob”: { “$binary”: “aGVsbG8K”}, “myDate”: { “$date”: “20130405” } }
  • 20. What if you forgot something? { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “coverage”: [ “NY”, “NJ” ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } ç Ω
  • 21. The Joy (and value) of mongoDB A Alter table(s) Extract more data LOE = .25x1 B Alter table(s) Extract more data LOE = .25x2 C Alter table(s) Extract more data LOE = .25x3 LOE =O(1)
  • 22. Helpful Hints
  • 23. Helpful Hint: Use the APIs curs.execute("select A.did, A.fullname, B.number from contact A left outer join phones B on A.did = B.did order by A.did") for q in curs.fetchall(): if q[0] != lastDID: if lastDID != None: coll.insert(contact) contact = { "did": q[0], "name": q[1]} lastDID = q[0] if q[2] is not None: if 'phones' not in contact: contact['phones'] = [] contact['phones'].append({"number”:q[2]}) if lastDID != None: coll.insert(contact) { "did" : ”D159308", "phones" : [ {"number”: "1-666-444-3333”}, {"number”: "1-999-444-3333”}, {"number”: "1-999-444-9999”} ], "name" : ”Buzz" } ç Ω
  • 24. Helpful Hint: Declare Types Use mongoDB conventions for dates and binary data: {“dateA”: {“$date”:“2014-05-16T09:42:57.112-0000”}} {“dateB”: {“$date”:1400617865438}} {“someBlob”: { "$binary" : "YmxhIGJsYSBibGE=", "$type" : "00" }
  • 25. Helpful Hint: Keep the file flexible Use CR-delimited JSON: { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”} …instead of a giant array: records = [ { “name”: “buzz”, “locale”: “NY”}, { “name”: “steve”, “locale”: “UK”}, { “name”: “john”, “locale”: “NY”}, ]
  • 26. Helpful Hint: A quick sidebar on jq $ cat myData { "name": "dave", “type”: “mobile”, "phones": [ { "type": "mobile", "number": "2123455634", "dnc": false }, { "type": "mobile", "number": "6173455634" }, { "type": "land", "number": "2023455634" } ] } { "name": "bob", “type”: “WFH”, "phones": [ { "type": ”land", "number": "70812342342", "dnc": false }, { "type": "land", "number": "7083455634" } ] } (another 99,998 rows)
  • 27. Helpful Hint: jq is JSON awk/sed/grep $ jq -c '.phones[] | select(.dnc == false and .type == “mobile” )' myData {"dnc":false,"number":"2123455634","type":"mobile"} {"dnc":false,"number":"70812342342","type":"mobile"} … $ jq [expression above] | wc –l 32433 $ gzip –c –d myData.gz | jq [expression above] | wc –l 32433 http://stedolan.github.io/jq/
  • 28. Helpful Hint: Don’t be afraid of metadata Use a version number in each document: { “v”: 1, “name”: “buzz”, “locale”: “NY”} { “v”: 1, “name”: “steve”, “locale”: “UK”} { “v”: 2, “name”: “john”, “region”: “NY”} …or get fancier and use a header record: { “vers”: 1, “creator”: “ID”, “createDate”: …} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
  • 29. Helpful Hints: Use batch ID { “vers”: 1, “batchID”: “B213W”, “createDate”:…} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
  • 30. Now that we have the data… You’re well on your way to a single view consolidation…but first: – Data Work • Cross-reference important keys • Potential scrubbing/cleansing – Software Stack Work
  • 31. You’ve Built a Great Data Asset; leverage it!
  • 32. DON’T Build This! Giant Glom Of GUI-biased code http://yourcompany/yourapp
  • 33. Build THIS! http://yourcompany/yourapp Data Access Layer Object Constructon Layer Basic Functional Layer Portal Functional Layer GUI adapter Layer Web Service Layer Other Regular Performance Applications Higher Performance Applications Special Generic Applications
  • 34. What Is Happening Next? Access Control Data Protection Auditing Overview & Data Analysis Data Design & Loading Strategies ç Ω Creating A Single View Part 1 Part 2 Securing Your Deployment Part 3
  • 35. Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Q&A