Creating a Single View Part 2: Loading Disparate Source Data and Creating a Single Enterprise-Wide View

1,071 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,071
On SlideShare
0
From Embeds
0
Number of Embeds
425
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Blblblb

  • AND WHY ARE WE DOING IT AT ALL! Federation? Managed QoS? Because traditional RDBMS dynamics make it difficult to well-serve a number of access patterns

    The single most important part of this that will make you successful is the simplest – and is part of the mongoDB data environment

  • ETL fabric fidelity of data typically LCD
    CSV still carries the day because easy to make and technically parse (but difficult to change or express things)
    XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions

    Anecdote about getting screwered by the arrow
    The arrow is disingenuous!
    This is LOSS OF FIDELITY


  • Most people use an ORM to get from DB to good objects – and mongoDB has a story around that too!
    But for the moment, assume we use it.



  • XML was supposed to be The Thing.
  • XML / XSD “too hard” to technically make, parse/consume, and harder still to create consistent list/array conventions

    No one runs schema validation in production because of performance
    Schemas became too complicated anyway…..

    JAXB, JAXP are compile-time bound
  • XML set us back about 10 years

    Leads to this: Can you please just send me a CSV again?


  • Changes to data in source system imply DB schema upgrade in data hub – with X source systems, this starts to become unscalable
    Hub Data storage scalability
    In summary: traditionally, common data hubs are harder to manage than the sum of their source systems – which themselves are not so easy to manage!

    Remember this formula; we’ll see how we improve upon this in just a bit.
  • Data entitlement implicit to system access
    Fast moving businesses cannot be held up by naturally more slowing moving ones

    (Andreas will cover this in greater detail later)
     

  • How did we get here, examples from past? Anecdotal reinforcement. Knowing legacy problems and experience, here are the 3 things that work.

    Don’t think about transfering tables’ think about transfering products, logs, trades, customers
    ----- Meeting Notes (5/19/14 13:31) -----
  • A zillion APIs.
    This does not necessarily mean REALTIME. We can do realtime with “microbatching”. We can do EOD batch with a filefree API. It’s all about how producer and consumer agree to capture the data – we’ll see more about this context later in the presentation.
    ----- Meeting Notes (5/19/14 13:31) -----
    Our most successful customers do this
    or use microbatching.

    The Green Arrow
  • JSON is the new leader in highly interoperable, ASCII structured data format
    ASCII interop is critical so GPB, Avro, and other formats are out.
    Better than XML because
    Strings, numbers, maps, and arrays natively supported
    Simpler data model (no attributes or unnested content)
    Easier to programmatically construct

    (Much!) better than CSV because
    Rich detail is preserved
    Content can be expanded later without struggling with “comma hell”

    Warning: JSON does NOT have Date or binary (BLOB) types! We’ll come back to a strategy on that….
  • The Basic Rules:
    Let feeder systems drive the data design
    Do not dilute, format, or otherwise mess with the data
  • JUST ADD IT.
    Not talking about doubles turning into lists of dates – but there’s a hint coming up that could help there too.
  • MUCH easier to update JSON feed handler for new data
    Essentially constant time to ingest new or changed data!
  • Build the rich structure!
    You have to do this anyway to produce a JSON file so if you can, go the extra distance and just directly insert the content.
    Don’t worry about transactions; you should be using batchID which we’ll get to in a moment.
  • mongoDB does not extend JSON per se. Rather, within the JSON spec, we have a structural-naming convention that allows us to clearly hint at the true intended type of the string value.
  • Easy to grep and use jq too
    Std unix utils work nicely too:
    Same format as mongoimport and mongoexport
    Does not force large memory footprint on loader

  • Don’t be afraid to make mistakes – for the same reason we explored on slide 21.
  • Context is an identifier for a set of data: ABC123
    Dates are dangerous
    For global systems, two (or more!) local dates possible.
    System processing date can be misleading

    Context has additional benefits
    Easy to associate other information with context ID like functional ID
  • Single View of Customer does not mean Single Technical visualization of Customer thru GUI!!
  • Examples: Fin svc who uses this stack and how.
  • Blblblb
  • Creating a Single View Part 2: Loading Disparate Source Data and Creating a Single Enterprise-Wide View

    1. 1. Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Creating a Single View Part 2: Data Design & Loading Strategies
    2. 2. Who Is Talking To You? • Yes, I use “Buzz” on my business cards • Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that • Over 27 years of designing and building systems • Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring • Inventor of perl DBI/DBD • Still programming – using emacs, of course
    3. 3. What Is He Going To Talk About? Historic Challenges New Strategy for Success Technical examples and tips Overview & Data Analysis Data Design & Loading Strategies Securing Your Deployment ç Ω Creating A Single View Part 1 Part 2 Part 3
    4. 4. Historic Challenges
    5. 5. It’s 2014: Why is this still hard to do? • Business / Technical / Information Challenges • Missteps in evolution of data transfer technology A X
    6. 6. We wish this “just worked” A Query objects from A with great performance Query objects from B with great performance X Query objects from merged A and B with great performance B
    7. 7. …but Beware The Blue Arrow! A X • Extracting many tables into many files • Some tables require more than one file to capture representation • Encoding/formatting clever tricks • Reconciliation • Different extracts for different consumers • Different extracts for different versions of data to same consumer
    8. 8. Loss of fidelity exposed class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } widget1,,3,,good texture,retains value,,,20142304,102.3,201401 widget2,XS,6,,,,not fragile,,,20132304,73,87653 widget3,XT,,,4,,dense,shiny,mysterious,,,19990304,73,87653,, widget4,,,3,4,,,,,,20040101,,999999,, AORM
    9. 9. What happened to XML? class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… ç Ω
    10. 10. XML: Created More Issues Than Solved <product> <name>widget1</name> <features> <feature> <text>good texture</text> <type>A</type> </feature> </features> <introDate>20140204</introDate> <versDates> <versDate>20100103</versDate> <versDate>20100601</versDate> </versDates> <unitBundles>1,3,9</unitBun… • No native handling of arrays • Attribute vs. nested tag rules/conventions widely variable • Generic parsing (DOM) yields a tree of Nodes of Strings – not very friendly • SAX is fast but too low level
    11. 11. … and it eventually became this <p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” … <p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” … <p name=“widget3” ftxt1=“dense” idt=“20140203” … <p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” … • Short, cryptic, conflated tag names • Everything is a string attribute • Mix of flattened arrays and delimited strings • Irony: org.xml.sax.Attributes easier to deal with than rest of DOM
    12. 12. Schema Change Challenges: Multiplied & Concentrated! X Alter table(s) split() more data A Alter table(s) Extract more data LOE = x1 Alter table(s) split() more data Alter table(s) split() more data B Alter table(s) Extract more data LOE = x2 C Alter table(s) Extract more data LOE = x3 LOE = xn 1 n å + f (n) where f() is nonlinear wrt n
    13. 13. SLAs & Security: Tough to Combine A B User 1 entitled to see X User 2 entitled to see Y User 1 entitled to see Z User 2 entitled to see V X Entitlements managed per- system/per-application here…. …are lost in the low-fidelity transfer of data…. …and have to be reconstituted here …somehow…
    14. 14. Solving The Problem with mongoDB
    15. 15. What We Are Building Today
    16. 16. Overall Strategy For Success • Let the source systems entities drive the data design, not the physical database • Capture data in full fidelity • Perform cross-ref and additional logic at the single point of view
    17. 17. Don’t forget the power of the API class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } If you can, avoid files altogether! Haskell ç Ω
    18. 18. But if you are creating files: emit JSON class Product { String productName; List<Features> ff; Date introDate; List<Date> versDates; int[] unitBundles; //… } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “unitBundles”: [1,3,7,9] // … } ç Ω
    19. 19. Let The Feeding System Express itself A B C { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ] } { “myColors”: [“red”,”blue”], “myFloats”: [ 3.14159, 2.71828 ], “nest”: { “as”: { “deep”: true }}} } { “myBlob”: { “$binary”: “aGVsbG8K”}, “myDate”: { “$date”: “20130405” } }
    20. 20. What if you forgot something? { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } { “name”: “widget1”, “features”: [ { “text”: “good texture”, “type”: “A” } ], “coverage”: [ “NY”, “NJ” ], “introDate”: “20140204”, “versDates”: [ “20100103”, “20100601” ], “versMinorNum”: [1,3,7,9] // … } ç Ω
    21. 21. The Joy (and value) of mongoDB A Alter table(s) Extract more data LOE = .25x1 B Alter table(s) Extract more data LOE = .25x2 C Alter table(s) Extract more data LOE = .25x3 LOE =O(1)
    22. 22. Helpful Hint: Use the APIs curs.execute("select A.did, A.fullname, B.number from contact A left outer join phones B on A.did = B.did order by A.did") for q in curs.fetchall(): if q[0] != lastDID: if lastDID != None: coll.insert(contact) contact = { "did": q[0], "name": q[1]} lastDID = q[0] if q[2] is not None: if 'phones' not in contact: contact['phones'] = [] contact['phones'].append({"number”:q[2]}) if lastDID != None: coll.insert(contact) { "did" : ”D159308", "phones" : [ {"number”: "1-666-444-3333”}, {"number”: "1-999-444-3333”}, {"number”: "1-999-444-9999”} ], "name" : ”Buzz" } ç Ω
    23. 23. Helpful Hint: Declare Types Use mongoDB conventions for dates and binary data: {“dateA”: {“$date”:“2014-05-16T09:42:57.112-0000”}} {“dateB”: {“$date”:1400617865438}} {“someBlob”: { "$binary" : "YmxhIGJsYSBibGE=", "$type" : "00" }
    24. 24. Helpful Hint: Keep the file flexible Use CR-delimited JSON: { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”} …instead of a giant array: records = [ { “name”: “buzz”, “locale”: “NY”}, { “name”: “steve”, “locale”: “UK”}, { “name”: “john”, “locale”: “NY”}, ]
    25. 25. Helpful Hint: Don’t be afraid of metadata Use a version number in each document: { “v”: 1, “name”: “buzz”, “locale”: “NY”} { “v”: 1, “name”: “steve”, “locale”: “UK”} { “v”: 2, “name”: “john”, “region”: “NY”} …or get fancier and use a header record: { “vers”: 1, “creator”: “ID”, “createDate”: …} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
    26. 26. Helpful Hints: Use batch ID { “vers”: 1, “batchID”: “B213W”, “createDate”:…} { “name”: “buzz”, “locale”: “NY”} { “name”: “steve”, “locale”: “UK”} { “name”: “john”, “locale”: “NY”}
    27. 27. Now that we have the data… You’re well on your way to a single view consolidation…but first: – Data Work • Cross-reference important keys • Potential scrubbing/cleansing – Software Stack Work
    28. 28. You’ve Built a Great Data Asset; leverage it!
    29. 29. DON’T Build This! Giant Glom Of GUI-biased code http://yourcompany/yourapp
    30. 30. Build THIS! http://yourcompany/yourapp Data Access Layer Object Constructon Layer Basic Functional Layer Portal Functional Layer GUI adapter Layer Web Service Layer Other Regular Performance Applications Higher Performance Applications Special Generic Applications
    31. 31. What Is Happening Next? Access Control Data Protection Auditing Overview & Data Analysis Data Design & Loading Strategies ç Ω Creating A Single View Part 1 Part 2 Securing Your Deployment Part 3
    32. 32. Enterprise Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #ConferenceHashTag Thank You

    ×