Advertisement

Neo4p dcbpw-2015

Director, Data Mgt and Interoperability
Apr. 14, 2015
Advertisement

More Related Content

Advertisement

Neo4p dcbpw-2015

  1. (Perl)-[:speaks]->(Neo4j) Mark A. Jensen 1 https://github.com/majensen/rest-neo4p.git
  2. • Perler since 2000 • CPAN contributor (MAJENSEN) since 2009 • BioPerl Core Developer • Director, Genomic Data Programs, Leidos Biomedical Research Inc (FNLCR) • @thinkinator, LinkedIn 2
  3. Not my sponsor, but could be yours! • http://www.perlfoundation.org/how_to_write_a_proposal 3
  4. Motivation • Cancer Genomics: Biospecimen, Clinical, Analysis Data – complex – growing – evolving technologies – evolving policies – need for precise accounting • Graph models are well-suited to this world 4
  5. 5 Patient Tumor Sample Clinical Extract Extract Data File Data File Normal Sample derived_from analysis_of • age • diagnosis • stage • date shipped Nodes Relationships Properties
  6. Graph vs RDBMS 6 foo barbaz spam eggs squirrel goob
  7. 7 select bar.name from bar, bar_baz, baz, baz_goob, goob, goob_squirrel, squirrel, squirrel_spam, spam, spam_eggs, eggs, eggs_foo, foo where bar.id = bar_baz.bar_id and bar_baz.baz_id = baz.id and baz.id = baz_goob.baz_id and baz_goob.goob_id = goob.id and goob.id = goob_squirrel.goob_id and goob_squirrel.id = squirrel.id and squirrel.id = squirrel_spam.squirrel_id and squirrel_spam.spam_id = spam.id and spam.id = spam_eggs.spam_id and spam_eggs.eggs_id = eggs.id and eggs_foo.eggs_id = eggs.id and eggs_foo.foo_id = foo.id and foo.name = 'zloty'; match (f:foo)-[*5..8]-(b:bar) where f.name = 'zloty' return b.name
  8. Neo4j • “Native” graph DB engine (currently in v2.2) – Written in Java, but – Very complete REST API – Custom query language: Cypher – Free community edition – Lots of community support, including many “language drivers” • Not the only one out there, but probably the most widely used (certainly the best marketed) 8
  9. Neo4p 9
  10. Neo4p 10
  11. Neo4p 11 Create Node Label Node Create Unique Node Add a Prop Link Nodes Load/Use Index
  12. Neo4p 12
  13. Design Goals • "OGM" – Perl 5 objects backed by the graph • User should never have to deal with a REST endpoint* *Unless she wants to. • User should never/only have to deal with Cypher queries† †Unless he wants/doesn’t want to. • Robust enough for production code – System should approach complete coverage of the REST service – System should be robust to REST API changes and server backward-compatible (or at least version-aware) • Take advantage of the self-describing features of the API 13
  14. REST::Neo4p core objects • Are Node, Relationship, Index – Index objects represent legacy (v1.0) indexes – v2.0 “background” indexes handled in Schema • Are blessed scalar refs : "Inside-out object" pattern – the scalar value is the item ID (or index name) – For any object $obj, $$obj (the ID) is exactly what you need for constructing the API calls • Are subclasses of Entity – Entity does the object table handling, JSON-to-object conversion and HTTP agent calls – Isolates most of the kludges necessary to handle the few API inconsistencies that exist(ed) 14
  15. Batch Calls • Certain situations (database loading, e.g.) make sense to batch : do many things in one API call rather than many single calls • REST API provides this functionality • How to make it "natural" in the context of working with objects? – Use Perl prototyping sugar to create a "batch block" 15
  16. Example: Rather than call the server for every line, you can mix in REST::Neo4p::Batch, and then use a batch {} block: 16 Calls within block are collected and deferred
  17. 17 You can execute more complex logic within the batch block, and keep the objects beyond it:
  18. 18 But miracles are not yet implemented: Object here doesn't really exist yet…
  19. How does that work? • Agent module isolates all bona fide calls – very few kludges to core object modules req'd • batch() puts the agent into “batch mode” and executes wrapped code – agent stores incoming calls as JSON in a queue • After wrapped code is executed, batch() switches agent back to normal mode and has it call the batch endpoint with the queue contents • Batch processes the response and creates objects if requested 19
  20. HTTP Agent 20
  21. Agent • Is transparent – But can always see it with REST::Neo4p->agent – Agent module alone meant to be useful and independent • Elicits and uses the API self-discovery feature on connect() • Isolates all HTTP requests and responses • Captures and distinguishes API and HTTP errors – emits REST::Neo4p::Exceptions objects • [Instance] Is a subclass of a "real" user agent: – LWP::UserAgent – Mojo::UserAgent, or – HTTP::Thin 21
  22. Working within API Self-Description 22 • Get the list of actions with – $agent->available_actions • And AUTOLOAD will provide (see pod for args): – $agent->get_<action>() – $agent->put_<action>() – $agent->post_<action>() – $agent->delete_<action>() • Other accessors, e.g. node(), return the appropriate URL for your server
  23. Schemas - Use Case You start out with a set of well categorized things, that have some well defined relationships. Each thing will be represented as a node, that's fine. But, You want to guarantee (to your client, for example) that 1. You can classify every node you add or read unambiguously into a well-defined group (you know everything that’s in there); 2. You never relate two nodes belonging to particular groups in a way that doesn't make sense according to your well-defined relationships (you can find everything that’s in there). 23
  24. Schema Helps • REST::Neo4p::Schema – Access the (limited) schema functionality of Neo4j server – Create indexes – Maintain uniqueness of nodes within Label classes • REST::Neo4p::Constrain - An add-in for constraining (or validating) – property values – connections (relationships) based on node properties – relationship types according to flexible specifications 24
  25. App-level Constraints 25
  26. 26
  27. 27 Will throw at Record 5
  28. Constrain/Constraint • Multiple modes: – Automatic (throws exception if constraint violated) – Manual (validation function returns false if constraint violated) – Suspended (lift constraint processing when desired) • Freeze/Thaw (in JSON) constraint specifications for reuse 28
  29. Cypher Queries • REST::Neo4p::Query takes a familiar, DBI-like approach – Prepare, execute, fetch – "rows" returned are arrays containing scalars, Node objects, and/or Relationship objects • Simple Perl data structures can be requested instead if desired – If a query returns a path, a Path object (a simple container) is returned 29
  30. 30
  31. Cypher Queries • Prepare and execute with parameter substitutions 31 Do This! Not This!
  32. Cypher Queries • Transactions are supported when you have v2.0.1 server or greater – started with REST::Neo4p->begin_work() – committed with REST::Neo4p->commit() – canceled with REST::Neo4p->rollback() (here, the class looks like the database handle in DBI, in fact…) 32
  33. DBI – DBD::Neo4p • Yes, you can really do this: 33
  34. DBI – DBD::Neo4p 34 • Row returns: choice of full objects or simple Perl structures
  35. Future Directions/Contribution Ideas • Test on v2.2 server and fix any issues • Make Neo4p closer to an ORM (require explicit push/pull from backend server) • Sunset v1.0 support – Completely touch-free testing within transactions – Integrate node labels better • Make batch response parsing more efficient – e.g., don't stream if response is not huge • Add traversal functionality • Beautify and deodorize 35
  36. Thanks! 36 https://github.com/majensen/rest-neo4p.git

Editor's Notes

  1. Simplified data model for a cancer genomics study
  2. Real (high level) model for a cancer genomics study. Think of every node as representing a table, and every edge as a foreign key (and potentially a linking table). Imagine the join you would have to write to find records on the far left that are related to records on the far lower right. Because of that complexity, you would probably not build the structure you see here in a RDBMS. But then your model is serving the technology at the expense of representing the real world relationships and items.
  3. the batch {} sugar indicates that the calls that would have been made immediately are deferred and kept in a queue, to be emitted after the code inside is done. 'discard_objs' means don't preserve the object information in memory.
  4. From the connect() method of REST::Neo4p::Agent.
  5. This is a little against the NoSQL and graph grain. But the fact is that data stewardship requirements may put you the position of explaining how your apps will positively maintain the integrity and connectivity of the data. Schemas are not dead. If your team is >1 member, there needs to be some externally established and consultable way to know what is in your datastore. If you have a client that wants to make sure you have stored everything she wants stored, you have to be able to report, validate and verify that.
  6. Right thing to do in principle, also not creating thousands of query objects, plus the server tends to bork on thousands of new queries. SomaFM DefCon radio snip: “Devs won’t write parameterized queries unless there’s a gun to their head. We know, we hold the gun.”
  7. The transaction REST endpoint is different from the cypher REST endpoint. Neo4p pays attention to whether you're in a transaction or not, and informs the Agent which endpoint to use. In adding transaction support to Neo4p, identified a bug in 2.0.0. Submitted a ticket and it got fixed! Now, my responsibility not to lead users down the rosy path – check server version and throw if user has a server <2.0.1.
Advertisement