Palantir XML Formats

4,735 views

Published on

Published in: Technology
  • Be the first to comment

Palantir XML Formats

  1. 1. Palantir XML Formats PalantirXML (pXML) & PalantirDocXML (DocXML) Ari Gordon-Schlosberg Senior Software Engineer © 2008 Palantir Technologies Inc. All rights reserved.
  2. 2. Palantir XML Formats  Written in XML Schema Definition (XSD) language – W3.org standard – Widely accepted  Allows developers to leverage existing XML tools – Editing – Verification – Transformation (XSLT) friendly  Designed to be simple & human-readable – Follows Palantir design principles – Meant to make life easier for developers to code, debug, learn
  3. 3. PalantirXML: An Introduction  A rendering of a Palantir object graph into XML – Encodes nearly all features in our lowest-level data model – “Close to the metal”  Used as open import format – Makes Palantir integration-friendly and a truly open platform – Federated Search on-the-fly-import uses it internally – Super efficient storage format  Used for export/interchange – Allows organization to pull knowledge out of Palantir – Can be transformed using XSLT to other XML formats
  4. 4. PalantirDocXML: An Introduction  Container for textual docs and entity extraction output – raw text – source document – entity extraction results – textual references to those entities – document metadata  Authored by Palantir, but it’s an open format – Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir – Not tied to a single extractor, multiple vendors already support it  Designed to be simple-to-author import format – XSLT friendly – Used existing entity extractor formats as design guides
  5. 5. Object-Model Refresher
  6. 6. Example Text Document Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino We're currently stuck in Atlanta, waiting for our flight to IL. We learned that our display case is 83 lineal inches, 3 inches longer than we're supposed to be able to fly with, but let us go this time. (I wonder if this is just a Delta thing?) Eric Poirier called us and told us that the presentation at Cornell went very well, which gives us high hopes for tomorrow's presentation at UIUC. John and I are excited to get back home for a visit and I've been contacting professors to look for students that we should target for recruiting. Things are going well. Sincerely, Your field team: Kevin, John, and Ari.
  7. 7. Imported Into Palantir
  8. 8. A Simple Example
  9. 9. A Simple Example
  10. 10. Keep In Mind…  We’ll be covering: – Details of these two formats – Explanations of where to use them – Some simple examples  Examples have been edited for brevity and clarity – Covering important features – Reference manuals and XSDs are the full references – Some elements abbreviated as <element/>where details are not relevant; More detail may be required there
  11. 11. PALANTIR XML pXML
  12. 12. pXML: Where To Use It  To import structured data that doesn’t import easily – Data from a database where objects span tables – Objects assembled from multiple DataSources – Other “exotic” data sources  To export data from Palantir – Other analytic tools – Other data platforms – Other Palantir instances
  13. 13. pXML And The Object Model  pXML is strongly coupled to the object model – Data sources – Objects – Properties – Notes – Media – Links – Data source records
  14. 14. pXML And The Object Model  pXML elements come directly from the object model – Data sources <dataSource/> – Objects <object/> – Properties <property/> – Notes <note/> – Media <media/> – Links <link/> – Data source records <dataSourceRecord/>
  15. 15. pXML Document Structure
  16. 16. Document/Data Source Duality  Data sources represent real-world sources of data – do not contain data – a collection of references  Palantir document objects contain real-world data  Primary object connects a data source to the object holding its data  Used by data sources representing unstructured data – Documents – Emails – Other sources of unstructured text
  17. 17. Data Sources
  18. 18. Object
  19. 19. Property
  20. 20. Property Values Three types of property values are supported in pXML: – Simple • Used for single, unparsed values • e.g. Nationality, Organization Name – Composite • Used for values composed of discrete, semantic units • e.g. Name (first & last), Address (city, state, zip, etc.) – Raw • Convenience format • Keeps pXML simple and allows the parsers to do the work • Allows ontology to change around existing pXML generators
  21. 21. Simple Property Value
  22. 22. Composite Property Value
  23. 23. Raw Property Value
  24. 24. Media
  25. 25. Notes
  26. 26. DataSourceRecords  Data source records (DSRs) tie data to their source  Apply to all pieces of data – Properties – Notes – Media – Links  Have two modes – Import keys are used to tie data to a record primary key or index in structured data sources. e.g. a line number, primary key, etc. – String position locators are used to mark references in unstructured text using character offsets and lengths.
  27. 27. DataSourceRecords
  28. 28. Links  Links represent a link between to objects  All links are directed in Palantir
  29. 29. PALANTIR DOCUMENT XML DocXML
  30. 30. PalantirDocXML: An Introduction  Authored by Palantir, but it’s an open format – Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir  Support for multiple entity extractors per document – Object data is designed to be an easy transform target from popular extractors – Contains hold the original output from the entity extractors  Allows ontologies to change over time – Architected to use pluggable type-mappings – Compatible with multiple Palantir instances – Never need to rebuild a DocXML document
  31. 31. Advanced Features  Advanced character set handling – Stores document originals in original character set – Careful UTF-8 encapsulation supports all human languages  Support for flexible document metadata – Captures arbitrary organizational or handling metadata  Easy to understand and transform into other formats – XSLT friendly by design – Can hold extractor configuration as well as output – Cross-data-platform format for extracted documents – Intermediate format for multi-step extraction – Single interface for ingestion of extracted document – Completely Palantir agnostic
  32. 32. DocXML Document Structure
  33. 33. Document Metadata
  34. 34. Document Metadata Example
  35. 35. Object Data
  36. 36. Extraction Metadata
  37. 37. Extraction Metadata Example
  38. 38. Object
  39. 39. Example Object
  40. 40. Relationship
  41. 41. Type Mapping  DocXML documents are not tied to an ontology – Single document can be ingested into different ontologies – Changes in an ontology does not require re-extraction or changes to the extractor, just an edit of the type mapping – Each document can use multiple mappings  Mappings map extractor types and document properties – Separate mapping for each supported extractor – Document properties map into properties on the Palantir Document object  Centrally-managed resource for each enterprise – Analysts don’t write type mappings, architects do – Imports seamlessly “just work” – Everyone uses a consistent mapping
  42. 42. Type Mapping Overview
  43. 43. Document Properties
  44. 44. Extractor Type Mappings
  45. 45. Extractor Type Mappings Example
  46. 46. Final Thoughts  This presentation is an overview – Both pXML and DocXML have features not covered here  The XSD files are the canonical reference – Full syntax and rules are covered there – Consult reference manual for usage and in-depth explanations  Living Standards – Backwards compatible – May add new features to support customer needs  See our blog for tips and techniques on XML processing – http://blog.palantirtech.com/
  47. 47. Palantir XML Formats PalantirXML (pXML) & PalantirDocXML (DocXML) Ari Gordon-Schlosberg Senior Software Engineer © 2008 Palantir Technologies Inc. All rights reserved.

×