Palantir XML Formats
PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-Schlosberg
Senior Software Engineer


© 2008...
Palantir XML Formats

 Written in XML Schema Definition (XSD) language
   – W3.org standard
   – Widely accepted
 Allows...
PalantirXML: An Introduction

 A rendering of a Palantir object graph into XML
   – Encodes nearly all features in our lo...
PalantirDocXML: An Introduction

 Container for textual docs and entity extraction output
   – raw text
   – source docum...
Object-Model Refresher
Example Text Document
Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino

We're currently stuck in Atlanta, w...
Imported Into Palantir
A Simple Example
A Simple Example
Keep In Mind…

 We’ll be covering:
  – Details of these two formats
  – Explanations of where to use them
  – Some simple...
PALANTIR XML
pXML
pXML: Where To Use It

 To import structured data that doesn’t import easily
   – Data from a database where objects span...
pXML And The Object Model

 pXML is strongly coupled to the object model
   – Data sources
   – Objects
   – Properties
 ...
pXML And The Object Model

 pXML elements come directly from the object model
   – Data sources <dataSource/>
   – Object...
pXML Document Structure
Document/Data Source Duality

 Data sources represent real-world sources of data
   – do not contain data
   – a collecti...
Data Sources
Object
Property
Property Values

Three types of property values are supported in pXML:
    – Simple
        • Used for single, unparsed va...
Simple Property Value
Composite Property Value
Raw Property Value
Media
Notes
DataSourceRecords

 Data source records (DSRs) tie data to their source
 Apply to all pieces of data
   – Properties
   ...
DataSourceRecords
Links

 Links represent a link between to objects
 All links are directed in Palantir
PALANTIR DOCUMENT XML
DocXML
PalantirDocXML: An Introduction

 Authored by Palantir, but it’s an open format
   – Not inherently tied Palantir.
   – C...
Advanced Features

 Advanced character set handling
   – Stores document originals in original character set
   – Careful...
DocXML Document Structure
Document Metadata
Document Metadata Example
Object Data
Extraction Metadata
Extraction Metadata Example
Object
Example Object
Relationship
Type Mapping

 DocXML documents are not tied to an ontology
   – Single document can be ingested into different ontologie...
Type Mapping Overview
Document Properties
Extractor Type Mappings
Extractor Type Mappings Example
Final Thoughts


 This presentation is an overview
   – Both pXML and DocXML have features not covered here
 The XSD fil...
Palantir XML Formats
PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-Schlosberg
Senior Software Engineer


© 2008...
Upcoming SlideShare
Loading in...5
×

Palantir XML Formats

3,044

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,044
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
147
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Drop the aclsRewrite with more triviaBetter transitionsLess humble“the other 10%”Live examples
  • Insert a “Why XSD?”
  • Go to external application after this.
  • Put in a transition slide before this
  • Use as jumping off to examplesExplain not to explain aclsExplain palantir element allows for easy embedding.Note that graph is for object graph, not visual graph
  • Mention that titles are actually properties, convenience feature
  • Drop custom keyword
  • Insert media overview slideIncludes biometric data
  • This slide sucks.
  • Explain IDREF or rewrite it
  • Put in a summary slide to end the previous
  • Switch last point to lead with sub-points
  • Use it without palantir (final point)Document propertiesCharacterset handlingFlexible type mapping
  • … contents of the objectSet element
  • Same as attribute used in objectSetMetaData (slide …)
  • Need example slide
  • Need example slide
  • Take aways:pxml open, scalable etc.
  • Drop the aclsRewrite with more triviaBetter transitionsLess humble“the other 10%”Live examples
  • Palantir XML Formats

    1. 1. Palantir XML Formats PalantirXML (pXML) & PalantirDocXML (DocXML) Ari Gordon-Schlosberg Senior Software Engineer © 2008 Palantir Technologies Inc. All rights reserved.
    2. 2. Palantir XML Formats  Written in XML Schema Definition (XSD) language – W3.org standard – Widely accepted  Allows developers to leverage existing XML tools – Editing – Verification – Transformation (XSLT) friendly  Designed to be simple & human-readable – Follows Palantir design principles – Meant to make life easier for developers to code, debug, learn
    3. 3. PalantirXML: An Introduction  A rendering of a Palantir object graph into XML – Encodes nearly all features in our lowest-level data model – “Close to the metal”  Used as open import format – Makes Palantir integration-friendly and a truly open platform – Federated Search on-the-fly-import uses it internally – Super efficient storage format  Used for export/interchange – Allows organization to pull knowledge out of Palantir – Can be transformed using XSLT to other XML formats
    4. 4. PalantirDocXML: An Introduction  Container for textual docs and entity extraction output – raw text – source document – entity extraction results – textual references to those entities – document metadata  Authored by Palantir, but it’s an open format – Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir – Not tied to a single extractor, multiple vendors already support it  Designed to be simple-to-author import format – XSLT friendly – Used existing entity extractor formats as design guides
    5. 5. Object-Model Refresher
    6. 6. Example Text Document Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino We're currently stuck in Atlanta, waiting for our flight to IL. We learned that our display case is 83 lineal inches, 3 inches longer than we're supposed to be able to fly with, but let us go this time. (I wonder if this is just a Delta thing?) Eric Poirier called us and told us that the presentation at Cornell went very well, which gives us high hopes for tomorrow's presentation at UIUC. John and I are excited to get back home for a visit and I've been contacting professors to look for students that we should target for recruiting. Things are going well. Sincerely, Your field team: Kevin, John, and Ari.
    7. 7. Imported Into Palantir
    8. 8. A Simple Example
    9. 9. A Simple Example
    10. 10. Keep In Mind…  We’ll be covering: – Details of these two formats – Explanations of where to use them – Some simple examples  Examples have been edited for brevity and clarity – Covering important features – Reference manuals and XSDs are the full references – Some elements abbreviated as <element/>where details are not relevant; More detail may be required there
    11. 11. PALANTIR XML pXML
    12. 12. pXML: Where To Use It  To import structured data that doesn’t import easily – Data from a database where objects span tables – Objects assembled from multiple DataSources – Other “exotic” data sources  To export data from Palantir – Other analytic tools – Other data platforms – Other Palantir instances
    13. 13. pXML And The Object Model  pXML is strongly coupled to the object model – Data sources – Objects – Properties – Notes – Media – Links – Data source records
    14. 14. pXML And The Object Model  pXML elements come directly from the object model – Data sources <dataSource/> – Objects <object/> – Properties <property/> – Notes <note/> – Media <media/> – Links <link/> – Data source records <dataSourceRecord/>
    15. 15. pXML Document Structure
    16. 16. Document/Data Source Duality  Data sources represent real-world sources of data – do not contain data – a collection of references  Palantir document objects contain real-world data  Primary object connects a data source to the object holding its data  Used by data sources representing unstructured data – Documents – Emails – Other sources of unstructured text
    17. 17. Data Sources
    18. 18. Object
    19. 19. Property
    20. 20. Property Values Three types of property values are supported in pXML: – Simple • Used for single, unparsed values • e.g. Nationality, Organization Name – Composite • Used for values composed of discrete, semantic units • e.g. Name (first & last), Address (city, state, zip, etc.) – Raw • Convenience format • Keeps pXML simple and allows the parsers to do the work • Allows ontology to change around existing pXML generators
    21. 21. Simple Property Value
    22. 22. Composite Property Value
    23. 23. Raw Property Value
    24. 24. Media
    25. 25. Notes
    26. 26. DataSourceRecords  Data source records (DSRs) tie data to their source  Apply to all pieces of data – Properties – Notes – Media – Links  Have two modes – Import keys are used to tie data to a record primary key or index in structured data sources. e.g. a line number, primary key, etc. – String position locators are used to mark references in unstructured text using character offsets and lengths.
    27. 27. DataSourceRecords
    28. 28. Links  Links represent a link between to objects  All links are directed in Palantir
    29. 29. PALANTIR DOCUMENT XML DocXML
    30. 30. PalantirDocXML: An Introduction  Authored by Palantir, but it’s an open format – Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir  Support for multiple entity extractors per document – Object data is designed to be an easy transform target from popular extractors – Contains hold the original output from the entity extractors  Allows ontologies to change over time – Architected to use pluggable type-mappings – Compatible with multiple Palantir instances – Never need to rebuild a DocXML document
    31. 31. Advanced Features  Advanced character set handling – Stores document originals in original character set – Careful UTF-8 encapsulation supports all human languages  Support for flexible document metadata – Captures arbitrary organizational or handling metadata  Easy to understand and transform into other formats – XSLT friendly by design – Can hold extractor configuration as well as output – Cross-data-platform format for extracted documents – Intermediate format for multi-step extraction – Single interface for ingestion of extracted document – Completely Palantir agnostic
    32. 32. DocXML Document Structure
    33. 33. Document Metadata
    34. 34. Document Metadata Example
    35. 35. Object Data
    36. 36. Extraction Metadata
    37. 37. Extraction Metadata Example
    38. 38. Object
    39. 39. Example Object
    40. 40. Relationship
    41. 41. Type Mapping  DocXML documents are not tied to an ontology – Single document can be ingested into different ontologies – Changes in an ontology does not require re-extraction or changes to the extractor, just an edit of the type mapping – Each document can use multiple mappings  Mappings map extractor types and document properties – Separate mapping for each supported extractor – Document properties map into properties on the Palantir Document object  Centrally-managed resource for each enterprise – Analysts don’t write type mappings, architects do – Imports seamlessly “just work” – Everyone uses a consistent mapping
    42. 42. Type Mapping Overview
    43. 43. Document Properties
    44. 44. Extractor Type Mappings
    45. 45. Extractor Type Mappings Example
    46. 46. Final Thoughts  This presentation is an overview – Both pXML and DocXML have features not covered here  The XSD files are the canonical reference – Full syntax and rules are covered there – Consult reference manual for usage and in-depth explanations  Living Standards – Backwards compatible – May add new features to support customer needs  See our blog for tips and techniques on XML processing – http://blog.palantirtech.com/
    47. 47. Palantir XML Formats PalantirXML (pXML) & PalantirDocXML (DocXML) Ari Gordon-Schlosberg Senior Software Engineer © 2008 Palantir Technologies Inc. All rights reserved.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×