SlideShare a Scribd company logo
1 of 20
Download to read offline
Let it Flow:
Government Documents
PDF (or PDF + XML) to reflowable EPUB
Open Book Hackathon, NYPL
January 11-13, 2014
Proposed Tasks: Legal ePUB
What would users want?
● Remove hard line breaks for continuous, reflowable display in multiple formats/modes.
● Parse metadata from header information, pagination and other semantic content.
● Parse citation of previous opinions or laws and link to cited legislation.
● Create stylesheet and page markup for display on devices.
It is important to:
● Ensure that data is formatted so that it flows naturally regardless of device used.
● Maintain information about pagination for citation and discovery purposes.
Proposed Tasks: Legal ePUB
What would users want?
● Remove hard line breaks for continuous, reflowable display in multiple formats/modes.
● Parse metadata from header information, pagination and other semantic content.
● Parse citation of previous opinions or laws and link to cited legislation.
● Create stylesheet and page markup for display on devices.
It is important to:
● Ensure that data is formatted so that it flows naturally regardless of device used.
● Maintain information about pagination for citation and discovery purposes.
We only had time for one thing.
Supreme Court Slip Opinions
Supreme Court Slip Opinions
● Supreme Court slip opinion documents are
available in PDF format.
● Dockets may contain multiple document
types:
○ Syllabus
○ Opinion of the Court
○ Per Curiam
○ Dissention
Metadata: SCOTUS Slip Opinions
Citation
Document Type
Notice
Presiding Court
Docket Number
Case Names/Parties to the Case
Date Decided
Justice
Opinion
Front matter for each document contains several centered blocks of content. At first glance, the
documents contain minimal structural information.
XML Format
<text top="171" left="676" width="15" height="16" font="0">1 </text>
<text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text>
<text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text>
<text top="206" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text>
<text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text>
<text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text>
<text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text>
<text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S.
321, 337. </text>
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
<text top="354" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text>
<text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text>
<text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>
We used Poppler to convert Supreme Court slip opinion documents (PDF) to XML. This
revealed position and size attributes for lines of text, but no structured semantic information.
XML Format: Font ID
<text top="171" left="676" width="15" height="16" font="0">1 </text>
<text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text>
<text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text>
<text top="206" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text>
<text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text>
<text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text>
<text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text>
<text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S.
321, 337. </text>
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
<text top="354" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text>
<text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text>
<text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>
We may be able to use position and size attributes to identify content parts of the document. For
example, different parts of the document have varying positions, font sizes and line heights.
XML Format: Font ID
<text top="171" left="676" width="15" height="16" font="0">1 </text>
<text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text>
<text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text>
<text top="206" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text>
<text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text>
<text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text>
<text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text>
<text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S.
321, 337. </text>
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
<text top="354" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text>
<text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text>
<text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>
In this document, the font attribute is 1 for the Note section, 3 for the names of the parties to the
case and 4 for the name of the presiding court.
XML Format: <page> attributes
Font sizes are defined after the <page> elements in the XML document. In this example, the title
of the Supreme Court of the United States has height=”20” and font=”4”, which corresponds to
fontspec id=”4”, which is defined as Times 20pt Black.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.24.4">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="11" family="Times" color="#000000"/>
<fontspec id="1" size="8" family="Times" color="#000000"/>
<fontspec id="2" size="8" family="Times" color="#000000"/>
<fontspec id="3" size="14" family="Times" color="#000000"/>
<fontspec id="4" size="20" family="Times" color="#000000"/>
<fontspec id="5" size="14" family="Times" color="#000000"/>
<fontspec id="6" size="11" family="Times" color="#000000"/>
...
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
Font ID 4 is
Times 20pt
XML Format: <page> attributes
Font specifications are defined within the page element. If subsequent pages require new font
IDs they are added after the <page> element.
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="11" family="Times" color="#000000"/>
<fontspec id="1" size="8" family="Times" color="#000000"/>
<fontspec id="2" size="8" family="Times" color="#000000"/>
<fontspec id="3" size="14" family="Times" color="#000000"/>
<fontspec id="4" size="20" family="Times" color="#000000"/>
<fontspec id="5" size="14" family="Times" color="#000000"/>
<fontspec id="6" size="11" family="Times" color="#000000"/>
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="7" size="11" family="Times" color="#ff0000"/>
<page number="3" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="8" size="7" family="Times" color="#000000"/>
<page number="5" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="9" size="4" family="Times" color="#000000"/>
...
XML Format: Inconsistencies
Font sizes for specific id attributes are not consistent across all documents, BUT...
Docket 12-609
<page number="1" position="absolute" top="0" left="0" height="1188"
width="918">
<fontspec id="0" size="11" family="Times" color="#000000"/>
<fontspec id="1" size="8" family="Times" color="#000000"/>
<fontspec id="2" size="8" family="Times" color="#000000"/>
<fontspec id="3" size="14" family="Times" color="#000000"/>
<fontspec id="4" size="20" family="Times" color="#000000"/>
<fontspec id="5" size="14" family="Times" color="#000000"/>
<fontspec id="6" size="11" family="Times" color="#000000"/>
In this document The Supreme Court of the United States has font id
4.
Docket 12-729
<page number="1" position="absolute" top="0" left="0" height="1188"
width="918">
<fontspec id="0" size="11" family="Times" color="#000000"/>
<fontspec id="1" size="8" family="Times" color="#000000"/>
<fontspec id="2" size="8" family="Times" color="#000000"/>
<fontspec id="3" size="14" family="Times" color="#000000"/>
<fontspec id="4" size="11" family="Times" color="#000000"/>
<fontspec id="5" size="20" family="Times" color="#000000"/>
<fontspec id="6" size="14" family="Times" color="#000000"/>
<fontspec id="7" size="14" family="Times" color="#000000"/>
In this document The Supreme Court of the United States has font id
5.
XML Format: Commonalities
...we noticed that the position attributes for many content parts such as left margin and line
height were consistent across all documents. These content parts are noted above.
Content Parts with consistent text attribute values:
● Page elements have a height=”1188” and width=”918” on all documents.
● Decision date has the same attributes for left=”395” and height=”16” on all documents.
● “Cite as” is always top="171" left="366" width="189" height="16" font="0"
● Type of Opinion, e.g. “(Slip Opinion)” is always top="174" left="234" font="1"
● “Opinion of the Court” is always top="171" left="366" width="189" height="16" font="0"
● “Note” and “Notices” is always left="272" height="13" font="1"
○ Exception: sometimes the first line is indented at left=”284”
● The last line of a Note or Notice is always a reference to another court case, which always has top="280" left="
272" width="343" height="13" font="1">
● The name of the case (usually [partyA] v. [partyB]) is always height="20" font="3".
XML Format: left & height attributes
<text top="171" left="676" width="15" height="16" font="0">1 </text>
<text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text>
<text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text>
<text top="206" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text>
<text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text>
<text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text>
<text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text>
<text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S.
321, 337. </text>
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
<text top="354" left="432" width="57" height="16" font="0">Syllabus </text>
<text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text>
<text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text>
<text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text>
Position of the left margin and line height may indicate a single text block. In this example, most
of the lines of the paragraph are indented to 272, while the first line of the paragraph is indented
to 284. Still, each line has the same line height.
Other Standard Format Elements
Page Headings Footnotes
Page header content alternates on even and odd pages, but the position and text attributes for
the included content is the same. Footers are always noted in superscript within the text and
displayed after a rule at the bottom of the page.
Tools for Text Mining
● Text Editing Initiative
● Court Listener: https://www.courtlistener.com/
● Semantic Parsing for Legal Texts (conference proceedings): http://www.lrec-conf.
org/proceedings/lrec2012/workshops/27.LREC%202012%20Workshop%20Proceedings%
20SPLeT.pdf
● Search and Replace
We looked at a number of pre-processing tools to add semantic information to the documents,
such as identifying docket number, case names, decision dates, presiding court, Justice name,
etc., as well as cited legislation and prior court cases.
Text Encoding Initiative
Text Encoding Initiative allows content producers to tag parts of a document with semantic information.
● footnotes glossing or commenting on any passage could be added;
● pointers linking parts of this text to others could be added;
● proper names of various kinds could be distinguished from the surrounding text;
● detailed bibliographic information about the text's provenance and context could be prefixed to it;
● a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being
associated with appropriate category codes;
● the text could be segmented into narrative or discourse units;
● systematic analysis or interpretation of the text could be included in the encoding, with potentially complex
alignment or linkage between the text and the analysis, or between the text and one or more translations of it;
● passages in the text could be linked to images or sound held on other media.
Source: TEI Lite: Encoding for Interchange: an introduction to the TEI <http://www.tei-c.org/release/doc/tei-p5-
exemplars/html/tei_lite.doc.html>
Parsing Metadata
Court Name in XML:
<text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text>
Could be rendered as the following, because the court always has attributes left=”235” and height=”20”:
HTML: <p class=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name>
XML: <name id=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name>
TEI: <rs type="organization" key=”SCOTUS”><b>SUPREME COURT OF THE UNITED STATES</b></rs>
As we discovered, certain semantically related blocks of text have common text attributes, like
position, line height and font id. We can add semantic markup to identify these blocks.
Parsing Citations
● Parse citation of previous opinions or laws and link to cited legislation.
Font sizes for specific id attributes are not consistent across all documents, BUT...
Creating the UI
● Create stylesheet and page markup for display on devices.
Font sizes for specific id attributes are not consistent across all documents, BUT...

More Related Content

More from Noreen Whysel

IAC22 Safe Tech Audit Presentation Noreen Whysel.pptx
IAC22 Safe Tech Audit Presentation Noreen Whysel.pptxIAC22 Safe Tech Audit Presentation Noreen Whysel.pptx
IAC22 Safe Tech Audit Presentation Noreen Whysel.pptxNoreen Whysel
 
IAC21: Shedding Light on Dark Patterns.pdf
IAC21: Shedding Light on Dark Patterns.pdfIAC21: Shedding Light on Dark Patterns.pdf
IAC21: Shedding Light on Dark Patterns.pdfNoreen Whysel
 
Consumer Views on Respectful Technology.pdf
Consumer Views on Respectful Technology.pdfConsumer Views on Respectful Technology.pdf
Consumer Views on Respectful Technology.pdfNoreen Whysel
 
Information architecture for science gateways
Information architecture for science gatewaysInformation architecture for science gateways
Information architecture for science gatewaysNoreen Whysel
 
How to Create and Maintain an Effective Information Architecture and Navigati...
How to Create and Maintain an Effective Information Architecture and Navigati...How to Create and Maintain an Effective Information Architecture and Navigati...
How to Create and Maintain an Effective Information Architecture and Navigati...Noreen Whysel
 
Shaping the Future of Trusted Digital Identity
Shaping the Future of Trusted Digital IdentityShaping the Future of Trusted Digital Identity
Shaping the Future of Trusted Digital IdentityNoreen Whysel
 
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018Noreen Whysel
 
Kantara Orientation for CARIN Digital ID Summit
Kantara Orientation for CARIN Digital ID SummitKantara Orientation for CARIN Digital ID Summit
Kantara Orientation for CARIN Digital ID SummitNoreen Whysel
 
Preserving Performance at DHWEEK 2018
Preserving Performance at DHWEEK 2018Preserving Performance at DHWEEK 2018
Preserving Performance at DHWEEK 2018Noreen Whysel
 
Journey App: Empathy Jam 2017 Hackathon Entry
Journey App: Empathy Jam 2017 Hackathon EntryJourney App: Empathy Jam 2017 Hackathon Entry
Journey App: Empathy Jam 2017 Hackathon EntryNoreen Whysel
 
SLP 2018 Customer Development
SLP 2018 Customer DevelopmentSLP 2018 Customer Development
SLP 2018 Customer DevelopmentNoreen Whysel
 
Dreams, resilience and making a difference
Dreams, resilience and making a differenceDreams, resilience and making a difference
Dreams, resilience and making a differenceNoreen Whysel
 
Diversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaDiversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaNoreen Whysel
 
IA Wikipedia Edit-a-thon
IA Wikipedia Edit-a-thonIA Wikipedia Edit-a-thon
IA Wikipedia Edit-a-thonNoreen Whysel
 
Creating a Collaborative Learning Gateway
Creating a Collaborative Learning GatewayCreating a Collaborative Learning Gateway
Creating a Collaborative Learning GatewayNoreen Whysel
 
Prelude 16: Preserving Performance
Prelude 16: Preserving PerformancePrelude 16: Preserving Performance
Prelude 16: Preserving PerformanceNoreen Whysel
 
Mentoring Women in Open Source
Mentoring Women in Open SourceMentoring Women in Open Source
Mentoring Women in Open SourceNoreen Whysel
 
Pinterest as Digital Archive, IA Summit 2016, Atlanta
Pinterest as Digital Archive, IA Summit 2016, AtlantaPinterest as Digital Archive, IA Summit 2016, Atlanta
Pinterest as Digital Archive, IA Summit 2016, AtlantaNoreen Whysel
 
DH Week Workshop: Pinterest as Exhibition
DH Week Workshop: Pinterest as ExhibitionDH Week Workshop: Pinterest as Exhibition
DH Week Workshop: Pinterest as ExhibitionNoreen Whysel
 

More from Noreen Whysel (20)

IAC22 Safe Tech Audit Presentation Noreen Whysel.pptx
IAC22 Safe Tech Audit Presentation Noreen Whysel.pptxIAC22 Safe Tech Audit Presentation Noreen Whysel.pptx
IAC22 Safe Tech Audit Presentation Noreen Whysel.pptx
 
IAC21: Shedding Light on Dark Patterns.pdf
IAC21: Shedding Light on Dark Patterns.pdfIAC21: Shedding Light on Dark Patterns.pdf
IAC21: Shedding Light on Dark Patterns.pdf
 
Consumer Views on Respectful Technology.pdf
Consumer Views on Respectful Technology.pdfConsumer Views on Respectful Technology.pdf
Consumer Views on Respectful Technology.pdf
 
Information architecture for science gateways
Information architecture for science gatewaysInformation architecture for science gateways
Information architecture for science gateways
 
How to Create and Maintain an Effective Information Architecture and Navigati...
How to Create and Maintain an Effective Information Architecture and Navigati...How to Create and Maintain an Effective Information Architecture and Navigati...
How to Create and Maintain an Effective Information Architecture and Navigati...
 
Shaping the Future of Trusted Digital Identity
Shaping the Future of Trusted Digital IdentityShaping the Future of Trusted Digital Identity
Shaping the Future of Trusted Digital Identity
 
Trust and inclusion
Trust and inclusionTrust and inclusion
Trust and inclusion
 
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018
Finding Empathy for Your Future Self: UX User Researchers Meetup April 4, 2018
 
Kantara Orientation for CARIN Digital ID Summit
Kantara Orientation for CARIN Digital ID SummitKantara Orientation for CARIN Digital ID Summit
Kantara Orientation for CARIN Digital ID Summit
 
Preserving Performance at DHWEEK 2018
Preserving Performance at DHWEEK 2018Preserving Performance at DHWEEK 2018
Preserving Performance at DHWEEK 2018
 
Journey App: Empathy Jam 2017 Hackathon Entry
Journey App: Empathy Jam 2017 Hackathon EntryJourney App: Empathy Jam 2017 Hackathon Entry
Journey App: Empathy Jam 2017 Hackathon Entry
 
SLP 2018 Customer Development
SLP 2018 Customer DevelopmentSLP 2018 Customer Development
SLP 2018 Customer Development
 
Dreams, resilience and making a difference
Dreams, resilience and making a differenceDreams, resilience and making a difference
Dreams, resilience and making a difference
 
Diversity and Inclusion in Wikipedia
Diversity and Inclusion in WikipediaDiversity and Inclusion in Wikipedia
Diversity and Inclusion in Wikipedia
 
IA Wikipedia Edit-a-thon
IA Wikipedia Edit-a-thonIA Wikipedia Edit-a-thon
IA Wikipedia Edit-a-thon
 
Creating a Collaborative Learning Gateway
Creating a Collaborative Learning GatewayCreating a Collaborative Learning Gateway
Creating a Collaborative Learning Gateway
 
Prelude 16: Preserving Performance
Prelude 16: Preserving PerformancePrelude 16: Preserving Performance
Prelude 16: Preserving Performance
 
Mentoring Women in Open Source
Mentoring Women in Open SourceMentoring Women in Open Source
Mentoring Women in Open Source
 
Pinterest as Digital Archive, IA Summit 2016, Atlanta
Pinterest as Digital Archive, IA Summit 2016, AtlantaPinterest as Digital Archive, IA Summit 2016, Atlanta
Pinterest as Digital Archive, IA Summit 2016, Atlanta
 
DH Week Workshop: Pinterest as Exhibition
DH Week Workshop: Pinterest as ExhibitionDH Week Workshop: Pinterest as Exhibition
DH Week Workshop: Pinterest as Exhibition
 

Recently uploaded

The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentationamedia6
 
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️soniya singh
 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxjeswinjees
 
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Delhi Call girls
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Roomdivyansh0kumar0
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Douxkojalkojal131
 
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...Suhani Kapoor
 
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...kumaririma588
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Delhi Call girls
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...home
 
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...Call Girls in Nagpur High Profile
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...babafaisel
 
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...nagunakhan
 
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiVIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiSuhani Kapoor
 
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girls
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call GirlsCBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girls
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girlsmodelanjalisharma4
 
AMBER GRAIN EMBROIDERY | Growing folklore elements | Root-based materials, w...
AMBER GRAIN EMBROIDERY | Growing folklore elements |  Root-based materials, w...AMBER GRAIN EMBROIDERY | Growing folklore elements |  Root-based materials, w...
AMBER GRAIN EMBROIDERY | Growing folklore elements | Root-based materials, w...BarusRa
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentation
 
B. Smith. (Architectural Portfolio.).pdf
B. Smith. (Architectural Portfolio.).pdfB. Smith. (Architectural Portfolio.).pdf
B. Smith. (Architectural Portfolio.).pdf
 
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
Call Girls in Kalkaji Delhi 8264348440 call girls ❤️
 
Stark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptxStark Industries Marketing Plan (1).pptx
Stark Industries Marketing Plan (1).pptx
 
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
Best VIP Call Girls Noida Sector 44 Call Me: 8448380779
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
 
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
 
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
escort service sasti (*~Call Girls in Prasad Nagar Metro❤️9953056974
 
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...Verified Trusted Call Girls Adugodi💘 9352852248  Good Looking standard Profil...
Verified Trusted Call Girls Adugodi💘 9352852248 Good Looking standard Profil...
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
 
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...Booking open Available Pune Call Girls Nanded City  6297143586 Call Hot India...
Booking open Available Pune Call Girls Nanded City 6297143586 Call Hot India...
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
 
young call girls in Vivek Vihar🔝 9953056974 🔝 Delhi escort Service
young call girls in Vivek Vihar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Vivek Vihar🔝 9953056974 🔝 Delhi escort Service
young call girls in Vivek Vihar🔝 9953056974 🔝 Delhi escort Service
 
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...
Nepali Escort Girl Gomti Nagar \ 9548273370 Indian Call Girls Service Lucknow...
 
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service BhiwandiVIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
VIP Call Girls Bhiwandi Ananya 8250192130 Independent Escort Service Bhiwandi
 
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girls
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call GirlsCBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girls
CBD Belapur Individual Call Girls In 08976425520 Panvel Only Genuine Call Girls
 
AMBER GRAIN EMBROIDERY | Growing folklore elements | Root-based materials, w...
AMBER GRAIN EMBROIDERY | Growing folklore elements |  Root-based materials, w...AMBER GRAIN EMBROIDERY | Growing folklore elements |  Root-based materials, w...
AMBER GRAIN EMBROIDERY | Growing folklore elements | Root-based materials, w...
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
 

Let It Flow: Government e pubs (NYPL Open Book Hack 2014)

  • 1. Let it Flow: Government Documents PDF (or PDF + XML) to reflowable EPUB Open Book Hackathon, NYPL January 11-13, 2014
  • 2. Proposed Tasks: Legal ePUB What would users want? ● Remove hard line breaks for continuous, reflowable display in multiple formats/modes. ● Parse metadata from header information, pagination and other semantic content. ● Parse citation of previous opinions or laws and link to cited legislation. ● Create stylesheet and page markup for display on devices. It is important to: ● Ensure that data is formatted so that it flows naturally regardless of device used. ● Maintain information about pagination for citation and discovery purposes.
  • 3. Proposed Tasks: Legal ePUB What would users want? ● Remove hard line breaks for continuous, reflowable display in multiple formats/modes. ● Parse metadata from header information, pagination and other semantic content. ● Parse citation of previous opinions or laws and link to cited legislation. ● Create stylesheet and page markup for display on devices. It is important to: ● Ensure that data is formatted so that it flows naturally regardless of device used. ● Maintain information about pagination for citation and discovery purposes. We only had time for one thing.
  • 5. Supreme Court Slip Opinions ● Supreme Court slip opinion documents are available in PDF format. ● Dockets may contain multiple document types: ○ Syllabus ○ Opinion of the Court ○ Per Curiam ○ Dissention
  • 6. Metadata: SCOTUS Slip Opinions Citation Document Type Notice Presiding Court Docket Number Case Names/Parties to the Case Date Decided Justice Opinion Front matter for each document contains several centered blocks of content. At first glance, the documents contain minimal structural information.
  • 7. XML Format <text top="171" left="676" width="15" height="16" font="0">1 </text> <text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text> <text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text> <text top="206" left="432" width="57" height="16" font="0">Syllabus </text> <text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text> <text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text> <text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text> <text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text> <text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text> <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> <text top="354" left="432" width="57" height="16" font="0">Syllabus </text> <text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text> <text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text> <text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text> We used Poppler to convert Supreme Court slip opinion documents (PDF) to XML. This revealed position and size attributes for lines of text, but no structured semantic information.
  • 8. XML Format: Font ID <text top="171" left="676" width="15" height="16" font="0">1 </text> <text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text> <text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text> <text top="206" left="432" width="57" height="16" font="0">Syllabus </text> <text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text> <text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text> <text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text> <text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text> <text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text> <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> <text top="354" left="432" width="57" height="16" font="0">Syllabus </text> <text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text> <text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text> <text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text> We may be able to use position and size attributes to identify content parts of the document. For example, different parts of the document have varying positions, font sizes and line heights.
  • 9. XML Format: Font ID <text top="171" left="676" width="15" height="16" font="0">1 </text> <text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text> <text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text> <text top="206" left="432" width="57" height="16" font="0">Syllabus </text> <text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text> <text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text> <text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text> <text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text> <text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text> <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> <text top="354" left="432" width="57" height="16" font="0">Syllabus </text> <text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text> <text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text> <text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text> In this document, the font attribute is 1 for the Note section, 3 for the names of the parties to the case and 4 for the name of the presiding court.
  • 10. XML Format: <page> attributes Font sizes are defined after the <page> elements in the XML document. In this example, the title of the Supreme Court of the United States has height=”20” and font=”4”, which corresponds to fontspec id=”4”, which is defined as Times 20pt Black. <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml producer="poppler" version="0.24.4"> <page number="1" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="0" size="11" family="Times" color="#000000"/> <fontspec id="1" size="8" family="Times" color="#000000"/> <fontspec id="2" size="8" family="Times" color="#000000"/> <fontspec id="3" size="14" family="Times" color="#000000"/> <fontspec id="4" size="20" family="Times" color="#000000"/> <fontspec id="5" size="14" family="Times" color="#000000"/> <fontspec id="6" size="11" family="Times" color="#000000"/> ... <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> Font ID 4 is Times 20pt
  • 11. XML Format: <page> attributes Font specifications are defined within the page element. If subsequent pages require new font IDs they are added after the <page> element. <page number="1" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="0" size="11" family="Times" color="#000000"/> <fontspec id="1" size="8" family="Times" color="#000000"/> <fontspec id="2" size="8" family="Times" color="#000000"/> <fontspec id="3" size="14" family="Times" color="#000000"/> <fontspec id="4" size="20" family="Times" color="#000000"/> <fontspec id="5" size="14" family="Times" color="#000000"/> <fontspec id="6" size="11" family="Times" color="#000000"/> <page number="2" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="7" size="11" family="Times" color="#ff0000"/> <page number="3" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="8" size="7" family="Times" color="#000000"/> <page number="5" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="9" size="4" family="Times" color="#000000"/> ...
  • 12. XML Format: Inconsistencies Font sizes for specific id attributes are not consistent across all documents, BUT... Docket 12-609 <page number="1" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="0" size="11" family="Times" color="#000000"/> <fontspec id="1" size="8" family="Times" color="#000000"/> <fontspec id="2" size="8" family="Times" color="#000000"/> <fontspec id="3" size="14" family="Times" color="#000000"/> <fontspec id="4" size="20" family="Times" color="#000000"/> <fontspec id="5" size="14" family="Times" color="#000000"/> <fontspec id="6" size="11" family="Times" color="#000000"/> In this document The Supreme Court of the United States has font id 4. Docket 12-729 <page number="1" position="absolute" top="0" left="0" height="1188" width="918"> <fontspec id="0" size="11" family="Times" color="#000000"/> <fontspec id="1" size="8" family="Times" color="#000000"/> <fontspec id="2" size="8" family="Times" color="#000000"/> <fontspec id="3" size="14" family="Times" color="#000000"/> <fontspec id="4" size="11" family="Times" color="#000000"/> <fontspec id="5" size="20" family="Times" color="#000000"/> <fontspec id="6" size="14" family="Times" color="#000000"/> <fontspec id="7" size="14" family="Times" color="#000000"/> In this document The Supreme Court of the United States has font id 5.
  • 13. XML Format: Commonalities ...we noticed that the position attributes for many content parts such as left margin and line height were consistent across all documents. These content parts are noted above. Content Parts with consistent text attribute values: ● Page elements have a height=”1188” and width=”918” on all documents. ● Decision date has the same attributes for left=”395” and height=”16” on all documents. ● “Cite as” is always top="171" left="366" width="189" height="16" font="0" ● Type of Opinion, e.g. “(Slip Opinion)” is always top="174" left="234" font="1" ● “Opinion of the Court” is always top="171" left="366" width="189" height="16" font="0" ● “Note” and “Notices” is always left="272" height="13" font="1" ○ Exception: sometimes the first line is indented at left=”284” ● The last line of a Note or Notice is always a reference to another court case, which always has top="280" left=" 272" width="343" height="13" font="1"> ● The name of the case (usually [partyA] v. [partyB]) is always height="20" font="3".
  • 14. XML Format: left & height attributes <text top="171" left="676" width="15" height="16" font="0">1 </text> <text top="174" left="234" width="71" height="13" font="1">(Slip Opinion) </text> <text top="171" left="380" width="166" height="16" font="0">OCTOBER TERM, 2013 </text> <text top="206" left="432" width="57" height="16" font="0">Syllabus </text> <text top="237" left="284" width="362" height="13" font="1">NOTE: Where it is feasible, a syllabus (headnote) will be released, as is</text> <text top="247" left="272" width="374" height="13" font="1">being done in connection with this case, at the time the opinion is issued.</text> <text top="258" left="272" width="374" height="13" font="1">The syllabus constitutes no part of the opinion of the Court but has been</text> <text top="269" left="272" width="380" height="13" font="1">prepared by the Reporter of Decisions for the convenience of the reader. </text> <text top="280" left="272" width="343" height="13" font="1">See <i>United States</i> v. <i>Detroit Timber &amp; Lumber Co.,</i> 200 U. S. 321, 337. </text> <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> <text top="354" left="432" width="57" height="16" font="0">Syllabus </text> <text top="390" left="370" width="183" height="20" font="3">KANSAS <i>v</i>. CHEEVER </text> <text top="426" left="279" width="363" height="16" font="0">CERTIORARI TO THE SUPREME COURT OF KANSAS </text> <text top="455" left="245" width="432" height="16" font="0">No. 12–609. Argued October 16, 2013—Decided December 11, 2013 </text> Position of the left margin and line height may indicate a single text block. In this example, most of the lines of the paragraph are indented to 272, while the first line of the paragraph is indented to 284. Still, each line has the same line height.
  • 15. Other Standard Format Elements Page Headings Footnotes Page header content alternates on even and odd pages, but the position and text attributes for the included content is the same. Footers are always noted in superscript within the text and displayed after a rule at the bottom of the page.
  • 16. Tools for Text Mining ● Text Editing Initiative ● Court Listener: https://www.courtlistener.com/ ● Semantic Parsing for Legal Texts (conference proceedings): http://www.lrec-conf. org/proceedings/lrec2012/workshops/27.LREC%202012%20Workshop%20Proceedings% 20SPLeT.pdf ● Search and Replace We looked at a number of pre-processing tools to add semantic information to the documents, such as identifying docket number, case names, decision dates, presiding court, Justice name, etc., as well as cited legislation and prior court cases.
  • 17. Text Encoding Initiative Text Encoding Initiative allows content producers to tag parts of a document with semantic information. ● footnotes glossing or commenting on any passage could be added; ● pointers linking parts of this text to others could be added; ● proper names of various kinds could be distinguished from the surrounding text; ● detailed bibliographic information about the text's provenance and context could be prefixed to it; ● a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being associated with appropriate category codes; ● the text could be segmented into narrative or discourse units; ● systematic analysis or interpretation of the text could be included in the encoding, with potentially complex alignment or linkage between the text and the analysis, or between the text and one or more translations of it; ● passages in the text could be linked to images or sound held on other media. Source: TEI Lite: Encoding for Interchange: an introduction to the TEI <http://www.tei-c.org/release/doc/tei-p5- exemplars/html/tei_lite.doc.html>
  • 18. Parsing Metadata Court Name in XML: <text top="307" left="235" width="453" height="20" font="4"><b>SUPREME COURT OF THE UNITED STATES </b></text> Could be rendered as the following, because the court always has attributes left=”235” and height=”20”: HTML: <p class=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name> XML: <name id=”Supreme Court of the United States"><b>SUPREME COURT OF THE UNITED STATES </b></name> TEI: <rs type="organization" key=”SCOTUS”><b>SUPREME COURT OF THE UNITED STATES</b></rs> As we discovered, certain semantically related blocks of text have common text attributes, like position, line height and font id. We can add semantic markup to identify these blocks.
  • 19. Parsing Citations ● Parse citation of previous opinions or laws and link to cited legislation. Font sizes for specific id attributes are not consistent across all documents, BUT...
  • 20. Creating the UI ● Create stylesheet and page markup for display on devices. Font sizes for specific id attributes are not consistent across all documents, BUT...