SlideShare a Scribd company logo
1 of 7
TXM import process
Serge Heiden
(ICAR Laboratory, France)
textometrie@ens-lyon.fr
TXM workshop, DARIAH-DE 2014, Würzburg
TXM import

Import
Search

Statistics
Edition
TXM import modules
corpus input formats



Various proprietary formats : Hyperbase, Alceste, CNR (Cordial)
Calibre – open ebook digital library (ePub)



Copy/Paste
TXT Unicode+CSV (metadata) : raw texts directory



XML/w+CSV : XML texts directory



XML-TEI P5 BFM : TEI standard compatible XML
XML-TEI P5 BVH
XML-TEI P5 FRANTEXT texts
XML-TEI P5 FRANTEXT search results
XML-TEI-TXM : TEI compatible XML+NLP (pivot)











XML-Transcriber+CSV – audio aligned transcriptions
XML-TMX – multilingual aligned corpora
XML-PPS-Factiva – press portal
TXM import environment
Basic import & analysis workflow

TXT texts directory
 XML texts + metadata
 NLP Tagged texts
 XML-TXM TEI texts
 Contrasts : sub-corpus & partition
 Structures
 Lexical facets


TreeTagger

TXM
TXM corpus data model


Text Units (book, article…)
Metadata (author, date, domain, genre…)
 Internal Structure (sentence, paragraph, sections…)
•
Properties (number, title, type...)
 Textual Planes











Comments, Direct Speech, Speaker Turns
Main language (french…), Secondary language (latin…)
Out-of-Text (comments…)

Lexical Units
– Properties (graphical form, lemma, part of speech…)
NLP tools involved (taggers…)
Edition
Pagination (page breaks)
 Rendering properties (styles)





Bibliographic references
Alignment (aligned corpora)

TXM import charter
TXM import levels charter
TXT

XML/w

XML-TEI

Text units

files

files

files

Metadata

CSV

CSV

teiHeader

Words

raw

<w>?

<w>?

Structures

-

any

specific

More Related Content

Similar to TXM import process

Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Multimedia in Higher Education
Multimedia in Higher EducationMultimedia in Higher Education
Multimedia in Higher Educationlearning20
 
Whats new in Alchemy Catalyst 8.0
Whats new in Alchemy Catalyst 8.0Whats new in Alchemy Catalyst 8.0
Whats new in Alchemy Catalyst 8.0Shamusd
 
Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.Bert Van Raemdonck
 
Hidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comHidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comGoa App
 
Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Peter Bouda
 
Zerfass trends in translation technologies
Zerfass trends in translation technologiesZerfass trends in translation technologies
Zerfass trends in translation technologiesascetlan
 
Phpconf taiwan-2012
Phpconf taiwan-2012Phpconf taiwan-2012
Phpconf taiwan-2012Hash Lin
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Architecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgArchitecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgpetermurrayrust
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overviewbwullems
 
Europeana Cloud - Ingestion and Aggregation Workshop
Europeana Cloud - Ingestion and Aggregation WorkshopEuropeana Cloud - Ingestion and Aggregation Workshop
Europeana Cloud - Ingestion and Aggregation WorkshopEuropeana
 

Similar to TXM import process (20)

Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Multimedia in Higher Education
Multimedia in Higher EducationMultimedia in Higher Education
Multimedia in Higher Education
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Whats new in Alchemy Catalyst 8.0
Whats new in Alchemy Catalyst 8.0Whats new in Alchemy Catalyst 8.0
Whats new in Alchemy Catalyst 8.0
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
The CLAM Framework
The CLAM FrameworkThe CLAM Framework
The CLAM Framework
 
Hidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.comHidden Markov Model Toolkit (HTK) www.redicals.com
Hidden Markov Model Toolkit (HTK) www.redicals.com
 
A few words about Search
A few words about SearchA few words about Search
A few words about Search
 
Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...
 
Zerfass trends in translation technologies
Zerfass trends in translation technologiesZerfass trends in translation technologies
Zerfass trends in translation technologies
 
Phpconf taiwan-2012
Phpconf taiwan-2012Phpconf taiwan-2012
Phpconf taiwan-2012
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Learning XSLT
Learning XSLTLearning XSLT
Learning XSLT
 
Architecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgArchitecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.org
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overview
 
Europeana Cloud - Ingestion and Aggregation Workshop
Europeana Cloud - Ingestion and Aggregation WorkshopEuropeana Cloud - Ingestion and Aggregation Workshop
Europeana Cloud - Ingestion and Aggregation Workshop
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

TXM import process

  • 1. TXM import process Serge Heiden (ICAR Laboratory, France) textometrie@ens-lyon.fr TXM workshop, DARIAH-DE 2014, Würzburg
  • 3. TXM import modules corpus input formats   Various proprietary formats : Hyperbase, Alceste, CNR (Cordial) Calibre – open ebook digital library (ePub)  Copy/Paste TXT Unicode+CSV (metadata) : raw texts directory  XML/w+CSV : XML texts directory  XML-TEI P5 BFM : TEI standard compatible XML XML-TEI P5 BVH XML-TEI P5 FRANTEXT texts XML-TEI P5 FRANTEXT search results XML-TEI-TXM : TEI compatible XML+NLP (pivot)         XML-Transcriber+CSV – audio aligned transcriptions XML-TMX – multilingual aligned corpora XML-PPS-Factiva – press portal
  • 5. Basic import & analysis workflow TXT texts directory  XML texts + metadata  NLP Tagged texts  XML-TXM TEI texts  Contrasts : sub-corpus & partition  Structures  Lexical facets  TreeTagger TXM
  • 6. TXM corpus data model  Text Units (book, article…) Metadata (author, date, domain, genre…)  Internal Structure (sentence, paragraph, sections…) • Properties (number, title, type...)  Textual Planes        Comments, Direct Speech, Speaker Turns Main language (french…), Secondary language (latin…) Out-of-Text (comments…) Lexical Units – Properties (graphical form, lemma, part of speech…) NLP tools involved (taggers…) Edition Pagination (page breaks)  Rendering properties (styles)    Bibliographic references Alignment (aligned corpora) TXM import charter
  • 7. TXM import levels charter TXT XML/w XML-TEI Text units files files files Metadata CSV CSV teiHeader Words raw <w>? <w>? Structures - any specific