- BibleGrapevine is a website developed by Global Bible Initiative to make linguistic data from Biblical texts and translations available for research
- It displays syntactic trees, alignments between source texts and translations, and links translations to allow comparison across languages
- Current features include basic views, interlinear views, tree views, and translation memory views, with plans to add search for similar linguistic units and word sense exploration
The main users would likely be biblical scholars, linguists, translators, and students interested in in-depth linguistic analysis of biblical texts and translations. Views showing syntactic relationships and alignments between
This document analyzes the fidelity and readability of 13 English Bible translations using quantitative linguistic methods. It measures fidelity based on the syntactic transfer rate and consistency of word choices between the original texts and translations. It measures readability based on the rate of common vocabulary words and syntactic fluency compared to a sample of contemporary English. The analysis ranks the translations on fidelity and readability and explores whether a translation can achieve both high fidelity and readability. The results show some translations are ranked highly in both dimensions.
This document discusses key topics related to understanding and interpreting the Bible:
1. It defines the Bible as God-breathed and authoritative. All Scripture is considered the word of God.
2. It states that the overarching theme of the Bible is exile and redemption, beginning with Adam and Eve's exile from Eden.
3. In discussing Bible translations, it acknowledges challenges in translation and outlines different theoretical models from formal to functional equivalence. Cultural issues and inclusive language are also addressed.
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
This document discusses natural language processing (NLP) and language modeling. It covers the basics of NLP including what NLP is, its common applications, and basic NLP processing steps like parsing. It also discusses word and sentence modeling in NLP, including word representations using techniques like bag-of-words, word embeddings, and language modeling approaches like n-grams, statistical modeling, and neural networks. The document focuses on introducing fundamental NLP concepts.
This document provides an overview of the resources available on BibleGateway.com for studying the Bible, including parallel Bible versions, commentaries, concordances, and original Greek and Hebrew texts. It explains how to use the parallel version and original language tools to compare translations and Greek manuscripts. The document also discusses translation theory, noting the differences between "word-for-word" and "thought-for-thought" translations, and how to "get behind the translation" by analyzing multiple versions of the same passage.
This document provides an overview of Bible translations. It discusses that the Bible was originally written in Hebrew, Greek, and Aramaic and needs to be translated into modern languages. There are two main translation philosophies: word-for-word, which focuses on structure, and thought-for-thought, which focuses on clarity. No translation is perfect, so it's important to understand the philosophies and compare translations to find the best fit for individual needs. Key points are summarized at the end.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
This document analyzes the fidelity and readability of 13 English Bible translations using quantitative linguistic methods. It measures fidelity based on the syntactic transfer rate and consistency of word choices between the original texts and translations. It measures readability based on the rate of common vocabulary words and syntactic fluency compared to a sample of contemporary English. The analysis ranks the translations on fidelity and readability and explores whether a translation can achieve both high fidelity and readability. The results show some translations are ranked highly in both dimensions.
This document discusses key topics related to understanding and interpreting the Bible:
1. It defines the Bible as God-breathed and authoritative. All Scripture is considered the word of God.
2. It states that the overarching theme of the Bible is exile and redemption, beginning with Adam and Eve's exile from Eden.
3. In discussing Bible translations, it acknowledges challenges in translation and outlines different theoretical models from formal to functional equivalence. Cultural issues and inclusive language are also addressed.
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
This document discusses natural language processing (NLP) and language modeling. It covers the basics of NLP including what NLP is, its common applications, and basic NLP processing steps like parsing. It also discusses word and sentence modeling in NLP, including word representations using techniques like bag-of-words, word embeddings, and language modeling approaches like n-grams, statistical modeling, and neural networks. The document focuses on introducing fundamental NLP concepts.
This document provides an overview of the resources available on BibleGateway.com for studying the Bible, including parallel Bible versions, commentaries, concordances, and original Greek and Hebrew texts. It explains how to use the parallel version and original language tools to compare translations and Greek manuscripts. The document also discusses translation theory, noting the differences between "word-for-word" and "thought-for-thought" translations, and how to "get behind the translation" by analyzing multiple versions of the same passage.
This document provides an overview of Bible translations. It discusses that the Bible was originally written in Hebrew, Greek, and Aramaic and needs to be translated into modern languages. There are two main translation philosophies: word-for-word, which focuses on structure, and thought-for-thought, which focuses on clarity. No translation is perfect, so it's important to understand the philosophies and compare translations to find the best fit for individual needs. Key points are summarized at the end.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
The aim of this presentation is to provide practical suggestions to help colleagues use online dictionaries effectively. We begin by exploring the ways in which dictionaries on the Internet have overcome the constraints of traditional dictionaries. We evaluate the advantages that online dictionaries offer, while also considering some potential disadvantages.
The first major advantage is that we have access to wide variety of dictionaries, and nearly all of them are free. Another major benefit is the way information is accessed and displayed; online dictionaries are easy to search, and make use of multimedia capabilities to include sound, pictures and even video.
The presentation distinguishes four different ways of accessing and using these resources. The first of these concerns dictionaries accessed through a dedicated website. These have the advantage of reliability, but some of them are subscription services. The second category is dictionaries integrated into other websites – usually bilingual dictionaries to help speakers of other languages to understand the predominantly English content of the Internet. Then, we look at an example of how a dictionary can be integrated into your web browser, so that it is available to use with every site you visit. Finally, there is the dictionary that you can integrate into your word processor, invaluable for writing and vocabulary activities.
We examine various learner’s dictionaries, assessing what is available and emphasising the importance of choosing an appropriate dictionary according to the level and the needs of the learners. We also look at the additional facilities that learner’s dictionary sites offer for language development.
Finally, we consider ways to train learners to use dictionaries more effectively. In particular, we emphasize the importance of training learner’s to select the correct meaning of a word according to the context, and we look at ways in which the dictionaries can guide learners in this process.
This document outlines a lecture on lexical semantics and multilingualism. It discusses the differences between linguistic ontologies and application ontologies. It introduces key concepts in lexical semantics like the semantic triangle, polysemy, and synonymy. The semantic triangle illustrates the indirect relationship between symbols, concepts, and things in the real world. Polysemy refers to a word having multiple meanings, while synonymy refers to different words having the same meaning. The document also discusses how concepts are shared to varying degrees across languages, with more interaction between language communities leading to more shared concepts.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
This document discusses different ways of visualizing textual data through word alignment. It notes that word alignment increases access to source texts for readers of target texts, and is commonly used for statistical machine translation and by scholars for analyzing literary dependence, translation techniques, reception history, textual criticism, and more. The document provides examples of past visualizations of aligned texts from Logos Bible Software, BibleWorks, and esvbible.org, as well as visualizations using lines, matrices, colors, and mouseovers developed by computational linguists.
Lecture 1st-Introduction to Discourse Analysis._023928.pptxGoogle
Introduction to discourse analysis
What is discourse?
What is discourse Analysis?
Paradigms in linguistics
Cohesion and Coherense
Types of written discourse
Types of spoken discourse
Text and discourse
Scope of discourse analysis
This document discusses the key elements of a good paragraph. It defines a paragraph as a group of related sentences that develop a single main idea. A good paragraph demonstrates unity with a single topic, support for the main idea with details, coherence through logical connections between ideas, and good language with grammatical accuracy. The document also provides examples of different types of linkers or connecting words that can be used to establish relationships between ideas in a paragraph.
This document discusses corpus linguistics and quantitative research design. It defines a corpus as a large collection of texts used for linguistic analysis. Corpus linguistics allows researchers to empirically test hypotheses about language patterns and features based on large amounts of real-world data. Quantitative analysis of corpus data shows how frequently certain words, constructions, and patterns are used. Specialized corpora can focus on particular text types, languages, or learner language. Various software tools are used to analyze corpora through frequency lists, keyword lists, collocation analysis, and other methods.
The central issue in test translations and adaptations is producing instruments that adequately measure target constructs across cultures. There are two main perspectives on equivalence - linguistic equivalence focuses on similarity of linguistic features, while psychological equivalence focuses on similarity of meaning and scores. A good translation combines high levels of construct, cultural, linguistic, and measurement equivalence. There is no single best approach, as the optimal method depends on the specific case. Multiple procedures can be used together to evaluate translation accuracy.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Deploying Semantic Technologies for Digital Publishing: A Case Study from Log...sboisen
Presented May 24, 2007 at the Semantic Technology Conference
This talk describes an effort at Logos Research Systems to build a semantic knowledgebase encompassing general background information about entities and relationships from the Bible (one of the world's most popular collections of information). The scope includes people, places, belief systems, ethnic attributes, social roles, as well as family and other inter-personal relationships, places visited, etc. This Bible Knowledgebase (BK) will be used to support knowledge discovery and visualization in both desktop and web-server configurations for Logos' products. It will also provide an integration framework for Logos' substantial digital library (more than 7000 titles from over 100 different publishers). The project is a good example of what it takes to move a real-world, knowledge-intensive application into a Semantic Web framework.
Discourse analysis is the study of language in use and context. It examines both spoken and written language beyond the sentence level to understand how language functions in real world situations. Discourse analysis focuses on elements like the relationship between participants, speech acts, discourse structures, and how language varies based on context and the social activity. Both spoken and written discourse have conventions like openings, closings, turn-taking, and cohesive devices that link ideas together to aid interpretation. Analyzing these elements can provide insights into how language constructs meaning based on its form and the context in which it is used.
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
There are three main components of information retrieval systems: query understanding, document-query relevance understanding, and document clustering and ranking. The path from a search query to a search document involves several steps like query parsing, processing, augmenting, scoring, ranking, and clustering. Query understanding is where search engine optimization (SEO) begins, while document creation and ranking are other areas where SEO is applied. Cranfield experiments in the late 1950s helped develop the concept of a "search query language" which is different from the language used in documents. Formal semantics and components like tense, aspect, and mood can help machines better understand human language for information retrieval tasks.
The document discusses several key concepts related to formal semantics and information retrieval, including:
1) Formal semantics studies the meaning of natural language through theoretical approaches like compositionality and truth conditions. It helps machines process human language by understanding lexical relations and semantic scope.
2) Cranfield experiments in the late 1950s first identified differences between query language used by searchers and document language, inventing the concept of a "search language" to bridge this gap.
3) Lexical semantics analyzes relationships between words like synonyms, antonyms and semantic networks to help search engines understand query semantics rather than just document content.
Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.
The document discusses various issues relating to the concept of equivalence in translation. It addresses:
1) Definitions of equivalence, including being equal, comparable, or having the same meaning or function.
2) The relationship between the source text and target text, and how to define and measure their level of equivalence.
3) Theories of equivalence from scholars like Roman Jakobson, Vladimir Nabokov, and Eugene Nida that focus on different types and levels of equivalence.
This document discusses using a data-mining approach to perform word sense detection and disambiguation in biblical texts. It aims to identify the different senses of words in the Bible and disambiguate which sense each instance refers to. The approach uses multiple Bible translations linked to the original texts and groups instances based on translation word similarities through a progressive merging technique. This allows automatic identification of word senses using translation data in an efficient and objective manner to build sense dictionaries and enable refined Bible search and translation tools.
The document discusses various issues relating to equivalence in translation. It addresses:
- Definitions of equivalence as being commensurate, comparable, equal, or having the same meaning or function.
- The relationship between the source text and target text, and how to define and measure equivalence.
- Models of equivalence proposed by scholars such as Roman Jakobson, Vladimir Nabokov, J.C. Catford, Vinay and Darbelnet, and Eugene Nida that focus on various levels of correspondence between texts.
- Translation strategies and techniques such as borrowing, calques, literal translation, and modulation that affect equivalence.
- A text-based approach to translation proposed by Andrew Neubert focusing
This document discusses strategies for teaching students to comprehend informational texts, as required by the Common Core State Standards. It emphasizes increasing students' exposure to informational texts and teaching them text structures, such as compare/contrast, and elements, like author's purpose and main ideas. High impact methods are recommended, like explicit instruction, building vocabulary, and having students summarize within and between texts. Graphic organizers can help students learn about content topics. The overall goal is to help students develop familiarity and skill with informational texts.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
This paper presents a method for automatically detecting and correcting erroneous characters in Chinese text. The method treats typo correction as an integral part of syntactic analysis. It considers both the original character and possible replacement characters from a list of confusable pairs during sentence parsing. The character that results in the best parse is identified as correct. The approach achieves substantially higher recall and precision than existing Chinese proofreaders, which do not perform a full syntactic analysis. An evaluation on 50 character pairs found an overall precision of 86.9% and recall of 96.3%. Cases involving characters that can only form words together tended to have perfect scores, while characters that can stand alone were more difficult to correct.
The aim of this presentation is to provide practical suggestions to help colleagues use online dictionaries effectively. We begin by exploring the ways in which dictionaries on the Internet have overcome the constraints of traditional dictionaries. We evaluate the advantages that online dictionaries offer, while also considering some potential disadvantages.
The first major advantage is that we have access to wide variety of dictionaries, and nearly all of them are free. Another major benefit is the way information is accessed and displayed; online dictionaries are easy to search, and make use of multimedia capabilities to include sound, pictures and even video.
The presentation distinguishes four different ways of accessing and using these resources. The first of these concerns dictionaries accessed through a dedicated website. These have the advantage of reliability, but some of them are subscription services. The second category is dictionaries integrated into other websites – usually bilingual dictionaries to help speakers of other languages to understand the predominantly English content of the Internet. Then, we look at an example of how a dictionary can be integrated into your web browser, so that it is available to use with every site you visit. Finally, there is the dictionary that you can integrate into your word processor, invaluable for writing and vocabulary activities.
We examine various learner’s dictionaries, assessing what is available and emphasising the importance of choosing an appropriate dictionary according to the level and the needs of the learners. We also look at the additional facilities that learner’s dictionary sites offer for language development.
Finally, we consider ways to train learners to use dictionaries more effectively. In particular, we emphasize the importance of training learner’s to select the correct meaning of a word according to the context, and we look at ways in which the dictionaries can guide learners in this process.
This document outlines a lecture on lexical semantics and multilingualism. It discusses the differences between linguistic ontologies and application ontologies. It introduces key concepts in lexical semantics like the semantic triangle, polysemy, and synonymy. The semantic triangle illustrates the indirect relationship between symbols, concepts, and things in the real world. Polysemy refers to a word having multiple meanings, while synonymy refers to different words having the same meaning. The document also discusses how concepts are shared to varying degrees across languages, with more interaction between language communities leading to more shared concepts.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
This document discusses different ways of visualizing textual data through word alignment. It notes that word alignment increases access to source texts for readers of target texts, and is commonly used for statistical machine translation and by scholars for analyzing literary dependence, translation techniques, reception history, textual criticism, and more. The document provides examples of past visualizations of aligned texts from Logos Bible Software, BibleWorks, and esvbible.org, as well as visualizations using lines, matrices, colors, and mouseovers developed by computational linguists.
Lecture 1st-Introduction to Discourse Analysis._023928.pptxGoogle
Introduction to discourse analysis
What is discourse?
What is discourse Analysis?
Paradigms in linguistics
Cohesion and Coherense
Types of written discourse
Types of spoken discourse
Text and discourse
Scope of discourse analysis
This document discusses the key elements of a good paragraph. It defines a paragraph as a group of related sentences that develop a single main idea. A good paragraph demonstrates unity with a single topic, support for the main idea with details, coherence through logical connections between ideas, and good language with grammatical accuracy. The document also provides examples of different types of linkers or connecting words that can be used to establish relationships between ideas in a paragraph.
This document discusses corpus linguistics and quantitative research design. It defines a corpus as a large collection of texts used for linguistic analysis. Corpus linguistics allows researchers to empirically test hypotheses about language patterns and features based on large amounts of real-world data. Quantitative analysis of corpus data shows how frequently certain words, constructions, and patterns are used. Specialized corpora can focus on particular text types, languages, or learner language. Various software tools are used to analyze corpora through frequency lists, keyword lists, collocation analysis, and other methods.
The central issue in test translations and adaptations is producing instruments that adequately measure target constructs across cultures. There are two main perspectives on equivalence - linguistic equivalence focuses on similarity of linguistic features, while psychological equivalence focuses on similarity of meaning and scores. A good translation combines high levels of construct, cultural, linguistic, and measurement equivalence. There is no single best approach, as the optimal method depends on the specific case. Multiple procedures can be used together to evaluate translation accuracy.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Deploying Semantic Technologies for Digital Publishing: A Case Study from Log...sboisen
Presented May 24, 2007 at the Semantic Technology Conference
This talk describes an effort at Logos Research Systems to build a semantic knowledgebase encompassing general background information about entities and relationships from the Bible (one of the world's most popular collections of information). The scope includes people, places, belief systems, ethnic attributes, social roles, as well as family and other inter-personal relationships, places visited, etc. This Bible Knowledgebase (BK) will be used to support knowledge discovery and visualization in both desktop and web-server configurations for Logos' products. It will also provide an integration framework for Logos' substantial digital library (more than 7000 titles from over 100 different publishers). The project is a good example of what it takes to move a real-world, knowledge-intensive application into a Semantic Web framework.
Discourse analysis is the study of language in use and context. It examines both spoken and written language beyond the sentence level to understand how language functions in real world situations. Discourse analysis focuses on elements like the relationship between participants, speech acts, discourse structures, and how language varies based on context and the social activity. Both spoken and written discourse have conventions like openings, closings, turn-taking, and cohesive devices that link ideas together to aid interpretation. Analyzing these elements can provide insights into how language constructs meaning based on its form and the context in which it is used.
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
There are three main components of information retrieval systems: query understanding, document-query relevance understanding, and document clustering and ranking. The path from a search query to a search document involves several steps like query parsing, processing, augmenting, scoring, ranking, and clustering. Query understanding is where search engine optimization (SEO) begins, while document creation and ranking are other areas where SEO is applied. Cranfield experiments in the late 1950s helped develop the concept of a "search query language" which is different from the language used in documents. Formal semantics and components like tense, aspect, and mood can help machines better understand human language for information retrieval tasks.
The document discusses several key concepts related to formal semantics and information retrieval, including:
1) Formal semantics studies the meaning of natural language through theoretical approaches like compositionality and truth conditions. It helps machines process human language by understanding lexical relations and semantic scope.
2) Cranfield experiments in the late 1950s first identified differences between query language used by searchers and document language, inventing the concept of a "search language" to bridge this gap.
3) Lexical semantics analyzes relationships between words like synonyms, antonyms and semantic networks to help search engines understand query semantics rather than just document content.
Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.
The document discusses various issues relating to the concept of equivalence in translation. It addresses:
1) Definitions of equivalence, including being equal, comparable, or having the same meaning or function.
2) The relationship between the source text and target text, and how to define and measure their level of equivalence.
3) Theories of equivalence from scholars like Roman Jakobson, Vladimir Nabokov, and Eugene Nida that focus on different types and levels of equivalence.
This document discusses using a data-mining approach to perform word sense detection and disambiguation in biblical texts. It aims to identify the different senses of words in the Bible and disambiguate which sense each instance refers to. The approach uses multiple Bible translations linked to the original texts and groups instances based on translation word similarities through a progressive merging technique. This allows automatic identification of word senses using translation data in an efficient and objective manner to build sense dictionaries and enable refined Bible search and translation tools.
The document discusses various issues relating to equivalence in translation. It addresses:
- Definitions of equivalence as being commensurate, comparable, equal, or having the same meaning or function.
- The relationship between the source text and target text, and how to define and measure equivalence.
- Models of equivalence proposed by scholars such as Roman Jakobson, Vladimir Nabokov, J.C. Catford, Vinay and Darbelnet, and Eugene Nida that focus on various levels of correspondence between texts.
- Translation strategies and techniques such as borrowing, calques, literal translation, and modulation that affect equivalence.
- A text-based approach to translation proposed by Andrew Neubert focusing
This document discusses strategies for teaching students to comprehend informational texts, as required by the Common Core State Standards. It emphasizes increasing students' exposure to informational texts and teaching them text structures, such as compare/contrast, and elements, like author's purpose and main ideas. High impact methods are recommended, like explicit instruction, building vocabulary, and having students summarize within and between texts. Graphic organizers can help students learn about content topics. The overall goal is to help students develop familiarity and skill with informational texts.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
This paper presents a method for automatically detecting and correcting erroneous characters in Chinese text. The method treats typo correction as an integral part of syntactic analysis. It considers both the original character and possible replacement characters from a list of confusable pairs during sentence parsing. The character that results in the best parse is identified as correct. The approach achieves substantially higher recall and precision than existing Chinese proofreaders, which do not perform a full syntactic analysis. An evaluation on 50 character pairs found an overall precision of 86.9% and recall of 96.3%. Cases involving characters that can only form words together tended to have perfect scores, while characters that can stand alone were more difficult to correct.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Learning Verb-Noun Relations to Improve ParsingAndi Wu
This document describes a learning procedure to automatically acquire knowledge about verb-noun relations in Chinese. It uses an existing parser, a large corpus, and statistical methods to learn which verb-noun pairs typically occur in a verb-object relation versus a modifier-head relation. The learned knowledge is then used to disambiguate parses, improving the accuracy of the original parser. An evaluation on 500 sentences showed the parser's accuracy improved significantly, with the correct analysis found for 350 sentences when using the acquired knowledge.
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
This document discusses a method for dynamically acquiring lexical information during sentence analysis in order to improve the coverage of a parser without requiring manual dictionary editing. New words and attributes are proposed based on contextual templates and accepted or rejected based on whether they are needed to parse sentences successfully. Accepted proposals are stored in auxiliary lexicons which can then be combined with the main lexicon to improve parsing of future sentences, especially in domain-specific texts. Evaluation on a technical manual corpus showed the method significantly improved parsing accuracy by recognizing new words and attributes.
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
This document discusses customizable segmentation of morphologically derived words in Chinese. It presents a system that can segment words in different ways to meet various user-defined standards. The system represents all morphologically derived words as word trees, where the root nodes are maximal words and leaf nodes are minimal words. Each non-terminal node has a resolution parameter that determines if its daughters are displayed as a single word or separate words. Different segmentations can then be obtained by specifying different combinations of these resolution parameters. This allows a single system to be customized for different segmentation needs.
The document discusses developing an intelligent search system for biblical texts that goes beyond traditional concordance searches based solely on identical word forms and word orders. It aims to enable searches based on similar meanings by accounting for syntax and semantics. An example is given of a traditional concordance search for an identical phrase across passages. The system seeks to improve on this by allowing searches for passages containing phrases that are not identical but have similar meanings.
This document provides information about how manuscripts submitted to UMI are reproduced on microfilm. It explains that the quality of reproduction depends on the quality of the original submitted. It details how oversize materials like maps are reproduced, and how photographs are reproduced either on the microfilm or available separately for an additional charge.
2. Global Bible Initiative
• Formerly “Asia Bible Society”
Our mission:
“Go therefore and make disciples of all nations, baptizing them in the name of the
Father and of the Son and of the Holy Spirit, teaching them to observe all that I have
commanded you. And behold, I am with you always, to the end of the age.”
Matthew 28:19-20 (ESV)
“Go into all the world and proclaim the gospel to the whole creation.”
Mark 16:15 (ESV)
3. Global Bible Initiative
Our work:
• Bible translation: Chinese, Cambodian (Khmer), Burmese, Hindi,
Punjabi, Japanese
• Biblical Research with particular focus on Linguistic Analysis of source
language data
• Linguistic Research with particular focus on the relations between
source languages and target languages
• Software Development Platform to support the translation process
• The BibleGrapevine website
4. BibleGrapevine
Making the data available for linguistic research in the Bible
A website where Biblical scholars and students can explore
the linguistic features and structures of the original Biblical texts
and their translations.
5. Linguistic data deployed so far
• Trees – representation of syntactic structures
o Hebrew OT trees
o Greek NT trees
o Chinese trees of Bible translations
o English trees of Bible translations
• Tree alignment
o Alignment between source language trees and translation trees to represent
correspondences at all levels
o Alignment between translations through the source languages
6. Trees
Representation of syntactic structures
• Layers of word groups
o Words
o Phrases
o Clauses
o Sentences
• Syntactic relations between words
12. Tree Alignment
Correspondence at all levels of syntactic structures
• Words
• Phrases
• Clauses
• Sentences
Types of Correspondence
• One-to-one
• One-to-many
• Many-to-one
• Many-to-many
16. BibleGrapevine
What distinguishes BibleGrapevine from other Bible study
websites?
• Focus on linguistic aspects of the Bible
• Linguistic Units in different granularities
• Detailed analysis of translations
• Refined links between translations and original texts
• Data visualization
• Intelligent search
17. BibleGrapevine
Features developed so far:
• Basic Views
• Interlinear Views
• Reference Views
• Tree Views
• Translation Memory Views
• Concordance Views
28. Interlinear View
Traditional interlinear: word-based
o Not easy to represent one-to-many, many-to-many
correspondences
o Hard to represent correspondences between discontinuous units
o Not easy to see correspondences between bigger chunks of text
Dynamic interlinear: variable linguistic units
o Easy to see correspondences between any units
52. Next?
Search for similar words, phrases, clauses and sentences
• Beyond identical texts: similar words in similar syntactic relations
• Probabilistic: results ranked by similarity scores
• Search item: type in the search box or select any chunk of text from
any views
Explore Word Senses
• Translations as reflecting different senses of words
53. Questions
• Who do you think will be the main users of this website?
• Which views are more useful to the users?
• What additional views/functionalities are desirable?
Editor's Notes
When I presented at the last BibleTech two years ago, the organization I work for was called Asia Bible Society. This name can be confusing and may make people think that we were affiliated with other Bible societies, such as American Bible Society and German Bible Society. Therefore we came up with a new name: Global Bible Initiative. Our mission is the Great Commission.
In the last 15 years or so, we have spent most of our time on Bible translation. We’ve been working on new translations in Chinese, Cambodian, Burmese, Hindi, Punjabi, and Japanese.
While working on Bible translation, we did a lot of Biblical Research with particular focus on Linguistic Analysis of source language data. We also did a lot of linguistic research with particular focus on the relations between source languages and target languages. All the translations are carefully linked to the source language data, resulting in a huge knowledge base of Biblical language data.
Computer technology has been used extensively in the creation of this knowledge base, and a software platform has been developed to support all the translation work.
More recently, we started developing a website called BibleGrapevine, where we try to use this knowledge base to create applications for Biblical scholars and students to explore the linguistic features and structures of the original Biblical texts and their translations. You can tell why we chose this name if you know our connection with GrapeCity.
The website is still under development. The linguistic data that has been deployed so far consist of trees and tree alignment only.
Trees are representations of syntactic structures. They show the word groups in different layers, word, phrases, clauses, sentences. They also show the syntactic relations between the words.
Tree alignment represents the relationships between source languages and translations. Once the translations are all linked to the source languages, alignments between different translations can be established through the source languages. So no alignment between the translations are necessary.
Tree alignment identifies correspondences at all levels of syntactic structures, words, phrases, clauses and sentences. The correspondence can be one-to-one, one-to-many, and many-to-many.
Demo
Demo
Demo
So, what What distinguishes BibleGrapevine from other Bible study websites?
Focus on linguistic aspects of the Bible. It doesn’t have commentaries, at least for now, but linguistic analysis is deeper than ever before.
Linguistic Units in different granularities. All the features operate not only on the word level and verse level, but on all levels, from words, phrases to sentences.
Detailed analysis of translations. For example, all the translation texts are syntactically analyzed.
Refined links between translations and original texts. Everything is accounted for.
Data visualization, as you will see in the demo.
Intelligent search, meaning-based search rather than string-based search
In the remaining time of this presentation, I will give you a demo of the system.
The system is still under development. Here are the views that have already been developed. I’ll show them one by one in the demo.
Here are the features that are yet to be developed.
The site is not public yet. So you won’t find it in Google.
It is a testing site and it uses a small dataset in order to speed up the development cycle and save space. The test set contains the books of Matthew and John.
A bird’s eye view of the whole Bible, with the size of each book proportional to the number of words in the book.