SlideShare a Scribd company logo
1 of 27
Segmentation and Transcription
Guidelines
FedaNegesse(PhD)
LanguageTechnology Group
Departmentof Linguistics
AddisAbaba University
February 26 -27, Adama
1. Segmentation
1.1. Utterance should be segmented into words
following orthographic conventions of the language.
E.g.
 Keessaattti > keessa + isaatti
 Irrakeessatti > irra + keessa + isaatti
 Gamana > gama + kana ( not needed)
 Obboleessakeetti > obboleessa + keetti
 Oldeeme > ol + deeme
 Olkeesuu > ol + kaasuu
Segmentation…
 Gabbaye> gadi + baye
 Gaqqabi > gadi + qabi
 Deemeera> deemee + jira ( as it is )
 Deemuuf > ( as it is )
 TolaafiBantii > Tolaa + fi + Bantii
 Ofjaalata > Of + jaalata
Segmentation…
1.2. No utterance can be longer than 15 seconds.
1.3. Independent words should be separated.
1.4. More than one dependent word should not be
affixed.
1.5. Meaningless lexical elements should not stand
alone.
1.6.Whenever possible choose a segmentation that
maintains the phrase structure of the conversation.
Segmentation…
1.7. Consider a stretch of silence which has small
amplitude noises embedded in it as a silence only
utterance.
1.8. Do not mark the noise and do not segment the
noises into separate utterances.
1.9. However, if a noise has a particularly high amplitude,
then segment it into its own utterance.
Transcription
2.1. Transcribe “verbatim,” without correcting
grammatical errors: “Inni deemte,”
2.2. Do not change an individual pronunciation into the
‘standard pronunciation’: “ Keeysa > keessa”.
2.3. Follow the dictionary on hyphenating compounds in
clear-cut cases. But “when in doubt, leave them out.”
e.g. bu’aa-ba’ii
2.4. Compound words: All compound words should be
transcribed as one word when such a word exists in the
dictionary unless there is an acoustical pause between
the two words. e.g. “gamana”, “garana”, “ofirra”, etc
Transcription …
2.5. Try to avoid word abbreviations: dhibbentaa , not
% .
2.6. Contractions are allowed. e.g. “deemeera”, “kale”
2.7. Capitalization: Use normal capitalization on
proper nouns. Do not capitalize the beginning of
the sentence.
2.8. No punctuation should be used in the
transcriptions
2.9. Remember to watch for common spelling
confusions like: its and it’s, they’re, there and their,
by and bye, to and too, etc.
Transcription …
2.10. Numbers: Spell out all number sequences
except in cases such as “123” or “101” where the
numbers have a specific meaning.
2.11. Transcribe years like 1983 as spoken —
“nineteen eighty three.”
2.12. Do not use hyphens (“twenty eight”, not
“twenty-eight”)
Transcription …
2.12. Letter sequences: Spell out letter sequences: DFW, USA,
FBI, NASA, ROM
2.13. Possessives: Use standard grammar rules to denote
possession: the US’s policy, Sally’s book, the drivers’ cars,
the CEO’s decision, the dancers’ shoes.
2.13. If a speaker does not completely pronounce a word and
the word is not a standard reduction then spell out as much
of the word as is pronounced, and inside brackets spell out
the part of the word that was not pronounced.
Transcription …
2.14. Use a single dash after the brackets if the last
part of the word was not pronounced and a single
dash before the brackets if the first part of the
word was not pronounced to flag that a partial
word was spoken.
2.15. Context should be used to determine what word
was intended to be spoken. If, from context, a
reasonable intended word can not be determined,
mark it as [vocalized-noise]
Transcription …
2.16. Restarts of “i”: If a speaker restarts when saying the
word “i”, it should be transcribed as “i-”.
2.17. Mispronunciations: If a speaker mispronounces a
word and the mispronunciation is not an actual word,
transcribe the word as it is spoken followed by the
word that was intended.
2.18. Divide these two words by a forward slash and
enclose both words in brackets.
E.g. i wasn’t sure that they were blaming that
[splace/space] space disaster on one company
Transcription …
2.19. If a speaker uses and gives meaning to a word that is not an actual
word, spell the word out as it sounds and enclose it in braces.
E.g. How are things for you {weatherwise}
2.20. If one of the speakers involved in the conversation talks to someone
in the background and the words can be understood, then transcribe it as
an aside enclosed in the markers, <b_aside> and <e_aside>.
2.21. This only applies if one of the conversation speakers is involved in the
background conversation.
Transcription …
2.22. If just background speakers can be heard then
this can be thought of either as noise or
background noise depending energy level of the
background speakers. compared to the foreground
speakers.
E.g. “yeah i know what you <b_aside> honey i
can’t play with you right now i’m on the phone
<e_aside> sorry you know kids always want
mommy all to themselves”
Transcription …
2.23. Hesitation sounds: Use “uh” or “ah” for hesitations
consisting of a vowel sound, and “um” or “hm” for hesitations
with a nasal sound, depending upon which transcription the
actual sound is closest to.
2.24. Use “huh” for the aspirated version of the hesitation as in:
"huh? <other speaker responds> um ok, I see your point."
2.25. Yes/no sounds: Use “uh-huh” or “um-hum” (yes) and “huh-
uh” or “hum-um” (no) for anything remotely resembling these
sounds of assent or denial;
2.26. You may use “yeah,” “yep,” and “nope” if that is what the
words sound like.
Transcription …
2.27. Non-speech sounds during conversations: transcribe these
using only the following list of expressions in brackets:
E.g. [laughter] [noise] [vocalized-noise] Pick the closest description
([noise] will be adequate in most cases)
2.28. Laughter during speech: If laughter occurs directly before a
word, place the [laughter] tag before the spoken word.
2.30. If laughter occurs after a spoken word, place the [laughter]
tag after the word.
2.31. If the speaker laughs while saying the word, but the word is
still understood, transcribe this as [laughter-word], where "word"
is the word spoken during the laughter.
Transcription …
2.32. If the speech is obliterated by the laughter, transcribe it
strictly as [laughter].
2.33. If a speaker laughs while saying several words and the
words are understood, transcribe each word in the phrase as
[laughter-word].
2.34. Laughter throughout the phrase, “you don’t say,” would be
transcribed as: [laughter-you] [laughter-don’t] [laughter-say].
Transcription …
2.35. Pronunciation variants are handled in such that
words should be transcribed as they are said.
E.g.
 about_ b aw t
 because _k ah z
 depends_p eh n d z
 them_ eh m
Transcription …
2.37. Consider continuous background noise as part of
channel.
2.38. For example, if a baby cries at a consistent energy
level throughout the conversation then treat it as
background
noise.
2.39. Only consider it as noise if the noise grows much
louder than the normal level .
2.40. The baby screaming would warrant considering it as
noise. In this case mark it as [noise].
Transcription …
2.41.In general abbreviations should be avoided and words
should be transcribed exactly as spoken.
2.42. The exception is that when abbreviations are used as
part of a personal title, they remain as abbreviations, as in
standard writing:
E.g. Mr. Brown
Mrs. Jones
Dr. Spock
Transcription …
2. 43. Acronyms that are normally written as a single
word but pronounced as a sequence of individual
letters should be written in all caps, with each
individual letter surrounded by spaces.
2.44. Similarly, individual letters that are pronounced as
such should b e written in caps:
e.g.
I got an A on the test.
How ’bout if his name was spelled M U H R?
Transcription …
2.45.When a speaker breaks off in the middle of the word, annotators
transcribe as much of the word as can be made out.
2.46.A single dash without preceding space is used to indicate point at which
word was broken off.
2.47.If transcribers can make a reasonable guess at which word was intended
by the speaker, they should include the full form of the word immediately
after the truncated form, preceded by a plus sign + (without separating
spaces).
E.g.
Yes, absolu- +absolutely absolutely.
Well, I gue- +guess -- I would think this is what they intended.
Transcription …
2.48. Speaker restarts are indicated with double dash –
surrounded by spaces.
2.49.Annotators use this convention for cases where a
speaker stops short, cutting him/herself off before
continuing with or rephrasing the utterance.
E.g.
Did people uh -- did fights ever break out uh over
hockey? Since she -- when she died we moved from
across the street.
Transcription …
2.50.An asterisk * is used for obviously mispronounced
words (not regional or nonstandard dialect
pronunciation), or for words that are made up on the
spot by the speaker or idiosyncratic to that speaker’s
usage.
2.51.Annotators should transcribe using the standard
spelling and should not try to represent the
pronunciation. E.g.
They have as much *knowledgement about
things as we’ve got. He insisted that we
((*teak)) -- talk to him in Italian
Transcription …
2.52.Sometimes an audio file will contain a section of speech that is
difficult or impossible to understand. In these cases, annotators use
double parentheses (( )) to mark the region of difficulty.
2.53.If it is possible to make a guess about the speaker’s words,
annotators should transcribe what they think they hear and surround
the stretch of uncertain transcription with double parentheses:
E.g.
And she told me that ((I should just leave.))
Transcription …
2.54.In addition to the transcription conventions outlined above,
the following symbols are used to for the transcription of
other kinds of noises made by either the main speaker or one
of the other participants in the interviews:
{BR} breath (The speaker takes an audible breath.)
{CG} cough (The speaker coughs, or clears his/her throat.)
{LS} lip smack (The speaker smacks his/her lips.)
{LG} laughter (The speaker laughs.)
{NS} noise (Loud background noise, e.g. a door slamming,
cars honking etc.)
Saving
 A file name should be same as the tier name.
 GLOSSA can find easily the file if the file
name and tier names are synchronized.
 Add this information to the metadata.
 Thank You!
 Questions

More Related Content

What's hot

What's hot (20)

morphological analysis of arabic and english language
morphological analysis of arabic and english languagemorphological analysis of arabic and english language
morphological analysis of arabic and english language
 
The locality principle- kiran
The locality principle- kiranThe locality principle- kiran
The locality principle- kiran
 
Prosodic phonology ms ferrer
Prosodic phonology   ms ferrerProsodic phonology   ms ferrer
Prosodic phonology ms ferrer
 
Pragmatics: Deixis And Distance By Dr.Shadia.Pptx
Pragmatics:  Deixis And Distance By Dr.Shadia.PptxPragmatics:  Deixis And Distance By Dr.Shadia.Pptx
Pragmatics: Deixis And Distance By Dr.Shadia.Pptx
 
Syllable and syllabification
Syllable and syllabificationSyllable and syllabification
Syllable and syllabification
 
Language planning
Language planningLanguage planning
Language planning
 
A critical view of ELT history
A critical view of ELT history A critical view of ELT history
A critical view of ELT history
 
Language teaching approaches
Language teaching approaches Language teaching approaches
Language teaching approaches
 
Communicative language teaching
Communicative language teachingCommunicative language teaching
Communicative language teaching
 
Pragmatics: Deixis
Pragmatics: DeixisPragmatics: Deixis
Pragmatics: Deixis
 
Sociolinguistics : Language Change
Sociolinguistics : Language ChangeSociolinguistics : Language Change
Sociolinguistics : Language Change
 
Language shift maintenance death
Language shift maintenance deathLanguage shift maintenance death
Language shift maintenance death
 
Predicates in Semantic
Predicates in SemanticPredicates in Semantic
Predicates in Semantic
 
Mutual intelligibility
Mutual intelligibilityMutual intelligibility
Mutual intelligibility
 
Subtitution dan Ellips
Subtitution dan EllipsSubtitution dan Ellips
Subtitution dan Ellips
 
Ling101 phonological rules
Ling101 phonological rulesLing101 phonological rules
Ling101 phonological rules
 
Diglossia
DiglossiaDiglossia
Diglossia
 
Misunderstanding bilingualism
Misunderstanding bilingualismMisunderstanding bilingualism
Misunderstanding bilingualism
 
Sociolinguistics
SociolinguisticsSociolinguistics
Sociolinguistics
 
Pidgin and creole
Pidgin and creole Pidgin and creole
Pidgin and creole
 

Similar to Segmentation and transcription

Suprasegmental or prosodic properties
Suprasegmental or prosodic propertiesSuprasegmental or prosodic properties
Suprasegmental or prosodic propertiesDewi Atin Surya
 
Grammar elements and their effect on writing
Grammar elements and their effect on writingGrammar elements and their effect on writing
Grammar elements and their effect on writingShobitash Jamwal
 
Study Of English Stress And Intonation
Study Of English Stress And IntonationStudy Of English Stress And Intonation
Study Of English Stress And Intonationsatheesh hendhino
 
Online Meeting Annotation Guideline.pdf.
Online Meeting Annotation Guideline.pdf.Online Meeting Annotation Guideline.pdf.
Online Meeting Annotation Guideline.pdf.barainipa873
 
Stylistic Classification of English Vocabulary
Stylistic Classification of English VocabularyStylistic Classification of English Vocabulary
Stylistic Classification of English VocabularyIrina K
 
English module for intermediate students
English module for intermediate studentsEnglish module for intermediate students
English module for intermediate studentsAkbar Fauzan
 
The features of the connected speech final
The features of the connected speech finalThe features of the connected speech final
The features of the connected speech finalHina Honey
 
Sk rpt bahasa inggeris tahun 2 (edited)
Sk rpt bahasa inggeris tahun 2 (edited)Sk rpt bahasa inggeris tahun 2 (edited)
Sk rpt bahasa inggeris tahun 2 (edited)bujalsksrd
 
ELE 11 LESSON 2.pptx
ELE 11 LESSON 2.pptxELE 11 LESSON 2.pptx
ELE 11 LESSON 2.pptxJrJackSauping
 
Teaching vocabulary
Teaching vocabularyTeaching vocabulary
Teaching vocabularycoolsimo
 

Similar to Segmentation and transcription (20)

Suprasegmental or prosodic properties
Suprasegmental or prosodic propertiesSuprasegmental or prosodic properties
Suprasegmental or prosodic properties
 
English Pholonogy (II Bimestre)
English Pholonogy (II Bimestre)English Pholonogy (II Bimestre)
English Pholonogy (II Bimestre)
 
Grammar elements and their effect on writing
Grammar elements and their effect on writingGrammar elements and their effect on writing
Grammar elements and their effect on writing
 
Prominence
ProminenceProminence
Prominence
 
Study Of English Stress And Intonation
Study Of English Stress And IntonationStudy Of English Stress And Intonation
Study Of English Stress And Intonation
 
Online Meeting Annotation Guideline.pdf.
Online Meeting Annotation Guideline.pdf.Online Meeting Annotation Guideline.pdf.
Online Meeting Annotation Guideline.pdf.
 
Stylistic Classification of English Vocabulary
Stylistic Classification of English VocabularyStylistic Classification of English Vocabulary
Stylistic Classification of English Vocabulary
 
Rabbi(final)
Rabbi(final)Rabbi(final)
Rabbi(final)
 
Phonetic7
Phonetic7Phonetic7
Phonetic7
 
5. Enunciation-Edited.docx
5. Enunciation-Edited.docx5. Enunciation-Edited.docx
5. Enunciation-Edited.docx
 
Prominence and intonation
Prominence and intonationProminence and intonation
Prominence and intonation
 
Prominence and intonation
Prominence and intonationProminence and intonation
Prominence and intonation
 
English module for intermediate students
English module for intermediate studentsEnglish module for intermediate students
English module for intermediate students
 
Punctuation
PunctuationPunctuation
Punctuation
 
The features of the connected speech final
The features of the connected speech finalThe features of the connected speech final
The features of the connected speech final
 
SPEAKING SKILLS
SPEAKING SKILLSSPEAKING SKILLS
SPEAKING SKILLS
 
Sk rpt bahasa inggeris tahun 2 (edited)
Sk rpt bahasa inggeris tahun 2 (edited)Sk rpt bahasa inggeris tahun 2 (edited)
Sk rpt bahasa inggeris tahun 2 (edited)
 
Tania ppt
Tania pptTania ppt
Tania ppt
 
ELE 11 LESSON 2.pptx
ELE 11 LESSON 2.pptxELE 11 LESSON 2.pptx
ELE 11 LESSON 2.pptx
 
Teaching vocabulary
Teaching vocabularyTeaching vocabulary
Teaching vocabulary
 

More from Qobboo

E-learning
E-learningE-learning
E-learningQobboo
 
Cognitive control
Cognitive controlCognitive control
Cognitive controlQobboo
 
Effective listening
Effective listeningEffective listening
Effective listeningQobboo
 
Video recording
Video recordingVideo recording
Video recordingQobboo
 
Tutorial for Wavepad
Tutorial for WavepadTutorial for Wavepad
Tutorial for WavepadQobboo
 
Audio recordings
Audio recordingsAudio recordings
Audio recordingsQobboo
 

More from Qobboo (6)

E-learning
E-learningE-learning
E-learning
 
Cognitive control
Cognitive controlCognitive control
Cognitive control
 
Effective listening
Effective listeningEffective listening
Effective listening
 
Video recording
Video recordingVideo recording
Video recording
 
Tutorial for Wavepad
Tutorial for WavepadTutorial for Wavepad
Tutorial for Wavepad
 
Audio recordings
Audio recordingsAudio recordings
Audio recordings
 

Recently uploaded

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethSamantha Rae Coolbeth
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证a8om7o51
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 

Recently uploaded (20)

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 

Segmentation and transcription

  • 1. Segmentation and Transcription Guidelines FedaNegesse(PhD) LanguageTechnology Group Departmentof Linguistics AddisAbaba University February 26 -27, Adama
  • 2. 1. Segmentation 1.1. Utterance should be segmented into words following orthographic conventions of the language. E.g.  Keessaattti > keessa + isaatti  Irrakeessatti > irra + keessa + isaatti  Gamana > gama + kana ( not needed)  Obboleessakeetti > obboleessa + keetti  Oldeeme > ol + deeme  Olkeesuu > ol + kaasuu
  • 3. Segmentation…  Gabbaye> gadi + baye  Gaqqabi > gadi + qabi  Deemeera> deemee + jira ( as it is )  Deemuuf > ( as it is )  TolaafiBantii > Tolaa + fi + Bantii  Ofjaalata > Of + jaalata
  • 4. Segmentation… 1.2. No utterance can be longer than 15 seconds. 1.3. Independent words should be separated. 1.4. More than one dependent word should not be affixed. 1.5. Meaningless lexical elements should not stand alone. 1.6.Whenever possible choose a segmentation that maintains the phrase structure of the conversation.
  • 5. Segmentation… 1.7. Consider a stretch of silence which has small amplitude noises embedded in it as a silence only utterance. 1.8. Do not mark the noise and do not segment the noises into separate utterances. 1.9. However, if a noise has a particularly high amplitude, then segment it into its own utterance.
  • 6. Transcription 2.1. Transcribe “verbatim,” without correcting grammatical errors: “Inni deemte,” 2.2. Do not change an individual pronunciation into the ‘standard pronunciation’: “ Keeysa > keessa”. 2.3. Follow the dictionary on hyphenating compounds in clear-cut cases. But “when in doubt, leave them out.” e.g. bu’aa-ba’ii 2.4. Compound words: All compound words should be transcribed as one word when such a word exists in the dictionary unless there is an acoustical pause between the two words. e.g. “gamana”, “garana”, “ofirra”, etc
  • 7. Transcription … 2.5. Try to avoid word abbreviations: dhibbentaa , not % . 2.6. Contractions are allowed. e.g. “deemeera”, “kale” 2.7. Capitalization: Use normal capitalization on proper nouns. Do not capitalize the beginning of the sentence. 2.8. No punctuation should be used in the transcriptions 2.9. Remember to watch for common spelling confusions like: its and it’s, they’re, there and their, by and bye, to and too, etc.
  • 8. Transcription … 2.10. Numbers: Spell out all number sequences except in cases such as “123” or “101” where the numbers have a specific meaning. 2.11. Transcribe years like 1983 as spoken — “nineteen eighty three.” 2.12. Do not use hyphens (“twenty eight”, not “twenty-eight”)
  • 9. Transcription … 2.12. Letter sequences: Spell out letter sequences: DFW, USA, FBI, NASA, ROM 2.13. Possessives: Use standard grammar rules to denote possession: the US’s policy, Sally’s book, the drivers’ cars, the CEO’s decision, the dancers’ shoes. 2.13. If a speaker does not completely pronounce a word and the word is not a standard reduction then spell out as much of the word as is pronounced, and inside brackets spell out the part of the word that was not pronounced.
  • 10. Transcription … 2.14. Use a single dash after the brackets if the last part of the word was not pronounced and a single dash before the brackets if the first part of the word was not pronounced to flag that a partial word was spoken. 2.15. Context should be used to determine what word was intended to be spoken. If, from context, a reasonable intended word can not be determined, mark it as [vocalized-noise]
  • 11. Transcription … 2.16. Restarts of “i”: If a speaker restarts when saying the word “i”, it should be transcribed as “i-”. 2.17. Mispronunciations: If a speaker mispronounces a word and the mispronunciation is not an actual word, transcribe the word as it is spoken followed by the word that was intended. 2.18. Divide these two words by a forward slash and enclose both words in brackets. E.g. i wasn’t sure that they were blaming that [splace/space] space disaster on one company
  • 12. Transcription … 2.19. If a speaker uses and gives meaning to a word that is not an actual word, spell the word out as it sounds and enclose it in braces. E.g. How are things for you {weatherwise} 2.20. If one of the speakers involved in the conversation talks to someone in the background and the words can be understood, then transcribe it as an aside enclosed in the markers, <b_aside> and <e_aside>. 2.21. This only applies if one of the conversation speakers is involved in the background conversation.
  • 13. Transcription … 2.22. If just background speakers can be heard then this can be thought of either as noise or background noise depending energy level of the background speakers. compared to the foreground speakers. E.g. “yeah i know what you <b_aside> honey i can’t play with you right now i’m on the phone <e_aside> sorry you know kids always want mommy all to themselves”
  • 14. Transcription … 2.23. Hesitation sounds: Use “uh” or “ah” for hesitations consisting of a vowel sound, and “um” or “hm” for hesitations with a nasal sound, depending upon which transcription the actual sound is closest to. 2.24. Use “huh” for the aspirated version of the hesitation as in: "huh? <other speaker responds> um ok, I see your point." 2.25. Yes/no sounds: Use “uh-huh” or “um-hum” (yes) and “huh- uh” or “hum-um” (no) for anything remotely resembling these sounds of assent or denial; 2.26. You may use “yeah,” “yep,” and “nope” if that is what the words sound like.
  • 15. Transcription … 2.27. Non-speech sounds during conversations: transcribe these using only the following list of expressions in brackets: E.g. [laughter] [noise] [vocalized-noise] Pick the closest description ([noise] will be adequate in most cases) 2.28. Laughter during speech: If laughter occurs directly before a word, place the [laughter] tag before the spoken word. 2.30. If laughter occurs after a spoken word, place the [laughter] tag after the word. 2.31. If the speaker laughs while saying the word, but the word is still understood, transcribe this as [laughter-word], where "word" is the word spoken during the laughter.
  • 16. Transcription … 2.32. If the speech is obliterated by the laughter, transcribe it strictly as [laughter]. 2.33. If a speaker laughs while saying several words and the words are understood, transcribe each word in the phrase as [laughter-word]. 2.34. Laughter throughout the phrase, “you don’t say,” would be transcribed as: [laughter-you] [laughter-don’t] [laughter-say].
  • 17. Transcription … 2.35. Pronunciation variants are handled in such that words should be transcribed as they are said. E.g.  about_ b aw t  because _k ah z  depends_p eh n d z  them_ eh m
  • 18. Transcription … 2.37. Consider continuous background noise as part of channel. 2.38. For example, if a baby cries at a consistent energy level throughout the conversation then treat it as background noise. 2.39. Only consider it as noise if the noise grows much louder than the normal level . 2.40. The baby screaming would warrant considering it as noise. In this case mark it as [noise].
  • 19. Transcription … 2.41.In general abbreviations should be avoided and words should be transcribed exactly as spoken. 2.42. The exception is that when abbreviations are used as part of a personal title, they remain as abbreviations, as in standard writing: E.g. Mr. Brown Mrs. Jones Dr. Spock
  • 20. Transcription … 2. 43. Acronyms that are normally written as a single word but pronounced as a sequence of individual letters should be written in all caps, with each individual letter surrounded by spaces. 2.44. Similarly, individual letters that are pronounced as such should b e written in caps: e.g. I got an A on the test. How ’bout if his name was spelled M U H R?
  • 21. Transcription … 2.45.When a speaker breaks off in the middle of the word, annotators transcribe as much of the word as can be made out. 2.46.A single dash without preceding space is used to indicate point at which word was broken off. 2.47.If transcribers can make a reasonable guess at which word was intended by the speaker, they should include the full form of the word immediately after the truncated form, preceded by a plus sign + (without separating spaces). E.g. Yes, absolu- +absolutely absolutely. Well, I gue- +guess -- I would think this is what they intended.
  • 22. Transcription … 2.48. Speaker restarts are indicated with double dash – surrounded by spaces. 2.49.Annotators use this convention for cases where a speaker stops short, cutting him/herself off before continuing with or rephrasing the utterance. E.g. Did people uh -- did fights ever break out uh over hockey? Since she -- when she died we moved from across the street.
  • 23. Transcription … 2.50.An asterisk * is used for obviously mispronounced words (not regional or nonstandard dialect pronunciation), or for words that are made up on the spot by the speaker or idiosyncratic to that speaker’s usage. 2.51.Annotators should transcribe using the standard spelling and should not try to represent the pronunciation. E.g. They have as much *knowledgement about things as we’ve got. He insisted that we ((*teak)) -- talk to him in Italian
  • 24. Transcription … 2.52.Sometimes an audio file will contain a section of speech that is difficult or impossible to understand. In these cases, annotators use double parentheses (( )) to mark the region of difficulty. 2.53.If it is possible to make a guess about the speaker’s words, annotators should transcribe what they think they hear and surround the stretch of uncertain transcription with double parentheses: E.g. And she told me that ((I should just leave.))
  • 25. Transcription … 2.54.In addition to the transcription conventions outlined above, the following symbols are used to for the transcription of other kinds of noises made by either the main speaker or one of the other participants in the interviews: {BR} breath (The speaker takes an audible breath.) {CG} cough (The speaker coughs, or clears his/her throat.) {LS} lip smack (The speaker smacks his/her lips.) {LG} laughter (The speaker laughs.) {NS} noise (Loud background noise, e.g. a door slamming, cars honking etc.)
  • 26. Saving  A file name should be same as the tier name.  GLOSSA can find easily the file if the file name and tier names are synchronized.  Add this information to the metadata.
  • 27.  Thank You!  Questions