CUNI at MediaEval 2013 Similar Segments in Social Speech TaskPetra Galuscakova
This document summarizes an approach for identifying similar segments in social speech using machine learning segmentation techniques. It discusses:
1) Creating queries from human transcripts and indexing recordings using an IR platform after preprocessing.
2) Segmenting recordings regularly into overlapping passages or using machine learning classification trees trained on human transcripts to identify segment boundaries.
3) Features and models used for the machine learning segmentation of beginnings and ends of segments.
4) Evaluation results showing regular segmentation on ASR transcripts achieved the overall best performance.
Universal Pictures, Studio Canal, and Working Title Pictures distributed the romantic comedy film Love Actually because they see potential in funding romantic comedies as a genre. The leading producers and distributors of romantic comedies include Disney, MGM, and Sony, whose films in the genre have grossed around half a billion pounds each. While Disney would be an obvious choice due to their success distributing romantic comedies, their child-oriented market makes it unlikely they would distribute a film with dark humor. Therefore, the author would choose Universal or MGM as distributors, as they have funded R-rated romantic comedies like What Women Want, making them a better fit for the author's film.
Mohsin Ali Sarder has over 20 years of experience in project management, livelihood development, value chain analysis, and training. He has worked for several international organizations, managing projects focused on economic development, livestock, and disaster recovery. Currently he is a Technical Specialist at Concern Worldwide in Bangladesh, overseeing livelihood and economic development activities.
1) The document describes two experiments involving oxidation-reduction reactions. The first experiment measured the rate of decomposition of hydrogen peroxide over time. The second experiment measured the rate of a redox reaction between potassium iodide and persulfate ions.
2) Both experiments involved measuring the volume of gas produced over time to determine the amount of reactant consumed and the rate of reaction. Rate calculations were performed using the volume and time data.
3) Thermodynamic considerations and balanced chemical equations were provided to explain the redox reactions occurring in each experiment.
1. Dokumen tersebut membahas tentang kesadaran masyarakat terhadap penanganan sampah di Kota Palu dan hak serta kewajiban masyarakat berdasarkan UU No. 18 Tahun 2008.
2. Kasus di Kota Palu menunjukkan bahwa volume sampah yang dihasilkan lebih besar dari yang dapat diangkut, menimbulkan permasalahan lingkungan.
3. UU No. 18 Tahun 2008 mengatur hak masyarakat untuk mendap
Penalty Functions for Evaluation Measures of Unsegmented Speech RetrievalPetra Galuscakova
The document discusses evaluation methods for unsegmented speech retrieval and proposes modifications to the mean generalized average precision (mGAP) measure. It summarizes research that studied user behavior in a simulated retrieval task to navigate audio recordings and identify relevant passages. The study found that users prefer playback points before true starting points and are tolerant of points up to 1-2 minutes away. Based on these results, the document proposes modifications to the penalty function used in mGAP to give higher reward for points before passages and maintain reward within 1-2 minutes of true starting points. A comparison showed the modified function correlates highly with scores from the original measure.
1) The document describes two experiments involving oxidation-reduction reactions. The first experiment measured the rate of decomposition of hydrogen peroxide over time. The second experiment measured the rate of a redox reaction between potassium iodide and persulfate ions.
2) Key results include: the half-life of hydrogen peroxide was found to be 20 minutes; rate constants were calculated from the data.
3) Additional experiments investigated the effects of temperature on reaction rates and used data to derive mathematical equations describing the reactions.
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskPetra Galuscakova
This document summarizes an approach for identifying similar segments in social speech using machine learning segmentation techniques. It discusses:
1) Creating queries from human transcripts and indexing recordings using an IR platform after preprocessing.
2) Segmenting recordings regularly into overlapping passages or using machine learning classification trees trained on human transcripts to identify segment boundaries.
3) Features and models used for the machine learning segmentation of beginnings and ends of segments.
4) Evaluation results showing regular segmentation on ASR transcripts achieved the overall best performance.
Universal Pictures, Studio Canal, and Working Title Pictures distributed the romantic comedy film Love Actually because they see potential in funding romantic comedies as a genre. The leading producers and distributors of romantic comedies include Disney, MGM, and Sony, whose films in the genre have grossed around half a billion pounds each. While Disney would be an obvious choice due to their success distributing romantic comedies, their child-oriented market makes it unlikely they would distribute a film with dark humor. Therefore, the author would choose Universal or MGM as distributors, as they have funded R-rated romantic comedies like What Women Want, making them a better fit for the author's film.
Mohsin Ali Sarder has over 20 years of experience in project management, livelihood development, value chain analysis, and training. He has worked for several international organizations, managing projects focused on economic development, livestock, and disaster recovery. Currently he is a Technical Specialist at Concern Worldwide in Bangladesh, overseeing livelihood and economic development activities.
1) The document describes two experiments involving oxidation-reduction reactions. The first experiment measured the rate of decomposition of hydrogen peroxide over time. The second experiment measured the rate of a redox reaction between potassium iodide and persulfate ions.
2) Both experiments involved measuring the volume of gas produced over time to determine the amount of reactant consumed and the rate of reaction. Rate calculations were performed using the volume and time data.
3) Thermodynamic considerations and balanced chemical equations were provided to explain the redox reactions occurring in each experiment.
1. Dokumen tersebut membahas tentang kesadaran masyarakat terhadap penanganan sampah di Kota Palu dan hak serta kewajiban masyarakat berdasarkan UU No. 18 Tahun 2008.
2. Kasus di Kota Palu menunjukkan bahwa volume sampah yang dihasilkan lebih besar dari yang dapat diangkut, menimbulkan permasalahan lingkungan.
3. UU No. 18 Tahun 2008 mengatur hak masyarakat untuk mendap
Penalty Functions for Evaluation Measures of Unsegmented Speech RetrievalPetra Galuscakova
The document discusses evaluation methods for unsegmented speech retrieval and proposes modifications to the mean generalized average precision (mGAP) measure. It summarizes research that studied user behavior in a simulated retrieval task to navigate audio recordings and identify relevant passages. The study found that users prefer playback points before true starting points and are tolerant of points up to 1-2 minutes away. Based on these results, the document proposes modifications to the penalty function used in mGAP to give higher reward for points before passages and maintain reward within 1-2 minutes of true starting points. A comparison showed the modified function correlates highly with scores from the original measure.
1) The document describes two experiments involving oxidation-reduction reactions. The first experiment measured the rate of decomposition of hydrogen peroxide over time. The second experiment measured the rate of a redox reaction between potassium iodide and persulfate ions.
2) Key results include: the half-life of hydrogen peroxide was found to be 20 minutes; rate constants were calculated from the data.
3) Additional experiments investigated the effects of temperature on reaction rates and used data to derive mathematical equations describing the reactions.
Combining Evidence for Cross-language Information RetrievalPetra Galuscakova
System combination has been extensively studied in monolingual information retrieval, but the problem is understudied in cross-language retrieval in which queries are expressed in one language, but documents are written in another. One notable characteristic of cross-language retrieval, however, is the potential for a greater diversity of system design, since translation and retrieval components both exhibit substantial design spaces. Due to the large diversity of the systems in cross-language retrieval, the potential range of combinations is orders of magnitude larger than in monolingual applications.
I show that evidence combination works well in cross-language retrieval, achieving improvements of 40% relative to the best single system. The best results are obtained using post-retrieval evidence combination, which is able to incorporate many diverse high-quality systems. Because hundreds of different systems can be built, the effectiveness of alternative approaches for managing the complexity is also explored. Both system clustering and expert judgment regarding diversity can help to limit the combinatorial growth of time complexity arising when selections among large numbers of systems need to be made.
How can acoustic and visual features be combined with text-based search methods applied on automatic transcripts and subtitles and help to retrieve television content.
Czech Malach Cross-lingual Speech Retrieval Test CollectionPetra Galuscakova
The document summarizes the Czech Malach Cross-lingual Speech Retrieval Test Collection, which contains 353 audio recordings selected from interviews in the USC Shoah Foundation's Visual History Archive. The collection includes automatic transcripts of the interviews in multiple formats, as well as manual topic annotations of segments and metadata. It is intended to help researchers in fields like information retrieval, machine translation, and social studies by providing a test bed for cross-lingual speech retrieval systems.
This document summarizes research on hyperlinking TV content using audio information. The researchers retrieved segments similar to a query segment from television programs using subtitles, transcripts, and metadata. They addressed problems with automatic speech recognition transcripts like restricted vocabulary and lack of reliability by expanding queries and combining transcripts. Acoustic fingerprinting and similarity were also explored but did not improve results due to lack of content in the transcripts. Overall, query and data expansion along with combining transcripts was most effective for hyperlinking TV content using audio information.
Multimodal Features for Search and Hyperlinking of Video ContentPetra Galuscakova
In the talk, I will discuss content-based retrieval in audio-visual collections. I will focus on retrieval of relevant segments of video using a textual query. In addition, I will describe techniques for detecting hyperlinks within audio-visual collections. Our retrieval system ranked first in the MediaEval 2014 Search and Hyperlinking shared task. The experiments were performed on almost 4000 hours of BBC broadcast video.
As the segmentation of the recordings shows to be crucial for high-quality video retrieval and hyperlinking, I will focus on segmentation strategies. I will show the possibility of employment of the prosodic and visual information into the segmentation process. Our decision tree-based segmentation proved to outperform fixed-length segmentation which regularly achieves the best results in the retrieval process. Visual and prosodic similarity are also explored in addition to the hyperlinking based on the subtitles and automatic transcripts. The employment of the visual similarity achieves a constant improvement, while the employment of the prosodic similarity shows a small but promising improvement too.
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...Petra Galuscakova
This document summarizes research on using different segmentation strategies for passage retrieval in audio-visual documents. It describes experiments comparing regular window-based segmentation to feature-based segmentation using machine learning. Feature-based segmentation outperformed regular segmentation on two benchmark tasks, retrieving similar segments from social speech interviews and searching/hyperlinking TV programs. While results were clearer for the interview data, some feature-based approaches also showed promise for the TV program data when applied to subtitles. Overall, the research suggests feature-based segmentation can improve passage retrieval over simple windowing approaches.
Application of Topic Segmentation in Audiovisual Information RetrievalPetra Galuscakova
This document discusses various approaches to topic segmentation in audiovisual documents. It describes lexical cohesion-based methods like TextTiling and C99, as well as feature-based methods that use lexical, syntactic, prosodic, and visual features. It also covers segmentation of audio data using prosodic cues and segmentation of video data using visual similarity metrics. Finally, it discusses fusion approaches that combine information from multiple modalities for topic segmentation.
Combining Evidence for Cross-language Information RetrievalPetra Galuscakova
System combination has been extensively studied in monolingual information retrieval, but the problem is understudied in cross-language retrieval in which queries are expressed in one language, but documents are written in another. One notable characteristic of cross-language retrieval, however, is the potential for a greater diversity of system design, since translation and retrieval components both exhibit substantial design spaces. Due to the large diversity of the systems in cross-language retrieval, the potential range of combinations is orders of magnitude larger than in monolingual applications.
I show that evidence combination works well in cross-language retrieval, achieving improvements of 40% relative to the best single system. The best results are obtained using post-retrieval evidence combination, which is able to incorporate many diverse high-quality systems. Because hundreds of different systems can be built, the effectiveness of alternative approaches for managing the complexity is also explored. Both system clustering and expert judgment regarding diversity can help to limit the combinatorial growth of time complexity arising when selections among large numbers of systems need to be made.
How can acoustic and visual features be combined with text-based search methods applied on automatic transcripts and subtitles and help to retrieve television content.
Czech Malach Cross-lingual Speech Retrieval Test CollectionPetra Galuscakova
The document summarizes the Czech Malach Cross-lingual Speech Retrieval Test Collection, which contains 353 audio recordings selected from interviews in the USC Shoah Foundation's Visual History Archive. The collection includes automatic transcripts of the interviews in multiple formats, as well as manual topic annotations of segments and metadata. It is intended to help researchers in fields like information retrieval, machine translation, and social studies by providing a test bed for cross-lingual speech retrieval systems.
This document summarizes research on hyperlinking TV content using audio information. The researchers retrieved segments similar to a query segment from television programs using subtitles, transcripts, and metadata. They addressed problems with automatic speech recognition transcripts like restricted vocabulary and lack of reliability by expanding queries and combining transcripts. Acoustic fingerprinting and similarity were also explored but did not improve results due to lack of content in the transcripts. Overall, query and data expansion along with combining transcripts was most effective for hyperlinking TV content using audio information.
Multimodal Features for Search and Hyperlinking of Video ContentPetra Galuscakova
In the talk, I will discuss content-based retrieval in audio-visual collections. I will focus on retrieval of relevant segments of video using a textual query. In addition, I will describe techniques for detecting hyperlinks within audio-visual collections. Our retrieval system ranked first in the MediaEval 2014 Search and Hyperlinking shared task. The experiments were performed on almost 4000 hours of BBC broadcast video.
As the segmentation of the recordings shows to be crucial for high-quality video retrieval and hyperlinking, I will focus on segmentation strategies. I will show the possibility of employment of the prosodic and visual information into the segmentation process. Our decision tree-based segmentation proved to outperform fixed-length segmentation which regularly achieves the best results in the retrieval process. Visual and prosodic similarity are also explored in addition to the hyperlinking based on the subtitles and automatic transcripts. The employment of the visual similarity achieves a constant improvement, while the employment of the prosodic similarity shows a small but promising improvement too.
Experiments with Segmentation Strategies for Passage Retrieval in Audio-Visua...Petra Galuscakova
This document summarizes research on using different segmentation strategies for passage retrieval in audio-visual documents. It describes experiments comparing regular window-based segmentation to feature-based segmentation using machine learning. Feature-based segmentation outperformed regular segmentation on two benchmark tasks, retrieving similar segments from social speech interviews and searching/hyperlinking TV programs. While results were clearer for the interview data, some feature-based approaches also showed promise for the TV program data when applied to subtitles. Overall, the research suggests feature-based segmentation can improve passage retrieval over simple windowing approaches.
Application of Topic Segmentation in Audiovisual Information RetrievalPetra Galuscakova
This document discusses various approaches to topic segmentation in audiovisual documents. It describes lexical cohesion-based methods like TextTiling and C99, as well as feature-based methods that use lexical, syntactic, prosodic, and visual features. It also covers segmentation of audio data using prosodic cues and segmentation of video data using visual similarity metrics. Finally, it discusses fusion approaches that combine information from multiple modalities for topic segmentation.
Application of Topic Segmentation in Audiovisual Information Retrieval
Česko-slovenský paralelný korpus určený pre preklad medzi blízkymi jazykmi
1. Česko-slovenský paralelný korpus určený
pre preklad medzi blízkymi jazykmi
Petra Galuščáková a Ondřej Bojar
{galuscakova,bojar}@ufal.mff.cuni.cz
Univerzita Karlova v Praze
Matematicko-fyzikální fakulta
Ústav formální a aplikované lingvistiky
2. 20. 10. 2011 2
Obsah prezentácie
● Vytvorenie korpusu – postup a použité nástroje
● Možné zdroje paralelného korpusu
● Aplikácia korpusu
3. 20. 10. 2011 3
Úvod
● Väčšie množstvo zdrojov pre češtinu
● Čeština a slovenčina sú veľmi príbuzné
● Čeština ako pivotný jazyk
● Česko-slovenský paralelný korpus
● Trénovanie automatického prekladu
● Vyhodnotenie automatického prekladu
CS
SK
EN PL
...
4. 20. 10. 2011 4
Nástroje
Příběh, který hodláte číst, není
ani román, ani novela. Ty mají
svá pravidla, své zákony. Své
začátky a své konce. Tento
příběh – řekl bych – je
přeslechnut.
Příběh, který hodláte číst, není a
ni román, ani novela.
Ty mají svá pravidla, své zákony.
Své začátky a své konce.
Tento příběh – řekl bych –
je přeslechnut.
1-1 2.28889 Příběh, který hodláte číst, není ani román, ani novela. Príbeh, ktorý
hodláte čítať, nie je ani román, ani novela.
1-1 2.475 Ty mají svá pravidla, své zákony. Tie majú svoje pravidlá, svoje zákony.
1-1 2.08125 Své začátky a své konce. Začiatky a konce.
1-1 2.87805 Tento příběh – řekl bych – je přeslechnut. Tento príbeh —
povedal by som — je prepočutý.
Segmentácia
Alignment
Trénovateľný tokenizér
natrénovaný
na češtine a slovenčine
Hunalign
Príbeh, ktorý hodláte čítať, nie
je ani román, ani novela. Tie majú
svoje pravidlá, svoje zákony.
Začiatky a konce. Tento príbeh
— povedal by som — je
prepočutý.
Príbeh, ktorý hodláte čítať, nie je ani
román, ani novela.
Tie majú svoje pravidlá, svoje zákony.
Začiatky a konce.
Tento príbeh — povedal by som
— je prepočutý.
5. 20. 10. 2011 5
Problémy
● Segmentácia je podstatná pri alignmente
● Problém v prípade, že česká segmentácia pracuje inak ako
slovenská
Alignment Česká veta Slovenská veta
2 - 1
"Pryč ode mne, vy zloto!
<s> Co vám udělaly ty
kačátka?
„Preč odo mňa, vy lotri! čo
vám urobili tie kačičky?
2 - 1
— <s> Viktor nevnímal
hovor a zmatek ve vagónu.
Viktor nevnímal vravu a
zmätok vo vagóne.
1 - 2 Stáří 23 let. Zoolingvistka.
Vek dvadsaťtri rokov. <s>
Zoolingvistka.
1 - 2 II/ MODLITBA II <s> MODLITBA
<s> označuje rozdelenie na vety
6. 20. 10. 2011 6
Zdroje korpusu
● Knihy
● Acquis JRC
● Official Journal Európskej únie
● Webstránka Európskej komisie
Zdroj Slová CS Slová SK Tokeny CS Tokeny SK Vety
Knihy 6.6 mil 6.6 mil 8.1 mil 8.1 mil 550.6 k
Acquis 20.4 mil 20.6 mil 24.3 mil 24.4 mil 926.1 k
Journal 45.5 mil 45.5 mil 56.4 mil 56.3 mil 2.9 mil
Ec-
Europa
0.4 mil 0.4 mil 0.4 mil 0.4 mil 24.2 k
Total 72.9 mil 73.1 mil 89.2 mil 89.2 mil 4.4 mil
7. 20. 10. 2011 7
Zdroje korpusu I - knihy
● Pripravený SAV
● Veľmi dobrý zdroj dát pre MT, problematický môže byť
alignment (málo štrukturované texty)
● 118 kníh (cs->sk, sk->cs a en->cs,sk), vlastný alignment
● Problém získať takýto zdroj, limitované použitie
8. 20. 10. 2011 8
Zdroje korpusu II - Acquis
● Voľne dostupný paralelný viacjazyčný korpus dokumentov EÚ
● Oficiálny alignment
● České a slovenské texty boli vytvorené ako preklady z ďalšieho
jazyka, väčšinou angličtiny
● Veľké množstvo textov, ale obmedzená slovná zásoba, veľká
časť viet sa opakuje – nutné kombinovať s inými zdrojmi
Zdroj Vety spolu Jedinečné vety %
Acquis CZ 926082 608086 65.66
Acquis SK 926082 632916 68.34
Knihy CZ 153478 148705 96.89
Knihy SK 153478 149152 97.18
9. 20. 10. 2011 9
Zdroje korpusu III – Official
Journal
● Opäť dokumenty EU, v 23 jazykoch
● Podobné dáta ako Acquis, podobné problémy
● Oficiálny alignment aj na úrovni viet
10. 20. 10. 2011 10
Zdroje korpusu IV – Stránka
European Commision
● Rôzne jazykové varianty tej istej stránky, ktoré sa líšia príponou
v URL
● Slovenské a české texty vznikli najčastejšie ako preklad z
angličtiny
● Veľa nepreložených odstavcov v českých a slovenských
stránkach
● Na sťahovanie stránok bol implementovaný špeciálny web
crawler
● Stiahnuté stránky boli ďalej prečistené od html kódu a
deduplikované
12. 20. 10. 2011 12
Automatický preklad
● Acquis a knihy boli použité pri trénovaní, ladení a testovaní
nástroja na automatický preklad Moses
● Celkom 6 prípadov (Acquis/Acquis, Acquis/Knihy, Knihy/Acquis,
Knihy/Knihy, Acquis+Knihy/Acquis, Acquis+Knihy/Knihy)
● Testovacia sada – 3860 náhodne vybraných riadkov z kníh
13. 20. 10. 2011 13
Automatický preklad -
výsledky
● Na testovanie prekladu bola použitá metrika BLEU
Trénovacie /
Ladiace dáta
Počet trénovacích
viet
Počet ladiacich
viet
BLEU
Acquis / Acquis 708406 3148 0.1808
Acquis / Knihy 708406 3802 0.2071
Knihy / Acquis 137027 3148 0.4661
Knihy / Knihy 137027 3802 0.4701
Acquis + Knihy /
Acquis
845433 3148 0.4781
Acquis + Knihy /
Knihy
845433 3802 0.4887
14. 20. 10. 2011 14
Automatický preklad – ukážky
výstupu
Originál
"Tak vidějí, vašnosti, dali jsme jim tu radu lacino," řekli pes s
kočičkou, "zrovna jsme si něco takového na zub přáli.
Acquis/Acquis
"tak vidějí, vašnosti, dali jsme im tu radu lacino," řekli pes s kočičkou,
"zrovna jsme si něco takéhoto na zub přáli.
Acquis/Knihy
"tak vidějí, vašnosti, dali jsme im tu radu lacino," řekli predvedenie
identifikácie psa s kočičkou, "zrovna si jsme inak takéhoto na zub
přáli.
Knihy/Acquis
„ nuž vidíte, pán veľkomožný, dali sme im tú radu lacno, “ povedali
pes s mačičkou, „ akurát sme si niečo takého na zub želali.
Knihy/Knihy
„ nuž vidíte, pán veľkomožný, dali sme im tú radu lacno, “ povedali
pes s mačičkou, „ práve sme si také čosi na zub želali.
Acquis+Knihy
/Acquis
„nuž vidíte, pán veľkomožný, dali sme im tú radu lacno,“ povedali pes
a mačička, „akurát sme si také čosi na zub želali.
Acquis+Knihy
/Knihy
„nuž vidíte, pán veľkomožný, dali sme im tú radu lacno,“ povedali pes
s mačičkou, „práve sme si také čosi na zub želali.
15. 20. 10. 2011 15
Automatický preklad –
diskusia
● Pri natrénovaní na knihách je dosiahnuté skóre podstatne
vyššie ako pri natrénovaní na Acquise, hoci veľkosť trénovacích
dát je 5x nižšia
● Slovná zásoba z testovacej sady vybranej z kníh sa môže
nachádzať v trénovacích dátach
● Knihy vznikli na rozdiel od Acquisu vo väčšine ako preklad
cs->sk a sk->cs
● Knihy zlepšia výsledky aj v prípade, že sa použijú ako ladiaca
množina
● Pri spojení kníh a Acquisu nedochádza k výraznému zlepšeniu
výsledkov, ktoré boli dosiahnuté pri tréningu iba na knihách
16. 20. 10. 2011 16
Záver
● Bol vytvorený česko-slovenský paralelný
korpus z niekoľkých zdrojov
● Korpus bol využitý pri automatickom preklade
● Pri preklade hrá dôležitú úlohu to, z akého
zdroja trénovacie dáta pochádzajú
● Stačí menšie množstvo dát, ktoré sú
rôznorodejšie
17. 20. 10. 2011 17
Odkazy
● Acquis JRC
http://optima.jrc.it/Acquis
● Stránka Európskej komisie
http://ec.europa.eu
● Official Journal
http://eurlex.europa.eu/JOIndex.do
● Trénovateľný tokenizér
Klyueva N., Bojar O. (2008). UMC 0.1: Czech-Russian-English Multilingual Corpus. In
Proceedings of International Conference Corpus Linguistics, pages 188–195.
● Hunalign
http://mokk.bme.hu/resources/hunalign
● Moses
http://www.statmt.org/moses