[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds

•Download as PPTX, PDF•

0 likes•3 views

How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.

Data & Analytics

Video transcription with deep learning module for
recognizing sounds
Paweł Ekk-Cierniakowski

Type of captions
Subtitles Closed Captions Open Captions
Language consistent
with audio
Ability to turn off
Background sounds
Information about
change of speaker

Technology
Video file
Data Lake
Storage
Video
Indexer Databricks
Function
Function
Function

Model
Video
Indexer
Transcription
with timestamps
2-4 sec
chunks
Does it contain
special sound?
[gunshot]
[gunshot]
[gunshot]
[gunshot]
Transcription
with timestamps
with special
sound added to
end of line

Audio representation using spectrogram
• In theory it is possible to pass raw audio signal to machine
learning models
• However, such method does not contains/represent
various useful features of audio
• For that reason, spectrogram representation is commonly
used
• It is visual representation of frequency in time
• In Python one can use librosa library to retrieve
spectrogram

Spectrogram representation of special vehicle

Processing of spectrograms with neural networks
• Spectrograms could be processed with various different neural networks, but most commonly they
are processed using convolutional neural network(CNN), both 1D and 2D
• Such neural network architecture aims at finding special fragments of image, like edges, borders,
etc.
• Custom and pre-defined (from Keras library) neural networks were used

Processing of spectrograms with neural networks
• In Keras there are various ready neural networks architectures(keras.io/api/applications/)
with descriptions of their accuracy, size, depth and
estimated computation and inference time
• Experiments were most performed
with Resnet network family

Enhancing capabilities of Video Indexer
Video Indexer
File txt
Newspaper
Feed (RSS)
• Loading
articles
• Extracting
keywords

Keywords extraction
Simplified approach Statistical approach
Deep learning
approach
1. Non-english words
2. Taking words that
appeared more than
once
1. Computation of following
metrics:
• term frequency
• TF-IDF
2. Selection of words with
high values of these
metrics
1. Usage of NER models
(Named Entity
Recognition)
2. Selection of words
marked as Location or
Person

Hugging Face
• Multimodal
• Computer Vision
• Natural Language Processing
• Audio
• Tabular
• Reinforcement learning

Hugging Face – NLP
• Text Classification
• Token Classification
• Table Questions Answering
• Zero-Shot Classification
• Translation
• Summarization
• Conversational
• Text Generation
• Text2Text Generation
• Fill-Mask
• Sentence Similarity
Named Entity Recognition:
- Location (LOC)
- Person (PER)
- Organization (ORG)
- Miscellaneous (MISC)

Various colors of accuracy
S – substitution
D - deletion
I – insertion
N – count of words in reference
Accuracy:
• General
• Noun
• Interpunction
• Numbers
• Special words
𝑊𝐸𝑅 =
𝑆 + 𝐷 + 𝐼
𝑁

Various colors of accuracy
S – substitution
D - deletion
I – insertion
N – count of words in reference
Accuracy:
• General 97%
• Noun 94%
• Interpunction 87%
• Numbers 92%
• Special words 81%
𝑊𝐸𝑅 =
𝑆 + 𝐷 + 𝐼
𝑁

Similar to [DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds

Language Engineering With XtextSven Efftinge

Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari

Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals

Deep Learning for Automatic Speaker RecognitionSai Kiran Kadam

Digital library softwarehttps://maktabios.blogspot.com/

Virtual eye vision with HoloLensStefano Tempesta

52 57Ijarcsee Journal

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa

Deep DomainZachary S. Brown

Deep Learning - Speaker Recognition Sai Kiran Kadam

Similar to [DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds (20)

Language Engineering With Xtext

Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview

Efficient Intralingual Text To Speech Web Podcasting And Recording

Deep Learning for Automatic Speaker Recognition

Digital library software

Virtual eye vision with HoloLens

52 57

DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...

Deep Domain

Deep Learning - Speaker Recognition

Recently uploaded

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Ukraine War presentation: KNOW THE BASICSAishani27

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Halmar dropshipping via API with DroFxolyaivanovalion

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships

CebaBaby dropshipping via API with DroFX.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

100-Concepts-of-AI by Anupama Kate .pptx

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Brighton SEO | April 2024 | Data Storytelling

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

BabyOno dropshipping via API with DroFx.pptx

Ukraine War presentation: KNOW THE BASICS

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

VidaXL dropshipping via API with DroFx.pptx

Invezz.com - Grow your wealth with trading signals

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Halmar dropshipping via API with DroFx

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Customer Service Analytics - Make Sense of All Your Data.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Log Analysis using OSSEC sasoasasasas.pptx

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds

1. Video transcription with deep learning module for recognizing sounds Paweł Ekk-Cierniakowski

2. About me 10 years in data analytics Data Science Domain Lead at SoftwareOne Data Science Trainer at Sages Lecturer at postgraduate studies at Warsaw University of Technology https://www.linkedin.com/in/pawe%C5%82-ekk-cierniakowski/

3. Project goal

4. Project goal

5. Type of captions Subtitles Closed Captions Open Captions Language consistent with audio Ability to turn off Background sounds Information about change of speaker

6. Technology Video file Data Lake Storage Video Indexer Databricks Function Function Function

7. Model Video Indexer Transcription with timestamps 2-4 sec chunks Does it contain special sound? [gunshot] [gunshot] [gunshot] [gunshot] Transcription with timestamps with special sound added to end of line

8. Audio representation using spectrogram • In theory it is possible to pass raw audio signal to machine learning models • However, such method does not contains/represent various useful features of audio • For that reason, spectrogram representation is commonly used • It is visual representation of frequency in time • In Python one can use librosa library to retrieve spectrogram

9. Spectrogram representation of special vehicle

10. Spectrograms of different sounds

11. Sound clustering

12. Processing of spectrograms with neural networks • Spectrograms could be processed with various different neural networks, but most commonly they are processed using convolutional neural network(CNN), both 1D and 2D • Such neural network architecture aims at finding special fragments of image, like edges, borders, etc. • Custom and pre-defined (from Keras library) neural networks were used

13. Processing of spectrograms with neural networks • In Keras there are various ready neural networks architectures(keras.io/api/applications/) with descriptions of their accuracy, size, depth and estimated computation and inference time • Experiments were most performed with Resnet network family

14. Enhancing capabilities of Video Indexer Video Indexer File txt Newspaper Feed (RSS) • Loading articles • Extracting keywords

15. Keywords extraction Simplified approach Statistical approach Deep learning approach 1. Non-english words 2. Taking words that appeared more than once 1. Computation of following metrics: • term frequency • TF-IDF 2. Selection of words with high values of these metrics 1. Usage of NER models (Named Entity Recognition) 2. Selection of words marked as Location or Person

16. Hugging Face • Multimodal • Computer Vision • Natural Language Processing • Audio • Tabular • Reinforcement learning

17. Hugging Face – NLP • Text Classification • Token Classification • Table Questions Answering • Zero-Shot Classification • Translation • Summarization • Conversational • Text Generation • Text2Text Generation • Fill-Mask • Sentence Similarity Named Entity Recognition: - Location (LOC) - Person (PER) - Organization (ORG) - Miscellaneous (MISC)

18. Various colors of accuracy S – substitution D - deletion I – insertion N – count of words in reference Accuracy: • General • Noun • Interpunction • Numbers • Special words 𝑊𝐸𝑅 = 𝑆 + 𝐷 + 𝐼 𝑁

19. Various colors of accuracy S – substitution D - deletion I – insertion N – count of words in reference Accuracy: • General 97% • Noun 94% • Interpunction 87% • Numbers 92% • Special words 81% 𝑊𝐸𝑅 = 𝑆 + 𝐷 + 𝐼 𝑁

20. Thank You for attention

[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds

Similar to [DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Paweł Ekk-Cierniakowski - Video transcription with deep learning module for recognizing sounds