How to get a good transcript and then supplement it with information about the background sounds? The leitmotif will be the use of Azure Video Indexer cognitive technology - a tool that enables transcription - and its integration with custom code written in Azure Databricks. During the lecture, I will show the integration of two technologies and present the approach to machine learning - I will review the available models for audio transcription, but also for image classification - which is especially useful for visual audio representation - spectrograms.
2. About me
10 years in data analytics
Data Science Domain Lead at SoftwareOne
Data Science Trainer at Sages
Lecturer at postgraduate studies at Warsaw University of Technology
https://www.linkedin.com/in/pawe%C5%82-ekk-cierniakowski/
5. Type of captions
Subtitles Closed Captions Open Captions
Language consistent
with audio
Ability to turn off
Background sounds
Information about
change of speaker
8. Audio representation using spectrogram
• In theory it is possible to pass raw audio signal to machine
learning models
• However, such method does not contains/represent
various useful features of audio
• For that reason, spectrogram representation is commonly
used
• It is visual representation of frequency in time
• In Python one can use librosa library to retrieve
spectrogram
12. Processing of spectrograms with neural networks
• Spectrograms could be processed with various different neural networks, but most commonly they
are processed using convolutional neural network(CNN), both 1D and 2D
• Such neural network architecture aims at finding special fragments of image, like edges, borders,
etc.
• Custom and pre-defined (from Keras library) neural networks were used
13. Processing of spectrograms with neural networks
• In Keras there are various ready neural networks architectures(keras.io/api/applications/)
with descriptions of their accuracy, size, depth and
estimated computation and inference time
• Experiments were most performed
with Resnet network family
14. Enhancing capabilities of Video Indexer
Video Indexer
File txt
Newspaper
Feed (RSS)
• Loading
articles
• Extracting
keywords
15. Keywords extraction
Simplified approach Statistical approach
Deep learning
approach
1. Non-english words
2. Taking words that
appeared more than
once
1. Computation of following
metrics:
• term frequency
• TF-IDF
2. Selection of words with
high values of these
metrics
1. Usage of NER models
(Named Entity
Recognition)
2. Selection of words
marked as Location or
Person
16. Hugging Face
• Multimodal
• Computer Vision
• Natural Language Processing
• Audio
• Tabular
• Reinforcement learning
17. Hugging Face – NLP
• Text Classification
• Token Classification
• Table Questions Answering
• Zero-Shot Classification
• Translation
• Summarization
• Conversational
• Text Generation
• Text2Text Generation
• Fill-Mask
• Sentence Similarity
Named Entity Recognition:
- Location (LOC)
- Person (PER)
- Organization (ORG)
- Miscellaneous (MISC)
18. Various colors of accuracy
S – substitution
D - deletion
I – insertion
N – count of words in reference
Accuracy:
• General
• Noun
• Interpunction
• Numbers
• Special words
𝑊𝐸𝑅 =
𝑆 + 𝐷 + 𝐼
𝑁
19. Various colors of accuracy
S – substitution
D - deletion
I – insertion
N – count of words in reference
Accuracy:
• General 97%
• Noun 94%
• Interpunction 87%
• Numbers 92%
• Special words 81%
𝑊𝐸𝑅 =
𝑆 + 𝐷 + 𝐼
𝑁