Bhashini (NLTM) Tools

Bhashini Tools
BhashaDaan & ULCA

Agenda
● Bhashini Mission
● Need for Digital Infrastructure
● NLTM Architecture
● Datasets & Models
● BhashaDaan
● ULCA
● Contributing Datasets & Models to ULCA
● Roadmap

Bhashini Mission Statement
Create a knowledge-based
society by transcending the
language barriers ;
Providing content and services
to citizens, in their own
language.

Digital Infrastructure Educational
eBooks
Digital Web
Content
Tele
Services
Communication
Services
Knowledge
Base
Search
Datasets
Datasets
AI/ML Models

In simple words…
Contributors of
Datasets
Development of
AI Models
Development of
End user Applications

Datasets & Contributors
Contributors Datasets
Public
- Crowdsourced
- Free
Dedicated Teams
- Language Experts
- Specific Tasks
- Paid
Parallel
Monolingual
ASR
TTS
OCR & more…

Data Collection
Help to build an open repository of data to digitally enrich your language
ASR Datasets
TTS Datasets
Parallel Datasets
OCR Datasets

AI Models
Task Types Contributors
Translation
ASR
TTS
Transliteration
OCR
Models
EkStep
AI4Bharat
IITs
IIITs
CDAC
IndicTrans
Vakyansh
IndicXlit
IndicTTS
Anuvaad
and more… and more… and more…

ULCA stands for Universal Language Contribution APIs
ULCA
ULCA is a standard API and open scalable data platform (supporting
various types of datasets) for Indian language datasets and models.
World’s largest Indic language data and models platform for Open AI
innovation

ULCA - Components
Open and scalable data platform
● Parallel text corpus in two or more languages
● Monolingual text corpus
● Automatic Speech Recognition (ASR) corpus
● Text to Speech (TTS) corpus
● Optical Character Recognition (OCR) corpus
● Natural Language Understanding (NLU) datasets
● Machine Translation (MT)
● Automatic Speech Recognition (ASR)
● Text to Speech (TTS)
● Optical Character Recognition (OCR)
● Transliteration
● Large, diverse and task specific benchmarks
● Research community approved metric system
Inclusive Indian language Models
Automated Transparent Benchmarking

ULCA - Current Status
Datasets
● 215 Million Parallel sentences in 13 languages
● 14k Hours of Audio recording in 14 languages
● 2.5 Million Images for OCR in 12 languages
● 10 Million Transliteration pairs in 19 languages
World's largest Indic language data and models platform for open AI innovation
Models ● 240 State of the Art Models in 21 Indian
languages across Translation, speech (ASR/TTS),
OCR & Transliteration
Benchmarks ● 135 Open Benchmarks across Translation, ASR
& Transliteration in 20 Indian languages

ULCA- Actions
Datasets
Submission My Contribution
Search & Download
My Searches
Models
Benchmarking
Submission My Contribution
Explore Models
Try Model
Metrics Benchmark Dataset
Explore Models
Try Model
Model Feedback
Model Leaderboard

ULCA - Language AI Models Demo

ULCA - Roadmap
Datasets
POS, NER
Multi-lingual Multi-speaker
Mobile APK
Models
POS, NER
Benchmark
OCR Benchmark dataset
User Analytics
Ex : En-Hi Legal
Readymade Datasets
Realtime Inference
for Models

ULCA - Roadmap (Contd.)
ULCA
Automated Ingestion of verified contents from external sources to ULCA

Bhashini (NLTM) Tools

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bhashini (NLTM) Tools

Similar to Bhashini (NLTM) Tools (20)

Recently uploaded

Recently uploaded (20)

Bhashini (NLTM) Tools

Editor's Notes