Computing Support for Pakistani
Languages – Challenges and Practice

Unlocking Information for Human Development

www.CLE.org.pk

Sarmad Hussain
Center for Language Engineering
Al-Khawarizmi Institute of Computer Science
University of Engineering and Technology
Lahore
sarmad@cantab.net
www.cle.org.pk

1
Need
ICTs promise significant socio-economic impact
Impact dependent on size of population which can use ICTs
180 Million citizens need access
66+ languages
10% understand English
58% literate
11% have access to computers
70% have access to mobile phones
ITU IDI: Pakistan ranked 127 of 155 nations
Human Language Technology necessary to bridge the gap
www.cle.org.pk

2
Languages of Pakistan
Urdu

Punjabi Sindhi

Pushto Balochi Saraiki Others (60)

Total

7.57

44.15

14.1

15.42

3.57

10.53

4.66

Rural

1.48

42.51

16.46

18.06

3.99

12.97

4.53

Urban

20.22

47.56

9.20

9.94

2.69

5.46

4.93

Percent Population of
Pakistan by
Mother Tongue

www.cle.org.pk

3
Languages of Pakistan

Sociocultural

Economic
Urdu

Punjabi Sindhi

Pushto Balochi Saraiki Others (60)

Total

7.57

44.15

14.1

15.42

3.57

10.53

4.66

Rural

1.48

42.51

16.46

18.06

3.99

12.97

4.53

Urban

20.22

47.56

9.20

9.94

2.69

5.46

4.93

Percent Population of
Pakistan by
Mother Tongue

www.cle.org.pk

4
Sociocultural

Languages of Pakistan
Economic
Urdu

Punjabi Sindhi

Pushto Balochi Saraiki Others (60)

Total

7.57

44.15

14.1

15.42

3.57

10.53

4.66

Rural

1.48

42.51

16.46

18.06

3.99

12.97

4.53

Urban

20.22

47.56

9.20

9.94

2.69

5.46

4.93

Percent Population of
Pakistan by
Mother Tongue

Languages of Pakistan
in Danger (UNESCO)
Vulnerable

definitely endangered
www.cle.org.pk

severely endangered

5
How?
Human Language
Technology Linguistic Research
Standards
Applications

Materials
Training

Adoption

USE

Relevant Content Access
Relevant Content Generation

www.cle.org.pk

6
Human Language Technology –
Bridging Barriers
•
•
•
•

Interfacing
Assisting
Enabling
Empowering

www.cle.org.pk

7
Interfacing

Language
– Character Set
• Input Methods
• Writing
• Collation

Standards
– National
– International

– Terminology Translation

• ISO 639
• ISO 3166
• ISO 10646/Unicode

Technology
– Applications
– Platforms: Computers and Phones
• Fonts
• Linux/Unix and Symbian
• Keyboards, Keypads and
Other Input Methods
• Microsoft Windows and Phone
• Collation Methods
• iOS – iPAD, iPhone, Macbook, …
• Localized Platform
• Google – Gmail, Docs, …Android
www.cle.org.pk

8
Software Localization
SeaMonkey Navigator

OpenOffice.org Writer
Terminology and Content

www.cle.org.pk

10
Assisting
• Text
– Assistive input/auto-complete methods
– Thesaurus, Spelling and Grammar Checking
– Machine Translation, Language Identification, Text
Summarization …

• Speech
– Speech Recognition
– Text to Speech
– Emotion Detection, …

• Image
– Optical Character Recognition – www.UrduOCR.net
– Handwriting Recognition
www.cle.org.pk

11
www.cle.org.pk

12
www.cle.org.pk

13
Enabling
• Hybrid
– Online Content Sharing Tools – CMS, Social
Networks
– Screen Readers
– Book Readers
– Text based Search Engines
– Dialogue Systems
– Speech to Speech Translation
– Multi-modal Search Engines
www.cle.org.pk

14
Dialogue System

www.cle.org.pk

15
Empowering
• ICT for ICT - Focused on infrastructure
• ICT for Development - Focused on content and applications
• ICT for Human Development - Focused on participatory process

www.cle.org.pk

16
www.cle.org.pk

17
LANGUAGE AND ICT TRAINING
100%
Preference for Urdu
80%

Preference for English

100

60%

80

20%

0%
Before Training
Software

Percent Teachers

40%

Preference for Urdu
Preference for English

60

40

After Training

Before Training

20

After Training

Training Material

0
Before Training

After Training

Software

www.cle.org.pk

Before Training

After Training

Training Material

18
LANGUAGE AND ICT TRAINING
Icon Identification by Students
Urdu

Icons
SubTotal
Total

F

M

English
Transliterat Didn't
English
ed into Recognize
Urdu
F

M

330

16%

M

F

691 656 132 198 150
1347

4%

F

M

SubTotal

183

49

40

2099

333

www.cle.org.pk

89

16%

64%

2099

19
ACCESSING INFO ONLINE
Language Used
Students
Female
Male
Total

Urdu

English

44
45
89

2
2
4

Total
46
47
93

Language Preference
for Searching on the Internet

Preferred Language
for Setting a Homepage
Participant

English

Urdu

Students

0

138

Teachers

5

13

Total

5

151

www.cle.org.pk

20
LANGUAGE IN ONLINE
COMMUNICATION
9%

1%

2%
1467 emails and 363 chats

Urdu
English
Punjabi
Others

89%
www.cle.org.pk

21
[1]

One school did not participate, and one school website was disqualified as the team took significant external assistance.

LANGUAGE FOR CONTENT
DEVELOPMENT

Website Competition Category

Language of Website
Urdu

English

Total

School Website (by 10 School Teacher Teams)

9

1

10

Local Village Website (by 10 School Student
Teams)

8

0

8

Open Category (Individual Students)

38

0

38

Total

55

1

56

www.cle.org.pk

22
CONTENT

www.cle.org.pk

23
Development Process of
Human Language Technology
Select
Language

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Localization of
Existing
Applications

Development
of Advanced
HLT
Application

Extension of
Localization
Applications

24
Status of Human Language
Technology
URDU

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
25
Status of Human Language
Technology
SINDHI

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
26
Status of Human Language
Technology
PUSHTO

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
27
Status of Human Language
Technology
PUNJABI

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
28
Status of Human Language
Technology
BALOCHI

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
29
Status of Human Language
Technology
SARAIKI

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
30
Status of Human Language
Technology
OTHERS

Linguistic Data
Collection

Core Linguistic
Analysis and
Definition

Publishing
Language
Computing
Standards

Development
of Localization
Utilities

Detailed
Linguistic
Analysis

Publishing
Data
Annotations
Schema

Annotation of
Linguistic Data

Development
of Linguistic
Utilities

Publishing
Annotated
Linguistic
Resources

Development
of Advanced
HLT
Application

Localization of
Existing
Applications

Extension of
Localization
Applications

Reasonable
Support
Some
Support
Minimal
Support
31
www.cle.org.pk

32

"Computing support for Pakistani Languages, Challenges & Practices" by Dr. Sarmad Hussain

  • 1.
    Computing Support forPakistani Languages – Challenges and Practice Unlocking Information for Human Development www.CLE.org.pk Sarmad Hussain Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore sarmad@cantab.net www.cle.org.pk 1
  • 2.
    Need ICTs promise significantsocio-economic impact Impact dependent on size of population which can use ICTs 180 Million citizens need access 66+ languages 10% understand English 58% literate 11% have access to computers 70% have access to mobile phones ITU IDI: Pakistan ranked 127 of 155 nations Human Language Technology necessary to bridge the gap www.cle.org.pk 2
  • 3.
    Languages of Pakistan Urdu PunjabiSindhi Pushto Balochi Saraiki Others (60) Total 7.57 44.15 14.1 15.42 3.57 10.53 4.66 Rural 1.48 42.51 16.46 18.06 3.99 12.97 4.53 Urban 20.22 47.56 9.20 9.94 2.69 5.46 4.93 Percent Population of Pakistan by Mother Tongue www.cle.org.pk 3
  • 4.
    Languages of Pakistan Sociocultural Economic Urdu PunjabiSindhi Pushto Balochi Saraiki Others (60) Total 7.57 44.15 14.1 15.42 3.57 10.53 4.66 Rural 1.48 42.51 16.46 18.06 3.99 12.97 4.53 Urban 20.22 47.56 9.20 9.94 2.69 5.46 4.93 Percent Population of Pakistan by Mother Tongue www.cle.org.pk 4
  • 5.
    Sociocultural Languages of Pakistan Economic Urdu PunjabiSindhi Pushto Balochi Saraiki Others (60) Total 7.57 44.15 14.1 15.42 3.57 10.53 4.66 Rural 1.48 42.51 16.46 18.06 3.99 12.97 4.53 Urban 20.22 47.56 9.20 9.94 2.69 5.46 4.93 Percent Population of Pakistan by Mother Tongue Languages of Pakistan in Danger (UNESCO) Vulnerable definitely endangered www.cle.org.pk severely endangered 5
  • 6.
    How? Human Language Technology LinguisticResearch Standards Applications Materials Training Adoption USE Relevant Content Access Relevant Content Generation www.cle.org.pk 6
  • 7.
    Human Language Technology– Bridging Barriers • • • • Interfacing Assisting Enabling Empowering www.cle.org.pk 7
  • 8.
    Interfacing Language – Character Set •Input Methods • Writing • Collation Standards – National – International – Terminology Translation • ISO 639 • ISO 3166 • ISO 10646/Unicode Technology – Applications – Platforms: Computers and Phones • Fonts • Linux/Unix and Symbian • Keyboards, Keypads and Other Input Methods • Microsoft Windows and Phone • Collation Methods • iOS – iPAD, iPhone, Macbook, … • Localized Platform • Google – Gmail, Docs, …Android www.cle.org.pk 8
  • 9.
  • 10.
  • 11.
    Assisting • Text – Assistiveinput/auto-complete methods – Thesaurus, Spelling and Grammar Checking – Machine Translation, Language Identification, Text Summarization … • Speech – Speech Recognition – Text to Speech – Emotion Detection, … • Image – Optical Character Recognition – www.UrduOCR.net – Handwriting Recognition www.cle.org.pk 11
  • 12.
  • 13.
  • 14.
    Enabling • Hybrid – OnlineContent Sharing Tools – CMS, Social Networks – Screen Readers – Book Readers – Text based Search Engines – Dialogue Systems – Speech to Speech Translation – Multi-modal Search Engines www.cle.org.pk 14
  • 15.
  • 16.
    Empowering • ICT forICT - Focused on infrastructure • ICT for Development - Focused on content and applications • ICT for Human Development - Focused on participatory process www.cle.org.pk 16
  • 17.
  • 18.
    LANGUAGE AND ICTTRAINING 100% Preference for Urdu 80% Preference for English 100 60% 80 20% 0% Before Training Software Percent Teachers 40% Preference for Urdu Preference for English 60 40 After Training Before Training 20 After Training Training Material 0 Before Training After Training Software www.cle.org.pk Before Training After Training Training Material 18
  • 19.
    LANGUAGE AND ICTTRAINING Icon Identification by Students Urdu Icons SubTotal Total F M English Transliterat Didn't English ed into Recognize Urdu F M 330 16% M F 691 656 132 198 150 1347 4% F M SubTotal 183 49 40 2099 333 www.cle.org.pk 89 16% 64% 2099 19
  • 20.
    ACCESSING INFO ONLINE LanguageUsed Students Female Male Total Urdu English 44 45 89 2 2 4 Total 46 47 93 Language Preference for Searching on the Internet Preferred Language for Setting a Homepage Participant English Urdu Students 0 138 Teachers 5 13 Total 5 151 www.cle.org.pk 20
  • 21.
    LANGUAGE IN ONLINE COMMUNICATION 9% 1% 2% 1467emails and 363 chats Urdu English Punjabi Others 89% www.cle.org.pk 21
  • 22.
    [1] One school didnot participate, and one school website was disqualified as the team took significant external assistance. LANGUAGE FOR CONTENT DEVELOPMENT Website Competition Category Language of Website Urdu English Total School Website (by 10 School Teacher Teams) 9 1 10 Local Village Website (by 10 School Student Teams) 8 0 8 Open Category (Individual Students) 38 0 38 Total 55 1 56 www.cle.org.pk 22
  • 23.
  • 24.
    Development Process of HumanLanguage Technology Select Language Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Localization of Existing Applications Development of Advanced HLT Application Extension of Localization Applications 24
  • 25.
    Status of HumanLanguage Technology URDU Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 25
  • 26.
    Status of HumanLanguage Technology SINDHI Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 26
  • 27.
    Status of HumanLanguage Technology PUSHTO Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 27
  • 28.
    Status of HumanLanguage Technology PUNJABI Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 28
  • 29.
    Status of HumanLanguage Technology BALOCHI Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 29
  • 30.
    Status of HumanLanguage Technology SARAIKI Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 30
  • 31.
    Status of HumanLanguage Technology OTHERS Linguistic Data Collection Core Linguistic Analysis and Definition Publishing Language Computing Standards Development of Localization Utilities Detailed Linguistic Analysis Publishing Data Annotations Schema Annotation of Linguistic Data Development of Linguistic Utilities Publishing Annotated Linguistic Resources Development of Advanced HLT Application Localization of Existing Applications Extension of Localization Applications Reasonable Support Some Support Minimal Support 31
  • 32.