SlideShare a Scribd company logo
Indian Institute of Information
Technology, Allahabad
Graph Theory Project Report
Countries and Official Languages
Naimish Agarwal
irm2013013@iiita.ac.in
Signature
Dr Rishi Ranjan Singh
Assistant Professor
Contents
1 Introduction 3
2 Methodology 3
2.1 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Countries and their Official Languages Graph Construction . . . 4
2.4 Sister Languages Network Construction . . . . . . . . . . . . . . 4
2.5 Sister Countries Network Construction . . . . . . . . . . . . . . . 4
3 Graph Visualization 5
4 Graph Analysis 5
4.1 Graph of Countries and their Official Languages . . . . . . . . . 5
4.2 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 11
5 Technology Deployed 11
5.1 Python 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Gephi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.4 Bash Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Conclusion 14
7 Future Scope 14
A Graph Terminology 15
A.1 Network Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.2 Average Path Length . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.3 Network Component . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.4 Network Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.6 Node Eccentricity . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References 16
1
List of Figures
1 Country Node Visualization . . . . . . . . . . . . . . . . . . . . . 5
2 Language Node Visualization . . . . . . . . . . . . . . . . . . . . 6
3 Full Graph View . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Russia and its official languages . . . . . . . . . . . . . . . . . . . 7
5 Countries which speak English as their Official Language . . . . 7
6 Language Cloud based on In-degree . . . . . . . . . . . . . . . . 8
7 Country Cloud based on Out-degree . . . . . . . . . . . . . . . . 9
8 In-degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . 10
9 Out-degree distribution . . . . . . . . . . . . . . . . . . . . . . . 10
10 Degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 12
12 Closeness Centrality in Sister Languages Network . . . . . . . . . 13
13 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 13
2
1 Introduction
We live in a multilingual environment with people speaking different languages
around us. Some of us dream to travel around the world, meet new people, learn
new languages and mingle in a new culture. In the job market, multinational
firms look for professional translators who can help them achieve their business
objectives.
There are many languages spoken in the world. A language learner faces the
challenge of deciding the language he should learn as his second, third, fourth,
etc language based on a number of factors like ease of learning, countries he
wishes to travel, other languages spoken in his country, etc.
Studies have been done on the popularity of languages based on number
of speakers [1] [2]. It has been found that Mandarin, Spanish, English, Hindi,
Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi are among the
top 10 most spoken languages in the world. However, such a result may not be
useful for a language learner since countries like China and India have the largest
population, so it introduces a biasness in the results towards the languages
spoken in these nations.
In this project, we address the challenge faced by a language learner by
representing his problems as a graph of countries and their official languages.
2 Methodology
In section 2.1, we describe how we represented our graph. In section 2.2, we
describe our data source. In section 2.3, we describe our approach to construct
the directed graph between countries and their official languages. In section 2.4,
we describe the procedure to construct the sister languages network. In section
2.5, we describe the procedure for constructing the sister countries network.
2.1 Graph Representation
A graph G = (V, E) consists of a set of nodes V and a set of edges E.
The set of nodes V consists of countries and official languages. We con-
sidered Country ID, Name, Description as attributes of countries. Also, for
official languages, we considered Language ID, Name, and Description as their
attributes.
The set of edges in E constitute directed relations from countries to their
official languages.
2.2 Data Collection
We have collected the data about countries and their official languages by scan-
ning the JSON dumps of Wikidata[3] in March 2017. It is basically compressed
metadata about Wiki projects where each Wiki entity is represented as a JSON
string. For each entity, the JSON string starts on a new line.
3
2.3 Countries and their Official Languages Graph Con-
struction
In this section, we discuss the steps to construct the directed graph of countries
and their official languages. The following points list the steps taken to construct
the graph:
1. Collect the Wikidata IDs of countries manually. It was easy because the
number of countries we considered were 202.
2. In the first pass over the JSON dumps, extract the JSON string of coun-
tries using the list of Wikidata IDs and store in separate JSON files.
3. Scan each JSON file of the countries, and make a global list of Wikidata
IDs of their official languages. We need to do this, since the JSON file
only lists the IDs of their official languages and not their names.
4. Find all the distinct Wikidata IDs of the official languages.
5. In the second pass over the JSON dumps, get the JSON string for the
official languages and store them in separate files.
6. Scan JSON file of each official language and extract its name.
7. Create nodes of countries.
8. Create nodes of official languages.
9. For each country, create a directed relation from country to its official
language(s) node(s).
This network is analyzed in section 4.1.
2.4 Sister Languages Network Construction
The nodes in the sister languages network are languages, and two languages are
connected by an undirected edge if they have a common country in common.
Their construction is simple and is outlined below:
1. For each country in the graph of country and its official languages, get the
list of its official languages.
2. For each distinct pair of languages in the list, join them by an undirected
edge.
This network is analyzed in section 4.2.
2.5 Sister Countries Network Construction
The nodes in the country cloud are countries, and two countries are connected by
an undirected edge if they have a common official language. Their construction
is a bit more involved than 2.4 and is outlines below:
1. In the graph of countries and their official languages, reverse the direction
of edges.
4
Figure 1: It shows India, as a node, along with its attributes at the bottom.
The <id> attribute is the node ID as used by Neo4j [4], while the id attrubute
is the Wikidata ID of India. At the top, we can see the Cypher code which
generated the visualization in Neo4j.
2. For each official language, do the following:
(a) Find the countries who share the language as their official language.
This is easy because such countries are now the neighbors of the
language.
(b) For each distinct pair of such countries, connect them by an undi-
rected edge.
This network is analyzed in section 4.3.
3 Graph Visualization
In figure 1, we visualize a typical country node, here India. In figure 2, we
visualize a typical official language node, here English. In figure 3, we showcase
the full view of the directed graph. In figure 4, we address the special case of
Russia. In figure 5, we show the countries with English as their official language.
4 Graph Analysis
In this section, we analyze the graph using various exploratory data analysis
techniques.
4.1 Graph of Countries and their Official Languages
We want to know the importance of languages. We want to rank them in the
descending order of their importance. We can make use of the concept of in-
degree of a node, which is basically the number of incoming edges in a node.
The languages with a higher incoming degree are more important to be learnt
5
Figure 2: It shows the official languages of India. At the bottom, one can see
the attribute values of English language. The <id> attribute is the node ID as
used by Neo4j, while the id attrubute is the Wikidata ID of English. At the
top, one can see the Cypher code used to visualize the figure in Neo4j. English
and Hindi are considered children of India in the graph, since directed edges are
present from India to Hindi and India to English.
Figure 3: It shows the full view of the graph of countries and their official
languages. It is constructed based on the procedure described in section 2.3.
The languages are shown in reddish color while the countries are shown in grey
color. At the top we can see the Cypher code which resulted in this visualization
in Neo4j. Below the code, we can see some statistics about the graph that there
are 202 countries, 173 official languages, and total 366 directed connections in
the graph.
6
Figure 4: It shows 36 official languages of Russia. On Wikidata, 36 languages
are mentioned, which includes 35 regional official languages. Only Russian is
the official language of Russia. Our script has extracted all the languages which
were mentioned under the official languages property of Russia in the Wikidata
JSON dumps. It is on the user to reject or keep the 35 languages. In our
analysis, we have kept them, so some results may get affected due to this. The
interested reader should keep this in mind. At the top, we also show the Cypher
code which generated the visualization in Neo4j.
Figure 5: It shows the 66 countries which have English as their official language.
At the top, we can see the Cypher code which resulted in the shown visualization
in Neo4j.
7
Figure 6: It shows the importance of languages based on their In-Degree in the
graph of countries and their official languages. The languages which are spoken
in larger number of countries as official languages have a larger font size in the
graph. The graph shows that English, French, Arabic, Spanish, Portuguese, etc
are among the most important languages [5].
by a language learner since it is spoken in large number of countries. To address
this problem, we have visualized the Language Cloud as shown in figure 6.
We are interested in ranking the countries by the number of official languages
they have. In other words, we are interested in ranking the countries based on
their out-degree. We address this problem in figure 7.
We are interested in plotting the degree distribution plot of the nodes in
countries and their official languages graph. We address this problem in figures
8, 9, 10. If we join the dots with a line, we find an exponentially decreasing
trend. The average degree of nodes in the graph is 0.976.
8
Figure 7: It compares the out-degree of countries, i.e. the number of official
languages they have. This figure is biased by the factor highlighted in figure 4.
The languages which have larger out-degree have a larger font size in the figure.
It is evident that Russia, Zimbabwe, South Africa, etc take the lead here.
4.2 Sister Languages Network
A language learner is interested to know the language which he should learn
as a second language. He may set one criterion to be that he will learn sister
languages i.e. the languages which have a common country. In the learner’s case,
he may chose such a language which is sister language to his native language. A
possible reason to set such a criterion is that the learner finds other people from
his country speaking that language. We construct the sister languages network
as described in section 2.4. In figure 11, we visualize the network.
The network has 173 nodes, 906 edges, and average degree of 10.474. It has
a network diameter A.1 of 5, and average path length A.2 of 2.2. It has 41
components A.3 and a network density A.4 of 0.061.
If one needs to diffuse or spread some information in the network, the most
central node seems to be the apt choice where-from to spread the information.
In a network, we can find such a node by computing the closeness centrality
9
Figure 8: It shows the in-degree distribution of nodes in the countries and their
official languages graph as visualized in Gephi.
Figure 9: It shows the out-degree distribution of nodes in the countries and
their official languages graph as visualized in Gephi.
10
Figure 10: It shows the degree distribution of nodes in the countries and their
official languages graph as visualized in Gephi.
A.5, which is further illustrated in figure 12.
4.3 Sister Countries Network
Some people may not wish to learn new languages but may wish to travel foreign
countries. They may choose to travel to those countries which speak their
native language or any language which they know. To address this problem, we
have constructed Sister Countries Network as described in 2.5; and is further
illustrated in figure 13.
This network has 202 nodes, 3172 edges with an average degree of 31.4 per
node. It has a network diameter of 4, average path length of 1.91, network
density of 0.156, and total of 41 components.
Another application of Sister Countries Network is for businesses which em-
ploy translators. Consider the scenario that country A is connected to country
B, and country B is connected to country C. If some business E in country A
has to enter into business terms with some business F in country C but A and
C do not have any official language in common, then it is likely that E will hire
translators from B since it is likely that they may know languages of both A
and C.
5 Technology Deployed
5.1 Python 3.5
It was deployed for the following tasks:
11
Figure 11: It shows the Sister Languages Network. The number of edges con-
necting two languages represent the number of countries which have both the
languages as their official languages. The right big blob of connections are the
languages of Russia. They out-stand because of the reasons highlighted in figure
4.
• Scrape the JSON dumps of Wikidata, which was over 7 GB.
• Construct and manipulate the graph using NetworkX [6] library.
• Export the constructed graph into GraphML [7].
• Automatically generate Cypher language code for use in Neo4j
5.2 Gephi [5]
It was mainly used for exploratory data anaylsis activities which includes com-
puting graph statistics and visualization of various graphs using Force-Altas
layout.
12
Figure 12: It shows the closeness centrality as computed on the sister languages
network. The darker the node, the better it is for spreading information in the
network.
Figure 13: It shows the sister countries network. In the graph, the number
of edges connecting two countries represent the number of official languages
common to both. In color codes, it also shows the eccentricity A.6 of the nodes.
13
5.3 Neo4j [4]
It was mainly used to visualize the original directed graph to get an overall look
and feel of the graph of countries and their official languages.
5.4 Bash Script
It was mainly used for managing the Neo4j server which involved tasks like
starting and stopping Neo4j, deleting existing graph database and creating a
new one.
6 Conclusion
The results of our analysis show us that English, French, Arabic, Spanish are the
official languages of a large number of nations. It addresses the long time need
of language learners for a systematic learning path for languages based on their
background. It also points out to the tourists about the destination countries
they can visit without having to spend months in learning new languages.
7 Future Scope
The analysis in the project can be used to build a Language Recommendation
System (LRS) for a language learner. It can suggest the language based on
the past nationalities / current nationality of the person, countries he wishes to
travel, languages already known, etc.
The recommendations can be further enhanced by incorporating the infor-
mation about the number of people speaking a language around the world. A
language which has higher number of speakers will find a higher rank in language
recommendation.
14
A Graph Terminology
In this section, we describe some of the terms related with graphs which we
used in our report.
A.1 Network Diameter
It is the longest graph distance between two nodes in a network. We do not
consider pairs of nodes which are disconnected or have no path from one node
to other.
A.2 Average Path Length
It is the average number of steps along the shortest paths for all pairs of network
nodes.
A.3 Network Component
In a component, there exists a path between each pair of nodes.
A.4 Network Density
Let n be the number of nodes and m be the number of edges. Then network
density is defined as d = 2m
n(n−1) . Density value of 1 means a complete graph
and value 0 means a graph with no edges.
A.5 Closeness Centrality
Closeness centrality gives a measure of how close the node is from all other
nodes in the component. Let x, y represent nodes in the same component; d be
the shortest distance between x and y; then closeness centrality H is defined as
H = y=x
1
d(y,x) .
A.6 Node Eccentricity
In a component, the maximum distance which the node can have from any other
node is its eccentricity.
15
References
[1] “List of languages by number of native speakers.” https://en.wikipedia.
org/wiki/List_of_languages_by_number_of_native_speakers.
[2] “The 10 most spoken languages in the world.” https://www.babbel.com/
en/magazine/the-10-most-spoken-languages-in-the-world.
[3] “Wikidata json dumps.” https://dumps.wikimedia.org/wikidatawiki/
entities/latest-all.json.bz2.
[4] Neo4j, “Neo4j - the world’s leading graph database,” 2012.
[5] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software
for exploring and manipulating networks,” 2009.
[6] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure,
dynamics, and function using NetworkX,” in Proceedings of the 7th Python
in Science Conference (SciPy2008), (Pasadena, CA USA), pp. 11–15, Aug.
2008.
[7] G. Team, “The graphml file format,” 2002.
16

More Related Content

Similar to Countries and their Official Languages

2. C# Guide - To Print
2. C# Guide - To Print2. C# Guide - To Print
2. C# Guide - To Print
Chinthaka Fernando
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
Ashwin Ramesh
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
bartholomeocoombs
 
2005_matzon
2005_matzon2005_matzon
FYP_12130648_FriendNav
FYP_12130648_FriendNavFYP_12130648_FriendNav
FYP_12130648_FriendNav
Daniel O'Neill
 
Thesis
ThesisThesis
Master's Thesis Alessandro Calmanovici
Master's Thesis Alessandro CalmanoviciMaster's Thesis Alessandro Calmanovici
Master's Thesis Alessandro Calmanovici
Alessandro Calmanovici
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
LoCloud - D5.4: Analysis and Recommendations
LoCloud - D5.4: Analysis and RecommendationsLoCloud - D5.4: Analysis and Recommendations
LoCloud - D5.4: Analysis and Recommendations
locloud
 
Your Guide to be a Software Engineer
Your Guide to be a Software EngineerYour Guide to be a Software Engineer
Your Guide to be a Software Engineer
Ahmed Mater
 
Software design of library circulation system
Software design of  library circulation systemSoftware design of  library circulation system
Software design of library circulation system
Md. Shafiuzzaman Hira
 
Automated Voice Based Braille Script Teaching Aid Using
Automated Voice Based Braille Script Teaching Aid UsingAutomated Voice Based Braille Script Teaching Aid Using
Automated Voice Based Braille Script Teaching Aid Using
Daphne Smith
 
Brian.suda.thesis
Brian.suda.thesisBrian.suda.thesis
Brian.suda.thesis
Aravindharamanan S
 
MicroFSharp
MicroFSharpMicroFSharp
MicroFSharp
Joachim Hasseldam
 
TCC-MSCR
TCC-MSCRTCC-MSCR
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
Waqas Tariq
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
Raouf KESKES
 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
ijcseit
 
C-sharping.docx
C-sharping.docxC-sharping.docx
C-sharping.docx
LenchoMamudeBaro
 
Poly_introduction_R.pdf
Poly_introduction_R.pdfPoly_introduction_R.pdf
Poly_introduction_R.pdf
BenjaminTheodorNicai
 

Similar to Countries and their Official Languages (20)

2. C# Guide - To Print
2. C# Guide - To Print2. C# Guide - To Print
2. C# Guide - To Print
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
2005_matzon
2005_matzon2005_matzon
2005_matzon
 
FYP_12130648_FriendNav
FYP_12130648_FriendNavFYP_12130648_FriendNav
FYP_12130648_FriendNav
 
Thesis
ThesisThesis
Thesis
 
Master's Thesis Alessandro Calmanovici
Master's Thesis Alessandro CalmanoviciMaster's Thesis Alessandro Calmanovici
Master's Thesis Alessandro Calmanovici
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
LoCloud - D5.4: Analysis and Recommendations
LoCloud - D5.4: Analysis and RecommendationsLoCloud - D5.4: Analysis and Recommendations
LoCloud - D5.4: Analysis and Recommendations
 
Your Guide to be a Software Engineer
Your Guide to be a Software EngineerYour Guide to be a Software Engineer
Your Guide to be a Software Engineer
 
Software design of library circulation system
Software design of  library circulation systemSoftware design of  library circulation system
Software design of library circulation system
 
Automated Voice Based Braille Script Teaching Aid Using
Automated Voice Based Braille Script Teaching Aid UsingAutomated Voice Based Braille Script Teaching Aid Using
Automated Voice Based Braille Script Teaching Aid Using
 
Brian.suda.thesis
Brian.suda.thesisBrian.suda.thesis
Brian.suda.thesis
 
MicroFSharp
MicroFSharpMicroFSharp
MicroFSharp
 
TCC-MSCR
TCC-MSCRTCC-MSCR
TCC-MSCR
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
 
C-sharping.docx
C-sharping.docxC-sharping.docx
C-sharping.docx
 
Poly_introduction_R.pdf
Poly_introduction_R.pdfPoly_introduction_R.pdf
Poly_introduction_R.pdf
 

Recently uploaded

A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 

Recently uploaded (20)

A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 

Countries and their Official Languages

  • 1. Indian Institute of Information Technology, Allahabad Graph Theory Project Report Countries and Official Languages Naimish Agarwal irm2013013@iiita.ac.in Signature Dr Rishi Ranjan Singh Assistant Professor
  • 2. Contents 1 Introduction 3 2 Methodology 3 2.1 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Countries and their Official Languages Graph Construction . . . 4 2.4 Sister Languages Network Construction . . . . . . . . . . . . . . 4 2.5 Sister Countries Network Construction . . . . . . . . . . . . . . . 4 3 Graph Visualization 5 4 Graph Analysis 5 4.1 Graph of Countries and their Official Languages . . . . . . . . . 5 4.2 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 11 5 Technology Deployed 11 5.1 Python 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Gephi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.3 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.4 Bash Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Conclusion 14 7 Future Scope 14 A Graph Terminology 15 A.1 Network Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Average Path Length . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 Network Component . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.4 Network Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.6 Node Eccentricity . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 References 16 1
  • 3. List of Figures 1 Country Node Visualization . . . . . . . . . . . . . . . . . . . . . 5 2 Language Node Visualization . . . . . . . . . . . . . . . . . . . . 6 3 Full Graph View . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Russia and its official languages . . . . . . . . . . . . . . . . . . . 7 5 Countries which speak English as their Official Language . . . . 7 6 Language Cloud based on In-degree . . . . . . . . . . . . . . . . 8 7 Country Cloud based on Out-degree . . . . . . . . . . . . . . . . 9 8 In-degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . 10 9 Out-degree distribution . . . . . . . . . . . . . . . . . . . . . . . 10 10 Degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 12 12 Closeness Centrality in Sister Languages Network . . . . . . . . . 13 13 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 13 2
  • 4. 1 Introduction We live in a multilingual environment with people speaking different languages around us. Some of us dream to travel around the world, meet new people, learn new languages and mingle in a new culture. In the job market, multinational firms look for professional translators who can help them achieve their business objectives. There are many languages spoken in the world. A language learner faces the challenge of deciding the language he should learn as his second, third, fourth, etc language based on a number of factors like ease of learning, countries he wishes to travel, other languages spoken in his country, etc. Studies have been done on the popularity of languages based on number of speakers [1] [2]. It has been found that Mandarin, Spanish, English, Hindi, Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi are among the top 10 most spoken languages in the world. However, such a result may not be useful for a language learner since countries like China and India have the largest population, so it introduces a biasness in the results towards the languages spoken in these nations. In this project, we address the challenge faced by a language learner by representing his problems as a graph of countries and their official languages. 2 Methodology In section 2.1, we describe how we represented our graph. In section 2.2, we describe our data source. In section 2.3, we describe our approach to construct the directed graph between countries and their official languages. In section 2.4, we describe the procedure to construct the sister languages network. In section 2.5, we describe the procedure for constructing the sister countries network. 2.1 Graph Representation A graph G = (V, E) consists of a set of nodes V and a set of edges E. The set of nodes V consists of countries and official languages. We con- sidered Country ID, Name, Description as attributes of countries. Also, for official languages, we considered Language ID, Name, and Description as their attributes. The set of edges in E constitute directed relations from countries to their official languages. 2.2 Data Collection We have collected the data about countries and their official languages by scan- ning the JSON dumps of Wikidata[3] in March 2017. It is basically compressed metadata about Wiki projects where each Wiki entity is represented as a JSON string. For each entity, the JSON string starts on a new line. 3
  • 5. 2.3 Countries and their Official Languages Graph Con- struction In this section, we discuss the steps to construct the directed graph of countries and their official languages. The following points list the steps taken to construct the graph: 1. Collect the Wikidata IDs of countries manually. It was easy because the number of countries we considered were 202. 2. In the first pass over the JSON dumps, extract the JSON string of coun- tries using the list of Wikidata IDs and store in separate JSON files. 3. Scan each JSON file of the countries, and make a global list of Wikidata IDs of their official languages. We need to do this, since the JSON file only lists the IDs of their official languages and not their names. 4. Find all the distinct Wikidata IDs of the official languages. 5. In the second pass over the JSON dumps, get the JSON string for the official languages and store them in separate files. 6. Scan JSON file of each official language and extract its name. 7. Create nodes of countries. 8. Create nodes of official languages. 9. For each country, create a directed relation from country to its official language(s) node(s). This network is analyzed in section 4.1. 2.4 Sister Languages Network Construction The nodes in the sister languages network are languages, and two languages are connected by an undirected edge if they have a common country in common. Their construction is simple and is outlined below: 1. For each country in the graph of country and its official languages, get the list of its official languages. 2. For each distinct pair of languages in the list, join them by an undirected edge. This network is analyzed in section 4.2. 2.5 Sister Countries Network Construction The nodes in the country cloud are countries, and two countries are connected by an undirected edge if they have a common official language. Their construction is a bit more involved than 2.4 and is outlines below: 1. In the graph of countries and their official languages, reverse the direction of edges. 4
  • 6. Figure 1: It shows India, as a node, along with its attributes at the bottom. The <id> attribute is the node ID as used by Neo4j [4], while the id attrubute is the Wikidata ID of India. At the top, we can see the Cypher code which generated the visualization in Neo4j. 2. For each official language, do the following: (a) Find the countries who share the language as their official language. This is easy because such countries are now the neighbors of the language. (b) For each distinct pair of such countries, connect them by an undi- rected edge. This network is analyzed in section 4.3. 3 Graph Visualization In figure 1, we visualize a typical country node, here India. In figure 2, we visualize a typical official language node, here English. In figure 3, we showcase the full view of the directed graph. In figure 4, we address the special case of Russia. In figure 5, we show the countries with English as their official language. 4 Graph Analysis In this section, we analyze the graph using various exploratory data analysis techniques. 4.1 Graph of Countries and their Official Languages We want to know the importance of languages. We want to rank them in the descending order of their importance. We can make use of the concept of in- degree of a node, which is basically the number of incoming edges in a node. The languages with a higher incoming degree are more important to be learnt 5
  • 7. Figure 2: It shows the official languages of India. At the bottom, one can see the attribute values of English language. The <id> attribute is the node ID as used by Neo4j, while the id attrubute is the Wikidata ID of English. At the top, one can see the Cypher code used to visualize the figure in Neo4j. English and Hindi are considered children of India in the graph, since directed edges are present from India to Hindi and India to English. Figure 3: It shows the full view of the graph of countries and their official languages. It is constructed based on the procedure described in section 2.3. The languages are shown in reddish color while the countries are shown in grey color. At the top we can see the Cypher code which resulted in this visualization in Neo4j. Below the code, we can see some statistics about the graph that there are 202 countries, 173 official languages, and total 366 directed connections in the graph. 6
  • 8. Figure 4: It shows 36 official languages of Russia. On Wikidata, 36 languages are mentioned, which includes 35 regional official languages. Only Russian is the official language of Russia. Our script has extracted all the languages which were mentioned under the official languages property of Russia in the Wikidata JSON dumps. It is on the user to reject or keep the 35 languages. In our analysis, we have kept them, so some results may get affected due to this. The interested reader should keep this in mind. At the top, we also show the Cypher code which generated the visualization in Neo4j. Figure 5: It shows the 66 countries which have English as their official language. At the top, we can see the Cypher code which resulted in the shown visualization in Neo4j. 7
  • 9. Figure 6: It shows the importance of languages based on their In-Degree in the graph of countries and their official languages. The languages which are spoken in larger number of countries as official languages have a larger font size in the graph. The graph shows that English, French, Arabic, Spanish, Portuguese, etc are among the most important languages [5]. by a language learner since it is spoken in large number of countries. To address this problem, we have visualized the Language Cloud as shown in figure 6. We are interested in ranking the countries by the number of official languages they have. In other words, we are interested in ranking the countries based on their out-degree. We address this problem in figure 7. We are interested in plotting the degree distribution plot of the nodes in countries and their official languages graph. We address this problem in figures 8, 9, 10. If we join the dots with a line, we find an exponentially decreasing trend. The average degree of nodes in the graph is 0.976. 8
  • 10. Figure 7: It compares the out-degree of countries, i.e. the number of official languages they have. This figure is biased by the factor highlighted in figure 4. The languages which have larger out-degree have a larger font size in the figure. It is evident that Russia, Zimbabwe, South Africa, etc take the lead here. 4.2 Sister Languages Network A language learner is interested to know the language which he should learn as a second language. He may set one criterion to be that he will learn sister languages i.e. the languages which have a common country. In the learner’s case, he may chose such a language which is sister language to his native language. A possible reason to set such a criterion is that the learner finds other people from his country speaking that language. We construct the sister languages network as described in section 2.4. In figure 11, we visualize the network. The network has 173 nodes, 906 edges, and average degree of 10.474. It has a network diameter A.1 of 5, and average path length A.2 of 2.2. It has 41 components A.3 and a network density A.4 of 0.061. If one needs to diffuse or spread some information in the network, the most central node seems to be the apt choice where-from to spread the information. In a network, we can find such a node by computing the closeness centrality 9
  • 11. Figure 8: It shows the in-degree distribution of nodes in the countries and their official languages graph as visualized in Gephi. Figure 9: It shows the out-degree distribution of nodes in the countries and their official languages graph as visualized in Gephi. 10
  • 12. Figure 10: It shows the degree distribution of nodes in the countries and their official languages graph as visualized in Gephi. A.5, which is further illustrated in figure 12. 4.3 Sister Countries Network Some people may not wish to learn new languages but may wish to travel foreign countries. They may choose to travel to those countries which speak their native language or any language which they know. To address this problem, we have constructed Sister Countries Network as described in 2.5; and is further illustrated in figure 13. This network has 202 nodes, 3172 edges with an average degree of 31.4 per node. It has a network diameter of 4, average path length of 1.91, network density of 0.156, and total of 41 components. Another application of Sister Countries Network is for businesses which em- ploy translators. Consider the scenario that country A is connected to country B, and country B is connected to country C. If some business E in country A has to enter into business terms with some business F in country C but A and C do not have any official language in common, then it is likely that E will hire translators from B since it is likely that they may know languages of both A and C. 5 Technology Deployed 5.1 Python 3.5 It was deployed for the following tasks: 11
  • 13. Figure 11: It shows the Sister Languages Network. The number of edges con- necting two languages represent the number of countries which have both the languages as their official languages. The right big blob of connections are the languages of Russia. They out-stand because of the reasons highlighted in figure 4. • Scrape the JSON dumps of Wikidata, which was over 7 GB. • Construct and manipulate the graph using NetworkX [6] library. • Export the constructed graph into GraphML [7]. • Automatically generate Cypher language code for use in Neo4j 5.2 Gephi [5] It was mainly used for exploratory data anaylsis activities which includes com- puting graph statistics and visualization of various graphs using Force-Altas layout. 12
  • 14. Figure 12: It shows the closeness centrality as computed on the sister languages network. The darker the node, the better it is for spreading information in the network. Figure 13: It shows the sister countries network. In the graph, the number of edges connecting two countries represent the number of official languages common to both. In color codes, it also shows the eccentricity A.6 of the nodes. 13
  • 15. 5.3 Neo4j [4] It was mainly used to visualize the original directed graph to get an overall look and feel of the graph of countries and their official languages. 5.4 Bash Script It was mainly used for managing the Neo4j server which involved tasks like starting and stopping Neo4j, deleting existing graph database and creating a new one. 6 Conclusion The results of our analysis show us that English, French, Arabic, Spanish are the official languages of a large number of nations. It addresses the long time need of language learners for a systematic learning path for languages based on their background. It also points out to the tourists about the destination countries they can visit without having to spend months in learning new languages. 7 Future Scope The analysis in the project can be used to build a Language Recommendation System (LRS) for a language learner. It can suggest the language based on the past nationalities / current nationality of the person, countries he wishes to travel, languages already known, etc. The recommendations can be further enhanced by incorporating the infor- mation about the number of people speaking a language around the world. A language which has higher number of speakers will find a higher rank in language recommendation. 14
  • 16. A Graph Terminology In this section, we describe some of the terms related with graphs which we used in our report. A.1 Network Diameter It is the longest graph distance between two nodes in a network. We do not consider pairs of nodes which are disconnected or have no path from one node to other. A.2 Average Path Length It is the average number of steps along the shortest paths for all pairs of network nodes. A.3 Network Component In a component, there exists a path between each pair of nodes. A.4 Network Density Let n be the number of nodes and m be the number of edges. Then network density is defined as d = 2m n(n−1) . Density value of 1 means a complete graph and value 0 means a graph with no edges. A.5 Closeness Centrality Closeness centrality gives a measure of how close the node is from all other nodes in the component. Let x, y represent nodes in the same component; d be the shortest distance between x and y; then closeness centrality H is defined as H = y=x 1 d(y,x) . A.6 Node Eccentricity In a component, the maximum distance which the node can have from any other node is its eccentricity. 15
  • 17. References [1] “List of languages by number of native speakers.” https://en.wikipedia. org/wiki/List_of_languages_by_number_of_native_speakers. [2] “The 10 most spoken languages in the world.” https://www.babbel.com/ en/magazine/the-10-most-spoken-languages-in-the-world. [3] “Wikidata json dumps.” https://dumps.wikimedia.org/wikidatawiki/ entities/latest-all.json.bz2. [4] Neo4j, “Neo4j - the world’s leading graph database,” 2012. [5] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software for exploring and manipulating networks,” 2009. [6] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using NetworkX,” in Proceedings of the 7th Python in Science Conference (SciPy2008), (Pasadena, CA USA), pp. 11–15, Aug. 2008. [7] G. Team, “The graphml file format,” 2002. 16