Countries and their Official Languages

Indian Institute of Information
Technology, Allahabad
Graph Theory Project Report
Countries and Oﬃcial Languages
Naimish Agarwal
irm2013013@iiita.ac.in
Signature
Dr Rishi Ranjan Singh
Assistant Professor

Contents
1 Introduction 3
2 Methodology 3
2.1 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Countries and their Oﬃcial Languages Graph Construction . . . 4
2.4 Sister Languages Network Construction . . . . . . . . . . . . . . 4
2.5 Sister Countries Network Construction . . . . . . . . . . . . . . . 4
3 Graph Visualization 5
4 Graph Analysis 5
4.1 Graph of Countries and their Oﬃcial Languages . . . . . . . . . 5
4.2 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 11
5 Technology Deployed 11
5.1 Python 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Gephi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.4 Bash Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Conclusion 14
7 Future Scope 14
A Graph Terminology 15
A.1 Network Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.2 Average Path Length . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.3 Network Component . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.4 Network Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.5 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.6 Node Eccentricity . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References 16
1

List of Figures
1 Country Node Visualization . . . . . . . . . . . . . . . . . . . . . 5
2 Language Node Visualization . . . . . . . . . . . . . . . . . . . . 6
3 Full Graph View . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Russia and its oﬃcial languages . . . . . . . . . . . . . . . . . . . 7
5 Countries which speak English as their Oﬃcial Language . . . . 7
6 Language Cloud based on In-degree . . . . . . . . . . . . . . . . 8
7 Country Cloud based on Out-degree . . . . . . . . . . . . . . . . 9
8 In-degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . 10
9 Out-degree distribution . . . . . . . . . . . . . . . . . . . . . . . 10
10 Degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 Sister Languages Network . . . . . . . . . . . . . . . . . . . . . . 12
12 Closeness Centrality in Sister Languages Network . . . . . . . . . 13
13 Sister Countries Network . . . . . . . . . . . . . . . . . . . . . . 13
2

1 Introduction
We live in a multilingual environment with people speaking different languages
around us. Some of us dream to travel around the world, meet new people, learn
new languages and mingle in a new culture. In the job market, multinational
firms look for professional translators who can help them achieve their business
objectives.
There are many languages spoken in the world. A language learner faces the
challenge of deciding the language he should learn as his second, third, fourth,
etc language based on a number of factors like ease of learning, countries he
wishes to travel, other languages spoken in his country, etc.
Studies have been done on the popularity of languages based on number
of speakers [1] [2]. It has been found that Mandarin, Spanish, English, Hindi,
Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi are among the
top 10 most spoken languages in the world. However, such a result may not be
useful for a language learner since countries like China and India have the largest
population, so it introduces a biasness in the results towards the languages
spoken in these nations.
In this project, we address the challenge faced by a language learner by
representing his problems as a graph of countries and their official languages.
2 Methodology
In section 2.1, we describe how we represented our graph. In section 2.2, we
describe our data source. In section 2.3, we describe our approach to construct
the directed graph between countries and their official languages. In section 2.4,
we describe the procedure to construct the sister languages network. In section
2.5, we describe the procedure for constructing the sister countries network.
2.1 Graph Representation
A graph G = (V, E) consists of a set of nodes V and a set of edges E.
The set of nodes V consists of countries and official languages. We con-
sidered Country ID, Name, Description as attributes of countries. Also, for
official languages, we considered Language ID, Name, and Description as their
attributes.
The set of edges in E constitute directed relations from countries to their
official languages.
2.2 Data Collection
We have collected the data about countries and their official languages by scan-
ning the JSON dumps of Wikidata[3] in March 2017. It is basically compressed
metadata about Wiki projects where each Wiki entity is represented as a JSON
string. For each entity, the JSON string starts on a new line.
3

2.3 Countries and their Official Languages Graph Con-
struction
In this section, we discuss the steps to construct the directed graph of countries
and their official languages. The following points list the steps taken to construct
the graph:
1. Collect the Wikidata IDs of countries manually. It was easy because the
number of countries we considered were 202.
2. In the first pass over the JSON dumps, extract the JSON string of coun-
tries using the list of Wikidata IDs and store in separate JSON files.
3. Scan each JSON file of the countries, and make a global list of Wikidata
IDs of their official languages. We need to do this, since the JSON file
only lists the IDs of their official languages and not their names.
4. Find all the distinct Wikidata IDs of the official languages.
5. In the second pass over the JSON dumps, get the JSON string for the
official languages and store them in separate files.
6. Scan JSON file of each official language and extract its name.
7. Create nodes of countries.
8. Create nodes of official languages.
9. For each country, create a directed relation from country to its official
language(s) node(s).
This network is analyzed in section 4.1.
2.4 Sister Languages Network Construction
The nodes in the sister languages network are languages, and two languages are
connected by an undirected edge if they have a common country in common.
Their construction is simple and is outlined below:
1. For each country in the graph of country and its official languages, get the
list of its official languages.
2. For each distinct pair of languages in the list, join them by an undirected
edge.
2.5 Sister Countries Network Construction
The nodes in the country cloud are countries, and two countries are connected by
an undirected edge if they have a common official language. Their construction
is a bit more involved than 2.4 and is outlines below:
1. In the graph of countries and their official languages, reverse the direction
of edges.
4

Figure 1: It shows India, as a node, along with its attributes at the bottom.
The <id> attribute is the node ID as used by Neo4j [4], while the id attrubute
is the Wikidata ID of India. At the top, we can see the Cypher code which
generated the visualization in Neo4j.
2. For each official language, do the following:
(a) Find the countries who share the language as their official language.
This is easy because such countries are now the neighbors of the
language.
(b) For each distinct pair of such countries, connect them by an undi-
rected edge.
3 Graph Visualization
In figure 1, we visualize a typical country node, here India. In figure 2, we
visualize a typical official language node, here English. In figure 3, we showcase
the full view of the directed graph. In figure 4, we address the special case of
Russia. In figure 5, we show the countries with English as their official language.
4 Graph Analysis
In this section, we analyze the graph using various exploratory data analysis
techniques.
4.1 Graph of Countries and their Official Languages
We want to know the importance of languages. We want to rank them in the
descending order of their importance. We can make use of the concept of in-
degree of a node, which is basically the number of incoming edges in a node.
The languages with a higher incoming degree are more important to be learnt
5

Figure 2: It shows the official languages of India. At the bottom, one can see
the attribute values of English language. The <id> attribute is the node ID as
used by Neo4j, while the id attrubute is the Wikidata ID of English. At the
top, one can see the Cypher code used to visualize the figure in Neo4j. English
and Hindi are considered children of India in the graph, since directed edges are
present from India to Hindi and India to English.
Figure 3: It shows the full view of the graph of countries and their official
languages. It is constructed based on the procedure described in section 2.3.
The languages are shown in reddish color while the countries are shown in grey
color. At the top we can see the Cypher code which resulted in this visualization
in Neo4j. Below the code, we can see some statistics about the graph that there
are 202 countries, 173 official languages, and total 366 directed connections in
the graph.
6

Figure 4: It shows 36 official languages of Russia. On Wikidata, 36 languages
are mentioned, which includes 35 regional official languages. Only Russian is
the official language of Russia. Our script has extracted all the languages which
were mentioned under the official languages property of Russia in the Wikidata
JSON dumps. It is on the user to reject or keep the 35 languages. In our
analysis, we have kept them, so some results may get affected due to this. The
interested reader should keep this in mind. At the top, we also show the Cypher
code which generated the visualization in Neo4j.
Figure 5: It shows the 66 countries which have English as their official language.
At the top, we can see the Cypher code which resulted in the shown visualization
in Neo4j.
7

Figure 6: It shows the importance of languages based on their In-Degree in the
graph of countries and their official languages. The languages which are spoken
in larger number of countries as official languages have a larger font size in the
graph. The graph shows that English, French, Arabic, Spanish, Portuguese, etc
are among the most important languages [5].
by a language learner since it is spoken in large number of countries. To address
this problem, we have visualized the Language Cloud as shown in figure 6.
We are interested in ranking the countries by the number of official languages
they have. In other words, we are interested in ranking the countries based on
their out-degree. We address this problem in figure 7.
We are interested in plotting the degree distribution plot of the nodes in
countries and their official languages graph. We address this problem in figures
8, 9, 10. If we join the dots with a line, we find an exponentially decreasing
trend. The average degree of nodes in the graph is 0.976.
8

Figure 7: It compares the out-degree of countries, i.e. the number of official
languages they have. This figure is biased by the factor highlighted in figure 4.
The languages which have larger out-degree have a larger font size in the figure.
It is evident that Russia, Zimbabwe, South Africa, etc take the lead here.
4.2 Sister Languages Network
A language learner is interested to know the language which he should learn
as a second language. He may set one criterion to be that he will learn sister
languages i.e. the languages which have a common country. In the learner’s case,
he may chose such a language which is sister language to his native language. A
possible reason to set such a criterion is that the learner finds other people from
his country speaking that language. We construct the sister languages network
as described in section 2.4. In figure 11, we visualize the network.
The network has 173 nodes, 906 edges, and average degree of 10.474. It has
a network diameter A.1 of 5, and average path length A.2 of 2.2. It has 41
components A.3 and a network density A.4 of 0.061.
If one needs to diffuse or spread some information in the network, the most
central node seems to be the apt choice where-from to spread the information.
In a network, we can find such a node by computing the closeness centrality
9

Figure 8: It shows the in-degree distribution of nodes in the countries and their
oﬃcial languages graph as visualized in Gephi.
Figure 9: It shows the out-degree distribution of nodes in the countries and
their oﬃcial languages graph as visualized in Gephi.
10

Figure 10: It shows the degree distribution of nodes in the countries and their
official languages graph as visualized in Gephi.
A.5, which is further illustrated in figure 12.
4.3 Sister Countries Network
Some people may not wish to learn new languages but may wish to travel foreign
countries. They may choose to travel to those countries which speak their
native language or any language which they know. To address this problem, we
have constructed Sister Countries Network as described in 2.5; and is further
illustrated in figure 13.
This network has 202 nodes, 3172 edges with an average degree of 31.4 per
node. It has a network diameter of 4, average path length of 1.91, network
density of 0.156, and total of 41 components.
Another application of Sister Countries Network is for businesses which em-
ploy translators. Consider the scenario that country A is connected to country
B, and country B is connected to country C. If some business E in country A
has to enter into business terms with some business F in country C but A and
C do not have any official language in common, then it is likely that E will hire
translators from B since it is likely that they may know languages of both A
and C.
5 Technology Deployed
5.1 Python 3.5
It was deployed for the following tasks:
11

Figure 11: It shows the Sister Languages Network. The number of edges con-
necting two languages represent the number of countries which have both the
languages as their oﬃcial languages. The right big blob of connections are the
languages of Russia. They out-stand because of the reasons highlighted in ﬁgure
4.
• Scrape the JSON dumps of Wikidata, which was over 7 GB.
• Construct and manipulate the graph using NetworkX [6] library.
• Export the constructed graph into GraphML [7].
• Automatically generate Cypher language code for use in Neo4j
5.2 Gephi [5]
It was mainly used for exploratory data anaylsis activities which includes com-
puting graph statistics and visualization of various graphs using Force-Altas
layout.
12

Figure 12: It shows the closeness centrality as computed on the sister languages
network. The darker the node, the better it is for spreading information in the
network.
Figure 13: It shows the sister countries network. In the graph, the number
of edges connecting two countries represent the number of oﬃcial languages
common to both. In color codes, it also shows the eccentricity A.6 of the nodes.
13

5.3 Neo4j [4]
It was mainly used to visualize the original directed graph to get an overall look
and feel of the graph of countries and their official languages.
5.4 Bash Script
It was mainly used for managing the Neo4j server which involved tasks like
starting and stopping Neo4j, deleting existing graph database and creating a
new one.
6 Conclusion
The results of our analysis show us that English, French, Arabic, Spanish are the
official languages of a large number of nations. It addresses the long time need
of language learners for a systematic learning path for languages based on their
background. It also points out to the tourists about the destination countries
they can visit without having to spend months in learning new languages.
7 Future Scope
The analysis in the project can be used to build a Language Recommendation
System (LRS) for a language learner. It can suggest the language based on
the past nationalities / current nationality of the person, countries he wishes to
travel, languages already known, etc.
The recommendations can be further enhanced by incorporating the infor-
mation about the number of people speaking a language around the world. A
language which has higher number of speakers will find a higher rank in language
recommendation.
14

A Graph Terminology
In this section, we describe some of the terms related with graphs which we
used in our report.
A.1 Network Diameter
It is the longest graph distance between two nodes in a network. We do not
consider pairs of nodes which are disconnected or have no path from one node
to other.
A.2 Average Path Length
It is the average number of steps along the shortest paths for all pairs of network
nodes.
A.3 Network Component
In a component, there exists a path between each pair of nodes.
A.4 Network Density
Let n be the number of nodes and m be the number of edges. Then network
density is deﬁned as d = 2m
n(n−1) . Density value of 1 means a complete graph
and value 0 means a graph with no edges.
A.5 Closeness Centrality
Closeness centrality gives a measure of how close the node is from all other
nodes in the component. Let x, y represent nodes in the same component; d be
the shortest distance between x and y; then closeness centrality H is deﬁned as
H = y=x
1
d(y,x) .
A.6 Node Eccentricity
In a component, the maximum distance which the node can have from any other
node is its eccentricity.
15

References
[1] “List of languages by number of native speakers.” https://en.wikipedia.
org/wiki/List_of_languages_by_number_of_native_speakers.
[2] “The 10 most spoken languages in the world.” https://www.babbel.com/
en/magazine/the-10-most-spoken-languages-in-the-world.
[3] “Wikidata json dumps.” https://dumps.wikimedia.org/wikidatawiki/
entities/latest-all.json.bz2.
[4] Neo4j, “Neo4j - the world’s leading graph database,” 2012.
[5] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software
for exploring and manipulating networks,” 2009.
[6] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure,
dynamics, and function using NetworkX,” in Proceedings of the 7th Python
in Science Conference (SciPy2008), (Pasadena, CA USA), pp. 11–15, Aug.
2008.
[7] G. Team, “The graphml ﬁle format,” 2002.
16

Countries and their Official Languages

Recommended

Recommended

More Related Content

Similar to Countries and their Official Languages

Similar to Countries and their Official Languages (20)

Recently uploaded

Recently uploaded (20)

Countries and their Official Languages