Graph Theory Project Presentation on Countries and their Official Languages for the course of Graph Theory at Indian Institute of Information Technology, Allahabad
This document provides instructions for a semester project in computer systems and programming. It outlines 4 potential project ideas: 1) an encoding/decoding program using an expanding square code, 2) a Battleship game, 3) a program to convert text to Morse code and vice versa, and 4) a word finding program in a 2D letter grid. It specifies requirements like commented/indented code and understanding for all members. The deadline is February 3rd and projects will be demoed and evaluated on various criteria then or earlier if completed.
Review of research on devnagari character recognitionVikas Dongre
This document summarizes research on Devnagari character recognition. It begins with an abstract discussing the progress of English character recognition and the need for further research on Indian languages like Devnagari. The document then reviews the stages of Devnagari optical character recognition systems, including pre-processing, segmentation, feature extraction, recognition, and post-processing. It discusses challenges in Devnagari recognition due to features of the script like connected characters. The document also reviews common techniques used at each stage of recognition systems and provides directions for future research.
The document describes an optical character recognition project submitted for a bachelor's degree. It details the image segmentation and character recognition processes used to recognize characters in Indian languages like Hindi. The image segmentation section explains preprocessing steps like digitization, text block identification, and line, word and character segmentation. It also describes a method for segmenting fused characters using properties like pen width and continuity of projections. Four character recognition techniques are evaluated on the segmented images: k-nearest neighbors, logistic regression, multilayer perceptron, and support vector machine.
Project Proposal: Bengali Braille to Text TranslationMinhas Kamal
Software Project Proposal- Bengali Braille to Text Translation
Presented in 4th year of Bachelor of Science in Software Engineering (BSSE) course at Institute of Information Technology, University of Dhaka (IIT, DU).
The document proposes a standard for writing sign languages using SignWriting. It describes SignWriting, which represents sign languages visually with symbols in a 2D signing space. It then details Formal SignWriting, which defines sign languages formally using strings. Symbols are assigned ASCII names and placed in a signing box with coordinates. Signs can be styled and queried using this formal language. The standard aims to document SignWriting for internet use.
1. The document describes a hybrid approach called OUT OF STEP for detecting clones across programming languages for mobile app development.
2. It discusses the need to abstract syntax trees to compare code snippets across different languages while retaining language references.
3. The approach uses universal nodes to represent common concepts, enriched syntax trees to provide hierarchy, and stop nodes to fragment trees for comparison across locations.
Braille to text and speech for cecity personseSAT Journals
This document describes a system that converts Braille input to text and speech output. The system uses a Braille keypad for input, which is interfaced with an FPGA (field programmable gate array). The FPGA decodes the Braille input and converts it to English text, which is then displayed on an LCD. The English text is also converted to speech output through an integrated circuit. The system is designed to help visually impaired people access and communicate information by converting Braille to readable and audible formats.
This document is a booklet for third grade primary students from El-Zahraa Language School. It covers various topics related to computer programming and coding, organized into chapters and accompanied by worksheets. The chapters discuss introduction to programming, flowcharting, pseudo code, object oriented programming, integrated development environments, and controls. Each chapter provides explanations of key concepts and terms, along with examples and exercises for students. The booklet aims to teach students basic computer programming logic and skills.
This document provides instructions for a semester project in computer systems and programming. It outlines 4 potential project ideas: 1) an encoding/decoding program using an expanding square code, 2) a Battleship game, 3) a program to convert text to Morse code and vice versa, and 4) a word finding program in a 2D letter grid. It specifies requirements like commented/indented code and understanding for all members. The deadline is February 3rd and projects will be demoed and evaluated on various criteria then or earlier if completed.
Review of research on devnagari character recognitionVikas Dongre
This document summarizes research on Devnagari character recognition. It begins with an abstract discussing the progress of English character recognition and the need for further research on Indian languages like Devnagari. The document then reviews the stages of Devnagari optical character recognition systems, including pre-processing, segmentation, feature extraction, recognition, and post-processing. It discusses challenges in Devnagari recognition due to features of the script like connected characters. The document also reviews common techniques used at each stage of recognition systems and provides directions for future research.
The document describes an optical character recognition project submitted for a bachelor's degree. It details the image segmentation and character recognition processes used to recognize characters in Indian languages like Hindi. The image segmentation section explains preprocessing steps like digitization, text block identification, and line, word and character segmentation. It also describes a method for segmenting fused characters using properties like pen width and continuity of projections. Four character recognition techniques are evaluated on the segmented images: k-nearest neighbors, logistic regression, multilayer perceptron, and support vector machine.
Project Proposal: Bengali Braille to Text TranslationMinhas Kamal
Software Project Proposal- Bengali Braille to Text Translation
Presented in 4th year of Bachelor of Science in Software Engineering (BSSE) course at Institute of Information Technology, University of Dhaka (IIT, DU).
The document proposes a standard for writing sign languages using SignWriting. It describes SignWriting, which represents sign languages visually with symbols in a 2D signing space. It then details Formal SignWriting, which defines sign languages formally using strings. Symbols are assigned ASCII names and placed in a signing box with coordinates. Signs can be styled and queried using this formal language. The standard aims to document SignWriting for internet use.
1. The document describes a hybrid approach called OUT OF STEP for detecting clones across programming languages for mobile app development.
2. It discusses the need to abstract syntax trees to compare code snippets across different languages while retaining language references.
3. The approach uses universal nodes to represent common concepts, enriched syntax trees to provide hierarchy, and stop nodes to fragment trees for comparison across locations.
Braille to text and speech for cecity personseSAT Journals
This document describes a system that converts Braille input to text and speech output. The system uses a Braille keypad for input, which is interfaced with an FPGA (field programmable gate array). The FPGA decodes the Braille input and converts it to English text, which is then displayed on an LCD. The English text is also converted to speech output through an integrated circuit. The system is designed to help visually impaired people access and communicate information by converting Braille to readable and audible formats.
This document is a booklet for third grade primary students from El-Zahraa Language School. It covers various topics related to computer programming and coding, organized into chapters and accompanied by worksheets. The chapters discuss introduction to programming, flowcharting, pseudo code, object oriented programming, integrated development environments, and controls. Each chapter provides explanations of key concepts and terms, along with examples and exercises for students. The booklet aims to teach students basic computer programming logic and skills.
The document provides an introduction to programming with C# and the Visual Studio environment. It discusses that C# is an object-oriented language created by Microsoft to build a variety of applications that run on the .NET framework. It also describes the .NET framework, which includes the common language runtime that compiles C# code into intermediate language code and executes it. Finally, it introduces Visual Studio as an integrated development environment for creating C# applications and its key components like solutions and projects.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
This document discusses guidelines for designing domain-specific languages (DSLs). It begins with an introduction explaining that while tools exist to define new languages, they lack support for enforcing good design principles. The document then presents guidelines the authors have developed based on their experience creating DSLs. These guidelines aim to help DSL designers achieve better quality and usability. The guidelines cover general topics like syntax, semantics, notation, and quality assurance. They are intended to make DSL design a more systematic process and less ad-hoc. The guidelines should be weighed depending on the language's purpose, complexity, and intended users.
This thesis describes the implementation of a speech-driven automatic receptionist system for Voxway AB. The receptionist was programmed in VoiceXML and ColdFusion to answer calls for smaller Swedish companies, direct calls to employees based on speech input, and handle errors. A database and website were also developed to allow customization and view call statistics.
FriendNav is a social media Android application created by Daniel O'Neill for his final year project. The goal of FriendNav was to create an app that facilitates more real-world, in-person social interactions between friends compared to other social media apps. It uses a client-server model where the Android app communicates with a backend server. Key features included in the app are a main menu, login screen, alerts, friend list, settings, events, chat, and maps functionality. The app was tested for performance and potential improvements are discussed such as result buffering, additional features, optimization, and security enhancements.
This document summarizes a thesis about automatically deriving semantic properties from source code. It introduces the Compose .NET project, which uses aspect-oriented programming to add features to .NET languages. The thesis aims to enhance Compose by extracting more semantic information from code. It presents the Semantic Analyzer, which parses code into a metamodel representing semantic actions. This metamodel can then be queried to provide semantic properties for tasks like pointcut matching and program analysis.
The document is a master's thesis that explores neural rap lyrics generation using the GPT-2 language model. It begins with an introduction to natural language processing and language models. It then discusses using GPT-2 as the baseline model for generating rap lyrics, and proposes a novel sampling strategy that biases word probabilities to improve rhyme density in the generated lyrics. The thesis reports on experiments comparing the baseline GPT-2 model to the rhyme density biased sampling approach, analyzing metrics like rhyme density, repetition rate, and model perplexity.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document proposes a new compression technique for Gujarati text based on a dictionary approach. It compares the proposed technique to Huffman compression coding. The proposed technique compresses Gujarati text to 60% by using a lossless Unicode-based compression method that assigns unique index numbers to characters in a Gujarati dictionary and concatenates the numbers for adjacent characters. An evaluation on sample text files shows the proposed technique achieves an average compression ratio of 60.61% compared to 53.20% for Huffman coding, representing a 7.41% improvement in file size compression.
LoCloud - D5.4: Analysis and Recommendationslocloud
The document analyzes and makes recommendations for local content in Europeana's cloud based on the LoCloud project. It describes 8 use cases for typical local collection holders to understand their needs. It identifies 7 common issues faced by small institutions that the LoCloud services aim to address, such as lack of technical expertise and standards. The services developed in LoCloud are then described and evaluated based on the use cases. Overall, the analysis finds that while the services try to help small institutions, their needs may not be fully met due to limited resources, requiring good documentation and support.
This document provides guidance on how to become a good software engineer. It discusses what programming is, different types of computer languages, popular programming majors, and how to learn software programming. The document recommends starting with online courses to learn programming basics and the integrated development environment. It also advises supplementing courses with books for a more comprehensive understanding and to develop an open mind. Community websites are recommended for discussions and problem solving help. The overall guidance is that both courses and books are important for learning, but courses are best to start as a beginner.
The purpose of Library Circulation System (LCS) is to provide a convenient, easy-to-use, Internetbased
application for Librarians to track and manage the circulation of resources at a university,
which include books, magazines, journals, Compact Disks (CD), videocassettes, Digital Video
Disks (DVD) etc. In addition, the purpose of LCS is also to provide a convenient, Internet-based
method for Students and Faculty of a university to search for items in the library’s circulation,
renew items they have checked out, and reserve items .This report provides the Software
Architectural Design, Component Level design, User Interface Design to develop the system.
Automated Voice Based Braille Script Teaching Aid UsingDaphne Smith
This document describes a Raspberry Pi-based hardware implementation of a Braille teaching aid. The system uses IR sensors embedded in a Braille slate to detect the positioning of marbles representing Braille letters. The sensor readings are processed by the Raspberry Pi, which outputs the corresponding audio letter pronunciation through speakers. The system aims to make Braille learning easier for visually impaired students by automating the detection and audio feedback of letter representations. Software was developed using Simulink to define the logic for identifying letters as capital, small, or numbers based on the marble patterns detected by the IR sensors. Evaluation of the system achieved high accuracy in recognizing different Braille configurations for capital letters, small letters, and numbers.
This document discusses SOAP web services. It begins with an introduction to web services, XML, and SOAP. SOAP is an XML-based protocol that allows for machine-readable documents to be passed over multiple connection protocols to create a distributed system. The document then discusses alternative distributed systems like CORBA, Java RMI, and XML-RPC. It analyzes the advantages and disadvantages of the SOAP protocol. It also covers service description using WSDL, service discovery including UDDI, and describes an MSc project that implements a SOAP web service for a BibTeX database.
This document is a master's thesis from September 2015 that describes compiling a functional programming language called Micro-F# to the Common Intermediate Language (CIL). It introduces Micro-F#, which has features like higher-order functions and inductive data types. It then proposes a technique for translating Micro-F# to an intermediate object-oriented language, which is then compiled to CIL. The technique represents functions as classes and uses methods to simulate function calls. It shows how higher-order functions and data types like lists and trees can be translated. While the approach works for many programs, it has limitations for functions with large recursive inputs.
This document describes a graduation project that conducted two empirical studies on the adoption of the Swift programming language. The first study analyzed over 59,000 Swift-related questions on StackOverflow to identify common problems faced by Swift developers. The second study interviewed 12 Swift developers to validate and expand on the initial findings. The project aims to understand the benefits, drawbacks and challenges of adopting Swift in its early stages as it is positioned to become widely used. Key areas examined include issues with optionals, error handling, integration with Objective-C and problems with tools like Xcode. The findings provide insights into Swift adoption from the perspective of early developers.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
This document summarizes the author's masters project on developing a document translation system. The system uses a multi-step pipeline including text detection with the CRAFT model, text recognition with STR, text merging, image inpainting with DeepFillV2, and translation via Google Translate API. Details are provided on the models used, data processing, and approach for each step of the pipeline to translate documents while preserving layout and design elements.
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
This document summarizes a research paper on Devnagari document segmentation using a histogram approach. It discusses challenges in segmenting the Devnagari script used for several Indian languages. A simple algorithm is proposed using horizontal and vertical histograms to segment documents into lines, words and characters. The algorithm achieves near 100% accuracy for line segmentation but lower accuracy for word and character segmentation due to complexities in the Devnagari script. Future work is needed to improve character segmentation handling connected and modified characters.
C# is a strongly typed, object-oriented programming language that is open source, simple, modern, flexible and versatile. It was developed by Microsoft in 2001 to be easy to learn and support modern functionality. C# supports features like generics, lambda expressions, and asynchronous programming. It is cross-platform and can be used to develop various applications including web, mobile, desktop, games and more. C# is an evolving language with new features added in each version. Key data types in C# include strings, which are represented by the System.String class, and arrays, which allow storing collections of objects or values.
This document provides an introduction to using the statistical software R. It discusses using both the command line interface and RStudio integrated development environment. It covers getting help, setting the working directory, loading libraries, and the basic data structures in R including vectors, matrices, arrays, data frames and lists. Vectors are the most basic data structure and can contain elements of a single type such as numeric, character, logical, etc. The document provides examples of creating, assigning, and testing vector objects.
The document provides an introduction to programming with C# and the Visual Studio environment. It discusses that C# is an object-oriented language created by Microsoft to build a variety of applications that run on the .NET framework. It also describes the .NET framework, which includes the common language runtime that compiles C# code into intermediate language code and executes it. Finally, it introduces Visual Studio as an integrated development environment for creating C# applications and its key components like solutions and projects.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
This document discusses guidelines for designing domain-specific languages (DSLs). It begins with an introduction explaining that while tools exist to define new languages, they lack support for enforcing good design principles. The document then presents guidelines the authors have developed based on their experience creating DSLs. These guidelines aim to help DSL designers achieve better quality and usability. The guidelines cover general topics like syntax, semantics, notation, and quality assurance. They are intended to make DSL design a more systematic process and less ad-hoc. The guidelines should be weighed depending on the language's purpose, complexity, and intended users.
This thesis describes the implementation of a speech-driven automatic receptionist system for Voxway AB. The receptionist was programmed in VoiceXML and ColdFusion to answer calls for smaller Swedish companies, direct calls to employees based on speech input, and handle errors. A database and website were also developed to allow customization and view call statistics.
FriendNav is a social media Android application created by Daniel O'Neill for his final year project. The goal of FriendNav was to create an app that facilitates more real-world, in-person social interactions between friends compared to other social media apps. It uses a client-server model where the Android app communicates with a backend server. Key features included in the app are a main menu, login screen, alerts, friend list, settings, events, chat, and maps functionality. The app was tested for performance and potential improvements are discussed such as result buffering, additional features, optimization, and security enhancements.
This document summarizes a thesis about automatically deriving semantic properties from source code. It introduces the Compose .NET project, which uses aspect-oriented programming to add features to .NET languages. The thesis aims to enhance Compose by extracting more semantic information from code. It presents the Semantic Analyzer, which parses code into a metamodel representing semantic actions. This metamodel can then be queried to provide semantic properties for tasks like pointcut matching and program analysis.
The document is a master's thesis that explores neural rap lyrics generation using the GPT-2 language model. It begins with an introduction to natural language processing and language models. It then discusses using GPT-2 as the baseline model for generating rap lyrics, and proposes a novel sampling strategy that biases word probabilities to improve rhyme density in the generated lyrics. The thesis reports on experiments comparing the baseline GPT-2 model to the rhyme density biased sampling approach, analyzing metrics like rhyme density, repetition rate, and model perplexity.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document proposes a new compression technique for Gujarati text based on a dictionary approach. It compares the proposed technique to Huffman compression coding. The proposed technique compresses Gujarati text to 60% by using a lossless Unicode-based compression method that assigns unique index numbers to characters in a Gujarati dictionary and concatenates the numbers for adjacent characters. An evaluation on sample text files shows the proposed technique achieves an average compression ratio of 60.61% compared to 53.20% for Huffman coding, representing a 7.41% improvement in file size compression.
LoCloud - D5.4: Analysis and Recommendationslocloud
The document analyzes and makes recommendations for local content in Europeana's cloud based on the LoCloud project. It describes 8 use cases for typical local collection holders to understand their needs. It identifies 7 common issues faced by small institutions that the LoCloud services aim to address, such as lack of technical expertise and standards. The services developed in LoCloud are then described and evaluated based on the use cases. Overall, the analysis finds that while the services try to help small institutions, their needs may not be fully met due to limited resources, requiring good documentation and support.
This document provides guidance on how to become a good software engineer. It discusses what programming is, different types of computer languages, popular programming majors, and how to learn software programming. The document recommends starting with online courses to learn programming basics and the integrated development environment. It also advises supplementing courses with books for a more comprehensive understanding and to develop an open mind. Community websites are recommended for discussions and problem solving help. The overall guidance is that both courses and books are important for learning, but courses are best to start as a beginner.
The purpose of Library Circulation System (LCS) is to provide a convenient, easy-to-use, Internetbased
application for Librarians to track and manage the circulation of resources at a university,
which include books, magazines, journals, Compact Disks (CD), videocassettes, Digital Video
Disks (DVD) etc. In addition, the purpose of LCS is also to provide a convenient, Internet-based
method for Students and Faculty of a university to search for items in the library’s circulation,
renew items they have checked out, and reserve items .This report provides the Software
Architectural Design, Component Level design, User Interface Design to develop the system.
Automated Voice Based Braille Script Teaching Aid UsingDaphne Smith
This document describes a Raspberry Pi-based hardware implementation of a Braille teaching aid. The system uses IR sensors embedded in a Braille slate to detect the positioning of marbles representing Braille letters. The sensor readings are processed by the Raspberry Pi, which outputs the corresponding audio letter pronunciation through speakers. The system aims to make Braille learning easier for visually impaired students by automating the detection and audio feedback of letter representations. Software was developed using Simulink to define the logic for identifying letters as capital, small, or numbers based on the marble patterns detected by the IR sensors. Evaluation of the system achieved high accuracy in recognizing different Braille configurations for capital letters, small letters, and numbers.
This document discusses SOAP web services. It begins with an introduction to web services, XML, and SOAP. SOAP is an XML-based protocol that allows for machine-readable documents to be passed over multiple connection protocols to create a distributed system. The document then discusses alternative distributed systems like CORBA, Java RMI, and XML-RPC. It analyzes the advantages and disadvantages of the SOAP protocol. It also covers service description using WSDL, service discovery including UDDI, and describes an MSc project that implements a SOAP web service for a BibTeX database.
This document is a master's thesis from September 2015 that describes compiling a functional programming language called Micro-F# to the Common Intermediate Language (CIL). It introduces Micro-F#, which has features like higher-order functions and inductive data types. It then proposes a technique for translating Micro-F# to an intermediate object-oriented language, which is then compiled to CIL. The technique represents functions as classes and uses methods to simulate function calls. It shows how higher-order functions and data types like lists and trees can be translated. While the approach works for many programs, it has limitations for functions with large recursive inputs.
This document describes a graduation project that conducted two empirical studies on the adoption of the Swift programming language. The first study analyzed over 59,000 Swift-related questions on StackOverflow to identify common problems faced by Swift developers. The second study interviewed 12 Swift developers to validate and expand on the initial findings. The project aims to understand the benefits, drawbacks and challenges of adopting Swift in its early stages as it is positioned to become widely used. Key areas examined include issues with optionals, error handling, integration with Objective-C and problems with tools like Xcode. The findings provide insights into Swift adoption from the perspective of early developers.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
This document summarizes the author's masters project on developing a document translation system. The system uses a multi-step pipeline including text detection with the CRAFT model, text recognition with STR, text merging, image inpainting with DeepFillV2, and translation via Google Translate API. Details are provided on the models used, data processing, and approach for each step of the pipeline to translate documents while preserving layout and design elements.
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
This document summarizes a research paper on Devnagari document segmentation using a histogram approach. It discusses challenges in segmenting the Devnagari script used for several Indian languages. A simple algorithm is proposed using horizontal and vertical histograms to segment documents into lines, words and characters. The algorithm achieves near 100% accuracy for line segmentation but lower accuracy for word and character segmentation due to complexities in the Devnagari script. Future work is needed to improve character segmentation handling connected and modified characters.
C# is a strongly typed, object-oriented programming language that is open source, simple, modern, flexible and versatile. It was developed by Microsoft in 2001 to be easy to learn and support modern functionality. C# supports features like generics, lambda expressions, and asynchronous programming. It is cross-platform and can be used to develop various applications including web, mobile, desktop, games and more. C# is an evolving language with new features added in each version. Key data types in C# include strings, which are represented by the System.String class, and arrays, which allow storing collections of objects or values.
This document provides an introduction to using the statistical software R. It discusses using both the command line interface and RStudio integrated development environment. It covers getting help, setting the working directory, loading libraries, and the basic data structures in R including vectors, matrices, arrays, data frames and lists. Vectors are the most basic data structure and can contain elements of a single type such as numeric, character, logical, etc. The document provides examples of creating, assigning, and testing vector objects.
Similar to Countries and their Official Languages (20)
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Countries and their Official Languages
1. Indian Institute of Information
Technology, Allahabad
Graph Theory Project Report
Countries and Official Languages
Naimish Agarwal
irm2013013@iiita.ac.in
Signature
Dr Rishi Ranjan Singh
Assistant Professor
4. 1 Introduction
We live in a multilingual environment with people speaking different languages
around us. Some of us dream to travel around the world, meet new people, learn
new languages and mingle in a new culture. In the job market, multinational
firms look for professional translators who can help them achieve their business
objectives.
There are many languages spoken in the world. A language learner faces the
challenge of deciding the language he should learn as his second, third, fourth,
etc language based on a number of factors like ease of learning, countries he
wishes to travel, other languages spoken in his country, etc.
Studies have been done on the popularity of languages based on number
of speakers [1] [2]. It has been found that Mandarin, Spanish, English, Hindi,
Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi are among the
top 10 most spoken languages in the world. However, such a result may not be
useful for a language learner since countries like China and India have the largest
population, so it introduces a biasness in the results towards the languages
spoken in these nations.
In this project, we address the challenge faced by a language learner by
representing his problems as a graph of countries and their official languages.
2 Methodology
In section 2.1, we describe how we represented our graph. In section 2.2, we
describe our data source. In section 2.3, we describe our approach to construct
the directed graph between countries and their official languages. In section 2.4,
we describe the procedure to construct the sister languages network. In section
2.5, we describe the procedure for constructing the sister countries network.
2.1 Graph Representation
A graph G = (V, E) consists of a set of nodes V and a set of edges E.
The set of nodes V consists of countries and official languages. We con-
sidered Country ID, Name, Description as attributes of countries. Also, for
official languages, we considered Language ID, Name, and Description as their
attributes.
The set of edges in E constitute directed relations from countries to their
official languages.
2.2 Data Collection
We have collected the data about countries and their official languages by scan-
ning the JSON dumps of Wikidata[3] in March 2017. It is basically compressed
metadata about Wiki projects where each Wiki entity is represented as a JSON
string. For each entity, the JSON string starts on a new line.
3
5. 2.3 Countries and their Official Languages Graph Con-
struction
In this section, we discuss the steps to construct the directed graph of countries
and their official languages. The following points list the steps taken to construct
the graph:
1. Collect the Wikidata IDs of countries manually. It was easy because the
number of countries we considered were 202.
2. In the first pass over the JSON dumps, extract the JSON string of coun-
tries using the list of Wikidata IDs and store in separate JSON files.
3. Scan each JSON file of the countries, and make a global list of Wikidata
IDs of their official languages. We need to do this, since the JSON file
only lists the IDs of their official languages and not their names.
4. Find all the distinct Wikidata IDs of the official languages.
5. In the second pass over the JSON dumps, get the JSON string for the
official languages and store them in separate files.
6. Scan JSON file of each official language and extract its name.
7. Create nodes of countries.
8. Create nodes of official languages.
9. For each country, create a directed relation from country to its official
language(s) node(s).
This network is analyzed in section 4.1.
2.4 Sister Languages Network Construction
The nodes in the sister languages network are languages, and two languages are
connected by an undirected edge if they have a common country in common.
Their construction is simple and is outlined below:
1. For each country in the graph of country and its official languages, get the
list of its official languages.
2. For each distinct pair of languages in the list, join them by an undirected
edge.
This network is analyzed in section 4.2.
2.5 Sister Countries Network Construction
The nodes in the country cloud are countries, and two countries are connected by
an undirected edge if they have a common official language. Their construction
is a bit more involved than 2.4 and is outlines below:
1. In the graph of countries and their official languages, reverse the direction
of edges.
4
6. Figure 1: It shows India, as a node, along with its attributes at the bottom.
The <id> attribute is the node ID as used by Neo4j [4], while the id attrubute
is the Wikidata ID of India. At the top, we can see the Cypher code which
generated the visualization in Neo4j.
2. For each official language, do the following:
(a) Find the countries who share the language as their official language.
This is easy because such countries are now the neighbors of the
language.
(b) For each distinct pair of such countries, connect them by an undi-
rected edge.
This network is analyzed in section 4.3.
3 Graph Visualization
In figure 1, we visualize a typical country node, here India. In figure 2, we
visualize a typical official language node, here English. In figure 3, we showcase
the full view of the directed graph. In figure 4, we address the special case of
Russia. In figure 5, we show the countries with English as their official language.
4 Graph Analysis
In this section, we analyze the graph using various exploratory data analysis
techniques.
4.1 Graph of Countries and their Official Languages
We want to know the importance of languages. We want to rank them in the
descending order of their importance. We can make use of the concept of in-
degree of a node, which is basically the number of incoming edges in a node.
The languages with a higher incoming degree are more important to be learnt
5
7. Figure 2: It shows the official languages of India. At the bottom, one can see
the attribute values of English language. The <id> attribute is the node ID as
used by Neo4j, while the id attrubute is the Wikidata ID of English. At the
top, one can see the Cypher code used to visualize the figure in Neo4j. English
and Hindi are considered children of India in the graph, since directed edges are
present from India to Hindi and India to English.
Figure 3: It shows the full view of the graph of countries and their official
languages. It is constructed based on the procedure described in section 2.3.
The languages are shown in reddish color while the countries are shown in grey
color. At the top we can see the Cypher code which resulted in this visualization
in Neo4j. Below the code, we can see some statistics about the graph that there
are 202 countries, 173 official languages, and total 366 directed connections in
the graph.
6
8. Figure 4: It shows 36 official languages of Russia. On Wikidata, 36 languages
are mentioned, which includes 35 regional official languages. Only Russian is
the official language of Russia. Our script has extracted all the languages which
were mentioned under the official languages property of Russia in the Wikidata
JSON dumps. It is on the user to reject or keep the 35 languages. In our
analysis, we have kept them, so some results may get affected due to this. The
interested reader should keep this in mind. At the top, we also show the Cypher
code which generated the visualization in Neo4j.
Figure 5: It shows the 66 countries which have English as their official language.
At the top, we can see the Cypher code which resulted in the shown visualization
in Neo4j.
7
9. Figure 6: It shows the importance of languages based on their In-Degree in the
graph of countries and their official languages. The languages which are spoken
in larger number of countries as official languages have a larger font size in the
graph. The graph shows that English, French, Arabic, Spanish, Portuguese, etc
are among the most important languages [5].
by a language learner since it is spoken in large number of countries. To address
this problem, we have visualized the Language Cloud as shown in figure 6.
We are interested in ranking the countries by the number of official languages
they have. In other words, we are interested in ranking the countries based on
their out-degree. We address this problem in figure 7.
We are interested in plotting the degree distribution plot of the nodes in
countries and their official languages graph. We address this problem in figures
8, 9, 10. If we join the dots with a line, we find an exponentially decreasing
trend. The average degree of nodes in the graph is 0.976.
8
10. Figure 7: It compares the out-degree of countries, i.e. the number of official
languages they have. This figure is biased by the factor highlighted in figure 4.
The languages which have larger out-degree have a larger font size in the figure.
It is evident that Russia, Zimbabwe, South Africa, etc take the lead here.
4.2 Sister Languages Network
A language learner is interested to know the language which he should learn
as a second language. He may set one criterion to be that he will learn sister
languages i.e. the languages which have a common country. In the learner’s case,
he may chose such a language which is sister language to his native language. A
possible reason to set such a criterion is that the learner finds other people from
his country speaking that language. We construct the sister languages network
as described in section 2.4. In figure 11, we visualize the network.
The network has 173 nodes, 906 edges, and average degree of 10.474. It has
a network diameter A.1 of 5, and average path length A.2 of 2.2. It has 41
components A.3 and a network density A.4 of 0.061.
If one needs to diffuse or spread some information in the network, the most
central node seems to be the apt choice where-from to spread the information.
In a network, we can find such a node by computing the closeness centrality
9
11. Figure 8: It shows the in-degree distribution of nodes in the countries and their
official languages graph as visualized in Gephi.
Figure 9: It shows the out-degree distribution of nodes in the countries and
their official languages graph as visualized in Gephi.
10
12. Figure 10: It shows the degree distribution of nodes in the countries and their
official languages graph as visualized in Gephi.
A.5, which is further illustrated in figure 12.
4.3 Sister Countries Network
Some people may not wish to learn new languages but may wish to travel foreign
countries. They may choose to travel to those countries which speak their
native language or any language which they know. To address this problem, we
have constructed Sister Countries Network as described in 2.5; and is further
illustrated in figure 13.
This network has 202 nodes, 3172 edges with an average degree of 31.4 per
node. It has a network diameter of 4, average path length of 1.91, network
density of 0.156, and total of 41 components.
Another application of Sister Countries Network is for businesses which em-
ploy translators. Consider the scenario that country A is connected to country
B, and country B is connected to country C. If some business E in country A
has to enter into business terms with some business F in country C but A and
C do not have any official language in common, then it is likely that E will hire
translators from B since it is likely that they may know languages of both A
and C.
5 Technology Deployed
5.1 Python 3.5
It was deployed for the following tasks:
11
13. Figure 11: It shows the Sister Languages Network. The number of edges con-
necting two languages represent the number of countries which have both the
languages as their official languages. The right big blob of connections are the
languages of Russia. They out-stand because of the reasons highlighted in figure
4.
• Scrape the JSON dumps of Wikidata, which was over 7 GB.
• Construct and manipulate the graph using NetworkX [6] library.
• Export the constructed graph into GraphML [7].
• Automatically generate Cypher language code for use in Neo4j
5.2 Gephi [5]
It was mainly used for exploratory data anaylsis activities which includes com-
puting graph statistics and visualization of various graphs using Force-Altas
layout.
12
14. Figure 12: It shows the closeness centrality as computed on the sister languages
network. The darker the node, the better it is for spreading information in the
network.
Figure 13: It shows the sister countries network. In the graph, the number
of edges connecting two countries represent the number of official languages
common to both. In color codes, it also shows the eccentricity A.6 of the nodes.
13
15. 5.3 Neo4j [4]
It was mainly used to visualize the original directed graph to get an overall look
and feel of the graph of countries and their official languages.
5.4 Bash Script
It was mainly used for managing the Neo4j server which involved tasks like
starting and stopping Neo4j, deleting existing graph database and creating a
new one.
6 Conclusion
The results of our analysis show us that English, French, Arabic, Spanish are the
official languages of a large number of nations. It addresses the long time need
of language learners for a systematic learning path for languages based on their
background. It also points out to the tourists about the destination countries
they can visit without having to spend months in learning new languages.
7 Future Scope
The analysis in the project can be used to build a Language Recommendation
System (LRS) for a language learner. It can suggest the language based on
the past nationalities / current nationality of the person, countries he wishes to
travel, languages already known, etc.
The recommendations can be further enhanced by incorporating the infor-
mation about the number of people speaking a language around the world. A
language which has higher number of speakers will find a higher rank in language
recommendation.
14
16. A Graph Terminology
In this section, we describe some of the terms related with graphs which we
used in our report.
A.1 Network Diameter
It is the longest graph distance between two nodes in a network. We do not
consider pairs of nodes which are disconnected or have no path from one node
to other.
A.2 Average Path Length
It is the average number of steps along the shortest paths for all pairs of network
nodes.
A.3 Network Component
In a component, there exists a path between each pair of nodes.
A.4 Network Density
Let n be the number of nodes and m be the number of edges. Then network
density is defined as d = 2m
n(n−1) . Density value of 1 means a complete graph
and value 0 means a graph with no edges.
A.5 Closeness Centrality
Closeness centrality gives a measure of how close the node is from all other
nodes in the component. Let x, y represent nodes in the same component; d be
the shortest distance between x and y; then closeness centrality H is defined as
H = y=x
1
d(y,x) .
A.6 Node Eccentricity
In a component, the maximum distance which the node can have from any other
node is its eccentricity.
15
17. References
[1] “List of languages by number of native speakers.” https://en.wikipedia.
org/wiki/List_of_languages_by_number_of_native_speakers.
[2] “The 10 most spoken languages in the world.” https://www.babbel.com/
en/magazine/the-10-most-spoken-languages-in-the-world.
[3] “Wikidata json dumps.” https://dumps.wikimedia.org/wikidatawiki/
entities/latest-all.json.bz2.
[4] Neo4j, “Neo4j - the world’s leading graph database,” 2012.
[5] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software
for exploring and manipulating networks,” 2009.
[6] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure,
dynamics, and function using NetworkX,” in Proceedings of the 7th Python
in Science Conference (SciPy2008), (Pasadena, CA USA), pp. 11–15, Aug.
2008.
[7] G. Team, “The graphml file format,” 2002.
16