The human resource (HR) domain contains various types of privacy-sensitive textual data, such as e-mail
correspondence and performance appraisal. Doing research on these documents brings several
challenges, one of them anonymisation. In this paper, we evaluate the current Dutch text de-identification
methods for the HR domain in four steps. First, by updating one of these methods with the latest named
entity recognition (NER) models. The result is that the NER model based on the CoNLL 2002 corpus in
combination with the BERTje transformer give the best combination for suppressing persons (recall 0.94)
and locations (recall 0.82). For suppressing gender, DEDUCE is performing best (recall 0.53). Second
NER evaluation is based on both strict de-identification of entities (a person must be suppressed as a
person) and third evaluation on a loose sense of de-identification (no matter what how a person is
suppressed, as long it is suppressed). In the fourth and last step a new kind of NER dataset is tested for
recognising job titles in tezts.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...Zainul Sayed
Using Natural Language Processing(NLP) and (ML)Machine Learning to rank the resumes according to the given constraint, this intelligent system ranks the resume of any format according to the given constraints or following the requirements provided by the client company. We will basically take the bulk of input resume from the client company and that client company will also provided the requirement and the constraints according to which the resume shall be ranked by our system. Moreover the details acquired from the resumes, our system shall be reading the candidates social profiles (like LinkedIn, Github etc) which will the more genuine information about that candidate.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...Zainul Sayed
Using Natural Language Processing(NLP) and (ML)Machine Learning to rank the resumes according to the given constraint, this intelligent system ranks the resume of any format according to the given constraints or following the requirements provided by the client company. We will basically take the bulk of input resume from the client company and that client company will also provided the requirement and the constraints according to which the resume shall be ranked by our system. Moreover the details acquired from the resumes, our system shall be reading the candidates social profiles (like LinkedIn, Github etc) which will the more genuine information about that candidate.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...TELKOMNIKA JOURNAL
Latent Dirichlet Allocation (LDA) is a widely implemented approach for extracting hidden topics in documents generated by soft clustering of a word based on document co-occurrence as a multinomial probability distribution over terms. Therefore, several visualizations have been developed, such as matrices design, text-based design, tree design, parallel coordinates, and force-directed graphs. Furthermore, based on a set of documents representing a class (category), we can implement classification task by comparing topic proportion for each class and topic proportion for the testing document by using Kullback-Leibler Divergence (KLD). Therefore, the purpose of this study is to develop a system for visualizing the output of LDA as a classification task. The visualization system consists of two parts: bar chart and dependent word cloud. The first visualization aims to show the trend of each category, while the second visualization aims to show the words that represent each selected category in a word cloud. This visualization is subsequently called WCloudViz. It provides clear, understandable and preferably shared the result.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Biomedical-named entity recognition using CUDA accelerated KNN algorithmTELKOMNIKA JOURNAL
Biomedical named entity recognition (Bio-NER) is a highly complex and time-consuming research domain using natural language processing (NLP). It’s widely used in information retrieval, knowledge summarization, biomolecular event extraction, and discovery applications. This paper proposes a method for the recognition and classification of named entities in the biomedical domain using machine learning (ML) techniques. Support vector machine (SVM), decision trees (DT), K-nearest neighbor (KNN), and its kernel versions are used. However, recent advancements in programmable, massively parallel graphics processing units (GPU) hold promise in terms of increased computational capacity at a lower cost to address multi-dimensional data and time complexity. We implement a novel parallel version of KNN by porting the distance computation step on GPU using the compute unified device architecture (CUDA) and compare the performance of all the algorithms using the BioNLP/NLPBA 2004 corpus. Results demonstrate that CUDA-KNN takes full advantage of the GPU’s computational capacity and multi-leveled memory architecture, resulting in a 35× performance enhancement over the central processing unit (CPU). In a comparative study with existing research, the proposed model provides an option for a faster NER system for higher dimensionality and larger datasets as it offers balanced performance in terms of accuracy and speed-up, thus providing critical design insights into developing a robust BioNLP system.
Temporal Information Processing is a subfield of Natural Language Processing, valuable in many tasks
like Question Answering and Summarization. Temporal Information Processing is broadened, ranging
from classical theories of time and language to current computational approaches for Temporal
Information Extraction. This later trend consists on the automatic extraction of events and temporal
expressions. Such issues have attracted great attention especially with the development of annotated
corpora and annotations schemes mainly TimeBank and TimeML. In this paper, we give a survey of
Temporal Information Extraction from Natural Language texts.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...TELKOMNIKA JOURNAL
Latent Dirichlet Allocation (LDA) is a widely implemented approach for extracting hidden topics in documents generated by soft clustering of a word based on document co-occurrence as a multinomial probability distribution over terms. Therefore, several visualizations have been developed, such as matrices design, text-based design, tree design, parallel coordinates, and force-directed graphs. Furthermore, based on a set of documents representing a class (category), we can implement classification task by comparing topic proportion for each class and topic proportion for the testing document by using Kullback-Leibler Divergence (KLD). Therefore, the purpose of this study is to develop a system for visualizing the output of LDA as a classification task. The visualization system consists of two parts: bar chart and dependent word cloud. The first visualization aims to show the trend of each category, while the second visualization aims to show the words that represent each selected category in a word cloud. This visualization is subsequently called WCloudViz. It provides clear, understandable and preferably shared the result.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Biomedical-named entity recognition using CUDA accelerated KNN algorithmTELKOMNIKA JOURNAL
Biomedical named entity recognition (Bio-NER) is a highly complex and time-consuming research domain using natural language processing (NLP). It’s widely used in information retrieval, knowledge summarization, biomolecular event extraction, and discovery applications. This paper proposes a method for the recognition and classification of named entities in the biomedical domain using machine learning (ML) techniques. Support vector machine (SVM), decision trees (DT), K-nearest neighbor (KNN), and its kernel versions are used. However, recent advancements in programmable, massively parallel graphics processing units (GPU) hold promise in terms of increased computational capacity at a lower cost to address multi-dimensional data and time complexity. We implement a novel parallel version of KNN by porting the distance computation step on GPU using the compute unified device architecture (CUDA) and compare the performance of all the algorithms using the BioNLP/NLPBA 2004 corpus. Results demonstrate that CUDA-KNN takes full advantage of the GPU’s computational capacity and multi-leveled memory architecture, resulting in a 35× performance enhancement over the central processing unit (CPU). In a comparative study with existing research, the proposed model provides an option for a faster NER system for higher dimensionality and larger datasets as it offers balanced performance in terms of accuracy and speed-up, thus providing critical design insights into developing a robust BioNLP system.
Temporal Information Processing is a subfield of Natural Language Processing, valuable in many tasks
like Question Answering and Summarization. Temporal Information Processing is broadened, ranging
from classical theories of time and language to current computational approaches for Temporal
Information Extraction. This later trend consists on the automatic extraction of events and temporal
expressions. Such issues have attracted great attention especially with the development of annotated
corpora and annotations schemes mainly TimeBank and TimeML. In this paper, we give a survey of
Temporal Information Extraction from Natural Language texts.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
Supreme court dialogue classification using machine learning models IJECEIAES
Legal classification models help lawyers identify the relevant documents required for a study. In this study, the focus is on sentence level classification. To be more precise, the work undertaken focuses on a conversation in the supreme court between the justice and other correspondents. In the study, both the naïve Bayes classifier and logistic regression are used to classify conversations at the sentence level. The performance is measured with the help of the area under the curve score. The study found that the model that was trained on a specific case yielded better results than a model that was trained on a larger number of conversations. Case specificity is found to be more crucial in gaining better results from the classifier.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
The job market has expanded exponentially in the past few years. With many recruiters and candidates, it
is not an easy task to match a perfect candidate with a perfect job. The recruiter targets candidates with
the required skill sets mentioned in the job descriptions, while candidates target their dream jobs. The
search frictions and skills mismatch are persistent problems. In this paper, we build a model that would
match companies with candidates with the right skills and workers with the right company. We have
further developed an algorithm to investigate people’s hiring history for better results.
International Journal on Soft Computing, Artificial Intelligence and Applicat...ijscai
The job market has expanded exponentially in the past few years. With many recruiters and candidates, it
is not an easy task to match a perfect candidate with a perfect job. The recruiter targets candidates with
the required skill sets mentioned in the job descriptions, while candidates target their dream jobs. The
search frictions and skills mismatch are persistent problems. In this paper, we build a model that would
match companies with candidates with the right skills and workers with the right company. We have
further developed an algorithm to investigate people’s hiring history for better results.
The job market has expanded exponentially in the past few years. With many recruiters and candidates, it
is not an easy task to match a perfect candidate with a perfect job. The recruiter targets candidates with
the required skill sets mentioned in the job descriptions, while candidates target their dream jobs. The
search frictions and skills mismatch are persistent problems. In this paper, we build a model that would
match companies with candidates with the right skills and workers with the right company. We have
further developed an algorithm to investigate people’s hiring history for better results
The job market has expanded exponentially in the past few years. With many recruiters and candidates, it
is not an easy task to match a perfect candidate with a perfect job. The recruiter targets candidates with
the required skill sets mentioned in the job descriptions, while candidates target their dream jobs. The
search frictions and skills mismatch are persistent problems. In this paper, we build a model that would
match companies with candidates with the right skills and workers with the right company. We have
further developed an algorithm to investigate people’s hiring history for better results.
Job matching
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...home
Named entities are the most informative element of a textual document and identification of the names is very much
important for extracting further information from text. We have developed a conditional random field based system to
identify the named entities from homeopathic diagnosis discussion forum text. We have manually annotated a training
corpus for the task. As manual creation of a sufficiently large annotated corpus is costly and time consuming, we use an
active learning based semi-supervised framework to increase the efficiency of the system with the help of un-annotated
data. Our system achieves the highest f-value of
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Similar to DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RESOURCE DOMAIN (20)
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RESOURCE DOMAIN
1. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
DOI: 10.5121/ijnlc.2020.9602 23
DUTCH NAMED ENTITY RECOGNITION AND DE-
IDENTIFICATION METHODS FOR THE HUMAN
RESOURCE DOMAIN
Chaïm van Toledo, Friso van Dijk, and Marco Spruit
Utrecht University, Utrecht, the Netherlands
ABSTRACT
The human resource (HR) domain contains various types of privacy-sensitive textual data, such as e-mail
correspondence and performance appraisal. Doing research on these documents brings several
challenges, one of them anonymisation. In this paper, we evaluate the current Dutch text de-identification
methods for the HR domain in four steps. First, by updating one of these methods with the latest named
entity recognition (NER) models. The result is that the NER model based on the CoNLL 2002 corpus in
combination with the BERTje transformer give the best combination for suppressing persons (recall 0.94)
and locations (recall 0.82). For suppressing gender, DEDUCE is performing best (recall 0.53). Second
NER evaluation is based on both strict de-identification of entities (a person must be suppressed as a
person) and third evaluation on a loose sense of de-identification (no matter what how a person is
suppressed, as long it is suppressed). In the fourth and last step a new kind of NER dataset is tested for
recognising job titles in tezts.
KEYWORDS
Named Entity Recognition, Dutch, NER, BERT, evaluation, de-identification, job title recognition
1. INTRODUCTION
De-identification of texts has become a common task within the medical domain, for researching
health care records and many other medical documents. The HR field also contains a lot of
textual data, but in contrast to de medical domain we did not find any HR text de-identification
tools. De-identification of texts provides benefits for organisations. First, the General Data
Protection Regulation (GDPR) asks for limiting personal identifiers (PIDs) in processing data.
Second, data scientists can work safer with pseudonymised texts, because the PIDs are
suppressed. De-identifying methods will reduce the impact of data breaches. And third,
organisations can temper employee privacy concerns by removing individual characteristics.
Therefore, to realise these benefits, we investigate text de-identification in Dutch governmental
HR e-mail correspondence.
This paper transfer existing Dutch medical text de-identification methods to the HR domain. The
case organisation is a Dutch governmental HR organisation. The organisation provides employee
payments and HR management. More than 100,000 civil servants rely on their systems. Civil
servants could contact the organisation's contact centre by phone, e-mail, and chat. The contact
centre gets questions like: "how to get tax reduction by buying a bike for commuting?"
The correspondence system saved more than 300,000 e-mails. Over 2,000 e-mails are manually
annotated on eleven characteristics. Next, we trained NER models to recognise names, locations,
2. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
24
and organisations. Then, different NER methods and de-identification methods are combined and
compared with each other.
This paper also offers a new NER dataset, for recognising job titles. In an unsupervised way, a
new Dutch NER dataset is created from vacancies. With the dataset, it is possible to train a model
for recognising job titles in texts.
The research question of this paper is: To which extent can current de-identification methods
anonymise Dutch HR related texts? Our main contributions are bringing de-identification to the
HR domain, benchmarking current de-identification methods, updating NER models, and
benchmarking NER methods.
The paper is structured as follows: The Background section elaborates on NER, de-identification
and policy and PIDs. From this background, the expectation is that updating NER models with
the so-called transformers should enhance de-identification results. About PIDs, none of the de-
identification methods are specialised in supressing gender, job titles and regular titles. Because
of the absence of recognising the classes, we expect that performance will be modest. In the
Methods section, this paper elaborates how we update the NER methods and how we combine
this with de-identification methods. The Methods section also explains how the Dutch job title
recognition dataset is constructed.
The Results section shows the performances of the NER and de-identification methods. The
results are divided in four evaluations. The first one evaluates the performance of the current and
the updated NER models. The second evaluation shows how the state-of-the-art de-identification
methods suppress PIDs correctly. Third evaluation brings the result of to which extend the PIDs
are suppressed, no matter how it is labelled by the methods. The fourth evaluation shows the
results of the job title recognition model.
After, we discuss the shortcomings of the current de-identification methods and what features
future de-identification methods should have. We conclude that current Dutch NER methods can
be improved through transformers and that with the TKS method and a state-of-the-art NER
method the best results can be achieved in de-identifying HR texts.
2. BACKGROUND
This section overviews three elements of de-identification. The background starts with NER.
NER recognise privacy sensitive elements in texts. Second part elaborates about the de-
identification methods and the usage of NER systems in de-identification methods. Third and last
part explain about the policies of de-identification.
2.1. NER For DUTCH
A definition of NER is: “the task of automatically identifying names in text and classifying them
into a pre-defined set of categories” [1, p. 1]. Automated text de-identification and NER share
the same goal: recognise entities in texts [2]. Three main approaches can be distinguished [3].
The first approach is rule-based, this method contains name lists, regular expressions and other
predefined handcrafted rules. Second is an unsupervised learning approach, with clustering, word
groups can construct based on context similarity. Third approach is a supervised learning, with
predefined labels on words in a corpus. The results are good in this approach, but the construction
of corpora for NER is a time-consuming process.
3. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
25
For supervised learning NER, the conditional random fields (CRF) algorithm plays a major role.
CRF takes neighbouring tokens into account and so context plays a role in labelling names in
texts. Long short-term memory (LSTM) also impacts a major role in NER. LSTM is “capable of
remembering information over long time periods during the processing of a sequence” [4, p. 1].
Recent developments show an upcoming rise of transformers, like BERT (Bidirectional Encoding
Representations for Transformers), introduced by Google [5]. Test results show improvements in
NER because of BERT.
Two known hand-annotated corpora are publicly available to create Dutch NER systems, namely
CoNLL-2002 (Conll) and SoNaR-1 (Sonar). Conll has four labels for entity recognition: persons,
locations, organisations and miscellaneous. The training corpus contains 218,737 lines of words
[6]. The Sonar corpus got two extra labels: products and events. Sonar contains 1 million words.
Where the Conll corpus uses news data, Sonar uses news items, manuals, autocues, fiction and
reports and ‘new’ media like blogs, forums, chat and SMS [1].
Conll and the Sonar dataset tags their entities in the IOB2 format. Word or tokens are represented
vertically, with first the word and second the entity category. O is for outside the scope (or the
non-categorised tokens). B-PER (B for begin) stands for name and can be extended with I-PER if
the name contains more than one token [7].
For Dutch, there is a well-known NER system, namely FROG [8]. FROG detects persons,
organisations, locations, products, events, and miscellaneous entities in texts. Another Dutch
NER system (with English, Spanish and German) is created by Lample et al. [9], but
unfortunately, the Dutch model isn’t available online. Another attempt is Polyglot [10], based on
Wikipedia and freebase data and brings a service with 40 language models. Performances for
NER are mostly tested with the Conll corpora. Table 1 lists the NER systems with support for
Dutch.
Table 1: NER systems with support for Dutch
Approach Test set Language Rates
Classifying Wikipedia data [11] CoNLL-2003 English F1-score 85.2
CoNLL-2002 Dutch F1-score 78.6
CoNLL-2003 German F1-score 66.5
CoNLL-2002 Spanish F1-score 79.6
Single CRF classifier [1], [8] SoNaR Dutch F1-score 84.91
Discriminative Learning, word
embeddings [10]
CoNLL-2003 English F1-score 71.3
CoNLL-2002 Dutch F1-score 59.6
CoNLL-2002 Spanish F1-score 63.0
LSTM-based [12] CoNLL-2003 English F1-score 84.57
CoNLL-2002 Dutch F1-score 78.08
CoNLL-2003 German F1-score 72.08
CoNLL-2002 Spanish F1-score 81.83
LSTM-CRF [9] CoNLL-2003 English F1-score 90.94
CoNLL-2002 Dutch F1-score 81.74
CoNLL-2003 German F1-score 78.76
CoNLL-2002 Spanish F1-score 85.75
2.2. Text De-Identification Methods
Because of the Health Insurance Portability and Accountability Act (HIPAA) regulations, text de-
identification methods originate from the medical domain. These methods use a combination of
4. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
26
rule-based and/or statistical approaches. Both approaches are used in the method of Tjong Kim
Sang (TKS) et al. [13]. In this method, the user can add names in a list and can also find names
with FROG NER. There are also non-statistical approaches like DEDUCE [14]. This rule-based
method employs user-added lists of names and other sensitive entities. Table 2 gives an overview
of the two different approaches in Dutch.
DEDUCE is created for de-identifying psychiatric nursing notes. The method requires an
organisational context for defining important entities. The method brings the most popular names
in the Netherlands, but it is recommendable to expanse this list with person names from the
organisational systems. The same routine implies for institutions or organisations and locations.
TKS uses NER systems, like Frog, for tokenising the text document. The output is directly
annotated in long lists of tokens with labels (entity or O). Like the case of DEDUCE, extra
organisational context extends the method. The tokens are already annotated by the NER system
or will be looked up in the context lists. At last, the tokens will be analysed with regular
expressions for dates, (phone) numbers and e-mails.
Table 2: Overview of previous Dutch text anonymisation research.
Name Approach Corpus Identifiers Rates
DEDUCE
[14]
Rule-based Psychiatric
nursing notes
(2,000)
Person, url, institution or
organisation, location, phone
number, age, date
F1-score 0.826
TKS [13] Rule-
based, CRF
Data from
therapeutic
sessions
Frog: Persons, locations, events,
misc., products, organisations
Method: numbers, months, days,
numbers, month, days, mails, phone
numbers
F1-score0.84
unlabelled
score, low
micro average
0.55
Table 3: HIPAA elements [17]
Personal identifiable elements
Names All geographical subdivisions smaller than a
State
All elements of dates (except year) for dates that
are directly related to an
individual
Telephone and fax numbers
Online identifier (such as e-mail) Identification number / social security numbers
Medical record numbers Health plan beneficiary
numbers
Account numbers Certificate/license numbers
Vehicle identifiers and serial numbers Device identifiers and serial numbers
Web Universal Resource Locators Internet Protocol (IP) addresses
Biometric identifiers, including finger and
voice prints
Full face photographs and any comparable
images
Any other unique identifying number,
characteristic, or code
2.3. Policy And PIDs
De-identified data contains no identifiable elements of individuals. When an element is
identifiable to an individual differs from situation to situation, and is also referred to as an
anonymity versus utility dilemma [15]. The medical world introduced many de-identifiers. It is
5. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
27
helpful to see what criteria there are, because the GDPR does not have an exact specification
about what PIDs are. The law describes that personal data: “any information relating to an
identified or identifiable natural person” [16, p. 2]. The HIPAA provides eighteen possible
identifiers, displayed in table 3[17].
De-identified data asks for policy. It would be dangerous to give an anonymised dataset to the
public. For example, Narayanan and Shmatikov [18] demonstrate a de-anonymisation attack on
the Netflix user dataset. Netflix de-identified user data for a recommender system competition,
but the public release of this dataset, backfired to the corporation. Hence, three important policies
should be considered when working with de-identified data [19]. First, control spread of the data.
Second, prevent identification attempts. Third, provide security measurements.
3. METHODS
The methods explain how the evaluation is done. The first two sections explain how the NER is
constructed and how the NER and the de-identification methods are evaluated. The fourt part
elaborates about the metrics. The third part give an inside in how the job title recognition dataset
is constructed. The fourth part explains how the data is annotated and how the agreement is
scored between the annotators.
3.1. NER Construction and Evaluation
As explained in Background, two Dutch NER applications are publicly available, Polyglot and
FROG. We construct four NER models based on the Conll and Sonar corpora. The first two
models on both the corpora are trained on the BERT Multilingual Cased (ML), where Dutch is
included. The third and fourth models are trained on the two corpora with BERTje [20]. BERTje
has specifically been created for the Dutch language. From both Conll and Sonar, we only test
persons, locations and organisations. Both contain miscellaneous, but this entity is not annotated
in the test dataset. Sonar also contains products and events, but these are also disregarded from
the test dataset. Software for constructing NER transformers based on BERT is derived from
Raj’s github [21].
3.2. De-Identification Evaluation
Based on the literature, this paper evaluates two de-identification methods, DEDUCE and the
method of TKS. The method of TKS uses Frog, but other NERs can also be attached to this
method. Therefore, the created NER models based on BERT, are also attached to the method of
TKS.
Due to a lack of comparability between DEDUCE and TKS, this research chose not to evaluate
DEDUCE’s url, phone number and age retrieval performance. DEDUCE can detect somebody’s
age, or phone number, but there are also other forms of numbers. DEDUCE classifies e-mail and
websites as URL, therefore it is difficult to distinguish them from each other. With the method of
TKS, this research did not evaluate phone numbers, mails, events, miscellaneous and products,
for the same reasons as with DEDUCE. In table 4, our annotation classification labels connects to
the DEDUCE and TKS label.
6. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
28
Table 4: Comparison of entity classification identifiers.
Our classification labels DEDUCE TKS
Person Person Person
Organisations Institution Organisation
Location Location Location
Num ... Num
Date Date Month, date, days numbers
3.3. Job Title Recognition Dataset
A new dataset for Dutch NER is created for recognising job titles in texts. This is done in four
steps. First by collecting Dutch governmental vacancies. Second step is creating a list of job titles
from the Dutch governmental vacancies. In more detail, by creating the job title list the job titles
with senior in front are copied, without the term senior in front.Then, all job titles without senior
in front are copied with senior in front. Then, the job title list got extra job titles, like minister or
judge (these job titles are not in the structured job title list, because the job application in this
category is not regulary via the website. In total a list of 39,338 job titles are extracted in an order
of a descending number of characters.
Third is finding the job titles in the vacancy texts with the list of job titles. In an unsupervised
way, the job titles are checked in the vacancy texts. The next sub step tokenised the vacancy texts
and give for every token an O or a B- or I-JOBTITLE tag.
The last and fourth step is training a NER model for recognising the job titles in texts. The BERT
Base Multilingual cased model was the base for training the model. A F1-score was given of
0.91. The dataset is available for download.
3.4. Evaluation Metrics
The test performance evaluation metrics are precision, recall and f1-score. Precision is measured
by (Σ true positive) / (Σ true positive + Σ false positive). Recall is measured by (Σ true positive) /
(Σ true positive + Σ false negative). F1-score is then measured by
2*((precision*recall)/(precision+recall)). This paper appoints recall as the most important metric,
because the recall is measured with the false negatives, so the harm is higher when there occurs a
false negative than when there occurs a false positive. F1-score is the second-most important
metric, because of the combination of false negatives and false positives in the measurements.
Table 5: Generated examples of e-mail questions. Signatures are parsed out of the context
Goodmorning, Please handle the following. [SIGNATURE] Dear colleague, from the 9th
of October
2016, I am seconded to the Apeldoorn office of the Tax and Customs Administration, but in the P-
Portal my old Ministry of Finance e-mail (j.doe@minfin.nl) is still connected to my account. I would
like my new e-mail (john.doe@belastingdienst.nl) to be connected to P-Direct. My employee number
is 98706540. Thanks in advantage. Greetings John. [SIGNATURE]
Hello, with permission from the board, I want to change my commuting fee. During long road
construction, my commuting distance to the PI Amsterdam increased from 35 km to 41 km. This
situation is happening since the 5th
of November past year, so I’d like, retrospectively, to get a higher
commuting fee since the 5th
of November. Sincerely, [SIGNATURE]
3.5. Annotated Data
E-mail data is used from a Dutch government organisation. This e-mail data concerns all kinds of
HR related issues. Most of these correspondences relate to HR administration or to modification
7. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
29
to personnel files. The dataset contains 2,017 e-mails, with a mean of 133 tokens per e-mail.
Table 5 shows two examples of messages. Annotations were made in these e-mails, in total
13,496. Software for making annotations is AnnotatorJS and a SQL database to store these
annotations. There is a mean of 6.69 annotations per e-mail, with a standard deviation of 11.21.
Table 6 contains the different PIDs with the occurrences in the dataset. Date is everything what
points to a date, i.e. month year, or day. Weekdays, like Sunday and Monday, are excluded; this
turned out too generic. Numbers are defined as a sequence of two or more digits, meaning that 1
or 2 are excluded, but 10 and 222 are not excluded. One digit turned out to be too generic.
Persons (per) include names, surnames, and initials. This research also includes Gender, because
gender related words can be seen as a binary. With these words and other identifiers an attacker
can easily deduce people. Gender words are like sir, madam (or in Dutch: de heer, mevrouw), but
also he, she, son, his. We define Organisation (org) as groups of people larger than one person.
The E-mail (mail) section was difficult, because some e-mails contain a whitespace, this is
strictly impossible, probably written by mistake. Location (loc) is everything what refers to a
physical place, such as a street, postal code, municipal, country, area, region and so forth. Job
title refers to somebody’s job, like policy officer or contact center agent. Title includes degrees
such as MSc, Dr or Duke. Code is anything what refers to an account, like IBAN numbers,
usernames and passwords. Websites can also reveal the organisation a person is working at, so
this research annotates this information as well.
Table 6: Counted PIDs in the dataset
PID Date Num Per Gender Org Mail Loc Job title Code Title Website
Num 3,628 3,338 3,086 1,567 1,011 279 225 120 116 96 28
The first annotator labelled the entire sample of 2,017 e-mails. A second annotator labelled 283 e-
mails, 14% of the sample. For measuring the correctness of the labelled representations, an
interrater reliability is used with a kappa statistic. According to [22], a kappa score of at least
0.80 is sufficient, a kappa below 0.60 indicates a non-agreement between annotators. A kappa
statistic is measured as follow: 𝜅 =
Pr(𝑎)−Pr(𝑒)
1−Pr(𝑒)
Pr(a) is the observed agreement between the annotators and Pr(e) is the expected change
agreement. The observed agreement is 0.99 and the expected agreement is 0.82. The kappa score
is 0.92 and we conclude there is a high agreement between the annotators.
4. RESULTS
The results are distinguished in four evaluations. First, how well do the NER models perform?
Second, how do the de-identification methods perform in a “strict” sense? A strict sense means
that the suppression both identifies and classifies the entity correctly. The third evaluation shows
how the de-identification methods perform in a “loose” sense. A loose sense only takes the
identification of the suppression into account and not the classification. The fourth and last
evaluation provides the evaluation of job title recognition.
The first three evaluations are equal important, because a good NER will recognise entities
without organisational knowledge. A high strict de-identification ensures that the utility will not
drop too much, because only the sensitive entities are suppressed, and the false positives are low
as possible. A loose sense de-identification evaluation is important because suppression is the key
to anonymisation.
8. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
30
4.1. NER Evaluation
Table 8 shows the results of two existing NER models, namely Polyglot and FROG, against
trained BERT-based models. A remarkable outcome is the fact that the Conll corpus gives
highest results with respect to recall. As the creators of Sonar explained, Sonar has a more
diverse corpus than Conll. Because of this diversity, the expectation was that Sonar could better
recognise entities than Conll. The precision of Sonar BERT ML gives in almost all cases the best
results.
Table 7: NER results (underline is highest result in row)
Metrics Polyglot Frog Conll
BERT ML
Sonar BERT
ML
Sonar
BERTje
ConllBERTje
Per Precision 0.69 0.68 0.69 0.81 0.77 0.72
Recall 0.35 0.80 0.86 0.86 0.83 0.88
F1 0.46 0.73 0.77 0.83 0.80 0.80
Org Precision 0.66 0.38 0.47 0.71 0.63 0.59
Recall 0.16 0.46 0.66 0.65 0.54 0.69
F1 0.26 0.42 0.55 0.68 0.58 0.64
Loc Precision 0.10 0.09 0.26 0.40 0.34 0.41
Recall 0.36 0.49 0.64 0.60 0.57 0.65
F1 0.16 0.15 0.37 0.48 0.42 0.50
4.2. Strict De-Identification
Table 8 shows the results for strict de-identification. This means that for example a person is
identified as an organisation, the person identifier gets a false negative. The BERT-based
transformers perform better in almost all NER related fields. It is in the line of expectations that
the method of TKS would perform the best regarding the recall in combination with the
ConllBERTje model. The NER results section shows that this model also performs best.
The precision can be lower compared to the NER results. The reason for this decrease is the
added organisational data. This added data has a positive influence on the recall, because more
entities are recognised, but the downside is an over fitting and that influences the precision. Or
the problem in other words: sometimes names can be regular words.
Table 8: Strict de-identification results (underline is highest result in row)
Metrics DED
UCE
TKS +
Frog
TKS + Sonar
BERT ML
TKS + Conll
BERT ML
TKS + Sonar
BERTje
TKS +
ConllBERTje
P
er
Precision 0.42 0.66 0.64 0.57 0.61 0.59
Recall 0.77 0.86 0.91 0.87 0.88 0.92
F1 0.55 0.74 0.75 0.69 0.72 0.72
O
rg
Precision 0.06 0.28 0.56 0.25 0.51 0.47
Recall 0.00 0.45 0.68 0.49 0.59 0.68
F1 0.00 0.35 0.61 0.33 0.55 0.56
L
o
c
Precision 0.70 0.16 0.23 0.20 0.21 0.26
Recall 0.26 0.38 0.32 0.47 0.31 0.40
F1 0.38 0.22 0.26 0.28 0.25 0.31
D
at
e
Precision 0.71 0.97 0.98 0.98 0.97 0.97
Recall 0.65 0.90 0.94 0.90 0.94 0.94
F1 0.68 0.96 0.96 0.94 0.96 0.96
N
u
m
Precision ... 0.62 0.62 0.62 0.62 0.62
Recall ... 0.92 0.93 0.91 0.93 0.93
F1 ... 0.74 0.74 0.74 0.74 0.74
9. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
31
Table 9: De-identification results (underline is highest result in row)
Entities Metrics DEDUC
E
TKS +
Frog
TKS +
Sonar
BERT ML
TKS +
Conll
BERT ML
TKS +
Sonar
BERTje
TKS +
ConllBE
RTje
Person Recall 0.79 0.9 0.93 0.89 0.93 0.94
Org Recall 0.00 0.75 0.81 0.76 0.77 0.78
Location Recall 0.38 0.76 0.75 0.77 0.76 0.82
Date Recall 0.61 0.92 0.96 0.92 0.97 0.97
Number Recall 0.30 0.88 0.89 0.87 0.88 0.89
Gender Recall 0.49 0.11 0.10 0.16 0.10 0.16
E-mail Recall 0.53 0.44 0.40 0.40 0.49 0.48
Code Recall 0.19 0.48 0.40 0.48 0.49 0.60
Title Recall 0.83 0.13 0.14 0.13 0.15 0.18
Job title Recall 0.11 0.52 0.49 0.45 0.49 0.51
Website Recall 0.08 0.08 0.08 0.08 0.00 0.00
4.3. Loose Sense De-Identification
The focus in this section is only whether something is suppressed or not. Loose sense de-
identification is when an identifier is, suppressed, a true positive is seen and when an identifier is
not suppressed, a false negative is seen. False positives are not considered, because not all
methods recognise gender, codes, titles and job titles. A false positive can also be a true positive
in another identifier class.
The highest results of table 10 are TKS with the Conll corpus and BERTje for training. The
combination preforms best on persons and locations. For persons the BERT based methods are
preforming slightly better than Frog. The same is happening with locations and organisations.
Interesting is DEDUCE, as it accounts for gender and title more often. This happens because
DEDUCE looks for so-called prefixes at names. These prefixes are like sir, prof, dr and so forth.
DEDUCE is not build for recognising words as her/his or son/daughter, so the results remain at a
recall of 0.49. However, for recognising titles, it does a good job, with a recall of 0.83. The prefix
list was incomplete and should be enlarged with more titles (like nobiliary and accountancy
titles).
E-mails are not well recognised; this is due to the tokenisation of the NER methods. The example
sentence: “My e-mail is johndoe@example.com” will be tokenised as follows: “My”, “e-mail”,
“is”, “johndoe”, “@”, “uu.nl”. The TKS method wants a full e-mail as token and simply selects
the e-mail based on its @-sign. The DEDUCE method also contains errors for particularly e-
mails, for example when somebody writes their e-mail in capitals or partly in capitals.
4.4. Job Title Recognition
For the job title recognition task, only one model is trained, Job title recognition in combination
with BERT Multilingual. Table 10 give an overview of the outcome with the new model tested
on the annotated dataset. The recall is excellent, but the precision is poor. The phenomenon is
explainable if the top five most common outcome entities are covering 78% of the prediction
output. Words like employee, manager, adviser and human resource supporter. These words are
indeed job titles, however, we did not mark them as privacy sensitive, because the words were
too generic and not pointing to a person direct.
10. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
32
Table 10: Job title recognition results
Metric Job title + BERT ML
Precision 0.08
Recall 1
F1 0.16
5. DISCUSSION
De-identification is a difficult task, as which entities are revealing an individual, and which are
not, differs per situation. This research aimed to be as strict as possible in annotating identifiers.
Unlike other studies, we also took job titles, titles and gender into account. Of course, it is
arguable that not every identifier is as important as the other. Revealing a person’s name holds
more privacy risk than revealing a person’s gender. But a combination of generic identifiers can
lead to a high enough specificity that allows for the identification of individuals
None of the methods had a hundred percent score compared to the annotated set and this means
that using these methods will not de-identify everybody. Hence, when using a selected method, a
researcher should always check data for false negatives. Besides the de-identification techniques,
it is arguable to embrace text de-identification workflows in a research organisation. With
handles to identify which PIDs are important to suppress.
The BERT transformers had in almost all cases a positive influence on enhancing the
identification of the person, organisation, and location entities. A particularly interesting potential
future improvement in NER method performance could be to focus more on recognising gender
in texts. NER models must in the future handle lower cased names because texts are and will
always be noisy and we cannot rely that all writing persons will write correctly. Although we
underline that updating and expanding NER corpora is a time-consuming process.
We gathered organisational data, like name and organisation lists for feeding DEDUCE and TKS,
however, in both methods this turned out insufficient to reach near-perfect performance. For
organisational entities the score of DEDUCE was too low. It is arguable that we didn’t gather
enough organisational data to fill the organisation list. On the other hand, organisational data is
never enough. To illustrate, a person can mention his/her partner’s name in an e-mail, but a
partner’s name doesn’t have to be saved in the organisation system. So, when requesting all the
existing names from the organisation for the names list, a partner’s name in an e-mail doesn’t
have to be matched and thus there can occur a false negative. Hence, in our result, the NER
methods could de-identify at least 75 percent, so applying NER always seems at least a good
starting point for de-identifying texts.
The major feature of TKS is that it is modular concerning NER applications. We could easily
connect BERT-based NER methods to the de-identification method. We recommend that future
text de-identification methods should follow this path and consider easy implementations of other
entity recognition methods in a pipeline overview. Thereby, it is recommendable to add the
possibility to let a user add de-identification methods to the overall de-identification method. The
text de-identification task differs per situation.
The job title recognition model gives a good demonstration to suppress all the job titles linking to
a person. However, also generic terms like manager, adviser, etc. were suppressed. For doing
research this behaviour can do harm to a researcher’s dataset. For example analysing
performance appraisal, the role of the manager can be interesting. On the other hand, with the
model, new information can be extracted for information retrieval.
11. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
33
6. CONCLUSIONS
There are no strict rules for what should be left out of a text and what should not. Every word in a
text could lead to revealing a natural person. We tried to be strict as possible with our annotations
and annotate everything what could lead to identifying a person. However, none of the methods
aim to include every PID we identified, like gender or job titles.
The loose sense de-identification section shows that the rule-based approaches can detect a
person’s gender-based salutation but has more trouble to identify a personal name when it is not
in the list. The statistical approaches with NER are way better in the situation when a person’s
name is not in the provided list.
This paper evaluated Dutch NER models and de-identification methods. None of the methods
achieve a near-perfect performance. Updating NER models with transformers had a positive
influence in all de-identification approaches. This paper showed that the Dutch Conll corpus
gives the best results regarding recall performance in combination with the Dutch pretrained
transformer BERTje for de-identifying Dutch HR text data. This paper also contributes a new
NER dataset for recognising job titles. The results are looking promising for detecting job titles in
texts.
REFERENCES
[1] B. Desmet and V. Hoste, ‘Fine-grained Dutch named entity recognition’, Lang. Resour. Eval., vol.
48, no. 2, pp. 307–343, Jun. 2014, doi: 10.1007/s10579-013-9255-y.
[2] B. Wellner et al., ‘Rapidly Retargetable Approaches to De-identification in Medical Records’, J. Am.
Med. Inform. Assoc., vol. 14, no. 5, pp. 564–573, Sep. 2007, doi: 10.1197/jamia.M2435.
[3] J. Li, A. Sun, J. Han, and C. Li, ‘A Survey on Deep Learning for Named Entity Recognition’, IEEE
Trans. Knowl. Data Eng., Mar. 2020, [Online]. Available: http://arxiv.org/abs/1812.09449.
[4] J. Hammerton, ‘Named entity recognition with long short-term memory’, in Proceedings of the
seventh conference on Natural language learning at HLT-NAACL 2003, Edmonton, Canada, 2003,
pp. 172–175, doi: 10.3115/1119176.1119202.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding’, in Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 4171–4186, doi:
10.18653/v1/N19-1423.
[6] E. F. Tjong Kim Sang, ‘Introduction to the CoNLL-2002 Shared Task: Language-Independent
Named Entity Recognition’, 2002, [Online]. Available: https://www.aclweb.org/anthology/W02-
2024.
[7] E. F. Tjong Kim Sang and J. Veenstra, ‘Representing text chunks’, in Proceedings of the ninth
conference on European chapter of the Association for Computational Linguistics -, Bergen,
Norway, 1999, p. 173, doi: 10.3115/977035.977059.
[8] A. van den Bosch, G. J. Busser, W. Daelemans, and S. Canisius, ‘An efficient memory-based
morphosyntactic tagger and parser for Dutch’, Sel. Pap. 17th Comput. Linguist. Neth. Meet. Leuven
Belg., pp. 99–114, 2007.
[9] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, ‘Neural Architectures for
Named Entity Recognition’, in Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, San Diego,
California, Jun. 2016, pp. 260–270, doi: 10.18653/v1/N16-1030.
[10] R. Al-Rfou, V. Kulkarni, B. Perozzi, and S. Skiena, ‘Polyglot-NER: Massive multilingual named
entity recognition’, in Proceedings of the 2015 SIAM International Conference on Data Mining,
2015, pp. 586–594, Accessed: Apr. 09, 2019. [Online].
12. International Journal on Natural Language Computing (IJNLC) Vol.9, No.6, December 2020
34
[11] J. Nothman, N. Ringland, W. Radford, T. Murphy, and J. R. Curran, ‘Learning multilingual named
entity recognition from Wikipedia’, Artif. Intell., vol. 194, pp. 151–175, Jan. 2013, doi:
10.1016/j.artint.2012.03.006.
[12] D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya, ‘Multilingual Language Processing From
Bytes’, in Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, San Diego, California, Jun. 2016, pp.
1296–1306, doi: 10.18653/v1/N16-1155.
[13] E. F. Tjong Kim Sang, B. de Vries, W. Smink, B. Veldkamp, G. Westerhof, and A. Sools, ‘De-
identification of Dutch Medical Text’, in 2nd Healthcare Text Analytics Conference, Cardiff, Wales,
UK, 2019, p. 4.
[14] V. Menger, F. Scheepers, L. M. van Wijk, and M. Spruit, ‘DEDUCE: A pattern matching method for
automatic de-identification of Dutch medical text’, Telemat. Inform., vol. 35, no. 4, pp. 727–736,
2018, doi: 10.1016/j.tele.2017.08.002.
[15] C. van Toledo and M. Spruit, ‘Adopting Privacy Regulations in a Data Warehouse A Case of the
Anonimity versus Utility Dilemma’, in Conference: 8th International Joint Conference on Knowledge
Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016), Porto, Portugal, Nov.
2016, pp. 67–72.
[16] M. Hintze and K. El Emam, ‘Comparing the benefits of pseudonymisation and anonymisation under
the GDPR’, J. Data Prot. Priv., vol. 2, no. 2, pp. 145–158, 2018.
[17] UC Berkeley, ‘UC Berkeley Committee for Protection of Human Subjects’, HIPAA PHI: List of 18
Identifiers and Definition of PHI, 2018. https://cphs.berkeley.edu/hipaa/hipaa18.html (accessed Oct.
09, 2018).
[18] A. Narayanan and V. Shmatikov, ‘Robust De-anonymization of Large Sparse Datasets’, in 2008
IEEE Symposium on Security and Privacy (sp 2008), Oakland, CA, USA, May 2008, pp. 111–125,
doi: 10.1109/SP.2008.33.
[19] M. Hintze, ‘Viewing the GDPR through a De-Identification Lens: A Tool for Clarification and
Compliance’, Int. Data Prot. Law, vol. 8, no. 1, pp. 86–101, 2018, doi: 10.2139/ssrn.2909121.
[20] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim, ‘BERTje: A
Dutch BERT Model’, ArXiv191209582 Cs, Dec. 2019, Accessed: Mar. 16, 2020. [Online].
Available: http://arxiv.org/abs/1912.09582.
[21] K. Raj, kamalkraj/BERT-NER. 2019.
[22] M. L. McHugh, ‘Interrater reliability: the kappa statistic’, Biochem. Medica, vol. 22, no. 3, pp. 276–
282, 2012, doi: 10.11613/BM.2012.031.
AUTHORS
Chaïm van Toledo is a PhD candidate in the applied question and answering systems
of the Department of Information and Computing Sciences at Utrecht University. As a
member of the Applied Data Science Lab, he focuses on transforming organisation data
to practical datasets for data science and intelligent systems. His main research
objective is to enhance and broaden question and answering systems in the Dutch
language.
Friso van Dijk is a PhD researcher in Privacy Governance at the Department of
Information and Computing Sciences at Utrecht University. As a member of the
Applied Data Science Lab he works with advanced data science techniques to enhance
traditional organizational research methods. His primary research interests are in the
development of practical tools for navigating privacy decision-making in the
organizational context.
Dr. Marco Spruit is an Associate Professor in the Natural Language Processing
research group of the Department of Information and Computing Sciences at Utrecht
University. As principle investigator in the department's Applied Data Science Lab, his
research team primarily works on Self-Service Data Science. Marco's research objective
for the coming years is to establish and lead an authoritative national infrastructure for
Dutch natural language processing and machine learning to facilitate and popularise self-
service data science.