Spell Checker

•

1 like•362 views

This project was completed during the Lviv Data Science Summer School 2016 (http://cs.ucu.edu.ua/en/summerschool). The project supervisor - Jordi Carrera Ventura. The project goal was to create a state-of-the-art automatic spellchecking system using the most recent advances in the industry (word embeddings, automatic word sense disambiguation through neural nets) as well as traditional technologies (collocation extraction, n-gram models, shallow syntactic parsing). The system shold be capable of using linguistic information and semantic context both to correct mistakes and to improve users’ word choice by suggesting better keywords whenever less specific ones are being used.

Education

English Spellchecker
Project
Lviv Data Science Summer School
2016

Input data
Datasets for training: 800 000 paragraphs (2 million words)
Datasets for testing:
from CONLL corpus

We want to achieve - Part 1
1. Spellchecker focused on defining and correcting Mec errors

Target error
Mec error:
● description: spelling, punctuation, capitalization
● examples:
○ This knowledge maybe relevant to them.
○ To tell his or her ralatives…
○ ...their altitudes will be easily changed.

We want to achieve - Part 2
● Implement several language models and choose the best one for Mec
mistakes
● Practice Hidden Markov Models and Python
● Have fun in the teamwork :)

Methods & Solution used
● Uni-gram
○ better understanding of the errors and the sensitivity of each parameters
○ previous & posterior context of a word is important
○ Part-of-speech could help
● Custom N-grams Model
○ N-grams models, based on previous context and posteriori context

Tested Methods
● Custom N-grams Model with Part-of-Speech tagging
○ No improvement of the results with part-of-speech tagging
● Conditionnal Random Fields
○ too slow to train
● Unigram model
○ perform well and are really simple

Results
● Custom N-gram Model
○ Precision : 30%
○ Recall : 41 %
TYPE AMU CAMB CUUI IITB IPN NARA NTHU PKU POST RAC SJTU UFC UMC
Vt 11.61 20.00 5.79 1.90 0.98 16.18 12.90 14.16 3.31 29.17 4.59 0.00 17.60
ArtOrDet 18.75 54.74 67.38 1.81 0.36 54.42 37.96 9.65 59.41 0.66 14.63 0.00 33.42
Mec 31.56 30.67 17.47 1.13 4.79 37.28 7.17 31.69 37.88 45.82 1.10 0.00 22.31
Recall (in %) for each error type with alternative answers, indicating how well each team performs against a particular error type

Results
2nd position in the CONLL benchmark in terms of recall, with an average
precision!

OUR TEAM
Jordi Carrera Ventura
Charlotte Rudnik
Kateryna Aloshkina
Igor Kraynikov

Similar to Spell Checker

Video recording of full talk: http://lectures.ms.mff.cuni.cz/view.php?rec=255 There has been an increasing interest in using statistical machine translation (SMT) for the task of Grammatical Error Correction (GEC) for English-as-a-Second-Language (ESL) learners. Two of the three highest-scoring systems of the CoNLL-2014 Shared Task were SMT-based. The currently highest-scoring result published for the CoNLL-2014 test set has been achieved by a system combination of the five best CoNLL-2014 submissions built with MEMT (a tool for MT system combination). In this talk, we demonstrate how a single SMT-based system can match and outperform the result of the mentioned combined system. Furthermore, this system outperforms any other published results (including our own CoNLL-2014 submission) for a single system by a margin of several percent F-Score when the same training data is being used. These results are achieved by adapting current state-of-the art methods for phrase-based SMT specifically to the problem of GEC. We report on the effects of: - Parameter tuning for GEC - Introducing GEC-specific dense and sparse features - Using large-scale data

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...

Marcin Junczys-Dowmunt

Bert

Abdallah Bashir

Study_of_Sequence_labeling_Systems

Jayavardhan Reddy Peddamail

Yves Peirsman - Deep Learning for NLP

Hendrik D'Oosterlinck

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Hayahide Yamagishi

For some management programming problems, multiple objectives to be optimized rather than a single objective, and objectives can be expressed with ratio equations such as return/investment, operating profit/net-sales, profit/manufacturing cost, etc. In this paper, we proposed the transformation characteristics to solve the multi objective linear fractional programming (MOLFP) problems. If a MOLFP problem with both the numerators and the denominators of the objectives are linear functions and some technical linear restrictions are satisfied, then it is defined as a multi objective linear fractional programming problem MOLFPP in this research. The transformation characteristics are illustrated and the solution procedure and numerical example are presented.

Applying Transformation Characteristics to Solve the Multi Objective Linear F...

AIRCC Publishing Corporation

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...

Antonio Tejero de Pablos

At the MODELS 2014 conference, we presented a generic framework for performing semantic model differencing based on behavioral semantics specifications. Model differencing is concerned with the identification of differences among independently developed or consecutive versions of models. Therewith, it constitutes the basis for performing change management tasks, such as model merging, model versioning, and conflict detection. The majority of existing model differencing approaches focus on revealing syntactic differences among models. While syntactic differences provide important information for performing change management tasks, they can only approximate semantic differences among models with respect to the meaning of models. Thus, they cannot account for the fact that models with many syntactic differences may have the same semantics and few syntactic differences may induce very different semantics. Therefore, semantic model differencing is required that takes the semantics of models into account. We developed a generic semantic model differencing framework, which can be instantiated to realize semantic model differencing operators for specific modeling languages. The framework utilizes the behavioral semantics specification of the used modeling language to execute the models to be compared and capture execution traces representing the models’ semantic interpretations. Based on these semantic interpretations, semantic differences are revealed in the form of diff witnesses that are semantic interpretations valid only for one of the compared models.

Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...

Tanja Mayerhofer

Duplicate_Quora_Question_Detection

Jayavardhan Reddy Peddamail

The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%. In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams. We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.

CIKM14: Fixing grammatical errors by preposition ranking

eXascale Infolab

A hierarchical neural autoencoder for paragraphs and documents

Hayahide Yamagishi

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...

Lviv Data Science Summer School

ML in Android

Jose Antonio Corbacho

CSSC × GDSC: Intro to Machine Learning! Aaron Shah and Manav Bhojak on October 5, 2023 🤖 Join us for an exciting ML Workshop! 🚀 Dive into the world of Machine Learning, where we'll unravel the mysteries of CNNs, RNNs, Transformers, and more. 🤯 Get ready to embark on a journey of discovery! We'll begin with an easy-to-follow introduction to the fascinating realm of ML. 📚 🛠️ In our hands-on session, we'll walk you through setting up your environment. No tech hurdles here! 🌐 🔍 Then, we'll get down to the nitty-gritty, guiding you through our starter code for a thrilling hands-on example. Together, we'll explore the power of ML in action! 💡

CSSC ML Workshop

GDSC UofT Mississauga

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

taeseon ryu

Important Concepts for Machine Learning

SolivarLabs

NLP using transformers

Arvind Devaraj

Programming beyond cs

uditproject

Emnl preading2016

Ace12358

Unit 1.pptx

DeepakYadav656387

Similar to Spell Checker (20)

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...

Bert

Study_of_Sequence_labeling_Systems

Yves Peirsman - Deep Learning for NLP

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Applying Transformation Characteristics to Solve the Multi Objective Linear F...

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...

Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...

Duplicate_Quora_Question_Detection

CIKM14: Fixing grammatical errors by preposition ranking

A hierarchical neural autoencoder for paragraphs and documents

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...

ML in Android

CSSC ML Workshop

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

Important Concepts for Machine Learning

NLP using transformers

Programming beyond cs

Emnl preading2016

Unit 1.pptx

More from Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 24th Abstract. In digital marketing, memes have become an attractive tool for engaging an online audience. Memes have an impact on buyers’ and sellers’ online behavior and information spreading processes. Thus, the technology of generating memes is a significant tool for social media engagement. In this study, we collected a new memes dataset of ∼650K meme instances, applied state of the art Deep Learning technique – GPT-2 model [1] towards meme generation, and compared machine-generated memes with human-created. We justified that MTurk workers can be used for the approximate estimating of users’ behavior in a social network, more precisely to measure engagement. Generated memes cause the same engagement as human memes, which didn’t collect engagement in the social network (historically). Still, generated memes are less engaging then random memes created by humans.

Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...

Lviv Data Science Summer School

Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23d Abstract. In modern days synthesis of human images and videos is arguably one of the most popular topics in the Data Science community. The synthesis of human speech is less trendy but deeply bonded to the mentioned topic. Since the publication of WaveNet paper by Google researchers in 2016, the state-of-the-art approach transferred from parametric and concatenative systems to deep learning models. Most of the work on the area focuses on improving the intelligibility and naturalness of the speech. However, almost every significant study also mentions ways to generate speech with the voices of different speakers. Usually, such an enhancement requires the model’s re-training in case of generating audio with the voice of a speaker that was not present in the training set. Additionally, studies focused on highly modular speech generation are rare. Therefore there is a room left for research on ways to add new parameters for other aspects of the speech, like sentiment, prosody, and melody. In this work, we aimed to implement a competitive text-to-speech solution with the ability to specify the speaker without model re-training and explore possibilities for adding emotions to the generated speech. Our approach generates good quality speech with the mean opinion score of 3,78 (out of 5) points and the ability to mimic speaker voice in real-time, which is a big improvement over the baseline that merely obtains 2,08. On top of that, we researched sentiment representation possibilities. We built an emotion classifier that performs on the level of the current state of the art solutions by giving an accuracy of more than eighty percent.

Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 24th Abstract. This work presents a context-based question answering model for the Ukrainian language based on Wikipedia articles using Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) model, which takes a context (Wikipedia article) and a question to the context. The result of the model is an answer to the question. The model consists of two parts. The first one is a pre-trained multilingual BERT model, which is trained on the top-100, the most popular languages on Wikipedia articles. The second part is the fine-tuned model, which is trained on the data set of questions and answers to the Wikipedia articles. The training and validation data is Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). There are no question answering datasets for the Ukrainian language. The plan is to build an appropriate dataset with machine translation and use it for the fine-tuning training stage and compare the result with models which were fine-tuned on the other languages. The next experiment is to train a model on the Slavic language datasets before fine-tuning on the Ukrainian language and compare the results.

Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 24th Abstract. This work tackles the problem of matching Wikipedia red links with existing articles. Links in Wikipedia pages are considered red when leading to nonexistent articles. In other Wikipedia, editions could exist articles that correspond to such red links. In our work, we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We solve this task in the context of Ukrainian red links and English existing pages. We created a dataset of 3 171 most frequent Ukrainian red links and a dataset of 2 957 927 pairs of red links and the most probable candidates for the corresponding pages in English Wikipedia. This dataset is publicly released. We defined the task as a Named Entity Linking problem. Red links are named entities and we link Ukrainian red links to English Wikipedia pages. In this work, we provide a thorough analysis of the data and define its conceptual characteristics to exploit in entity resolution. These characteristics are graph properties (connections with the pages where red links occur and connections with the pages which occur in the same pages with red links) and word properties (title names). BabelNet knowledge base was applied to this task. We evaluated its powers in terms of F1 score (29 %) and regarded it as a baseline for our approach. To improve the results we introduced several similarity metrics based on mentioned red links characteristics. Combined in a linear model they resulted in F1 score 85 % which is our best result. In our thesis, we also discuss the bottlenecks and limitations of the current approach and outline the ideas for future improvements. To the best of our knowledge, we are the first to state the problem and propose a solution for red links in the Ukrainian Wikipedia edition. All the code for this project is publicly released on github.

Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 24th Abstract. Every day a lot of visitors leave countless reviews about hotels, restaurants, cafes, attractions or other services. In most cases, they set the rate about this service, sometimes they also set the rate about the specific topic if service provides this possibility. However, the main information about user opinion is hidden inside the body of review text. Thereby, in this work, we propose a solution to analyze one or several user reviews, determine sentiments and acquire important characteristics for these reviews. We determine which characteristics were influenced by such reviews. In this case, the proposed solution can detect sentiments from text and classify for pos-itive and negative. Then it acquires top positive and negative phrases, which can explain why the user left such review. Besides, we analyze all reviews about one hotel or just several reviews and summarize the most important positive and negative properties for a specific hotel.

Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23rd Abstract. Advances in the demand response for energy imbalance management (EIM) ancillary services can change the future power systems. These changes are subject to research in academia and industry. Although an important/promising part of this research is the application of Machine Learning methods to shape future power systems domain, the domain has not fully benefited from this application yet. Thus, the main objective of the presented project is to investigate and assess opportunities for applying reinforcement learning (RL) to achieve such advances by developing an intelligent voltage control-based ancillary service that uses thermostatically controlled loads (TCLs). Two stages of the project are presented: a proof of concept (PoC) and extensions. The PoC includes modeling and training of a voltage controller utilizing Q-learning, chosen due to its efficiency that is achieved without unnecessary sophistication. Simplest relevant for demand response power system of 20 TCLs is considered in the experiments to provide ancillary service. The power system model is developed with Modelica tools. Extensions aim to exceed PoC performance by applying advanced RL methods: Q-learning modification that uses a window of environment states as an input (WIQL), smart discretization strategies for environment’s continuous state space and a deep Q-network (DQN) with experience replay. To investigate particularities of the developed controller, modifications in an experimental setup such as controller testing longer than training, different simulation start time is considered. The improvement of 4% in median performance is achieved compared to the competing analytical approach – optimal constant control chosen using whole time interval simulation for the same voltage controller design. The presented results and corresponding discussions can be useful for both further works on the RL-driven voltage controllers for EIM and other applications of RL in the power system domain using Modelica models.

Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23rd Abstract. Speaker classification is an essential task in the machine learning domain, with many practical applications in identification and natural language processing. This work concentrates on speaker classification as a subtask of general speaker diarization for real-world conversation scenarios. We research the domain of modern speech processing and present the original speaker classification approach based on the recent developments in convolutional neural networks. Our method uses a spectrogram as input to the CNN classifier model, allowing it to capture spatial information about voice frequencies distribution. Presented results show beyond human ability performance and give strong prospects for future development.

Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23rd Abstract. Currently, the active development of image processing methods requires large amounts of correctly labeled data. The lack of quality data makes it impossible to use various machine learning methods. In case of limited possibilities for collecting real data, used methods for their synthetic generation. In practice, we can formulate the task of the high-quality generation of synthetic images as an efficient generation of complex data distributions, which is the object of study of this work. Generating high-quality synthetic data is an expensive and complicated process in terms of existing methods. We can distinguish two main approaches that are used to generate synthetic data: image generation based on rendered 3-D scenes and the use of GANs for simple images. These methods have some drawbacks, such as a narrow range of applicability and insufficient distribution complexity of the obtained data. When using GANs to generate complex distributions, in practice, we face a visible increase in the complexity of the model architecture and training procedure. A deep understanding of the real data complex distributions can be used to improve the quality of synthetic generation. Minimizing the differences in the real and synthetic data distributions can improve not only the generation process but also develop tools for solving the problem of data lack in the field of image processing.

Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23rd Abstract. Customer Lifetime Value (CLV) is a present value of the future cash flows attributed to a customer during their entire relationship with the company (Farris et al., 2010). CLV represents a 360-degree view of the client’s business situation (McKinsey, Customer Lifecycle Management), which takes into account the probability of customer churn and their future purchases. The modeling of CLV in retail is a complicated task due to the lack of access to historical data of purchases, the difficulty of customer identification, and building the historical reference with a particular customer. In this research, historical transactional data were taken from twelve North American brick-and-mortar grocery stores to compare different approaches to CLV modeling in terms of segmentation and forecast. The research has resulted in the suggestions on CLV estimation for the offline retail business case with given advantages and limitations of each approach.

Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 23rd Abstract. In this project (Glusco and Maksymenko, 2019), we treat the Reinforcement Learning problem of Exploration vs. Exploitation. The problem can be rephrased in terms of generalization and overfitting or efficient learning. To face the problem we decided to combine the techniques from different researches: we introduce noise as an environment’s characteristics (Packer et al., 2018); create multiple Reinforcement Learning agents and environments setup to train in parallel and interact within each other (Jaderberg et al., 2017); use parallel tempering approach to initialize environments with different temperatures (noises) and perform exchanges using Metropolis-Hastings criterion (Pushkarov et al., 2019). We implemented multi-agent architecture with a parallel tempering approach based on two different Reinforcement Learning agent algorithms – Deep Q Network and Advantage Actor-Critic – and environment wrapper of the OpenAI Gym (Gym: A toolkit for developing and comparing reinforcement learning algorithms) environment for noise addition. We used the CartPole environment to run multiple experiments with three different types of exchanges: no exchange, random exchange, smart exchange according to Metropolis-Hastings rule. We implemented aggregation functionality to gather the results of all the experiments and visualize them with charts for analysis. Experiments showed that a parallel tempering approach with multiple environments with different noise level can improve the performance of the agent under specific circumstances. At the same time, results raised new questions that should be addressed to fully understand the picture of the implemented approach.

Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 22nd Abstract. The thesis introduces the reader to the concepts of edge computing in terms of person re-identification and tracking problem. It describes the challenges, limitations, and current state-of-the-art solutions. The author proposed a pipeline for the task, launched several experiments on validating different parts of the system, and provided a theoretical explanation of the person re-identification process in the overlapping multi-camera environment.

Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 22nd Abstract. Generative Adversarial Networks (GANs) in recent years has certainly become one of the biggest trends in the computer vision domain. GANs are used for generating face images and computer game scenes, transferring artwork style, visualizing designs, creating super-resolution images, translating text to images, etc. We want to present a model to solve an image problem: generate new outfits onto people’s images. This task seems to be extremely important for the offline/online trade and fashion industry.Changing clothing on people’s images isn’t a trivial task. The generated part of the image should have high quality without blurring. Another problem is generating long sleeves on the images with T-shirts, for example. As a result, well-known models are not suitable for this task. In the master project, we are going to reproduce the model for clothing hanging on people’s images based on the existing approaches and improve it in order to get better quality of the image.

Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 22nd Abstract. Generative adversarial networks (GANs) are one of the most popular models capable of producing high-quality images. However, most of the works generate images from the vector of random values, without explicit control of desired output properties. We study the ways of introducing such control for the user-selected region of interest (RoI). First, we overview and analyze the existing works in areas of image completion (inpainting) and controllable generation. Second, we propose our model based on GANs, which united approaches from the two mentioned areas, for the controllable local content generation. Third, we evaluate the controllability of our model on three accessible datasets – Celeba, Cats, and Cars – and give numerical and visual results of our method.

Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...

Lviv Data Science Summer School

Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 22nd Abstract. Today virtual and augmented reality applications become more and more popular. Such a trend creates a demand for 3D processing algorithms which may be applied to many areas. This work is focused on sigh language video sequences. There are a lot of prerecorded photos and video dictionaries that can be transformed into 3D and unified in one place. We research nuances of hand pose video sequence analysis as well as the influence of results refinement for 2D and 3D keypoint detection. Besides that, we designed a solution for the parametrization of hand shape and engineered system for 3D hand pose reconstruction. Model show good results on train data but lack generalization. Retraining on multiple datasets and usage of various data augmentation techniques will improve performance.

Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...

Lviv Data Science Summer School

Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 21st Abstract. Novelty is an inherent part of innovations and discoveries. Such processes may be considered as the appearance of new ideas or as the emergence of atypical connections between existing ones. The importance of such connections hints for investigation of innovations through network or graph representation in the space of ideas. In such representation, a graph node corresponds to the relevant notion (idea), whereas an edge between two nodes means that the corresponding notions have been used in a common context. The question addressed in this research is the possibility to identify the edges between existing concepts where the innovations may emerge. To this end, a well-documented scientific knowledge landscape has been used. Namely, we downloaded 1.2M arXiv.org manuscripts dated starting from April 2007 and until September 2019; and extracted relevant concepts for them using ScienceWISE.info platform. Combining approaches developed in complex networks science and graph embedding the predictability of edges (links) on the scientific knowledge landscape where the innovations may appear is investigated. We argue that the conclusions drawn from this analysis may be used not only to the scientific knowledge analysis but are rather generic and may be applied to any domain that involves creativity within.

Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 21st Abstract. Human navigation in information spaces has increasing importance in ever-growing data sources we possess. Therefore, an efficient navigation strategy would give a huge benefit to the satisfaction of human information needs. Often, the search space can be understood as a network and navigation can be seen as a walk on this network. Previous studies have shown that despite not knowing the global network structure people tend to be efficient at finding what they need. This is usually explained by the fact that people possess some background knowledge. In this work, we explore an adapted version of the network consisting of Wikipedia pages and links between them as well as human trails on it. The goal of our research is to find a procedure to label articles that are similar to a given one. Among others, this would lay a foundation for a recommender system for Wikipedia editors, which will suggest links from the given page to the related articles. Our work is, therefore, providing a basement for enhancing the Wikipedia navigation process making it more user-friendly.

Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...

Lviv Data Science Summer School

Ukrainian Catholic University Faculty of Applied Sciences Data Science Master Program January 21st Abstract. The maritime industry is huge and consists of a lot of complex processes. It is a consequence of the fact that the maritime industry provides most of the goods transportation. During transportation, people serve the vessel. And here the problem is raised of the optimal distribution of crew on vessels. This problem can be solved by formalizing the integer programming problem. In practice, we saw that solving this problem is time-consuming since there are a large number of free variables. This makes the solution inapplicable to the end-user. In this work, we describe the approach to speed up a solution of crew optimization for the maritime industry using the Rolling Time Horizon technique. Our approach is 3.5 times faster than the benchmark and deviates from the optimal solution by less than 1%.

Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...

Lviv Data Science Summer School

More from Lviv Data Science Summer School (20)