The document discusses two Spark algorithms: outlier detection on categorical data and KNN join. It describes how the algorithms work, including mapping attributes to scores for outlier detection and using z-order curves to map points to a single dimension for KNN joins. It also provides performance results and best practices for implementing the algorithms in Spark and discusses applications in graph algorithms.
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
A web search engine is concerned about mainly three things: 1) a query which is characterized by the user's information need, 2) a ranked list of relevant responses, and most importantly, 3) the magic that we call as a retrieval system which churns the content and returns the relevant responses. The beauty of such a retrieval system is hidden in its retrieval model. We start from a bare-bone vector space model and keep converting our intuitions and observations into mathematical models. We discuss the variants of IDF and TF-IDF models. As math meets Information Retrieval (IR) in this fast-paced and fun-filled talk, I hope to leave you with ideas on quickly converting these ideas into tools for your own area of research. I will also use this as a case to explain one way of "Ideating - Elaborating - Applying - Publishing" model of research. Some of my work which is currently under submission is masked in this presentation. Please wait till it gets published before approaching me for a full copy.
The document describes several C++ header and source code files that implement different types of integer lists: IntegerListArray (array-based), IntegerListVector (vector-based), IntegerListLinked (linked list-based), and IntegerListSorted (sorted list-based). Each file pair (header and source) defines a class to represent the list type, with member functions for common list operations like getting elements, size, adding/removing elements, and iterating. The files were generated by Doxygen from comments in the code.
Automatic detection of click fraud in online advertisementsTrieu Nguyen
This thesis presents an approach to automatically detecting click frauds in online advertising.
The approach uses a mathematical theory of evidence to estimate the likelihood of a click whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s website. One advantage of the proposed approach is the fact that the likelihood can be computed for each incoming click and thus it gives an online computation of the belief that fits well with the dynamic behaviors of users
This document provides a version 3.3 implementation guide for the CDISC Study Data Tabulation Model (SDTM) for use in human clinical trials. It describes the fundamentals and assumptions of the SDTM standard, including how to represent observations, variables, datasets and domains. It also provides models and specifications for over 30 standard SDTM domains grouped according to the general observation classes of Interventions, Events and Findings.
The document describes a homework assignment to analyze the performance of a search engine on two datasets: Cranfield and Time. It involves building inverted indexes on the datasets using three different stemmers, running queries using three different scoring functions, and evaluating the results by calculating precision at different ranks. Python scripts are used to automatically create the collections and indexes, run the queries to obtain results files, and evaluate the results files against ground truths to analyze the performance of different configurations.
About decision tree induction which helps in learningGReshma10
This document discusses decision tree induction and the concept of entropy. It begins with an overview of decision trees, how they are used for classification tasks, and the basic algorithm for building a decision tree from a training dataset. It then covers node splitting for different attribute types in more detail. Examples are provided to illustrate decision tree building. The document also discusses the concept of entropy from information theory and how it is used as a measure of uncertainty in a training dataset to select the best attributes during decision tree construction.
The document discusses two Spark algorithms: outlier detection on categorical data and KNN join. It describes how the algorithms work, including mapping attributes to scores for outlier detection and using z-order curves to map points to a single dimension for KNN joins. It also provides performance results and best practices for implementing the algorithms in Spark and discusses applications in graph algorithms.
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
Association Rule mining is one of the dominant tasks of data mining, which concerns in finding frequent
itemsets in large volumes of data in order to produce summarized models of mined rules. These models are
extended to generate association rules in various applications such as e-commerce, bio-informatics,
associations between image contents and non image features, analysis of effectiveness of sales and retail
industry, etc. In the vast increasing databases, the major challenge is the frequent itemsets mining in a
very short period of time. In the case of increasing data, the time taken to process the data should be
almost constant. Since high performance computing has many processors, and many cores, consistent runtime
performance for such very large databases on association rules mining is achieved. We, therefore,
must rely on high performance parallel and/or distributed computing. In literature survey, we have studied
the sequential Apriori algorithms and identified the fundamental problems in sequential environment and
parallel environment. In our proposed ParApriori, we have proposed parallel algorithm for GPGPU, and
we have also done the results analysis of our GPU parallel algorithm. We find that proposed algorithm
improved the computing time, consistency in performance over the increasing load. The empirical analysis
of the algorithm also shows that efficiency and scalability is verified over the series of datasets
experimented on many core GPU platform.
A web search engine is concerned about mainly three things: 1) a query which is characterized by the user's information need, 2) a ranked list of relevant responses, and most importantly, 3) the magic that we call as a retrieval system which churns the content and returns the relevant responses. The beauty of such a retrieval system is hidden in its retrieval model. We start from a bare-bone vector space model and keep converting our intuitions and observations into mathematical models. We discuss the variants of IDF and TF-IDF models. As math meets Information Retrieval (IR) in this fast-paced and fun-filled talk, I hope to leave you with ideas on quickly converting these ideas into tools for your own area of research. I will also use this as a case to explain one way of "Ideating - Elaborating - Applying - Publishing" model of research. Some of my work which is currently under submission is masked in this presentation. Please wait till it gets published before approaching me for a full copy.
The document describes several C++ header and source code files that implement different types of integer lists: IntegerListArray (array-based), IntegerListVector (vector-based), IntegerListLinked (linked list-based), and IntegerListSorted (sorted list-based). Each file pair (header and source) defines a class to represent the list type, with member functions for common list operations like getting elements, size, adding/removing elements, and iterating. The files were generated by Doxygen from comments in the code.
Automatic detection of click fraud in online advertisementsTrieu Nguyen
This thesis presents an approach to automatically detecting click frauds in online advertising.
The approach uses a mathematical theory of evidence to estimate the likelihood of a click whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s website. One advantage of the proposed approach is the fact that the likelihood can be computed for each incoming click and thus it gives an online computation of the belief that fits well with the dynamic behaviors of users
This document provides a version 3.3 implementation guide for the CDISC Study Data Tabulation Model (SDTM) for use in human clinical trials. It describes the fundamentals and assumptions of the SDTM standard, including how to represent observations, variables, datasets and domains. It also provides models and specifications for over 30 standard SDTM domains grouped according to the general observation classes of Interventions, Events and Findings.
The document describes a homework assignment to analyze the performance of a search engine on two datasets: Cranfield and Time. It involves building inverted indexes on the datasets using three different stemmers, running queries using three different scoring functions, and evaluating the results by calculating precision at different ranks. Python scripts are used to automatically create the collections and indexes, run the queries to obtain results files, and evaluate the results files against ground truths to analyze the performance of different configurations.
About decision tree induction which helps in learningGReshma10
This document discusses decision tree induction and the concept of entropy. It begins with an overview of decision trees, how they are used for classification tasks, and the basic algorithm for building a decision tree from a training dataset. It then covers node splitting for different attribute types in more detail. Examples are provided to illustrate decision tree building. The document also discusses the concept of entropy from information theory and how it is used as a measure of uncertainty in a training dataset to select the best attributes during decision tree construction.
This thesis examines machine learning approaches using Hadoop in the cloud. It implements a distributed machine learning infrastructure in the cloud without dependence on distributed file systems or shared memory. This infrastructure learns and configures a distributed network of learners. The results are then filtered, fused and visualized. The thesis also develops a machine learning infrastructure using Python and compares the two approaches. It uses real-world immigration and GDP datasets from a government database to test the frameworks. The cloud-based approach is able to scale to petabytes of data with minimal configuration.
This document summarizes Kiran Lakhotia's PhD thesis on search-based testing. The thesis aims to advance automated test data generation using search-based techniques by addressing challenges like pointers, dynamic data structures, and loop-assigned flags. It presents a tool called AUSTIN for generating test data for C programs. AUSTIN combines a hill climber search algorithm with a custom constraint solver for pointers. The thesis also introduces a testability transformation to handle loop-assigned flags and a multi-objective approach to optimize for branch coverage and memory usage. Empirical studies compare AUSTIN to other tools on large open source programs and embedded software.
This document summarizes the author's masters project on developing a document translation system. The system uses a multi-step pipeline including text detection with the CRAFT model, text recognition with STR, text merging, image inpainting with DeepFillV2, and translation via Google Translate API. Details are provided on the models used, data processing, and approach for each step of the pipeline to translate documents while preserving layout and design elements.
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
This document describes a thesis that compares several algorithms for enforcing Generalized Arc Consistency (GAC) on Constraint Satisfaction Problems (CSPs). It studies GAC2001, algorithms for table constraints (STR1, STR2, STR3, eSTR1, eSTR2), and an algorithm for constraints represented as multi-valued decision diagrams (mddc). The author empirically evaluates and compares the performance of these algorithms on randomly generated CSPs and benchmark problems. A new hybrid algorithm (STR-h) is also proposed that uses a selection criterion to combine STR1 and STR-Ni.
The document discusses modeling systems at the end of Dennard scaling and approaches to modeling in a post-Dennard era. It covers the end of consistent CPU performance improvements, the rise of specialized computing like GPUs and deep learning drivers. It also discusses using fewer bits in calculations, exploring uncertainties, and generating low-dimensional representations from complex models to help address the challenges of increased computing needs. Learning algorithms may help build emulators and surrogates of Earth system models to enable fitting-purpose simulations.
This document discusses applying machine learning techniques including text retrieval, association rule mining, and decision tree learning using R. It introduces the movie review dataset and preprocessing steps like removing stopwords and stemming. Text retrieval is performed to create a document-term matrix from the reviews. Association rules are generated from a sample of negative reviews using the Apriori algorithm. Decision trees are built on the combined document-term matrix and sentiment labels to classify review sentiment.
The document provides an introduction to data cleaning techniques in R. It discusses reading raw data into R, performing type conversions, handling character strings, and addressing encoding issues to transform raw data into a technically correct format. It then covers methods for detecting and correcting errors and inconsistencies to make the data consistent, including handling missing values, outliers, obvious inconsistencies, and using simple rules, deduction, and imputation. The goal is to prepare the data so it can be analyzed and to produce reliable statistical statements. Data cleaning is presented as an important part of the overall statistical analysis process.
This document provides an introduction and preface to a lecture on econometric analysis of financial market data. It discusses using the R programming language to model financial data and complete assignments. It encourages downloading R and additional packages to facilitate statistical analysis and simulations. The document outlines installing and using R, and provides references for further reading on empirical finance techniques.
The document outlines a proposed curriculum for data science classes. It covers topics in Python, basic mathematics, feature engineering, supervised learning, unsupervised learning, and data mining. The Python section includes the history of Python, basic structures like variables and data types, operators and control flow, functions, and data structures. The basic mathematics section covers linear algebra, probability and statistics, and optimization. Later sections discuss feature engineering, supervised learning techniques from basic to advanced, unsupervised learning, and data mining.
This document provides an overview of the first five evaluation campaigns conducted by the SEALS project. It summarizes the results of evaluating ontology engineering tools, storage and reasoning systems, ontology matching tools, and semantic search tools. Key findings include varying levels of support for standards compliance among ontology tools, differences in performance between reasoners on classification and satisfiability tests, and usability issues identified for semantic search tools based on user questionnaires. The document also discusses lessons learned from executing the campaigns and drawing comparative analyses of tool results.
This document provides an overview and outline for a 10-week course on C programming with a focus on data structures and algorithms. The course introduces elementary C programming concepts and covers topics like doubly linked lists, sorting, modules, abstract data types, stacks, priority queues, binary trees, and n-ary trees. Students will complete weekly readings, quizzes, and 5 programming projects over the course of the 10 weeks. The goal is for students to learn how to properly design and implement data structures and algorithms in C.
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...hanumayamma
Selfies are popular. They embrace and represent social and emotional pulse of the User. We offer, nevertheless, groundbreaking and novel radical view on Selfies, especially Selfies that are taken for medical image purposes. In our view Selfies that are taken for medical image purposes are valuable outpatient healthcare data assets that could provide new clinical insights. Additionally, they could be used as diagnostics markers that could provide prognosis of a potential masked disease and necessitate actions to avert any emergency incidence, thereby saving Billions of dollars. We strongly believe that Interweaving Selfies that are taken for medical image purposes with outpatient Electronic Health Records (EHR) could breed new data driven diagnosis and clinical pathways that could potentially preempt healthcare services rendering decision making process for greater efficiencies and that could potentially save valuable time and attention of healthcare professionals who’re already operating on a highly constrained time and shortage of skilled human resources. Putting in simple terms, Selfies could offer new diagnosis & clinical insights that have the potential to improve overall health outcomes of people around the globe in a cost-effective manner that epitomizes the confluence of popularity with curiosity and sharing with accountability.
In this research paper, we propose computer vision (CV) based Machine Learning (ML) / Artificial Intelligence(AI) algorithms to classify and stratify Selfies that are captured for medical imaging purposes. Finally, the paper presents a CV - ML/AI prototyping solution as well as its application and certain experimental results.
This document is a degree project from KTH Royal Institute of Technology that examines using social media analysis to predict stock prices. Specifically, it collected Twitter data related to Microsoft, Netflix, and Walmart and used machine learning algorithms like artificial neural networks to analyze the relationship between sentiment in tweets and future stock movement. The best model achieved 80% accuracy in predicting the direction of price changes for one of the companies based on Twitter sentiment alone.
Word Vector Compositionality based Relevance Feedback using Kernel Density Es...Dwaipayan Roy
The document discusses using word embeddings and kernel density estimation for relevance feedback in information retrieval. Word embeddings represent words as vectors in a semantic space, where similar words have similar vectors. Kernel density estimation is used to model the probability density function of relevant terms based on an initial query. The densities of candidate expansion terms can then be estimated to select additional helpful terms for feedback. This combines the advantages of word embeddings capturing semantics and relevance feedback capturing query-document co-occurrence information. The approach is evaluated on several test collections to demonstrate its effectiveness.
This document is the December 2010 issue of The R Journal, a peer-reviewed publication of the R Foundation for Statistical Computing. It contains 10 peer-reviewed research articles on various statistical modeling and data analysis techniques in R, including solving differential equations, hierarchical generalized linear models, data cloning, string processing, Bayesian time series analysis, and more. It also includes sections on recent changes and updates to R and contributed R packages. Peter Dalgaard steps down as Editor-in-Chief and is succeeded by Heather Turner.
Applying the GNST to the Engie IT RAL-finalGeert Haerens
This document applies the Generalised Normalised Systems Theory (GNST) to the Engie IT Reference Architecture Library (RAL). It demonstrates the application of a GNST artefact to three infrastructure services in the RAL - Housing, Hosting, and Proxy. For each service, the artefact describes the concerns, states, versions, and instances. It shows how the artefact captures the evolution of each service across multiple versions. The document also evaluates the value of the artefact for documenting the RAL, aiding architects, and advancing the GNST. In appendices, it includes detailed examples of applying the artefact to different versions of the Housing, Hosting, and Proxy services.
This document is a project report that proposes developing a web application to securely store files on a cloud server using hybrid cryptography. It aims to address data security and privacy issues for cloud storage. The application would use a hybrid cryptography technique combining symmetric and asymmetric encryption to encrypt files before uploading them to the cloud. Only authorized users with decryption keys would be able to access and download encrypted files from the cloud server. The report outlines the problem statement, objectives, methodology, design, and implementation of the proposed application to provide secure file storage on the cloud.
This study examines the relationship between corporate social responsibility (CSR) and financial performance of top US tech firms from 2017-2019. The authors find:
1) Tech firms that spend more on CSR initiatives experience increased revenue, supporting the hypothesis that CSR spending improves financial performance.
2) CSR spending has an insignificant relationship with Tobin's Q, a measure of firm value.
3) Effective corporate governance may be a channel through which CSR enhances financial performance, by allowing firms to consider stakeholder interests.
The study provides evidence that CSR spending can benefit tech firms financially and that governance impacts the CSR-performance relationship.
This document describes an EE660 project for Walmart to classify customer shopping trip types using machine learning models. The goals are to help Walmart improve their segmentation of customer visits by predicting trip types based only on purchase history data. Five machine learning classification algorithms will be tested: Naive Bayes, K-nearest neighbors, support vector machines, random forests, and adaptive boosting. The best performing model, random forests, will be used as the final trip type classification system. Feature selection and extraction methods will be explored to reduce the high dimensionality of the feature space and improve predictive accuracy.
An Optical Character Recognition Engine For Graphical Processing UnitsKelly Lipiec
This dissertation investigates building an optical character recognition (OCR) engine for graphical processing units (GPUs). It describes Jeremy Reed's doctoral dissertation from the University of Kentucky in 2016. The dissertation introduces basic OCR and GPU concepts. It then describes in detail the SegRec algorithm developed by the author for segmenting and recognizing characters on a GPU. Evaluation results comparing SegRec to other OCR engines are provided. The dissertation concludes by discussing limitations and opportunities for improving SegRec and developing it into a full-featured OCR system.
This thesis examines machine learning approaches using Hadoop in the cloud. It implements a distributed machine learning infrastructure in the cloud without dependence on distributed file systems or shared memory. This infrastructure learns and configures a distributed network of learners. The results are then filtered, fused and visualized. The thesis also develops a machine learning infrastructure using Python and compares the two approaches. It uses real-world immigration and GDP datasets from a government database to test the frameworks. The cloud-based approach is able to scale to petabytes of data with minimal configuration.
This document summarizes Kiran Lakhotia's PhD thesis on search-based testing. The thesis aims to advance automated test data generation using search-based techniques by addressing challenges like pointers, dynamic data structures, and loop-assigned flags. It presents a tool called AUSTIN for generating test data for C programs. AUSTIN combines a hill climber search algorithm with a custom constraint solver for pointers. The thesis also introduces a testability transformation to handle loop-assigned flags and a multi-objective approach to optimize for branch coverage and memory usage. Empirical studies compare AUSTIN to other tools on large open source programs and embedded software.
This document summarizes the author's masters project on developing a document translation system. The system uses a multi-step pipeline including text detection with the CRAFT model, text recognition with STR, text merging, image inpainting with DeepFillV2, and translation via Google Translate API. Details are provided on the models used, data processing, and approach for each step of the pipeline to translate documents while preserving layout and design elements.
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
This document describes a thesis that compares several algorithms for enforcing Generalized Arc Consistency (GAC) on Constraint Satisfaction Problems (CSPs). It studies GAC2001, algorithms for table constraints (STR1, STR2, STR3, eSTR1, eSTR2), and an algorithm for constraints represented as multi-valued decision diagrams (mddc). The author empirically evaluates and compares the performance of these algorithms on randomly generated CSPs and benchmark problems. A new hybrid algorithm (STR-h) is also proposed that uses a selection criterion to combine STR1 and STR-Ni.
The document discusses modeling systems at the end of Dennard scaling and approaches to modeling in a post-Dennard era. It covers the end of consistent CPU performance improvements, the rise of specialized computing like GPUs and deep learning drivers. It also discusses using fewer bits in calculations, exploring uncertainties, and generating low-dimensional representations from complex models to help address the challenges of increased computing needs. Learning algorithms may help build emulators and surrogates of Earth system models to enable fitting-purpose simulations.
This document discusses applying machine learning techniques including text retrieval, association rule mining, and decision tree learning using R. It introduces the movie review dataset and preprocessing steps like removing stopwords and stemming. Text retrieval is performed to create a document-term matrix from the reviews. Association rules are generated from a sample of negative reviews using the Apriori algorithm. Decision trees are built on the combined document-term matrix and sentiment labels to classify review sentiment.
The document provides an introduction to data cleaning techniques in R. It discusses reading raw data into R, performing type conversions, handling character strings, and addressing encoding issues to transform raw data into a technically correct format. It then covers methods for detecting and correcting errors and inconsistencies to make the data consistent, including handling missing values, outliers, obvious inconsistencies, and using simple rules, deduction, and imputation. The goal is to prepare the data so it can be analyzed and to produce reliable statistical statements. Data cleaning is presented as an important part of the overall statistical analysis process.
This document provides an introduction and preface to a lecture on econometric analysis of financial market data. It discusses using the R programming language to model financial data and complete assignments. It encourages downloading R and additional packages to facilitate statistical analysis and simulations. The document outlines installing and using R, and provides references for further reading on empirical finance techniques.
The document outlines a proposed curriculum for data science classes. It covers topics in Python, basic mathematics, feature engineering, supervised learning, unsupervised learning, and data mining. The Python section includes the history of Python, basic structures like variables and data types, operators and control flow, functions, and data structures. The basic mathematics section covers linear algebra, probability and statistics, and optimization. Later sections discuss feature engineering, supervised learning techniques from basic to advanced, unsupervised learning, and data mining.
This document provides an overview of the first five evaluation campaigns conducted by the SEALS project. It summarizes the results of evaluating ontology engineering tools, storage and reasoning systems, ontology matching tools, and semantic search tools. Key findings include varying levels of support for standards compliance among ontology tools, differences in performance between reasoners on classification and satisfiability tests, and usability issues identified for semantic search tools based on user questionnaires. The document also discusses lessons learned from executing the campaigns and drawing comparative analyses of tool results.
This document provides an overview and outline for a 10-week course on C programming with a focus on data structures and algorithms. The course introduces elementary C programming concepts and covers topics like doubly linked lists, sorting, modules, abstract data types, stacks, priority queues, binary trees, and n-ary trees. Students will complete weekly readings, quizzes, and 5 programming projects over the course of the 10 weeks. The goal is for students to learn how to properly design and implement data structures and algorithms in C.
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...hanumayamma
Selfies are popular. They embrace and represent social and emotional pulse of the User. We offer, nevertheless, groundbreaking and novel radical view on Selfies, especially Selfies that are taken for medical image purposes. In our view Selfies that are taken for medical image purposes are valuable outpatient healthcare data assets that could provide new clinical insights. Additionally, they could be used as diagnostics markers that could provide prognosis of a potential masked disease and necessitate actions to avert any emergency incidence, thereby saving Billions of dollars. We strongly believe that Interweaving Selfies that are taken for medical image purposes with outpatient Electronic Health Records (EHR) could breed new data driven diagnosis and clinical pathways that could potentially preempt healthcare services rendering decision making process for greater efficiencies and that could potentially save valuable time and attention of healthcare professionals who’re already operating on a highly constrained time and shortage of skilled human resources. Putting in simple terms, Selfies could offer new diagnosis & clinical insights that have the potential to improve overall health outcomes of people around the globe in a cost-effective manner that epitomizes the confluence of popularity with curiosity and sharing with accountability.
In this research paper, we propose computer vision (CV) based Machine Learning (ML) / Artificial Intelligence(AI) algorithms to classify and stratify Selfies that are captured for medical imaging purposes. Finally, the paper presents a CV - ML/AI prototyping solution as well as its application and certain experimental results.
This document is a degree project from KTH Royal Institute of Technology that examines using social media analysis to predict stock prices. Specifically, it collected Twitter data related to Microsoft, Netflix, and Walmart and used machine learning algorithms like artificial neural networks to analyze the relationship between sentiment in tweets and future stock movement. The best model achieved 80% accuracy in predicting the direction of price changes for one of the companies based on Twitter sentiment alone.
Word Vector Compositionality based Relevance Feedback using Kernel Density Es...Dwaipayan Roy
The document discusses using word embeddings and kernel density estimation for relevance feedback in information retrieval. Word embeddings represent words as vectors in a semantic space, where similar words have similar vectors. Kernel density estimation is used to model the probability density function of relevant terms based on an initial query. The densities of candidate expansion terms can then be estimated to select additional helpful terms for feedback. This combines the advantages of word embeddings capturing semantics and relevance feedback capturing query-document co-occurrence information. The approach is evaluated on several test collections to demonstrate its effectiveness.
This document is the December 2010 issue of The R Journal, a peer-reviewed publication of the R Foundation for Statistical Computing. It contains 10 peer-reviewed research articles on various statistical modeling and data analysis techniques in R, including solving differential equations, hierarchical generalized linear models, data cloning, string processing, Bayesian time series analysis, and more. It also includes sections on recent changes and updates to R and contributed R packages. Peter Dalgaard steps down as Editor-in-Chief and is succeeded by Heather Turner.
Applying the GNST to the Engie IT RAL-finalGeert Haerens
This document applies the Generalised Normalised Systems Theory (GNST) to the Engie IT Reference Architecture Library (RAL). It demonstrates the application of a GNST artefact to three infrastructure services in the RAL - Housing, Hosting, and Proxy. For each service, the artefact describes the concerns, states, versions, and instances. It shows how the artefact captures the evolution of each service across multiple versions. The document also evaluates the value of the artefact for documenting the RAL, aiding architects, and advancing the GNST. In appendices, it includes detailed examples of applying the artefact to different versions of the Housing, Hosting, and Proxy services.
This document is a project report that proposes developing a web application to securely store files on a cloud server using hybrid cryptography. It aims to address data security and privacy issues for cloud storage. The application would use a hybrid cryptography technique combining symmetric and asymmetric encryption to encrypt files before uploading them to the cloud. Only authorized users with decryption keys would be able to access and download encrypted files from the cloud server. The report outlines the problem statement, objectives, methodology, design, and implementation of the proposed application to provide secure file storage on the cloud.
This study examines the relationship between corporate social responsibility (CSR) and financial performance of top US tech firms from 2017-2019. The authors find:
1) Tech firms that spend more on CSR initiatives experience increased revenue, supporting the hypothesis that CSR spending improves financial performance.
2) CSR spending has an insignificant relationship with Tobin's Q, a measure of firm value.
3) Effective corporate governance may be a channel through which CSR enhances financial performance, by allowing firms to consider stakeholder interests.
The study provides evidence that CSR spending can benefit tech firms financially and that governance impacts the CSR-performance relationship.
This document describes an EE660 project for Walmart to classify customer shopping trip types using machine learning models. The goals are to help Walmart improve their segmentation of customer visits by predicting trip types based only on purchase history data. Five machine learning classification algorithms will be tested: Naive Bayes, K-nearest neighbors, support vector machines, random forests, and adaptive boosting. The best performing model, random forests, will be used as the final trip type classification system. Feature selection and extraction methods will be explored to reduce the high dimensionality of the feature space and improve predictive accuracy.
An Optical Character Recognition Engine For Graphical Processing UnitsKelly Lipiec
This dissertation investigates building an optical character recognition (OCR) engine for graphical processing units (GPUs). It describes Jeremy Reed's doctoral dissertation from the University of Kentucky in 2016. The dissertation introduces basic OCR and GPU concepts. It then describes in detail the SegRec algorithm developed by the author for segmenting and recognizing characters on a GPU. Evaluation results comparing SegRec to other OCR engines are provided. The dissertation concludes by discussing limitations and opportunities for improving SegRec and developing it into a full-featured OCR system.
Similar to Improving Information Retrieval Performance using Word Embedding (20)
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
26. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Term Transformation Events
Noisy Channel
d
C
t
λ
t′
1 − λ − α − β
α
β
Figure: A Generalized Language Model (GLM). GLM degenerates to LM when
α = β = 0.
λ: Standard Jelinek Mercer weighting parameter
α: Probability of sampling term t via a transformation through a term
t′ sampled from the document d
β: Probability of sampling term t via a transformation through a term
t′ sampled from collection
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 20 / 78
33. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
34. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
35. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
39. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Pseudo-Relevance Feedback based Query Expansion
Given a query Q
Perform initial retrieval using some retrieval model
Consider top K documents as (pseudo-)relevant
Select terms from those documents as candidate expansion terms
Perform re-retrieval with the expanded query
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 29 / 78
44. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Kernel Density Estimation (KDE)
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
Estimate a distribution that generates the given data (data points).
Place kernel function (e.g. Gaussian) centered around each observed
data point.
Combine the Gaussians to get a function peaked at the observed data
point.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 33 / 78
49. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Comparing the Estimations
Relevance Model Estimation
Given set of terms from (pseudo-)relevant documents, estimate the density
function that generates R.
76543210-1
5
4
3
2
1
0
Kernel Density Estimation
Given set of n data samples, estimates the density function that generates
the observed data.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 36 / 78
60. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Two dimensional KDE
The Points: Two Dimensional KDE
Observed data points (xij) = (qi, Dj): Encapsulates word vector for
i-th query word qi and its normalized term frequency in feedback
document Dj (P(qi|Dj)).
Points at which probability (density) is to be estimated (x)
= (w, Dj): Encapsulates word vector for candidate expansion term w
and its normalized term frequency in feedback document Dj
(P(w|Dj)).
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 45 / 78
62. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Two dimensional KDE
f(x, α) =
n∑
i=1
M∑
j=1
P(w|Dj)P(qi|Dj)
2πσ2
exp(
(w − qi)2 + (P(w|Dj) − P(qi|Dj))2
−2σ2h2
)
w in document Dj, denoted as x, will get a high value of the density
function if:
(i) w is semanticlly close to query terms qi;
(ii) w frequently co-occurs with query terms in each top ranked document
Dj.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 46 / 78
66. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Summary
Given:
Corpora : Set of documents.
Query Q : q1, . . . , qn.
Vector embeddings of each of the terms.
First level retrieval performed using any LM.
ET = All terms of top ranked M documents considered as potential
expansion terms.
Calculate KDE(w) for each w ∈ ET.
Terms with high estimated value taken as expansion terms.
Linearly interpolate derived density function with underlying query
model.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 50 / 78
76. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Query Performance Prediction (QPP)
Definition
To quantify the quality of search results when no relevance feedback is
given.
For an ambiguous query, it may be difficult for a search engine to
return satisfactory results for the query.
This can lead to poor IR effectiveness.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 55 / 78
81. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78
82. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
d(R, R) - Distance among retrieved documents; Manifestation of ambiguity of retrieved
documents.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78
83. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
d(R, R) - Distance among retrieved documents; Manifestation of ambiguity of retrieved
documents.
d(Q, R) - Distance between query and retrieved documents R; Equivalent to distribution
of retrieval scores of R.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78