This document summarizes research that implemented the same transitive closure algorithm for entity resolution on three different Apache Hadoop distributions: a local HDFS cluster, Cloudera Enterprise, and Talend Big Data Sandbox. The algorithm was run on a synthetic dataset to discover entity clusters. While the local HDFS cluster produced consistent results matching the baseline, the Cloudera and Talend platforms had inconsistent results due to differences in configuration requirements, load balancing, and blocking behavior across nodes. The experiments highlighted scalability issues for entity resolution processes in distributed environments due to inconsistencies introduced by differences in platform implementations.
Column store decision tree classification of unseen attribute setijma
A decision tree can be used for clustering of frequently used attributes to improve tuple reconstruction time
in column-stores databases. Due to ad-hoc nature of queries, strongly correlative attributes are grouped
together using a decision tree to share a common minimum support probability distribution. At the same
time in order to predict the cluster for unseen attribute set, the decision tree may work as a classifier. In
this paper we propose classification and clustering of unseen attribute set using decision tree to improve
tuple reconstruction time.
Column store decision tree classification of unseen attribute setijma
A decision tree can be used for clustering of frequently used attributes to improve tuple reconstruction time
in column-stores databases. Due to ad-hoc nature of queries, strongly correlative attributes are grouped
together using a decision tree to share a common minimum support probability distribution. At the same
time in order to predict the cluster for unseen attribute set, the decision tree may work as a classifier. In
this paper we propose classification and clustering of unseen attribute set using decision tree to improve
tuple reconstruction time.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
In this playlist
https://youtube.com/playlist?list=PLT...
I'll illustrate algorithms and data structures course, and implement the data structures using java programming language.
the playlist language is arabic.
The Topics:
--------------------
1- Arrays
2- Linear and Binary search
3- Linked List
4- Recursion
5- Algorithm analysis
6- Stack
7- Queue
8- Binary search tree
9- Selection sort
10- Insertion sort
11- Bubble sort
12- merge sort
13- Quick sort
14- Graphs
15- Hash table
16- Binary Heaps
Reference : Object-Oriented Data Structures Using Java - Third Edition by NELL DALE, DANEIEL T.JOYCE and CHIP WEIMS
Slides is owned by College of Computing & Information Technology
King Abdulaziz University, So thanks alot for these great materials
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
In this playlist
https://youtube.com/playlist?list=PLT...
I'll illustrate algorithms and data structures course, and implement the data structures using java programming language.
the playlist language is arabic.
The Topics:
--------------------
1- Arrays
2- Linear and Binary search
3- Linked List
4- Recursion
5- Algorithm analysis
6- Stack
7- Queue
8- Binary search tree
9- Selection sort
10- Insertion sort
11- Bubble sort
12- merge sort
13- Quick sort
14- Graphs
15- Hash table
16- Binary Heaps
Reference : Object-Oriented Data Structures Using Java - Third Edition by NELL DALE, DANEIEL T.JOYCE and CHIP WEIMS
Slides is owned by College of Computing & Information Technology
King Abdulaziz University, So thanks alot for these great materials
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERYIJDKP
Action rules, which are the modified versions of classification rules, are one of the modern approaches for
discovering knowledge in databases. Action rules allow us to discover actionable knowledge from large
datasets. Classification rules are tailored to predict the object’s class. Whereas action rules extracted from
an information system produce knowledge in the form of suggestions of how an object can change from one
class to another more desirable class. Over the years, computer storage has become larger and also the
internet has become faster. Hence the digital data is widely spread around the world and even it is growing
in size such a way that it requires more time and space to collect and analyze them than a single computer
can handle. To produce action rules from a distributed massive data requires a distributed action rules
processing algorithm which can process the datasets in all systems in one or more clusters simultaneously
and combine them efficiently to induce single set of action rules. There has been little research on action
rules discovery in the distributed environment, which presents a challenge. In this paper, we propose a new
algorithm called MR – Random Forest Algorithm to extract the action rules in a distributed processing
environment.
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERYIJDKP
Action rules, which are the modified versions of classification rules, are one of the modern approaches for
discovering knowledge in databases. Action rules allow us to discover actionable knowledge from large
datasets. Classification rules are tailored to predict the object’s class. Whereas action rules extracted from
an information system produce knowledge in the form of suggestions of how an object can change from one
class to another more desirable class. Over the years, computer storage has become larger and also the
internet has become faster. Hence the digital data is widely spread around the world and even it is growing
in size such a way that it requires more time and space to collect and analyze them than a single computer
can handle. To produce action rules from a distributed massive data requires a distributed action rules
processing algorithm which can process the datasets in all systems in one or more clusters simultaneously
and combine them efficiently to induce single set of action rules. There has been little research on action
rules discovery in the distributed environment, which presents a challenge. In this paper, we propose a new
algorithm called MR – Random Forest Algorithm to extract the action rules in a distributed processing
environment.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
The alignment of two DNA sequences is a basic step in the analysis of biological data. Sequencing a long
DNA sequence is one of the most interesting problems in bioinformatics. Several techniques have been
developed to solve this sequence alignment problem like dynamic programming and heuristic algorithms.
In this paper, we introduce (GPCodon alignment) a pairwise DNA-DNA method for global sequence
alignment that improves the accuracy of pairwise sequence alignment. We use a new scoring matrix to
produce the final alignment called the empirical codon substitution matrix. Using this matrix in our
technique enabled the discovery of new relationships between sequences that could not be discovered using
traditional matrices. In addition, we present experimental results that show the performance of the
proposed technique over eleven datasets of average length of 2967 bps. We compared the efficiency and
accuracy of our techniques against a comparable tool called “Pairwise Align Codons” [1].
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
The premise of this paper is to discover frequent patterns by the use of data grids in WEKA 3.8 environment. Workload imbalance occurs due to the dynamic nature of the grid computing hence data grids are used for the creation and validation of data. Association rules are used to extract the useful information from the large database. In this paper the researcher generate the best rules by using WEKA 3.8 for better performance. WEKA 3.8 is used to accomplish best rules and implementation of various algorithms.
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IMPLEMENTED ON DIFFERENT HADOOP PLATFORMS
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
DOI: 10.5121/ijcsit.2020.12403 27
VARIATIONS IN OUTCOME FOR THE SAME MAP
REDUCE TRANSITIVE CLOSURE ALGORITHM
IMPLEMENTED ON DIFFERENT
HADOOP PLATFORMS
Purvi Parmar, MaryEtta Morris, John R. Talburt and Huzaifa F. Syed
Center for Advanced Research in Entity Resolution and Information Quality
University of Arkansas at Little Rock Little Rock, Arkansas, USA
ABSTRACT
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
KEYWORDS
Entity Resolution; Hadoop; MapReduce; Transitive Closure; HDFS; Cloudera; Talend
1. INTRODUCTION
1.1. Entity Resolution
Entity Resolution (ER) is the process of determining whether two references to real world objects
in an information system refer to the same object or to different objects [1]. Real world objects
can be identified by their attributes and by their relationships with other entities [2]. The
equivalence of any two references is determined by an ER system based upon the degree to
which the values of the attributes of the two references are similar. In order to determine these
similarities, ER systems apply a set of Boolean rules to produce a True (link) or False (no link)
decision [3]. Once pairs have been discovered, the next step is to generate clusters of all
references to the same object.
1.2. Blocking
To reduce the amount of references compared against one another, one or more blocking
strategies may be applied [1]. Blocking is the process of dividing records into groups with their
most likely matches [4,19]. Match keys are first generated by encoding or transforming a given
attribute. Records whose attributes share the same match key value are placed together within a
block. Comparisons can then be made among records within the same block. One common
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
28
method used to generate match keys is Soundex, which encodes strings based on their English
pronunciation [5]. Using Soundex, match keys for the names “Stephen” and “Steven” would both
have a value of S315. Records containing these strings in the appropriate attribute would have the
same match key value, and would thus be placed into the same block for further comparison,
along with records with slight differences such as “Stephn” and “Stevn”. Blocks are often
created by using a combination of multiple match rules. Records can be placed in more than one
block, which can sometimes lead to many redundant comparisons [25].
1.3. Transitive Closure
Transitive closure is the process used to discover clusters of references from among matched
pairs. The transitive relationship determines that if reference A is equivalent to reference B,
reference B is equivalent to reference C, then, by the property of transitivity reference A is
equivalent to reference C [2]. This study utilized CC-MR, a transitive closure algorithm
developed for use in MapReduce by Seidl, et al. [18] and enhanced in [8].
1.4. Oyster
The ER processes discussed in this paper were performed with OYSTER (Open System for
Entity Resolution) Version 3.6.7 [3]. OYSTER is an open source ER system developed by the
Center for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the
University of Arkansas at Little Rock. OYSTER’s source code and documentation is freely
available on Bitbucket [7]. The system has proven useful in several research and industry
applications [6]. Notably, OYSTER was used in previous studies using large-scale education
data in [3] and [5] and electronic medical records in [22].
1.5. Hadoop and MapReduce
Apache Hadoop is an open-source system designed to store and process large amounts of data in
a distributed environment [23]. Hadoop is highly scalable, capable of employing thousands of
computers to execute tasks in an efficient manner. Each computer, or node, utilizes the Hadoop
Distributed File System (HDFS) to store data [9].
Apache MapReduce (MR) is a software framework with the capability to create, schedule, and
monitor tasks developed to run on the Hadoop platform [16]. As noted in [10], MR is a good fit
for ER operations, as each pairwise comparison is independent of another, and thus can be run in
parallel [19,20].
2. BACKGROUND
This research began in an effort to migrate OYSTER’s operations to a distributed environment in
order to accommodate faster and more efficient processing of larger datasets.
In the Hadoop distributed environment, the preprocessors lack access to a large shared memory
space which makes the standard ER blocking approach impossible [11].
This study employs the BlockSplit strategy introduced in [12] to accompany the transitive closure
algorithm design introduced in [8]. The algorithm generates clusters by detecting the maximally
connected subsets of an undirected graph. BlockSplit is a load balancing mechanism that divides
the records in large blocks into multiple reduce tasks for comparison [6]. Three different
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
29
platforms for distributions of Hadoop were tested to investigate the stability and efficiency of the
same TC algorithm.
The expected outcome of the study was that the results from the transitive closure process run in
each environment would agree (in terms of F-Measure) with those from the OYSTER baseline
run.
3. RESEARCH METHODOLOGY AND EXPERIMENTS
3.1. Transitive Closure Logic
Consistent with the Map-Reduce paradigm, the steps of the algorithm are divided into a Map and
Reduce phase. In the Map phase, pairs are generated and identifiers are assigned to each pair.
Both the original pair (ex. A,B) and the reverse of each pair (ex. B,A) are generated. Pairs are
sorted by the first node and then the second. Pairs that have the same recid as the first element of
the pair are placed into a group. Each group is then processed as in the Reduce details below.
Figure 1. Transitive closure logic.
Transitive Closure Logic (Reduce phase)
Reduce: Apply Group Processing Rules
(X, Y) represents a generic pair; Set processComplete = True
Get next Group, Examine first pair in the group (X, Y)
If X <= Y, Then pass the entire group to the output (R1) Else
If groupSize = 1, Ignore (don’t pass to output) (R2) Else
Set processComplete = False
For each pair (A, B) following first pair (X, Y) in the group
Create new pair (Y, B)
Create new pair (B, Y)
Move (Y, B) to output (R3)
Move (B, Y) to output (R4)
Examine last pair of the group (Z,W)
If X < W
Move (X, Y) to the output (R5)
After all groups are processed, If processComplete = false
Make input = output
Repeat Process Else
Process complete
Join final output back to original full set of record keys to get singleton
clusters iteratively
New output sorted and grouped AND Iteration Complete with Reduce
Final Output: Join back to original input and validate with connected
records.
3.2. Dataset
To simulate the variety of data quality challenges present in real-world data, a synthetic dataset
was used. The dataset contains 151,868 name and address records with 53,467 matching pairs,
and 111,181 total clusters. The field names for the dataset use are: recid, fname, lname, address,
city, state, zip, ssn, and homephone.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
30
While distributed systems such as Hadoop are designed to handle much larger amounts of data,
the dataset size was chosen to accommodate the baseline OYSTER calculations performed on a
single local computer.
3.3. Hadoop Distribution Platforms
The details of each Hadoop distribution used for this study are as follows:
Local HDFS: a stand-alone, single node cluster running Hadoop version 2.8.4.
Cloudera Enterprise: a multi-node cluster running Cloudera Enterprise version 5.15
(multiple node cluster) along with Apache Hadoop 2.8.4 [13].
Talend Big Data Sandbox: a single node cluster running Apache Hadoop 2.0 hosted by
Amazon Web Services (AWS) [14, 15]
3.4. Boolean Rules
OYSTER (Version 3.6.7) was used to prepare the input data by applying Boolean match rules.
The output of this step was a set of indices containing the blocks generated by the Boolean rules.
The algorithms used were:
SCAN: remove all non-alphanumeric characters, convert all letters to uppercase
Soundex: encode each string based on its English pronunciation.
Table 1 details the rules used in this study. This component of the experiment design was
previously used in [5, 21].
Table 1. Boolean Rules Used for each Index.
Index Rules
1
fname : SCAN
lname : SCAN
2
lname: SCAN
ssn : SCAN
3
fname: Soundex
lname: Soundex
ssn: SCAN
4
fname: Soundex
lname: SCAN
address: SCAN
3.5. Baseline Run
An initial full ER run was conducted using OYSTER to establish a baseline for the dataset. This
process included generating a link file containing the pairs considered to be a match, and
performing transitive closure to discover the clusters. The expectation was that each MR
implementation of TC would produce the same results as the ER calculation in terms of clusters
of records. The OYSTER ER Metrics utility [7] was used for evaluation.
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
31
3.6. Experiment Steps
The MapReduce transitive closure experiment was conducted in three steps as defined below.
The first two steps of the study helped to create benchmarks for the expected results of the TC
process, setting the stage for the comparisons made in the final step.
In Step 1, the transitive closure algorithm for MapReduce was run on the Local HDFS cluster,
using the pairwise link file that was generated in OYSTER by applying all Boolean match rules.
This first run was the benchmark for all further experiments.
In Step 2, separate pairwise link files were generated in OYSTER for each Boolean match rule
individually. The files were combined and run through the TC process on the Local HDFS
cluster. This step was repeated on the Cloudera platform for validation.
In Step 3, the original source input data is used, and all Boolean rules from Table 1 applied and
transitive closure is done on MR [17]. This step was repeated on all three platforms.
3.7. Evaluation
The output from the TC processes on each platform in Step 3 were compared against the initial
Step 1 benchmark that was conducted on the Local HDFS cluster with the full match link file.
OYSTER’s ER Metrics utility [7] was used to compare the results based on the number of true
matches found vs. the number of predicted matches. The primary metric used was the F-Measure,
which is the harmonic mean of precision and recall [2]. F-Measure is reported with a value from
0 to 1, with 1 meaning a 100% match of the expected results.
4. RESULTS AND DISCUSSION
The Local HDFS cluster executed all steps without issue, and was able to match the benchmark
F-Measure value successfully.
The Cloudera Enterprise platform was the most inconsistent among the environments tested.
Cloudera was used for Steps 2 and 3. There were multiple attempts required due to
compatibility and configuration challenges. The first few runs on the Cloudera platform failed
because of a compatibility issue with the custom Soundex algorithm. Subsequent runs were
processed with a reduced-size dataset (75, 934 records) which resulted in an F-Measure of 0.68,
more than 50% than the full benchmark baseline. After reconfiguration, a final run was done with
the full dataset and the subsequent F-Measure was 0.97.
The Talend Big Data Sandbox Hadoop platform was used in Step 3. The initial run’s F-Measure
was 0.57. After a review of the results, it was determined that the input file format, CSV, caused
inconsistencies during processing. This was corrected by converting to a Fixed Width input
format as required by the platform. The subsequent run yielded 0.77, over a 10% improvement.
Table 2. Outcomes by platform for Step 3.
Platform Attempts Successes
Average
Outcome1 Best Outcome1
Local HDFS 3 2 0.99 0.99
Cloudera 5 2 0.98 0.97
Talend 3 2 0.67 0.77
1
by F-Measure.
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
32
Table 2 summarizes the results of step 3 from all three platforms. “Attempts” is the number of
attempts taken, including failed runs. “Successes” are the count of attempts that ran until
completion. The Average and Best outcomes of these attempts, according to F Measure, are also
listed.
While the same input data and TC algorithm were used, there were a few issues that led to failed
attempts and inconsistent results. Working with different platforms required adapting the process
to fit the configuration constraints of each system, as with the file format issue mentioned above.
In addition, the underlying load balancing mechanisms of each platform seem to vary in the
manner in which the pairwise comparison tasks were spread across each node. The platforms
used different internal thresholds to determine how the pairs were spread across data processing
nodes, which led to inconsistencies among the results, even between iterations on the same
platform. This affected the ability of the algorithm to return all matching pairs, which in turn
affected the ability to discover the correct amount of clusters during the reduce phase. As
previously mentioned, blocking is in itself an ongoing challenge in ER, and distributed
environments add to the complexity.
5. CONCLUSION
Although the original intent was to improve the scalability and consistency of the TC algorithm,
the experiment results show the inconsistent iterative TC behavior when using different
platforms.
The expected result of generating the same matches from using the TC algorithm on MR as the
baseline run was impacted by the differences in configuration requirements and blocking
behavior on the different Hadoop platforms.
A compromise needs to be made between load balancing and preserving the generated blocks of
matched pairs during the Map phase. If matched pairs that should be in the same block are spread
across multiple nodes, the Reduce phase cannot correctly and consistently complete the transitive
closure process.
These experiments highlight some of the underlying scalability issues for Entity Resolution
processes in distributed environments. Future experiments include exploring additional blocking
strategies, as well as testing additional platforms and distributed computing frameworks. We also
plan to perform this experiment on different datasets to further validate our methods.
ACKNOWLEDGEMENTS
The authors thank the Pilog Group for providing access to their Cloudera environment and
Shahnawaz Kapadia for his assistance with the initial research. The authors thank Pradeep
Parmar for providing financial support.
REFERENCES
[1] Talburt, J. R., & Zhou, Y. (2013). A practical guide to entity resolution with OYSTER. In Handbook
of Data Quality (pp. 235-270). Springer, Berlin, Heidelberg
[2] Zhong, B., & Talburt, J. (2018, December). Using Iterative Computation of Connected Graph
Components for Post-Entity Resolution Transitive Closure. In 2018 International Conference on
Computational Science and Computational Intelligence (CSCI) (pp. 164-168). IEEE.
[3] Nelson, E. D., & Talburt, J. R. (2011). Entity resolution for longitudinal studies in education using
OYSTER. In Proceedings of the International Conference on Information and Knowledge
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
33
Engineering (IKE) (p. 1). The Steering Committee of The World Congress in Computer Science,
Computer Engineering and Applied Computing (WorldComp).
[4] Christen, P. (2012). "A Survey of Indexing Techniques for Scalable Record Linkage and
Deduplication," in IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-
1555, Sept. 2012, doi: 10.1109/TKDE.2011.127.
[5] Wang, P., Pullen, D., Talburt, J., & Wu, N. (2015). Applying Phonetic Hash Functions to Improve
Record Linking in Student Enrollment Data. In Proceedings of the International Conference on
Information and Knowledge Engineering (IKE) (p. 187). The Steering Committee of The World
Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).
[6] Osesina, O. I., & Talburt, J. (2012). A Data-Intensive Approach to Named Entity Recognition
Combining Contextual and Intrinsic Indicators. International Journal of Business Intelligence
Research (IJBIR), 3(1), 55-71. doi:10.4018/jbir.2012010104
[7] OYSTER Open Source Project, https://bitbucket.org/oysterer/oyster/
[8] Kolb, L., Sehili, Z., & Rahm, E. (2014). Iterative computation of connected graph components with
MapReduce. Datenbank-Spektrum, 14(2), 107-117.
[9] Manikandan, S. G., & Ravi, S. (2014, October). Big data analysis using Apache Hadoop. In 2014
International Conference on IT Convergence and Security (ICITCS) (pp. 1-4). IEEE.
[10] Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: Efficient deduplication with hadoop. Proceedings of
the VLDB Endowment, 5(12), 1878-1881.
[11] Chen, X, Schallehn, E, Saake, G. (2018). Cloud-Scale Entity Resolution:Current State and Open
Challenges, Open Journal of Big Data (OJBD) Volume 4, (Issue 1), (Available at
http://www.ronpub.com/ojbd/ ;ISSN 2365-029X).
[12] Kolb, L., Thor, A., & Rahm, E. (2012, April). Load balancing for mapreduce-based entity resolution.
In 2012 IEEE 28th international conference on data engineering (pp. 618-629). IEEE.
[13] Cloudera. (2019). Cloudera Enterprise 5.15.x documentation, (available at
https://docs.cloudera.com/documentation/enterprise/5-15-x.html); retrieved December 11, 2019
[14] Talend Big Data Sandbox, https://www.talend.com/products/big-data/real-time-big-data/, retrieved
December 10, 2019.
[15] Amazon Web Services. (2019)., (available at https://en.wikipedia.org/wiki/Amazon_Web_Services);
retrieved November 1,2019.
[16] Salinas, S. O., & Lemus, A. C. (2017). Data warehouse and big data integration. Int. Journal of
Comp. Sci. and Inf. Tech, 9(2), 1-17.
[17] Chen, C., Pullen, D., Petty, R. H., & Talburt, J. R. (2015, November). Methodology for Large-Scale
Entity Resolution without Pairwise Matching. In 2015 IEEE International Conference on Data
Mining Workshop (ICDMW) (pp. 204-210). IEEE.
[18] Thomas Seidl, Brigitte Boden, and Sergej Fries. (2012). CC-MR - finding connected components in
huge graphs with MapReduce. In Proceedings of the 2012th European Conference on Machine
Learning and Knowledge Discovery in Databases - Volume Part I (ECMLPKDD’12). Springer-
Verlag, Berlin, Heidelberg, 458–473.
[19] Hsueh, S. C., Lin, M. Y., & Chiu, Y. C. (2014, January). A load-balanced mapreduce algorithm for
blocking-based entity-resolution with multiple keys. In Proceedings of the Twelfth Australasian
Symposium on Parallel and Distributed Computing-Volume 152 (pp. 3-9).
[20] Elsayed, T., Lin, J., & Oard, D. W. (2008, June). Pairwise document similarity in large collections
with MapReduce. In Proceedings of ACL-08: HLT, Short Papers (pp. 265-268).
[21] Syed, H., Wang, Talburt, J.R., Liu, F., Pullen, D., Wu,N. (2012). Developing and refining matching
rules for entity resolution, in Proceedings of the International Conference on Information and
knowledge Engineering (IKE), Las Vegas, NV
[22] Gupta T., Deshpande V. (2020) Entity Resolution for Maintaining Electronic Medical Record Using
OYSTER. In: Haldorai A., Ramu A., Mohanram S., Onn C. (eds) EAI International Conference on
Big Data Innovation for Sustainable Cognitive Computing. EAI/Springer Innovations in
Communication and Computing. Springer, Cham.
[23] Muniswamaiah, M., Agerwala, T., and Tappert, C. (2019). Big data in cloud computing review and
opportunities. Int. Journal of Comp Sci and Inf. Tech, 11(4), 43-57.
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 12, No 4, August 2020
34
[24] Zhou, Y., & Talburt, J. R. (2014). Strategies for Large-Scale Entity Resolution Based on Inverted
Index Data Partitioning. In Yeoh, W., Talburt, J. R., & Zhou, Y. (Ed.), Information Quality and
Governance for Business Intelligence (pp. 329-351). IGI Global. http://doi:10.4018/978-1-4666-
4892-0.ch017
[25] Efthymiou, Vasilis & Papadakis, George & Papastefanatos, George & Stefanidis, Kostas & Palpanas,
Themis. (2017). Parallel Meta-blocking for Scaling Entity Resolution over Big Heterogeneous Data.
Information Systems. 65. 137-157.