With the increasing growth of Internet and World Wide Web, information retrieval (IR) has
attracted much attention in recent years. Quick, accurate and quality information mining is the
core concern of successful search companies. Likewise, spammers try to manipulate IR system
to fulfil their stealthy needs. Spamdexing, (also known as web spamming) is one of the
spamming techniques of adversarial IR, allowing users to exploit ranking of specific documents
in search engine result page (SERP). Spammers take advantage of different features of web
indexing system for notorious motives. Suitable machine learning approaches can be useful in
analysis of spam patterns and automated detection of spam. This paper examines content based
features of web documents and discusses the potential of feature selection (FS) in upcoming
studies to combat web spam. The objective of feature selection is to select the salient features to
improve prediction performance and to understand the underlying data generation techniques.
A publically available web data set namely WEBSPAM - UK2007 is used for all evaluations.
Entity linking with a knowledge baseissues, techniques, and solutionsShakas Technologies
The large number of potential applications from bridging Web data with knowledge bases has led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledgebase.
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...cscpconf
Semantic Similarity measures plays an important role in information retrieval, natural language processing and various tasks on web such as relation extraction, community mining, document clustering, and automatic meta-data extraction. In this paper, we have proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure between the words by
combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and nonsynonymous word-pairs. The proposed approach aims to improve the Correlation values,
Precision, Recall, and F-measures, compared to the existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
This document discusses methods for collecting reliable data on user behavior from online social networks. It reviews previous research that has collected social network data through OSN APIs/crawling, surveys/questionnaires, and custom application deployment. The authors argue that these methods can produce biased data as they often only collect publicly shared content or rely on self-reported information. The authors propose combining existing methods, specifically the Experience Sampling Method, to collect more reliable data by capturing both shared and non-shared user behaviors and contexts through in situ experiments.
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
The document discusses entity linking, which is the task of linking entity mentions in text to corresponding entries in a knowledge base. It presents challenges like name variations and ambiguity. The paper then surveys main approaches to entity linking and discusses applications like knowledge base population and question answering. It also covers evaluation of entity linking systems and directions for future work.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The document describes an entity linking system with three main modules: 1) Entity linking to map entity mentions in text to entities in a knowledge base, which is challenging due to name variations and ambiguity. 2) A knowledge base containing entities. 3) Candidate entity ranking to rank potential entities for a mention using evidence like supervised ranking methods. The system aims to address issues in entity linking like predicting unlinkable mentions.
This document summarizes a research paper that develops an ant colony optimization (ACO) approach for filtering spam emails. It begins by noting the negative impacts of spam and how machine learning techniques have improved spam filtering over traditional rule-based methods. It then provides an overview of how ACO has been applied to data mining problems. The document proposes an ACO-based spam filtering model called AntSFilter and evaluates its performance on a public email dataset compared to other classifiers like Naive Bayes and Ripper. The preliminary results found AntSFilter yielded better accuracy with smaller rule sets, highlighting important features for identifying email categories.
Detecting Gender-bias from Energy Modeling Jobscapeyungahhh
Despite of many strides made by women in the workplace, workplace inequality persists, whether they come from geographical cultures or a result of workplace attrition. Unconscious (or implicit) bias can creep
into job listings and can actively or passively discourage certain applicants’ pool. The goal of this study is to understand the role of gender bias in the energy modeling job market.
Job listings and their word choices can significantly affect the application pool. This study investigates the word choices from job postings specifically for the energy modeling sector within the United States building construction industry, using web-scraped data of an online job site. The study explores the potential usage of gender-bias terminology through frequency analysis and Natural Language Processing Kit. The study also employs two machine learning methods (one supervise and one unsupervised) to test the application in this
context and evaluate its utilization.
Entity linking with a knowledge baseissues, techniques, and solutionsShakas Technologies
The large number of potential applications from bridging Web data with knowledge bases has led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledgebase.
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...cscpconf
Semantic Similarity measures plays an important role in information retrieval, natural language processing and various tasks on web such as relation extraction, community mining, document clustering, and automatic meta-data extraction. In this paper, we have proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure between the words by
combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and nonsynonymous word-pairs. The proposed approach aims to improve the Correlation values,
Precision, Recall, and F-measures, compared to the existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
This document discusses methods for collecting reliable data on user behavior from online social networks. It reviews previous research that has collected social network data through OSN APIs/crawling, surveys/questionnaires, and custom application deployment. The authors argue that these methods can produce biased data as they often only collect publicly shared content or rely on self-reported information. The authors propose combining existing methods, specifically the Experience Sampling Method, to collect more reliable data by capturing both shared and non-shared user behaviors and contexts through in situ experiments.
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
The document discusses entity linking, which is the task of linking entity mentions in text to corresponding entries in a knowledge base. It presents challenges like name variations and ambiguity. The paper then surveys main approaches to entity linking and discusses applications like knowledge base population and question answering. It also covers evaluation of entity linking systems and directions for future work.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The document describes an entity linking system with three main modules: 1) Entity linking to map entity mentions in text to entities in a knowledge base, which is challenging due to name variations and ambiguity. 2) A knowledge base containing entities. 3) Candidate entity ranking to rank potential entities for a mention using evidence like supervised ranking methods. The system aims to address issues in entity linking like predicting unlinkable mentions.
This document summarizes a research paper that develops an ant colony optimization (ACO) approach for filtering spam emails. It begins by noting the negative impacts of spam and how machine learning techniques have improved spam filtering over traditional rule-based methods. It then provides an overview of how ACO has been applied to data mining problems. The document proposes an ACO-based spam filtering model called AntSFilter and evaluates its performance on a public email dataset compared to other classifiers like Naive Bayes and Ripper. The preliminary results found AntSFilter yielded better accuracy with smaller rule sets, highlighting important features for identifying email categories.
Detecting Gender-bias from Energy Modeling Jobscapeyungahhh
Despite of many strides made by women in the workplace, workplace inequality persists, whether they come from geographical cultures or a result of workplace attrition. Unconscious (or implicit) bias can creep
into job listings and can actively or passively discourage certain applicants’ pool. The goal of this study is to understand the role of gender bias in the energy modeling job market.
Job listings and their word choices can significantly affect the application pool. This study investigates the word choices from job postings specifically for the energy modeling sector within the United States building construction industry, using web-scraped data of an online job site. The study explores the potential usage of gender-bias terminology through frequency analysis and Natural Language Processing Kit. The study also employs two machine learning methods (one supervise and one unsupervised) to test the application in this
context and evaluate its utilization.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
Bayesian classifier works efficiently on some fields, and badly on some. The performance of Bayesian Classifier suffers in fields that involve correlated features. Feature selection is beneficial in reducing dimensionality, removing irrelevant data, incrementing learning accuracy, and improving result comprehensibility. But, the recent increase of dimensionality of data place a hard challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this paper, Bayesian Classifier with Correlation Based Feature Selection is introduced which can key out relevant features as well as redundancy among relevant features without pair wise correlation analysis. The efficiency and effectiveness of our method is presented through broad.
A Trinity Construction for Web Extraction Using Efficient AlgorithmIOSR Journals
This document describes a proposed system for web extraction using a trinity construction and efficient algorithms. It begins with an abstract discussing how trinity characteristics can be used to automatically extract content from websites in a sequential tree structure. It then discusses the existing system which uses trinity tree and prefix/suffix sorting but has limitations. The proposed system introduces fuzzy logic for multi-perspective crawling across multiple websites. A genetic algorithm is used to load extracted content into the trinity structure and remove unwanted data. Finally, an ant colony optimization algorithm is used to obtain an effective structure and suggest optimized solutions.
The document introduces an API-based webometrics tool called WeboNaver that was created for the Korean search engine Naver. WeboNaver allows researchers to automatically collect large amounts of web data and distinguish different types of information. It provides an interface that allows users to select search queries, run them, and output the results. The tool provides a way to systematically analyze web presence and trends using data from the popular Korean search engine Naver.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
Prior empirical and theoretical work has discussed the role of dominant search engine plays in the function of information gatekeeping on the Web, and there are reports on the high ranking of Wikipedia website among the search engine result pages (SERP). However, little research has been conducted on non-Google search engines and non-English versions of user-generated encyclopedias. This paper proposes a method to quantify the “display” gatekeeping differences of the SERP ranking and presents findings based on the Chinese SERP data. Based on 2,500 mainly-Chinese-language search queries, the data set includes the SERP outcome of four Chinese-speaking regions (mainland China, Singapore, Hong Kong and Taiwan) provided by three major search engines (Baidu, and Google and Yahoo), covering over 97% of the search engine market in each region. The findings, analysed and visualized using network analysis techniques, demonstrate the followings: major user-generated encyclopedias are among the most visible; localization factors matter (certain search engine variants produce the most divergent outcomes, especially mainland Chinese ones). The indicated strong effects of “network gatekeeping” by search engines also suggest similar dynamics inside user-generated encyclopedias.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGijcsit
With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social network, creation of user accounts in these sites usually needs just an email-id. A real life person can create multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s license), in the virtual world of social media, admission does not require any such checks. In this paper, we study the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine Learning techniques namely Logistic Regression and Random Forest Algorithm.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
Liao and petzold opensym berlin wikipedia geolinguistic normalizationHanteng Liao
This paper proposes a method of geo-linguistic normalization to advance the existing comparative analysis of open collaborative communities, with multilingual Wikipedia projects as the example. Such normalization requires data regarding the potential users and/or resources of a geolinguistic unit.
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...idescitation
According to the survey India is one of the
leading countries in the word for technical education and
management education. Numbers of students are increasing
day by day by the growth rate of 45% per annum. Advancement
in technology puts special effect on education system. This
helps in upgrading higher education. Some universities and
colleges are using these technologies. Weblog is one of them.
Main aim of this paper is to represent web logs using clustering
technique for predicting next user movement and user
behavior analysis. This paper moves around the web log
clustering technique based on Markov chain results .In this
paper we present an ideal approach to web clustering
(clustering web site users) and predicting their behavior for
next visit. Methodology: For generating effective result approx
14 engineering college web usage data is used and an advance
clustering approach is presenting after optimizing the other
clustering approach.Results: The user behavior is predicted
with the help of the advance clustering approach based on the
FPCM and k-mean. Proposed algorithm is used to mined and
predict user’s preferred paths. To predict the user behavior
existing approaches have been used. But the existing
approaches are not enough because of its reaction towards
noise. Thus with the help of ACM, noise is reduced, provides
more accurate result for predicting the user behavior. Approach
Implementation:The algorithm was implemented in MAT
LAB, DTRG and in Java .The experiment result proves that
this method is very effective in predicting user behavior. The
experimental results have validated the method’s effectiveness
in comparison with some previous studies.
Investigating crimes using text mining and network analysisZhongLI28
For the PDF part, copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...csandit
Due to tremendous growth of the wireless based application services are increasing the demand
for wireless communication techniques that use bandwidth more effectively. Channel slot reutilization
in multi-radio wireless mesh networks is a very challenging problem. WMNs have
been adopted as back haul to connect various networks such as Wi-Fi (802.11), WI-MAX
(802.16e) etc. to the internet. The slot re-utilization technique proposed so far suffer due to high
collision due to improper channel slot usage approximation error. To overcome this here the
author propose the cross layer optimization technique by designing a device classification
based channel slot re-utilization routing strategy which considers the channel slot and node
information from various layers and use some of these parameters to approximate the risk
involve in channel slot re-utilization in order to improve the QoS of the network. The simulation
and analytical results show the effectiveness of our proposed approach in term of channel slot
re-utilization efficiency and thus helps in reducing latency for data transmission and reduce
channel slot collision.
Geometric Correction for Braille Document Images csandit
Image processing is an important research area in computer vision. clustering is an unsupervised
study. clustering can also be used for image segmentation. there exist so many methods for image
segmentation. image segmentation plays an important role in image analysis.it is one of the first
and the most important tasks in image analysis and computer vision. this proposed system
presents a variation of fuzzy c-means algorithm that provides image clustering. the kernel fuzzy
c-means clustering algorithm (kfcm) is derived from the fuzzy c-means clustering
algorithm(fcm).the kfcm algorithm that provides image clustering and improves accuracy
significantly compared with classical fuzzy c-means algorithm. the new algorithm is called
gaussian kernel based fuzzy c-means clustering algorithm (gkfcm)the major characteristic of
gkfcm is the use of a fuzzy clustering approach ,aiming to guarantee noise insensitiveness and
image detail preservation.. the objective of the work is to cluster the low intensity in homogeneity
area from the noisy images, using the clustering method, segmenting that portion separately using
content level set approach. the purpose of designing this system is to produce better segmentation
results for images corrupted by noise, so that it can be useful in various fields like medical image
analysis, such as tumor detection, study of anatomical structure, and treatment planning.
COQUEL: A CONCEPTUAL QUERY LANGUAGE BASED ON THE ENTITYRELATIONSHIP MODELcsandit
As more and more collections of data are available on the Internet, end users but not experts in
Computer Science demand easy solutions for retrieving data from these collections. A good
solution for these users is the conceptual query languages, which facilitate the composition of
queries by means of a graphical interface. In this paper, we present (1) CoQueL, a conceptual
query language specified on E/R models and (2) a translation architecture for translating
CoQueL queries into languages such as XQuery or SQL..
FILESHADER: ENTRUSTED DATA INTEGRATION USING HASH SERVER csandit
This document summarizes a research paper that proposes FileShader, a system using hash values to ensure file integrity during transfers between clients and servers. FileShader works by having the file provider calculate and send the hash value of a file to a trusted hash server. When clients download a file, FileShader calculates the hash and compares it to the value stored on the hash server to detect any changes. The researchers implemented a prototype of FileShader and found it could accurately detect file changes with little performance overhead. They conclude FileShader is a practical solution that can increase security for internet users by verifying file integrity during transfers.
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...csandit
Opinion mining also known as sentiment analysis, involves customer satisfactory patterns,
sentiments and attitudes toward entities, products, services and their attributes. With the rapid
development in the field of Internet, potential customer’s provides a satisfactory level of
product/service reviews. The high volume of customer reviews were developed for
product/review through taxonomy-aware processing but, it was difficult to identify the best
reviews. In this paper, an Associative Regression Decision Rule Mining (ARDRM) technique is
developed to predict the pattern for service provider and to improve customer satisfaction based
on the review comments. Associative Regression based Decision Rule Mining performs twosteps
for improving the customer satisfactory level. Initially, the Machine Learning Bayes
Sentiment Classifier (MLBSC) is used to classify the class labels for each service reviews. After
that, Regressive factor of the opinion words and Class labels were checked for Association
between the words by using various probabilistic rules. Based on the probabilistic rules, the
opinion and sentiments effect on customer reviews, are analyzed to arrive at specific set of
service preferred by the customers with their review comments. The Associative Regressive
Decision Rule helps the service provider to take decision on improving the customer satisfactory
level. The experimental results reveal that the Associative Regression Decision Rule Mining
(ARDRM) technique improved the performance in terms of true positive rate, Associative
Regression factor, Regressive Decision Rule Generation time and Review Detection Accuracy of
similar pattern.
Segmentation and Labelling of Human Spine MR Images Using Fuzzy Clustering csandit
Computerized medical image segmentation is a challenging area because of poor resolution
and weak contrast. The predominantly used conventional clustering techniques and the
thresholding methods suffer from limitations owing to their heavy dependence on user
interactions. Uncertainties prevalent in an image cannot be captured by these techniques. The
performance further deteriorates when the images are corrupted by noise, outliers and other
artifacts. The objective of this paper is to develop an effective robust fuzzy C- means clustering
for segmenting vertebral body from magnetic resonance images. The motivation for this work is
that spine appearance, shape and geometry measurements are necessary for abnormality
detection and thus proper localisation and labelling will enhance the diagnostic output of a
physician. The method is compared with Otsu thresholding and K-means clustering to illustrate
the robustness. The reference standard for validation was the annotated images from the
radiologist, and the Dice coefficient and Hausdorff distance measures were used to evaluate the
segmentation.
Selection of Best Alternative in Manufacturing and Service Sector Using Multi...csandit
Modern manufacturing organizations tend to face versatile challenges due to globalization,
modern lifestyle trends and rapid market requirements from both locally and globally placed
competitors. The organizations faces high stress from dual perspective namely enhancement in
science and technology and development of modern strategies. In such an instance,
organizations were in a need of using an effective decision making tool that chooses out optimal
alternative that reduces time, complexity and highly simplified. This paper explores a usage of
new multi criteria decision making tool known as MOORA for selecting the best alternatives by
examining various case study. The study was covered up in two fold manner by comparing
MOORA with other MCDM and MADM approaches to identify its advantage for selecting
optimal alternative, followed by highlighting the scope and gap of using MOORA approach.
Examination on various case study reveals an existence of huge scope in using MOORA for
numerous manufacturing and service applications.
OCR-THE 3 LAYERED APPROACH FOR CLASSIFICATION AND IDENTIFICATION OF TELUGU HA...csandit
Optical Character recognition is the method of digitalization of hand and type written or
printed text into machine-encoded form and is superfluity of the various applications of envision
of human’s life. In present human life OCR has been successfully using in finance, legal,
banking, health care and home need appliances. India is a multi cultural, literature and
traditional scripted country. Telugu is the southern Indian language, it is a syllabic language,
symbol script represents a complete syllable and formed with the conjunct mixed consonants in
their representation. Recognition of mixed conjunct consonants is critical than the normal
consonants, because of their variation in written strokes, conjunct maxing with pre and post
level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient
classification of the hand written and printed conjunct consonants. This paper implements the
Advanced Fuzzy Logic system controller to take the text in the form of written or printed,
collected the text images from the scanned file, digital camera, Processing the Image with
Examine the high intensity of images based on the quality ration, Extract the image characters
depends on the quality then check the character orientation and alignment then to check the
character thickness, base and print ration. The input image characters can classify into the two
ways, first way represents the normal consonants and the second way represents conjunct
consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here
recognition process starts from middle layer, and then it continues to check the top and bottom
layers. The recognition process treat as conjunct consonants when it can detect any symbolic
characters in top and bottom layers of present base character otherwise treats as normal
consonants. The post processing technique applied to all three layered characters. Post
processing of the image: concentrated on the image text readability and compatibility, if the
readability is not process then repeat the process again. In this recognition process includes
slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character
recognition and post-processing modules were discussed. The main objectives to the
development of this paper are: To develop the classification, identification of deference
prototyping for written and printed consonants, conjunct consonants and symbols based on 3
layered approaches with different measurable area by using fuzzy logic and to determine
suitable features for handwritten character recognition.
GAUSSIAN KERNEL BASED FUZZY C-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATIONcsandit
Image processing is an important research area in computer vision. clustering is an unsupervised
study. clustering can also be used for image segmentation. there exist so many methods for image
segmentation. image segmentation plays an important role in image analysis.it is one of the first
and the most important tasks in image analysis and computer vision. this proposed system
presents a variation of fuzzy c-means algorithm that provides image clustering. the kernel fuzzy
c-means clustering algorithm (kfcm) is derived from the fuzzy c-means clustering
algorithm(fcm).the kfcm algorithm that provides image clustering and improves accuracy
significantly compared with classical fuzzy c-means algorithm. the new algorithm is called
gaussian kernel based fuzzy c-means clustering algorithm (gkfcm)the major characteristic of
gkfcm is the use of a fuzzy clustering approach ,aiming to guarantee noise insensitiveness and
image detail preservation.. the objective of the work is to cluster the low intensity in homogeneity
area from the noisy images, using the clustering method, segmenting that portion separately using
content level set approach. the purpose of designing this system is to produce better segmentation
results for images corrupted by noise, so that it can be useful in various fields like medical image
analysis, such as tumor detection, study of anatomical structure, and treatment planning.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
Bayesian classifier works efficiently on some fields, and badly on some. The performance of Bayesian Classifier suffers in fields that involve correlated features. Feature selection is beneficial in reducing dimensionality, removing irrelevant data, incrementing learning accuracy, and improving result comprehensibility. But, the recent increase of dimensionality of data place a hard challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this paper, Bayesian Classifier with Correlation Based Feature Selection is introduced which can key out relevant features as well as redundancy among relevant features without pair wise correlation analysis. The efficiency and effectiveness of our method is presented through broad.
A Trinity Construction for Web Extraction Using Efficient AlgorithmIOSR Journals
This document describes a proposed system for web extraction using a trinity construction and efficient algorithms. It begins with an abstract discussing how trinity characteristics can be used to automatically extract content from websites in a sequential tree structure. It then discusses the existing system which uses trinity tree and prefix/suffix sorting but has limitations. The proposed system introduces fuzzy logic for multi-perspective crawling across multiple websites. A genetic algorithm is used to load extracted content into the trinity structure and remove unwanted data. Finally, an ant colony optimization algorithm is used to obtain an effective structure and suggest optimized solutions.
The document introduces an API-based webometrics tool called WeboNaver that was created for the Korean search engine Naver. WeboNaver allows researchers to automatically collect large amounts of web data and distinguish different types of information. It provides an interface that allows users to select search queries, run them, and output the results. The tool provides a way to systematically analyze web presence and trends using data from the popular Korean search engine Naver.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
Prior empirical and theoretical work has discussed the role of dominant search engine plays in the function of information gatekeeping on the Web, and there are reports on the high ranking of Wikipedia website among the search engine result pages (SERP). However, little research has been conducted on non-Google search engines and non-English versions of user-generated encyclopedias. This paper proposes a method to quantify the “display” gatekeeping differences of the SERP ranking and presents findings based on the Chinese SERP data. Based on 2,500 mainly-Chinese-language search queries, the data set includes the SERP outcome of four Chinese-speaking regions (mainland China, Singapore, Hong Kong and Taiwan) provided by three major search engines (Baidu, and Google and Yahoo), covering over 97% of the search engine market in each region. The findings, analysed and visualized using network analysis techniques, demonstrate the followings: major user-generated encyclopedias are among the most visible; localization factors matter (certain search engine variants produce the most divergent outcomes, especially mainland Chinese ones). The indicated strong effects of “network gatekeeping” by search engines also suggest similar dynamics inside user-generated encyclopedias.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGijcsit
With the advent of the Internet and social media, while hundreds of people have benefitted from the vast sources of information available, there has been an enormous increase in the rise of cyber-crimes, particularly targeted towards women. According to a 2019 report in the [4] Economics Times, India has witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most speculate that this is due to impact of social media such as Facebook, Instagram and Twitter on our daily lives. While these definitely help in creating a sound social network, creation of user accounts in these sites usually needs just an email-id. A real life person can create multiple fake IDs and hence impostors can easily be made. Unlike the real world scenario where multiple rules and regulations are imposed to identify oneself in a unique manner (for example while issuing one’s passport or driver’s license), in the virtual world of social media, admission does not require any such checks. In this paper, we study the different accounts of Instagram, in particular and try to assess an account as fake or real using Machine Learning techniques namely Logistic Regression and Random Forest Algorithm.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
Liao and petzold opensym berlin wikipedia geolinguistic normalizationHanteng Liao
This paper proposes a method of geo-linguistic normalization to advance the existing comparative analysis of open collaborative communities, with multilingual Wikipedia projects as the example. Such normalization requires data regarding the potential users and/or resources of a geolinguistic unit.
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...idescitation
According to the survey India is one of the
leading countries in the word for technical education and
management education. Numbers of students are increasing
day by day by the growth rate of 45% per annum. Advancement
in technology puts special effect on education system. This
helps in upgrading higher education. Some universities and
colleges are using these technologies. Weblog is one of them.
Main aim of this paper is to represent web logs using clustering
technique for predicting next user movement and user
behavior analysis. This paper moves around the web log
clustering technique based on Markov chain results .In this
paper we present an ideal approach to web clustering
(clustering web site users) and predicting their behavior for
next visit. Methodology: For generating effective result approx
14 engineering college web usage data is used and an advance
clustering approach is presenting after optimizing the other
clustering approach.Results: The user behavior is predicted
with the help of the advance clustering approach based on the
FPCM and k-mean. Proposed algorithm is used to mined and
predict user’s preferred paths. To predict the user behavior
existing approaches have been used. But the existing
approaches are not enough because of its reaction towards
noise. Thus with the help of ACM, noise is reduced, provides
more accurate result for predicting the user behavior. Approach
Implementation:The algorithm was implemented in MAT
LAB, DTRG and in Java .The experiment result proves that
this method is very effective in predicting user behavior. The
experimental results have validated the method’s effectiveness
in comparison with some previous studies.
Investigating crimes using text mining and network analysisZhongLI28
For the PDF part, copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
A Cross Layer Based Scalable Channel Slot Re-Utilization Technique for Wirele...csandit
Due to tremendous growth of the wireless based application services are increasing the demand
for wireless communication techniques that use bandwidth more effectively. Channel slot reutilization
in multi-radio wireless mesh networks is a very challenging problem. WMNs have
been adopted as back haul to connect various networks such as Wi-Fi (802.11), WI-MAX
(802.16e) etc. to the internet. The slot re-utilization technique proposed so far suffer due to high
collision due to improper channel slot usage approximation error. To overcome this here the
author propose the cross layer optimization technique by designing a device classification
based channel slot re-utilization routing strategy which considers the channel slot and node
information from various layers and use some of these parameters to approximate the risk
involve in channel slot re-utilization in order to improve the QoS of the network. The simulation
and analytical results show the effectiveness of our proposed approach in term of channel slot
re-utilization efficiency and thus helps in reducing latency for data transmission and reduce
channel slot collision.
Geometric Correction for Braille Document Images csandit
Image processing is an important research area in computer vision. clustering is an unsupervised
study. clustering can also be used for image segmentation. there exist so many methods for image
segmentation. image segmentation plays an important role in image analysis.it is one of the first
and the most important tasks in image analysis and computer vision. this proposed system
presents a variation of fuzzy c-means algorithm that provides image clustering. the kernel fuzzy
c-means clustering algorithm (kfcm) is derived from the fuzzy c-means clustering
algorithm(fcm).the kfcm algorithm that provides image clustering and improves accuracy
significantly compared with classical fuzzy c-means algorithm. the new algorithm is called
gaussian kernel based fuzzy c-means clustering algorithm (gkfcm)the major characteristic of
gkfcm is the use of a fuzzy clustering approach ,aiming to guarantee noise insensitiveness and
image detail preservation.. the objective of the work is to cluster the low intensity in homogeneity
area from the noisy images, using the clustering method, segmenting that portion separately using
content level set approach. the purpose of designing this system is to produce better segmentation
results for images corrupted by noise, so that it can be useful in various fields like medical image
analysis, such as tumor detection, study of anatomical structure, and treatment planning.
COQUEL: A CONCEPTUAL QUERY LANGUAGE BASED ON THE ENTITYRELATIONSHIP MODELcsandit
As more and more collections of data are available on the Internet, end users but not experts in
Computer Science demand easy solutions for retrieving data from these collections. A good
solution for these users is the conceptual query languages, which facilitate the composition of
queries by means of a graphical interface. In this paper, we present (1) CoQueL, a conceptual
query language specified on E/R models and (2) a translation architecture for translating
CoQueL queries into languages such as XQuery or SQL..
FILESHADER: ENTRUSTED DATA INTEGRATION USING HASH SERVER csandit
This document summarizes a research paper that proposes FileShader, a system using hash values to ensure file integrity during transfers between clients and servers. FileShader works by having the file provider calculate and send the hash value of a file to a trusted hash server. When clients download a file, FileShader calculates the hash and compares it to the value stored on the hash server to detect any changes. The researchers implemented a prototype of FileShader and found it could accurately detect file changes with little performance overhead. They conclude FileShader is a practical solution that can increase security for internet users by verifying file integrity during transfers.
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...csandit
Opinion mining also known as sentiment analysis, involves customer satisfactory patterns,
sentiments and attitudes toward entities, products, services and their attributes. With the rapid
development in the field of Internet, potential customer’s provides a satisfactory level of
product/service reviews. The high volume of customer reviews were developed for
product/review through taxonomy-aware processing but, it was difficult to identify the best
reviews. In this paper, an Associative Regression Decision Rule Mining (ARDRM) technique is
developed to predict the pattern for service provider and to improve customer satisfaction based
on the review comments. Associative Regression based Decision Rule Mining performs twosteps
for improving the customer satisfactory level. Initially, the Machine Learning Bayes
Sentiment Classifier (MLBSC) is used to classify the class labels for each service reviews. After
that, Regressive factor of the opinion words and Class labels were checked for Association
between the words by using various probabilistic rules. Based on the probabilistic rules, the
opinion and sentiments effect on customer reviews, are analyzed to arrive at specific set of
service preferred by the customers with their review comments. The Associative Regressive
Decision Rule helps the service provider to take decision on improving the customer satisfactory
level. The experimental results reveal that the Associative Regression Decision Rule Mining
(ARDRM) technique improved the performance in terms of true positive rate, Associative
Regression factor, Regressive Decision Rule Generation time and Review Detection Accuracy of
similar pattern.
Segmentation and Labelling of Human Spine MR Images Using Fuzzy Clustering csandit
Computerized medical image segmentation is a challenging area because of poor resolution
and weak contrast. The predominantly used conventional clustering techniques and the
thresholding methods suffer from limitations owing to their heavy dependence on user
interactions. Uncertainties prevalent in an image cannot be captured by these techniques. The
performance further deteriorates when the images are corrupted by noise, outliers and other
artifacts. The objective of this paper is to develop an effective robust fuzzy C- means clustering
for segmenting vertebral body from magnetic resonance images. The motivation for this work is
that spine appearance, shape and geometry measurements are necessary for abnormality
detection and thus proper localisation and labelling will enhance the diagnostic output of a
physician. The method is compared with Otsu thresholding and K-means clustering to illustrate
the robustness. The reference standard for validation was the annotated images from the
radiologist, and the Dice coefficient and Hausdorff distance measures were used to evaluate the
segmentation.
Selection of Best Alternative in Manufacturing and Service Sector Using Multi...csandit
Modern manufacturing organizations tend to face versatile challenges due to globalization,
modern lifestyle trends and rapid market requirements from both locally and globally placed
competitors. The organizations faces high stress from dual perspective namely enhancement in
science and technology and development of modern strategies. In such an instance,
organizations were in a need of using an effective decision making tool that chooses out optimal
alternative that reduces time, complexity and highly simplified. This paper explores a usage of
new multi criteria decision making tool known as MOORA for selecting the best alternatives by
examining various case study. The study was covered up in two fold manner by comparing
MOORA with other MCDM and MADM approaches to identify its advantage for selecting
optimal alternative, followed by highlighting the scope and gap of using MOORA approach.
Examination on various case study reveals an existence of huge scope in using MOORA for
numerous manufacturing and service applications.
OCR-THE 3 LAYERED APPROACH FOR CLASSIFICATION AND IDENTIFICATION OF TELUGU HA...csandit
Optical Character recognition is the method of digitalization of hand and type written or
printed text into machine-encoded form and is superfluity of the various applications of envision
of human’s life. In present human life OCR has been successfully using in finance, legal,
banking, health care and home need appliances. India is a multi cultural, literature and
traditional scripted country. Telugu is the southern Indian language, it is a syllabic language,
symbol script represents a complete syllable and formed with the conjunct mixed consonants in
their representation. Recognition of mixed conjunct consonants is critical than the normal
consonants, because of their variation in written strokes, conjunct maxing with pre and post
level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient
classification of the hand written and printed conjunct consonants. This paper implements the
Advanced Fuzzy Logic system controller to take the text in the form of written or printed,
collected the text images from the scanned file, digital camera, Processing the Image with
Examine the high intensity of images based on the quality ration, Extract the image characters
depends on the quality then check the character orientation and alignment then to check the
character thickness, base and print ration. The input image characters can classify into the two
ways, first way represents the normal consonants and the second way represents conjunct
consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here
recognition process starts from middle layer, and then it continues to check the top and bottom
layers. The recognition process treat as conjunct consonants when it can detect any symbolic
characters in top and bottom layers of present base character otherwise treats as normal
consonants. The post processing technique applied to all three layered characters. Post
processing of the image: concentrated on the image text readability and compatibility, if the
readability is not process then repeat the process again. In this recognition process includes
slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character
recognition and post-processing modules were discussed. The main objectives to the
development of this paper are: To develop the classification, identification of deference
prototyping for written and printed consonants, conjunct consonants and symbols based on 3
layered approaches with different measurable area by using fuzzy logic and to determine
suitable features for handwritten character recognition.
GAUSSIAN KERNEL BASED FUZZY C-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATIONcsandit
Image processing is an important research area in computer vision. clustering is an unsupervised
study. clustering can also be used for image segmentation. there exist so many methods for image
segmentation. image segmentation plays an important role in image analysis.it is one of the first
and the most important tasks in image analysis and computer vision. this proposed system
presents a variation of fuzzy c-means algorithm that provides image clustering. the kernel fuzzy
c-means clustering algorithm (kfcm) is derived from the fuzzy c-means clustering
algorithm(fcm).the kfcm algorithm that provides image clustering and improves accuracy
significantly compared with classical fuzzy c-means algorithm. the new algorithm is called
gaussian kernel based fuzzy c-means clustering algorithm (gkfcm)the major characteristic of
gkfcm is the use of a fuzzy clustering approach ,aiming to guarantee noise insensitiveness and
image detail preservation.. the objective of the work is to cluster the low intensity in homogeneity
area from the noisy images, using the clustering method, segmenting that portion separately using
content level set approach. the purpose of designing this system is to produce better segmentation
results for images corrupted by noise, so that it can be useful in various fields like medical image
analysis, such as tumor detection, study of anatomical structure, and treatment planning.
Mining Fuzzy Association Rules from Web Usage Quantitative Data csandit
Web usage mining is the method of extracting interesting patterns from Web usage log file. Web
usage mining is subfield of data mining uses various data mining techniques to produce
association rules. Data mining techniques are used to generate association rules from
transaction data. Most of the time transactions are boolean transactions, whereas Web usage
data consists of quantitative values. To handle these real world quantitative data we used fuzzy
data mining algorithm for extraction of association rules from quantitative Web log file. To
generate fuzzy association rules first we designed membership function. This membership
function is used to transform quantitative values into fuzzy terms. Experiments are carried out
on different support and confidence. Experimental results show the performance of the
algorithm with varied supports and confidence.
GEOMETRIC CORRECTION FOR BRAILLE DOCUMENT IMAGEScsandit
Braille system has been used by the visually impaired people for reading.The shortage of Braille
books has caused a need for conversion of Braille to text. This paper addresses the geometric
correction of a Braille document images. Due to the standard measurement of the Braille cells,
identification of Braille characters could be achieved by simple cell overlapping procedure. The
standard measurement varies in a scaled document and fitting of the cells become difficult if the
document is tilted. This paper proposes a line fitting algorithm for identifying the tilt (skew)
angle. The horizontal and vertical scale factor is identified based on the ratio of distance
between characters to the distance between dots. These are used in geometric transformation
matrix for correction. Rotation correction is done prior to scale correction. This process aids in
increased accuracy. The results for various Braille documents are tabulated.
A Routing Protocol Orphan-Leach to Join Orphan Nodes in Wireless Sensor Netwo...csandit
The hierarchical routing protocol LEACH (Low Energy Adaptive Clustering Hierarchy) is
referred to as the basic algorithm of distributed clustering protocols. LEACH allows clusters
formation. Each cluster has a leader called Cluster Head (CH). The selection of CHs is made
with a probabilistic calculation. It is supposed that each non-CH node join a cluster and
becomes a cluster member. Nevertheless, some CHs can be concentrated in a specific part of the
network. Thus several sensor nodes cannot reach any CH. As a result, the remaining part of the
controlled field will not be covered; some sensor nodes will be outside the network. To solve this
problem, we propose O-LEACH (Orphan Low Energy Adaptive Clustering Hierarchy) a routing
protocol that takes into account the orphan nodes. Indeed, a cluster member will be able to play
the role of a gateway which allows the joining of orphan nodes. If a gateway node has to
connect a important number of orphan nodes, thus a sub-cluster is created and the gateway
node is considered as a CH’ for connected orphans. As a result, orphan nodes become able to
send their data messages to the CH which performs in turn data aggregation and send
aggregated data message to the CH. The WSN application receives data from the entire network
including orphan nodes.
The simulation results show that O-LEACH performs better than LEACH in terms of
connectivity rate, energy, scalability and coverage.
Curso => SIAF Básico
Aprenderás el Módulo Administrativo, Módulo Presupuestal, el Marco Jurídico, Presupuestal y Financiero de Entidades Públicas, entre otros tems relevantes.
Días => 10, 11, 12 y 13 de Mayo de 6:15 a 9:45 pm
Certificación por R&C Consulting
¡No dejes pasar la oportunidad de especializarte y estar preparado para los cambios que se vienen en la Gestión Pública!
Separa tu vacante AQUÍ => http://bit.ly/1Pjh11i
Lee el artículo especializado del SIAF AQUÍ => http://bit.ly/1puFJUF
Entérate de nuestros Cursos y diplomados actualizados e inscríbete AQUÍ => http://goo.gl/fcRnEi
Adquiere desde ahora los cursos más importantes en la Gestión Pública en DVD => http://bit.ly/1M5Sz3f
Se parte de nuestra Comunidad Virtual GRATIS => http://goo.gl/A22Amb
Síguenos en Facebook => http://goo.gl/6vNsuf
Y en Twitter => http://goo.gl/Q6SEb6
***CONTÁCTANOS...***
R&C Consulting
Escuela de Gobierno y Gestión Pública
Av. Petit Thouars N° 2166 Piso 4, Lince - Lima Perú
Movistar: 9-9911-4921
RPM: # 956991632
RPC: 987-972-131
Fijo: (01) 2661067 Anexo 101
Curso Virtual => Gestión de Tesorería Gubernamental 2016 con aplicaciones en el SIAF y SIGA.
Aprenderás qué es el Sistema Nacional de Tesorería, Los procedimientos para la apertura y manejos de cuentas bancarias, Procedimientos de Pagos (transferencias, uso de cheques, caja chica y carta orden) entre otros temas relevantes.
Inicio: 2 de Mayo de 6:00 a 8:00 pm (8 sesiones)
Ven, capacítate y certifícate con nosotros. ¡Te esperamos!
Separa tu vacante AQUÍ => http://bit.ly/1TGNz9g
Lee el artículo especializado de Gestión de Tesorería AQUÍ => http://bit.ly/1NeVTqa
Entérate de nuestros Cursos y diplomados actualizados e inscríbete AQUÍ→ http://goo.gl/fcRnEi
Adquiere desde ahora los cursos más importantes en la Gestión Pública en DVD → http://bit.ly/1M5Sz3f
Se parte de nuestra Comunidad Virtual GRATIS → http://goo.gl/A22Amb
Síguenos en Facebook → http://goo.gl/6vNsuf
Y en Twitter → http://goo.gl/Q6SEb6
***CONTÁCTANOS...***
R&C Consulting
Escuela de Gobierno y Gestión Pública
Av. Petit Thouars N° 2166 Piso 4, Lince - Lima Perú
Movistar: 9-9911-4921
RPM: # 956991632
RPC: 987-972-131
Fijo: (01) 2661067 Anexo 101
This certificate certifies that Mohamed Ahmed Elbadawi Elsherbini passed the ACCA Certificate in International Financial Reporting in October 2015. The certificate was issued by Alan Hatfield, the director of learning at ACCA. The certificate remains the property of ACCA and cannot be altered or defaced, and ACCA can demand its return at any time.
Este documento presenta un análisis de los indicadores de rentabilidad y liquidez de una empresa para los años 2013 y 2014. Resume los resultados clave de cada indicador, mostrando que los márgenes bruto y operacional se mantuvieron estables o decrecieron levemente, mientras que el margen neto y los indicadores de liquidez como la razón corriente y capital de trabajo mejoraron para 2014, lo que indica una mayor capacidad de la empresa para cubrir sus obligaciones financieras a corto plazo.
REVIEW PAPER ON NEW TECHNOLOGY BASED NANOSCALE TRANSISTORmsejjournal
Owing to the fact that MOSFETs can be effortlessly assimilated into ICs, they have become the heart of the
growing semiconductor industry. The need to procure low power dissipation, high operating speed and
small size requires the scaling down of these devices. This fully serves the Moore’s Law. But scaling down
comes with its own drawbacks which can be substantiated as the Short Channel Effect. The working of the
device deteriorates owing to SCE. In this paper, the problems of device downsizing as well as how the use
of SED based devices prove to be a better solution to device downsizing has been presented. As such the
study of Short Channel effects as well as the issues associated with a nanoMOSFET is provided. The study
of the properties of several Quantum dot materials and how to choose the best material depending on the
observation of clear Coulomb blockade is done. Specifically, a study of a graphene single electron
transistor is reviewed. Also a theoretical explanation to a model designed to tune the movement of
electrons with the help of a quantum wire has been presented.
MODIFICATION OF DOPANT CONCENTRATION PROFILE IN A FIELD-EFFECT HETEROTRANSIST...msejjournal
This document describes an approach to modify the energy band diagram and decrease the dimensions of field-effect heterotransistors. The approach involves manufacturing a heterostructure with a substrate and epitaxial layer with four doped sections - two channel sections separated by source and drain sections. Additional doping of the channel sections allows for modification of the energy band diagram. Analytical models are developed to optimize the dopant concentration profiles through solving diffusion equations considering temperature-dependent diffusion coefficients. This approach could enable more compact transistor designs with tunable energy band structures.
A HYBRID METHOD FOR AUTOMATIC COUNTING OF MICROORGANISMS IN MICROSCOPIC IMAGESacijjournal
Microscopic image analysis is an essential process to enable the automatic enumeration and quantitative
analysis of microbial images. There are several system are available for numerating microbial growth.
Some of the existing method may be inefficient to accurately count the overlapped microorganisms.
Therefore, in this paper we proposed an efficient method for automatic segmentation and counting of
microorganisms in microscopic images. This method uses a hybrid approach based on morphological
operation, active contour model and counting by region labelling process. The colony count value obtained
by this proposed method is compared with the manual count and the count value obtained from the existing
method.
Malicious-URL Detection using Logistic Regression TechniqueDr. Amarjeet Singh
Over the last few years, the Web has seen a
massive growth in the number and kinds of web services.
Web facilities such as online banking, gaming, and social
networking have promptly evolved as has the faith upon them
by people to perform daily tasks. As a result, a large amount
of information is uploaded on a daily to the Web. As these
web services drive new opportunities for people to interact,
they also create new opportunities for criminals. URLs are
launch pads for any web attacks such that any malicious
intention user can steal the identity of the legal person by
sending the malicious URL. Malicious URLs are a keystone
of Internet illegitimate activities. The dangers of these sites
have created a mandates for defences that protect end-users
from visiting them. The proposed approach is that classifies
URLs automatically by using Machine-Learning algorithm
called logistic regression that is used to binary classification.
The classifiers achieves 97% accuracy by learning phishing
URLs
A COMPARATIVE ANALYSIS OF DIFFERENT FEATURE SET ON THE PERFORMANCE OF DIFFERE...ijaia
Reducing the risk pose by phishers and other cybercriminals in the cyber space requires a robust and automatic means of detecting phishing websites, since the culprits are constantly coming up with new techniques of achieving their goals almost on daily basis. Phishers are constantly evolving the methods they used for luring user to revealing their sensitive information. Many methods have been proposed in past for phishing detection. But the quest for better solution is still on. This research covers the development of phishing website model based on different algorithms with different set of features in order to investigate the most significant features in the dataset
A Hybrid Approach For Phishing Website Detection Using Machine Learning.vivatechijri
In this technical age there are many ways where an attacker can get access to people’s sensitive information illegitimately. One of the ways is Phishing, Phishing is an activity of misleading people into giving their sensitive information on fraud websites that lookalike to the real website. The phishers aim is to steal personal information, bank details etc. Day by day it’s getting more and more risky to enter your personal information on websites fearing that it might be a phishing attack and can steal your sensitive information. That’s why phishing website detection is necessary to alert the user and block the website. An automated detection of phishing attack is necessary one of which is machine learning. Machine Learning is one of the efficient techniques to detect phishing attack as it removes drawback of existing approaches. Efficient machine learning model with content based approach proves very effective to detect phishing websites.
Our proposed system uses Hybrid approach which combines machine learning based method and content based method. The URL based features will be extracted and passed to machine learning model and in content based approach, TF-IDF algorithm will detect a phishing website by using the top keywords of a web page. This hybrid approach is used to achieve highly efficient result. Finally, our system will notify and alert user if the website is Phishing or Legitimate.
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATIONIJNSA Journal
Due to the rapid growth of the internet, malicious websites [1] have become the cornerstone for internet crime activities. There are lots of existing approaches to detect benign and malicious websites — some of them giving near 99% accuracy. However, effective and efficient detection of malicious websites has now
seemed reasonable enough in terms of accuracy, but in terms of processing speed, it is still considered an enormous and costly task because of their qualities and complexities. In this project, We wanted to implement a classifier that would detect benign and malicious websites using network and application features that are available in a data-set from Kaggle, and we will do that using Map Reduce to make the classification speeds faster than the traditional approaches.[2].
MAPREDUCE IMPLEMENTATION FOR MALICIOUS WEBSITES CLASSIFICATIONIJNSA Journal
This document summarizes research on using MapReduce to classify websites as malicious or benign more efficiently than traditional approaches. The researchers used a dataset from Kaggle containing network and application features of websites to train and evaluate machine learning models. They preprocessed the data to handle missing values and encode categorical features. Models tested included neural networks, random forests, and decision trees. Random forests achieved 100% accuracy when trained on 30% of the data and were much faster than other models. Processing time was improved when using Apache Spark on two nodes compared to a single node or traditional programming, and further speedups could be gained with more nodes.
An In-Depth Benchmarking And Evaluation Of Phishing Detection Research For Se...Anita Miller
This document introduces a benchmarking framework called PhishBench for evaluating phishing detection research. PhishBench enables systematic comparison of existing phishing detection features and techniques under identical experimental conditions. The framework incorporates over 200 phishing features and machine learning algorithms. Using PhishBench, the document conducts an in-depth benchmarking study to compare prior research, analyze feature importance, and test robustness on new datasets. Key findings include that imbalanced datasets affect performance and retraining alone is not enough to defeat evolving attacks.
IRJET - An Automated System for Detection of Social Engineering Phishing Atta...IRJET Journal
1) The document presents a machine learning approach to detect phishing URLs using logistic regression. It trains a logistic regression model on a dataset of 420,467 URLs that have been classified as either phishing or legitimate.
2) It preprocesses the URLs using tokenization before training the logistic regression model. The trained model is able to classify new URLs with 96% accuracy as either phishing or legitimate based on the URL features.
3) The proposed approach provides an automated way to detect phishing URLs in real-time and help prevent phishing attacks. Future work could involve developing a browser extension using this approach and increasing the dataset size for higher accuracy.
A Deep Learning Technique for Web Phishing Detection Combined URL Features an...IJCNCJournal
The most popular way to deceive online users nowadays is phishing. Consequently, to increase cybersecurity, more efficient web page phishing detection mechanisms are needed. In this paper, we propose an approach that rely on websites image and URL to deals with the issue of phishing website recognition as a classification challenge. Our model uses webpage URLs and images to detect a phishing attack using convolution neural networks (CNNs) to extract the most important features of website images and URLs and then classifies them into benign and phishing pages. The accuracy rate of the results of the experiment was 99.67%, proving the effectiveness of the proposed model in detecting a web phishing attack.
This document outlines a presentation on detecting phishing websites using machine learning. The goals of the project are to develop a machine learning model to identify phishing URLs and integrate it into a web application. It will discuss collecting and preprocessing data, selecting machine learning algorithms, developing the web app, and addressing challenges with existing phishing detection techniques. The project aims to help reduce online theft and educate users on phishing risks and countermeasures.
Phishing Websites Detection Using Back Propagation Algorithm: A Reviewtheijes
Phishing is an illicit modus operandi employing both societal engineering and technological subterfuge to theft client’s private identity data and monetary account credentials. Influence of phishing is pretty radical as it engrosses the menace of identity larceny and financial losses. This paper elucidates the back propagation paradigm to instruct the neural network for phishing forecast. We execute the root-cause analysis of phishing and incentive for phishing. This analysis is intended at serving developers the effectiveness of neural networks in data mining and provides the grounds proving neural networks in phishing detection.
Detecting Phishing Websites Using Machine LearningIRJET Journal
1. The document proposes a system to detect phishing websites in real-time using machine learning. It trains a random forest classifier on a dataset of URLs and associated features to classify websites as legitimate or phishing.
2. The system collects URL and page content features from a user's browser and sends them to a cloud-based random forest model. The model was trained on over 40,000 URLs and achieved 97.36% accuracy at detecting phishing sites.
3. The proposed system provides advantages like real-time detection, using a large dataset for training, detecting new phishing sites, and independence from third-party services. It aims to protect users by identifying phishing sites before sensitive information is entered.
IRJET- Detecting Phishing Websites using Machine LearningIRJET Journal
This document describes a research project that aims to implement machine learning techniques to detect phishing websites. The researchers plan to test algorithms like logistic regression, SVM, decision trees and neural networks on a dataset of phishing links. They will evaluate the performance of these algorithms and develop a browser plugin using the best model. This plugin will detect malicious URLs and protect users from phishing attacks. The document provides background on phishing and outlines the proposed approach, dataset, algorithms to be tested, planned Chrome extension implementation, and expected results sections of the project.
A Comparative Analysis of Different Feature Set on the Performance of Differe...gerogepatton
This document summarizes a research paper that analyzed the performance of different machine learning algorithms (Random Forest, Probabilistic Neural Network, XGBoost) using different feature sets for detecting phishing websites. The researchers conducted experiments on a benchmark phishing website dataset using 6 categories of features: address bar-based, abnormal-based, HTML/JavaScript-based, domain-based, feature selection, and full dataset. The results showed that XGBoost achieved the best performance (>97% accuracy) when using the full dataset of features, outperforming the other algorithms and feature sets. Therefore, the combination of all features provided the most important information for accurately detecting phishing websites.
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLSAM Publications,India
Malicious URLs are harmful to every aspect of computer users. Detecting of the malicious URL is very important. Currently, detection of malicious web pages techniques includes black-list and white-list methodology and machine learning classification algorithms are used. However, the black-list and white-list technology is useless if a particular URL is not in list. In this paper, we propose a multi-layer model for detecting malicious URL. The filter can directly determine the URL by training the threshold of each layer filter when it reaches the threshold. Otherwise, the filter leaves the URL to next layer. We also used an example to verify that the model can improve the accuracy of URL detection.
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMieijjournal
Google has released many algorithm updates to combat web spam over the years:
- Updates like Panda and Penguin specifically target low-quality sites, thin content sites, and sites engaging in unethical SEO practices like link spamming.
- Other updates like Caffeine and Social Signals aim to incorporate more authoritative signals like social media mentions and site freshness to improve results.
- Google's goal is to balance economic incentives for spammers by keeping costs high for manipulation, while continually adapting their algorithms to mitigate new spam tactics.
This document summarizes a research paper that proposes a machine learning approach for detecting phishing websites. It discusses using heuristic features from CANTINA to train machine learning models. A new domain top-page similarity feature is introduced to improve accuracy. Various modules are described, including site training, site capturing, a phishing dictionary, and image correlation to measure similarity. Experimental results show the approach achieves up to 92.5% f-measure and a 7.5% error rate for phishing detection.
This document summarizes a research paper that proposes a machine learning approach for detecting phishing websites. It discusses using heuristic features from CANTINA to train machine learning models. A new domain top-page similarity feature is introduced to improve accuracy. Various modules are described, including site training, site capturing, a phishing dictionary, and image correlation to measure similarity. Experimental results show the approach achieves up to 92.5% f-measure and a 7.5% error rate for phishing detection.
This document discusses the development of a machine learning model to accurately detect phishing websites in real-time. It begins with an introduction to the problem of phishing and the need for reliable phishing detection systems. It then discusses using supervised machine learning, specifically logistic regression, to classify websites as legitimate or phishing based on discriminatory features. The goal is to determine an optimal feature combination to train a classifier with high performance at detecting phishing sites. Previous literature on phishing detection is reviewed, including techniques using fuzzy rough set theory and a three-tiered approach using web crawler traffic data, content analysis, and URL analysis.
This document discusses web spam detection using machine learning techniques. Specifically, it proposes an improved Naive Bayes classifier that incorporates user feedback and domain-specific features to better detect spam pages. The key points are:
1) Web spam has become a serious problem as internet usage has increased, threatening search engines and users. Spam pages aim to deceive search engines' ranking algorithms.
2) Existing spam detection techniques like content analysis are still lacking and Naive Bayes classifiers are commonly used but have limitations like treating all terms equally.
3) The paper proposes an improved Naive Bayes classifier that assigns different weights to terms based on user feedback and considers domain-specific features to reduce false positives and negatives and improve accuracy
This document discusses web spam detection using machine learning techniques. Specifically, it proposes an improved Naive Bayes classifier that incorporates user feedback and domain-specific features to better detect spam pages. The key points are:
1) Web spam has become a serious problem as internet usage has increased, threatening users and wasting search engine resources. Current techniques like content analysis are still lacking.
2) An improved Naive Bayes classifier is proposed that assigns weights to user feedback and considers domain-specific features to reduce false positives and negatives compared to the traditional classifier.
3) Results show the improved classifier outperforms the traditional one in terms of accuracy for web spam detection.
Similar to FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM (20)
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
2. 28 Computer Science & Information Technology (CS & IT)
Ranking system of SEs involves various content-based and graph-based measures. Spammers
exploit these parameters to artificially inflate the ranking of web documents. Spam techniques
range from stuffing a page with large number of authority references or popular query keywords,
thereby causing the page to rank higher for those queries, to setting up a network of pages that
mutually reinforce their page value to increase the score of some target pages or the overall
group.
Recently [3; 4], all major SEs such as Google, Yahoo etc. have identified web spam as a tangible
issue in IR process. It not only deteriorates the search quality but also cause wastage of
computational and storage resources of a SE provider. A financial loss of $50 billion was caused
due to spam in the year 2005 [5]. In the year 2009, it was estimated at $130 billion [6]. Further, it
weakens people’s trust and might deprive legitimate websites of user’s visibility and revenue.
Therefore, identifying and combating spam becomes a top priority for SE providers.
According to web spam taxonomy presented in the work of Spirin and Han [7], web spam is
broadly classified into four categories namely content spam [8], link spam [9; 10], cloaking and
redirection [11; 12], and click spam [13]. This research work primarily focuses on the detection
of content spam which is the most common and frequently occurring spam [14].
IR systems examine the content of pages in the corpus to retrieve the most relevant document
with respect to a specific search query. “Term Frequency-Inverse Document Frequency” (TF-
IDF) or another similar approach is used to access the “most similar” (relevant) documents to the
search query. In TFIDF, “the relevance of the search terms to the documents in corpus is
proportional to the number of times the term appeared in the document and inversely proportional
to the number of documents containing the term.” Spammers exploit Term Frequency (TF)
scoring by overstuffing content fields (title, body, anchor text, URL etc.) of a page with a number
of popular search terms so as to boost its relevancy score for any search query. It can be measured
as:
where q refers to query, p denotes a web page in the corpus, and t denotes the term.
Machine learning is a field of study that deals with automated learning of patterns, within the data
belonging to different classes or groups, with an aim to differentiate between the classes or
groups. An effective machine learning algorithm is expected to make accurate predictions about
categorization of unseen data based on the learnt patterns. Specifically, supervised machine
learning involves predicting the class of an unseen (new) data sample based on the decision
model learnt using the existing (training) data. Therefore, knowledge of machine learning may be
appropriately utilized for web spam detection.
Several machine learning methods to combat content spam were introduced in the past researches
of adversarial IR and web spam domain. Egele et al. [15] examined the importance of different
text and link metrics in web ranking system and utilize C4.5 supervised learning algorithm to
remove spam links. They deployed their own mechanism to generate data to carry out the
experiment.
Ntoulas et al. [16] presented an approach for detecting spam based on content analysis. They
extracted several content features and presented a comprehensive study about the influence of
these features in web spam domain.
3. Computer Science & Information Technology (CS & IT) 29
Prieto et.al [17] suggested a number of heuristics to identify all possible kinds of web spam and
developed a system called SAAD (Spam Analyzer and Detector) for efficient web spam
detection. The beneficial trait is that the system was tested on different data sets and proved to be
effective for more accurate classification of spam pages.
Araujo and Romo [18] proposed an interesting approach of spam detection by comparing the
language models (LM) [19] for anchor text and pages linked from these anchor text. KL-
divergence [20] was used to measure the discrepancy between two LMs.
However, spammers are continuously adapting themselves to circumvent these barriers. This
research work presents a content-based spam detection approach using supervised learning. The
aim of this study is to draw a clear understanding of underlying process of web spamming by
examining already extracted features. Moreover, the work aims at selecting salient features from
the existing ones to stimulate further studies in the domain of “adversarial information retrieval
[21]”. A filter based feature selection technique is employed to uncover important patterns in
order to classify websites as spam or ham (non-spam). The proposed method is observed to be
efficient in terms of both computational complexity and classification accuracy.
The rest of the paper organizes as follows. A general methodology of applying feature selection
technique for adversarial classification is presented in section 2. Finally, experimental results are
shown in section 3 and section 4 concludes the paper.
2. METHODOLOGY
2.1 Experimental Data Set
Widely known web spam data set WEBSPAM-UK2007 [22] is used to carry out the experiments.
This dataset was released by Yahoo especially for Search Engine Spam project. The data set was
also used for “Web Spam Challenge 2008”. It is the biggest known web spam data set having
more that 100 M web pages from 114,529 hosts. However, only 6,479 hosts were manually
labelled as spam, non-spam and undecided. Among these, approx 6% hosts are spam, i.e., data is
imbalanced in nature. The data set consist a separate training and testing data set, but we
combined the two sets together to evaluate our model since the percentage of spam hosts were
small in both of them. Further, we neglect the hosts labelled as “undecided” and conduct our
experiment for a group of 5,797 hosts. The training set was released with pre-extracted content
feature set which are examined in this study to select salient and optimal features.
Existing content- based heuristics for detecting web spam
The content feature set proposed by Ntoulas et al. [16] comprises of 98 features based on
following heuristics:
• Number of words in the page: “Term Stuffing” is a common spamming technique to
increase visibility of a web document on typical queries or search terms. Sometimes the
proportion of common terms in a page is very high. Therefore, authors suggest counting
number of words per page. Very large value of the proposed heuristic indicates the strains
of spam in the page.
• Number of words in the title: Many times, a page title is stuffed with unrelated keywords
and terms because of its high weightage in search engines text metrics. As a spam
detection method, authors propose measure of number of words in the title
• Average word length of the document: In order to combat composite keywords
spamming, authors propose to measure the average word length
4. 30 Computer Science & Information Technology (CS & IT)
• Fraction of anchor text in the page: Due to the fact that “anchors” are used to describe the
association between two linked pages, spammers misuse them to create false associations
• Ratio of visible text: The authors propose this heuristic to detect “hidden content” in a
page
• Compression Ratio: The proposed features helps in determining the level of redundancy
in the page
• Independent and Dependent LH: These techniques utilize the independent and dependent
n-grams probability to detect spam. More precisely, content of each page is divided into
n-g of n consecutive words to calculate the probability of document by individual n-g
probabilities. However, this feature is computationally expensive.
The performance analysis of these features and its comparison after applying feature selection
techniques is presented in section 3.
2.2 Proposed Methodology
This research work focuses on the contribution of feature selection in adversarial IR applications.
The common issues in the spam domain are listed as: small sizes of samples, unbalanced data set
and large input dimensionality due to several pages in a single web document. To deal with such
problems, a variety of feature selection methods have been designed by researchers in machine
learning and pattern recognition. This work employs univariate filter feature selection to improve
the prediction performance of decision model. The existing heuristics on which feature selection
is performed are already discussed in previous subsection. The idea of applying feature selection
in the existing features is two-fold: these features were utilized in many previous studies [16; 17;
18; 23] for effective spam detection; the heuristics recognized as baseline for further studies in
the underlying domain.
Due to aforementioned reasons, it can be expected that, for spam documents classification,
feature selection techniques will be of practical use for later researches in information retrieval
and web spam.
2.2.1 Methods for Web Spam Detection
This section describes the algorithms and methods used for evaluation of this work.
Classification Algorithm
For the appropriate prediction, we have tried various classification methods, namely, k-nearest
neighbour, linear discriminant analysis, and support vector machine (SVM). As per experiments,
SVM is observed to achieve better results for binary classification in comparison to other
classifiers. Therefore, this research work SVM is utilised to learn decision model.
SVM [24] classify objects by mapping them to higher dimensional space. In this space, SVM
train to find the optimal hyperplane, i.e., the one with the maximum margin from the support
vectors (nearest patterns).
Consider a training vector we define a vector y such that
decision function can be defined as:
5. Computer Science & Information Technology (CS & IT) 31
The weight vector w is defined as:
where is the Lagrange’s multiplier used to find the hyperplane with maximum distance from
the nearest patterns. Patterns for which are support vectors.
The possible choice of decision function when data points are not separable linearly can be
expressed as:
where, and C is the penalty parameter. Value of C =10 is used for
experimental evaluation.
Performance Evaluation
In order to evaluate the model, 10-fold cross validation technique is used. The results are shown
in terms of classification accuracy, sensitivity, and specificity, whose values are obtained by
analysing the cost matrix [25].
Sensitivity can be defined as the number of actual positive instances (spam) that are correctly
predicted as positive by the classifier. Conversely, specificity determines proportion of actual
negatives (non-spam hosts) that are correctly classified. Accuracy can be defined as total number
of instances that are correctly predicted by the classifier.
Univariate Filter Feature Selection: Simple yet efficient
In order to improve the performance of SVM, feature selection is implemented. Univariate filter
based feature selection has been utilised due to the fact that it is computationally simple, fast,
efficient, and inexpensive. In filter based feature selection, “features are selected independently to
induction algorithm” [26; 27; 28]. The measurement for feature selection is chosen as Mutual
Information Maximization (MIM) [29]. It is a simple method to estimate rank of features based
on mutual information. Mutual information is defined as a relevancy measure, determining how
relevant a feature is for corresponding targets. The criterion can be expressed as follows:
where is the nth
feature of training matrix X, c is the class label and refers to mutual
information between the two. Mutual Information between two random variables p and q can be
determined as:
3. EXPERIMENTAL RESULTS AND DISCUSSION
Table 1 shows the performance of SVM on existing feature set whereas Table 2 shows the
prediction performance after feature selection. It is clearly visible that feature selection technique
on precompiled measures outperforms the performance of complete feature set. The results show
a significant gain in classifier’s accuracy in terms of both valuation measures (i.e., specificity and
sensitivity). Approximately 3% increase in specificity and 2% increment in accuracy and
sensitivity is reported.
6. 32 Computer Science & Information Technology (CS & IT)
Table 1: Performance of content- based feature sets using SVM
Feature Set Performance Measure (in percentage)
(98 features) Accuracy Sensitivity Specificity
Content 79.9 61.8 79.2
Table 2: Performance of content- based feature sets after feature selection using SVM
Filter based Feature
Selection
Number of features
Top
10
Top
20
Top
30
Top
40
Top
50
Top
60
Top
70
Top
80
Top
90
Performance
Measure (in
percentage)
Accuracy 76.7 78.6 81.6 81.6 81.4 81.1 82.1 80.8 80.7
Sensitivity 46.6 51.2 51.3 55.3 55.3 57.1 63.1 63.1 62.7
Specificity 81.5 81.6 81.6 82.6 82.7 82.7 82.9 80.8 80.7
4. CONCLUSIONS AND FUTURE PERSPECTIV
In this study, we take into account existing heuristics for detecting spam by means of content
analysis. This experiment compares the performance results of pre-determined features with the
performance of features achieved after feature selection. The experimental results demonstrate
that classifier performance increases with reduced (reduced from 98 features to 70 features) set of
salient features. Furthermore, we believe that feature selection undermines the inherent risk of
imprecision and over-fitting caused due to unbalanced nature of dataset. However, a robust and
optimal feature selection model is still a need to uncover.
Multivariate feature selection and wrapper based feature selection can be addressed as a
prominent future study in web spam community. A second line of future research will be
extension of heuristics extracted using both content analysis and web graph mining. Other
interesting opportunities oriented towards different machine learning approaches such as fuzzy
logic, neural network etc. Since, there is no clear separation between spam and ham pages, i.e.,
definition of spam may be vary from one person to another, use of fuzzy logic can be seen as a
promising line of future work in detection of web spam.
REFERENCES
[1] Wood, "Today, the Internet -- tomorrow, the Internet of Things?," Computerworld, 2009.
[Online].Available: http://www.computerworld.com/article/2498542/internet/today--the-internet----
tomorrow--the-internet-of-things-.html.
[2] Z. Jia, W. Li and H. Zhang, "Content-based spam web page detection in search engine," Computer
Application and Software, vol. 26, no. 11, pp. 165-167, 2009.
[3] M. Cutts, "Google search and search engine spam," Google Official Blog, 2011. [Online]. Available:
https://googleblog.blogspot.in/2011/01/google-search-and-search-engine-spam.html.
7. Computer Science & Information Technology (CS & IT) 33
[4] M. McGee, "businessWeek Dives Deep Into Google's Search Quality," Search Engine Land, 2009.
[Online]. Available: http://searchengineland.com/businessweak-dives-deep-into-googles-search-
quality-27317.
[5] D. Ferris, R. Jennings and C. WIlliams, "The Global Economic Impact of Spam," Ferris Research,
2005.
[6] R. Jennings, “Cost of Spam is Flattening - Our 2009 Predictions,” 2009. [Online]. Available:
http://ferris.com/2009/01/28/cost-of-spam-is-flattening-our-2009- predictions/.
[7] N. Spirin and J. Han, "Survey on web spam detection: principles and algorithms," SIGKDD Explor.
Newsl., vol. 13, no. 2, p. 50-64, 2012.
[8] C. Castillo, K. Chellapilla and B. Davison, "Adversarial Information Retrieval on the Web,"
Foundations and trends in Information Retrieval, vol. 4, no. 5, pp. 377-486, 2011.
[9] S. Chakrabarti, Mining the Web. San Francisco, CA: Morgan Kaufmann Publishers, 2003.
[10] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini and S. Vigna, "A reference
collection for web spam," ACM SIGIR Forum, vol. 40, no. 2, pp. 11-24, 2006.
[11] J. Lin, "Detection of cloaked web spam by using tag-based methods," Expert Systems with
Applications, vol. 36, no. 4, pp. 7493-7499, 2009.
[12] A. Andrew, "Spam and JavaScript, future of the web," Kybernetes, vol. 37, no. 910, pp. 1463-1465,
2008.
[13] C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and K. Zhang, "Fighting against web spam: a novel
propagation method based on click-through data," In Proceedings of the 35th international ACM
SIGIR conference on Research and development in information retrieval, pp. 395-404, 2012.
[14] H. Ji and H. Zhang, "Analysis on the content features and their correlation of web pages for spam
detection", China Communications, vol. 12, no. 3, pp. 84-94, 2015.
[15] M. Egele, C. Kolbitsch and C. Platzer, "Removing web spam links from search engine results,"
Journal in Computer Virology, vol. 7, no. 1, pp. 51-62, 2011.
[16] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content
analysis," In Proceedings of the 15th international conference on World Wide Web, ACM, pp. 83-92,
2006.
[17] V. Prieto, M. Álvarez and F. Cacheda, "SAAD, a content based Web Spam Analyzer and Detector,"
Journal of Systems and Software, vol. 86, no. 11, pp. 2906-2918, 2013.
[18] L. Araujo and J. M. Romo, "Web spam detection: new classification features based on qualified link
analysis and language models," Information Forensics and Security, IEEE Transactions, vol. 5, no. 3,
pp. 581-590, 2010.
[19] J. M.Ponte, and W. B. Croft, "A language modeling approach to information retrieval," In
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in
information retrieval, pp. 275-281, 1998.
[20] T. Cover and J. Thomas, Elements of information theory. New York: Wiley, 1991.
[21] D. Fetterly, "Adversarial information retrieval: The manipulation of web content," ACM Computing
Reviews, 2007.
8. 34 Computer Science & Information Technology (CS & IT)
[22] Web Spam Collections. http://chato.cl/webspam/datasets/ Crawled by the Laboratory of Web
Algorithmics, University of Milan, http://law.di.unimi.it/.
[23] C. Dong and B. Zhou, "Effectively detecting content spam on the web using topical diversity
measures," In Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web
Intelligence and Intelligent Agent Technology, vol. 01, pp. 266-273, IEEE Computer Society, 2012.
[24] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf. "Support vector machines,"
Intelligent Systems and their Applications, vol.13, no. 4, pp:18-28, 1998.
[25] “Confusion matrix,” Wikipedia, the free Encyclopedia,
Available: https : //en.wikipedia.org/wiki/Confusionmatrix.
[26] C.Bishop, Pattern recognition and machine learning, springer, 2006.
[27] Guyon and A. Elisseeff, "An introduction to variable and feature selection," The Journal of Machine
Learning Research, vol. 3, pp: 1157-1182, 2003.
[28] H. Peng, F. Long and C. Ding, "Feature selection based on mutual information criteria of max-
dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, pp: 1226-1238, 2005.
[29] G. Brown, A. Pocock, M.J. Zhao and M. Lujan, "Conditional likelihood maximization: A unifying
framework for information theoretic feature selection,” Journal of Machine Learning Research, vol.
13, pp: 27-66, 2012.
AUTHORS
Ms. Shipra is currently pursuing her Masters in Analytics (Computer Science and
Technology) from National Institute of Technology (NIT), New Delhi, India.Prior to
this she has received her B.Tech degree in Computer Science and Engineering and has
more than 1 year of work experience in Digital Marketing. Her current area of research
is machine learning and pattern recognition problems domains of adversarial
information retrieval and search engine spam.
Ms. Akanksha Juneja is currently working as Assistant Professor in Department of
Computer Science and Engineering, National Institute of Technology Delhi. She is also
a PhD scholar at School of Computer & Systems Sciences (SC&SS), Jawaharlal Nehru
University (JN U), New Delhi, India. Prior to this she has received her M.Tech degree
(Computer Science & Technology) from SC&SS, JNU, New Delhi. Her current area of
research is machine learning and pattern recognition problems domains of image
processing and security.