A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document traffic. The main objective is to retrieve recent updated documents that are most relevant to the query by applying sliding window technique. Our solution indexes the streamed documents in the main memory with structure based on the principles of inverted file, and processes document arrival and expiration events with incremental threshold-based method. It also ensures elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are ranked based on user feedback and given higher priority for retrieval.
Data mining model for the data retrieval from central server configurationijcsit
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCEijcses
In order to solve the low efficiency problem of large-scale distributed software testing , CBDSTP(
Cloud-Based Distributed Software Testing Platform) is put forward.This platform can provide continous
integration and automation of testing for large software systems, which can make full use of resources on
the cloud clients, achieving testing result s in the real environment and reasonable allocating testing jobs,
to resolve the Web application software configuration test, compatibility test and distributed test problems,
to reduce costs, improve efficiency. Through making MySQL testing on this prototype system, the
verification is made for platform architecture and job allocation effectiveness.
eXTend DB. An embedded extensible document database. Extend with custom queries and object modifiers. Learn More ».
Morph DB. A Key-Value pair database. Allows fast in-place updates / object expansion. Learn More ».
Block Manager
An innovative library which manages on-disk blocks inside a file and provides a very simple interface to be used for variety of on-disk datastructures.
http://sscreation.net.in
Data mining model for the data retrieval from central server configurationijcsit
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCEijcses
In order to solve the low efficiency problem of large-scale distributed software testing , CBDSTP(
Cloud-Based Distributed Software Testing Platform) is put forward.This platform can provide continous
integration and automation of testing for large software systems, which can make full use of resources on
the cloud clients, achieving testing result s in the real environment and reasonable allocating testing jobs,
to resolve the Web application software configuration test, compatibility test and distributed test problems,
to reduce costs, improve efficiency. Through making MySQL testing on this prototype system, the
verification is made for platform architecture and job allocation effectiveness.
eXTend DB. An embedded extensible document database. Extend with custom queries and object modifiers. Learn More ».
Morph DB. A Key-Value pair database. Allows fast in-place updates / object expansion. Learn More ».
Block Manager
An innovative library which manages on-disk blocks inside a file and provides a very simple interface to be used for variety of on-disk datastructures.
http://sscreation.net.in
The Concept of Load Balancing Server in Secured and Intelligent NetworkIJAEMSJORNAL
Hundreds and thousands of data packets are routed every second by computer networks which are complex systems. The data should be routed efficiently to handle large amounts of data in network. A core networking solution which is responsible for distribution of incoming traffic among servers hosting the same content is load balancing. For example, if there are ten servers within a network and two of them are doing 95% of the work, the network is not running very efficiently. If each server was handling about 10% of the traffic, the network would run much faster.Networks get more efficient with the help of Load balancing. The traffic is evenly distributed amongst the network making sure no single device is overwhelmed.When a request is balanced across multiple servers, it prevents any server from becoming a single point of failure. It improves overall availability and responsiveness. To evenly split the traffic load among several different servers web servers; often use load balancing.Load balancing requires hardware or software that divides incoming traffic amongst the available serverseither it is done on a local network or a large web server. High amount of traffic is received by a network that have one server dedicated to balance the load among other servers and devices in the network. This server is often known as load balancer. Load balancing is used by clusters or multiple computers that work together, to spread out processing jobs among the available systems.
An effective classification approach for big data with parallel generalized H...riyaniaes
Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.
Scheduling -Issues in Load Distributing, Components for Load Distributing Algorithms,
Different Types Distributed of Load Distributing Algorithms, Fault-tolerant services Highly
available services, Introduction to Distributed Database and Multimedia system
Introduction: Definition, Design Issues, Goals, Types of distributed systems, Centralized
Computing, Advantages of Distributed systems over centralized system .Limitation of
Distributed systems Architectural models of distributed system, Client-server
communication, Introduction to DCE
Distributed Objects and Remote Invocation: Communication between distributed objects
Remote procedure call, Events and notifications, operating system layer Protection, Processes
and threads, Operating system architecture. Introduction to Distributed shared memory,
Design and implementation issue of DSM.Case Study: CORBA and JAVA RMI.
Clock synchronization: Clocks, events and process states, Synchronizing physical clocks,
Logical time and logical clocks, Lamport’s Logical Clock, Global states, Distributed mutual
exclusion algorithms: centralized, decentralized, distributed and token ring algorithms,
election algorithms, Multicast communication.
In-Memory Storage Engine (beta)
WiredTiger as the default storage engine
Advanced security (encryption at rest)
Document Validation
Advanced full text
Dynamic Lookups
BI Connector (Tableau, Qlikview, Cognos, BusinessObjects, etc...)
Database GUI with MongoDB Compass
And more...
New generations of database technologies are allowing organizations to build applications never before possible, at a speed and scale that were previously unimaginable. MongoDB is the fastest growing database on the planet, and the new 3.2 release will bring the benefits of modern database architectures to an ever broader range of applications and users.
The Concept of Load Balancing Server in Secured and Intelligent NetworkIJAEMSJORNAL
Hundreds and thousands of data packets are routed every second by computer networks which are complex systems. The data should be routed efficiently to handle large amounts of data in network. A core networking solution which is responsible for distribution of incoming traffic among servers hosting the same content is load balancing. For example, if there are ten servers within a network and two of them are doing 95% of the work, the network is not running very efficiently. If each server was handling about 10% of the traffic, the network would run much faster.Networks get more efficient with the help of Load balancing. The traffic is evenly distributed amongst the network making sure no single device is overwhelmed.When a request is balanced across multiple servers, it prevents any server from becoming a single point of failure. It improves overall availability and responsiveness. To evenly split the traffic load among several different servers web servers; often use load balancing.Load balancing requires hardware or software that divides incoming traffic amongst the available serverseither it is done on a local network or a large web server. High amount of traffic is received by a network that have one server dedicated to balance the load among other servers and devices in the network. This server is often known as load balancer. Load balancing is used by clusters or multiple computers that work together, to spread out processing jobs among the available systems.
An effective classification approach for big data with parallel generalized H...riyaniaes
Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.
Scheduling -Issues in Load Distributing, Components for Load Distributing Algorithms,
Different Types Distributed of Load Distributing Algorithms, Fault-tolerant services Highly
available services, Introduction to Distributed Database and Multimedia system
Introduction: Definition, Design Issues, Goals, Types of distributed systems, Centralized
Computing, Advantages of Distributed systems over centralized system .Limitation of
Distributed systems Architectural models of distributed system, Client-server
communication, Introduction to DCE
Distributed Objects and Remote Invocation: Communication between distributed objects
Remote procedure call, Events and notifications, operating system layer Protection, Processes
and threads, Operating system architecture. Introduction to Distributed shared memory,
Design and implementation issue of DSM.Case Study: CORBA and JAVA RMI.
Clock synchronization: Clocks, events and process states, Synchronizing physical clocks,
Logical time and logical clocks, Lamport’s Logical Clock, Global states, Distributed mutual
exclusion algorithms: centralized, decentralized, distributed and token ring algorithms,
election algorithms, Multicast communication.
In-Memory Storage Engine (beta)
WiredTiger as the default storage engine
Advanced security (encryption at rest)
Document Validation
Advanced full text
Dynamic Lookups
BI Connector (Tableau, Qlikview, Cognos, BusinessObjects, etc...)
Database GUI with MongoDB Compass
And more...
New generations of database technologies are allowing organizations to build applications never before possible, at a speed and scale that were previously unimaginable. MongoDB is the fastest growing database on the planet, and the new 3.2 release will bring the benefits of modern database architectures to an ever broader range of applications and users.
A distributed system can be viewed as an environment in which, number of computers/nodes are connected and resources are shared among these computers/nodes. But unfortunately, distributed systems often face the problem of traffic, which can degrade the performance of the system. Traffic management is used to improve scalability and overall system throughput in distributed systems using Software Defined Network (SDN) based systems. Traffic management improves system performance by dividing the work traffic effectively among the participating computers/nodes. Many algorithms were proposed for traffic management and their performance is measured based on certain parameters such as response time, resource utilization, and fault tolerance. Traffic management algorithms are broadly classified into two categories- scheduling and machine learning traffic management. This work presents the study of performance analysis of traffic management algorithms. This analysis can further help in the design of new algorithms. However, when multiple servers are assigned to compile the mysterious code, different kinds of techniques are used. One common example is traffic management. The processes are managed based on power efficiency, networking bandwidth, Processor speed. The desired output will again send back to the developer. If multiple programs have to be compiled then appropriate technique such as scheduling algorithm is used. So the compilation process becomes faster and also the other process can get a chance to compile. SDN based clustering algorithm based on Simulated Annealing whose main goal is to increase network lifetime while maintaining adequate sensing coverage in scenarios where sensor nodes produce uniform or non-uniform data traffic.
Exploiting Web Technologies to connect business process management and engine...Stefano Costanzo
The Business Process Model and Notations (BPMN) standard can be used for representing low-level simulation and automation workflows for scientific, engineering and manufacturing process. This poster presents a prototype focused on removing the main obstacles to an adoption of the standard and the related technology caused by insufficient collaboration and data management.
Abstract In early days information contain in increasingly corporate area, now IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. As part of an Information Lifecycle Management (ILM) best-practices strategy, organizations require solutions for migrating data between in heterogeneous environments and system storage. In early days information contain in increasingly corporate area, today IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. This paper helps to planned to design powerful modules that high-performances data migration of storage area with less time complexity. This project contain unique information of data migration in dynamic IT nature and business advantage that design to provide new tool used for data migration. Keywords— Heterogeneous Environment, data migration, data mapping
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
2. 22 Computer Science & Information Technology (CS & IT)
into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again
in turn, leading to a multi-level structure. The worker node processes the smaller problem, and
passes the answer back to its master node, b) reduce step: The master node then collects the
answers to all the sub-problems and combines them in some way to form the output – the answer
to the problem it was originally trying to solve. Unsupervised duplicate detection. [3]
The problem
of identifying objects in databases that refer to the same real world entity, is known, among others,
as duplicate detection or record linkage. Here this method is used to identify documents that are all
alike and prevent them from being prepared in the result set. Our paper also focuses on ranking the
documents based on user feedback. The user is allowed to give feedback for each document that
has been retrieved. This feedback is used to rank the document and hence increase the probability
of the document to appear in the sliding window.
Visual Web Spider is a fully automated, multi-threaded web crawler that allows us to index and
collect specific web pages on the Internet. Once installed, it enables us to browse the Web in an
automated manner, indexing pages that contain specific keywords and phrases and exporting the
indexed data to a database on our local computer in the format of our choice. We want to collect
website links to build our own specialized web directory. We can configure Visual Web Spider
automatically. This program’s friendly, wizard-driven interface lets us customize our search in a
step-by-step manner. To index relevant web pages, just follow this simple sequence of steps.
After opening the wizard, enter the starting web page URL or let the program generate URL links
based on specific keywords or phrases. Then set the crawling rules and depth according to your
search strategy. Finally, specify the data you want to index and your project filename. That’s
pretty much it. Clicking on ‘Start’ sets the crawler to work. Crawling is fast, thanks to multi-
threading that allows up to 50 simultaneous threads. Another nice touch is that Visual Web Spider
can index the content of any HTML tag such as: page title (TITLE tag), page text (BODY tag),
HTML code (HTML tag), header text (H1-H6 tags), bold text (B tags), anchor text (A tags), alt
text (IMG tag, ALT attribute), keywords, description (META tags) and others. This program can
also list each page size and last modified date.
Once the web pages have been indexed, Visual Web Spider can export the indexed data to any of
the following formats: Microsoft Access, Excel (CSV), TXT, HTML, and MySQL script.
1.1 Key Features
A Personal, Customizable Web crawler. Crawling rules. Multi-threaded technology (up to 50
threads). Support for the robots exclusion protocol/standard (Robots.txt file and Robots META
tags);Index the contents of any HTML tag. Indexing rules; Export the indexed data into Microsoft
Access database, TEXT file, Excel file (CSV), HTML file, MySQL script file; Start crawling
from a list of the URLs specified by user; Start crawling using keywords and phrases; Store web
pages and media files on your local disk; Auto-resolve URL of redirected links; Detect broken
links; Filter the indexed data;
2. EXISTING SYSTEM
Drawbacks of the existing servers that tend to handle the heavy document traffic are: Cannot
efficiently monitor the data stream that has highly dynamic document traffic. The server alone
does the processing hence it involves large amount of time consumption. In case of continuous text
search queries and extraction every time the entire document set has to be scanned in order to find
3. Computer Science & Information Technology (CS & IT)
the relevant documents. There is no confirmation that duplicate documents are not retrieved for the
given query. A large amount of documents cannot be stored in the main memory as it involves
large amount of CPU cost.
Naïve solution: The most straightforward approach to evaluate the
above is to scan the entire window
intervals, compute all the document scores, and report the
incurs high processing costs due
3. PROPOSED SYSTEM
3.1 Problem Formulation
In our model, a stream of documents flows into a central server. The user registers text queries at
the server, which is then responsible for continuously monitoring/reporting their results. As in
most stream processing systems, we store all the data in main memory in order to cope with
frequent updates, and design our methods with the primary goal of minimizing the CPU cost.
Moreover it is necessary to reduce the work load of the monitoring server.
3.2 Proposed Solution
In our solution we use the MapReduce technique in order to reduce the work load of the central
server, where the server acts as the master node, which splits up the processing task to several
worker nodes. The number of worker nodes, which have been assign
depends on the nature of query that has been put up by the user. Here the master node, upon
receiving a query from the user, assigns the workers to find the relevant result query set and
return the solution to the master node. The
the workers, integrates the results to produce the final result set for the given query. This can be
viewed schematically in the following
incremental threshold algorithm
for the given query. The overall system architecture can be viewed as in the following
Fig. 1. System Architecture for the proposed Re
Computer Science & Information Technology (CS & IT)
ts. There is no confirmation that duplicate documents are not retrieved for the
given query. A large amount of documents cannot be stored in the main memory as it involves
The most straightforward approach to evaluate the continuous queries defined
above is to scan the entire window contents D after every update or in fixed time
mpute all the document scores, and report the top-k documents. This method
igh processing costs due to the need for frequent re computations from scratch.
YSTEM
3.1 Problem Formulation
In our model, a stream of documents flows into a central server. The user registers text queries at
responsible for continuously monitoring/reporting their results. As in
most stream processing systems, we store all the data in main memory in order to cope with
frequent updates, and design our methods with the primary goal of minimizing the CPU cost.
eover it is necessary to reduce the work load of the monitoring server.
In our solution we use the MapReduce technique in order to reduce the work load of the central
server, where the server acts as the master node, which splits up the processing task to several
worker nodes. The number of worker nodes, which have been assigned the processing task,
depends on the nature of query that has been put up by the user. Here the master node, upon
receiving a query from the user, assigns the workers to find the relevant result query set and
return the solution to the master node. The master node, after receiving the partial solutions from
the workers, integrates the results to produce the final result set for the given query. This can be
viewed schematically in the following Fig.1. Each worker/slave node is responsible uses the
incremental threshold algorithm for computing the result set of k relevant and recent documents
for the given query. The overall system architecture can be viewed as in the following
System Architecture for the proposed Relevant Updated Architecture Model.
23
ts. There is no confirmation that duplicate documents are not retrieved for the
given query. A large amount of documents cannot be stored in the main memory as it involves
tinuous queries defined
contents D after every update or in fixed time
documents. This method
to the need for frequent re computations from scratch.
In our model, a stream of documents flows into a central server. The user registers text queries at
responsible for continuously monitoring/reporting their results. As in
most stream processing systems, we store all the data in main memory in order to cope with
frequent updates, and design our methods with the primary goal of minimizing the CPU cost.
In our solution we use the MapReduce technique in order to reduce the work load of the central
server, where the server acts as the master node, which splits up the processing task to several
ed the processing task,
depends on the nature of query that has been put up by the user. Here the master node, upon
receiving a query from the user, assigns the workers to find the relevant result query set and
master node, after receiving the partial solutions from
the workers, integrates the results to produce the final result set for the given query. This can be
Each worker/slave node is responsible uses the
for computing the result set of k relevant and recent documents
for the given query. The overall system architecture can be viewed as in the following Fig.2
levant Updated Architecture Model.
4. 24 Computer Science & Information Technology (CS & IT)
Fig. 2. A data Retrieval system for continuous data extraction technique using MapReduce.
Each element of the input stream comprises of a document d, a
document arrival time, a composition list
term t belonging to T in the document and wdt is the frequency of the term in the document d.
The notations in this model are as follows in Fig 3.
Fig. 3. A Detailed list of the notations used in the paper for the proposed system.
The worker node maintains an inverted index for each term t in the document. With the inverted
index, a query Q is processed as follows: the inverted lists for the terms t belongi
scanned and the partial wdt scores of each encountered document d are accumulated to produce
S(d/Q). The documents with the highest scores at the end are returned as the result.
Computer Science & Information Technology (CS & IT)
A data Retrieval system for continuous data extraction technique using MapReduce.
Each element of the input stream comprises of a document d, a unique document identifier, the
composition list. The composition list contains one (t, wdt) pair for each
term t belonging to T in the document and wdt is the frequency of the term in the document d.
The notations in this model are as follows in Fig 3.
A Detailed list of the notations used in the paper for the proposed system.
The worker node maintains an inverted index for each term t in the document. With the inverted
index, a query Q is processed as follows: the inverted lists for the terms t belongi
scanned and the partial wdt scores of each encountered document d are accumulated to produce
S(d/Q). The documents with the highest scores at the end are returned as the result.
A data Retrieval system for continuous data extraction technique using MapReduce.
unique document identifier, the
The composition list contains one (t, wdt) pair for each
term t belonging to T in the document and wdt is the frequency of the term in the document d.
A Detailed list of the notations used in the paper for the proposed system.
The worker node maintains an inverted index for each term t in the document. With the inverted
index, a query Q is processed as follows: the inverted lists for the terms t belonging to Q are
scanned and the partial wdt scores of each encountered document d are accumulated to produce
5. Computer Science & Information Technology (CS & IT) 25
3.3 Incremental Threshold Algorithm
Fig.3 represents the data structures that have been used in this system. The valid documents D are
stored in a single list, shown at the bottom of the figure. Each element of the list holds the stream
of information of document (identifier, text content, composition list, arrival time). D contains the
most recent documents for both count-based and time-based windows. Since documents expire in
first-in-first-out manner, D is maintained efficiently by inserting arriving documents at the end of
the list and deleting expiring ones from its head. On the top of the list of valid documents we
build an inverted index. The structure at the top of the figure is the dictionary of search terms. It
is an array that contains an entry for each term t belonging to T. The dictionary entry for t stores a
pointer to the corresponding inverted list Lt. Lt holds an impact entry for each document d that
contains t, together with a pointer to d’s full information in the document list. When a document
d arrives, an impact entry (d, wdt) (derived from d’s composition list) is inserted into the inverted
list of each term t that appears in d. Likewise, the impact entries of an entries of an expiring
document are removed from the respective inverted lists. To keep the inverted lists sorted on wdt
while supporting fast (logarithmic) insertions and deletions.
Initial Top-k Search: When a query is first submitted to the system, its top-k result is computed
using the initial search module. The process is an adaptation of the threshold algorithm. Here, the
inverted lists Lt of the query terms play the role of the sorted attribute lists. Unlike the original
threshold algorithm, however we do not probe the lists in a round robin fashion. Since the
similarity function associates different weights wQt with the query specifically, inspired by [4],
we probe the list Lt with the highest ct=wQt.wdnxtt value, where dnxt is the next document in Lt.
The global threshold gt, a notion used identically to the original algorithm, is the sum of ct values
for all the terms in Q. Consider query Q1 with search string “red rose” and k=2. Let term
t20=”red” and t11=”rose”. First the server identifies the inverted lists L11 and L20 (using the
dictionary hash table), and computes the values c11=wQ1t11.wd7t11 and c20=wQ1t20.wd6t20. In
iteration 1, since c20 is larger, the first entry of L20 is popped; the similarity score of the
corresponding document, d6, is computed by accessing its composition list in D and inserted into
the tentative R. c20 is then updated to impact entry which is above local threshold, but we would
still include it in R as unverified entry.
The algorithm is as follows,
Algorithm Incremental Threshold with Duplicate Detection
(Arriving dins, Expiring ddel)
1: Insert document dins into D (the system document list)
2: for all terms t in the composition list of dins do
3: for all documents in Lt
4: for all terms t in dins
5: Compute unique (dins)
6: wdinst != wdnxtt
7: Insert the impact entry of dins into Lt
8: Probe the threshold tree of Lt
9: for all queries Q where wdinst > =localThreshold do
10: if Q has not been considered for dins in another Lt then
11: Compute S (dins/Q)
12: Insert dins into R
13: if S(dins/Q)>= old Sk then
14: Update Sk (since dins enters the top-k result)
6. 26 Computer Science & Information Technology (CS & IT)
15: Keep rolling up local thresholds while r <= Sk
16: Set new τ as influence threshold for Q
17: Update local thresholds of Q
18: Delete document ddel from D (the system document list)
19: for all terms t in the composition list of ddel do
20: Delete the impact entry of ddel from Lt
21: Probe the threshold tree of Lt
22: for all queries Q where wddelt >= localThreshold do
23: if Q has not been considered for ddel in another Lt then
24: Delete ddel from R
25: if S(ddel/Q) >= old Sk then
26: Resume top-k search from local thresholds
27: Set new τ as influence threshold for Q
28: Update local thresholds of Q
After constructing the initial result set R using the above algorithm, only the documents that have
a score higher than or equal to the influence threshold t(tow) are verified. The main key point is
that no duplicate documents from the part of the result set R. This is ensured using unsupervised
duplication detection. The idea of unsupervised learning for duplicate detection has its roots in
the probabilistic model proposed by Fellegi and Sunter. When there is no training data to compute
the probability estimates, it is possible to use variations of the Expectation Maximization
algorithm to identify appropriate clusters in the data.
Fig. 3. Data Structures used for Incremental Threshold Algorithm
7. Computer Science & Information Technology (CS & IT) 27
3. CONCLUSION
In this paper, we study the processing of continuous text queries over document streams. These
queries define a set of search terms, and request continual monitoring of a ranked list of recent
documents that are most similar to those terms. The problem arises in a variety of text monitoring
applications, e.g., email and news tracking. To the best of our knowledge, this is the first attempt
to address this important problem. Currently, our study focuses on plain text documents. A
challenging direction for future work is to extend our methodology to documents tagged with
metadata and documents with a hyperlink structure, as well as to specialized scoring mechanisms
that may apply in these settings.
REFERENCES
[1] B.Babcock, S. Babu, M.Datar, R.Motwani, and J.Widom, 2002, “Models and Issues in Data
Streaming System” PODS’02, 1-16.
[2] J.Zobel and A.Moffat, July 2006, “Inverted Files for Text Search Engines”, Computing Surveys,
vol.38, o.2, p1-55.
[3] VNAnh and A.Moffat, 2002, “Impact Transformation: Effective Efficient Web Retrieval”, Int’l ACM
SIGIR conf. Research and Development in Information Retrieval.
[4] V.N.Anh, O.de Krestser, and A. Moffat, 2001, “Vector-Space Ranking with Effective Early
Termination”, ACM SIGIR conf. Research and Development in Information Retrieval.
[5] Y. Zhang and J.Callan, 2001 “Maximum Likelihood Estimation for Filtering Tresholds,” ACM SIGIR
conf. Research and Development in Information Retrieval.
[6] M. Persin, J. Zobel and R. Sacks-Davis, 1996, “Filtered Document Retrieval with Frequency_Sorted
Indexes”, J.Am.Soc for Information Science, vol.47, no.10.