This Slide was collected from a seminar "Machine Learning for Data Mining" which was arranged in Daffodil International University.The Chief Guest was Dr. Dewan Md. Farid. He made this wonderful Slide for described to us about Data Mining. He also shared his research experience which was just amazing.Totally unpredictable speech it was from Dr. Dewan Md. Farid Sir. He is one of the famous researcher.I hope , you will enjoy this slide. Details about Dr. Dewan Md. Farid sir is given below in this link
https://ai.vub.ac.be/members/dewan-md-farid
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Deep learning is a collection of machine learning algorithms utilizing multiple layers, with which higher levels of raw data are slowly removed. For example, lower layers can recognize edges in image processing whereas higher layers may define concepts for humans such as numbers or letters or faces. In this paper we have done a literature survey of some other papers to know how useful is Deep Learning and how to define other Artificial Intelligence things using Deep Learning. Anirban Chakraborty "A Study of Deep Learning Applications" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31629.pdf Paper Url :https://www.ijtsrd.com/computer-science/artificial-intelligence/31629/a-study-of-deep-learning-applications/anirban-chakraborty
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Deep learning is a collection of machine learning algorithms utilizing multiple layers, with which higher levels of raw data are slowly removed. For example, lower layers can recognize edges in image processing whereas higher layers may define concepts for humans such as numbers or letters or faces. In this paper we have done a literature survey of some other papers to know how useful is Deep Learning and how to define other Artificial Intelligence things using Deep Learning. Anirban Chakraborty "A Study of Deep Learning Applications" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31629.pdf Paper Url :https://www.ijtsrd.com/computer-science/artificial-intelligence/31629/a-study-of-deep-learning-applications/anirban-chakraborty
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
National Resource for Networks Biology's TR&D Theme 3: Although networks have been very useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this technology research project, we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of cell structure and function.
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainBarry Smith
We describe on-going work on IAO-Intel, an information artifact ontology developed as part of a suite of ontologies designed to support the needs of intelligence community. IAO-Intel provides a controlled, structured vocabulary for the consistent formulation of metadata about documents, images, emails and other carriers of information. It will provide a resource for uniform explication of the terms used in multiple existing military dictionaries, thesauri and metadata registries, thereby enhancing the degree to
which the content formulated with their aid will be available to computational reasoning.
Presented at the 2013 STIDS (Semantic Technology for Intelligence, Defense and Security) conference: http://stids.c4i.gmu.edu/
Natural language processing through the subtractive mountain clustering algor...ijnlc
In this work, the subtractive mountain clustering algorithm has been adapted to the
problem of natural languages processing in view to construct a chatbot that answers questions
posed by the user. The implemented algorithm version allosws for the association of a set of words
into clusters. After finding the centre of every cluster — the most relevant word, all the others are
aggregated according to a defined metric adapted to the language processing realm. All the relevant
stored information (necessary to answer the questions) is processed, as well as the questions, by the
algorithm. The correct processing of the text enables the chatbot to produce answers that relate
to the posed queries. Since we have in view a chatbot to help elder people with medication, to
validate the method, we use the package insert of a drug as the available information and formulate
associated questions. Errors in medication intake among elderly people are very common. One of
the main causes for this is their loss of ability to retain information. The high amount of medicine
intake required by the advanced age is another limiting factor. Thence, the design of an interactive
aid system, preferably using natural language, to help the older population with medication is in
demand. A chatbot based on a subtractive cluster algorithm is the chosen solution.
Prediction APIs are democratizing Machine Learning. They make it easier for developers to build smart features in their apps by abstracting away some of the complexities of building and deploying predictive models. In this talk we’ll look at the possibilities and limitations of ML, how to use Prediction APIs, how to prepare data to send to them, and how to assess performance.
Item generation using rule based randomization algorithms in RPG gamesRejosh Samuel
This thesis discusses the available randomization algorithms as compared to rule-based algorithm. Based on this research, an item generation mechanic was developed in a prototype game which was repeatedly tested to generate 200 player simulated test cases. A conclusion was drawn out as to how it can affect the game play for player satisfaction and in turn prolong the shelf life of the game. This research also sheds light on its usefulness in post production.
Note: As I am not an artist, some of the graphical assets has been borrowed from some public domains and games, including Diablo, and I have given credit to the owners in my thesis.
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Sheryl Grant
Harnessing the “data deluge” is promoting new conversations between disciplines. Prof. Marciano and his collaborators have been pursuing research in a number of areas including: big cultural data, access to big heterogeneous data, records in the cloud, federated grid/cloud storage, visual interfaces to large collections, policy-based frameworks to automate content management, and distributed cyberinfrastructure to enable data sharing. But more importantly, innovative technical approaches require the convergence of creative insights across computer science, the social sciences, and the humanities. This talk touches on these topics and highlights a new collaboration with partners at Duke.
Richard Marciano is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL). He leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award on big heterogeneous data integration.
He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.
National Resource for Networks Biology's TR&D Theme 3: Although networks have been very useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this technology research project, we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of cell structure and function.
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainBarry Smith
We describe on-going work on IAO-Intel, an information artifact ontology developed as part of a suite of ontologies designed to support the needs of intelligence community. IAO-Intel provides a controlled, structured vocabulary for the consistent formulation of metadata about documents, images, emails and other carriers of information. It will provide a resource for uniform explication of the terms used in multiple existing military dictionaries, thesauri and metadata registries, thereby enhancing the degree to
which the content formulated with their aid will be available to computational reasoning.
Presented at the 2013 STIDS (Semantic Technology for Intelligence, Defense and Security) conference: http://stids.c4i.gmu.edu/
Natural language processing through the subtractive mountain clustering algor...ijnlc
In this work, the subtractive mountain clustering algorithm has been adapted to the
problem of natural languages processing in view to construct a chatbot that answers questions
posed by the user. The implemented algorithm version allosws for the association of a set of words
into clusters. After finding the centre of every cluster — the most relevant word, all the others are
aggregated according to a defined metric adapted to the language processing realm. All the relevant
stored information (necessary to answer the questions) is processed, as well as the questions, by the
algorithm. The correct processing of the text enables the chatbot to produce answers that relate
to the posed queries. Since we have in view a chatbot to help elder people with medication, to
validate the method, we use the package insert of a drug as the available information and formulate
associated questions. Errors in medication intake among elderly people are very common. One of
the main causes for this is their loss of ability to retain information. The high amount of medicine
intake required by the advanced age is another limiting factor. Thence, the design of an interactive
aid system, preferably using natural language, to help the older population with medication is in
demand. A chatbot based on a subtractive cluster algorithm is the chosen solution.
Prediction APIs are democratizing Machine Learning. They make it easier for developers to build smart features in their apps by abstracting away some of the complexities of building and deploying predictive models. In this talk we’ll look at the possibilities and limitations of ML, how to use Prediction APIs, how to prepare data to send to them, and how to assess performance.
Item generation using rule based randomization algorithms in RPG gamesRejosh Samuel
This thesis discusses the available randomization algorithms as compared to rule-based algorithm. Based on this research, an item generation mechanic was developed in a prototype game which was repeatedly tested to generate 200 player simulated test cases. A conclusion was drawn out as to how it can affect the game play for player satisfaction and in turn prolong the shelf life of the game. This research also sheds light on its usefulness in post production.
Note: As I am not an artist, some of the graphical assets has been borrowed from some public domains and games, including Diablo, and I have given credit to the owners in my thesis.
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Sheryl Grant
Harnessing the “data deluge” is promoting new conversations between disciplines. Prof. Marciano and his collaborators have been pursuing research in a number of areas including: big cultural data, access to big heterogeneous data, records in the cloud, federated grid/cloud storage, visual interfaces to large collections, policy-based frameworks to automate content management, and distributed cyberinfrastructure to enable data sharing. But more importantly, innovative technical approaches require the convergence of creative insights across computer science, the social sciences, and the humanities. This talk touches on these topics and highlights a new collaboration with partners at Duke.
Richard Marciano is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL). He leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award on big heterogeneous data integration.
He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.
Buy Embedded Systems Projects,B tech Final Year Projects OnlineTechnogroovy
Get In Touch:
Technogroovy Systems India Pvt. Ltd.
www.technogroovy.com
http://www.technogroovy.com/index.php/student-zone/final-year-project
Email Id: technogroovy@gmail.com
Connect with us On Facebook:
https://www.facebook.com/Technogroovyindia
How to plan and conduct hypotheis based science projects for A/L school project.
The project can be presented to National Science and Engineering Fair or to Google Science fair projects
Data Mining: What is Data Mining?
History
How data mining works?
Data Mining Techniques.
Data Mining Process.
(The Cross-Industry Standard Process)
Data Mining: Applications.
Advantages and Disadvantages of Data Mining.
Conclusion.
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...WithTheBest
This presentation illustrates distinct statistical and machine learning approaches to automated recognition of major brain tissues in 3D brain MRI.
Nataliya Portman, Postdoctoral Fellow Faculty of Science, UOIT, Oshawa, ON Canada
PhD in Applied Mathematics, University of Waterloo | Postdoctoral Research on Brain MRI Segmentation, Neuro | Current: Applied Machine Learning in Materials Science, University of Ontario Institute of Technology
Définition du data mining, intervention du Data Mining dans une chaîne décisionnelle, applications, méthodes de travail, processus KDD (ECD, Extraction de connaissances à partir de Données), méthode SEMMA de SAS, méthode CRISP-DM, etc.
This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman. It compares the data modeling culture (statistics) and the algorithmic modeling culture (machine learning).
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
A crucial task in modern biology is the prediction of complex phenotypes, such as breast cancer prognosis, from genome-wide measurements. Machine learning algorithms can sometimes infer predictive patterns, but there is rarely enough data to train and test them effectively and the patterns that they identify are often expressed in forms (e.g. support vector machines, neural networks, random forests composed of 10s of thousands of trees) that are highly difficult to understand. In addition, it is generally unclear how to include prior knowledge in the course of their construction.
Decision trees provide an intuitive visual form that can capture complex interactions between multiple variables. Effective methods exist for inferring decision trees automatically but it has been shown that these techniques can be improved upon via the manual interventions of experts. Here, we introduce Branch, a new Web-based tool for the interactive construction of decision trees from genomic datasets. Branch offers the ability to: (1) upload and share datasets intended for classification tasks (in progress), (2) construct decision trees by manually selecting features such as genes for a gene expression dataset, (3) collaboratively edit decision trees, (4) create feature functions that aggregate content from multiple independent features into single decision nodes (e.g. pathways) and (5) evaluate decision tree classifiers in terms of precision and recall. The tool is optimized for genomic use cases through the inclusion of gene and pathway-based search functions.
Branch enables expert biologists to easily engage directly with high-throughput datasets without the need for a team of bioinformaticians. The tree building process allows researchers to rapidly test hypotheses about interactions between biological variables and phenotypes in ways that would otherwise require extensive computational sophistication. In so doing, this tool can both inform biological research and help to produce more accurate, more meaningful classifiers.
A prototype of Branch is available at http://biobranch.org/
With the advent of the internet, cyber-attacks are changing rapidly and the security situation on the internet is not always optimistic. Machine Learning (ML) and In-depth Learning (DL) methods for community-based access to entry and present a quick teaching definition of the entire ML/DL method. Representative papers all the way have been listed, read, and summarized primarily based on their temporary or thermal interactions. Because information is critical to ML/DL strategies, it describes the amount of commonly used public databases used in ML/DL, discusses the complexities of using ML/DL for Internet protection and provides guidelines for course guides. KDD a set of information is a symbol of standing that is widely recognized within the study of the Acquisition strategies. A lot of work is underway to develop innocent identification strategies as information courses used to read and test the diagnostic version are equally problematic because high-quality information can improve offline access. This paper provides a KDD knowledge test set by recognizing the 4 Basic Courses, Content, Traffic and Handling in which all information attributes can be categorized using the Modified Random Forest (MRF). The test was completed by identifying the remaining 2 metric metrics, Visual Rate (DR) and False Noise Scale (FAR) of the Intervention Detection System (IDS). As a result of this evidence-based evaluation of the data set, the contribution of all 4 character studies in DR and FAR has been proven to help determine the validity of the information set.
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
Data mining is defined as the process of extracting or mining knowledge from vast and large database.Data mining is an interdisciplinary field that brings together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large databases. Bioinformatics is defined as the science of organizing and analyzing the biological data. Microarray technology helps biologists for monitoring expression of thousands of genes in a single experiment on a small chip. Microarray is also called as DNA chip, gene chip, or biochip is used to analyze the gene expression profiles. Fuzzy Logic is defined as a multivalued logic that provides the intermediate values to be defined between conventional evaluations like true or false, yes or no, high or low, etc.In this paper, a type 2 fuzzy logic approach is used in microarray gene expression data to convert the numerical values into fuzzy terms. After fuzzification, the fuzzy association patterns are discovered. A framework is proposed to cluster microarray gene data based on fuzzy association patterns. Then the proposed type 2
fuzzy approach is compared with traditional clustering algorithms.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Adversarial Multi Scale Features Learning for Person Re Identificationijtsrd
Person re identification Re ID is the task of matching a target person across different cameras, which has drawn extensive attention in computer vision and has become an essential component in the video surveillance system. Pried can be considered as a problem of image retrieval. Existing person re identification methods depend mostly on single scale appearance information. In this work, to address issues, we demonstrate the benefits of a deep model with Multi scale Feature Representation Learning MFRL using Convolutional Neural Networks CNN and Random Batch Feature Mask RBFM is proposed for pre id in this study. The RBFM is enlightened by the drop block and Batch Drop Block BDB dropout based approaches. However, great challenges are being faced in the pre id task. First, in different scenarios, appearance of the same pedestrian changes dramatically by reason of the body misalignment frequently, various background clutters, large variations of camera views and occlusion. Second, in a public space, different pedestrians wear the same or similar clothes. Therefore, the distinctions between different pedestrian images are subtle. These make the topic of pre id a huge challenge. The proposed methods are only performed in the training phase and discarded in the testing phase, thus, enhancing the effectiveness of the model. Our model achieves the state of the art on the popular benchmark datasets including Market 1501, duke mtmc re id and CUHK03. Besides, we conduct a set of ablation experiments to verify the effectiveness of the proposed methods. Mrs. D. Radhika | D. Harini | N. Kirujha | Dr. M. Duraipandiyan | M. Kavya "Adversarial Multi-Scale Features Learning for Person Re-Identification" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42562.pdf Paper URL: https://www.ijtsrd.comengineering/computer-engineering/42562/adversarial-multiscale-features-learning-for-person-reidentification/mrs-d-radhika
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Surveyijtsrd
The Healthcare exchange generally clinical diagnosis is ended commonly by doctor's knowledge and practice. Computer Aided Decision Support System plays a major task in the medical field. Data mining provides the methodology and technology to modify these rises of data into valuable data for decision making. By utilizing data mining techniques it requires less time for the prediction of the diseases with more accuracy. Among the expanding research on coronary diseases predicting system, it has happened significant to classifications the exploration results and gives readers with a layout of the current coronary diseases forecast strategies in every discussion. Data mining tools can respond to exchange addresses that expectedly being used much time over riding to decide. In this paper we study different papers in which at least one algorithm of data mining used for the prediction of coronary diseases. As of the study it is observed that Naïve Bayes Technique increase the accuracy of the coronary diseases prediction system. The commonly used techniques for Heart Disease Prediction and their complexities are outlined in this paper. D. Haripriya | Dr. M. Lovelin Ponn Felciah "Prognosis of Cardiac Disease using Data Mining Techniques: A Comprehensive Survey" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26605.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26605/prognosis-of-cardiac-disease-using-data-mining-techniques-a-comprehensive-survey/d-haripriya
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
GasCan is a specialized and unique database of
gastric cancer protein encoding genes expressed in human and
mouse. The features that make GasCan unique are availability
of gene information, availability of primers for each gene, with
their features and conditions given that are useful in PCR
amplification, especially in cloning experiments and to make it
more unique built in programmed sequence analysis facility is
provided that analyze gene sequences in database itself,
resulting sequence analysis information can be valuable for
researchers in different experiments. Furthermore, DNA
sequence analysis tool is provided that can be access freely.
GasCan will expand in future to other species, genes and cover
more useful information of other species. Flexible database
design, expandability and easy access of information to all of
the users are the main features of the database. The Database is
publicly available at http://www.gastric-cancer.site40.net.
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
In this paper, we present an literature survey of the modern data classification and prediction algorithms. All these algorithms are very important in real world applications like- heart disease prediction, cancer prediction etc. Classification of data is a very popular and computationally expensive task. The fundamentals of data classification are also discussed in brief.
This Slide was made for my university presentation in "Database Management System" course. Actually, here I try to discuss about the basic things about machine learning.
This Slide was made for my university presentation in "Statistics and probability" course.In this slide ,you will get short statistical calculation on Distance Between Present residence from Daffodil International University(Using to find the distance & take the shortest value of the possible distances).I think that's will help you by giving information about statistical calculation of Statistics and probability.
This is our Object Oriented Programme course presentation slide which was compeletly made by me.I think it will help others to clear their concept about this.
This Slide was made for my university presentation in "Bangladesh Studies" course.In this slide ,you will get all logical information about Bangladesh from the pre-ancient period to till now.I think that's will help you by giving information about Bangladeshi Political History of All in All.
This slide was made for my University presentation .
In this slide is full of the basic of Tree.I hope, you will get most basic information from this slide.
This Slide is made of Basic Information on Solar Power.
Actually , this slide was made for my University Presentation.
I hope, after watching this slide , you will get some basic information about Solar Power.
This Slide is made of many important information which are very easily discussed in this slide briefly. I hope, after watching this slide , you will get some analytical information on Alternative Current(AC).Actually, this slide was made for my University Presentation.
Li-Fi will be a great advanced technology in Upcoming High Speed Wireless World.This slide is a basic informational Slide about Li-Fi. I hope after reading this slide, You will get the basic knowledge about Li-Fi. This slide was made for my University Presentation. Information was taking from Google.
This Slide is made of many basic thoughts of Graph & Heap which are a part of data structure.It's was our university group presentation slide which was completely made by me with the help of some information from google.I hope that will help us for understanding Graph & Heap easily.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Data Mining (Predict The Future)
1. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Machine Learning for Data Mining
Dr. Dewan Md. Farid
Department of Computer Science & Engineering,
United International University, Bangladesh
December 01, 2016
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
2. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Big Data Project
Rule-based Classifier
Class Imbalanced Problem
Active Learning
Ensemble Clustering
Hybrid Classifier
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
3. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Mining: What is Data Mining?
Data mining (DM) is also known as Knowledge Discovery from
Data, or KDD for short, which turns a large collection of data into
knowledge. DM is a multidisciplinary field including machine learning,
artificial intelligence, pattern recognition, knowledge-based systems,
high-performance computing, database technology and data visualisation.
1. Data mining is the process of analysing data from different
perspectives and summarising it into useful information.
2. Data mining is the process of finding hidden information and
patterns in a huge database.
3. Data mining is the extraction of implicit, previously unknown, and
potentially useful information from data.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
4. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Machine Learning
Machine learning (ML) provides the technical basis of data mining,
which concerns the construction and study of systems that can learn
from data.
1. Supervised learning/ Classification - the supervision in the
learning comes from the labeled instances.
2. Unsupervised learning/ Clustering - the learning process is
unsupervised since the instances are not class labeled.
3. Semi-supervised learning - uses of both labeled and unlabelled
instances when learning a model.
4. Active learning - lets users play an active role in the learning
process. It asks a user (e.g., a domain expert) to label an instance,
which may be from a set of unlabelled instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
5. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Learning Algorithms
Decision Tree (DT) Induction
Na¨ıve Bayes (NB) Classifier
NBTree Classifier
RainForest and BOAT Classifier
k Nearest Neighbour (kNN) Classifier
Random Forest, Bagging and Boosting (AdaBoost)
Support Vector Machines (SVM)
k Means Clustering
Similarity based Clustering
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
6. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data
Mining big data is the process of extracting knowledge to uncover large
hidden information from the massive amount of complex data or
databases. Big data is defined by the three V’s:
Volume - the quantity of data.
Variety - the category of data.
Velocity - the speed of data in and out.
It might suggest throwing a few more V’s into the mix:
Vision - having a purpose/ plan).
Verification - ensuring that the data conforms to a set of
specifications.
Validation - checking that its purpose is fulfilled.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
7. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Big Data Project
1. BRiDGEIris - Brussels Big Data Platform for Sharing and Discovery
in Clinical Genomics.
Hosted by IB2
(Interuniversity Institute of Bioinformatics in
Brussels).
Funded by INNOVIRIS
(Brussels Institute for Research and
Innovation).
2. FWO research project G004414N “Machine Learning for Data
Mining Applications in Cancer Genomics”.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
8. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
BRiDGEIris Project
Brussels big data platform for sharing and discovery in clinical genomics
project aims to answer the research challenges by:
1. Design and creation of a multi-site clinical/phenomic and genomic
data warehouse.
2. Development of automated tools for extracting relevant information
from genetic data.
3. Use of the designed tools to extract new knowledge and transfer it
to the medical setting.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
9. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
VUB AI Lab (CoMo)
Lab is particularly focused on the aspect of design and developing
strategy for information discovery on genomic and clinical big data
by employing an optimal ensemble method. Goal is to evaluate
ensemble predictive modelling techniques for:
1. Improving the prediction accuracy of variant identification/ genomic
variants classification.
2. Pathology classification tasks.
Developing new methods/ algorithms to deal with the following issues:
Multi-class classification
High-dimensional data
Class imbalanced data
Big data
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
10. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Brugada syndrome
Brugada syndrome (BrS), also known as sudden adult death
syndrome (SADS) is a genetic disease. It increases the risk of sudden
cardiac death (SCD) at a young age. The Spanish cardiologists Pedro
Brugada and Josep Brugada name Brugada syndrome.
BrS is detected by abnormal electrocardiogram (ECG) findings called
a type 1 Brugada ECG pattern, which is much more common in men.
BrS is a heart rhythm disorder.
Sudden cardiac death (SCD) caused when the heart doesn’t pump
effectively and not enough blood travels to the rest of the body.
The Exome datasets of 148 patients have analysed for Brugada syndrome
at UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
11. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Knowledge Discovery from Genomic Data
Exome 1
Formatted
Data
Gene Panel
Mining Algorithm
Genomic Data Sets
Knowledge Discovery
from Genomic Data
Exome 2
Exome 148
Data
Preprocessing
Feature Selection
Figure: The process of extracting knowledge from genomic data in data mining.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
12. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Genomic Data of BrS
Table: Classification of DNA variants for Brugada syndrome.
Class Label
Class I Nonpathogenic
Class II VUS1 - Unlikely pathogenic
Class III VUS2 - Unclear
Class IV VUS3 - Likely pathogenic
Class V Pathogenic
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
13. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Gene Panel of BrS
Table: Gene panel of Brugada syndrome.
Chromosome Name of Gene
Chr 1 KCND3
Chr 3 SCN5A, GPD1L, SLMAP, CAV3, SCN10A
Chr 4 ANK2
Chr 7 CACNA2D1, AKAP9, KCNH2
Chr 10 CACNAB2
Chr 11 KCNE3, SCN3B, SCN2B, KCNJ5,
KCNQ1, SCN4B
Chr 12 CACNA1C, KCNJ8
Chr 15 HCN4
Chr 17 RANGRF, KCNJ2
Chr 19 SCN1B, TRPM4
Chr 20 SNTA1
Chr 21 KCNE1, KCNE2
Chr X KCNE1L
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
14. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Chromosomes
1 11 12 15 17 19 21 3 4 7 X
Chromosomes
No.ofVariants
0100200300400500
Figure: Chromosomes in 148 Exome Datasets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
15. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Genomic Data
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
38
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
113
115
117
119
121
123
125
127
129
131
133
135
137
139
141
143
145
147
No. of Variants
Exome Data Sets
Annotated vcf File
Gene Panel
BrS Variants
Figure: Genomic Data: 148 Exome Datasets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
16. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Rule-based Classifier
Rule-based classifier is easy to deal with complex classification problems.
It has various advantages:
Highly expressive as DT
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to DT
New rules can be added to existing rules without disturbing ones
already in there
Rules can be executed in any order
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
17. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Adaptive Rule-based Classifier
It combines the random subspace and boosting approaches with
ensemble of decision trees to construct a set of classification rules for
multi-class classification of biological big data.
Random subspace method (or attribute bagging) to avoid
overfitting
Boosting approach for classifying noisy instances
Ensemble of decision trees to deal with class-imbalance data
It uses two popular classification techniques: decision tree (DT) and
k-nearest-neighbour (kNN) classifiers.
DTs are used for evolving classification rules from the training data.
kNN is used for analysing the misclassified instances and removing
vagueness between the contradictory rules.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
18. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Random Subspace & Boosting Method
Random subspace is an ensemble classifier. It consists of several
classifiers each operating in a subspace of the original feature space, and
outputs the class based on the outputs of these individual classifiers.
It has been used for decision trees (random decision forests).
It is an attractive choice for high dimensional data.
Boosting is designed specifically for classification.
It converts weak classifiers to strong ones.
It is an iterative process.
It uses voting for classification to combine the output of individual
classifiers.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
19. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble Classifier
Figure: An example of an ensemble classifier.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
20. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Decision Tree Induction
Decision tree (DT) induction is a top down recursive divide and
conquer algorithm for multi-class classification task. The goal of DT is to
iteratively partition the data into smaller subsets until all the subsets
belong to a single class. It is easy to interpret and explain, and also
requires little prior knowledge.
Information Gain: ID3 (Iterative Dichotomiser) algorithm
Gain Ratio: C4.5 algorithm
Gini Index: CART algorithm
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
21. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 1 Decision Tree Induction
Input: D = {x1, · · · , xi , · · · , xN }
Output: A decision tree, DT.
Method:
1: DT = ∅;
2: find the root node with best splitting, Aj ∈ D;
3: DT = create the root node;
4: DT = add arc to root node for each split predicate and label;
5: for each arc do
6: Dj created by applying splitting predicate to D;
7: if stopping point reached for this path, then
8: DT = create a leaf node and label it with cl ;
9: else
10: DT = DTBuild(Dj );
11: end if
12: DT = add DT to arc;
13: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
22. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
K-Nearest-Neighbour (kNN) Classifier
The k-nearest-neighbour (kNN) is a simple classifier. It uses the
distance measurement techniques that widely used in pattern recognition.
kNN finds k instances, X = {x1, x2, · · · , xk } ∈ Dtraining that are closest to
the test instance, xtest and assigns the most frequent class label,
cl → xtest among the X. When a classification is to be made for a new
instance, xnew , its distance to each Aj ∈ Dtraining , must be determined.
Only the k closest instances, X ∈ Dtraining are considered further. The
closest is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points,
x1 = (x11, x12, · · · , x1n) and x2 = (x21, x22, · · · , x2n), is shown in Eq. 1
dist(x1, x2) =
n
i=1
(x1i − x2i )2 (1)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
23. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 2 k-Nearest-Neighbour classifier
Input: D = {x1, · · · , xi , · · · , xn}
Output: kNN classifier, kNN.
Method:
1: find X ∈ D that identify the k nearest neighbours, regardless of class
label, cl .
2: out of these instances, X = {x1, x2, · · · , xk }, identify the number of
instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously,
i ki = k.
3: assign xtest to the class cl with the maximum number of ki of instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
24. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Constructing Classification Rules
Extracting classification rules from DTs is easy and well-known process.
Rules are highly expressive as DT, so the performance of rule-based
classifier is comparable to DT.
Each rule is generated for each leaf of the DT.
Each path in DT from the root node to a leaf node corresponds with
a rule.
Tree corresponds exactly to the classification rules.
DT vs. Rules
New rules can be added to an existing rule set without disturbing ones
already there, whereas to add to a tree structure may require reshaping
the whole tree. Rules can be executed in any order.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
25. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm: Adaptive rule-based (ARB) classifier
It considers a series of k iterations.
Initially, an equal weight, 1
N is assigned to each training instance.
The weights of training instances are adjusted according to how they
are classified in every iterations.
In each iteration, a sub-dataset Dj is created from the original
training dataset D and previous sub-dataset Dj−1 with maximum
weighted instances. Only the sampling with replacement technique
is used to create the sub-dataset D1 from the original training data
D in the first iteration.
A tree DTj is built from the sub-dataset Dj with randomly selected
features in each iteration.
Each rule is generated for each leaf node of DTj .
Each path in DTj from the root to a leaf corresponds with a rule.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
26. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 3 Adaptive rule-based classifier.
Input:
D = {x1, · · · , xi , · · · , xN }, training dataset;
k, number of iterations;
DT learning scheme;
Output: rule-set; // A set of classification rules.
Method:
1: rule-set = ∅;
2: for i = 1 to N do
3: xi = 1
N ; // initialising weights of each xi ∈ D.
4: end for
5: for j = 1 to k do
6: if j==1 then
7: create Dj , by sampling D with replacement;
8: else
9: create Dj , by Dj−1 and D with maximum weighted X;
10: end if
11: build a tree, DTj ← Dj by randomly selected features;
12: compute error(DTj ); // the error rate of DTj .
13: if error(DTj ) ≥ threshold-value then
14: go back to step 6 and try again;
15: else
16: rules ← DTj ; // extracting the rules from DTj .
17: end if
18: for each xi ∈ Dj that was correctly classified do
19: multiply the weight of xi by (
error(DTj )
1−error(DTj ) ); // update weights.
20: end for
21: normalise the weight of each xi ∈ Dj ;
22: rule-set = rule-set ∪ rules;
23: end for
24: return rule-set;
25: create sub-dataset, Dmisclassified with misclassified instances from Dj ;
26: analyse Dmisclassified employing algorithm 4.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
27. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Error Rrate Calculation
The error rate of DTj is calculated by the sum of weights of misclassified
instances that is shown in Eq. 2. Where, err(xi ) is the misclassification
error of an instance xi . If an instance, xi is misclassified, then err(xi ) is
one. Otherwise, err(xi ) is zero (correctly classified).
error(DTj ) =
n
i=1
wi × err(xi ) (2)
If error rate of DTj is less than the threshold-value, then rules are
extracted from DTj .
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
28. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data with Rules
Big data is so big (millions of instances) that we cannot process all
the instances together at the same time.
It is not possible to store all the data in the main memory at a time.
We can create several smaller sample (or subsets) of data from the
big data that each of which fits in main memory.
Each subset of data is used to construct a set of rules, resulting in
several sets of rules.
Then the rules are examined and used to merge together to
construct the final set of classification rules to deal with big data.
As we have the advantage to add new rules with existing rules and
rules are executed in any order.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
29. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data with Rules (con.)
Data
Data
Data
Integrating Rules
Big Data
Sub-data, 1
Adaptive Rule-based
Classifier
Final Classification Rules
Adaptive Rule-based
Classifier
Adaptive Rule-based
Classifier
Sub-data, N Sub-data, 2
Figure: Mining big data using adaptive rule-based classifier.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
30. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Reduced-Error Pruning
Split the original data into two parts: (a) a growing set, and (b) a
pruning set.
Rules are generated using growing set only. So, important rules
might miss because some key instances had been assigned to the
pruning set.
A rule generated from the growing set is deleted, and the effect is
evaluated by trying out the truncated rule from the pruning set and
seeing whether it performs well than the original rule.
If the new truncated rule performs better then this new rule is added
to the rule set.
This process continues for each rule and for each class.
The overall best rules are established by evaluating the rules on the
pruning set.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
31. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm: Analysing Misclassified Instances
To check the classes of misclassified instances we used the kNN classifier
with feature selection and weighting approach.
We applied DT induction for feature selection and weighting
approach.
We build a tree from the misclassified instances.
Each feature that is tested in the tree, Aj ∈ Dmisclassified is assigned
by a weight 1
d . Where d is the depth of the tree.
We do not consider the features that are not tested in the tree for
similarity measure of kNN classifier.
We apply kNN classifier to classify each misclassified instance based
on the weighted features.
We update the class label of misclassified instances.
We check for the contradictory rules, if there is any.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
32. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 4 Analysing misclassified instances
Input: D, original training data;
Dmisclassified , dataset with misclassified instances;
Output: A set of instances, X with right class labels.
Method:
1: build a tree, DT using Dmisclassified ;
2: for each Aj ∈ Dmisclassified do
3: if Aj is tested in DT then
4: assign weight to Aj by 1
d , where d is the depth of DT;
5: else
6: not to consider, Aj for similarity measure;
7: end if
8: end for
9: for each xi ∈ Dmisclassified do
10: find X ∈ D, with the similarity of weighted A =
{A1, · · · , Aj , · · · , An};
11: find the most frequent class, cl , in X;
12: assign xi ← cl ;
13: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
33. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance Measurement
The classification accuracy:
accuracy =
|X|
i=1 assess(xi )
|X|
, xi ∈ X (3)
If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified then
assess(xi ) = 0.
precision =
TP
TP + FP
(4)
recall =
TP
TP + FN
(5)
F − score =
2 × precision × recall
precision + recall
(6)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
34. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers on 148 Exome datasets. The ARB classifier correctly
classifies 91% gene variants for BrS using training data. We have
considered five iterations for the proposed ARB classifier on each Exome
dataset.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using training data.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 83.33 0.76 0.83 0.79
NB 83.33 0.79 0.83 0.78
kNN 75 0.56 0.75 0.64
ARB classifier 91.66 0.95 0.91 0.92
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
35. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets (con.)
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers using 10-folds cross validation on 148 Exome
datasets.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using 10 folds cross-validation.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 58.33 0.46 0.58 0.51
NB 58.33 0.63 0.58 0.6
kNN 50 0.33 0.5 0.4
ARB classifier 75 0.73 0.75 0.68
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
36. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets (con.)
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers using unseen test variants of 45 Exome datasets.
Where 103 Exome datasets were used for training the models.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using testing data.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 50 0.33 0.5 0.4
NB 50 0.25 0.5 0.62
kNN 50 0.25 0.5 0.33
ARB classifier 66.66 0.44 0.66 0.53
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
37. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Benchmark Life Sciences Datasets
Table: 10 real benchmark life sciences datasets from UCI (University of
California, Irvine) machine learning repository.
No. Datasets Instances No of Att. Att. Types Classes
1 Appendicitis 106 7 Numeric 2
2 Breast cancer 286 9 Nominal 2
3 Contraceptive 1473 9 Numeric 3
4 Ecoli 336 7 Numeric 8
5 Heart 270 13 Numeric 2
6 Pima diabetes 768 8 Numeric 2
7 Iris 150 4 Numeric 3
8 Soybean 683 35 Nominal 19
9 Thyroid 215 5 Numeric 2
10 Yeast 1484 8 Numeric 10
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
38. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classification Accuracy
Table: The classification accuracy (%) of C4.5, kNN, na¨ıve Bayes (NB) and
proposed adaptive rule-based classifier with 10-fold cross validation.
Datasets C4.5 kNN NB Proposed
classifier
Appendicitis 85.84 86.79 85.84 87.73
Breast cancer 75.52 73.42 71.67 75.52
Contraceptive 50.98 49.76 48.13 50.1
Ecoli 79.76 83.03 78.86 83.92
Heart 77.40 78.88 83.7 83.7
Pima diabetes 73.82 73.17 76.3 75.65
Iris 96 95.33 96 95.33
Soybean 91.50 90.19 92.97 91.94
Thyroid 98.13 97.2 98.13 98.13
Yeast 56.73 56.94 57.88 61.99
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
39. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classification Accuracy (con.)
45
50
55
60
65
70
75
80
85
90
95
100
Appendici1s Breast cancer Contracep1ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast
Classifica(on Accuracy
UCI Benchmark Life Sciences Data Sets
C4.5 kNN NB Adap1ve rule-based classifier
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
40. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Accuracy having 20% noisy instances
40
45
50
55
60
65
70
75
80
85
Appendici/s Breast cancer Contracep/ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast
Classifica(on Accuracy
UCI Benchmark Life Sciences Data Sets
C4.5 kNN NB Adap/ve rule-based classifier
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
41. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Balancing Methods
Classification of multi-class imbalanced data is a difficult task, as real
data sets are noisy, high dimensional, small sample size that results
overfitting and overlapping of classes..
Traditional machine learning algorithms are very successful with
classifying majority class instances compare to the minority class
instances.
The conventional data balancing methods alter the original data
distribution, so they might suffer from overfitting or drop some
potential information.
We proposed a new method for dealing with multi-class imbalanced data
based on clustering and selecting most informative instances from the
majority classes.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
42. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classifying Imbalanced Data
Machine learning algorithms successfully classify majority class instances,
but misclassify the minority class instances in many high-dimensional
data sets.
Following methods are used for class imbalance problems:
1. Sampling methods
Under-sampling
Over-sampling
2. Cost-sensitive learning methods (difficult to get the accurate
misclassification cost)
3. Ensemble methods
Bagging
Boosting
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
43. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Data Balancing Method
Initially, we cluster the majority class instances into several clusters.
Find the most informative instances in each cluster. The informative
instances are close to the center of cluster and border of cluster.
Then several data sets are created using these clusters with most
informative instances by combining the instances of minority classes.
Every data set should have almost equal number of
minority-majority classes instances.
Finally, multiple classifiers are trained using these data sets. The
voting technique is used to classify the existing/ new instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
44. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Data Balancing Method (con.)
Imbalanced Data
Majority Classes
Instances
Minority Classes
Instances
Cluster 1
Balanced
Data 1
Classifier 1
Find
Informative
Instances
Cluster 2 Cluster N
Find
Informative
Instances
Find
Informative
Instances
Balanced
Data 2
Balanced
Data N
Classifier 2 Classifier N
Combine Votes
Prediction
New Data
Instances
Figure: Proposed data balancing method.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
45. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance of Data Balancing Methods
The performance of data balancing methods using area under the ROC
(Receiver Operating Characteristic) curve (AUC) on 2143 variants of
Brugada syndrome (BrS) of 148 Exome data sets.
Table: Average AUC values of 148 imbalanced Exome data sets for different
imbalance data handling methods.
Algorithm Average AUC value
Random Under-Sampling 0.8923
Random Over-Sampling 0.8673
Bagging 0.8915
Boosting 0.9136
Proposed Method 0.9317
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
46. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Active Learning
It achieves high accuracy using the number of instances to learn a
concept can often be much lower than the number required in typical
supervised learning.
It interactively queries a user/ expert for class labels of unlabeled
instances.
The objective is to train a classifier using as few labeled instances as
possible by selecting the most informative instances.
Let the data, D contains both set of labeled data, DL and set of
unlabeled data, DU . Initially, a model, M∗
trains using DL. Then a
querying function uses to select unlabeled instances, XU ∈ DU and
requests a user for labeling, XU → XL. After XL is added to DL and train
M∗
again. The process repeats until the user is satisfied.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
47. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Active Learning (con.)
Data, D
Labeled Data,
DL
Unlabeled Data,
DU
Unlabeled
Instances, XU
Labeled
Instances, XL
DL + XL
Ensemble Model,
M*
User/ Oracle
Figure: Active learning process.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
48. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Method
The na¨ıve Bayes (NB) classifier and clustering are used to find the most
informative instances for labeling as part of active learning. The unlabeled
instances are selected for labeling using the following two strategies:
Instances close to centers of clusters and borders of clusters.
If the posterior probabilities of instances are equal/ very close.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
49. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance of Ensemble Methods
Adaptive boosting (AdaBoost algorithm) with NB classifier is used as
base classifier.
Table: The accuracy and F-score of ensemble methods on 2143 DNA variants
of Brugada syndrome.
Algorithm Classification F-score
accuracy (%) (weighted
avg.)
Random Forest 92.3 0.93
Bagging 87.5 0.83
Boosting 91.66 0.9
AdaBoost with NB classifier 94.73 0.93
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
50. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering of high-dimensional big data
An ensemble clustering method with feature selection and grouping
approach.
K-means clustering.
Similarity-based clustering.
Biclustering (On each cluster that generated by ensemble clustering
to find the sub-matrices).
Unlabelled genomic data of Brugada syndrome (148 Exome
datasets).
The proposed method selects the most relevant features in the dataset
and grouping them into subset of features to overcome the problems
associated with the traditional clustering methods.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
51. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering
It is the process of grouping a set of instances into clusters (subsets or
groups) so that instances within a cluster have high similarity in
comparison to one another, but are very dissimilar to instances in other
clusters.
Let X be the unlabelled data set, that is,
X = {x1, x2, · · · , xN }; (7)
The partition of X into k clusters, C1, · · · , Ck , so that the following
conditions are met:
Ci = ∅, i = 1, · · · , k; (8)
∪k
i=1Ci = X; (9)
Ci ∩ Cj = ∅, i = j, i, j = 1, · · · , k; (10)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
52. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Challenges
Pattern extracting from the genomic big data.
Genomic data is often too big and too messy.
Genomic data is also high-dimensional, so traditional distance
measures may be dominated by the noise in many dimensions.
In genomic data, we need to find not only the clusters of instances
(genes), but for each cluster a set of features (conditions).
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
53. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
k-Means
It defines the mean value of instances {xi1, xi2, · · · , xiN } ∈ Ci .
It randomly selects k instances, {xk1, xk2, · · · , xkN } ∈ X each of
which initially represents a cluster center.
Remaining instances, xi ∈ X, xi is assigned to the cluster.
Similar is measure based on the Euclidean distance between xi and
Ci .
It iteratively improves the within-cluster variation.
A high degree of similarity among instances in clusters is obtained, while
a high degree of dissimilarity among instances in different clusters is
achieved simultaneously. The cluster mean of Ci = {xi1, xi2, · · · , xiN } is
defined in equation 11.
Mean = Ci =
N
j=1(xij )
N
(11)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
54. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 5 k-Means Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
k // the number of clusters
Output: A set of k clusters.
Method:
1: arbitrarily choose k number of instances, {xk1, xk2, · · · , xkN } ∈ X as
the initial k clusters center;
2: repeat
3: (re)assign each xi ∈ X → k to which the xi is the most similar based
on the mean value of the xm ∈ k;
4: update the k means, that is, calculate the mean value of the instances
for each cluster;
5: until no change
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
55. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Similarity-Based Clustering (SCM)
It is robust to initialise the cluster numbers.
It detects different volumes of clusters.
Let’s consider sim(xi , xl ) as the similarity measure between instances xi
and the lth cluster center xl . The goal is to find xl to maximise the total
similarity measure shown in Eq. 12.
Js(C) =
k
l=1
N
i=1
f (sim(xi , xl )) (12)
Where, f (sim(xi , xl )) is a reasonable similarity measure and
C = {C1, · · · , Ck }. In general, SCM uses feature values to check the
similarity between instances. However, any suitable distance measure can
be used to check the similarity between the instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
56. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 6 Similarity-based Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
Output: A set of clusters, C = {C1, C2, · · · , Ck }.
Method:
1: C = ∅;
2: k = 1;
3: Ck = {x1};
4: C = C ∪ Ck ;
5: for i = 2 to N do
6: for l = 1 to k do
7: find the lth cluster center xl ∈ Cl to maximize the similarity
measure, sim(xi , xl );
8: end for
9: if sim(xi , xl ) ≥ threshold value then
10: Cl = Cl ∪ xi
11: else
12: k = k + 1;
13: Ck = {xi };
14: C = C ∪ Ck ;
15: end if
16: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
57. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble Clustering
Ensemble clustering is a process of integrating multiple clustering
algorithms to form a single strong clustering approach that usually
provides better clustering results. It generates a set of clusters from a
given unlabelled data set and then combines the clusters into final
clusters to improve the quality of individual clustering.
No single cluster analysis method is optimal.
Different clustering methods may produce different clusters, because
they impose different structure on the data set.
Ensemble clustering performs more effectively in high dimensional
complex data.
It’s a good alternative when facing cluster analysis problems.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
58. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble clustering (con.)
Generally three strategies are applied in ensemble clustering:
1. Using different clustering algorithms on the same data set to create
heterogeneous clusters.
2. Using different samples/ subsets of the data with different clustering
algorithms to cluster them to produce component clusters.
3. Running the same clustering algorithm many times on same data set
with different parameters or initialisations to create homogeneous
clusters.
The main goal of the ensemble clustering is to integrate component
clustering into one final clustering with a higher accuracy.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
59. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble clustering on genomic/ biological data
Pattern extraction from genomic data applying ensemble clustering.
Data
Data
Data
Data
Preprocessing
Biclustering
Big
Biological
Data
Hidden
Patterns
in
Data
Feature
Selection
Feature
Grouping
Ensemble
Clustering
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
60. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Pre-processing
It transforms raw data into an understandable format, which includes
several techniques:
Data cleaning is the process of dealing with missing values.
Data integration merges data from different multiple sources into a
coherent data store like data warehouse or integrate metadata.
Data transformation includes the followings: (a) normalisation, (b)
aggregation, (c) generalisation, and (d) feature construction.
Data reduction obtains a reduced representation of data set
(eliminating redundant features/ instances).
Data discretisation involves the reduction of a number of values of
a continuous feature by dividing the range of feature intervals.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
61. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Feature Selection
It is the process of selecting a subset of relevant features from a total
original features in data.
Mainly the following three reasons are used for feature selection:
Simplification of models
Shorter training times
Reducing overfitting
In biological data, features may contain false correlations and the
information they add is contained in other features. In this work, we have
applied an unsupervised feature selection approach based on measuring
similarities between features by maximum information compression index.
We have quantified the information loss in feature selection with entropy
measure technique. After selecting the subset of features from the data,
we have grouped them into two groups: nominal and numeric features.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
62. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Subspace Clustering
The subspace clustering finds subspace clusters in high-dimensional data.
It can be classified into three groups:
1. Subspace search methods.
2. Correlation-based clustering methods
3. Biclustering methods.
A subspace search method searches various subspaces for clusters (set
of instances that are similar to each other in a subspace) in the full
space. It uses two kinds of strategies:
Bottom-up approach - start from low-dimensional subspace and
search higher-dimensional subspaces.
Top-down approach - start with full space and search smaller
subspaces recursively.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
63. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 7 δ-Biclustering
Input: E, a data matrix and δ ≥ 0, the maximum acceptable mean squared
residue score.
Output: EIJ , a δ-bicluster that is a submatrix of E with row set I and
column set J, with a score no longer than δ.
Initialization: I and J are initialized to the instance and feature sets in
the data and EIJ = E.
Deletion phase:
1: compute eiJ for all i ∈ I, eIj for all j ∈ J, eIJ , and H(I, J);
2: if H(I, J) ≤ δ then
3: return EIJ ;
4: end if
5: find the rows i ∈ I with d(i) = j∈J (eij −eiJ −eIj +eIJ )2
|J| ;
6: find the columns j ∈ J with d(j) = i∈I (eij −eiJ −eIj +eIJ )2
|I| ;
7: remove rows i ∈ I and columns j ∈ J with larger d;
Addition phase:
1: compute eiJ for all i, eIj for all j, eIJ , and H(I, J);
2: add the columns j /∈ J with i∈I (eij −eiJ −eIj +eIJ )2
|I| ≤ H(I, J);
3: recompute eiJ , eIJ and H(I, J);
4: add the rows i /∈ I with j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J);
5: for each row i /∈ I do
6: if j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J) then
7: add inverse of i;
8: end if
9: end for
10: return EIJ ;
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
64. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering of BrS variants
Distribution of BrS variants in clusters using proposed ensemble
clustering.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
65. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experimental Method
To test the performance of clustering algorithms we have used an
unsupervised evaluation method that compute the Compactness (CP) of
clusters is shown in Eq. 13.
CP =
1
n
k
l=1
nl
xi ,xj ∈Cl
d(xi , xj )
nl (nl − 1)/2
(13)
Where d(xi , xj ) is the distance between two instances in cluster Cl and nl
is the number of instances in Cl . The smaller the CP for a clustering
result, the more compact and better the clustering result.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
66. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Results
The proposed ensemble clustering is compared with following clustering
algorithms:
SimpleKMeans (clustering using the k-means method)
XMeans (extension of k-means)
DBScan (nearest-neighbor-based that automatically determines the
number of clusters)
MakeDensityBasedCluster (wrap a clusterer to make it return
distribution and density)
Table: Comparison of clustering results on 148 Exome data sets of BrS.
Clustering Method Compactness (CP)
SimpleKMeans 9.401
XMeans 8.297
MakeDensityBasedCluster 7.483
DBScan 6.351
Ensemble Clustering 5.647
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
67. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Hybrid Decision Tree & Na¨ıve Bayes Classifiers
The presence of noisy contradictory instances in the training data cause
the learning models suffer from overfitting and decrease classification
accuracy.
Hybrid Decision Tree (DT) classifier - A na¨ıve Bayes (NB)
classifier is used to remove the noisy troublesome instances from the
training data before the DT induction.
Hybrid Na¨ıve Bayes (NB) classifier - A DT is used to select a
comparatively more important subset of features for the production
of na¨ıve assumption of class conditional independence. It is
extremely computationally expensive for a na¨ıve Bayes classifier to
compute class conditional independence for high dimensional data
sets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
68. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 8 Decision Tree Induction
Input: D = {x1, x2, · · · , xn} // Training dataset, D, which contains a set
of training instances and their associated class labels.
Output: T, Decision tree.
Method:
1: for each class, Ci ∈ D, do
2: Find the prior probabilities, P(Ci ).
3: end for
4: for each attribute value, Aij ∈ D, do
5: Find the class conditional probabilities, P(Aij |Ci ).
6: end for
7: for each training instance, xi ∈ D, do
8: Find the posterior probability, P(Ci |xi )
9: if xi is misclassified, then
10: Remove xi from D;
11: end if
12: end for
13: T = ∅;
14: Determine best splitting attribute;
15: T = Create the root node and label it with the splitting attribute;
16: T = Add arc to the root node for each split predicate and label;
17: for each arc do
18: D = Dataset created by applying splitting predicate to D;
19: if stopping point reached for this path, then
20: T = Create a leaf node and label it with an appropriate class;
21: else
22: T = DTBuild(D);
23: end if
24: T = Add T to arc;
25: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
69. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 9 Na¨ıve Bayes classifier
Input: D = {x1, x2, · · · , xn} // Training data.
Output: A classification Model.
Method:
1: T = ∅;
2: Determine the best splitting attribute;
3: T = Create the root node and label it with the splitting attribute;
4: T = Add arc to the root node for each split predicate and label;
5: for each arc do
6: D = Dataset created by applying splitting predicate to D;
7: if stopping point reached for this path, then
8: T = Create a leaf node and label it with an appropriate class;
9: else
10: T = DTBuild(D);
11: end if
12: T = Add T to arc;
13: end for
14: for each attribute, Ai ∈ D, do
15: if Ai is not tested in T, then
16: Wi = 0;
17: else
18: d as the minimum depth of Ai ∈ T, and Wi = 1√
d
;
19: end if
20: end for
21: for each class, Ci ∈ D, do
22: Find the prior probabilities, P(Ci ).
23: end for
24: for each attribute, Ai ∈ D and Wi = 0, do
25: for each attribute value, Aij ∈ Ai , do
26: Find the class conditional probabilities, P(Aij |Ci )
Wi
.
27: end for
28: end for
29: for each instance, xi ∈ D, do
30: Find the posterior probability, P(Ci |xi );
31: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
70. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Accuracy on Benchmark Datasets
Figure: Classification accuracy on 10 datasets with 10-fold cross validation.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
71. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances
Figure: Instances with a fixed number of class labels (left) and instances of a
novel class arriving in the data stream (right).
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
72. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Figure: Flow chart of classification and novel class detection.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
73. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
74. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
75. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
*** THANK YOU ***
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining