This document discusses supervised learning algorithms. It defines supervised learning as using labeled datasets to train algorithms to classify data or predict outcomes accurately. Some commonly used supervised learning algorithms are discussed, including neural networks, naive Bayes, linear regression, logistic regression, support vector machines, k-nearest neighbors, and random forests. These algorithms are used to build models that can generate predictions for new data based on patterns learned from training data.
Discover the evolving technology of artificial intelligence and text analysis. Learn about the importance, types, applications and challenges of the industry. Visit https://www.bytesview.com/ for more information.
The document discusses text classification, which is the process of assigning predefined categories or tags to text. It provides examples of text classification like sentiment analysis and topic detection. Text classification is important because it allows large amounts of unstructured text data to be automatically analyzed and organized, enabling companies to save time, automate processes, and make data-driven decisions. The document outlines some key algorithms used for automatic text classification, including decision trees and Naive Bayes classifiers.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
Unit-3 Professional Ethics in EngineeringNandakumar P
This document discusses safety and risk assessment in engineering. It defines safety and risk, and examines factors that influence risk perception such as voluntarism, control, and information. It also discusses techniques for assessing and reducing risk, including fault tree analysis, failure mode and effects analysis, and scenario analysis. The document concludes with case studies on the Three Mile Island and Chernobyl nuclear accidents and emphasizes the importance of disaster planning, training, and ensuring safe exits in product design.
This document discusses unsupervised learning approaches including clustering, blind signal separation, and self-organizing maps (SOM). Clustering groups unlabeled data points together based on similarities. Blind signal separation separates mixed signals into their underlying source signals without information about the mixing process. SOM is an algorithm that maps higher-dimensional data onto lower-dimensional displays to visualize relationships in the data.
The document discusses different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It provides examples of each type, such as using labeled data to classify emails as spam or not spam for supervised learning, grouping fruits by color without labels for unsupervised learning, and using rewards to guide an agent through a maze for reinforcement learning. The document also covers applications of machine learning across different domains like banking, biomedical, computer, and environment.
This document discusses supervised learning algorithms. It defines supervised learning as using labeled datasets to train algorithms to classify data or predict outcomes accurately. Some commonly used supervised learning algorithms are discussed, including neural networks, naive Bayes, linear regression, logistic regression, support vector machines, k-nearest neighbors, and random forests. These algorithms are used to build models that can generate predictions for new data based on patterns learned from training data.
Discover the evolving technology of artificial intelligence and text analysis. Learn about the importance, types, applications and challenges of the industry. Visit https://www.bytesview.com/ for more information.
The document discusses text classification, which is the process of assigning predefined categories or tags to text. It provides examples of text classification like sentiment analysis and topic detection. Text classification is important because it allows large amounts of unstructured text data to be automatically analyzed and organized, enabling companies to save time, automate processes, and make data-driven decisions. The document outlines some key algorithms used for automatic text classification, including decision trees and Naive Bayes classifiers.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
Unit-3 Professional Ethics in EngineeringNandakumar P
This document discusses safety and risk assessment in engineering. It defines safety and risk, and examines factors that influence risk perception such as voluntarism, control, and information. It also discusses techniques for assessing and reducing risk, including fault tree analysis, failure mode and effects analysis, and scenario analysis. The document concludes with case studies on the Three Mile Island and Chernobyl nuclear accidents and emphasizes the importance of disaster planning, training, and ensuring safe exits in product design.
This document discusses unsupervised learning approaches including clustering, blind signal separation, and self-organizing maps (SOM). Clustering groups unlabeled data points together based on similarities. Blind signal separation separates mixed signals into their underlying source signals without information about the mixing process. SOM is an algorithm that maps higher-dimensional data onto lower-dimensional displays to visualize relationships in the data.
The document discusses different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It provides examples of each type, such as using labeled data to classify emails as spam or not spam for supervised learning, grouping fruits by color without labels for unsupervised learning, and using rewards to guide an agent through a maze for reinforcement learning. The document also covers applications of machine learning across different domains like banking, biomedical, computer, and environment.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
Lecture #1: Introduction to machine learning (ML)butest
1. Machine learning (ML) is a subfield of artificial intelligence concerned with building computer programs that learn from data and improve their abilities to perform tasks.
2. ML programs build models from example data to predict future examples or describe relationships in the data. For example, an ML program given patient cases could predict diseases in new patients or describe relationships between diseases and symptoms.
3. There are different types of learning including supervised learning (classification, regression), unsupervised learning (clustering), and reinforcement learning (sequential decision making). The goal is to learn patterns in data and generalize to new examples.
Deep learning uses neural networks, which are systems inspired by the human brain. Neural networks learn patterns from large amounts of data through forward and backpropagation. They are constructed of layers including an input layer, hidden layers, and an output layer. Deep learning can learn very complex patterns and has various applications including image classification, machine translation, and more. Recurrent neural networks are useful for sequential data like text and audio. Convolutional neural networks are widely used in computer vision tasks.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
1. Machine learning is a branch of artificial intelligence concerned with algorithms that allow computers to learn from data without being explicitly programmed.
2. A major focus is automatically learning patterns from training data to make intelligent decisions on new data. This is challenging since the set of all possible behaviors given all inputs is too large to observe completely.
3. Machine learning is applied in areas like search engines, medical diagnosis, stock market analysis, and game playing by developing algorithms that improve automatically through experience. Decision trees, Bayesian networks, and neural networks are common algorithms.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
This document discusses naïve text classification. It begins with an introduction to naïve Bayes classifiers and how they are based on Bayes' theorem. It then discusses how text data is converted into numerical feature vectors to be used in machine learning algorithms. Two examples are provided to illustrate how to use a naïve Bayes classifier to predict class labels. The document concludes with discussing some advantages and disadvantages of naïve Bayes classifiers.
. An introduction to machine learning and probabilistic ...butest
This document provides an overview and introduction to machine learning and probabilistic graphical models. It discusses key topics such as supervised learning, unsupervised learning, graphical models, inference, and structure learning. The document covers techniques like decision trees, neural networks, clustering, dimensionality reduction, Bayesian networks, and learning the structure of probabilistic graphical models.
This document discusses association rule mining. Association rule mining finds frequent patterns, associations, correlations, or causal structures among items in transaction databases. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively joining frequent itemsets from the previous pass to generate candidates, and then pruning the candidates that have infrequent subsets. Various techniques can improve the efficiency of Apriori, such as hashing to count itemsets and pruning transactions that don't contain frequent itemsets. Alternative approaches like FP-growth compress the database into a tree structure to avoid costly scans and candidate generation. The document also discusses mining multilevel, multidimensional, and quantitative association rules.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
B.E. / B.TECH. DEGREE
SEMESTER - VIII
PROFESSIONAL ELECTIVE - V
CS8080 INFORMATION RETRIEVAL TECHNIQUES
UNIT - III - TEXT CLASSIFICATION AND CLUSTERING
This document summarizes a seminar presentation on machine learning. It defines machine learning as applications of artificial intelligence that allow computers to learn automatically from data without being explicitly programmed. It discusses three main algorithms of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses labelled training data, unsupervised learning finds patterns in unlabelled data, and reinforcement learning involves learning through rewards and punishments. Examples applications discussed include data mining, natural language processing, image recognition, and expert systems.
Smart Data Slides: Machine Learning - Case StudiesDATAVERSITY
The state of the art and practice for machine learning (ML) has matured rapidly in the past 3 years, making it an ideal time to take a look at what works and what doesn’t.
In this webinar, we will review case studies from 3 industries:
-Insurance
-Healthcare
-Pharma
Participants will learn to look for characteristics of business processes and of data that make them well - or ill - suited to augmentation or automation with ML.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
The document discusses feature selection and dimensionality reduction techniques for text classification. It describes how these techniques aim to minimize the number of features in a dataset by selecting only the most important ones, to reduce overfitting and improve model performance. Various feature selection methods are covered, including filter methods that score features based on statistical tests, wrapper methods that evaluate feature subsets with a predictive model, and embedded methods that perform feature selection during model training.
This document discusses evaluation metrics for text classification. It introduces confusion matrices, which contain true positives, false positives, true negatives, and false negatives based on comparing predicted and known labels. Accuracy measures are calculated using these counts from the confusion matrix, allowing evaluation of a classifier's performance. Common measures include precision, recall, and F1 score. The document provides examples of using confusion matrices and contingency tables to evaluate predictive models in fields like bioinformatics.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
Lecture #1: Introduction to machine learning (ML)butest
1. Machine learning (ML) is a subfield of artificial intelligence concerned with building computer programs that learn from data and improve their abilities to perform tasks.
2. ML programs build models from example data to predict future examples or describe relationships in the data. For example, an ML program given patient cases could predict diseases in new patients or describe relationships between diseases and symptoms.
3. There are different types of learning including supervised learning (classification, regression), unsupervised learning (clustering), and reinforcement learning (sequential decision making). The goal is to learn patterns in data and generalize to new examples.
Deep learning uses neural networks, which are systems inspired by the human brain. Neural networks learn patterns from large amounts of data through forward and backpropagation. They are constructed of layers including an input layer, hidden layers, and an output layer. Deep learning can learn very complex patterns and has various applications including image classification, machine translation, and more. Recurrent neural networks are useful for sequential data like text and audio. Convolutional neural networks are widely used in computer vision tasks.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
1. Machine learning is a branch of artificial intelligence concerned with algorithms that allow computers to learn from data without being explicitly programmed.
2. A major focus is automatically learning patterns from training data to make intelligent decisions on new data. This is challenging since the set of all possible behaviors given all inputs is too large to observe completely.
3. Machine learning is applied in areas like search engines, medical diagnosis, stock market analysis, and game playing by developing algorithms that improve automatically through experience. Decision trees, Bayesian networks, and neural networks are common algorithms.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
This document discusses naïve text classification. It begins with an introduction to naïve Bayes classifiers and how they are based on Bayes' theorem. It then discusses how text data is converted into numerical feature vectors to be used in machine learning algorithms. Two examples are provided to illustrate how to use a naïve Bayes classifier to predict class labels. The document concludes with discussing some advantages and disadvantages of naïve Bayes classifiers.
. An introduction to machine learning and probabilistic ...butest
This document provides an overview and introduction to machine learning and probabilistic graphical models. It discusses key topics such as supervised learning, unsupervised learning, graphical models, inference, and structure learning. The document covers techniques like decision trees, neural networks, clustering, dimensionality reduction, Bayesian networks, and learning the structure of probabilistic graphical models.
This document discusses association rule mining. Association rule mining finds frequent patterns, associations, correlations, or causal structures among items in transaction databases. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively joining frequent itemsets from the previous pass to generate candidates, and then pruning the candidates that have infrequent subsets. Various techniques can improve the efficiency of Apriori, such as hashing to count itemsets and pruning transactions that don't contain frequent itemsets. Alternative approaches like FP-growth compress the database into a tree structure to avoid costly scans and candidate generation. The document also discusses mining multilevel, multidimensional, and quantitative association rules.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
B.E. / B.TECH. DEGREE
SEMESTER - VIII
PROFESSIONAL ELECTIVE - V
CS8080 INFORMATION RETRIEVAL TECHNIQUES
UNIT - III - TEXT CLASSIFICATION AND CLUSTERING
This document summarizes a seminar presentation on machine learning. It defines machine learning as applications of artificial intelligence that allow computers to learn automatically from data without being explicitly programmed. It discusses three main algorithms of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses labelled training data, unsupervised learning finds patterns in unlabelled data, and reinforcement learning involves learning through rewards and punishments. Examples applications discussed include data mining, natural language processing, image recognition, and expert systems.
Smart Data Slides: Machine Learning - Case StudiesDATAVERSITY
The state of the art and practice for machine learning (ML) has matured rapidly in the past 3 years, making it an ideal time to take a look at what works and what doesn’t.
In this webinar, we will review case studies from 3 industries:
-Insurance
-Healthcare
-Pharma
Participants will learn to look for characteristics of business processes and of data that make them well - or ill - suited to augmentation or automation with ML.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
The document discusses feature selection and dimensionality reduction techniques for text classification. It describes how these techniques aim to minimize the number of features in a dataset by selecting only the most important ones, to reduce overfitting and improve model performance. Various feature selection methods are covered, including filter methods that score features based on statistical tests, wrapper methods that evaluate feature subsets with a predictive model, and embedded methods that perform feature selection during model training.
This document discusses evaluation metrics for text classification. It introduces confusion matrices, which contain true positives, false positives, true negatives, and false negatives based on comparing predicted and known labels. Accuracy measures are calculated using these counts from the confusion matrix, allowing evaluation of a classifier's performance. Common measures include precision, recall, and F1 score. The document provides examples of using confusion matrices and contingency tables to evaluate predictive models in fields like bioinformatics.
This document discusses accuracy and error in text classification. It defines accuracy as the proportion of correct predictions and discusses different types of errors. It also describes several metrics for evaluating classification models, including mean squared error, mean absolute error, mean absolute percent error, and metrics derived from a confusion matrix like recall and precision. Cross-validation and bootstrapping techniques for estimating error rates in classification models are also covered.
This document discusses unsupervised learning algorithms, specifically clustering. It defines clustering as a machine learning technique that groups unlabeled datasets into clusters of similar data points without supervision. Popular clustering algorithms mentioned include K-means clustering and hierarchical clustering. The key advantages of unsupervised learning are that it can handle more complex tasks since the data is unlabeled, and unlabeled data is easier to obtain than labeled data. However, the results may be less accurate since the algorithms do not know the exact outputs.
The document discusses decision trees, which are a classification technique where an internal node represents a test on an attribute, tree branches represent outcomes of the test, and leaf nodes represent class labels. It describes how decision trees are constructed in a top-down manner by selecting the optimal attribute to split the data on at each node, recursively partitioning the data until reaching leaf nodes of single class labels. The document provides examples of decision tree construction and classification using a weather dataset.
The document discusses support vector machines (SVM) classifiers. It begins with an introduction to SVM, explaining that it is a supervised machine learning model that finds a hyperplane to classify data. It then covers SVM history and applications, the general philosophy of maximizing margins between classes, and how SVM handles both linearly separable and non-linearly separable data. Finally, it provides an example of how SVM works for text classification by finding the optimal hyperplane to separate documents into classes.
The document discusses multi-dimensional indexing and searching. It is part of a course on information retrieval techniques covering text classification, clustering, naive classification, supervised algorithms like decision trees and SVMs, and dimensionality reduction. Multi-dimensional indexing allows indexing and searching based on multiple fields to support queries with criteria on different fields.
This document provides an overview of data mining and the CRISP-DM methodology. It discusses key terminology, potential applications, and a Venn diagram comparing data mining, knowledge discovery, big data analytics, statistics, and data science. The CRISP-DM methodology is explained in six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Various data exploration, cleaning, transformation, and dimensionality reduction techniques are covered. Common machine learning algorithms, model selection factors, and assessment metrics are also summarized.
The document discusses the topic of indexing and searching in a course on information retrieval techniques. It covers various techniques for text classification including supervised algorithms like decision trees, k-NN classifiers, and SVM classifiers. It also discusses feature selection, evaluation metrics, accuracy, organizing classes, inverted indexes, sequential searching, and multi-dimensional indexing. The document appears to be from a course at Aalim Muhammed Salgh College of Engineering on professional elective CS8080.
The document discusses the K-nearest neighbors (K-NN) classification algorithm. It explains that K-NN is a simple supervised machine learning algorithm that stores all available training data and classifies new data based on similarity. It finds the K closest training examples to a new data point and assigns the most common class among those K examples to the new data point. The document provides examples of how K-NN works and discusses factors like choosing K, distance measures, and normalization.
Recent trends discussed include digital transformation, COVID-19 impact, remote working, and disruptive technologies like quantum physics and driverless vehicles. Machine learning techniques can help analyze large, complex datasets and make predictions. Unsupervised machine learning models can find hidden patterns in unlabeled data and group objects based on similarities. Supervised learning predicts target variables using labeled examples to train algorithms like decision trees and random forests. The machine learning process involves data preparation, algorithm selection, model training, prediction, and evaluation.
The document discusses organizing classes for text classification. It covers taxonomies for organizing classes in a hierarchical structure and relationships between them. The document is about a course on information retrieval techniques, specifically discussing organizing classification classes through taxonomies.
The document discusses organizing classes for text classification. It covers taxonomies for organizing classes in a hierarchical structure and relationships between them. The document is about a course on information retrieval techniques, including topics like text classification, clustering, supervised and unsupervised algorithms, evaluation metrics and indexing methods.
Module Overview Careers in Analytics In this module, we .docxaudeleypearl
Module Overview | Careers in
Analytics
In this module, we will evaluate the various quantitative data collection and analysis methods in
standard industry practice. These methods are what will be used throughout this program, so
you should become familiar with the terminology.
The second part of this module presents a variety of career paths for data analysts and an
overview of how several industries are currently using data analytics. Pay special attention to
the intersection of skills necessary for a data analyst to possess, and think of the steps you can
take to gain or improve on these in your own skill set. This may give you an idea of the career
path and industry you would like to pursue, or enhance your understanding of a career path and
industry you have already chosen.
Industry Practice
Learning Objectives
Explain the technical elements and steps associated with analytics practices and processes
Explore industry practice of data analytics
Typical Quantitative Techniques Used in Advanced Analytics
Several quantitative techniques apply to analytics projects, including:
Type Description
Simulation
Randomized repetitions of a set of discrete events in order to
model real-world systems and phenomena (e.g., queues)
Optimization
Algorithm selects the best possible outcome, subject to
satisfying constraints
Matrix Algebra
Calculations involving matrices solve multidimensional
problems
Fitting Functions to Data
Also called “curve fitting,” using numerical methods to
interpolate data
Survival Analysis
Originally used by life scientists, but adopted by marketers and
actuaries
Time Series
When data are “auto-correlated,” such as time-dependent data
(also called “Box-Jenkins”)
Predictive Analytics and Machine Learning
Classical Statistics
Descriptive: calculates metrics to characterize the distribution of
values of data (mean, standard deviation, range, etc.)
Predictive: estimates parameters using historical data and
making predictions of future outcomes (multivariate regression,
generalized linear regression, etc.)
Learning
Unsupervised learning: characterizes the data to establish
classes without using explicit metrics, e.g., k-means clustering
Supervised learning: Classify and describe the data with pre-
defined ‘labels,’ e.g., decision trees
Bayesian
Used to augment classical analysis when there is prior
knowledge about how the data was generated
Typical Challenges and Pitfalls in an Analytics Project
1. Poorly defined problem
• Unclear goal of problem-solving
• Scope is unclear, e.g., how many SKUs to analyze
• Mixed objectives, e.g., economic analysis of a product category promotion for retailer versus
CPG mixed
2. Limited IT resources
• Cloud data can’t be acquired off-line within a reasonable time
• Can’t run the complete model due to computation limitation
• Too slow to generate results in real time
• Can’t share.
Module Overview Careers in Analytics In this module, we .docxroushhsiu
Module Overview | Careers in
Analytics
In this module, we will evaluate the various quantitative data collection and analysis methods in
standard industry practice. These methods are what will be used throughout this program, so
you should become familiar with the terminology.
The second part of this module presents a variety of career paths for data analysts and an
overview of how several industries are currently using data analytics. Pay special attention to
the intersection of skills necessary for a data analyst to possess, and think of the steps you can
take to gain or improve on these in your own skill set. This may give you an idea of the career
path and industry you would like to pursue, or enhance your understanding of a career path and
industry you have already chosen.
Industry Practice
Learning Objectives
Explain the technical elements and steps associated with analytics practices and processes
Explore industry practice of data analytics
Typical Quantitative Techniques Used in Advanced Analytics
Several quantitative techniques apply to analytics projects, including:
Type Description
Simulation
Randomized repetitions of a set of discrete events in order to
model real-world systems and phenomena (e.g., queues)
Optimization
Algorithm selects the best possible outcome, subject to
satisfying constraints
Matrix Algebra
Calculations involving matrices solve multidimensional
problems
Fitting Functions to Data
Also called “curve fitting,” using numerical methods to
interpolate data
Survival Analysis
Originally used by life scientists, but adopted by marketers and
actuaries
Time Series
When data are “auto-correlated,” such as time-dependent data
(also called “Box-Jenkins”)
Predictive Analytics and Machine Learning
Classical Statistics
Descriptive: calculates metrics to characterize the distribution of
values of data (mean, standard deviation, range, etc.)
Predictive: estimates parameters using historical data and
making predictions of future outcomes (multivariate regression,
generalized linear regression, etc.)
Learning
Unsupervised learning: characterizes the data to establish
classes without using explicit metrics, e.g., k-means clustering
Supervised learning: Classify and describe the data with pre-
defined ‘labels,’ e.g., decision trees
Bayesian
Used to augment classical analysis when there is prior
knowledge about how the data was generated
Typical Challenges and Pitfalls in an Analytics Project
1. Poorly defined problem
• Unclear goal of problem-solving
• Scope is unclear, e.g., how many SKUs to analyze
• Mixed objectives, e.g., economic analysis of a product category promotion for retailer versus
CPG mixed
2. Limited IT resources
• Cloud data can’t be acquired off-line within a reasonable time
• Can’t run the complete model due to computation limitation
• Too slow to generate results in real time
• Can’t share ...
This document provides an introduction to machine learning concepts including definitions of machine learning, training and test data, and different machine learning techniques. It defines machine learning as a field that allows machines to learn from data without being explicitly programmed. It describes how training data is used to teach a machine and test data is used to evaluate how well a machine has learned. The document outlines common machine learning techniques including supervised learning techniques like classification and regression as well as unsupervised learning techniques like clustering. It provides examples of different algorithms for each technique.
machine learning workflow with data input.pptxjasontseng19
This document provides an overview of machine learning, including definitions of key concepts like tasks, experience, and performance in machine learning. It also discusses common machine learning workflows like data loading, understanding data through statistics and visualization, preparing data through techniques like scaling and normalization, selecting features, and discussing applications and types of machine learning. It provides examples of challenges in machine learning and techniques for data preparation and feature selection.
This document discusses machine learning techniques and concepts. It introduces topics like supervised learning, unsupervised learning, reinforcement learning, and neural networks. It defines key machine learning terms and describes applications. The objectives are to understand fundamental machine learning concepts and the need for machine learning to solve various problems. Examples of motivating problems discussed include handwritten character recognition, fingerprint recognition, and face recognition.
The document discusses the topic of sequential searching as part of a course on information retrieval techniques. It covers text classification, clustering algorithms, naive classification, supervised algorithms like decision trees and SVMs, feature selection, evaluation metrics, and indexing and searching techniques including inverted indexes and sequential searching. The document appears to be from a college course that focuses on classification, clustering, and searching methods for information retrieval.
Similar to CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf (20)
The document discusses various features and constructs of the Java programming language including:
- Java is an object-oriented, simple, platform-independent, secure, robust, and high-performance language.
- The Java Runtime Environment (JRE) provides the runtime platform and Java Development Kit (JDK) includes development tools.
- Java programs are compiled to bytecode that runs on the Java Virtual Machine (JVM) on any platform.
- Core Java constructs include data types, variables, operators, statements, and classes. Primitive data types include numbers, booleans, characters and strings.
This document provides an introduction to structured programming and object-oriented programming (OOP).
For structured programming, it discusses the concepts of sequence, selection, and iteration constructs. It also provides an example of modeling a game as a structured program.
For OOP, it discusses the three main concepts of encapsulation, inheritance, and polymorphism. Encapsulation involves defining the attributes and methods of an object. Inheritance allows a subclass to inherit attributes and methods from a superclass. Polymorphism allows methods to perform different actions depending on the object type. Real-world examples are provided to illustrate each OOP concept.
The document concludes by suggesting an activity to apply the structured and OOP concepts to
The document discusses Java buzzwords related to object-oriented programming concepts. It describes messaging as how objects communicate by sending and receiving requests to execute methods. It also discusses relationships between objects, including aggregation/composition for "part-of" relationships, association for links between classes, and inheritance. Degrees of association can be unary, binary, or ternary depending on the number of classes involved.
The document discusses object-oriented programming (OOP) and Java. It covers the basics of OOP, including defining it as a programming paradigm that organizes code around objects rather than actions. Some key features of OOP discussed are encapsulation, inheritance, polymorphism. The document also contrasts OOP with procedural programming and provides examples of different programming paradigms. It outlines the topics to be covered in the course, including Java data types, classes, methods and access specifiers.
This document discusses the features of object-oriented programming, including objects, classes, encapsulation, abstraction, polymorphism, and inheritance. It provides examples and definitions for each concept. Objects have identity, state, and behavior. Classes provide blueprints for objects and define their attributes and methods. Encapsulation binds data and methods together, and hides implementation details. Abstraction involves hiding details and showing functionality. Polymorphism allows the same method to operate in different ways depending on the object, like calculating area for different shapes. Inheritance allows classes to share structure and behavior with parent classes.
The document discusses the object oriented programming (OOP) paradigm. It describes OOP as a programming paradigm based on objects that contain both data and methods. Objects interact and work together to design applications. The document traces the history and development of OOP, including early languages like Simula and Smalltalk. It notes that OOP provides benefits like modularity and reusability. The OOP paradigm became widely used with languages in the 1990s and remains an important methodology for software development.
This document provides an overview of JavaFX event handling, controls, and components. It discusses key concepts like events, event sources, listeners, and event types in JavaFX. It also describes common JavaFX controls like Checkbox, ToggleButton, RadioButtons, ListView, ComboBox, ChoiceBox, and text controls. Finally, it discusses JavaFX layouts including FlowPane, HBox, VBox, BorderPane, StackPane, and GridPane as well as menus.
This document discusses personalized learning material (PLM) and personalized assessment tests (PATs) for a course on object oriented programming. It introduces PLM as an artificial intelligence-based learning strategy that recommends topics to different learner groups, including smart, effective, and slow learners. The document provides an index of topics covered in the course syllabus and includes sample code demonstrating exceptions and input/output streams in Java.
This document provides class notes on exception handling and multithreading in Java. It introduces the concept of exceptions in Java and describes how to handle exceptions using try, catch, throw, throws and finally keywords. Different types of exceptions like checked exceptions and unchecked exceptions are discussed along with examples. Exception handling basics like normal flow, exception objects, and exception hierarchy are explained. The document also provides an index of topics covered in the unit on exception handling and multithreading.
This document provides class notes on object-oriented programming concepts like inheritance, packages, and interfaces. It covers topics like method overloading, objects as parameters, returning objects, static and nested classes, inheritance basics, method overriding, abstract classes, packages, and interfaces. The document is organized into sections on each topic with examples provided. It also includes indexes listing the topics covered and their importance levels. The material is intended to complement personalized learning materials and assessment tests for students.
The document discusses key concepts in object-oriented programming including objects, classes, encapsulation, abstraction, polymorphism, and inheritance. It provides definitions and examples of each concept. For objects, it describes how objects have an identity, state, and behavior. For classes, it explains that a class is a blueprint that defines common properties and behaviors for a collection of objects.
This document provides notes on programming in C from a class on the subject. It covers basics of C programming including data types, constants, operators, expressions, input/output statements, decision making statements, looping statements and more. It discusses the structure of a C program and includes comments, preprocessor directives, global variable declarations and the main function. It also covers the history and applications of C, types of programming languages, and an introduction to programming paradigms and C as a programming language.
The document provides an introduction to information retrieval, including its history, key concepts, and challenges. It discusses how information retrieval aims to retrieve relevant documents from a collection to satisfy a user's information need. The main challenge in information retrieval is determining relevance, as relevance depends on personal assessment, task, context, time, location, and device. Three main issues in information retrieval are determining relevance, representing documents and queries, and developing effective retrieval models and algorithms.
The document discusses inverted indexes, which are data structures used in information retrieval systems for indexing and searching contents. It explains full inverted indexes, where each unique term in a document collection is mapped to a postings list containing the documents that contain that term. Inverted indexes allow efficient search and retrieval of documents by term.
More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (14)
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
Height and depth gauge linear metrology.pdfq30122000
Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
1. P1WU
UNIT – III: CLASSIFICATION
Topic 1: A CHARACTERIZATION OF TEXT
CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2. UNIT III
1.A Characterization of
Text Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
3. INTRODUCTION TO CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
4. INTRODUCTION TO CLASSIFICATION
• Scientists became very serious about addressing the question:
• “Can we build a model that learns from available data and
automatically makes the right decisions and predictions?”
• Answer can be found in numerous applications that are emerging
from the fields of
1. pattern classification,
2. machine learning, and
3. artificial intelligence.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
5. INTRODUCTION TO CLASSIFICATION
• Data from various sensoring devices combined with powerful
learning algorithms and domain knowledge led to :
• many great inventions that we now take for granted in our
everyday life:
• Internet queries via search engines like Google,
• text recognition at the post office,
• barcode scanners at the supermarket, the diagnosis of diseases,
• speech recognition by Siri or
• Google Now on our mobile phone, just to name a few.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
6. INTRODUCTION TO CLASSIFICATION
• Classification is:
• the data mining process of
• finding a model (or function) that
• describes and distinguishes data classes or concepts,
• for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
• That is, predicts categorical class labels (discrete or nominal).
• Classifies the data (constructs a model) based on the training set.
• It predict group membership for data instances.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
7. INTRODUCTION TO CLASSIFICATION
What is CLASSIFICATION?
• Classification and prediction are :
• two forms of data analysis that can used to extract models describing
important data classes or to predict the future data trends.
• C & P help us to provide a better understanding of large data.
• Classification predicts categorical (discrete, unordered) labels.
• Prediction models continuous valued functions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
8. INTRODUCTION TO CLASSIFICATION
• How can we classify?
• The trick here is Machine Learning which requires us to make classifications based on past
observations (the learning part).
• We give the machine a set of data having texts with labels tagged to it and then we let the model
to learn on all these data which will later give us some useful insight on the categories of text
input we feed.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
9. Applications of Classification
• Classification of (potential) customers for:
• Credit approval, risk prediction, selective marketing
• Performance prediction based on
• selected indicators
• Medical diagnosis based on symptoms or reactions to Therapy
• Application areas:
• Credit approval
• Target marketing
• Medical diagnosis
• Treatment effectiveness analysis
• Performance prediction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
10. When is classification needed?
• Scenarios:
• In each of these examples, the data analysis task is classification,
• where a model or classifier is constructed to predict categorical labels, such as
• “safe” or “risky” for the loan application data;
• “yes” or “no” for the marketing data; or
• “treatment A,” “treatment B,” or “treatment C” for the medical data.
• These categories can be represented by discrete values, where the ordering among values
has no meaning.
• For example,
• the values 1, 2, and 3 may be used to represent treatments A, B, and C,
• where there is no ordering implied among this group of treatment regimes.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
11. INTRODUCTION TO CLASSIFICATION
Aim: predict categorical class labels
for new tuples/samples
Input: a training set of tuples/samples,
each with a class label
Output: a model (a classifier) based on
the training set and the class labels
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
12. Why Classification?
• A classical problem extensively studied by
• statisticians and machine learning researchers
• Predicts categorical class labels.
• Produces a model (classifier).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
13. Typical Applications of Classification
• Example:
• {credit history, salary} credit approval ( Yes/No)
• {Temp, Humidity} Rain (Yes/No)
• A set of documents sports, technology, etc.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
• Another Example:
• If x >= 90 then grade =A.
• If 80<=x<90 then grade =B.
• If 70<=x<80 then grade =C.
• If 60<=x<70 then grade =D.
• If x<50 then grade =F.
14. WHAT ARE TEXT CLASSIFICATION?
• Text classification is a machine
learning technique that assigns a
set of predefined categories
to open-ended text.
• Text classifiers can be used to
organize, structure, and categorize
pretty much any kind of text –
from documents, medical studies
and files, and all over the web.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
15. What is meant by text classification?
• Text classification or Text Categorization
is the activity of labeling natural
language texts with relevant categories
from a predefined set.
• In laymen terms, text classification is a
process of extracting generic tags from
unstructured text.
• These generic tags come from a set of
pre-defined categories.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
16. What is meant by text classification or Document classification ?
• Document classification or document categorization is
• a problem in library science, information science and
computer science.
• The task is to assign a document to one or more classes or
categories.
• This may be done "manually" or algorithmically.
•Wikipedia
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
17. What is meant by text classification?
• Text classification also known as text tagging or text
categorization is the process of categorizing text into
organized groups.
• By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then
assign a set of pre-defined tags or categories based on
its content.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
18. Text Classification Examples
• Text classification is becoming
• an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
• Some of the most common examples and use cases for
automatic text classification include the following:
a) Sentiment Analysis
b) Topic Detection
c) Language Detection
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
19. Text Classification Examples
a) Sentiment Analysis: the process of understanding if a given text is
talking positively or negatively about a given subject
(e.g. for brand monitoring purposes).
b) Topic Detection: the task of identifying the theme or topic of a piece
of text
(e.g. know if a product review is about Ease of Use, Customer Support,
or Pricing when analyzing customer feedback).
c) Language Detection: the procedure of detecting the language of a
given text
(e.g. know if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
20. A Characterization of Text Classification
• For example,
• new articles can be organized by topics;
• support tickets can be organized by urgency;
• chat conversations can be organized by language;
• brand mentions can be organized by sentiment; and so on.
• Text classification is
• one of the fundamental tasks in natural language processing with broad applications such
as sentiment analysis, topic labeling, spam detection, and intent detection.
• Here’s an example of how it works:
• “The user interface is quite straightforward and easy to use.”
• A text classifier can take this phrase as an input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
21. A Characterization of Text Classification
• First tactic for categorizing documents is to assign a
label to each document,
• but this solve the problem only when the users know the
labels of the documents they looking for.
• This tactic does not solve more generic problem of
finding documents on specific topic or subject.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
22. A Characterization of Text Classification
• For that case, better solution is to
• group documents by common generic topics and label each group
with a meaningful name.
• Each labeled group is called category or class.
• Document classification is
• the process of categorizing documents under a given cluster or
category using fully supervised learning process.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
23. Why is Text Classification Important?
• It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data.
• Because of the messy nature of text,
• analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so
most companies fail to use it to its full potential.
• This is where text classification with machine learning comes in.
• Using text classifiers, companies can automatically structure all manner of
relevant text, from
• , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way.
• This allows companies to
• save time analyzing text data, automate business processes, and make data-driven business
decisions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
24. Reasons for: Text Classification Important
a) Scalability
• Manually analyzing and organizing is slow and much less accurate..
• Machine learning can automatically analyze millions of surveys, comments, emails,
etc., at a fraction of the cost, often in just a few minutes.
• Text classification tools are scalable to any business needs, large or small.
b) Real-time analysis
• There are critical situations that companies need to identify as soon as possible and
take immediate action (e.g., PR crises on social media).
• Machine learning text classification can follow your brand mentions constantly and in
real time, so you'll identify critical information and be able to take action right away.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
25. Reasons for: Text Classification Important
c) Consistent criteria
• Human annotators make mistakes when classifying text data due to
distractions, fatigue, and boredom, and human subjectivity creates inconsistent
criteria.
• Machine learning, on the other hand, applies the same lens and criteria to all
data and results.
• Once a text classification model is properly trained it performs with
unsurpassed accuracy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
26. A Characterization of Text Classification
• Classification could be performed
1. manually by domain experts or
2. automatically using well- known and
• widely used classification algorithms such as decision tree and
Naïve Bayes.
• Documents are classified according to
• other attributes (e.g. author, document type, publishing year
etc.) or according to their subjects.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
27. A Characterization of Text Classification
• there are two main kind of subject classification of documents:
1. The content based approach and
2. the request based approach.
• In Content based classification,
• the weight that is given to subjects in a document decides the class to which the document is assigned.
• For example, it is a rule in some library classification that at least 15% of the content of a book
should be about the class to which the book is assigned.
• In automatic classification, the number of times given words appears in a document determine the
class.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
28. A Characterization of Text Classification
• In Request oriented
classification, the anticipated
request from users is impacting
how documents are being
classified.
• The classifier asks himself:
• “Under which description should this
entity be found?” and
• “think of all the possible queries and
decide for which ones the entity at
hand is relevant”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
29. Text Classification Applications
• With the help of text classification, businesses can make sense of large
amounts of data using techniques like
• aspect-based sentiment analysis to understand what people are talking about
and how they’re talking about each aspect.
• Text classification can help support teams provide a stellar experience
by
• automating tasks that are better left to computers, saving precious time that
can be spent on more important things.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
30. Text Classification Applications
• models can help you analyze survey results to discover patterns and
insights like:
• What do people like about our product or service?
• What should we improve?
• What do we need to change?
• By combining both quantitative results and qualitative analyses,
• teams can make more informed decisions without having to spend hours
manually analyzing every single open-ended response.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
31. Text Classification Applications
• Text classification has thousands of use cases and is applied to a wide range
of tasks.
• In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering).
• In some other cases, classifiers are used by marketers, product managers,
engineers, and salespeople to automate business processes and save
hundreds of hours of manual data processing.
• Some of the top applications and use cases of text classification include:
1. Detecting urgent issues
2. Automating customer support processes
3. Listening to the Voice of customer (VoC)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
32. A Characterization of Text Classification
• Automatic document classification tasks can be divided into three
types
1. Unsupervised document classification (document clustering): the
classification must be done totally without reference to external information.
2. Semi-supervised document classification: parts of the documents are labeled
by the external method.
3. Supervised document classification where some external method (such as
human feedback) provides information on the correct classification for
documents
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
33. Computational Supervised Learning
• Computational Supervised Learning is also called classification aimed
to:
• Learn from past experience, and
• use the learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
• Clinical diagnosis for patients
• Cell type classification
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
34. Overall Picture of Supervised Learning
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Biomedical
Financial
Government
Scientific
Decision trees
Emerging patterns
SVM
Neural networks
Classifiers (M-Doctors)
35. Unsupervised Learning
• Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while
learning new things. It can be defined as:
• “Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
36. Unsupervised Learning
Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
The goal of unsupervised learning is to
find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
37. Unsupervised Learning
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image
features on their own.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
38. Unsupervised Learning
• . Unsupervised learning algorithm will
• perform this task by clustering the image dataset into the groups according to
similarities between images.
• By Simply,
• no training data is provided Examples:
• neural network models
• independent component analysis
• clustering
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
39. Supervised vs. Unsupervised Learning
classification Vs clustering
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
40. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES