Introduction To Data Mining and Data Mining Techniques.pptx

1
Data, Text & Web Mining
Data Mining : Basic Concepts
Prof. Kirti Samrit

2
Data
The quantities, characters, or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
Big Data
• Big Data is a collection of data that is huge in volume, yet growing exponentially with
time.
• It is a data with so large size and complexity that none of traditional data management
tools can store it or process it efficiently.
• Big data is also a data but with huge size.
• Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data.
• It is stated that almost 90% of today's data has been generated in the past 3 years.

3
Enterprise
“Dark Data”
Partner, Employee
Customer, Supplier
Public Commercial
Social Media
Transactions
Monitoring
Population
Economi
c
Sentiment
Email
Contracts
Network
Industry
Credit
Weather
Where does Big Data come from?

5
Sources of Big Data
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

7
Types Of Big Data
Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’
data.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. A typical example of
unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos
etc.
Semi-structured
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition
in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

9
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when they have to
store large amounts of data.
These tools help organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like Hadoop help
them to analyze data immediately thus helping in making quick decisions based on the learnings.
3. Understand the market conditions
Big Data analysis helps businesses to get a better understanding of market situations.
For example, analysis of customer purchasing behavior helps companies to identify the products sold most and
thus produces those products accordingly. This helps companies to get ahead of their competitors.
4. Social Media Listening
Companies can perform sentiment analysis using Big Data tools.
These enable them to get feedback about their company, that is, who is saying what about the company.
Advertisement
Companies can use Big data tools to improve their online presence.

10
5. Boost Customer Acquisition and Retention
• Big data analytics helps businesses to identify customer related trends and patterns.
• Customer behavior analysis leads to a profitable business.
6. Solve Advertisers Problem and Offer Marketing Insights
• Big data analytics shapes all business operations.
• It enables companies to fulfill customer expectations.
• Big data analytics helps in changing the company’s product line.
• It ensures powerful marketing campaigns.
7. The driver of Innovations and Product Development
Big data makes companies capable to innovate and redevelop their products.

11
Data Mining
“Data Mining”, that mines the data.
It is defined as finding hidden insights(information) from the database,
extract patterns from the data.
Advantages of Data Mining:
• Data mining can uncover hidden patterns and relationships in data that are not
immediately apparent.
• Data mining can be used to identify new opportunities and improve business
processes.
• Data mining can help organizations predict future outcomes based on historical data.

12
Business Intelligence
BI(Business Intelligence) is a set of processes, architectures, and technologies
that convert raw data into meaningful information that drives profitable
business actions.
Advantages of Business Intelligence (BI):
• BI helps businesses improve decision-making by providing insights into historical
data.
• BI allows organizations to monitor their key performance indicators (KPIs) and track
progress toward their goals.
• BI tools are user-friendly and do not require specialized technical knowledge to use.

13
Data mining process
Data Pre-processing –
Data cleaning, integration,
selection, and
transformation takes place
Data Extraction –
Occurrence of exact data
mining
Data Evaluation and
Presentation – Analyzing
and presenting results

18
Data Mining Techniques
1. Clustering
2. Association
3. Data Cleaning
4. Data Visualization
5. Classification
6. Machine Learning
7. Prediction
8. Neural Networks
9. Outlier Detection
10. Data Warehousing

19
Clustering
Clustering is a technique used to represent data visually — such as in
graphs that show buying trends or sales demographics for a particular
product.

20
2. Association
Association rules are used to find correlations, or associations, between points in a data set.
What Is Association in Data Mining?
Data miners use association to discover unique or interesting relationships between variables in databases.
Association is often employed to help companies determine marketing research and strategy.

21
3. Data Cleaning
Data cleaning is the process of preparing data to be mined.
What Is Data Cleaning in Data Mining?
Data cleaning involves organizing data, eliminating duplicate or corrupted data, and filling in any null
values. When this process is complete, the most useful information can be harvested for analysis.

22
4. Data Visualization
Data visualization is the translation of data into graphic form to illustrate its meaning to business
stakeholders.
What Is Data Visualization in Data Mining?
Data can be presented in visual ways through charts, graphs, maps, diagrams, and more. This is a primary
way in which data scientists display their findings.

23
7. Neural Networks
Computers process large amounts of data much faster than human brains but don’t yet have the capacity to
apply common sense and imagination in working with the data. Neural networks are one way to help
computers reason more like humans.
What Are Neural Networks in Data Mining?
Artificial neural networks attempt to digitally mimic the way the human brain operates. Neural networks
combine many computer processors (similar to the way the brain uses neurons) to process data, make
decisions, and learn as a human would — or at least as closely as possible.

26
5. Classification
• Classification is a fundamental technique in data mining and can be applied
to nearly every industry.
• It is a process in which data points from large data sets are assigned to
categories based on how they’re being used.
What Is Classification in Data Mining?
• In data mining, classification is considered to be a form of clustering —
that is, it is useful for extracting comparable points of data for comparative
analysis.
• Classification is also used to designate broad groups within a demographic,
target audience, or user base through which businesses can gain stronger
insights.

27
6. Machine Learning
Machine learning is the process by which computers use algorithms to learn
on their own.
An increasingly relevant part of modern technology, machine learning makes
computers “smarter” by teaching them how to perform tasks based on the data
they have gathered.
What Is Machine Learning in Data Mining?
In data mining, machine learning’s applications are vast.
Machine learning and data mining fall under the umbrella of data science but
aren’t interchangeable terms.
For instance, computers perform data mining as part of their machine learning
functions.

28
8. Outlier Detection
Outlier detection is a key component of maintaining safe databases. Companies use it to
test for fraudulent transactions, such as abnormal credit card usage that might suggest theft.
What Is Outlier Detection in Data Mining?
• While other data mining methods seek to identify patterns and trends, outlier detection
looks for the unique: the data point or points that differ from the rest or diverge from the
overall sample.
• Outlier detection finds errors, such as data that was input incorrectly or extracted from
the wrong sample.
• Natural data deviations can be instructive as well.

29
9. Prediction
Predictive modeling seeks to turn data into a projection of future action or
behavior.
These models examine data sets to find patterns and trends, then calculate
the probabilities of a future outcome.
What Is Prediction in Data Mining?
Predictive modeling is among the most common uses of data mining and
works best with large data sets that represent a broad sample size.

33
10. Data Warehousing
Data warehousing is the process by which data is collected and stored before it is
evaluated.
What Is Data Warehousing in Data Mining?
• Data miners collect data from multiple sources into a common archive before it can
be used in business analysis.
• This process, called data warehousing, typically occurs before the data mining
process.

34
Data Mining Tools

35
1. Orange Data Mining:
• Orange is a perfect machine learning and data mining software suite.
• It supports the visualization
• It is a software-based on components written in Python computing language
• It is developed at the bioinformatics laboratory at the faculty of computer and
information science, Ljubljana University, Slovenia.
• As it is a software-based on components, the components of Orange are called
"widgets."
2. SAS Data Mining:
• SAS stands for Statistical Analysis System.
• It is a product of the SAS Institute created for analytics and data management.
• SAS can mine data, change it, manage information from various sources, and analyze
statistics.
• It offers a graphical UI for non-technical users.
• SAS data miner allows users to analyze big data and provide accurate insight for timely
decision-making purposes.
• It is suitable for data mining, optimization, and text mining purposes.

36
3. Rattle:
• Ratte is a data mining tool based on GUI.
• It uses the R stats programming language.
• Rattle exposes the statical power of R by offering significant data mining features.
• It has an integrated log code tab that produces duplicate code for any GUI operation.
• The data set produced by Rattle can be viewed and edited.
• Rattle gives the other facility to review the code, use it for many purposes, and extend the
code without any restriction.
4. Rapid Miner:
• Rapid Miner is one of the most popular predictive analysis systems created by the
company with the same name as the Rapid Miner.
• It is written in JAVA programming language.
• It offers an integrated environment for text mining, deep learning, machine learning,
and predictive analysis.
• The instrument can be used for a wide range of applications, including company
applications, commercial applications, research, education, training, application
development, machine learning.

37
Weka
• Weka is a collection of machine learning algorithms for data mining tasks.
• The algorithms can either be applied directly to a dataset or called from your own Java
code.
• Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.

38
• Business Analytics/Business intelligence (BI) is a broad category
of applications, technologies, and processes for:
• gathering,
• storing,
• accessing, and
• analyzing data
• to help business users make better decisions.
Business Analytics/ Business
Intelligence

39
Prescriptive
Analytics
Predictive
Analytics
Diagnostic
Analytics
What happened?
What will
happen?
Why did it
happen?
Descriptive
Analytics
VALUE
DIFFICULTY
Analytics Models

40
Descriptive
Analytics
What has occurred?
Descriptive analytics, such as data visualization, is important in
helping users interpret the output from predictive and
predictive analytics.
• Descriptive analytics, such as reporting/ OLAP, dashboards, and data visualization, have
been widely used for some time.
• They are the core of traditional BI.

41
Predictive
Analytics
• Algorithms for predictive analytics, such as regression analysis, machine learning,
and neural networks, have also been around for some time.
What will occur?
• Marketing is the target for many predictive analytics applications.
• Descriptive analytics, such as data visualization, is important in helping
users interpret the output from predictive and prescriptive analytics.

42
• Prescriptive analytics are often referred to as advanced analytics.
• Often for the allocation of scarce resources
• Optimization
What should occur?
Prescriptive analytics can benefit healthcare strategic planning by using analytics to
leverage operational and usage data combined with data of external factors such as
economic data, population demographic trends and population health trends, to more
accurately plan for future capital investments such as new facilities and equipment
utilization as well as understand the trade-offs between adding additional beds and
expanding an existing facility versus building a new one.
Prescriptive
Analytics

43
Most department stores clear seasonal inventory by
reducing prices.
Key question: When to reduce the price and by how much to
maximize revenue?
Potential applications of analytics:
Descriptive analytics: examine historical data for similar products (prices, units
sold, advertising, …)
Predictive analytics: predict sales based on price
Prescriptive analytics: find the best sets of pricing and advertising to maximize
sales revenue

44
FRAUD
Definition:-
The Association of Certified Fraud Examiners (ACFE) defined fraud as "the
use of one's occupation for personal enrichment through the deliberate
misuse or application of the employing organization's resources or assets.
• Fraud detection has been implemented by a number of methods such as
data mining, statistics, and artificial intelligence.
• Fraud is discovered from anomalies in data and patterns.

47
How does data mining help with fraud detection?
• Fraudulent activity often leaves a trail of digital breadcrumbs behind it, whether in financial
records, social media posts, or web-based user behaviour patterns.
• Data mining can help identify these patterns and alert us when something looks suspicious
or out of place.
• This allows us to take action before it becomes too late.
What are some use cases of Data Mining In Fraud Detection?
The most common use cases for fraud detection via data mining involve detecting fraudulent
transactions by looking for abnormal activity patterns such as spending habits and account
activity among multiple sources like credit card statements, POS terminals, online purchases,
etc.
What are the sophisticated algorithms used for fraud detection in data mining?
Sophisticated algorithms used for fraud detection in data mining are : 1. Logistic regression 2.
Neural networks 3. Decision trees etc.

48
Credit Card Fraud
Credit card fraud is divided into two types: Offline fraud and Online fraud.
Online fraud is where a fraudster commits the fraud via the phone or the Internet with the
card details.
Offline fraud is committed when a stolen card is used physically to pay for goods or
services.
Computer Intrusion
Unauthorized attempt to access information, manipulate information, or render a system
unreliable or unusable.

49
Telecommunication Fraud
It is classified into two categories: subscription fraud and superimposed fraud.
Subscription fraud occurs from obtaining a subscription to a service, often with false
identity details, with no intention of paying.
Superimposed fraud occurs from using a service without having the necessary authority
detected by the appearance of unknown calls on a bill.
This fraud includes several ways, for example, mobile phone cloning, ghosting (the
technology that tricks the network in order to obtain free calls), insider fraud, tumbling
(rolling fake serial numbers are used on cloned handsets so that successive calls are
attributed to different legitimate phones), and etc.

Introduction To Data Mining and Data Mining Techniques.pptx

More Related Content

Similar to Introduction To Data Mining and Data Mining Techniques.pptx

Recently uploaded

Introduction To Data Mining and Data Mining Techniques.pptx