SlideShare a Scribd company logo
1 of 58
Download to read offline
Presenter
Supervisor
Data Mining Role in
Telecommunications
February
2012
1. Preface
2. Data Mining Definition
3. Data Mining Activities
4. Data Mining Tasks
5. Data Mining Process
6. Data Mining Roots
7. Why Data Mining?
8. Indicators Of Data Quality
9. Data Warehouse Concept
10. Data Warehouse Development
11. Data Warehouse & Data Mining
1. Data Mining Applications
2. Telecom. & Data Mining
4/8/2019 Hunan University 2 - 58
3. Types Of Telecom. Data
4. Telecom. Issues & Data Mining
1) Fraud Detection
5. Conclusions
6. References
2) Marketing/Customer Profiling
A. Customer Churn
B. Customer Insolvency
3) Network Fault Isolation
 In many domains the underlying first principles are
unknown, or the systems under study are too complex
to be mathematically formalized. With the growing use
of computers, there is a great amount of data being
generated by such systems.
 In the absence of first-principle models, such readily
available data can be used to derive models by
estimating useful relationships between a system's
variables (i.e., unknown input-output dependencies).
 Thus there is currently a paradigm shift from classical
modeling and analyses based on first principles to
developing models and the corresponding analyses
directly from data.
PREFACE
4/8/2019 Hunan University
Need for Analyses of Large, Complex, Information-Rich Data Sets
4 - 58
 Data Mining (DM) is: an iterative process within which
progress is defined by discovery, through either
automatic or manual methods.
 DM is: most useful in an exploratory analysis scenario in
which there are no predetermined notions about what
will constitute an "interesting" outcome.
 DM is: the search for new, valuable, and nontrivial
information in large volumes of data.
 DM is: a cooperative effort of humans and computers.
(Best results are achieved by balancing the
knowledge of human experts in describing problems
and goals with the search capabilities of computers).
DATA MINING DEFINITIONS
4/8/2019 Hunan University 5 - 58
1. Predictive DM: the goal of DM is to produce a model,
expressed as an executable code, which can be
used to perform classification, prediction, estimation,
or other similar tasks.
2. Descriptive DM: the goal is to gain an understanding
of the analyzed system by uncovering patterns and
relationships in large data sets.
• So,:
a. The relative importance of Prediction and
Description for particular DM applications can
vary considerably.
b. The goals of Prediction and Description are
achieved by using DM techniques.
DATA MINING ACTIVITIES
4/8/2019 Hunan University 6 - 58
1. Classification : discovery of a predictive learning function
that classifies a data item into one of several predefined
classes.
2. Regression : discovery of a predictive learning function,
which maps a data item to a real-value prediction variable.
3. Clustering : a common descriptive task in which one seeks to
identify a finite set of categories or clusters to describe the
data.
4. Summarization : an additional descriptive task that involves
methods for finding a compact description for a set (or
subset) of data.
5. Dependency Modeling : finding a local model that
describes significant dependencies between variables or
between the values of a feature in a data set or in a part of
a data set.
6. Change and Deviation Detection : discovering the most
significant changes in the data set.
DATA MINING TASKS
4/8/2019 Hunan University 7 - 58
Recognize the Iterative Character of A Data Mining Process and
Specify Its Basic Steps
1. State the problem and formulate the hypothesis
2. Collect the data
3. Preprocessing the data
a)Outlier detection (and removal):
Detect and eventually remove outliers as a
part of the preprocessing phase.
Develop robust modeling methods that are
insensitive to outliers.
b)Scaling, encoding, and selecting feature.
4. Estimate the model
5. Interpret the model and draw conclusions
DATA MINING PROCESS
4/8/2019 Hunan University 8 - 58
1. State the Problem & Formulate The Hypothesis
2. Collect the Data
3. Preprocessing the Data
4. Estimate the Model
5. Interpret the Model & Draw Conclusions
DATA MINING PROCESS
4/8/2019 Hunan University 9 - 58
• The root of the problem is that the data size and
dimensionality are too large for manual analysis and
interpretation, or even for some semiautomatic
computer-based analyses.
• A scientist or a business manager can work effectively
with a few hundred or thousand records.
• Effectively mining millions of data points, each
described with tens or hundreds of characteristics, is
another matter.
• Imagine the analysis of:
 terabytes (TB) of sky-image data with thousands of
photographic high-resolution images (23,040 x
23,040 pixels per image).
 OR human genome databases with billions of
components.
DATA MINING ROOTS
4/8/2019 Hunan University 10 - 58
Date
N-HOSTSWHY DATA MINING?
4/8/2019 Hunan University
0
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
Jul-1993
Jul-1994
Jul-1995
Jul-1996
Jul-1997
Jul-1998
Jul-1999
Jul-2000
Jul-2001
Jul-2002
Jul-2003
Jul-2004
Jul-2005
Jul-2006
Jul-2007
Jul-2008
Jul-2009
Jul-2010
Jul-2011
Jan-2012
Growth Of Internet Hosts
Host Count
Source: Internet System Consortium - www.isc.org/solutions/survey
11 - 58
• In theory, "BIG DATA" can lead to much stronger
conclusions, but in practice many difficulties arise.
• The business community is well aware of today's
information overload, and one analysis shows that:
Perceptions and Opinions
Managers
Believe Don’t-Believe
Information overload is present in their own
Workplace.
61% 39%
The situation will get worse. 80% 20%
Ignore data in current decision-making processes
because of the information overload.
> 50% < 50%
Store this information for the future; it is not used for
.current analysis
84% 16%
The cost of gathering information outweighs its
value.
60% 40%
WHY DATA MINING?
4/8/2019 Hunan University 12 - 58
• What are the solutions?
1. Work harder.
(But how long can you keep up, because the
limits are very close).
2. Employ an assistant.
(Maybe, if you can afford it).
3. Ignore the data.
(But then you are not competitive in the
market).
 The Only Real Solution:
Replace the classical data analysis and
interpretation methodologies (both Manual &
Computer-Based) with a new DM Technology.
WHY DATA MINING?
4/8/2019 Hunan University 13 - 58
1. Data should be Accurate.
2. Data should be Stored according to Data Type.
3. Data should have Integrity.
4. Data should be Consistent.
5. Data should not be Redundant.
6. Data should be Timely.
7. Data should be Well Understood.
8. Data set should be Complete.
INDICATORS OF DATA QUALITY
4/8/2019 Hunan University 14 - 58
• Definition:
The Data Warehouse (DW) is a collection of
integrated, Subject-Oriented Databases
designed to support the Decision-Support
Functions (DSF), where each unit of data is
relevant to some moment in time.
Types of Elementary or Derived Data
1. Old detail data.
2. New detail data.
3. Lightly summarized data.
4. Highly summarized data.
5. Metadata (the data directory or guide)
DATA WAREHOUSE CONCEPT
4/8/2019 Hunan University 15 - 58
• To prepare these five types of elementary or
derived data in a DW, the fundamental types
of Data Transformation (DT) are standardized.
• There are four main types of Transformations:
1. Simple Transformations.
2. Cleansing and Scrubbing.
3. Integration.
4. Aggregation and Summarization.
DATA WAREHOUSE CONCEPT
4/8/2019 Hunan University 16 - 58
A Three-Stage Data-Warehousing Development Process
1. Modeling:
In simple terms, to take the time to understand business
processes, the information requirements of these processes,
and the decisions that are currently made within processes.
2. Building:
a. To establish requirements for tools that suit the types of
decision support necessary for the targeted business
process.
b. To create a data model that helps further define
information requirements.
c. To decompose problems into data specifications and
the actual data store, which will, in its final form,
represent either a data mart or a more comprehensive
DW.
DATA WAREHOUSE DEVELOPMENT
4/8/2019 Hunan University 17 - 58
3. Deploying:
a. To implement, relatively early in the overall
process, the nature of the data to be
warehoused and the various business intelligence
tools to be employed.
b. To begin by training users. The deploy stage
explicitly contains a time during which users
explore both the repository (to understand data
that are and should be available) and early
versions of the actual DW. This can lead to an
evolution of the DW, which involves adding more
data, extending historical periods, or returning to
the build stage to expand the scope of the DW
through a data model.
DATA WAREHOUSE DEVELOPMENT
4/8/2019 Hunan University 18 - 58
How is DW different from SQL & OLAP, which are also applied to DW?
• SQL is a standard relational database language that
is good for queries that impose some kind of
constraints on data in the database in order to
extract an answer.
• In contrast, DM methods are good for queries that are
exploratory in nature, trying to extract hidden, not so
obvious information.
• SQL is useful when we know exactly what we are
looking for and we can describe it formally.
• DM methods will be used when we know only vaguely
what we are looking for.
• Therefore, these two classes of DW applications are
complementary.
DATA WAREHOUSE & DATA MINING
4/8/2019 Hunan University 19 - 58
• OLAP tools and methods have become very popular in
recent years as they let users analyze data in a
warehouse by providing multiple views of the data,
supported by advanced graphical representations. In
these views, different dimensions of data correspond
to different business characteristics.
• OLAP tools make it very easy to look at dimensional
data from any angle or to slice-and-dice it.
• Although OLAP tools, like DM tools, provide answers
that are derived from data, the similarity between
them ends here.
DATA WAREHOUSE & DATA MINING
4/8/2019 Hunan University 20 - 58
How is DW different from SQL & OLAP, which are also applied to DW?
• OLAP tools do not learn from data, nor do they
create new knowledge. They are usually
special-purpose visualization tools that can
help end-users draw their own conclusions and
decisions, based on graphically condensed
data.
• OLAP tools are very useful for the DM process;
they can be a part of it but they are not a
substitute.
DATA WAREHOUSE & DATA MINING
4/8/2019 Hunan University 21 - 58
How is DW different from SQL & OLAP, which are also applied to DW?
• DM is a young discipline with wide and
diverse applications.
• There is still a nontrivial gap between
general principles of DM and domain-
specific, effective DM tools for particular
applications:
a. Some Application Domains
b. Biomedical and DNA Data Analysis
c. Financial Data Analysis
d. Retail Industry
e. Telecommunication Industry
DATA MINING APPLICATIONS
4/8/2019 Hunan University 23 - 58
 A rapidly expanding and highly competitive
industry and a great demand for DM
 Understand the Business Involved
 Identify Telecommunication Patterns
 Catch Fraudulent Activities
 Make Better Use of Resources
 Improve The Quality Of Service (QoS)
 Multidimensional analysis of telecom. data
 Intrinsically Multidimensional:
calling-time, duration, location of caller, type of call
and etc…
TELECOMMUNICATIONS & DATA MINING
4/8/2019 Hunan University 24 - 58
• Telecommunication companies generate and deal
with a huge amount of data.
• These data include:
1. Call Detail Data: which describes the calls that traverse the
telecommunication networks.
2. Network Data: which describes the state of the hardware and
software components in the network.
3. Customer Data: which describes the telecommunication
customers.
• Several DM applications in Telecommunication
Industry can be used to identify:
1. Telecommunication Fraud.
2. Improve Marketing Effectiveness.
3. Identify Network Faults.
TYPES OF TELECOMMUNICATIONS DATA
4/8/2019 Hunan University 25 - 58
1. Call Detail Data
• Every time a call is placed on
a telecommunications
network, descriptive
information about the call is
saved as a call detail record.
• The number of call detail
records that are generated
and stored is huge. For
example, AT&T long distance
customers alone generate
over 300 million call detail
records per day (Cortes &
Pregibon, 2001).
• Given that several months of
call detail data is typically
kept online, this means that
tens of billions of call detail
records will need to be stored
at any time.
• Below is a list of features that one might
use when generating a summary
description of a customer based on the
calls they originate and receive over
some time period P:
1.Average Call Duration
2.% No-answer Calls
3. % Calls to/from a Different Area Code
4.% Of Weekday Calls (Monday– Friday)
5. % Of Daytime Calls (9am – 5pm)
6.Average # Calls Received Per Day
7.Average # Calls Originated Per Day
8.# Unique Area Codes Called During P
TYPES OF TELECOMMUNICATIONS DATA
4/8/2019 Hunan University 26 - 58
The period of 9am to 4pm, businesses
place roughly 1.5 times as many of their
total weekday calls as does a
residence.
Note that at 5pm the ratio is close to
1, indicating that the calls during this
timeframe are not very useful for
distinguishing between a Business
and a Residence.
calls in the evening
timeframe (6pm – 1am) are
also useful in distinguishing
between the two types of
customers.
TYPES OF TELECOMMUNICATIONS DATA
4/8/2019 Hunan University 27 - 58
• Telecommunication networks are extremely complex
configurations of equipment, comprised of thousands of
interconnected components.
• Each network element is capable of generating error and
status messages, which leads to a tremendous amount of
network data.
• This data must be stored and analyzed in order to support
network management functions, such as fault isolation.
• As was the case with the call detail data, network data is
also generated in real-time as a data stream and must often
be summarized in order to be useful for data mining.
• This is sometimes accomplished by applying a time window
to the data. For example, such a summary might indicate
that a hardware component experienced twelve instances
of a power fluctuation in a 10-minute period.
TYPES OF TELECOMMUNICATIONS DATA
2. Network Data
4/8/2019 Hunan University 28 - 58
• Telecommunication companies, like other large businesses,
may have millions of customers, and this means maintaining
a database of information on these customers.
• This information will include name and address information
and may include other information such as service plan and
contract information, credit score, family income and
payment history.
• This information may be supplemented with data from external
sources, such as from credit reporting agencies. Because the
customer data maintained by telecommunication companies
does not substantially differ from that maintained in most other
industries, however customer data is often used in conjunction
with other data in order to improve results.
• For example, customer data is typically used to supplement
call detail data when trying to identify phone fraud.
TYPES OF TELECOMMUNICATIONS DATA
3. Costumer Data
4/8/2019 Hunan University 29 - 58
Fraud Detection 1
Marketing/Customer Profiling 2
Network Fault Isolation 3
Customer
Churn
Customer
Insolvency
A B
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 30 - 58
1. Fraud Detection
• Fraud is a serious problem for telecommunication companies,
leading to billions of dollars in lost revenue each year.
• Fraud can be divided into two categories:
1. Subscription fraud: occurs when a customer opens an
account with the intention of never paying for the
account charges.
2. Superimposition fraud: involves a legitimate account with
some legitimate activity, but also includes some
“superimposed” illegitimate activity by a person other than
the account holder.
• The most common method for identifying fraud is to build a
profile of customer’s calling behavior and compare recent
activity against this behavior.
• Thus, this Data Mining application relies on deviation
detection. The calling behavior is captured by summarizing
the call detail records for a customer.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 31 - 58
• Fraudulent pattern analysis and the identification of unusual
patterns:
 Identify potentially fraudulent users and their atypical
usage patterns.
 Detect attempts to gain fraudulent entry to customer
accounts.
 Discover unusual patterns which may need special
attention.
• Multidimensional association and sequential pattern analysis:
 Find usage patterns for a set of communication services
by customer group, by month, etc.
 Promote the sales of specific services.
 Improve the availability of particular services in a region.
• Use Of Visualization Tools In Telecommunication Data Analysis.
TELECOMMUNICATIONS ISSUES & DATA MINING
1. Fraud Detection
4/8/2019 Hunan University 32 - 58
• If the call detail summaries are updated in real-time,
fraud can be identified soon after it occurs.
• Because new behavior does not necessarily imply fraud,
one fraud-detection system augments this basic
approach by comparing the new calling behavior to
profiles of generic fraud—and only signals fraud if the
behavior matches one of these profiles (Cortes &
Pregibon, 2001).
• For example, : one sample rule that combines call detail
and customer level data for detecting cellular fraud is:
“People who have a price plan that makes international
calls expensive and who display a sharp rise in
international calls are likely the victim of cloning fraud.”
TELECOMMUNICATIONS ISSUES & DATA MINING
1. Fraud Detection
4/8/2019 Hunan University 33 - 58
• Fraud is relatively rare, and the number of verified fraudulent
calls is relatively low, the fraud application involves predicting
a relatively rare event where the underlying class distribution is
highly skewed.
• DM Algorithms often have great difficulty deal in with highly
skewed class distributions and predicting rare events.
• The fraud problems reviewed here carry several common
characteristics, which are summarized below:
1. Significant loss of revenue.
2. Unpredictability of human.
3. Information can be retrieved only after processing
huge amounts of data.
4. Fraudulent cases are rare compared to the
legitimate ones.
TELECOMMUNICATIONS ISSUES & DATA MINING
1. Fraud Detection
4/8/2019 Hunan University 34 - 58
• In the following, briefly review studies on similar problems
appearing in the mobile communications, conventional
communications and credit card sectors:
1. Fraud In Mobile Communications.
2. Fraud In Conventional Communications.
3. Fraud In Credit Card Usage.
• The fraud problems carry several common
characteristics, which are summarized below:
1. Significant loss of revenue.
2. Unpredictability of human.
3. Information can be retrieved only after processing
huge amounts of data.
4. Fraudulent cases are rare compared to the
legitimate ones.
TELECOMMUNICATIONS ISSUES & DATA MINING
1. Fraud Detection
4/8/2019 Hunan University 35 - 58
For example (2) : Ezawa & Norton (1995) increase the
percentage of fraudulent calls from 1-2% to 9-12%.
However, the use of a non-representative training set
can be problematic because it does not provide the DM
method with accurate information about the true class
distribution (Weiss and Provost, 2003).
For example (1) : if fraud makes up only 0.2% of all calls,
many data mining systems will not generate any rules for
finding fraud, since a default rule, which never predicts
fraud, would be 98.8% accurate. To deal with this issue,
the training data is often selected to increase the
proportion of fraudulent cases.
TELECOMMUNICATIONS ISSUES & DATA MINING
1. Fraud Detection
4/8/2019 Hunan University 36 - 58
• Telecommunication companies maintain a
great deal of data about their customers. In
addition to the general customer data that
most businesses collect, telecommunication
companies also store call detail records,
which precisely describe the calling behavior
of each customer.
• This information can be used to profile the
customers and these profiles can then be
used for marketing and/or forecasting
purposes.
2. Marketing/Customer Profiling
Example-1 (MCI’s) : AT&T (1991) used a rate averaging system;
calls between areas of similar distance were billed the same,
regardless of actual transmission costs. This system allowed
sparsely populated areas to enjoy the same low phone rates as
high-volume, major cities.
(http://www.hagley.org/library/exhibits/Mci/earlyyearsmci.html).
The MCI Friends and
Family promotion
relied on DM to
identify associations
within data.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 37 - 58
• Example-2: Another marketing application that
relies on this technique is a DM application for
finding the set of Non-U.S. countries most often
called together by U.S. telecommunication
customers (Cortes & Pregibon, 2001).
• One set of countries identified by this DM
application is: {Jamaica, Antigua, Grenada,
Dominica}.
• This information is useful for establishing and
marketing international calling plans.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 38 - 58
2. Marketing/Customer Profiling
• Customer Churn involves a customer leaving one
telecommunication company for another.
• Customer Churn is a significant problem because of the
associated loss of revenue and the high cost of attracting
new customers.
• Data Mining techniques now permit companies the ability to
mine historical data in order to predict when a customer is
likely to leave.
• These techniques typically utilize billing data, call detail data,
subscription information (calling plan, features, contract
expiration data) and customer information (e.g., Age).
• Based on the induced Model, the company can then take
Action, if desired.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 39 - 58
2. Marketing/Customer Profiling
A. Customer Churn
Example: A wireless company might offer a customer a free phone
for extending their contract. One such effort utilized a neural
network to estimate the probability h(t) of cancellation at a given
time t in the future (Mani, Drew, Betz & Datta, 1999).
• In order to effectively mine the call detail data, it must be
summarized to the customer level (Business or Residence).
• Then, a Classifier Induction Program can be applied to a set of
labeled training examples in order to Build A Classifier. This
approach has been used to identify Fax Lines (Kaplan, Strauss &
Szegedy, 1999) and to classify a phone line as belonging to a
Business or Residence (Cortes & Pregibon, 1998).
• Other applications have used this approach to identify phone
lines belonging to Telemarketers and to classify a phone line as
being used for Voice, Data, Or Fax.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 40 - 58
2. Marketing/Customer Profiling
A. Customer Churn
• These rules were generated using SAS Enterprise Miner, a
sophisticated Data Mining Package that supports Multiple Data
Mining Techniques.
• The rules shown here were generated using a Decision Tree
Learner. However, a Neural Network was also used to predict the
probability of a customer being a Business or Residential
customer, based solely on the distribution of calls by time of day
(i.e., the Neural Network had 24 Inputs, One per Hour of The Day).
Pseudo-Code
• Rule 1: if < 43% of calls last 0-10 seconds and < 13.5% of calls occur during the
weekend and neural network says that P(business) > 0.58 based on time of
day call distribution then Business Customer.
• Rule 2: if calls received over two-month period from at most 3 unique area
codes and <56.6% of calls last 0-10 seconds then Residential Customer.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 41 - 58
2. Marketing/Customer Profiling
A. Customer Churn
Model-Building to Predict Churn
• BSNL. Satara like other telecom companies suffer from churning
customers who use the provided services without paying their
dues. It provides many services like the Internet, fax, post and pre-
paid mobile phones and fixed phones, the researchers would like
to focus only on postpaid phones with respect to churn
prediction, which is the purpose of this research work.
• In essence, many aggregated attributes containing the lengths of
calls made by every customer in these five months were created.
• This Model is predicting customers who are at risk of leaving a
company, in telecommunication sector.
• Using this Model, Company will be able to find such kind of
customers.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 42 - 58
2. Marketing/Customer Profiling
A. Customer Churn
0
20
40
60
80
100
120
140
160
180
1 2 3 4
Not Churning
Possible Churning
Difference Between Churning And Not Churning Customers
Weeks
Custome
rs
1st
Week
2nd
Week
3rd
Week
4th
Week
Not
Churning
102.5 89.2 96.1 117.2
Possible
Churning
71 109.5 131.5 174
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 43 - 58
Model-Building to Predict Churn
2. Marketing/Customer Profiling
A. Customer Churn
• The major task to be performed at this stage was creating and training a
decision support system that can discriminate between churning and not
churning customers.
• For the proposed, we had a choice of several data mining tools
available, and METALAB was found to be the most suitable for this
purpose, because it supports with many algorithms. It has Neural Network
Toolbox.
• The algorithm that is used for this research work is Back Propagation
Algorithm (BPA). While building a model whole dataset was divided into
three subsets.
• These subsets were the training set, the validation set, and the test set.
• The training set is used to train the network.
• The validation set is used to monitor the error during the training process.
• The test set is used to compare the Performance of the Model.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 44 - 58
Model-Building to Predict Churn
A. Customer Churn
2. Marketing/Customer Profiling
• Telecommunication companies, as well as other service-
providing companies, often suffer from Insolvent Customers
who use the provided services without paying their dues.
• Companies In Telecommunications Business take precautions
against these customers; however, in most cases this refers to
measures applied quite late, often with no significant effect.
• As a result Many Unpaid Bills end up in the account of
uncollectible debts.
• Detection and prevention of this behavior is an objective of
prime importance for the industry.
• This is especially important today that Strong Competition is
building in many sectors of the Telecommunications Industry
as monopolies cease to exist.
B. Customer Insolvency
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 45 - 58
2. Marketing/Customer Profiling
Model-Building to Predict Insolvency
• This Model, capable of predicting insolvent behavior
of customers well in advance, being available, can
be a useful decision support tool for a service-
providing company.
• However building such this Model is not a trivial
process.
• The Model objective to study the feasibility of
building such a tool for a major Telecommunications
Operator using Data Mining Techniques.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 46 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
• Most cases dispersed, these data when inter-related may contain
valuable information relating to the Insolvency-Prediction
Problem.
• Examples of such data are:
1. Customer Profiles.
2. Usage of the Offered Service.
3. Financial Transactions of the Customers with the Company.
• Like fraudsters in other Fraud Detection Problems, it is assumed
that Insolvent Customers on the average behave differently
from the rest of the customers, especially during a critical
period preceding the due-date for payment.
• The goal is therefore to reveal those Behavioral Patterns, which
may distinguish these customers from the rest.
Model-Building to Predict Insolvency
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 47 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
Model-Building Difficulties
• Some of the inherent limitations that make this a particularly
difficult problem, as identified early in the study, were the
following:
A. Insolvent customer’s behavior can be either the result of
fraudulence or due to factors beyond the customer’s will
(force majeure, social factors etc.).
B. The available data sets, as often is the case with real life
problems, represent the individual customer in a limited
and distorted way.
C. In the available large data set, many parameters can be
defined, often deduced from primary transactional data,
which can describe the behavior of a customer of a
Modern Telecommunications Company.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 48 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
Model Stages
1 Learning the application domain
2 Creating a target data set
Data cleaning and preprocessing
4 Data reduction and projection
3
5 Choosing the function of data mining
6 Choosing the data mining algorithm(s)
Data Mining
8 Interpretation
7
9 Using discovered knowledge
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 49 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
• The study reported in this Model can be described as a KDD experiment.
• Following a typical KDD framework (Fayyad et al., 1996; Han and
Kamber, 2001), where Data Mining is the core in the overall process, the
experiment went through all nine steps of Fig. 1, starting from the stage of
gaining profound knowledge of the domain till the actual use of
discovered knowledge.
Problem Definition and Application Domain
• Insolvency Prediction : is the ability to predict the insolvent customers to
be, that is the customers that will refuse to pay their telephone bills in the
next due-date for payment, while there is still time for preventive (and
possibly aversive) measures.
• Insolvency Prediction : makes sense in business terms if it can take place
early enough to be of any use for the company.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 50 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
Model Stages
Problem Definition and Application Domain
• For this problem, three objectives were set as most prevailing for the
company:
1. Detection of as many insolvent customers as possible.
2. Minimization of the false alarms, i.e. the number of solvent customers that
are falsely classified as insolvent.
3. Timely warning to the service provider in order to take action against a
prospective insolvent customer.
Using Discovered Knowledge for Insolvency Prediction
• In the experiments and by the case-by-case comparison, every case
(customer) was examined separately.
• Then the following scheme was followed: if all three algorithms agreed on
one classification (solvent or insolvent), then the customer was labeled as
such. On the contrary, if one of the classifiers did not agree with the
others, then the customer was considered as unclassified.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 51 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
Using Discovered Knowledge for Insolvency Prediction
Actual
Predicted
Insolvent Solvent Unclassified Total
Cases Used For Training
Insolvent 43 (31.6%) 29 (21.3%) 64 (47.1%) 136
Solvent 0 (0%) 1174 (96.8%) 38 (3.2%) 1212
Cases Used For Testing
Insolvent 19 (29.69%) 17 (26.56%) 28 (43.75%) 64
Solvent 1 (0.15%) 626 (95.72%) 27 (4.13%) 654
Combined Classification Results in a Case-by-case Comparison
90.28% Of Cases Used For Training Were Correctly Classified. 89.83% Of Cases Used For Testing Were Correctly Classified.
1. DM experiment was a multi-stage
process.
2. Various Algorithms were tested in
order to establish which one is most
suitable for our problem and this
particular dataset.
a. The specific problem of customer insolvency
in telecommunication companies, the
experiments was highly reusable, except of
deviations in the syntactic and semantic
definition of the operators data.
b. A Parametric Tool To Handle Such Deviations
is one of the objectives of further research.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 52 - 58
B. Customer Insolvency
2. Marketing/Customer Profiling
• Telecommunication networks are extremely complex
configurations of hardware and software.
• Most of the network elements are capable of at least
limited self-diagnosis, and these elements may
collectively generate millions of status and alarm
messages each month. In order to effectively manage
the network, alarms must be analyzed automatically in
order to identify network faults in a timely manner—or
before they occur and degrade network performance.
• A proactive response is essential to maintaining the
reliability of the network. Because of the volume of the
data, and because a single fault may cause many
different, seemingly unrelated, alarms to be generated,
the task of network fault isolation is quite difficult.
3. Network Fault Isolation
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 53 - 58
• Telecommunication Alarm Sequence Analyzer (TASA) is one
tool that helps with the knowledge acquisition task for alarm
correlation (Klemettinen, Mannila & Toivonen, 1999).
Data Mining Role for Identifying Faults
• TASA automatically discovers recurrent patterns of alarms within
the network data along with their statistical properties, using a
specialized Data Mining Algorithm.
• Network specialists then use this information to construct a rule-
based alarm correlation system, which can then be used in real-
time to identify faults.
• TASA is capable of finding episodic rules that depend on temporal
relationships between the alarms. For example, it may discover the
following rule: “if alarms of type link alarm and link failure occur within 5
seconds, then an alarm of type high fault rate occurs within 60 seconds
with probability 0.7.”
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 54 - 58
3. Network Fault Isolation
• The most significant one is that some information will be lost in the
reformulation process.
• For example: using the scalar-based representation just mentioned, all
sequence information is lost.
• Time-weaver (Weiss & Hirsh, 1998) is a Genetic-Algorithm (GA) based
data mining system that is capable of operating directly on the raw
network-level time series data (as well as other time-series data), thereby
making it unnecessary to re-represent the network level data.
• Given a sequence of TIME-STAMPED events and a target event T, Time-
weaver will identify patterns that successfully predict T.
• Time-weaver essentially searches through the space of possible patterns,
which includes sequence and temporal relationships, to find predictive
patterns.
• The system is especially designed to perform well when the target event is
rare, which is critical since most network failures are rare. In the case studied,
the target event is the failure of components in the 4ESS switching system.
TELECOMMUNICATIONS ISSUES & DATA MINING
4/8/2019 Hunan University 55 - 58
Data Mining Role for Identifying Faults
3. Network Fault Isolation
CONCLUSIONS
 This presentation described how much Data Mining is very important in our life.
 Also, this presentation explained in more defiles the Data Mining is technology.
 Furthermore, the presentation described how Data Mining is used in the
telecommunications industry.
 There are three main sources of telecommunication data: Call Detail, Network And
Customer Data.
 Common data mining applications are described as follow: Fraud, Marketing And
Network Fault Isolation
 The presentation highlighted several key issues that affect the ability to mine data,
and commented on how they impact the Data Mining Process.
 The telecommunications industry has been one of the earliest adopters of Data
Mining technology, largely because of the amount and quality of the data that it
collects.
 The previous point has resulted in many successful Data Mining applications.
 Data Mining Technology will be the optimal way to achieve the great goals of the
companies especially, in the world of telecommunications industry.
4/8/2019 Hunan University 56 - 58
REFERENCES
3 “Data-Mining Concepts”,(Chapter from Textbook)
5 Jamal Sophieh, “Data mining in telecommunications and studying its status in Iran
telecom companies and operators”,(Paper)
4 Gary M. Weiss, “Data Mining in Telecommunications”, (Chapter from Textbook)
1 Adrian COSTEA, “The Analysis of the Telecommunications Sector by The Means of
Data Mining Techniques”,(Paper)
6 Mieke Jans, Nadine Lybaert, Koen Vanhoof, Data Mining for Fraud Detection: Toward
an Improvement on Internal Control Systems?, (Paper)
2 Constantinos S. Hilas, John N. Sahalos, “ User Profiling for Fraud Detection in
Telecommunication Networks” (Paper)
9 V.Umayaparvathi, Dr.K.Iyakutti, “A Fraud Detection Approach in Telecommunication
using Cluster GA”, (Paper)
7 Rahul J. Jadhav, Usharani T. Pawar, “Churn Prediction in Telecommunication Using
Data Mining Technology”, (Paper)
8 S. Daskalaki, I. Kopanas, M. Goudara, N. Avouris, “Data mining for decision support on
customer insolvency in telecommunications business”,(Paper)
57 - 584/8/2019 Hunan University
THE END

More Related Content

What's hot

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...nalini manogaran
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
Data science lecture1_doaa_mohey
Data science lecture1_doaa_moheyData science lecture1_doaa_mohey
Data science lecture1_doaa_moheyDoaa Mohey Eldin
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET Journal
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGIJCSEIT Journal
 
Twala2007.doc
Twala2007.docTwala2007.doc
Twala2007.docbutest
 
IRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data MiningIRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data MiningIRJET Journal
 
Textbooks for Responsible Data Analysis in Excel
Textbooks for Responsible Data Analysis in ExcelTextbooks for Responsible Data Analysis in Excel
Textbooks for Responsible Data Analysis in ExcelNathan Garrett
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...ijsc
 

What's hot (19)

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
 
Case study
Case studyCase study
Case study
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Data science lecture1_doaa_mohey
Data science lecture1_doaa_moheyData science lecture1_doaa_mohey
Data science lecture1_doaa_mohey
 
Data science 101
Data science 101Data science 101
Data science 101
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSINGMETA DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
 
Twala2007.doc
Twala2007.docTwala2007.doc
Twala2007.doc
 
IRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data MiningIRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data Mining
 
Text analytics
Text analyticsText analytics
Text analytics
 
Textbooks for Responsible Data Analysis in Excel
Textbooks for Responsible Data Analysis in ExcelTextbooks for Responsible Data Analysis in Excel
Textbooks for Responsible Data Analysis in Excel
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 

Similar to التنقيب في البيانات - Data Mining

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfData Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfNeha Singh
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .rajasrichalamala3zen
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .rajasrichalamala3zen
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabadakhilamadupativibhin
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabadrajasrichalamala3zen
 
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Sahilakhurana
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarelli2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarellitruongthuthuy47
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 

Similar to التنقيب في البيانات - Data Mining (20)

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfData Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabad
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
data science.pptx
data science.pptxdata science.pptx
data science.pptx
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabad
 
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarelli2 data warehouse life cycle golfarelli
2 data warehouse life cycle golfarelli
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 

Recently uploaded

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 

Recently uploaded (20)

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 

التنقيب في البيانات - Data Mining

  • 1. Presenter Supervisor Data Mining Role in Telecommunications February 2012
  • 2. 1. Preface 2. Data Mining Definition 3. Data Mining Activities 4. Data Mining Tasks 5. Data Mining Process 6. Data Mining Roots 7. Why Data Mining? 8. Indicators Of Data Quality 9. Data Warehouse Concept 10. Data Warehouse Development 11. Data Warehouse & Data Mining 1. Data Mining Applications 2. Telecom. & Data Mining 4/8/2019 Hunan University 2 - 58 3. Types Of Telecom. Data 4. Telecom. Issues & Data Mining 1) Fraud Detection 5. Conclusions 6. References 2) Marketing/Customer Profiling A. Customer Churn B. Customer Insolvency 3) Network Fault Isolation
  • 3.
  • 4.  In many domains the underlying first principles are unknown, or the systems under study are too complex to be mathematically formalized. With the growing use of computers, there is a great amount of data being generated by such systems.  In the absence of first-principle models, such readily available data can be used to derive models by estimating useful relationships between a system's variables (i.e., unknown input-output dependencies).  Thus there is currently a paradigm shift from classical modeling and analyses based on first principles to developing models and the corresponding analyses directly from data. PREFACE 4/8/2019 Hunan University Need for Analyses of Large, Complex, Information-Rich Data Sets 4 - 58
  • 5.  Data Mining (DM) is: an iterative process within which progress is defined by discovery, through either automatic or manual methods.  DM is: most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome.  DM is: the search for new, valuable, and nontrivial information in large volumes of data.  DM is: a cooperative effort of humans and computers. (Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers). DATA MINING DEFINITIONS 4/8/2019 Hunan University 5 - 58
  • 6. 1. Predictive DM: the goal of DM is to produce a model, expressed as an executable code, which can be used to perform classification, prediction, estimation, or other similar tasks. 2. Descriptive DM: the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. • So,: a. The relative importance of Prediction and Description for particular DM applications can vary considerably. b. The goals of Prediction and Description are achieved by using DM techniques. DATA MINING ACTIVITIES 4/8/2019 Hunan University 6 - 58
  • 7. 1. Classification : discovery of a predictive learning function that classifies a data item into one of several predefined classes. 2. Regression : discovery of a predictive learning function, which maps a data item to a real-value prediction variable. 3. Clustering : a common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data. 4. Summarization : an additional descriptive task that involves methods for finding a compact description for a set (or subset) of data. 5. Dependency Modeling : finding a local model that describes significant dependencies between variables or between the values of a feature in a data set or in a part of a data set. 6. Change and Deviation Detection : discovering the most significant changes in the data set. DATA MINING TASKS 4/8/2019 Hunan University 7 - 58
  • 8. Recognize the Iterative Character of A Data Mining Process and Specify Its Basic Steps 1. State the problem and formulate the hypothesis 2. Collect the data 3. Preprocessing the data a)Outlier detection (and removal): Detect and eventually remove outliers as a part of the preprocessing phase. Develop robust modeling methods that are insensitive to outliers. b)Scaling, encoding, and selecting feature. 4. Estimate the model 5. Interpret the model and draw conclusions DATA MINING PROCESS 4/8/2019 Hunan University 8 - 58
  • 9. 1. State the Problem & Formulate The Hypothesis 2. Collect the Data 3. Preprocessing the Data 4. Estimate the Model 5. Interpret the Model & Draw Conclusions DATA MINING PROCESS 4/8/2019 Hunan University 9 - 58
  • 10. • The root of the problem is that the data size and dimensionality are too large for manual analysis and interpretation, or even for some semiautomatic computer-based analyses. • A scientist or a business manager can work effectively with a few hundred or thousand records. • Effectively mining millions of data points, each described with tens or hundreds of characteristics, is another matter. • Imagine the analysis of:  terabytes (TB) of sky-image data with thousands of photographic high-resolution images (23,040 x 23,040 pixels per image).  OR human genome databases with billions of components. DATA MINING ROOTS 4/8/2019 Hunan University 10 - 58
  • 11. Date N-HOSTSWHY DATA MINING? 4/8/2019 Hunan University 0 100,000,000 200,000,000 300,000,000 400,000,000 500,000,000 600,000,000 700,000,000 800,000,000 900,000,000 1,000,000,000 Jul-1993 Jul-1994 Jul-1995 Jul-1996 Jul-1997 Jul-1998 Jul-1999 Jul-2000 Jul-2001 Jul-2002 Jul-2003 Jul-2004 Jul-2005 Jul-2006 Jul-2007 Jul-2008 Jul-2009 Jul-2010 Jul-2011 Jan-2012 Growth Of Internet Hosts Host Count Source: Internet System Consortium - www.isc.org/solutions/survey 11 - 58
  • 12. • In theory, "BIG DATA" can lead to much stronger conclusions, but in practice many difficulties arise. • The business community is well aware of today's information overload, and one analysis shows that: Perceptions and Opinions Managers Believe Don’t-Believe Information overload is present in their own Workplace. 61% 39% The situation will get worse. 80% 20% Ignore data in current decision-making processes because of the information overload. > 50% < 50% Store this information for the future; it is not used for .current analysis 84% 16% The cost of gathering information outweighs its value. 60% 40% WHY DATA MINING? 4/8/2019 Hunan University 12 - 58
  • 13. • What are the solutions? 1. Work harder. (But how long can you keep up, because the limits are very close). 2. Employ an assistant. (Maybe, if you can afford it). 3. Ignore the data. (But then you are not competitive in the market).  The Only Real Solution: Replace the classical data analysis and interpretation methodologies (both Manual & Computer-Based) with a new DM Technology. WHY DATA MINING? 4/8/2019 Hunan University 13 - 58
  • 14. 1. Data should be Accurate. 2. Data should be Stored according to Data Type. 3. Data should have Integrity. 4. Data should be Consistent. 5. Data should not be Redundant. 6. Data should be Timely. 7. Data should be Well Understood. 8. Data set should be Complete. INDICATORS OF DATA QUALITY 4/8/2019 Hunan University 14 - 58
  • 15. • Definition: The Data Warehouse (DW) is a collection of integrated, Subject-Oriented Databases designed to support the Decision-Support Functions (DSF), where each unit of data is relevant to some moment in time. Types of Elementary or Derived Data 1. Old detail data. 2. New detail data. 3. Lightly summarized data. 4. Highly summarized data. 5. Metadata (the data directory or guide) DATA WAREHOUSE CONCEPT 4/8/2019 Hunan University 15 - 58
  • 16. • To prepare these five types of elementary or derived data in a DW, the fundamental types of Data Transformation (DT) are standardized. • There are four main types of Transformations: 1. Simple Transformations. 2. Cleansing and Scrubbing. 3. Integration. 4. Aggregation and Summarization. DATA WAREHOUSE CONCEPT 4/8/2019 Hunan University 16 - 58
  • 17. A Three-Stage Data-Warehousing Development Process 1. Modeling: In simple terms, to take the time to understand business processes, the information requirements of these processes, and the decisions that are currently made within processes. 2. Building: a. To establish requirements for tools that suit the types of decision support necessary for the targeted business process. b. To create a data model that helps further define information requirements. c. To decompose problems into data specifications and the actual data store, which will, in its final form, represent either a data mart or a more comprehensive DW. DATA WAREHOUSE DEVELOPMENT 4/8/2019 Hunan University 17 - 58
  • 18. 3. Deploying: a. To implement, relatively early in the overall process, the nature of the data to be warehoused and the various business intelligence tools to be employed. b. To begin by training users. The deploy stage explicitly contains a time during which users explore both the repository (to understand data that are and should be available) and early versions of the actual DW. This can lead to an evolution of the DW, which involves adding more data, extending historical periods, or returning to the build stage to expand the scope of the DW through a data model. DATA WAREHOUSE DEVELOPMENT 4/8/2019 Hunan University 18 - 58
  • 19. How is DW different from SQL & OLAP, which are also applied to DW? • SQL is a standard relational database language that is good for queries that impose some kind of constraints on data in the database in order to extract an answer. • In contrast, DM methods are good for queries that are exploratory in nature, trying to extract hidden, not so obvious information. • SQL is useful when we know exactly what we are looking for and we can describe it formally. • DM methods will be used when we know only vaguely what we are looking for. • Therefore, these two classes of DW applications are complementary. DATA WAREHOUSE & DATA MINING 4/8/2019 Hunan University 19 - 58
  • 20. • OLAP tools and methods have become very popular in recent years as they let users analyze data in a warehouse by providing multiple views of the data, supported by advanced graphical representations. In these views, different dimensions of data correspond to different business characteristics. • OLAP tools make it very easy to look at dimensional data from any angle or to slice-and-dice it. • Although OLAP tools, like DM tools, provide answers that are derived from data, the similarity between them ends here. DATA WAREHOUSE & DATA MINING 4/8/2019 Hunan University 20 - 58 How is DW different from SQL & OLAP, which are also applied to DW?
  • 21. • OLAP tools do not learn from data, nor do they create new knowledge. They are usually special-purpose visualization tools that can help end-users draw their own conclusions and decisions, based on graphically condensed data. • OLAP tools are very useful for the DM process; they can be a part of it but they are not a substitute. DATA WAREHOUSE & DATA MINING 4/8/2019 Hunan University 21 - 58 How is DW different from SQL & OLAP, which are also applied to DW?
  • 22.
  • 23. • DM is a young discipline with wide and diverse applications. • There is still a nontrivial gap between general principles of DM and domain- specific, effective DM tools for particular applications: a. Some Application Domains b. Biomedical and DNA Data Analysis c. Financial Data Analysis d. Retail Industry e. Telecommunication Industry DATA MINING APPLICATIONS 4/8/2019 Hunan University 23 - 58
  • 24.  A rapidly expanding and highly competitive industry and a great demand for DM  Understand the Business Involved  Identify Telecommunication Patterns  Catch Fraudulent Activities  Make Better Use of Resources  Improve The Quality Of Service (QoS)  Multidimensional analysis of telecom. data  Intrinsically Multidimensional: calling-time, duration, location of caller, type of call and etc… TELECOMMUNICATIONS & DATA MINING 4/8/2019 Hunan University 24 - 58
  • 25. • Telecommunication companies generate and deal with a huge amount of data. • These data include: 1. Call Detail Data: which describes the calls that traverse the telecommunication networks. 2. Network Data: which describes the state of the hardware and software components in the network. 3. Customer Data: which describes the telecommunication customers. • Several DM applications in Telecommunication Industry can be used to identify: 1. Telecommunication Fraud. 2. Improve Marketing Effectiveness. 3. Identify Network Faults. TYPES OF TELECOMMUNICATIONS DATA 4/8/2019 Hunan University 25 - 58
  • 26. 1. Call Detail Data • Every time a call is placed on a telecommunications network, descriptive information about the call is saved as a call detail record. • The number of call detail records that are generated and stored is huge. For example, AT&T long distance customers alone generate over 300 million call detail records per day (Cortes & Pregibon, 2001). • Given that several months of call detail data is typically kept online, this means that tens of billions of call detail records will need to be stored at any time. • Below is a list of features that one might use when generating a summary description of a customer based on the calls they originate and receive over some time period P: 1.Average Call Duration 2.% No-answer Calls 3. % Calls to/from a Different Area Code 4.% Of Weekday Calls (Monday– Friday) 5. % Of Daytime Calls (9am – 5pm) 6.Average # Calls Received Per Day 7.Average # Calls Originated Per Day 8.# Unique Area Codes Called During P TYPES OF TELECOMMUNICATIONS DATA 4/8/2019 Hunan University 26 - 58
  • 27. The period of 9am to 4pm, businesses place roughly 1.5 times as many of their total weekday calls as does a residence. Note that at 5pm the ratio is close to 1, indicating that the calls during this timeframe are not very useful for distinguishing between a Business and a Residence. calls in the evening timeframe (6pm – 1am) are also useful in distinguishing between the two types of customers. TYPES OF TELECOMMUNICATIONS DATA 4/8/2019 Hunan University 27 - 58
  • 28. • Telecommunication networks are extremely complex configurations of equipment, comprised of thousands of interconnected components. • Each network element is capable of generating error and status messages, which leads to a tremendous amount of network data. • This data must be stored and analyzed in order to support network management functions, such as fault isolation. • As was the case with the call detail data, network data is also generated in real-time as a data stream and must often be summarized in order to be useful for data mining. • This is sometimes accomplished by applying a time window to the data. For example, such a summary might indicate that a hardware component experienced twelve instances of a power fluctuation in a 10-minute period. TYPES OF TELECOMMUNICATIONS DATA 2. Network Data 4/8/2019 Hunan University 28 - 58
  • 29. • Telecommunication companies, like other large businesses, may have millions of customers, and this means maintaining a database of information on these customers. • This information will include name and address information and may include other information such as service plan and contract information, credit score, family income and payment history. • This information may be supplemented with data from external sources, such as from credit reporting agencies. Because the customer data maintained by telecommunication companies does not substantially differ from that maintained in most other industries, however customer data is often used in conjunction with other data in order to improve results. • For example, customer data is typically used to supplement call detail data when trying to identify phone fraud. TYPES OF TELECOMMUNICATIONS DATA 3. Costumer Data 4/8/2019 Hunan University 29 - 58
  • 30. Fraud Detection 1 Marketing/Customer Profiling 2 Network Fault Isolation 3 Customer Churn Customer Insolvency A B TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 30 - 58
  • 31. 1. Fraud Detection • Fraud is a serious problem for telecommunication companies, leading to billions of dollars in lost revenue each year. • Fraud can be divided into two categories: 1. Subscription fraud: occurs when a customer opens an account with the intention of never paying for the account charges. 2. Superimposition fraud: involves a legitimate account with some legitimate activity, but also includes some “superimposed” illegitimate activity by a person other than the account holder. • The most common method for identifying fraud is to build a profile of customer’s calling behavior and compare recent activity against this behavior. • Thus, this Data Mining application relies on deviation detection. The calling behavior is captured by summarizing the call detail records for a customer. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 31 - 58
  • 32. • Fraudulent pattern analysis and the identification of unusual patterns:  Identify potentially fraudulent users and their atypical usage patterns.  Detect attempts to gain fraudulent entry to customer accounts.  Discover unusual patterns which may need special attention. • Multidimensional association and sequential pattern analysis:  Find usage patterns for a set of communication services by customer group, by month, etc.  Promote the sales of specific services.  Improve the availability of particular services in a region. • Use Of Visualization Tools In Telecommunication Data Analysis. TELECOMMUNICATIONS ISSUES & DATA MINING 1. Fraud Detection 4/8/2019 Hunan University 32 - 58
  • 33. • If the call detail summaries are updated in real-time, fraud can be identified soon after it occurs. • Because new behavior does not necessarily imply fraud, one fraud-detection system augments this basic approach by comparing the new calling behavior to profiles of generic fraud—and only signals fraud if the behavior matches one of these profiles (Cortes & Pregibon, 2001). • For example, : one sample rule that combines call detail and customer level data for detecting cellular fraud is: “People who have a price plan that makes international calls expensive and who display a sharp rise in international calls are likely the victim of cloning fraud.” TELECOMMUNICATIONS ISSUES & DATA MINING 1. Fraud Detection 4/8/2019 Hunan University 33 - 58
  • 34. • Fraud is relatively rare, and the number of verified fraudulent calls is relatively low, the fraud application involves predicting a relatively rare event where the underlying class distribution is highly skewed. • DM Algorithms often have great difficulty deal in with highly skewed class distributions and predicting rare events. • The fraud problems reviewed here carry several common characteristics, which are summarized below: 1. Significant loss of revenue. 2. Unpredictability of human. 3. Information can be retrieved only after processing huge amounts of data. 4. Fraudulent cases are rare compared to the legitimate ones. TELECOMMUNICATIONS ISSUES & DATA MINING 1. Fraud Detection 4/8/2019 Hunan University 34 - 58
  • 35. • In the following, briefly review studies on similar problems appearing in the mobile communications, conventional communications and credit card sectors: 1. Fraud In Mobile Communications. 2. Fraud In Conventional Communications. 3. Fraud In Credit Card Usage. • The fraud problems carry several common characteristics, which are summarized below: 1. Significant loss of revenue. 2. Unpredictability of human. 3. Information can be retrieved only after processing huge amounts of data. 4. Fraudulent cases are rare compared to the legitimate ones. TELECOMMUNICATIONS ISSUES & DATA MINING 1. Fraud Detection 4/8/2019 Hunan University 35 - 58
  • 36. For example (2) : Ezawa & Norton (1995) increase the percentage of fraudulent calls from 1-2% to 9-12%. However, the use of a non-representative training set can be problematic because it does not provide the DM method with accurate information about the true class distribution (Weiss and Provost, 2003). For example (1) : if fraud makes up only 0.2% of all calls, many data mining systems will not generate any rules for finding fraud, since a default rule, which never predicts fraud, would be 98.8% accurate. To deal with this issue, the training data is often selected to increase the proportion of fraudulent cases. TELECOMMUNICATIONS ISSUES & DATA MINING 1. Fraud Detection 4/8/2019 Hunan University 36 - 58
  • 37. • Telecommunication companies maintain a great deal of data about their customers. In addition to the general customer data that most businesses collect, telecommunication companies also store call detail records, which precisely describe the calling behavior of each customer. • This information can be used to profile the customers and these profiles can then be used for marketing and/or forecasting purposes. 2. Marketing/Customer Profiling Example-1 (MCI’s) : AT&T (1991) used a rate averaging system; calls between areas of similar distance were billed the same, regardless of actual transmission costs. This system allowed sparsely populated areas to enjoy the same low phone rates as high-volume, major cities. (http://www.hagley.org/library/exhibits/Mci/earlyyearsmci.html). The MCI Friends and Family promotion relied on DM to identify associations within data. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 37 - 58
  • 38. • Example-2: Another marketing application that relies on this technique is a DM application for finding the set of Non-U.S. countries most often called together by U.S. telecommunication customers (Cortes & Pregibon, 2001). • One set of countries identified by this DM application is: {Jamaica, Antigua, Grenada, Dominica}. • This information is useful for establishing and marketing international calling plans. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 38 - 58 2. Marketing/Customer Profiling
  • 39. • Customer Churn involves a customer leaving one telecommunication company for another. • Customer Churn is a significant problem because of the associated loss of revenue and the high cost of attracting new customers. • Data Mining techniques now permit companies the ability to mine historical data in order to predict when a customer is likely to leave. • These techniques typically utilize billing data, call detail data, subscription information (calling plan, features, contract expiration data) and customer information (e.g., Age). • Based on the induced Model, the company can then take Action, if desired. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 39 - 58 2. Marketing/Customer Profiling A. Customer Churn
  • 40. Example: A wireless company might offer a customer a free phone for extending their contract. One such effort utilized a neural network to estimate the probability h(t) of cancellation at a given time t in the future (Mani, Drew, Betz & Datta, 1999). • In order to effectively mine the call detail data, it must be summarized to the customer level (Business or Residence). • Then, a Classifier Induction Program can be applied to a set of labeled training examples in order to Build A Classifier. This approach has been used to identify Fax Lines (Kaplan, Strauss & Szegedy, 1999) and to classify a phone line as belonging to a Business or Residence (Cortes & Pregibon, 1998). • Other applications have used this approach to identify phone lines belonging to Telemarketers and to classify a phone line as being used for Voice, Data, Or Fax. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 40 - 58 2. Marketing/Customer Profiling A. Customer Churn
  • 41. • These rules were generated using SAS Enterprise Miner, a sophisticated Data Mining Package that supports Multiple Data Mining Techniques. • The rules shown here were generated using a Decision Tree Learner. However, a Neural Network was also used to predict the probability of a customer being a Business or Residential customer, based solely on the distribution of calls by time of day (i.e., the Neural Network had 24 Inputs, One per Hour of The Day). Pseudo-Code • Rule 1: if < 43% of calls last 0-10 seconds and < 13.5% of calls occur during the weekend and neural network says that P(business) > 0.58 based on time of day call distribution then Business Customer. • Rule 2: if calls received over two-month period from at most 3 unique area codes and <56.6% of calls last 0-10 seconds then Residential Customer. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 41 - 58 2. Marketing/Customer Profiling A. Customer Churn
  • 42. Model-Building to Predict Churn • BSNL. Satara like other telecom companies suffer from churning customers who use the provided services without paying their dues. It provides many services like the Internet, fax, post and pre- paid mobile phones and fixed phones, the researchers would like to focus only on postpaid phones with respect to churn prediction, which is the purpose of this research work. • In essence, many aggregated attributes containing the lengths of calls made by every customer in these five months were created. • This Model is predicting customers who are at risk of leaving a company, in telecommunication sector. • Using this Model, Company will be able to find such kind of customers. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 42 - 58 2. Marketing/Customer Profiling A. Customer Churn
  • 43. 0 20 40 60 80 100 120 140 160 180 1 2 3 4 Not Churning Possible Churning Difference Between Churning And Not Churning Customers Weeks Custome rs 1st Week 2nd Week 3rd Week 4th Week Not Churning 102.5 89.2 96.1 117.2 Possible Churning 71 109.5 131.5 174 TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 43 - 58 Model-Building to Predict Churn 2. Marketing/Customer Profiling A. Customer Churn
  • 44. • The major task to be performed at this stage was creating and training a decision support system that can discriminate between churning and not churning customers. • For the proposed, we had a choice of several data mining tools available, and METALAB was found to be the most suitable for this purpose, because it supports with many algorithms. It has Neural Network Toolbox. • The algorithm that is used for this research work is Back Propagation Algorithm (BPA). While building a model whole dataset was divided into three subsets. • These subsets were the training set, the validation set, and the test set. • The training set is used to train the network. • The validation set is used to monitor the error during the training process. • The test set is used to compare the Performance of the Model. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 44 - 58 Model-Building to Predict Churn A. Customer Churn 2. Marketing/Customer Profiling
  • 45. • Telecommunication companies, as well as other service- providing companies, often suffer from Insolvent Customers who use the provided services without paying their dues. • Companies In Telecommunications Business take precautions against these customers; however, in most cases this refers to measures applied quite late, often with no significant effect. • As a result Many Unpaid Bills end up in the account of uncollectible debts. • Detection and prevention of this behavior is an objective of prime importance for the industry. • This is especially important today that Strong Competition is building in many sectors of the Telecommunications Industry as monopolies cease to exist. B. Customer Insolvency TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 45 - 58 2. Marketing/Customer Profiling
  • 46. Model-Building to Predict Insolvency • This Model, capable of predicting insolvent behavior of customers well in advance, being available, can be a useful decision support tool for a service- providing company. • However building such this Model is not a trivial process. • The Model objective to study the feasibility of building such a tool for a major Telecommunications Operator using Data Mining Techniques. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 46 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 47. • Most cases dispersed, these data when inter-related may contain valuable information relating to the Insolvency-Prediction Problem. • Examples of such data are: 1. Customer Profiles. 2. Usage of the Offered Service. 3. Financial Transactions of the Customers with the Company. • Like fraudsters in other Fraud Detection Problems, it is assumed that Insolvent Customers on the average behave differently from the rest of the customers, especially during a critical period preceding the due-date for payment. • The goal is therefore to reveal those Behavioral Patterns, which may distinguish these customers from the rest. Model-Building to Predict Insolvency TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 47 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 48. Model-Building Difficulties • Some of the inherent limitations that make this a particularly difficult problem, as identified early in the study, were the following: A. Insolvent customer’s behavior can be either the result of fraudulence or due to factors beyond the customer’s will (force majeure, social factors etc.). B. The available data sets, as often is the case with real life problems, represent the individual customer in a limited and distorted way. C. In the available large data set, many parameters can be defined, often deduced from primary transactional data, which can describe the behavior of a customer of a Modern Telecommunications Company. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 48 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 49. Model Stages 1 Learning the application domain 2 Creating a target data set Data cleaning and preprocessing 4 Data reduction and projection 3 5 Choosing the function of data mining 6 Choosing the data mining algorithm(s) Data Mining 8 Interpretation 7 9 Using discovered knowledge TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 49 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 50. • The study reported in this Model can be described as a KDD experiment. • Following a typical KDD framework (Fayyad et al., 1996; Han and Kamber, 2001), where Data Mining is the core in the overall process, the experiment went through all nine steps of Fig. 1, starting from the stage of gaining profound knowledge of the domain till the actual use of discovered knowledge. Problem Definition and Application Domain • Insolvency Prediction : is the ability to predict the insolvent customers to be, that is the customers that will refuse to pay their telephone bills in the next due-date for payment, while there is still time for preventive (and possibly aversive) measures. • Insolvency Prediction : makes sense in business terms if it can take place early enough to be of any use for the company. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 50 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling Model Stages
  • 51. Problem Definition and Application Domain • For this problem, three objectives were set as most prevailing for the company: 1. Detection of as many insolvent customers as possible. 2. Minimization of the false alarms, i.e. the number of solvent customers that are falsely classified as insolvent. 3. Timely warning to the service provider in order to take action against a prospective insolvent customer. Using Discovered Knowledge for Insolvency Prediction • In the experiments and by the case-by-case comparison, every case (customer) was examined separately. • Then the following scheme was followed: if all three algorithms agreed on one classification (solvent or insolvent), then the customer was labeled as such. On the contrary, if one of the classifiers did not agree with the others, then the customer was considered as unclassified. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 51 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 52. Using Discovered Knowledge for Insolvency Prediction Actual Predicted Insolvent Solvent Unclassified Total Cases Used For Training Insolvent 43 (31.6%) 29 (21.3%) 64 (47.1%) 136 Solvent 0 (0%) 1174 (96.8%) 38 (3.2%) 1212 Cases Used For Testing Insolvent 19 (29.69%) 17 (26.56%) 28 (43.75%) 64 Solvent 1 (0.15%) 626 (95.72%) 27 (4.13%) 654 Combined Classification Results in a Case-by-case Comparison 90.28% Of Cases Used For Training Were Correctly Classified. 89.83% Of Cases Used For Testing Were Correctly Classified. 1. DM experiment was a multi-stage process. 2. Various Algorithms were tested in order to establish which one is most suitable for our problem and this particular dataset. a. The specific problem of customer insolvency in telecommunication companies, the experiments was highly reusable, except of deviations in the syntactic and semantic definition of the operators data. b. A Parametric Tool To Handle Such Deviations is one of the objectives of further research. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 52 - 58 B. Customer Insolvency 2. Marketing/Customer Profiling
  • 53. • Telecommunication networks are extremely complex configurations of hardware and software. • Most of the network elements are capable of at least limited self-diagnosis, and these elements may collectively generate millions of status and alarm messages each month. In order to effectively manage the network, alarms must be analyzed automatically in order to identify network faults in a timely manner—or before they occur and degrade network performance. • A proactive response is essential to maintaining the reliability of the network. Because of the volume of the data, and because a single fault may cause many different, seemingly unrelated, alarms to be generated, the task of network fault isolation is quite difficult. 3. Network Fault Isolation TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 53 - 58
  • 54. • Telecommunication Alarm Sequence Analyzer (TASA) is one tool that helps with the knowledge acquisition task for alarm correlation (Klemettinen, Mannila & Toivonen, 1999). Data Mining Role for Identifying Faults • TASA automatically discovers recurrent patterns of alarms within the network data along with their statistical properties, using a specialized Data Mining Algorithm. • Network specialists then use this information to construct a rule- based alarm correlation system, which can then be used in real- time to identify faults. • TASA is capable of finding episodic rules that depend on temporal relationships between the alarms. For example, it may discover the following rule: “if alarms of type link alarm and link failure occur within 5 seconds, then an alarm of type high fault rate occurs within 60 seconds with probability 0.7.” TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 54 - 58 3. Network Fault Isolation
  • 55. • The most significant one is that some information will be lost in the reformulation process. • For example: using the scalar-based representation just mentioned, all sequence information is lost. • Time-weaver (Weiss & Hirsh, 1998) is a Genetic-Algorithm (GA) based data mining system that is capable of operating directly on the raw network-level time series data (as well as other time-series data), thereby making it unnecessary to re-represent the network level data. • Given a sequence of TIME-STAMPED events and a target event T, Time- weaver will identify patterns that successfully predict T. • Time-weaver essentially searches through the space of possible patterns, which includes sequence and temporal relationships, to find predictive patterns. • The system is especially designed to perform well when the target event is rare, which is critical since most network failures are rare. In the case studied, the target event is the failure of components in the 4ESS switching system. TELECOMMUNICATIONS ISSUES & DATA MINING 4/8/2019 Hunan University 55 - 58 Data Mining Role for Identifying Faults 3. Network Fault Isolation
  • 56. CONCLUSIONS  This presentation described how much Data Mining is very important in our life.  Also, this presentation explained in more defiles the Data Mining is technology.  Furthermore, the presentation described how Data Mining is used in the telecommunications industry.  There are three main sources of telecommunication data: Call Detail, Network And Customer Data.  Common data mining applications are described as follow: Fraud, Marketing And Network Fault Isolation  The presentation highlighted several key issues that affect the ability to mine data, and commented on how they impact the Data Mining Process.  The telecommunications industry has been one of the earliest adopters of Data Mining technology, largely because of the amount and quality of the data that it collects.  The previous point has resulted in many successful Data Mining applications.  Data Mining Technology will be the optimal way to achieve the great goals of the companies especially, in the world of telecommunications industry. 4/8/2019 Hunan University 56 - 58
  • 57. REFERENCES 3 “Data-Mining Concepts”,(Chapter from Textbook) 5 Jamal Sophieh, “Data mining in telecommunications and studying its status in Iran telecom companies and operators”,(Paper) 4 Gary M. Weiss, “Data Mining in Telecommunications”, (Chapter from Textbook) 1 Adrian COSTEA, “The Analysis of the Telecommunications Sector by The Means of Data Mining Techniques”,(Paper) 6 Mieke Jans, Nadine Lybaert, Koen Vanhoof, Data Mining for Fraud Detection: Toward an Improvement on Internal Control Systems?, (Paper) 2 Constantinos S. Hilas, John N. Sahalos, “ User Profiling for Fraud Detection in Telecommunication Networks” (Paper) 9 V.Umayaparvathi, Dr.K.Iyakutti, “A Fraud Detection Approach in Telecommunication using Cluster GA”, (Paper) 7 Rahul J. Jadhav, Usharani T. Pawar, “Churn Prediction in Telecommunication Using Data Mining Technology”, (Paper) 8 S. Daskalaki, I. Kopanas, M. Goudara, N. Avouris, “Data mining for decision support on customer insolvency in telecommunications business”,(Paper) 57 - 584/8/2019 Hunan University