التنقيب في البيانات - Data Mining

Presenter
Supervisor
Data Mining Role in
Telecommunications
February
2012

1. Preface
2. Data Mining Definition
3. Data Mining Activities
4. Data Mining Tasks
5. Data Mining Process
6. Data Mining Roots
7. Why Data Mining?
8. Indicators Of Data Quality
9. Data Warehouse Concept
10. Data Warehouse Development
11. Data Warehouse & Data Mining
1. Data Mining Applications
2. Telecom. & Data Mining
4/8/2019 Hunan University 2 - 58
3. Types Of Telecom. Data
4. Telecom. Issues & Data Mining
1) Fraud Detection
5. Conclusions
6. References
2) Marketing/Customer Profiling
A. Customer Churn
B. Customer Insolvency
3) Network Fault Isolation

 In many domains the underlying first principles are
unknown, or the systems under study are too complex
to be mathematically formalized. With the growing use
of computers, there is a great amount of data being
generated by such systems.
 In the absence of first-principle models, such readily
available data can be used to derive models by
estimating useful relationships between a system's
variables (i.e., unknown input-output dependencies).
 Thus there is currently a paradigm shift from classical
modeling and analyses based on first principles to
developing models and the corresponding analyses
directly from data.
PREFACE
4/8/2019 Hunan University
Need for Analyses of Large, Complex, Information-Rich Data Sets
4 - 58

 Data Mining (DM) is: an iterative process within which
progress is defined by discovery, through either
automatic or manual methods.
 DM is: most useful in an exploratory analysis scenario in
which there are no predetermined notions about what
will constitute an "interesting" outcome.
 DM is: the search for new, valuable, and nontrivial
information in large volumes of data.
 DM is: a cooperative effort of humans and computers.
(Best results are achieved by balancing the
knowledge of human experts in describing problems
and goals with the search capabilities of computers).
DATA MINING DEFINITIONS

1. Predictive DM: the goal of DM is to produce a model,
expressed as an executable code, which can be
used to perform classification, prediction, estimation,
or other similar tasks.
2. Descriptive DM: the goal is to gain an understanding
of the analyzed system by uncovering patterns and
relationships in large data sets.
• So,:
a. The relative importance of Prediction and
Description for particular DM applications can
vary considerably.
b. The goals of Prediction and Description are
achieved by using DM techniques.
DATA MINING ACTIVITIES

1. Classification : discovery of a predictive learning function
that classifies a data item into one of several predefined
classes.
2. Regression : discovery of a predictive learning function,
which maps a data item to a real-value prediction variable.
3. Clustering : a common descriptive task in which one seeks to
identify a finite set of categories or clusters to describe the
data.
4. Summarization : an additional descriptive task that involves
methods for finding a compact description for a set (or
subset) of data.
5. Dependency Modeling : finding a local model that
describes significant dependencies between variables or
between the values of a feature in a data set or in a part of
a data set.
6. Change and Deviation Detection : discovering the most
significant changes in the data set.
DATA MINING TASKS

Recognize the Iterative Character of A Data Mining Process and
Specify Its Basic Steps
1. State the problem and formulate the hypothesis
2. Collect the data
3. Preprocessing the data
a)Outlier detection (and removal):
Detect and eventually remove outliers as a
part of the preprocessing phase.
Develop robust modeling methods that are
insensitive to outliers.
b)Scaling, encoding, and selecting feature.
4. Estimate the model
5. Interpret the model and draw conclusions
DATA MINING PROCESS

1. State the Problem & Formulate The Hypothesis
2. Collect the Data
3. Preprocessing the Data
4. Estimate the Model
5. Interpret the Model & Draw Conclusions
DATA MINING PROCESS

• The root of the problem is that the data size and
dimensionality are too large for manual analysis and
interpretation, or even for some semiautomatic
computer-based analyses.
• A scientist or a business manager can work effectively
with a few hundred or thousand records.
• Effectively mining millions of data points, each
described with tens or hundreds of characteristics, is
another matter.
• Imagine the analysis of:
 terabytes (TB) of sky-image data with thousands of
photographic high-resolution images (23,040 x
23,040 pixels per image).
 OR human genome databases with billions of
components.
DATA MINING ROOTS

Date
N-HOSTSWHY DATA MINING?
4/8/2019 Hunan University
0
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
Jul-1993
Jul-1994
Jul-1995
Jul-1996
Jul-1997
Jul-1998
Jul-1999
Jul-2000
Jul-2001
Jul-2002
Jul-2003
Jul-2004
Jul-2005
Jul-2006
Jul-2007
Jul-2008
Jul-2009
Jul-2010
Jul-2011
Jan-2012
Growth Of Internet Hosts
Host Count
Source: Internet System Consortium - www.isc.org/solutions/survey
11 - 58

• In theory, "BIG DATA" can lead to much stronger
conclusions, but in practice many difficulties arise.
• The business community is well aware of today's
information overload, and one analysis shows that:
Perceptions and Opinions
Managers
Believe Don’t-Believe
Information overload is present in their own
Workplace.
61% 39%
The situation will get worse. 80% 20%
Ignore data in current decision-making processes
because of the information overload.
> 50% < 50%
Store this information for the future; it is not used for
.current analysis
84% 16%
The cost of gathering information outweighs its
value.
60% 40%
WHY DATA MINING?

• What are the solutions?
1. Work harder.
(But how long can you keep up, because the
limits are very close).
2. Employ an assistant.
(Maybe, if you can afford it).
3. Ignore the data.
(But then you are not competitive in the
market).
 The Only Real Solution:
Replace the classical data analysis and
interpretation methodologies (both Manual &
Computer-Based) with a new DM Technology.
WHY DATA MINING?

1. Data should be Accurate.
2. Data should be Stored according to Data Type.
3. Data should have Integrity.
4. Data should be Consistent.
5. Data should not be Redundant.
6. Data should be Timely.
7. Data should be Well Understood.
8. Data set should be Complete.
INDICATORS OF DATA QUALITY

• Definition:
The Data Warehouse (DW) is a collection of
integrated, Subject-Oriented Databases
designed to support the Decision-Support
Functions (DSF), where each unit of data is
relevant to some moment in time.
Types of Elementary or Derived Data
1. Old detail data.
2. New detail data.
3. Lightly summarized data.
4. Highly summarized data.
5. Metadata (the data directory or guide)
DATA WAREHOUSE CONCEPT

• To prepare these five types of elementary or
derived data in a DW, the fundamental types
of Data Transformation (DT) are standardized.
• There are four main types of Transformations:
1. Simple Transformations.
2. Cleansing and Scrubbing.
3. Integration.
4. Aggregation and Summarization.
DATA WAREHOUSE CONCEPT

A Three-Stage Data-Warehousing Development Process
1. Modeling:
In simple terms, to take the time to understand business
processes, the information requirements of these processes,
and the decisions that are currently made within processes.
2. Building:
a. To establish requirements for tools that suit the types of
decision support necessary for the targeted business
process.
b. To create a data model that helps further define
information requirements.
c. To decompose problems into data specifications and
the actual data store, which will, in its final form,
represent either a data mart or a more comprehensive
DW.
DATA WAREHOUSE DEVELOPMENT

3. Deploying:
a. To implement, relatively early in the overall
process, the nature of the data to be
warehoused and the various business intelligence
tools to be employed.
b. To begin by training users. The deploy stage
explicitly contains a time during which users
explore both the repository (to understand data
that are and should be available) and early
versions of the actual DW. This can lead to an
evolution of the DW, which involves adding more
data, extending historical periods, or returning to
the build stage to expand the scope of the DW
through a data model.
DATA WAREHOUSE DEVELOPMENT

How is DW different from SQL & OLAP, which are also applied to DW?
• SQL is a standard relational database language that
is good for queries that impose some kind of
constraints on data in the database in order to
extract an answer.
• In contrast, DM methods are good for queries that are
exploratory in nature, trying to extract hidden, not so
obvious information.
• SQL is useful when we know exactly what we are
looking for and we can describe it formally.
• DM methods will be used when we know only vaguely
what we are looking for.
• Therefore, these two classes of DW applications are
complementary.
DATA WAREHOUSE & DATA MINING

• OLAP tools and methods have become very popular in
recent years as they let users analyze data in a
warehouse by providing multiple views of the data,
supported by advanced graphical representations. In
these views, different dimensions of data correspond
to different business characteristics.
• OLAP tools make it very easy to look at dimensional
data from any angle or to slice-and-dice it.
• Although OLAP tools, like DM tools, provide answers
that are derived from data, the similarity between
them ends here.

• OLAP tools do not learn from data, nor do they
create new knowledge. They are usually
special-purpose visualization tools that can
help end-users draw their own conclusions and
decisions, based on graphically condensed
data.
• OLAP tools are very useful for the DM process;
they can be a part of it but they are not a
substitute.

• DM is a young discipline with wide and
diverse applications.
• There is still a nontrivial gap between
general principles of DM and domain-
specific, effective DM tools for particular
applications:
a. Some Application Domains
b. Biomedical and DNA Data Analysis
c. Financial Data Analysis
d. Retail Industry
e. Telecommunication Industry
DATA MINING APPLICATIONS

 A rapidly expanding and highly competitive
industry and a great demand for DM
 Understand the Business Involved
 Identify Telecommunication Patterns
 Catch Fraudulent Activities
 Make Better Use of Resources
 Improve The Quality Of Service (QoS)
 Multidimensional analysis of telecom. data
 Intrinsically Multidimensional:
calling-time, duration, location of caller, type of call
and etc…
TELECOMMUNICATIONS & DATA MINING

• Telecommunication companies generate and deal
with a huge amount of data.
• These data include:
1. Call Detail Data: which describes the calls that traverse the
telecommunication networks.
2. Network Data: which describes the state of the hardware and
software components in the network.
3. Customer Data: which describes the telecommunication
customers.
• Several DM applications in Telecommunication
Industry can be used to identify:
1. Telecommunication Fraud.
2. Improve Marketing Effectiveness.
3. Identify Network Faults.
TYPES OF TELECOMMUNICATIONS DATA

1. Call Detail Data
• Every time a call is placed on
a telecommunications
network, descriptive
information about the call is
saved as a call detail record.
• The number of call detail
records that are generated
and stored is huge. For
example, AT&T long distance
customers alone generate
over 300 million call detail
records per day (Cortes &
Pregibon, 2001).
• Given that several months of
call detail data is typically
kept online, this means that
tens of billions of call detail
records will need to be stored
at any time.
• Below is a list of features that one might
use when generating a summary
description of a customer based on the
calls they originate and receive over
some time period P:
1.Average Call Duration
2.% No-answer Calls
3. % Calls to/from a Different Area Code
4.% Of Weekday Calls (Monday– Friday)
5. % Of Daytime Calls (9am – 5pm)
6.Average # Calls Received Per Day
7.Average # Calls Originated Per Day
8.# Unique Area Codes Called During P

The period of 9am to 4pm, businesses
place roughly 1.5 times as many of their
total weekday calls as does a
residence.
Note that at 5pm the ratio is close to
1, indicating that the calls during this
timeframe are not very useful for
distinguishing between a Business
and a Residence.
calls in the evening
timeframe (6pm – 1am) are
also useful in distinguishing
between the two types of
customers.

• Telecommunication networks are extremely complex
configurations of equipment, comprised of thousands of
interconnected components.
• Each network element is capable of generating error and
status messages, which leads to a tremendous amount of
network data.
• This data must be stored and analyzed in order to support
network management functions, such as fault isolation.
• As was the case with the call detail data, network data is
also generated in real-time as a data stream and must often
be summarized in order to be useful for data mining.
• This is sometimes accomplished by applying a time window
to the data. For example, such a summary might indicate
that a hardware component experienced twelve instances
of a power fluctuation in a 10-minute period.
2. Network Data

• Telecommunication companies, like other large businesses,
may have millions of customers, and this means maintaining
a database of information on these customers.
• This information will include name and address information
and may include other information such as service plan and
contract information, credit score, family income and
payment history.
• This information may be supplemented with data from external
sources, such as from credit reporting agencies. Because the
customer data maintained by telecommunication companies
does not substantially differ from that maintained in most other
industries, however customer data is often used in conjunction
with other data in order to improve results.
• For example, customer data is typically used to supplement
call detail data when trying to identify phone fraud.
3. Costumer Data

Fraud Detection 1
Marketing/Customer Profiling 2
Network Fault Isolation 3
Customer
Churn
Customer
Insolvency
A B
TELECOMMUNICATIONS ISSUES & DATA MINING

1. Fraud Detection
• Fraud is a serious problem for telecommunication companies,
leading to billions of dollars in lost revenue each year.
• Fraud can be divided into two categories:
1. Subscription fraud: occurs when a customer opens an
account with the intention of never paying for the
account charges.
2. Superimposition fraud: involves a legitimate account with
some legitimate activity, but also includes some
“superimposed” illegitimate activity by a person other than
the account holder.
• The most common method for identifying fraud is to build a
profile of customer’s calling behavior and compare recent
activity against this behavior.
• Thus, this Data Mining application relies on deviation
detection. The calling behavior is captured by summarizing
the call detail records for a customer.

• Fraudulent pattern analysis and the identification of unusual
patterns:
 Identify potentially fraudulent users and their atypical
usage patterns.
 Detect attempts to gain fraudulent entry to customer
accounts.
 Discover unusual patterns which may need special
attention.
• Multidimensional association and sequential pattern analysis:
 Find usage patterns for a set of communication services
by customer group, by month, etc.
 Promote the sales of specific services.
 Improve the availability of particular services in a region.
• Use Of Visualization Tools In Telecommunication Data Analysis.
1. Fraud Detection

• If the call detail summaries are updated in real-time,
fraud can be identified soon after it occurs.
• Because new behavior does not necessarily imply fraud,
one fraud-detection system augments this basic
approach by comparing the new calling behavior to
profiles of generic fraud—and only signals fraud if the
behavior matches one of these profiles (Cortes &
Pregibon, 2001).
• For example, : one sample rule that combines call detail
and customer level data for detecting cellular fraud is:
“People who have a price plan that makes international
calls expensive and who display a sharp rise in
international calls are likely the victim of cloning fraud.”
1. Fraud Detection

• Fraud is relatively rare, and the number of verified fraudulent
calls is relatively low, the fraud application involves predicting
a relatively rare event where the underlying class distribution is
highly skewed.
• DM Algorithms often have great difficulty deal in with highly
skewed class distributions and predicting rare events.
• The fraud problems reviewed here carry several common
characteristics, which are summarized below:
1. Significant loss of revenue.
2. Unpredictability of human.
3. Information can be retrieved only after processing
huge amounts of data.
4. Fraudulent cases are rare compared to the
legitimate ones.
1. Fraud Detection

• In the following, briefly review studies on similar problems
appearing in the mobile communications, conventional
communications and credit card sectors:
1. Fraud In Mobile Communications.
2. Fraud In Conventional Communications.
3. Fraud In Credit Card Usage.
• The fraud problems carry several common
characteristics, which are summarized below:
1. Significant loss of revenue.
2. Unpredictability of human.
3. Information can be retrieved only after processing
huge amounts of data.
4. Fraudulent cases are rare compared to the
legitimate ones.
1. Fraud Detection

For example (2) : Ezawa & Norton (1995) increase the
percentage of fraudulent calls from 1-2% to 9-12%.
However, the use of a non-representative training set
can be problematic because it does not provide the DM
method with accurate information about the true class
distribution (Weiss and Provost, 2003).
For example (1) : if fraud makes up only 0.2% of all calls,
many data mining systems will not generate any rules for
finding fraud, since a default rule, which never predicts
fraud, would be 98.8% accurate. To deal with this issue,
the training data is often selected to increase the
proportion of fraudulent cases.
1. Fraud Detection

• Telecommunication companies maintain a
great deal of data about their customers. In
addition to the general customer data that
most businesses collect, telecommunication
companies also store call detail records,
which precisely describe the calling behavior
of each customer.
• This information can be used to profile the
customers and these profiles can then be
used for marketing and/or forecasting
purposes.
2. Marketing/Customer Profiling
Example-1 (MCI’s) : AT&T (1991) used a rate averaging system;
calls between areas of similar distance were billed the same,
regardless of actual transmission costs. This system allowed
sparsely populated areas to enjoy the same low phone rates as
high-volume, major cities.
(http://www.hagley.org/library/exhibits/Mci/earlyyearsmci.html).
The MCI Friends and
Family promotion
relied on DM to
identify associations
within data.

• Example-2: Another marketing application that
relies on this technique is a DM application for
finding the set of Non-U.S. countries most often
called together by U.S. telecommunication
customers (Cortes & Pregibon, 2001).
• One set of countries identified by this DM
application is: {Jamaica, Antigua, Grenada,
Dominica}.
• This information is useful for establishing and
marketing international calling plans.

• Customer Churn involves a customer leaving one
telecommunication company for another.
• Customer Churn is a significant problem because of the
associated loss of revenue and the high cost of attracting
new customers.
• Data Mining techniques now permit companies the ability to
mine historical data in order to predict when a customer is
likely to leave.
• These techniques typically utilize billing data, call detail data,
subscription information (calling plan, features, contract
expiration data) and customer information (e.g., Age).
• Based on the induced Model, the company can then take
Action, if desired.
A. Customer Churn

Example: A wireless company might offer a customer a free phone
for extending their contract. One such effort utilized a neural
network to estimate the probability h(t) of cancellation at a given
time t in the future (Mani, Drew, Betz & Datta, 1999).
• In order to effectively mine the call detail data, it must be
summarized to the customer level (Business or Residence).
• Then, a Classifier Induction Program can be applied to a set of
labeled training examples in order to Build A Classifier. This
approach has been used to identify Fax Lines (Kaplan, Strauss &
Szegedy, 1999) and to classify a phone line as belonging to a
Business or Residence (Cortes & Pregibon, 1998).
• Other applications have used this approach to identify phone
lines belonging to Telemarketers and to classify a phone line as
being used for Voice, Data, Or Fax.
A. Customer Churn

• These rules were generated using SAS Enterprise Miner, a
sophisticated Data Mining Package that supports Multiple Data
Mining Techniques.
• The rules shown here were generated using a Decision Tree
Learner. However, a Neural Network was also used to predict the
probability of a customer being a Business or Residential
customer, based solely on the distribution of calls by time of day
(i.e., the Neural Network had 24 Inputs, One per Hour of The Day).
Pseudo-Code
• Rule 1: if < 43% of calls last 0-10 seconds and < 13.5% of calls occur during the
weekend and neural network says that P(business) > 0.58 based on time of
day call distribution then Business Customer.
• Rule 2: if calls received over two-month period from at most 3 unique area
codes and <56.6% of calls last 0-10 seconds then Residential Customer.
A. Customer Churn

Model-Building to Predict Churn
• BSNL. Satara like other telecom companies suffer from churning
customers who use the provided services without paying their
dues. It provides many services like the Internet, fax, post and pre-
paid mobile phones and fixed phones, the researchers would like
to focus only on postpaid phones with respect to churn
prediction, which is the purpose of this research work.
• In essence, many aggregated attributes containing the lengths of
calls made by every customer in these five months were created.
• This Model is predicting customers who are at risk of leaving a
company, in telecommunication sector.
• Using this Model, Company will be able to find such kind of
customers.
A. Customer Churn

0
20
40
60
80
100
120
140
160
180
1 2 3 4
Not Churning
Possible Churning
Difference Between Churning And Not Churning Customers
Weeks
Custome
rs
1st
Week
2nd
Week
3rd
Week
4th
Week
Not
Churning
102.5 89.2 96.1 117.2
Possible
Churning
71 109.5 131.5 174
A. Customer Churn

• The major task to be performed at this stage was creating and training a
decision support system that can discriminate between churning and not
churning customers.
• For the proposed, we had a choice of several data mining tools
available, and METALAB was found to be the most suitable for this
purpose, because it supports with many algorithms. It has Neural Network
Toolbox.
• The algorithm that is used for this research work is Back Propagation
Algorithm (BPA). While building a model whole dataset was divided into
three subsets.
• These subsets were the training set, the validation set, and the test set.
• The training set is used to train the network.
• The validation set is used to monitor the error during the training process.
• The test set is used to compare the Performance of the Model.
A. Customer Churn

• Telecommunication companies, as well as other service-
providing companies, often suffer from Insolvent Customers
who use the provided services without paying their dues.
• Companies In Telecommunications Business take precautions
against these customers; however, in most cases this refers to
measures applied quite late, often with no significant effect.
• As a result Many Unpaid Bills end up in the account of
uncollectible debts.
• Detection and prevention of this behavior is an objective of
prime importance for the industry.
• This is especially important today that Strong Competition is
building in many sectors of the Telecommunications Industry
as monopolies cease to exist.

Model-Building to Predict Insolvency
• This Model, capable of predicting insolvent behavior
of customers well in advance, being available, can
be a useful decision support tool for a service-
providing company.
• However building such this Model is not a trivial
process.
• The Model objective to study the feasibility of
building such a tool for a major Telecommunications
Operator using Data Mining Techniques.

• Most cases dispersed, these data when inter-related may contain
valuable information relating to the Insolvency-Prediction
Problem.
• Examples of such data are:
1. Customer Profiles.
2. Usage of the Offered Service.
3. Financial Transactions of the Customers with the Company.
• Like fraudsters in other Fraud Detection Problems, it is assumed
that Insolvent Customers on the average behave differently
from the rest of the customers, especially during a critical
period preceding the due-date for payment.
• The goal is therefore to reveal those Behavioral Patterns, which
may distinguish these customers from the rest.
Model-Building to Predict Insolvency

Model-Building Difficulties
• Some of the inherent limitations that make this a particularly
difficult problem, as identified early in the study, were the
following:
A. Insolvent customer’s behavior can be either the result of
fraudulence or due to factors beyond the customer’s will
(force majeure, social factors etc.).
B. The available data sets, as often is the case with real life
problems, represent the individual customer in a limited
and distorted way.
C. In the available large data set, many parameters can be
defined, often deduced from primary transactional data,
which can describe the behavior of a customer of a
Modern Telecommunications Company.

Model Stages
1 Learning the application domain
2 Creating a target data set
Data cleaning and preprocessing
4 Data reduction and projection
3
5 Choosing the function of data mining
6 Choosing the data mining algorithm(s)
Data Mining
8 Interpretation
7
9 Using discovered knowledge

• The study reported in this Model can be described as a KDD experiment.
• Following a typical KDD framework (Fayyad et al., 1996; Han and
Kamber, 2001), where Data Mining is the core in the overall process, the
experiment went through all nine steps of Fig. 1, starting from the stage of
gaining profound knowledge of the domain till the actual use of
discovered knowledge.
Problem Definition and Application Domain
• Insolvency Prediction : is the ability to predict the insolvent customers to
be, that is the customers that will refuse to pay their telephone bills in the
next due-date for payment, while there is still time for preventive (and
possibly aversive) measures.
• Insolvency Prediction : makes sense in business terms if it can take place
early enough to be of any use for the company.
Model Stages

Problem Definition and Application Domain
• For this problem, three objectives were set as most prevailing for the
company:
1. Detection of as many insolvent customers as possible.
2. Minimization of the false alarms, i.e. the number of solvent customers that
are falsely classified as insolvent.
3. Timely warning to the service provider in order to take action against a
prospective insolvent customer.
Using Discovered Knowledge for Insolvency Prediction
• In the experiments and by the case-by-case comparison, every case
(customer) was examined separately.
• Then the following scheme was followed: if all three algorithms agreed on
one classification (solvent or insolvent), then the customer was labeled as
such. On the contrary, if one of the classifiers did not agree with the
others, then the customer was considered as unclassified.

Using Discovered Knowledge for Insolvency Prediction
Actual
Predicted
Insolvent Solvent Unclassified Total
Cases Used For Training
Insolvent 43 (31.6%) 29 (21.3%) 64 (47.1%) 136
Solvent 0 (0%) 1174 (96.8%) 38 (3.2%) 1212
Cases Used For Testing
Insolvent 19 (29.69%) 17 (26.56%) 28 (43.75%) 64
Solvent 1 (0.15%) 626 (95.72%) 27 (4.13%) 654
Combined Classification Results in a Case-by-case Comparison
90.28% Of Cases Used For Training Were Correctly Classified. 89.83% Of Cases Used For Testing Were Correctly Classified.
1. DM experiment was a multi-stage
process.
2. Various Algorithms were tested in
order to establish which one is most
suitable for our problem and this
particular dataset.
a. The specific problem of customer insolvency
in telecommunication companies, the
experiments was highly reusable, except of
deviations in the syntactic and semantic
definition of the operators data.
b. A Parametric Tool To Handle Such Deviations
is one of the objectives of further research.

• Telecommunication networks are extremely complex
configurations of hardware and software.
• Most of the network elements are capable of at least
limited self-diagnosis, and these elements may
collectively generate millions of status and alarm
messages each month. In order to effectively manage
the network, alarms must be analyzed automatically in
order to identify network faults in a timely manner—or
before they occur and degrade network performance.
• A proactive response is essential to maintaining the
reliability of the network. Because of the volume of the
data, and because a single fault may cause many
different, seemingly unrelated, alarms to be generated,
the task of network fault isolation is quite difficult.
3. Network Fault Isolation

• Telecommunication Alarm Sequence Analyzer (TASA) is one
tool that helps with the knowledge acquisition task for alarm
correlation (Klemettinen, Mannila & Toivonen, 1999).
Data Mining Role for Identifying Faults
• TASA automatically discovers recurrent patterns of alarms within
the network data along with their statistical properties, using a
specialized Data Mining Algorithm.
• Network specialists then use this information to construct a rule-
based alarm correlation system, which can then be used in real-
time to identify faults.
• TASA is capable of finding episodic rules that depend on temporal
relationships between the alarms. For example, it may discover the
following rule: “if alarms of type link alarm and link failure occur within 5
seconds, then an alarm of type high fault rate occurs within 60 seconds
with probability 0.7.”

• The most significant one is that some information will be lost in the
reformulation process.
• For example: using the scalar-based representation just mentioned, all
sequence information is lost.
• Time-weaver (Weiss & Hirsh, 1998) is a Genetic-Algorithm (GA) based
data mining system that is capable of operating directly on the raw
network-level time series data (as well as other time-series data), thereby
making it unnecessary to re-represent the network level data.
• Given a sequence of TIME-STAMPED events and a target event T, Time-
weaver will identify patterns that successfully predict T.
• Time-weaver essentially searches through the space of possible patterns,
which includes sequence and temporal relationships, to find predictive
patterns.
• The system is especially designed to perform well when the target event is
rare, which is critical since most network failures are rare. In the case studied,
the target event is the failure of components in the 4ESS switching system.
Data Mining Role for Identifying Faults

CONCLUSIONS
 This presentation described how much Data Mining is very important in our life.
 Also, this presentation explained in more defiles the Data Mining is technology.
 Furthermore, the presentation described how Data Mining is used in the
telecommunications industry.
 There are three main sources of telecommunication data: Call Detail, Network And
Customer Data.
 Common data mining applications are described as follow: Fraud, Marketing And
Network Fault Isolation
 The presentation highlighted several key issues that affect the ability to mine data,
and commented on how they impact the Data Mining Process.
 The telecommunications industry has been one of the earliest adopters of Data
Mining technology, largely because of the amount and quality of the data that it
collects.
 The previous point has resulted in many successful Data Mining applications.
 Data Mining Technology will be the optimal way to achieve the great goals of the
companies especially, in the world of telecommunications industry.

REFERENCES
3 “Data-Mining Concepts”,(Chapter from Textbook)
5 Jamal Sophieh, “Data mining in telecommunications and studying its status in Iran
telecom companies and operators”,(Paper)
4 Gary M. Weiss, “Data Mining in Telecommunications”, (Chapter from Textbook)
1 Adrian COSTEA, “The Analysis of the Telecommunications Sector by The Means of
Data Mining Techniques”,(Paper)
6 Mieke Jans, Nadine Lybaert, Koen Vanhoof, Data Mining for Fraud Detection: Toward
an Improvement on Internal Control Systems?, (Paper)
2 Constantinos S. Hilas, John N. Sahalos, “ User Profiling for Fraud Detection in
Telecommunication Networks” (Paper)
9 V.Umayaparvathi, Dr.K.Iyakutti, “A Fraud Detection Approach in Telecommunication
using Cluster GA”, (Paper)
7 Rahul J. Jadhav, Usharani T. Pawar, “Churn Prediction in Telecommunication Using
Data Mining Technology”, (Paper)
8 S. Daskalaki, I. Kopanas, M. Goudara, N. Avouris, “Data mining for decision support on
customer insolvency in telecommunications business”,(Paper)
57 - 584/8/2019 Hunan University

التنقيب في البيانات - Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to التنقيب في البيانات - Data Mining

Similar to التنقيب في البيانات - Data Mining (20)

Recently uploaded

Recently uploaded (20)

التنقيب في البيانات - Data Mining