Data mining seminar report
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data mining seminar report

on

  • 4,910 views

 

Statistics

Views

Total Views
4,910
Views on SlideShare
4,910
Embed Views
0

Actions

Likes
1
Downloads
223
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data mining seminar report Document Transcript

  • 1. DATA MINING SEMINAR TOPIC: DATAMINING FOR TELECOMMUNICATIONS SUBMITTED BY: MAYURI KIRAN ANVEKAR SUBMITTED TO: RINO CHERIAN DATE: 30/03/2012
  • 2. Table of ContentsI. ABSTRACT .............................................................................................................................................. 3II. INTRODUCTION ..................................................................................................................................... 3III. TYPES OF TELECOM DATA ................................................................................................................. 4 1. Call Summary Data ............................................................................................................................ 4 2. Network Data .................................................................................................................................... 4 3. Customer Data .................................................................................................................................. 4IV. Mining Sequential Patterns In Telecommunication Database Using Genetic Algorithm ................. 5 1. Sequential Patterns Mining............................................................................................................... 5 2. Genetic Algorithm ............................................................................................................................. 5 3. Mining Sequential Patterns in Telecommunication Database Using GA .......................................... 6 3.1 Chromosome. .................................................................................................................................. 6 3.2 Genetic Operators ........................................................................................................................... 7 3.3 Fitness Function. ............................................................................................................................. 7 3.4 SPT-GA Algorithm............................................................................................................................ 8 4. Experiment Results ........................................................................................................................... 9V. Discovering Structural Patterns In Telecommunications Data ........................................................... 11VI. CONCLUSION ................................................................................................................................... 13VII. REFERENCES .................................................................................................................................... 13
  • 3. I. ABSTRACTTelecommunication companies generate a tremendous amount of data. These data include calldetail data, which describes the calls that traverse the telecommunication networks, networkdata, which describes the state of the hardware and software components in the network, andcustomer data, which describes the telecommunication customers. This chapter describes howdata mining can be used to uncover useful information buried within these data sets. Several datamining applications are described and together they demonstrate that data mining can be used toidentify telecommunication fraud, improve marketing effectiveness, and identify network faults. II. INTRODUCTIONThe international telecommunications play an important role in message sharing and informationpassing. In the last decade a dramatic change in the structure of telecommunications companieshas been taken place, from public monopolies to private companies. The quick development ofmobile telephone networks and video calling and Internet technologies has created enormouscompetitive pressure on the companies sector. As new competitors arise in market, telecom needintelligent tools to gain profit and withstand. Also, stock market expectations are huge andinvestors, financial analysts need tested tools to gain information about how companies performfinancially compared to their competitors, what they are good at, who are the major competitorsare, etc. In other term, the telecom companies need to benchmark their performances againstcompete trends in order to remain important role in this market. There is an enormous amount ofinformation about these companies financial performance that is now publicly available. Thisamount greatly exceeds our capacity to analyze it; the problem is that we often lack tools toquickly and accurately process these data.Data mining for telecommunications companies involve the use of simple, traditional andadvanced mathematical techniques used to analyze large populations of data and deliver insights,forecasts, explanations and predictions of how systems, customers, network and marketplace arelikely to react to different situations. The hypercompetitive nature of the telecom industry hascreated a need to understand customers, to keep them, and to model effective ways to marketnew products. This creates a great demand for innovation like data mining technique to helpunderstand the new business trends involved, catch fraudulent activities, identifytelecommunication patterns, make better use of resources and improve the quality of services.
  • 4. III. TYPES OF TELECOM DATAThe Initial step in the data mining process is to understand the data. Here we discuss three maintypes of telecom data. They are as follows: 1. Call Summary DataEvery time a call is placed on a telecom network, descriptive information about the call is savedas a call detail for future record. At a minimum, each call detail record will include theoriginating and terminating phone numbers, the date and time of the call and the duration of thecall. 2. Network DataTelecommunication networks are extremely complex configurations of equipment, comprised ofinterconnected components. Each network element is capable of generating error and statusmessages, which leads to a tremendous amount of network data. 3. Customer DataTelecommunication companies, like other businesses have millions of customer’s. For necessitythey have to maintaining a database of information on these customers. This information willinclude name, address and may include other information such as service plan, credit score,contract information, family income and payment history.
  • 5. IV. Mining Sequential Patterns in Telecommunication Database Using Genetic AlgorithmSequential pattern mining is the process of finding the relationships between occurrences ofsequential events, to find if there exists any specific order of the occurrences. The extraction ofsequential pattern is not polynomial in time of execution. The other algorithms for performingsequential pattern mining can assure optimum solutions but they do not take into considerationthe time taken to reach such solutions. Whereas this algorithm based on genetic concepts gives anon-optimal solution but in a reasonable time (polynomial) of execution. 1. Sequential Patterns MiningThe goal is to find all sub sequences from the given sets of transactions; this approach is usefulwhen the data to be mined have some sequential nature to deal with databases that have a time-series characteristics. Sequential Pattern can be defined as follows.Definition: Let I ={x1...xn} be a set of items. An itemset is a non-empty subset of items, and anitemset with k items is called k-itemset. A sequence s=(X1...Xl) is an order list of item sets, andan item set Xi (1≤ i ≤ l) in a sequence is called a transaction. In a set of sequences, a sequence sis maximal if s is not contained in other sequences. 2. Genetic AlgorithmGenetic Algorithm (GA) is a part of evolutionary computing, which is a rapidly growing area ofartificial intelligence. Genetic algorithm starts with a set of solutions (represented bychromosomes) called population. Solutions from one population are taken and used to form anew population by mutation and crossover. This is motivated by a hope, that the new populationwill be better than the old one.Best solutions which are selected to form new solutions (offspring) are selected according totheir best fitness. This is repeated until some condition (for example number of populations or
  • 6. improvement of the best solution) is satisfied. To measure the quality of a solution, fitnessfunction is assigned to each chromosome in the population. 3. Mining Sequential Patterns in Telecommunication Database Using Genetic AlgorithmDatabase, this algorithm is called SPT-GA algorithm. Firstly, we present our chromosomestructure and encoding schema, genetic operators, and then we define the fitness assignment andselection criteria. Finally, we give the structure of SPT-GA algorithm.3.1 Chromosome. The used structure of GA chromosomes and how it is represented is asfollows:3.1.1 Structure. In Telecommunication Database, country code values are used for creating thechromosomes.In the algorithm, chromosomes have a fixed length, and its length is equal to number of countrycodes that are available in the database as in Figure 1.3.1.2 Representation. In Genetic Algorithm, there are many alternatives to represent achromosome based on other problem like binary and integer representation. To decide whichrepresentation is better to be used for Sequential Pattern rules, we should use the short; low-orderschemata are relevant to the underlying problem and relatively unrelated to schemata over otherfixed positions. Also we should select the smallest alphabet that permits a natural expression ofthe problem, presented in [8]. In SPT-GA algorithm, we choose the binary representation because it is the most suitable forour algorithm and it needs less space and it represents the needed information (element occurredor not).For example, using the Telecommunication Database in Table.1, if a sequence is equal to <249,973, 91>, it can be represented as in Figure.2.
  • 7. Additionally, as you can see in Figure.2, order cannot be extracted directly. To solve thisproblem, we decided to associate the transactions sequence as a metadata with eachchromosome. For that, we use Vertical Bitmap Representation that makes SPT-GA algorithm totake less time and space to be executed.3.2 Genetic Operators. SPT-GA uses genetic operators to generate the offspring of the existingpopulation.Genetic algorithm will send chromosomes that represented by binary string where each bitcorresponds to an element occurrence (0 or 1); the number of bits is equal to the number ofitems. After encoding of the solution domain, initially many chromosome solutions are randomlygenerated depending on population size.Genetic algorithm will select the chromosome regarding to our fitness function. The majormeasure that used by our algorithm is Sequential Interestingness measure (SIM).Then crossover takes place, it selects genes from parent chromosomes and creates a newoffspring. The simplest way how to do this is to choose randomly some crossover point andeverything before this point copy from a first parent and then everything after a crossover pointcopy from the second parent. Crossover is shown in Figure 3.After a crossover is performed, mutation takes place. This is to prevent falling all solutions inpopulation into a local optimum of solved problem. Mutation changes randomly the newoffspring. For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0to 1. Mutation is shown in Figure 3.3.3 Fitness Function. Fitness function method discovers sequential patterns corresponding tothe interests of users without using background knowledge. This method defines a new criterioncalled the sequential interestingness measure (SIM).Definition: The sequential interestingness measure of a rule AC is: SIM(AC) = min Ci ∈ C {(Confidence(A|Ci))α} × Support(AC)
  • 8. Where: (α ≥ 0): is a confidence priority that represents how important the frequency of thepattern isCi: is subsequence of C, it represents a condition of sequence Ci = 1 … n where n is the number of conditions in C.The first term of the criterion evaluates that the frequencies of the sub-patterns are not frequentwhile the second term evaluates that the frequency of the pattern is frequent.3.4 SPT-GA Algorithm. The SPT-GA algorithm that was proposed is described in this section.In Figure 4, the pseudo code of SPT-GA algorithm is presented.ALGORITHMInput:Initial samples size: NMaximum generations: GThreshold value: TMinimum fitness: minFOutput:Malpractice user no listsBeginStep1: Initialize counter count = 0Step2: Initial population IN of size NStep3: For each chromosome i ∈ INIf S1& S2 is given thenMeasure the fitness F(i, S1, S2,T)Else If S1 is given thenMeasure the fitness F(i, S1, T)Else If S2 is given then
  • 9. Measure the fitness F(i, S2, T)ElseMeasure the fitness F(i, T)Step4: Mutate and crossover P.Step5: IF (fitness ≥ minF)Select fittest rules from PStep6: Set temp = temp +1Step7: IF (t > G) thenS=PStopElseGo to Step 3EndAfter the encoding of the dataset, using bitmap representation, the algorithm starts by selectingindividuals to initial population. Then the following processes are repeated until the pre-specifiedmaximum number of generations is achieved. The fitness values determined for each selectedindividual given the rule antecedent or consequent. The fittest rules that are larger than or equalminimum fitness in P will be selected. Giving the antecedent or consequent is not mandatory butit will reduce the time of search and extract more desired rules.Existing chromosomes are used in generating new ones by applying crossover and mutationoperators. Chromosomes survive based on their fitness used in the process. This way, theinteresting set is determined and the target is achieved. 4. Experiment ResultsThe results of some experiments on telecommunication database is presented and analyzed asfollows. SPT-GA algorithm is written with MATLAB programming language. The user can tunepopulation size (N), generations (G), confidence priority (α) and minimum fitness (minF). A Telecommunication Database is taken from a Telecommunication Company and it has 1091transactions and 60 country codes. The crossover probability used is 0.8, while the mutation is
  • 10. 0.001. The output of this experiment is a text file that includes the interesting rules that representthe most suitable telecommunication sequences. For example, the rule of Figure 5 told us thatwhen country code 91 is called, that means (40%) of callers will call country code 92 afterwards.In SPT-GA algorithm, there are four parameters that must be determined by a user: number ofgeneration (G), population size (N), confidence priority (α) and minimum fitness (minF). Firstexperiment sets N = 20, α = 1 and minF = 0 with G = [20...200]. The second experiment sets G =20, α = 1 and minF = 0 with N = [20...200]. The two experiments were done on two days of callswith 1091 transactions and 60 country codes. According to the experiment, Figure 6 shows the time, in seconds, spent by the SPT-GAalgorithm to extract the best rules while Figure 7 shows the average fitness of the final output.Both figures shows the result related to increasing the generations and population size. The tests showed that when the generation increases the time will be around each other butwhen the population increases the time will be increased. However, comparing the twoexperiments, as shown in Figure 6, GA takes less time when increasing the generation thanincreasing the population. Moreover, the tests also showed that when the generation andpopulation increase, the best fitness will be around. From the experiments, it is observed that increasing of generation will take less time thanincreasing of population size. But, either ways, GA does not take a long time; it is only a matter
  • 11. of seconds. In addition, increasing the generation and the population will not guarantee a largeimprovement of average fitness. V. Discovering Structural Patterns In Telecommunications DataThe SUBDUE system is a structural discovery tool that finds substructures in a graph-basedrepresentation of structural databases using the minimum description length (MDL) principle.SUBDUE discovers substructures that compress the original data and represent structuralconcepts in the data. Once a substructure is discovered, the substructure is used to simplifythe data by replacing instances of the substructure with a pointer to the newly discoveredsubstructure. The discovered substructures allow abstraction over detailed structures in theoriginal data. Iteration of the substructure discovery and replacement process constructs ahierarchical description of the structural data in terms of the discovered substructures. Thishierarchy provides varying levels of interpretation that can be accessed based on the specificgoals of the data analysis.SUBDUE represents structural data as a labeled graph. Objects in the data map to vertices orsmall sub graphs in the graph, and relationships between objects map to directed or undirectededges in the graph. A substructure is a connected sub graph within the graphical representation.This graphical representation serves as input to the substructure discovery system. Figure 8shows a geometric example of such an input graph. The objects in the figure become labeledvertices in the graph, and the relationships become labeled edges in the graph. The graphicalrepresentation of the substructure discovered by SUBDUE from this data is also shown in Figure8. One of the four instances of the substructure is highlighted in the input graph. An instance of asubstructure in an input graph is a set of vertices and edges from the input graph that match,graph theoretically, to the graphical representation of the substructure.
  • 12. Figure 8: Example Substructure in Graph FormExperiment Results:In a study upon running the SUBDUE system on the data in graph form, the six substructuresoutput by the system were analyzed. Figure 9 shows two of these six substructures. Thesubstructures in this figure show that the following patterns are common:1. Local calls originating between 11:00am and 2:30pm with 30 to 45 minute durations and2. Long distance calls originating between 6:00pm and 10:00pm with 5 to 20 minute durations. Figure 9: Telecom pattern discovered by Subdue.
  • 13. VI. CONCLUSIONGenetic Algorithm is applied to find frequent sequences in Telecommunication Database in orderto help Telecommunication companies to know the country codes that have a relation betweenthem. So, the telecommunication companies can estimate the countries that have a specific orderof the occurrences and give a discount on the calls to these countries. SPT-GA algorithm utilizesthe property of evolutionary algorithm that discovers best rules in a short time with meaningfulresults. VII. REFERENCES 1. Data Mining In Telecommunications; Gary M. Weiss 2. Data Mining In Telecommunications And Studying Its Status In Iran Telecom Companies And Operator; Jamal Sophieh 3. Data Mining And CRM In Telecommunications; D. Camilovic 4. Genetic Algorithms; William H. Hsu 5. A Fraud Detection Approach in Telecommunication using Cluster GA; V.Umayaparvathi & Dr.K.Iyakutti