Published on

Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. DATAMINING PROJECT REPORT Submitted by SHY AM KUMAR S MTHIN GOPINADH AJITH JOHN ALIAS RI TO GEORGE CHERIAN 1 INTRODUCTION 1.1 ABOUT THE TOPIC Data Mining is the process of discovering new correlations, patterns, and trends by digging into(mining) large amounts of data stored in warehouses, using artificial intelligence, statistical andmathematical techniques. Data mining can also be defined as the process of extracting knowledge hiddenfrom large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, andpotentially useful information from data. The alternative name of Data Mining is Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc. Data mining is the principle of sorting through large amounts of data and picking out relevantinformation. It is usually used by business intelligence organizations, and financial analysts, but it isincreasingly used in the sciences to extract information from the enormous data sets generated by modernexperimental and observational methods, it has been described as "the nontrivial extraction of implicit,previously unknown, and potentially useful information from data" and "the science of extracting usefulinformation from large data sets or databases". 1.2 ABOUT THE PROJECT The Project has been developed in our college in an effort to identify the most frequently visitedsites, the site from where the most voluminous downloading has taken place and the sites that have beendenied access when referred to by the users. 1
  2. 2. Our college uses the Squid proxy server and our aim is to extract useful knowledge from one ofthe log files in it. After a combined scrutiny of the log files the log named access.log was decided to beused as the database. Hence our project was to mine the contents ofaccess.log .
  3. 3. Finally the PERL programming language was used for manipulating the contents of the log file.PERL EXPRESS 2.5 was the platform used to develop the mining application. The log file content is in the form of standard text file requiring extensive and quick siringmanipulation to retrieve the necessary contents. The programs were required to sort the mined contentsin the descending order of its frequency of usage and size. CHAPTER 2 REQUIREMENT ANALYSIS2.1 INTRODUCTION Requirement analysis is the process of gathering and interpreting facts, diagnosing problems andusing the information lo recommend improvements on the system. It is a problem solving activity thatrequires intensive communication between the system users and system developers. Requirement analysis or study is an important phase of any system development process. Thesystem is studied to the minutest detail and analyzed. The system analyst plays the role of an interrogatorand dwells deep into the working of the present system. The system is viewed as a whole and the inputsto the system are identified. The outputs from the organization are traced through the various processingthat the inputs phase through in the organization. A detailed study of these processes must be made by various techniques like Interviews,Questionnaires etc. The data collected by these sources must be scrutinized to arrive to a conclusion. Theconclusion is an understanding of how the system functions. This system is called the existing system.Now, the existing system is subjected to close study and the problem areas are identified. The designernow functions as a problem solver and tries to sort out the difficulties that the enterprise faces. Thesolutions are given as a proposal. The proposal is then weighed with the existing system analytically and the best one isselected. The proposal is presented to the user for an endorsement by the user. The proposal isreviewed on user request and suitable changes are made. This loop ends as soon as the user issatisfied with the proposal. 3
  4. 4. 2.2 PROPOSED SYSTEM In order to make the programming strategy optimal, complete and least complex a detailedunderstanding of data mining, related concepts and associated algorithms are required. This is to befollowed by effective implementation of the algorithm using the best possible alternative.2.3 DATAM1NING (KDD PROCESS) The Knowledge Discovery from Data process involved / includes relevant prior knowledge andgoals of applications: Creating a large dataset, Preprocessing of the data, Filtering or clearing, datatransformation, identifying dimcnsionally and useful feature. It also involves classification, association,regression, clustering and summarization. Choosing the mining algorithm is the most important parameterfor the process. The final stage includes pattern evaluation which means visualization, transformation, removingredundant pattern etc. use of discovery knowledge of the process. DM Technology and System: Data mining methods involves neural network, evolutionaryprogramming, memory base programming, Decision trees. Genetic Algorithms, Nonlinear regressionmethods these work also involve fuzzy logic, which is a superset of conventional Boolean logic that hasbeen extended handle the concept of partial truth, partial false between completely true and completefalse. The term data mining is often used to apply to the two separate processes of knowledge discoveryand prediction. Knowledge discovery provides explicit information that has a readable form and can beunderstood by a user. Forecasting, or predictive modeling provides predictions of future events and maybe transparent and readable in some approaches (e.g. rule based systems) and opaque in others such asneural networks. Moreover, some data mining systems such as neural networks are inherently gearedtowards prediction and pattern recognition, rather than knowledge discovery. Metadata, or data about a given data set, are often expressed in a condensed data mine-able format,or one that facilitates the practice of data mining. Common examples include executive summaries andscientific abstracts. 4
  5. 5. Data Mining is the process of discovering new correlations, patterns, and trends by digging into(mining) large amounts of data stored in warehouses, using artificial intelligence, statistical andmathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden from largevolumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from data. The alternative name of Data Mining is Knowledge discovery (mining) indatabases (KDD), knowledge extraction, data/pattern analysis, etc. The importance of collecting data thaireflect your business or scientific activities to achieve competitive advantage is widely recognized now.Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. LOG files Preprocessing Data cleaning Session identification Data conversion mjnsup Frequent mjnsup Frequent mjnsup Frequent Iternset Sequence Subtree Discovery Discovery Discovery | Pattern RESULTS i Analysis Figure 2.3.1 : Process of web usage mining However, the bottleneck of turning this data into your success is the difficulty of extractingknowledge about the system you study from the collected data. DSS are computerize tools develop assistdecision makers through the process of making of decision. This is inherently prescription whichenhances decision making in some way. DSS are closely related to the concept of rationality which meansthe tendency to act in a reasonableway to make good decision. To produce the key decision for anorganization involve product/service, distribution of the product using different distribution channel,calculation /computation of the output on different time and space, prediction/trend of the output for 5
  6. 6. individual product or service with in estimated time frame and finally the schedule of the production onthe basis of demand, capacity and resource. The main aim and objective of the work is to develop a system on dynamic decision which dependon product life cycle individual characteristics graph analysis has been done to give enhance and advancethought to analysis the pattern of the product. The system has been reviewed in terms of local and globalaspect.2.4 WORKING OF DATAMINTNG While large-scale information technology has been evolving separate transaction and analyticalsystems, data mining provides the link between the two. Data mining software analyzes relationships andpatterns in stored transaction data based on open-ended user queries. Several types of analytical softwareare available: statistical, machine learning, and neural networks. Generally, any of four types ofrelationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurantchain could mine customer purchase data to determine when customers visit and what they typicallyorder. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. Forexample, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an exampleof associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, anotitdoor equipment retailer could predict the likelihood of a backpack being purchased based on aconsumers purchase of sleeping bags and hiking shoes. Data mining consists of five major elements: •Extract, transform, and load transaction data onto the data warehouse system. •Store and manage the data in a multidimensional database system. •Provide data access to business analysts and information technology professionals. 6
  7. 7. •Analyze the data by application software. •Present the data in a useful format, such as a graph or table. 1 .Classification and Regression Trees (CART) and Chi Square 2.Detection (CHAID) : CART and CHAID are decision tree techniques used for classificationof a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predictwhich records will have a given outcome. CART segments a dataset by creating 2-way splits whileCHAID segments using chi square tests to create multi-way splits. CART typically requires less datapreparation than CHAID.•Nearest neighbor method: A technique that classifies each record in a dataset based on a combination ofthe classes of the k record(s) most similar to it in a historical dataset. Sometimes called the A:-nearestneighbor technique.•Rule induction: The extraction of useful if-then rules from data based on statistical significance.• Data visualization: The visual interpretation of complex relationships in multidimensional data.Graphics tools are used to illustrate data relation.2.5 DATA MINING ALGORITHMS The data mining algorithm is the mechanism that creates mining models. To create a model, analgorithm first analyzes a set of data, looking for specific patterns and trends. The algorithm then usesthe results of this analysis to define the parameters of the mining model. The mining model that an algorithm creates can take various forms, including: •A set of rules that describe how products are grouped together in a transaction. •A decision tree that predicts whether a particular customer will buy a product. •A mathematical model that forecasts sales. • A set of clusters that describe how the cases in a dataset are related. 7
  8. 8. Microsoft SQL Server 2005 Analysis Services (SSAS) provides several algorithms for use in yourdata mining solutions. These algorithms are a subset of all the algorithms that can be used for datamining. You can also use third-party algorithms that comply with the OLE DB for Data Miningspecification. For more information about third-party algorithms, see Plugin Algorithms. Analysis Services includes the following algorithm types: •Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Decision Trees Algorithm. •Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Time Series Algorithm. •Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Clustering Algorithm. •Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. » Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path How. An example of a sequence analysis algorithm is the Sequence Clustering Algorithm. 2.6 SOFTWARE REQUIREMENTS OPERATION SYSTEM WINDOWS XP SP2 PERL COMPILER. PERL ACTIVE PERL SCRIPT EDITOR PERL EXPRESS SERVER SOFTWARE IIS SERVER 8
  9. 9. 2.7 FUZZY LOGIC Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal withreasoning that is approximate rather than precise. Just as in fuzzy set theory the set membership valuescan range (inclusively) between 0 and 1, in fuzzy logic the degree of truth of a statement can rangebetween 0 and 1 and is not constrained to the two truth values ftrue, false} as in classic predicate logic.And when linguistic variables are used, these degrees may be managed by specific functions, asdiscussed below. Both fuzzy degrees of truth and probabilities range between 0 and 1 and hence may seemsimilar at first. However, they are distinct conceptually; fuzzy truth represents membership in vaguelydefined sets, not likelihood of some event or condition as in probability theory. For example, if a 100-mlglass contains 30 ml of water, then, for two fuzzy sets, Empty and Full, one might define the glass asbeing 0.7 empty and 0.3 full. Note that the concept of emptiness would be subjective and thus would depend on the observeror designer. Another designer might equally well design a set membership function where the glasswould be considered full for all values down to 50 ml. A probabilistic setting would first define ascalar variable for the fullness of the glass, and second, conditional distributions describing theprobability that someone would call the glass full given a specific fullness level. Note that theconditioning can be achieved by having a specific observer that randomly selects ihe label for theglass, a distribution over deterministic observers, or both. While fuzzy logic avoids talking aboutrandomness in this context, this simplification at the same time obscures what is exactly meant by thestatement the glass is 0.3 full.2.7.1 APPLYING FUZZY TRUTH VALUES A basic application might characterize sub ranges of a continuous variable. For instance, atemperature measurement for anti-lock brakes might have several separate membership functionsdefining particular temperature ranges needed to control the brakes properly. Each function maps thesame temperature value to a truth value in the 0 to I range. These truth values can then be used todetermine how the brakes should be controlled. In this image, cold, warm, and hot are functions mapping a temperature scale. A point on thatscale has three "truth values" — one for each of the three functions. The vertical line in the imagerepresents a particular temperature that the three arrows (truth values) gauge. Since the red arrow 9
  10. 10. points to zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2) may describe it as "slightly warm" and the blue arrow (pointing at 0.8) "fairly cold". 2.7.2 FUZZY LINGUISTIC VARIABLES While variables in mathematics usually take numerical values, in fuzzy logic applications, the non-numeric linguistic variables are often used to facilitate the expression of rules and facts. A linguistic variable such as age may have a value such as young or its opposite defined as old. ITowever, the great utility of linguistic variables is that they can be modified via linguistic operations on the primary terms. For instance, if young is associated with the value 0.7 then very young is automatically deduced as having the value 0.7 * 0.7 = 0.49. And not very young gets the value (l - 0.49), i.e. 0.51. In this example, the operator very(X) was defined as X * X, however in general these operators may be uniformly, but flexibly defined to fit the application, resulting in a great deal of power for the expression of both rules and fuzzy facts. CHAPTER 3 SYSTEM DESIGN System design is the solution to the creation of a new system. This phase is composed of severalsystems. This phase focuses on the detailed implementation of the feasible system. Its emphasis is ontranslating design specifications to performance specification. System design has two phases ofdevelopment logical and physical design. During logical design phase the analyst describes inputs (sources), out puts (destinations),databases (data sores) and procedures (data flows) all in a format that meats the uses requirements. Theanalyst also specifies the user needs and at a level that virtually determines the information How into and 10
  11. 11. out of the system and the data resources. Here the logical design is done through data flow diagrams anddatabase design. The physical design is followed by physical design or coding. Physical design produces the workingsystem by defining the design specifications, which tell the programmers exactly what the candidate systemmust do. The programmers write the necessary programs that accept input from the user, perform necessaryprocessing on accepted data through call and produce the required report on a hard copy or display it on thescreen.3.1 DATABASE DESIGN The data mining process involves the manipulation of large data sets. Hence, a large database is akey requirement in the mining operation. Ordered set of information is now to be extracted from thisdatabase. The overall objective in the development of database technology has been to treat data as anorganizational resource and as an integrated whole. DBMS allow data to be protected and organizedseparately from other resources. Database is an integrated collection of data. The most significant form of data as seen by theprogrammers is data as stored on the direct access storage devices. This is the difference between logicaland physical data. Database files are the key source of information into the system. It is the process of designingdatabase files, which are the key source of information to the system. The files should be properly designedand planned for collection, accumulation, editing and retrieving the required information. The organization of data in database aims to achieve three major objectives: - •Data integration. •Data integrity. •Data independence. 11
  12. 12. A large data set is difficult to parse and to interpret the knowledge contained in it. Since the database used in this project is the log file of a proxy server called SQUID, a detailed study of the squid styletransaction logging is also required.3.2 PKOXY SERVER A proxy server is a server (a computer system or an application program) which services therequests of its clients by forwarding requests to other servers. A client connects to the proxy server,requesting some service, such as a file, connection, web page, or other resource, available from a differentserver. The proxy server provides the resource by connecting to the specified server and requesting theservice on behalf of the client. A proxy server may optionally alter the clients request or the serversresponse, and sometimes it may serve the request without contacting the specified server. In this case, itwould cache the first request to the remote server, so it could save the information for later, and makeeverything as fast as possible. A proxy server that passes all requests and replies unmodified is usually called a gateway orsometimes tunneling proxy. A proxy server can be placed in the users local computer or at specific keypoints between the user and the destination servers or the Internet. • Caching proxy server A proxy server can service requests without contacting the specified server, by retrieving contentsaved from a previous request, made by the same client or even other clients. This is called caching. • Web proxy A proxy that focuses on WWW traffic is called a "web proxy". The most common use of a webproxy is to serve as a web cache. Most proxy programs (e.g. Squid, Net Cache) provide a means to denyaccess to certain URLs in a blacklist, thus providing content filtering. • Content Filtering Web Proxy A content filtering web proxy server provides administrative control over the content that may berelayed through the proxy. It is commonly used in commercial and non-commercial organizations(especially schools) to ensure that Internet usage conforms to acceptable use policy. • Anonymizing proxy server 12
  13. 13. An anonymous proxy server (sometimes called a web proxy) generally attempts to anonymize websurfing. These can easily be overridden by site administrators, and thus rendered useless in some cases.There are different varieties of anonymizers. • Hostile proxy Proxies can also be installed by online criminals, in order to eavesdrop upon the dataflow betweenthe client machine and the web. All accessed pages, as well as all forms submitted, can be captured andanalyzed by the proxy operator.3.3 THE SQUID PROXY SERVER Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reducesbandwidth and improves response times by caching and reusing frequently-requested web pages. Squid hasextensive access controls and makes a great server accelerator. It runs on Unix and Windows and islicensed under the GNU GPL. Squid is used by hundreds of Internet Providers world-wide to provide theirusers with the best possible web access. Squid optimizes the data flow between client and server to improve performance and cachesfrequently-used content to save bandwidth. Squid can also route content requests to servers in a widevariety of ways to build cache server hierarchies which optimize network throughput. Thousands of web-sites around the Internet use Squid to drastically increase their content delivery.Squid can reduce your server load and improve delivery speeds to clients. Squid can also be used to delivercontent from around the world - copying only the content being used, rather than inefficiently copyingeverything. Finally, Squids advanced content routing configuration allows you to build content clusters toroute and load balance requests via a variety of web servers. Squid is a fully-featured HTTP/1.0 proxy which is almost HTTP/1.1 compliant. Squid offers a richaccess control, authorization and logging environment to develop web proxy and content servingapplications. Squid is one of the projects which grew out of the initial content distribution and cachingwork in the mid-90s. It has grown to include extra features such as powerful access control, authorization, logging,content distribution/replication, traffic management and shaping and more. It has many, many work -arounds, new and old. to deal with incomplete and incorrect HTTP implementations. 13
  14. 14. Squid allows Internet Providers to save on their bandwidth through content caching. Cachedcontent means data is served locally and users will see this through faster download speeds withfrequently-used content. A well-tuned proxy server (even without caching!) can improve user speeds purely by optimizingTCP flows. Its easy to tune servers to deal with the wide variety of latencies found on the internet -something that desktop environments just arent tuned for. Squid allows ISPs to avoid needing to spend large amounts of money on upgrading core equipmentand transit links to cope with ever-demanding content growth. It also allows ISPs to prioritize and controlcertain web content types where dictated by technical or economic reasons.3.3.1 SQUID STYLE TRANSACTION-LOGGING Transaction logs allow administrators to view the traffic that has passed through the ContentEngine. Typical fields in the transaction log are the date and time when a request was made, the URL thatwas requested, whether it was a cache-hit or a cache-miss, the type of request, the number of bytestransferred, and the source IP. High-performance caching presents additional challenges other than how to quickly retrieve objectsfrom storage, memory, or the web. Administrators of caches are often interested in what requests have beenmade of the cache and what the results of these requests were. This information is then used for suchapplications as: •Problem identification and solving •Load monitoring •Billing •Statistical analysis •Security problems • Cost analysis and provisioning 14
  15. 15. Squid log file format is: time elapsed remotehost code/status bytes method URL rfc931 peerstatus/peerhost type A Squid logformat example looks like this: 1012429341.115 100 TCP REFRESHJVIISS/304 1100 GET - DlRECT/ - Squid logs are a valuable source of information about cache workloads and performance. The logsrecord not only access information but also system configuration errors and resource consumption, such asmemory and disk space. 15
  16. 16. Field Description lme UNIX time stamp as Coordinated Jniversal Time (UTC) seconds with a millisecond ■esolution.Elapsed Length of time in milliseconds that the ache was busy with the transaction. Note Entries are logged after the reply las been sent, not during the lifetime of the transaction.Remote Host IP address of the requesting instance.Code/Status Two entries separated by a slash. The first mtry contains information on the result of the xansaction: the kind of request, how it was satisfied, or in what way it failed. The second ■ mtry contains the HTTP result codes.Bytes Amount of data delivered to the client. This does not constitute the net object size, because headers are also counted. Also, failed ■equests may deliver an error page, the size of which is also logged here. 16
  17. 17. Method i ........................................ ...ARequest method to obtain an object for jxample, GET.URLURL requested.Rfc93 1Contains the authentication servers identification or lookup names of the requesting ;lient. This field will always be a "-" (dash).Peerstatus/PeerhostI Two entries separated by a slash. The first ;ntry represents a code that explainshow the •equest was handled, for example, by forwarding t to a peer, or returningthe request to the source. The second entry contains the name of the host rromwhich the object was requested. This host nay be the origin site, a parent, or anyother peer. Mso note that the host name may be numerical.Typei1 ! ..................................Content type of the object as seen in the HITTPreply header. In the ACNS 4.1 software, :his field will always contain a "-"(dash).Table : Squid-Style Format
  18. 18. 3.3.2 SQUID LOG FILES The logs are a valuable source of information about Squid workloads and performance. The logsrecord not only access information, but also system configuration errors and resource consumption (eg,memory, disk space). There are several log file maintained by Squid. Some have 10 be explicitlyactivated during compile time, others can safely be deactivated during run-time. There are a few basic points common to all log files. The lime stamps logged into the log files areusually UTC seconds unless stated otherwise. The initial time stamp usually contains a millisecondextension. SQUID.OUT If we run your Squid from the Run Cache script, a file squid.out contains the Squid startup times,and also all fatal errors, e.g. as produced by an assertQ failure. If we are not using Run Cache, you willnot see such a file. CACHE.LOG The cache.log file contains the debug and error messages that Squid generates. If we start yourSquid using the default RunCache .script, or start it with the -s command line option, a copy of certainmessages will go into your syslog facilities. It is a matter of personal preferences to use a separate filefor the squid log data. From the area of automatic log file analysis, the cache.log file does not have much to offer. Wewill usually look into this file for automated error reports, when programming Squid, testing newfeatures, or searching for reasons of a perceived misbehavior, etc. USERAGENT.LOG The user agent log file is only maintained, if l.We configure the compile time —enable-useragent-log option, and 18
  19. 19. 2.We pointed the useragentjog configuration option to a file. From the user agent log file you are able to find out about distribution of browsers of your clients.Using this option in conjunction with a loaded production squid might not be the best of all ideas. STORE.LOG The store.log file covers the objects currently kept on disk or removed ones. As a kind oftransaction log it is usually used for debugging purposes. A definitive statement, whether an objectresides on your disks is only possible after analyzing the complete log file. The release (deletion) of anobject may be logged at a later time than the swap out (save to disk). The store.log file may be of interest to log file analysis which looks into the objects on yourdisks and the time they spend there, or how many times a hot object was accessed. The latter may becovered by another log file, too. With knowledge of the cache_dir configuration option, this log fileallows for a URL to filename mapping without recurring your cache disks. However, the Squiddevelopers recommend to treat store.log primarily as a debug file, and so should you, unless you knowwhat you are doing. 2.0
  20. 20. HIERARCHY.LOG This log file exists for Squid-1.0 only. The format is [date] URL peer status peer host ACCESS.LOG Most log file analysis program are based on the entries in access.log. Currently, there are two fileformats possible for the log file, depending on your configuration for the emulate^ httpd Jog option. Bydefault, Squid will log in its native log file format. If the above option is enabled. Squid will log in thecommon log file format as defined by the CERN web daemon. The Common Logfile Format is used by numerous HTTP servers. This format consists of thefollowing seven fields: remote host rfc931 authuser [date] "method URL" status bytes It is pars able by a variety of tools. The common format contains different information than thenative log file format. The HTTP version is logged, which is not logged in native log file format. The log contents include the site name, the IP address of the requesting instance, date and timein unix time format, bytes transferred, the requesting method and other such features. Log files areusually large in size, large enough to be mined. However, the values of an entire line of input changeswith the change in header. The common log file format contains other information than the native log file, and less. Thenative format contains more information for the admin interested in cache evaluation. The access.log isthe squid log that has been made use of in this project. The log file was in the form of a text file shownbelow : 20
  21. 21. View llei|>File Eft Form*ii85s..?.?._.s ::»xc .n .:5s :i:iic .:.3 -CP>:5S/290 i8 5 . ON__.CT /64 .ioi .:iS87.:jii1198 85.141.2J7.136 ICP_MI5S/200 143 CONNECT login.icq.can :443 -DIRECT/ -11204073887.2318219 .06.51. 233.54 TCPJ4ISS/200 10286 TOST http :// -DIRECT/203. 131.197.213 text/ht _ilDl? TCF.flISS/302 630 GET http ://Ww.around-japjn .net/cg1-b1n/rjnk/access.egl -DIRtCi/210.188.2-5.12 text/html[1_04073337.263 170*7 TCP_HISS/200 5901 GEThttp :// /imaqes/portal/MailAd.ipg -DIREC1/2Q9.225.8.224 image/ipegll204073387.265 1257 TCPJ4ISS/302 679 GET /cgl-b1rVrank1ng /ranklink.cgl? -DIRECT/202.212,131.188 text/html 112040/3887.266 1257 TCPJ.ISS/200 183 CONNECT login.icq.ttjm:443 -DIRECT/ -11204073887.4417891 TCP.MSS/500 758 POST http://Ww. _Coinnent.asp -DIRECT/210.51.1 J.83 text/html 11204073887.4711463 219.117,248.243 TCP_MISS/20u 6286 GET .com/ -DIRECT/ text/html _120407.8S7.4Bb465 TCPJ1ISS/2Q. 977 POSI http://hiysstud1o.co_/proxy5/check.php -DIRECT/ text/html[12040/3887.64223638 TCPJ4ISS/999 3002 GET http :// -DIRECT/ text/htmlJ12O4073887. 668645 TCPJ.ISS/200 466 POST -OIRECT/ cext/Titmll1_.10.3387.G72649 TCP..MISS/200 467 POST http :// .php -DIRECT/ text/himli)12u4073887.68i3653 24.195,130.110 TCPJMISS/999 5080 GET -DIRECT, .09.191..2.64 te*t/ht«illll204073887.6.5673 TCPJII5S/200 810 GET http://sinarteh.coiri .ru/proxy_checker/proxy_dest .php -DIRECT/ text/html 03 2 04 0 . 3 8 87 . 731708 216.163.8,34 TCPJMI55/200 581 GET hitp:// -DIRECT/716.155. 200.61 application/octet-stream!1204073887. 7432 5 6 3 5 60.172 . 204 . 2 5 0 TCPJ.ISS/200 12077 GET http://aqrl.diytrade.c_ii/sdp/514222/2/iiid-2732062/3707270.htiin -MKECr/ text/html 1120407388/.76:747 TCPJ.ISS/200 581 GET -DIRECT/ appHcat1on/octet-streainQl204073887.824:801 TCPJMISS/200 595 GET http ://w_w .arca _-_Hriners.c_t_/banners.php -DIRECT/ text/html[1204073837.835754 147.32.92 .702 TCPJ.I55/302 386 GET http ://pod-o-lee -DIRECT/87.98.205 .19 text /htni!lll2O4073687.9032684 TCPJ .ISS/500 451 POST http :// /content.php ? -DIRECT/ text/htmllll204073887.974951 TCPJMISS/200 139 CONNECT -blRECT/ -111204073888.0103001 219.161,217.101 TCP_MSS/200 4144 GET http://mamono -DIRECT/ text/html[120S073888.153 1131 TCPJUSS/200 583 GET /confiq/pwroken_get? -DIRECT/ application/octet-screaM1204 0 7 3 8 38.189 1166 TCP_MISS/200 182 CONNECT 205.1881153.249:443-DIRECT / -[1204073388.270 6264 TCP_MISS/200 199 CONNECT -DIRICT/ -03204073888.4231400 TCP_MI55/200 973 POST hnp ;//hpcgi2.nifty.comA "inokankyo/BBS2/./aska.cgi -DIRECT/ text/html .1204073888.423 4 10 64.124 . 9.8 KP_HIT /:00 10400 GkT -NONE/- te«t /pl»1r.U.040738J8.$4534422 ICPJII55/200 5942 POST http ://www .volijriteertravelcostarica.co_/fonjri /po5ting.php -DIRECT/ text/html .1204073888.6341612 TCP._l.I5S/209 292 CONNECT -DIRECT/ -.120407388..649636 TCPJMISS/200 601 POST http ://sm.cusbbs.caii /proxy.php -DIRECT/ text/htm! 1112040738.8.682669 TCPJ.ISS/200 466 POST http ://riuhost.1nfo/eye.php -DIRECT/ text/htm^ H204073883.759746 TCPJMISS/200 401 POST -DIRECT/. text/html [1204073888.760747 TCPJMISS/200 402 POST http :// -DIRECT/ text/html .1204073838.765753 TCP_MIS5/200 399 POST -DIRECT/ text /html 11204 0 73 8 8 8 . 792 779 TCPJMISS/200 935 GET -DIRECT/ text/html [1204073388.818 5801 TCPJ1ISS/302 802 POSThttp :// ? -DIRECT/ text/html[1204073388.821 80S 66 . 2 3 2.113.194 TCPJ.ISS/200 402 POST /eve.php-DIRECT/72 .232.225.186 text/html 01 204 0 7 3 S88.8338804 TCPJMISS/200 945 POST .3p /tbbs /old/imqbbs/1mgboard.cg1 -DIRECT/ text/ht_i.lC1204073888.S41 828 1CPJ4ISS/200 4 02 POSThttp :// -DIRECT/ text/htmlD1204U73838.8498821 TCPJMISS/200 521 POST http ://tesi.zJleJs1ng.c_n/Guestboofc/e_jdd msg.asp -DIRECT/ text/htmiDl/04073888.852 839 TCP_MIiS/200 753 GEThttp://engine.espace ? -DIRECT/ tex.Aitmlll204 0 7 S888.939 ♦926 TCPJ.ISS/200 1957 POST http :// /index.php -DIRECT/ Btext/html 204073888.94929 TCPJ1ISS/302 913 GET http;//www, /rankllnk.cgi? -DIRECT/ text/htmlD12O4073888.947 9935 TCP_HISS/302374 POST http://ww - DIRECT/ text/html 11204073889.000 84 7 77 . 73.185.2 5 0 TCP.XI5S/304 4 4 0 GEThttp://www /conmunity/imjges/htiil_liook.qif -DIRECi/ -112O4073339.023 2001 TCPJ.ISS/200 10340 GET text/ht_ilo_204073889.221 3212 TCP_MISS/200 3916 GET http ://www .kyksy .C_ll/5ite/promotion.php -DIRECT/ text/html 112040,3839.251 123884.53.86.19 TCPJMISS/200 183 CONNECT -OIRECT/ -.1204073889. 271 1256 82,114.228 .67 TCP_MI5S/200 185 CONNECT login. icq.CO»i:443-DIRECT/ -_L.2O4073889.414451 TCPJMISS/200 581 GET http :// ? -OIRECT / applicjtion/ TCPJ.ISS/200 701 POST /addjnsg.asp -DIRECT/ text/html 11204 0 7 3 8 89 . 5 0819622 24. 95.156.140 TCP_MI55/999 3002 GET http ://n37.loqin.mud /config/login? -DIRECT/ text/html [1204073889.6042581 TCPJ1IS5/999 5082 GET -DIRECT/ text/htmlol.204 0 7 3 8 8 9 . 6347629 TCPJUSS/502 1366 POST http://megafasthost,info/eye!php -DIPECT/72.232.67,226 text/htii.111204973889.6487642 206.51.225 .48 FCP.HISS/502 1366 POST -DIRECT/72.232.67,226 text /html 11204073889.6596642 TCP_MISS/999 5082 GET -DIRECT/ text/htm 111204073889.67441070 TCP_MISS/200 3053 POST http://blogs /archive/2005/06 /ie /6309.aspx -DIRECT/ text /html_1204D73839.689686 i tcpj GET .55/200 581 http://rhobilel.login.v1p.dcn .yahoo.ccmi/conf1g /pwtokeri_get 7 - OIRECT / cat1on/octet-streaml2C4073889.7143706 TCPJMISS/302 580 POST http :// /index.php -DIRECT/202.53. 5.147 text /html 11204073889.7236706 TCPJ1ISS/200 675 HEAD http :// /phpBB2/v1e _topic.php? -DIRECT/66,28.224.201 text/html[1204073889.741738 TCPJ-S5/200 400 POST -DIRECT/ text/hti_lB12O4073889.77076 7 66 . 2 3 2.113.194 TCPJitSS/200 402 POST http :// /eye.php -DIRECT/ text /html 11204 0 7 3 8 89 . 9713962 TCPJ.I5S/200 184 CONNECT login.icq.coi»:443 -DIRECT/205.188.153,121 -11204073890.01636739 TCPJ.ISS/200 4701 GET http://__vj.ba1du.eom/s? -DIRECT/202. 108.22.44 text/htmlD12040738.0,022401 3 6 9 . 64 . 45.239 TCPJMISS/200 4530 POST http ://www _w1.ois/iridex -DIRECT/ toxt/ht_il.l20.1073890.o221019 TCP_HISS/200 144 CONNECT -DIRECT/ -11234073890.129988 TCPJUSS/200 489 GET http :// .g_.1us .pl /-l2O4074244l40/redot.gif? -DIRECT/ .mage /gifQ12O4073S90.i56 32445 TCP-WSS/999 5084 GEThttp :// _ifig/isp _verify_user -DIRECT/87, 248.107.127 text /htmlll2O407<990.178 6357 TCPJMISS/200 585 POST http ://tenayagroup.eom/blog/_p -cc_ment5 -post.php -DIRECT/ text/tltm 111204073890.228Figure : Access.log used as database3.3.3 SQUID RESULT CODES The TCP_ codes refer to requests on the HTTP port (usually 3128). The UDP_ codesrefer to requests on the ICP port (usually 3130). If ICP logging was disabled using the logicpqueries option, no ICP replies will be logged. TCPJEIIT 21
  22. 22. A valid copy of the requested object was in thecache. TCP_MISSThe requested object was not in thecache. TCP REFRESH HITThe requested object was cached but STALE. The IMS query for the object resulted in "304not modi lied".TCP REFFAILHITThe requested object was cached but STALE. The IMS query failed and the stale object wasdelivered.TCPREFRESHJVHSSThe requested object was cached but STALE. The IMS query returned the new content.TCP CLIENTJREFRESH MISSThe client issued a "no-cache" pragma, or some analogous cache control command alongwith the request. Thus, the cache has to-prefect the object. 22
  23. 23. TCP IMS_HITThe client issued an IMS request for an object which was in the cache and fresh. TCPSWAPFAIL MISSThe object was believed to be in the cache, but could not be accessed.TCPNEGATIVEHITRequest for a negatively cached object, e.g. "404 not found", for which the cache believes to knowthat it is inaccessible. Also refer to the explanations for negative^ ttl in your squid.conf file.TCPMEMHITA valid copy of the requested object was in the cache and it was in memory, thus avoiding diskaccesses.TCPDENIEDAccess was denied for this request.TCP_OFFLINE_IIITThe requested object was retrieved from the cache during offline mode. The offline mode nevervalidates any object.UDP HITA valid copy of the requested object was in the cache.UDP MISSThe requested object is not in this cache. UDPDENIED Access was denied for this request. UDP_IN VALID An invalid request was received. UDP_MISS_NOFEl CH 23
  24. 24. During "-Y" startup, or during frequent failures, a cache in hit only mode will return either UDPJHIT or this code. Neighbors will thus only fetch hits. NONE Seen with errors and cache manager requests.3.4 HTTP RESULT CODES These are taken from RFC 2616 and verified for Squid. Squid-2 uses almost all codes except307 (Temporary Redirect), 416 (Request Range Not Satisfactory), and 417 (Expectation Failed).Extra codes include 0 for a result code being unavailable, and. 600 to signal an invalid header, aproxy error. Also, some definitions were added as for RFC 2518. Yes, there are really two entriesfor status code 424, compare with http_status in src/enums.h; 000 USED MOSTLY WITH UDP TRAFFIC 100 CONTINUE 101 SWITCHING PROTOCOLS 102 PROCESSING 200 OK 201CREATED 202ACCEPTED 203NON-AUTHORITATIVE INFORMATION 204NO CONTENT 205RESET CONTENT 206PARTIAL CONTENT 207MULTI STATUS 24
  27. 27. PROPATCH CHANGE PROPERTIES OF AN OBJECT COPY CREATE A DUPLICATE OF SRC IN DST. MOVE ATOMICALLY MOVE SRC TO DST. LOCK LOCK AN OBJECT AGAINST MODIFICATIONS. UNLOCK UNLOCK AN OBJECT.TABLE 3.4.2 : HTTP request methods CHAPTER 4 CODING4.1 FEATURES OF LANGUAGE (PERL)Practical Extraction and Reporting Language is an interpretedlanguage optimized for scanning arbitrary text files, extracting information from those text files, andprinting reports based on that information, its also a good language for many system managementtasks. •The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). •It combines (in the authors opinion, anyway) some of the best features of c, sed, awk, and sh, so people familiar with those languages should have little difficulty with it. (language historians will also note some vestiges of Pascal and even basic-plus.) •Unlike most UNIX utilities, Perl does not arbitrarily limit the size of our data — if we have got the memory, Perl can slurp in our whole file as a single string, recursion is of unlimited depth. •The hash tables used by associative arrays grow as necessary to prevent degraded performance. Perl uses sophisticated pattern matching techniques to scan large amounts of data very quickly. •Although optimized for scanning text, Perl can also deal with binary data, and can make dbm files look like associative arrays (where dbm is available).Setuid Perl scripts are safer than c programs through a dataflow tracing mechanism which prevents many stupid security holes. 27
  28. 28. •The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables,expressions, assignment statements, brace-delimited code blocks, control structures, andsubroutines. •Perl also takes features from shell programming. All variables are marked with leading sigils. which unambiguously identify the data type (scalar, array, hash, etc.) of the variable in context. Importantly, sigils allow variables to be interpolated directly into strings. •Perl has many built-in functions which provide tools often used in shell programming (though many of these tools are implemented by programs external to the shell) like sorting, and calling on system facilities. •Perl takes lists from Lisp, associative arrays (hashes) from AWK, and regular expressions from sed. These simplify and facilitate many parsing, text handling, and data management tasks. •In Perl 5, features were added that support complex data structures, first-class functions (i.e., closures as values), and an object-oriented programming model. These include references, packages, class-based method dispatch, and lexically scoped variables, along with compiler directives . •All versions of Perl do automatic data typing and memory management. The interpreter knows the type and storage requirements of every data object in the program; it allocates and frees storage for them as necessary using reference counting (so it cannot reallocate circular data structures without manual intervention). Legal type conversions -for example, conversions from number to string—are done automatically at run time; illegal type conversions are fatal errors. •Perl has a context-sensitive grammar which can be affected by code executed during an intermittent run-time phase. Therefore Perl cannot be parsed by a straight Lex/Yacc lexer/parser combination. Instead, the interpreter implements its own laxer, which coordinates with a modified GNU bison parser to resolve ambiguities in the language. •The execution of a Perl program divides broadly into two phases: compile-timc and runtime. At compile time, the interpreter parses the program text into a syntax tree. At run time, it executes the program by walking the tree. 28
  29. 29. 4.2 PERL CODE FOR MINING 6 : i 2 12 j nptn (DAT, Sdi.uifiJ .-f ! ! 1.1 f ?iile content-<LiU>; ]:eM7h * line ft". Ltiop(f line); ?.U | (5ET,?tP,iC3,SBYTt;,;MT,8KAHi:,;P:;;H.: ^1| peint "*NA«E"; 32 : print "n"; 83! inumfgarray, "SWAHr.i ; ■2*1 ! ! -<:S ■ j 27 : £uie dch (IJaEt ttyj icounc»<5 )++; teach $Weye (keys MC- i |ii iiit "-ii • :;;sor.rn- n; o- frequency of usAGEnnW[ 43 jforeaeh Ske-; (Kort hashValuePeaceiiCtingNum (ktty* (* hash))) ...j FIGURE 4.2.1: PERL Program for mining The Perl code to mine access.log makes use of the construct splitf) which is required to split a line of text in the log file. The extracted site name is pushed into an array for comparison purposes. After the required comparison to determine the number of times that a site has been repeated, both the site and its corresponding count is inserted into a hash array. The Hashed array is now utilized for sorting the site name in the descending order of its count. The count and the corresponding site name is displayed as the output. 4.3 DISPLAYED OUTPUT . - He "dt vm Rut feUM* Pflri Serve Mndm ti*> («"j:."l61.I53:4« login.lC3.eom:443 Ihttp;/««.around-],/tarikyacce33.cgi? 6ttp://»«».club- ■ http://«*i.a/cuuDetit/Add Conwent.asp
  30. 30. http;//iww,ti( http://Biysstudio.crjm/proxy5/checJt.php ht tp://ZOZ. 86.4.199/conf ig/ ispverify_u3er 7 fcttp://nuhost- into/eye■php http://E09.191.92.64/conIig/isp_verify_usei-? checker/proxy de3t.php gut? http://«^S !Jp/514222/2/ind-2732062/3707270.html tttp://i»r*.BW*>?L. in. yahoo, cui/coni ig/pirtoken_get? httpj// http://shebiog3.p«oplt:aggreqati:t:.riet/ctntent.php? ., cai/cvoV 1200928402/1 FIGURE 4.2.2 : VISITED SITES .-■:i>; "ir,.priM i" used otif once. v.--rr»""- - - - ---------------^aaJi pv.ia ............-; - ""■*** *■ ■ This is the output to the program in figure 4. It displays only the sites that have been reqtiested for, visited and even those that have been denied access from the proxy server. Hence, the log records all the transactions that have been successful and those that have failed. fen Run Oatahjie{* 511 Input TOTAL SITES VISITED : 5238 SITES SORTED IN ORDER OF FREQUENCY OF USiGF.: 200 11 93 11 80 11 69 10 53 10 51 10 50 11 31 26 24 23 23 22 20 19 19 18 18 17 15 14 13 13 13 13 13 12 11
  31. 31. http://megalast get? p 205.1B8.153.100:113 http://miho3t.into http: // iwf iiids. org/ eye. php /eye.php .-n/hits? 3 http:,/202.86,4,192/config/pwtokenjjec? hi tp ://hi)nf. http://www.brtidu.c0m/3? coai/eye. phr 205.IBS.153.94:113 143 http :// wetVhois/ 61.12,161.185:443 index http ://espace. netavettir. com/ diffusion/ ? http://72.2l.31.2S/-sirset/eye.php http://thedou Hies ite. com/ eye. III http: http://vw.dti- php tanker.coM/public/jp/click:? get? gle.conv www.tlcketmastet.con: 143 http://botttia.tterrioi .com/proxy/http/eiigine.plip http ://www. http://|NaiTfiiiiain,:MTtt^aJycfce:poisWe!^oatMtetkiie$.^lffi20 :Maine man ET" used onV ooce possible typo tie 20: ■ Name man IP used onV oncepossiete two at sottedsitespl rte 20.Figure 4.2.3 : Sites sorted in frequency of usage BYTES DOWNLOAD EI1 yiTK NAHE 606811 http //2O2.1Q4.241.3/qq£ile/qq/update/ 89926 http //hwk, antrecotci. net/cgi-bin/bbs.cgi 89955 http //, jp/cgi~bin/ca3-bbs/yybb3.cgi 78307 http //B» 78240 http // lever, cgi 6442 6 http //iBage32.singleparentrseet.coBi/30/l4S/4689l15/ 1137852.jpg 62330 http //bp 12 3. spre ebb. cost/index, php? 62414 http //forum, showthread.php? 61633 http // jp/cgi-Qpt/bbs/soybean_bbs.cgi? 58949 http // 56631 http //uw. spike, com/search? 56594 http //wwu., php? 49106 http // 47775 http // taller.php 47558 http // 45410 http / / 3eshg. coin/ vb/sendmessage. php 45039 http // yahoo.coin/cmkr/tHBLCWarite curt, html 43060 http //www. 42152 http //comedy, 42142 http //05xx. sub, jp/ sfsrver/bbs/ index.cgi? 41878 http /Jvm.aemwT.vz 39246 http / / veetra. auto-art. org/ web/sue/6/ ? 38502 http //www. yahoo, corn/ 38110 http // 34569 http //www.ostee,com/cgi-bin/bbs/clever.cgi? 33900 http / / www. x-iaods. co. nz/t orum/ index. php? 33895 http //www.oztee. coiti/cgi-bin/bbs/clever. cgi 33595 http //www. pi 33449 http // post corntiTent 33206 http // 30382 http //ok. 2 29594 http //search, 2 8757 http // 4al87039010005ui».ht»l 27543 http //phot0370.nas2a-klasa .pl7devll3 /O6l /0 /266/OO61266097.jpg 26430 http //ews.sogou.eoio/websearch/corp/search. jsp? 2593 6 http //hi. baid«.cora/hggggi8/b log/it ett^bedcci4 3 d447ca4c3 9e3d62e9.html 25483 http // cond.html 25464 http //hww.ticketamstec.cora/event/06003F65BEE317745 25316 http //M«w.3ingleparentBieet.cora/coi«(iunity/nieinber/? 25227 http //uiiiqueduiiip.coid/ indes. php? 25225 http / 25105 http //sacradoctrina.b efj-toHarti.:"-snare r.s- 25040 http //inwaoes. gooqle .com/ imaqes ? pervasiveness.htmlNome "main ET" used only once: ixjistolelypo <i rowidowiloadedpl line 1 h«P8 ■ 6WiMT"u^onivnrco possUel>)o jliixAldu^n>> line 19 Name toanJPusedor»vonce, potable lw»atmoridowibauedpiSue 1S.
  32. 32. Figure 4.2.4 : Sites sorted in terms of bytesdownloaded I* Sid Input! (! Scrip) © Sid [Upi I TCP" HISS /iob fCP^HISS/200 TCPJ1ISS/2QQ TCPJIISS/200 TCPJUSS/302 JCP_flISS/4CJl TCPJHSS/200 TCP HISS/200 TCP JUSS/2QQ | CP_HI3S/200 TCP_HIS5/200 NUMBER OF SITES THAT WEP.E DEN IIP ACCESS ACCESS DENIED SITES ***, TCP_DENKD/4Q3, TCP_DF.NI£D/403, TCP_DENIED/403 Cup,;25, iCP_DENIED/403 H-, TCP_DENIED/4C3, TCP DENIED/403 <».■■■■ !M ; ,man:APP"usetion|i«ico: poBfc!e-jT»aUepdtfiedpl »li -w. ■F *j.oiri«n::MT,u)edon(|Jor«! ijoufcletypoattcpctencd.plline 12 ■■■.■c "maitlP" wed onk> one* owtiie !wo a tcudoniedplliiw 12. Figure 4.2.5 : Sites that were denied access
  33. 33. CHAPTER 5 TESTING5.1 SYSTEM TESTING Testing is a set activity that can be planned and conducted systematically. Testing begins at themodule level and work towards the integration of entire computers based system. Nothing is completewithout testing, as it is vital success of the system. Testing Objectives: There are several rides that can serve as testing objectives, they are Testing is a process of executing a program with the intent of finding an error A good test case is one that has high probability of finding an undiscovered error. A successful test is one that uncovers an undiscovered error. If testing is conducted successfully according to the objectives as stated above, it woulduncover errors in the software. Also testing demonstrates that software functions appear to the workingaccording to the specification, that performance requirements appear to have been met. There are three ways to test a program •For Correctness •For Implementation efficiency •For Computational Complexity. Tests for correctness are supposed to verify that a program does exactly what it was designedto do. This is much more difficult than it may at first appear, especially for large programs. Tests for implementation efficiency attempt to find ways to make a correct program faster oruse less storage. It is a code-refining process, which reexamines the implementation phase of algorithmdevelopment. Tests for computational complexity amount to an experimental analysis of the complexity of analgorithm or an experimental comparison of two or more algorithms, which solve the same problem. Testing Correctness 33
  34. 34. The following ideas should be a part of any testing plan: •Preventive Measures •Spot checks •Testing all parts of the program •Test Data •Looking for trouble •Time for testing •Re Testing The data is entered in all forms separately and whenever an error occurred, it is correctedimmediately. A quality team deputed by the management verified all the necessary documents andtested the Software while entering the data at all levels. The entire testing process can be divided into3 phases Unit Testing Integrated Testing Final/ System testing5.1.1 UNIT TESTING As this system was partially GUI based WINDOWS application, the following were tested in thisphase Tab Order Reverse Tab Order Field length Front end validations In our system, Unit testing has been successfully handled. The test data was given to each andevery module in all respects and got the desired output. Each module has been tested found workingproperly. 34
  35. 35. 5.1.2 INTEGRATION TESTING Test data should be prepared carefully since the data only determines the efficiency and accuracyof the system. Artificial data are prepared solely for testing. Every program validates the input data5.1.3 VALIDATION TESTING In this, all the Code Modules were tested individually one after the other. The following weretested in all the modules Loop testing Boundary Value analysis Equivalence Partitioning Testing In our case all the modules were combined and given the test data. The combined moduleworks successfully with out any side effect on other programs. Everything was found tine working.5.1.4 OUTPUT TESTING This is the final step in testing. In this the entire system was tested as a whole with all forms,code, modules and class modules. This form of testing is popularly known as Black Box testing orsystem testing. Black Box testing methods focus on the functional requirement of the software. That is, BlackBox testing enables the software engineer to derive sets of input conditions that will fully exercise allfunctional requirements for a program. Black Box testing attempts to find errors in the following categories; incorrect or missingfunctions, interface errors, errors in data structures or external database access, performance errors andinitialization errors and termination errors. CHAPTER 6 CONCLUSION The project report entitled "DATAMINING USING FUZZY LOGIC" has come to its finalstage. The system has been developed with much care that it is free of errors and at the same time it isefficient and less time consuming. The important thing is that the system is robust. We have tried ourlevel best to make the complete the project with all its required features. 35
  36. 36. However due to time constraints the fuzzy implementation over the mined data has not beenpossible. Since, the queries related to mining require the proper retrieval of data, actual connl ispreferred over applying fuzziness into count. APPENDICESOVERVIEW OF PERL EXPRESS 2.5 PERL EXPRESS 2.5 is a free integrated development environment (IDE) for Perl with multipletools for writing and debugging your scripts. It features multiple CGI scripts for editing, running, anddebugging; multiple input fdes; full server simulation; queries created from an internal Web browser orquery editor; test MySQL, MS Access scripts: interactive I/O; directory window; code library; andcode templates. Perl Express allows us to set environment variables used for running and debugging script. Ithas a customizable code editor with syntax highlighting, unlimited text size, printing, line numbering,bookmarks, column selection, a search-and-replace engine, multilevel undo/redo operations. Version2.5 adds command line and bug fixes. RESUME The developed system is flexible and changes can be made easily. The system is developedwith an insight into the necessary modification that may be required in the future. Hence the systemcan be maintained successfully without much rework. One of the main future enhancements of our system is to include fuzzy logic which is a formof multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate ratherthan precise. REFERENCES 1.frequent Pattern Mining in Web Log Data - Renata Ivancsy, lstvan Vajk 2.Squid-Style Transaction Logging (log formats) - 3.Mining interesting knowledge from weblogs: a survey - Federico Michele Facca, Pier Luca lanzi. 4. 5. 6. 36