Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The topic of my Ph. D. proposal is “ DESIGN AND IMPLEMENTATION OF A WEB MINING RESEARCH SUPPORT SYSTEM ”. This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Science and Technology
  • I will first give a brief introduction about the web mining research support system. Then, describe its four components: WEB INFORMATION RETRIEVAL, WEB INFORMATION EXTRACTION, GENERALIZATION, SIMULATION & VALIDATION. At last, Conclusions and time plan will be given.
  • The evolution of the World Wide Web has brought us enormous and ever growing amounts of data and information. Some valuable web data sources include on-line databases, web documents and archives. With the abundant data provided by the web, it has become an important resource for research.
  • One example is OPEN SOURCE SOFTWARE study. The objective of this project is to study the OSS community including Projects and developers, to understand projects’ development and developers’ behaviors. There are several web sites which host OSS projects. With approximately 70,000 projects, 90,000 developers, and 700,000 registered users, SourceForge.net, sponsored by VA Software is the largest OSS development and collaboration site. This site provides highly detailed information about the projects and the developers, including project characteristics, most active projects, and ``top ranked'' developers. Web data mining is necessary to collect information from such kind of web sites.
  • Researchers can retrieve web data by browsing and keyword searching. However, both ways have limitations. It is hard for researchers to retrieve data by browsing because there are many following links contained in a web page. Keyword searching will return large amount of irrelevant data. On the other hand, traditional data extraction and mining techniques can not be applied directly to the web due to its semi-structured heterogeneous and dynamic features. Thus, design and implementation of a web data mining research support system has become a challenge for researchers in order to utilize information from the web. Yao designed a web-based IR system to help find research papers which include some functions such as citation analysis, text analysis, etc. Tang proposed a framework to help individual researchers to collect data and carry out experiment. These are still an ongoing projects.
  • We propose a web mining research support system to help researchers use web resources and discover knowledge from web data. The goal of this system is to provide a general solution which researchers can follow to utilize web resources in their research. This system should include functions to identify available web resources, extract Contents from web documents, perform data mining on web data, and analyze as well as validate data. This system will be tested on OSS study.
  • This web mining research support system consists of several parts . The first component is a web information retrieval tool. It will be designed and implemented to find proper web resources according to research needs. This includes identifying availability, relevance and importance of web sites. The second component is to develop a web information extraction tool to extract specific fragments from a web document. Then we will conduct data mining on extracted web data. This task is called generalization. Data preprocessing is also performed in generalization. Based on web data, we can build models to simulate and validate web information. The proposed system will be tested on the Open Source Software study. I will discuss each component in my following talk.
  • During my discussion of web information retrieval, I will first talk about current information retrieval tools and analyze their limitations. Then web page classification techniques will be introduced. A classification search tool will be proposed and an example will be provided.
  • Providing an efficient and effective web information retrieval tool is important in the web mining research system. Two major IR tools are directories (Yahoo, Netscape, etc.) and search engines (Lycos, Google, etc.). Directories are subject lists created by specific indexing criteria. Search engines allow users perform a query through key words when searching web content. Current IR systems have several difficulties during web searching. Low precision (i.e., too many irrelevant documents) and low recall (i.e., too little of the web is covered by well-categorized directories) are two main problems cite{chekuri96web}. Furthermore, current directories and search engines are designed for general uses. The results returned by these tools are not well-categorized for research needs. We propose to develop a web research search tool which combines directories and search engines to provide a classified interface to users. By using this search tool, users submit a query based on their research needs, and the search tool will return automatically-categorized directories related to their query.
  • Classifying web pages can be performed either manually or automatically. Manual classification is more accurate than automatic classification. Some commercial IR tools, e.g. GNN, Infomine, and Yahoo use manual classification. However, with the enormous increase in the size of the web, manual classification becomes impossible. Currently, one challenge in IR is to implement automatic classification and improve its accuracy. Some techniques are used to perform automatic classification. NN classifiers assign similar documents the same class lable. Given a test document, we use it as a query and let the IR system fetch us the k training documents most similar to this document. The class that occurs the largest number of times among these k training documents is reported as the class of the test document. Feature selection improve efficiency by eliminating features with less correlations. For example, if a feature appears in every class, it should be removed Beyesian classifier classify a test class by using conditional probability of training documents.
  • We will design and implement a web information retrieval system Which will take a query as an input and output hierarchical directories. pre-defined manually by users. Users specify their interested queries and categories through an interface. A web crawler will retrieve web documents matching the user's queries. Current existing retrieval tools may be used to help searching. The searched results will be automatically classified based on some classification algorithms. Finally, we will use XML to create a directory page and return it to users. XML is used because it facilitates further searching in those categories.
  • This is an example of expected returned output by our IR tool. Those red words are categories and subcategories predefined by users.
  • Web IE is the second task of the web mining research support system. I will first briefly introduce a commonly used IE tool – wrapper. Wrapper generation ways and tools will be discussed. Then I will present our previous IE work in OSS project. At last, a proposed hybrid IE tool will be discussed.
  • Because web data are semi-structured or even unstructured, which can not be manipulated by traditional database techniques, it is imperative to extract web data to port them into databases for further handling. The purpose of Web Information Extraction (IE) in our web mining research support system is to extract a specific portion of web documents useful for a research project. Designing a general web IE system is a hard task. Most IE systems use a wrapper to extract information from a particular web site. A wrapper consists of a set of extraction rules and specific code to apply rules to a particular site.
  • A wrapper can be generated manually, semi-automatically or automatically. The manual generation of a wrapper requires programmers understanding the structure of a web page and writing codes to translate it. This way can not adapt to the dynamic changes of web sites. If new pages appear or the format of existing sources is changed, a wrapper must be modified to adapt the new change. Semi-automatic wrapper generation benefits from support tools to help design the wrapper. Users show the system what information to extract. The system cannot themselves induce the structure of a site. Wrappers can be generated automatically by learning extraction rules or patterns. These systems can train themselves to learn the structure of web pages. Learning algorithms must be developed to guide the training process.
  • We designed a wrapper to extract data from Sourceforge web sites. The wrapper was implemented by Perl and CPAN (Comprehensive Perl Archive - the repository of Perl module/libraries) modules. All project home pages in SourceForge have a similar top-level design. Many of these pages are dynamically generated from a database. The web crawler uses LWP, the libwww-Perl library, to fetch each project's homepage. CPAN has a generic HTML parser to recognize start tags, end tags, text and comments, etc. Because both statistical and member information are stored in tables, the web crawler uses an existing Perl Module called {em HTML::TableExtract} and string comparisons provided by Perl to extract information. Link extractors are used if there are more than one page of members. This manual wrapper has high maintenance cost. Moreover, many text-like documents exist on the web, e.g. discussion board, news group. This wrapper can not deal with text-like documents exist on the web. We will design and implement a better wrapper to facilitate web data extraction.
  • These are web document type definitions we use in our research. Structured text is defined as text with a fixed order of relevant information and labels or tags that delimit strings to be extracted. Free text is the natural language text which involves syntactic relations between words. Semistructured text falls between the above two types. This is an example. Such semi-structured text is ungrammatical and does not follow any rigid format.
  • Two popular wrapper generation techniques are wrapper induction and Natural language processing techniques. Both tools generate extraction rules from a given set of training samples. The difference between these tools is that wrapper induction tools format features to implicitly delineate the structure of data, while NL: relies on semantic and syntactic constraints. Thus, wrapper induction techniques is suitable for structured or semi-structured data. NLP can work for both. However, this way is more time-consuming because of complex linguistic analysis. WI
  • Our IE system will use a hybrid way to deal with web documents. At startup, a web document will be checked for its type by a selector. Wrapper induction techniques will be developed to extract information from a structured/semi-structured text, while NLP techniques are used for free text. For a free text type web document, grammar and syntax are analyzed by NLP learning. Then, extraction rules are generated based on the analysis. The extractor extracts data according to those extraction rules and stores extracted data into the database. The extraction of a structured/semi-structured web document is as follows. Firstly, the parser creates a parse tree for the document. Secondly, users input sample patterns which they want to extract. Then, extraction heuristics are generated to match the sample patterns. Wrappers are created based on extraction heuristics to extract data. If the extracted data contains free text which needs further extraction, the process will be changed to use NLP techniques. Otherwise, data are stored into the database.
  • This hybrid information extraction system will be applied on our Open Source Software study. Our OSS study needs to collect data from several OSS web sites such as SourceForge, Savanah, Linux, Apache, Mozilla etc. These sites offer structured/semi-structured documents, e.g., membership tables, statistics tables. However, we also need to study information hiding in some free text documents. For example, we want to analyze developers activity by collecting data from their messages. We believe our hybrid information extraction system will provide efficient extraction on those sites.
  • The purpose of generalization is to discover information patterns in the extracted web content. Generalization can be divided into two steps. The first step is preprocessing the data. Preprocessing is necessary because the extracted web content may include missing data, erroneous data, wrong formats and unnecessary characters. The second step is to find patterns by using some advanced techniques such as association rules, clustering, classification and sequential patterns.
  • Preprocessing consists of data cleansing, user identification, and session identification. Data cleansing} eliminates irrelevant or unnecessary items in the analyzed data. A web site can be accessed by millions of users. Different users may use different formats when creating data. Furthermore, overlapping data and incomplete data also exist. By {em Data cleansing}, errors and inconsistencies will be detected and removed to improve the quality of data. Another task of Preprocessing is user identification. A single user may use multiple IP addresses, while an IP address can be used by multiple users. In order to study users' behaviors, we must identify individual users. Techniques and algorithms for identifying users can be performed by analyzing user actions recorded in server logs. Session identification divides the page accesses of a single user, who has multiple visits to a web site, into individual sessions. Like user identification, this task can be performed based on server logs. A server can set up sessions. Session IDs can be embedded into each URI and recorded in server logs.
  • To facilitate web mining, some data mining algorithms can be applied to find patterns and trends in the data collected from the web. Association rules mining tries to find interesting association or correlation relationship among a large set of data items. Association rules mining can also be applied to predict web access patterns for personalization. The association rules mining can be applied to web data to explore the behavior of web users and find patterns of their behaviors. Clustering is used to find natural groupings of data. These natural groupings are clusters. A cluster is a collection of data that are similar to one another. Clustering can be used to group customers with similar behavior and to find groups of pages having related content.
  • The goal of classification is to predict which of several classes a case (or an observation) belongs to. In web mining, classification rules allow one to develop a profile of items according to their common attributes . Sequential patterns refer to the frequently occurring patterns related to time or other sequences, and have been widely applied to prediction.
  • In our previous OSS study, we used Oracle data mining softwrare 9.2 to analyze collected data from Sourceforge web site. We use the APriori algorithm to find correlations between features of projects. The algorithms takes two inputs, namely, the minimum support and the minimum confidence. We choose 0.01 for minimum support and 0.5 for minimum confidence. we find that the feature ``all\_trks'', ``cvs'' and ``downloads'' are ``associated''. For classification, we compared results of Naive Bayes algorithms and Adaptive Bayes Network algorithm. Naive Bayes assumes that each attribute is independent from others. That is not the case in the SourceForge data. For example, the ``downloads'' feature is closely related to the ``cvs'' feature, the ``rank'' feature is closely related to other features, since it is calculated from other features. Oracle's implementation of CART is called Adaptive Bayes Network (ABN). In this case study, we try to predict downloads from other features. As stated previously, the "downloads" feature is binned into ten equal buckets. We predict the downloads resides which buckets based on the values of other features. As expected, the Naive Bayes algorithms is not suitable for predicting ``downloads'', since it is related to other features, such as "cvs". The accuracy of Naive Bayes is less than 10%. While Naive Bayes performs badly on predicting ``downloads'', the ABN algorithms can predict ``downloads'' quite accurately which is about 63%. The rules built by the ABN classification model show that ``downloads'' is closely related to ``cvs''. We are interested in putting the projects with similar features together to form clusters. Two algorithms can be used to accomplish this: k-means and o-cluster. The k-means algorithm is a distance-based clustering algorithm, which partitions the data into predefined number of clusters. The o-cluster algorithm is a hierarchical and grid-based algorithm. The resulting clusters define dense areas in the attribute space. The dimension of the attribute space is the number of attributes involved in the clustering algorithm. We apply the two clustering algorithms to projects in this case study. Figure~
    ef{cluster} and Figure~
    ef{crule} shows the resulting clusters and the rules that define the clusters.
  • We propose to build a generalization Infrastructure which consists of data cleansing, data integration, data transformation, data reduction and pattern discovery. The former four steps belong to data preprocessing. Data cleansing deals with dirty data. The main tasks of data cleansing include handling missing values, identifying outliers, filtering out noisy data and correcting inconsistent data. Data integration combines data from multiple sources, such as statistics, forums, etc., into a coherent store. Web data from different sources may have the same concept but different attribute names. Data transformation is also called data normalization, which scales the data value to a range. Data reduction reduces the huge data set to a smaller representative subset according to pattern discovery needs. Then we will implement data mining techniques to recognize patterns of reduced data.
  • In data cleansing, missing data can be replaced by mean values or similar input patterns. Outliers can be detected by data distribution analysis, cluster analusis and regression. Data integration can be performed by using metadata or correlation analysis which measures how strongly one attribute implies the other attribute. We will use several methods to handle data transformation. The simplest way is to divide the value by $10^n$, where $n$ is the number of digits of the maximum absolute values. The second way is called Min-Max; Another way to normalize data is by Z-Score transformation, which is useful when min and max are unknown.
  • This task includes data aggregation, data compression and discretization. Data aggregation gathers and expresses data in a summary form. For example, we may summarize data in a year. Data compression reduces redundancy of data to improve mining efficiency. Discretization transforms the numeric data to categorical values for some data mining algorithms. Pattern discovery applies different data mining functions, such as association rules, clustering, classification and sequential patterns on preprocessed data to find interesting patterns.
  • We plan to explore web data from Open Source Software sites by using our generalization infrastructure. For example, we want to find patterns which characterize the activeness of a project. The activeness of a project may relate to its downloads, page views and bug reports. By using data mining functions such as association rules, we can determine their relationships. We also want to discover clustering and dependencies of Open Source Software projects, and groups of developers as well as their interrelationships.
  • There are three ways to validate an agent-based simulation. The first way is to compare the simulation output with the real phenomenon. This way is relatively simple and straightforward. However, often we cannot get complete real data on all aspect of the phenomenon. The second way compares agent-based simulation results with results of mathematical models. The disadvantage of this way is that we need to construct mathematical models which may be difficult to formulate for a complex system. The third way is by docking with other simulations of the same phenomenon. Docking is the process of aligning two dissimilar models to address the same question or problem, to investigate their similarities and their differences. It can verify the correctness of simulations and discover advantages and disadvantages Of different development toolkits.
  • Analysis of web data involves simulation and validation. Simulation interprets the mined patterns. It can be used to test and evaluate hypothesis. Because each research project has its own simulation and validation ways. We focused on OSS study validation to show an example. In OSS study, We use agent-based tools to simulate and validate the OSS developers' network. This process is called docking.
  • There are three ways to validate an agent-based simulation. The first way is to compare the simulation output with the real phenomenon. This way is relatively simple and straightforward. However, often we cannot get complete real data on all aspect of the phenomenon. The second way compares agent-based simulation results with results of mathematical models. The disadvantage of this way is that we need to construct mathematical models which may be difficult to formulate for a complex system. The third way is by docking with other simulations of the same phenomenon. Docking is the process of aligning two dissimilar models to address the same question or problem, to investigate their similarities and their differences. It can verify the correctness of simulations and discover advantages and disadvantages Of different development toolkits.
  • Our docking experiment are part of a study of the Open Source Software phenomenon. Data about the SourceForge OSS developer site has been collected for over 2 years. Developer membership in projects is used to model the social network of developers. Social networks based on random graphs, preferential attachment, preferential attachment with constant fitness, and preferential attachment with dynamic fitness are modeled and compared to collected data. We use two agent-based simulation tools—swarm and repast to build simulations for these four models. Repast and Swarm simulations are docked by comparing properties of social networks such as degree distribution, diameter and clustering coefficient.
  • The Open Source Software community is a classic example of a dynamic social network. In our model of the OSS collaboration network, there are two entities -- developer and project. The network can be illustrated as a graph. In this network, nodes are developers. An edge will be added if two developers are participating in the same project. Edges can be removed if two developers are no longer participating on the same project. A developer can have several activities: create new projects, join existing projects, abandon Projects or continue with same projects.
  • We use agent-based modeling to simulate the OSS development community. Unlike developers, projects are passive elements of the social network. Thus, we only define developers as the agents which encapsulate a real developer's possible daily interactions with the development network. Our simulation is time stepped instead of event driven, with one day of real time as a time step. Each day, a certain number of new developers are created. Newly created developers use decision rules to create new projects or join other projects. Also, each day existing developers can decide to abandon a randomly selected project, to continue their current projects, or to create a new project. If a preferential model is used, Developers’ and projects’ preference need to be updated.
  • This Figure shows the docking process. The initial simulation was written using Swarm. Our docking process began when the author of the swarm simulation wrote the docking specification. Then, the Repast version was written based on the docking specification. Swarm simulations and Repast simulations are docked for four models of the OSS network. Simulations are validated by comparing four network attributes generated by running these two simulation models—diameter, degree distribution, clustering coefficient and community size.
  • Our docking process sought to verify our Repast migration against the original Swarm simulation. To do so, the process began with a comparison of network parameters between corresponding models. Upon finding differences, we compared each developer's actions. In our docking process, we found that swarm and Repast use different random generator. For example, the Repast simulation used the COLT random number generator from the European Laboratory for Particle Physics (CERN). Different random generator causes systematic differences between the two simulations' outputted data. We ran the two simulations using the exact same set of random numbers: each simulation used the same random number generator with the same seed and determined that the random number generators did not cause this systematic difference. To determine the exact reasons for this difference, we had the simulations log the action that each developer took during each step. Comparing these logs, two reasons for the differences emerged. First, we determined that one simulation would occasionally throw an SQL Exception (we store simulation data in a relational database for post-simulation analysis). Such an error can cause more discrepancies between the two simulations at future time steps since the developer's previous actions affect its future actions. We found the cause of this error to be a problem with the primary keys in the links table of our SQL database (this is a programming bug). Furthermore, We found that the Swarm scheduler begins at time step 0 while the Repast scheduler begins with time 1. Thus, Swarm had actually performed one extra time step. With these two problems corrected, the corresponding logs of he developers' actions matched. Using the same sequence of random numbers, the Swarm and Repast simulations produced identical output
  • Degree distribution, diameter and clustering coefficient are frequent attributes used to describe a network. The diameter of a network is the maximum distance between any pair of connected nodes. The diameter can also be defined as the average length of the shortest paths between any pair of nodes in the graph. In this paper, the second definition is used since the average value is more suitable for studying the topology of the OSS network. The neighborhood of a node consists of the set of nodes to which it is connected. The clustering coefficient of a node is a fraction representing the number of links actually present relative to the total possible number of links among the nodes in its neighborhood. The clustering coefficient of a graph is the average of all the clustering coefficients of the nodes . Degree distribution is the distribution of the degree throughout the network. Degree distribution was believed to be a normal distribution, but Albert and Barabasi recently found it fit a power law distribution in many real networks
  • This figure shows the evolution of the diameter of the network with the time period. We can see that Repast simulations and Swarm simulations are docked. In the real SourceForge developer collaboration network, the diameter of the network decreases as the network grows. In our models, we can observe that ER model does not fit the SourceForge network, while other three models match the real network. However, there are slight differences between Swarm results and Repast results. We believe this difference is caused by different random generators associated with RePast and Swarm.
  • This figure
    ef{degree_distribution} gives developer distributions in four models implemented by Swarm and Repast. The $X$ coordinate is the number of projects in which each developer participated, and the $Y$ coordinate is the number of developers in the related categories. From the figure, we can observe that there is no power law distribution in ER model. The distribution looks more like the mathematically proven normal distribution. Developer distributions in the other three models match the power law distribution.
  • Clustering coefficients for the developer network as a function of time is shown in this Figure. All models are docked very well. We can observe the decaying trend of the clustering coefficient in all four models. The reason is that with the evolution of the developer network, two co-developers will less likely join a new project together because their participated projects are approaching their limits.
  • This figure shows the total number of developers and projects relative to the time period in four models, which describe the developing trends of size of developers and projects in the network. The upper part is developers’ trend for both simulations, and the lower part is the projects’ trend. The size of developers and projects are almost the same for Swarm and RePast simulations.
  • There are several directions of future work. First, currently, we just compare the results of one run simulations. Because there is slight difference between docking results which we believe is caused by Different random generators. We will run both swarm simulations and Repast simulations many times to see if this difference can be ignored. we will do statistic analysis, for example, hypothesis tests, on these results. Moreover, we will dock more network parameters such as Average degree, Cluster size distribution And Fitness and life cycle. Also, statistic comparison with extracted web data will be performed.
  • Our proposed work will provide an integrated web mining system which combines web retrieval and data mining techniques together to support research. This system is designed for identifying, extracting, and analyzing data from web resources. Each of its four components will be constructed to improve the effectiveness and efficiency of web mining.
  • The IR system should be developed and implemented by March, 2004. The IE system will be designed and constructed by July, 2004. Generalization and validation will be done by September, 2004 and December, 2004. Dissertation will be written during the whole research. The whole Ph.D study will be finished by May 2005.
  • Four journal papers are expected to be published for this research.
  • The primary data required for this research are two tables -- project statistics and developers. The project statistics table consists of records with 9 fields: project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS. The developers table has 2 fields: project ID and developer ID. Because projects can have many developers and developers can be on many projects, neither field is unique primary key. Thus the composite key composed of both attributes serves as a primary key. Each project in SourceForge has a unique ID when registering with SourceForge.
  • The primary data required for this research are two tables -- project statistics and developers. The project statistics table consists of records with 9 fields: project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS. The developers table has 2 fields: project ID and developer ID. Because projects can have many developers and developers can be on many projects, neither field is unique primary key. Thus the composite key composed of both attributes serves as a primary key. Each project in SourceForge has a unique ID when registering with SourceForge.
  • Social network theory is the basis of the conceptual framework through which we view the OSS developer activities. The theory, built on mathematical graph theory, depicts interrelated social agents as nodes or vertices of a graph and their relationships as links or edges drawn between the nodes. The number of edges (or links) connected to a node (or vertex) is called the index or degree of the node. Early work in this field by Erdos and Renyi focuses on random graphs, i.e., those where edges between vertices were attached in a random process (called ER graphs here). However, the distributions of index values for the random graphs do not agree with the observed power law distribution for many social networks, including the OSS developer network at SourceForge.
  • Some other evolutionary mechanisms include: 1) the Watts-Strogatz (WS) model, 2) the Barabasi-Albert (BA) model with preferential attachment, 3) the modified BA model with fitness, and 4) an extension of the BA model (with fitness) to include dynamic fitness based on project life cycle reported. The WS model captures the local clustering property of social networks and was extended to include some random reattachment to capture the small world property, but failed to display the power-law distribution of index values. The BA model added preferential attachment, both preserving the realistic properties of the WS model and also displaying the power-law distribution. The BA model was extended with the addition of random fitness to capture the fact that sometimes newly added nodes grow edges faster than previously added nodes.
  • Our swarm simulation has a hierarchical structure which consists of a {em developer} class, a {em modelswarm} class, an {em observerswarm} class and a {em main} program. The {em modelswarm} handles creating developers and controls the activities of developers. In {em modelswarm}, a schedule is generated to define a set of activities of the agents. The {em observerswarm} is used to implement data collection and draw graphs. The {em main} program is a driver to start the whole simulation. The core of a swarm simulation consists of a group of agents. Agents in our simulation are developers. Each developer is an instance of a Java class. A developer has an identification id, a degree which is the number of links, and a list of projects participated by this developer. Furthermore, a developer class has methods to describe possible daily actions: create, join, abandon a project or continue the developer's current collaborations.
  • Our RePast simulation of OSS developer network consists of a {em model} class, a {em developer} class, an {em edge} class and a {em project} class. The class structure of the simulation is different from that of the Swarm simulation. This is due in part to the graphical network display feature of Repast. The model classis responsible for creation and control of the activities of developers. Furthermore, information collection and display are also encapsulated in the {em model} class. The {em developer} class is similar to that in Swarm simulation. An {em edge} class is used to define an edge in OSS network. We also create a {em project} class with properties and methods to simulate a project.
  • The docking process validates four simulation models of OSS developer network using Swarm and RePast. Properties of social networks such as degree distribution, diameter and clustering coefficient are used to dock simulations. This docking process showed that a docking process can also be used to validate a migration of a simulation from one software package to another. In our case, the docking process helped with the transfer to Repast to take advantages of its features. Repast simulation runs faster than Swarm simulation because RePast is written in pure Java while Swarm is originally written in Object C which may cause some overhead for Java Swarm. Furthermore, RePast provides more display library packages such as network package which help users to do analysis. This two graphs show random layout and circular layout generated by Repast network package.
  • Jin_proposal_slides.ppt

    1. 1. DESIGN AND IMPLEMENTATION OF A WEB MINING RESEARCH SUPPORT SYSTEM Jin Xu Department of Computer Science & Engineering Presented at the Ph. D. Candidacy exam Advisors: Dr. Gregory Madey Dr. Patrick Flynn Nov. 21, 2003
    2. 2. OUTLINE <ul><li>INTRODUCTION </li></ul><ul><li>WEB INFORMATION RETRIEVAL (IR) </li></ul><ul><li>WEB INFORMATION EXTRACTION (IE) </li></ul><ul><li>GENERALIZATION </li></ul><ul><li>SIMULATION & VALIDATION </li></ul><ul><li>CONCLUSIONS & TIME PLAN </li></ul>
    3. 3. INTRODUCTION <ul><li>World Wide Web </li></ul><ul><ul><li>Abundant information </li></ul></ul><ul><li>Web resources </li></ul><ul><ul><li>On-line databases </li></ul></ul><ul><ul><li>Web documents </li></ul></ul><ul><ul><li>Archives: forums, newsgroups </li></ul></ul><ul><li>Important resource for research </li></ul>
    4. 4. OPEN SOURCE SOFTWARE STUDY <ul><li>Open Source Software (OSS) study at ND </li></ul><ul><ul><li>Development of projects </li></ul></ul><ul><ul><li>Behaviors of developers </li></ul></ul><ul><li>Web resources </li></ul><ul><ul><li>SourceForge Developer Site </li></ul></ul><ul><ul><ul><li>Largest OSS development site </li></ul></ul></ul><ul><ul><ul><ul><li>70,000 projects, 90,000 developers, 700,000 users </li></ul></ul></ul></ul><ul><ul><ul><li>Detailed information </li></ul></ul></ul><ul><ul><ul><ul><li>Project characteristics </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Developer activities </li></ul></ul></ul></ul>
    5. 5. MOTIVATION <ul><li>Traditional Web Data Retrieval </li></ul><ul><ul><li>Browsing: following links </li></ul></ul><ul><ul><li>keyword query: irrelevant results </li></ul></ul><ul><li>Web data features </li></ul><ul><ul><li>Semi-structured </li></ul></ul><ul><ul><li>Heterogeneous </li></ul></ul><ul><ul><li>Dynamic </li></ul></ul><ul><li>Web data mining research support systems </li></ul><ul><ul><li>Web-based IR support system, J.T. Yao, et al, 2003 </li></ul></ul><ul><ul><li>CUPTRESS, H. Tang, et al, 2003 </li></ul></ul>
    6. 6. SYSTEM DESCRIPTION <ul><li>Objective </li></ul><ul><ul><li>General solution to use web resources and discover knowledge </li></ul></ul><ul><li>Functions </li></ul><ul><ul><li>Web resource identification – IR </li></ul></ul><ul><ul><li>Web data extraction – IE </li></ul></ul><ul><ul><li>Data mining – Generalization </li></ul></ul><ul><ul><li>Analysis & Validation </li></ul></ul><ul><li>Will be tested on the OSS study at ND </li></ul>
    7. 7. FRAMEWORK Web Mining Research Support System Open Source Software Information Retrieval Information Extraction Generalization Simulation & Validation
    8. 8. WEB INFORMATION RETRIEVAL <ul><li>Current IR tools </li></ul><ul><li>Web page classification </li></ul><ul><li>Proposed classification search tool </li></ul><ul><li>Example </li></ul>
    9. 9. CURRENT IR TOOLS <ul><li>Current IR Tools </li></ul><ul><ul><li>Directory - subject lists </li></ul></ul><ul><ul><li>Search engine - keywords query </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>Low precision - too many irrelevant documents </li></ul></ul><ul><ul><li>Low recall - too little of the web is covered by well-categorized directories </li></ul></ul><ul><ul><li>Not well-categorized for research needs </li></ul></ul><ul><li>Classification search tool </li></ul>
    10. 10. WEB PAGE CLASSIFICATION <ul><li>Manual classification </li></ul><ul><ul><li>Accurate </li></ul></ul><ul><ul><li>Yahoo, Informine </li></ul></ul><ul><ul><li>Impossible for large number of web pages </li></ul></ul><ul><li>Automatic classification </li></ul><ul><ul><li>Nearest neighbor (NN) classifier (C. Chekuri, 1996) </li></ul></ul><ul><ul><ul><li>Similar documents are assigned the same class label </li></ul></ul></ul><ul><ul><li>Feature selection (S. Chakrabarti, 1998) </li></ul></ul><ul><ul><ul><li>Eliminate features with low correlations </li></ul></ul></ul><ul><ul><li>Naïve Bayes classifier (A.K. McCallum, 2000) </li></ul></ul>
    11. 11. CLASSIFICATION SEARCH TOOL Crawlers Classifier User Interface Existing search tools Web pages XML directories
    12. 12. EXAMPLE <ul><li>Hosted sites </li></ul><ul><ul><li>www.sourceforge.com </li></ul></ul><ul><ul><li>savannah.gnu.org </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>Projects </li></ul></ul><ul><ul><li>www.linux.org </li></ul></ul><ul><ul><li>www.apache.org </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Research </li></ul><ul><li>Organizations </li></ul><ul><ul><ul><ul><li>www.nd.edu/~oss/ </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… </li></ul></ul></ul></ul><ul><ul><li>Papers </li></ul></ul><ul><ul><li> … </li></ul></ul><ul><ul><ul><li>Conferences </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul>
    13. 13. WEB INFORMATION EXTRACTION <ul><li>Wrapper </li></ul><ul><li>Wrapper generation </li></ul><ul><li>OSS wrapper & data collection </li></ul><ul><li>Wrapper generation tools </li></ul><ul><li>Hybrid IE </li></ul>
    14. 14. WEB INFORMATION EXTRACTION <ul><li>Web IE </li></ul><ul><ul><li>Extract a specific portion of web documents </li></ul></ul><ul><ul><li>Port into databases </li></ul></ul><ul><li>Wrapper </li></ul><ul><ul><li>Extract information from a particular web site </li></ul></ul><ul><ul><li>Extraction rules and code to apply rules </li></ul></ul>
    15. 15. WRAPPER GENERATION (WG) <ul><li>Manual generation </li></ul><ul><ul><li>Understanding of web documents </li></ul></ul><ul><ul><li>Writing codes </li></ul></ul><ul><ul><li>Dynamic changes of web sites </li></ul></ul><ul><li>Semi-automatic generation </li></ul><ul><ul><li>Sample pages </li></ul></ul><ul><ul><li>Demonstration by users </li></ul></ul><ul><li>Automatic generation </li></ul><ul><ul><li>Learn extraction rules </li></ul></ul>
    16. 16. OSS WEB WRAPPER <ul><li>OSS web wrapper </li></ul><ul><ul><li>Perl and CPAN modules </li></ul></ul><ul><ul><li>URL accessing– fetch pages (LWP) </li></ul></ul><ul><ul><li>HTML parser – parse pages </li></ul></ul><ul><ul><ul><li>HTML::TableExtract – extract information </li></ul></ul></ul><ul><ul><ul><li>Link extractor – extract links </li></ul></ul></ul><ul><ul><ul><li>Word extractor </li></ul></ul></ul><ul><li>Features </li></ul><ul><ul><li>Manual </li></ul></ul><ul><ul><li>High maintenance cost </li></ul></ul><ul><ul><li>Not suitable to handle free text </li></ul></ul>
    17. 17. WEB DOCUMENT TYPES <ul><li>Structured text </li></ul><ul><ul><li>Fixed order </li></ul></ul><ul><ul><li>Labels/tags </li></ul></ul><ul><ul><li>On-line data generated by database </li></ul></ul><ul><li>Free text </li></ul><ul><ul><li>Natural language text </li></ul></ul><ul><li>Semistructured text </li></ul><ul><ul><li>Bill Smith – admin, Perl, Graduate student, University of Notre Dame, IN 46556, (574)333-3333 </li></ul></ul>
    18. 18. WG TECHNIQUES <ul><li>Wrapper induction (SoftMealy, 1998; WIEN, 2000; STALKER 2001; LIXTO 2003) </li></ul><ul><ul><li>Training examples </li></ul></ul><ul><ul><li>Extraction rules </li></ul></ul><ul><ul><li>Format features to delineate the structure of data </li></ul></ul><ul><ul><li>Not suitable for free text </li></ul></ul><ul><li>Natural language processing (NLP) (WHISK, 1999) </li></ul><ul><ul><li>Training examples </li></ul></ul><ul><ul><li>Extraction rules </li></ul></ul><ul><ul><li>Linguistic constraints </li></ul></ul><ul><ul><li>Work for all </li></ul></ul><ul><ul><li>Complex, time cost </li></ul></ul>
    19. 19. HYBRID IE ARCHITECTURE structured/semi structured text parser Structured data extractor database Free text selector Web document NLP learning heuristics Wrapper induction Extraction rules user Parse tree Sample pattern Sample pattern
    20. 20. OSS APPLICATION <ul><li>Web sites </li></ul><ul><ul><li>SourceForge, Savanah, Linux, Apache, Mozilla </li></ul></ul><ul><li>Structured/semi-structured </li></ul><ul><ul><li>Membership tables, statistics tables, etc. </li></ul></ul><ul><li>Free text </li></ul><ul><ul><li>Message board, emails, etc. </li></ul></ul>
    21. 21. GENERALIZATION <ul><li>Overview </li></ul><ul><li>Preprocessing </li></ul><ul><li>Data mining functions </li></ul><ul><li>Previous OSS generalization study </li></ul><ul><li>Infrastructure </li></ul>
    22. 22. GENERALIZATION OVERVIEW <ul><li>Discover information patterns </li></ul><ul><li>Two steps </li></ul><ul><ul><li>Preprocessing </li></ul></ul><ul><ul><ul><li>Missing, erroneous data </li></ul></ul></ul><ul><ul><ul><li>Wrong formats </li></ul></ul></ul><ul><ul><ul><li>Unnecessary characters </li></ul></ul></ul><ul><ul><li>Pattern recognition </li></ul></ul><ul><ul><ul><li>Advanced techniques </li></ul></ul></ul>
    23. 23. PREPROCESSING <ul><li>Data cleansing </li></ul><ul><ul><li>Eliminate irrelevant/unnecessary items </li></ul></ul><ul><ul><li>Detect errors/inconsistencies/duplications </li></ul></ul><ul><li>User identification </li></ul><ul><ul><li>A single user uses multiple IP addresses </li></ul></ul><ul><ul><li>An IP address is used by multiple users </li></ul></ul><ul><li>Session identification </li></ul><ul><ul><li>Divides page accesses of a single user into sessions </li></ul></ul>
    24. 24. DATA MINING FUNCTIONS <ul><li>Association Rules (C. Lin, 2000; B. Mobasher 2000) </li></ul><ul><ul><li>Find interesting association or correlation relationship among data items </li></ul></ul><ul><li>Clustering (Y. Fu, 1999; B. Mobasher 2000) </li></ul><ul><ul><li>Find natural groups of data </li></ul></ul><ul><ul><li>Usage clusters – users with similar browsing patterns </li></ul></ul><ul><ul><li>Page clusters – pages having related content </li></ul></ul>
    25. 25. DATA MINING FUNCTIONS (Cont.) <ul><li>Classification (B. Mobasher 2000) </li></ul><ul><ul><li>Map a data item into predefined classes </li></ul></ul><ul><ul><li>Develop a profile of items </li></ul></ul><ul><li>Sequential patterns (H. Pinto, 2001) </li></ul><ul><ul><li>Find patterns related to time or other sequence </li></ul></ul><ul><ul><li>Prediction </li></ul></ul>
    26. 26. PREVIOUS OSS GENERALIZTION <ul><li>Association Rules </li></ul><ul><ul><li>“ all tracks”, “downloads” and “CVS” are associated </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>Predict “downloads” </li></ul></ul><ul><ul><li>Naïve Bayes – Build Time 30 sec, accuracy 9% </li></ul></ul><ul><ul><li>Adaptive Bayes Network - Build Time 20 min, accuracy 63% </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>K-means </li></ul></ul><ul><ul><li>O-cluster </li></ul></ul><ul><li>In collaboration with Y. Huang </li></ul>
    27. 27. GENERALIZATION INFRASTRUCTURE … Integrated data Reduced data Association Rules Clustering Classification Sequential patterns Cleansing Integration Transformation Reduction Recognition Statistics Forums Server logs Transformed data … Statistics Forums Server logs
    28. 28. GENERALIZATION TECHNIQUES <ul><li>Data cleansing </li></ul><ul><ul><li>Missing data replaced by mean or similar input </li></ul></ul><ul><ul><li>Outliers can be detected by distribution, regression </li></ul></ul><ul><li>Data integration </li></ul><ul><ul><li>Metadata or correlation analysis </li></ul></ul><ul><li>Data transformation </li></ul><ul><ul><li>Divide by 10^n </li></ul></ul><ul><ul><li>Min-Max </li></ul></ul><ul><ul><li>Z-score transformation </li></ul></ul>
    29. 29. GENERALIZTION TECHNIQUES (Cont.) <ul><li>Data reduction </li></ul><ul><ul><li>Data aggregation </li></ul></ul><ul><ul><li>Data compression </li></ul></ul><ul><ul><li>Discretization </li></ul></ul><ul><li>Pattern discovery </li></ul><ul><ul><li>Apply data mining functions </li></ul></ul>
    30. 30. OSS APPLICATION <ul><li>Patterns to characterize the activeness of a project </li></ul><ul><ul><li>Downloads, page view, bug reports </li></ul></ul><ul><li>Clustering and dependencies of OSS projects </li></ul><ul><li>Groups of developers and their relationships </li></ul>
    31. 31. SIMULATION & VALIDATION <ul><li>Introduction </li></ul><ul><li>Validation approaches </li></ul><ul><li>Previous validation – OSS docking </li></ul><ul><li>Future work </li></ul>
    32. 32. SIMULATION & VALIDATION <ul><li>Simulation (In collaboration with Y. Gao) </li></ul><ul><ul><li>Interpret the mined patterns </li></ul></ul><ul><ul><li>Build models and simulations </li></ul></ul><ul><ul><li>Use simulations to test and evaluate hypothesis </li></ul></ul><ul><li>Validation </li></ul><ul><li>OSS validation – docking </li></ul>
    33. 33. VALIDATION <ul><li>Three methods of Validation </li></ul><ul><ul><li>Comparison with real phenomenon </li></ul></ul><ul><ul><li>Comparison with mathematical models </li></ul></ul><ul><ul><li>Docking with other simulations </li></ul></ul><ul><li>Docking </li></ul><ul><ul><li>Verify simulation correctness </li></ul></ul><ul><ul><li>Discover pros & cons of toolkits </li></ul></ul><ul><ul><li>R. Axtell, 1996; M. North 2001; M. Ashworth, 2002 </li></ul></ul>
    34. 34. OSS DOCKING EXPERIMENT <ul><li>Four Models of OSS </li></ul><ul><ul><li>random graphs </li></ul></ul><ul><ul><li>preferential attachment </li></ul></ul><ul><ul><li>preferential attachment with constant fitness </li></ul></ul><ul><ul><li>preferential attachment with dynamic fitness </li></ul></ul><ul><li>Agent-based Simulation </li></ul><ul><ul><li>Swarm </li></ul></ul><ul><ul><li>Repast </li></ul></ul>
    35. 35. OSS NETWORK <ul><li>A classic example of a dynamic social network </li></ul><ul><li>Two Entities: developer, project </li></ul><ul><li>Graph Representation </li></ul><ul><ul><li>Node – developers </li></ul></ul><ul><ul><li>Edge – two developers are participating in the same project </li></ul></ul><ul><li>Activities </li></ul><ul><ul><li>Create projects </li></ul></ul><ul><ul><li>Join projects </li></ul></ul><ul><ul><li>Abandon projects </li></ul></ul><ul><ul><li>Continue with current projects </li></ul></ul>
    36. 36. OSS MODEL <ul><li>Agent: developer </li></ul><ul><li>Each time interval: </li></ul><ul><ul><li>Certain number developers generated </li></ul></ul><ul><ul><li>New developers: create or join </li></ul></ul><ul><ul><li>Old developers: create, join, abandon, idle </li></ul></ul><ul><ul><li>Update preference for preferential models </li></ul></ul>
    37. 37. DOCKING PROCESS OSS models parameters Docking Repast Swarm ER BA BAC BAD Diameter Degree distribution Clustering coefficient Community size toolkits
    38. 38. DOCKING PROCEDURE <ul><li>Process: comparisons of parameters corresponding models. </li></ul><ul><li>Findings: </li></ul><ul><ul><li>Different Random Generators </li></ul></ul><ul><ul><li>Databases creation errors in the original version </li></ul></ul><ul><ul><li>Different starting time of schedulers </li></ul></ul>
    39. 39. DOCKING PARAMETERS <ul><li>Diameter </li></ul><ul><ul><li>Average length of shortest paths between all pairs of vertices </li></ul></ul><ul><li>Degree distribution </li></ul><ul><ul><li>The distribution of degrees throughout a network </li></ul></ul><ul><li>Clustering coefficient (CC) </li></ul><ul><ul><li>CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. </li></ul></ul><ul><ul><li>CC: average of all CC i in a network </li></ul></ul><ul><li>Community size </li></ul>
    40. 40. DIAMETER
    44. 44. PROPOSED WORK OF VALIDATION <ul><li>Docking </li></ul><ul><ul><li>Many runs instead of one </li></ul></ul><ul><ul><li>Statistical analysis </li></ul></ul><ul><ul><li>More network parameters </li></ul></ul><ul><ul><ul><li>Average degree </li></ul></ul></ul><ul><ul><ul><li>Cluster size distribution </li></ul></ul></ul><ul><ul><ul><li>Fitness and life cycle </li></ul></ul></ul><ul><li>Statistic comparison </li></ul>
    45. 45. CONTRIBUTIONS <ul><li>Provide an integrated web mining system to support research – a new tool </li></ul><ul><li>Build a classification retrieval tool to improve precision and recall, as well as meet users’ search requirements </li></ul><ul><li>Implement a hybrid IE tool to extract web data effectively and efficiently </li></ul><ul><li>Create a generalization infrastructure which is suitable for web data mining </li></ul><ul><li>Provide methods to validate OSS simulations </li></ul>
    46. 46. TIME PLAN <ul><li>IR system – March 2004 </li></ul><ul><li>IE system – July 2004 </li></ul><ul><li>Generalization – September 2004 </li></ul><ul><li>Validation – December 2004 </li></ul><ul><li>Dissertation – During the research </li></ul><ul><li>Complete – May 2005 </li></ul>
    47. 47. PAPERS <ul><li>Published papers </li></ul><ul><ul><li>“ A Research Support System Framework for Web Data mining Research&quot;, Workshop at IEEE/WIC and Intelligent Agent Technology, 2003 </li></ul></ul><ul><ul><li>“ Multi-Model Docking Experiment of Dynamic Social Network Simulations&quot;, Agents2003. </li></ul></ul><ul><ul><li>“ Docking Experiment: Swarm and Repast for Social Network Modeling&quot;, Swarm2003. </li></ul></ul><ul><li>Future papers </li></ul><ul><ul><li>A classification web information retrieval tool </li></ul></ul><ul><ul><li>A hybrid web information extraction tool </li></ul></ul><ul><ul><li>Data mining results from OSS study </li></ul></ul><ul><ul><li>Validation results of OSS simulations </li></ul></ul>
    48. 48. <ul><li>THANK YOU </li></ul>
    49. 49. OSS DATA COLLECTON <ul><li>Data sources </li></ul><ul><ul><li>Statistics, forums </li></ul></ul><ul><li>Project statistics </li></ul><ul><ul><li>9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS </li></ul></ul><ul><li>Developer statistics </li></ul><ul><ul><li>Project ID and developer ID </li></ul></ul>
    50. 50. EXAMPLES OF DATA MINING FUNC. <ul><li>Association rules </li></ul><ul><ul><li>40% of users who accessed the web page with URL/project1, also accessed /project2; </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>users of project1 can be grouped as developers and common users. </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>50% of users who downloaded software in /product2, were developers of Open Source Software and worked in IT companies. </li></ul></ul><ul><li>Sequential patterns </li></ul><ul><ul><li>if clients who downloaded software in /project1, they also downloaded software in /project2 within 15 days </li></ul></ul>
    51. 51. SOCIAL NETWORK MODEL <ul><li>Graph Representation </li></ul><ul><ul><li>Node/vertex – Social Agent </li></ul></ul><ul><ul><li>Edge/link – Relationship </li></ul></ul><ul><ul><li>Index/degree - The number of edges connected to a node </li></ul></ul><ul><li>ER (random) Graph </li></ul><ul><ul><li>edges attached in a random process </li></ul></ul><ul><ul><li>No power law distribution </li></ul></ul>
    52. 52. SOCIAL NETWORK MODEL(Cont.) <ul><li>Watts-Strogatz (WS) model </li></ul><ul><ul><li>include some random reattachment </li></ul></ul><ul><ul><li>No power law distribution </li></ul></ul><ul><li>Barabasi-Albert (BA) model with preferential attachment </li></ul><ul><ul><li>Addition of preferential attachment </li></ul></ul><ul><ul><li>Power law distribution </li></ul></ul><ul><li>BA model with constant fitness </li></ul><ul><ul><li>addition of random fitness </li></ul></ul><ul><li>BA model with dynamic fitness </li></ul>
    53. 53. SWARM SIMULATION <ul><li>ModelSwarm </li></ul><ul><ul><li>Creats developers </li></ul></ul><ul><ul><li>Controls the activities of developers in the model </li></ul></ul><ul><ul><li>Generate a schedule </li></ul></ul><ul><li>ObserverSwarm </li></ul><ul><ul><li>Collects information and draws graphs </li></ul></ul><ul><li>main </li></ul><ul><li>Developer (agent) </li></ul><ul><ul><li>Properties: ID, degree, participated projects </li></ul></ul><ul><ul><li>Methods: daily actions </li></ul></ul>
    54. 54. REPAST SIMULATON <ul><li>Model </li></ul><ul><ul><li>creates and controls the activities of developers </li></ul></ul><ul><ul><li>collects information and draws graphs </li></ul></ul><ul><ul><ul><li>Network display </li></ul></ul></ul><ul><ul><ul><li>Movie </li></ul></ul></ul><ul><ul><ul><li>snapshot </li></ul></ul></ul><ul><li>Developer (agent) </li></ul><ul><li>Project </li></ul><ul><li>Edge </li></ul>
    55. 55. VALIDATION CONCLUSION <ul><li>Same results for both simulations </li></ul><ul><li>Better performance of Repast </li></ul><ul><li>Better display provided by Repast </li></ul><ul><ul><li>Network display </li></ul></ul>Random Layout Circular Layout
    56. 56. PROJECT PAGE
    57. 57. FORUM PAGE