Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Proposal for a Thesis in the Field of Information Technology In Partial Fulfillment of the Requirements For a Master of Liberal Arts Degree Harvard University Extension School 10/18/2004 Clifford Lyon 53 West Emerson Street Melrose, MA 02176-3109 (617) 225-3293 (781) 663-7703 Proposed Start Date: 10/4/2004 Anticipated Date of Graduation: 6/2005 Thesis Directors: Sergei Makar-Limanov and Bhiksha Raj 1
  2. 2. 1 Tentative Thesis Title: Visualization of High-Dimensional Clickstream Data Using Java Keywords: Clustering, Unsupervised Learning, Critic, Search, Data Visualization, Java 2D/3D, Clickstream, Data Mining, Machine Learning 2 Abstract Unsupervised learning holds out a promise for the discovery of objectively valid disaggregate patterns within large clickstream data stores. Using an interactive data visualization interface and clustering algorithms, the software designed and delivered by this project will allow the exploration of clickstream data in a subjectively meaningful way. 3 Thesis Project Description 3.1 Background Clickstream data accumulated by a commercial website offers site managers the potential for objective insight into their audience unparalleled in other publishing media. Unlike their print, television, and radio counterparts, web publishers have access to a detailed record of events generated by their visitors. Each time a visitor requests a URL, a webserver records the request and some information about the visitor’s browser in a log file. However, the potential for insight remains in large part unrealized for the commercial Internet despite the availability of this detailed behavior record, well-established machine learning algorithms, exponential growth in processing power, and decreased memory and storage cost. This is in contrast to the evident success of personalization and targeting efforts by sites such as and that estimate the posterior probability of user decisions from previous behavior to present contextually relevant recommendations. Content automation is certainly one positive outcome of modeling behavior using clickstream data. However, it is fundamentally an application of knowledge at the transaction level, not at the enterprise level. In contrast, this project seeks to recognize patterns in web data at a high level, and to build an interface capable of presenting these patterns to a non-technical (business) user in a meaningful way. 3.1.1 Challenge of Clickstream Data A key reason for the lack of progress in the application of standard machine learning algorithms to clickstream data lies in the nature of the data itself. In recent years, academics in market research and applied economics have started building behavioral models using clickstream data. The initial papers are interesting and encouraging. However, the shape of the data presents a fundamental challenge. Unlike typical market research surveys and polls, web data is vast, noisy, and censored. For example, the website contributing data for this project records more than one billion events each month. The interesting events are those generated by real people interacting with the website using a web browser. Software robots making requests for content generate noise in the system. For very different reasons, these robots traverse a website using the same protocol and transactional processes as real people. For example, a robot might gather 2
  3. 3. information for use in a search index, cache pages for a proxy server, or artificially increase popularity for a particular product featured on a site by repeatedly requesting information. The noise is not easily separable from interesting events. This is partly because it is easy for a robot to do everything a real person would do, and partly because a person who configures his or her browser to interact as minimally as possible with the webserver may appear to be a robot. Moreover, a proxy server will appear as a single user, but in reality may convey requests for thousands of users. Typically, a time series known as a session stores the sequence of events generated by a user during a site visit. Analyzing time series data can help separate robot generated events from real traffic. However, sessions are censored in the sense that there is no event signifying completion. In particular, because the start of an event marks the end of the previous event in the series, the final event has an unknown duration. These factors make it difficult to model user-website interaction using raw clickstream data. Recent efforts to apply market research techniques to clickstream data have generally used a regression model to expose some sort of interesting user behavior pattern. Typically, websites have extensive reporting at an aggregate level, but little real insight into user segments or the differences between these segments. A recurring theme in recent research is the need to model behavior in a disaggregate way to account for and expose behavior away from the mean. In particular, Bucklin and Sismeiro (2003) suggest that accounting for user heterogeneity is of critical importance, and that using aggregate metrics can potentially lead to the wrong conclusions. 3.2 Approach We propose inverting the general approach to the user behavior problem: rather than build vectors of user or visit behavior, we will build vectors of pages, with features derived from user or visit behavior. Metric design will account for user heterogeneity by incorporating aggregate metrics from user dimensions as features. For example, the duration of time spent on the page might vary depending on the time of day or the position of the page in the session. Representing duration at a disaggregate level ensures the preservation of variance that allows users, and so pages, to be successfully partitioned. The utility of an inverted approach is two-fold: first, we express results in terms of website entities, which are under a site manager’s control. A low repeat-visit rate, while important information for a site manager to know, suggests no direct action. On the other hand, a site manager who learns that certain categories of pages are less likely to generate repeat visits has a clear area on which to focus efforts to improve the site. Second, using the data produced by our learning exercise as extra input for an existing user based model may improve its predictive power. By modeling pages using unsupervised learning first, we remove the bias of categorical features established by the top-down human design of the site. When looking at behavior on a website, it is important to distinguish the hierarchical site structure from the behavior on it as much as possible. Remodeling the business-driven categorical entity as a behavioral class structure can help create better user models. A stretch goal for the project is to use the page classification in a user behavior model to demonstrate the value of clustering as a way of segmenting data behaviorally and supporting heterogeneity. The software delivered by this project will cluster pages on the website according to their natural order in the data. Clustering groups similar pages together. For example, pages visited on the weekend by a young audience might fall into one cluster, and pages visited at the start and end of the workday by an older group might fall into a second cluster. The pages within each group or cluster are more similar to each other than to pages in other groups. Thus, the ordering emerges from the data itself, rather than from an external agent. In this sense, the order is “natural.” This 3
  4. 4. approach is termed “unsupervised learning” because there is no known target class for the input data; the model is fit to the features of the input data. The clustering algorithm assigns each page to a class based on features derived from user interaction on that page. A user interface (GUI) will visualize the clusters. We hope that the framework can make the often opaque results of unsupervised learning subjectively meaningful for the site manager, that is, someone who understands the problem domain well, but not the specifics of the machine learning process. Unless the results have subjective meaning to the user of the software, the results will not be useful. The software framework should be generalizable. Although the data set for our investigation is specific, the application should perform reasonably well on other data sets. Application testing includes scenarios using some of the common public domain machine learning data sets, such as the iris data set (UCI Machine Learning Repository Content Summary). In order to present data to the user, the GUI will project high-dimensional clustered page vectors in two or three dimensions. There are established methods for achieving such a projection, such as using the first two or three eigenvalues. Generally, the idea is to eliminate or merge features in a way that minimizes the introduction of error into the system as information is lost. The user interface will allow the user to search among the clusters for items of interest. The user will act as a critic by using the search function to establish subjective validity of a given set of clusters, and by suggesting (weighting) a direction for more useful results. While this technique has some precedent in machine learning literature (Duda, R, Hart, P. & Stork, D. 2001, p. 565), we are unaware of specific applications that use search as a tool for cluster exploration. Although one readily finds examples of Java data visualization programs on the internet, we found none that offered the feedback mechanism proposed here. An application notable for its approach to dimensionality reduction is the two-dimensional cluster-visualization program produced by IBM’s Alphaworks program, which can be found at The Alphaworks program places cluster exemplars at the origin and extents of the x- and y- axes, and plots instances based on similarity. The program translates similarity into Euclidean distance on the plane. The x- and y-axes have no units. The exemplars at the origin and extents of the axes triangulate the placement of clustered items in the two-dimensional space: the software places items on the plane based on similarity to the three exemplars. This method has a few nice properties: it is fast, it does not require a lot of extra computation, and it is visually meaningful and intuitive. Exploring this method in three dimensions would be an interesting exercise. It might provide a parsimonious way to scale the cluster space to a low dimensional representation. A search for similar or related material uncovered no other papers on the topic of clustering web pages for data visualization. 3.3 Data Description The clustering algorithm will use data from a well-known shopping services, advice, and news website. The site records approximately 70 million page events each day. Each time a user loads a page, a tracking image is loaded, and the resulting log line in the server log represents a single page event. Each page event has clickstream attributes from the webserver logline and attributes derived from the site delivery application and site meta-data. Additionally, links into and out of the site are tracked using an HTTP redirect. These redirects share the same attributes as the page events. 4
  5. 5. The following table represents a sample of data available at the atomic level from the database: Field Name Description Unique identified for the session in which the page event occurred. (A SESSION_ID session is continuous activity with gaps of no more than 30 minutes. EVENT_SEQ_NUM The sequence number of the event w/in the session REFERRING_HOST If the data is from an external site, the hostname of the external site Foreign key to third party demographic data based on IP address. NETWORK_IP Provides Country, US State, DMA, Line speed. IP_ADDRESS Client IP Address USER_AGENT The user agent of the browser performing the page request EDITION The “branding” of the page. PAGE_TYPE Identifies the template used to serve the page by the content application PAGE_DURATION Amount of time spent on the page TIME_SINCE_SESS_START Time elapsed since the first event of the session IS_REG_USER Whether the client was a registered user IS_NEW_USER Whether the client has been to the site before (cookie based) IS_COOKIED_USER Whether the client allows cookies The sequence number of the page w/in the session (in contrast to the PAGE_SEQ_NUM event_seq_num, which included redirects) IS_LAST_PAGE Whether the event was the last page TIMESTAMP The date and time of the page request ANONYMOUS_ID ID based on website cookie SITE_ID The site number of the event (40 total sites) ONTOLOGY_NODE_ID The location of the page in the site navigational hierarchy IS_IAB_ROBOT Whether the user agent is a known robot IS_BEHAVIORAL_ROBOT Whether the user agent behaves like a robot The search phrase that the user types, if any. (Includes third party sites SEARCH_PHRASE like Google.) REGISTRATION_ID The id of the registered user, if any. Table 1 Sample Data Fields These fields are the raw material that will form the aggregate page vectors. Four entities uniquely identify a page on the website: site, page type, ontology, and asset. “Site” is a business dimension that groups content together at a high-level. The dataset contains tens of sites. “Page type” is an application dimension identifying the template used to render the content. The data contains thousands of page types. An “Ontology” node is a navigational dimension describing the area on the site where the page lives – for example, a “door”, or a “story” page. The data contains thousands of ontology nodes. “Asset” refers to a particular piece of content or a product featured on a page. There are tens of thousands of assets active each day, and millions historically. We have intentionally left Asset out of the page vector key, as this would produce far too many instances to be useful for clustering. We expect between one and ten thousand vectors for clustering, depending on the choice of sites. The page vector will have the following structure: SITE_ID, PAGE_TYPE, ONTOLOGY_NODE, derived attributes 1..n. The derived attributes will be behavioral in nature, and computed from activity over some period: 30 or 60 days, for example. Initially, we are considering the following attributes for each page: 5
  6. 6. • Count total page views • Count 1 page sessions (this page was the only page) • Count 2-5 page sessions (this page occurring in) • Count 5-10 page sessions • Count 10+ page sessions • Count registered user visits • Count anonymous user visits • Repeat visitor rate • Average hits/day • Average hits/weekday • Average hits/weekend • Average hits by hour of day, flattened • Count session starts • Count session stops • Count leads (occurring in sessions with this page in it – leads are redirects to a partner site.) • Total time spent on the page • Average duration • Average duration, weekday • Average duration, weekend • Average duration, by hours 1-24, by Time Zone • Average “place in session” – where this event occurs, as a fraction of all events occurring • Anonymous and Registered user visits, for each world country (flattened) • Anonymous and Registered user visits, for each US State This is a starting point. As mentioned, asset is not part of the page vector, but we could preserve attributes of the asset to qualify the metrics. For example, rather than using the product entity 6
  7. 7. itself to identify a page, we could use the product category. We expect finding an identifiable set of attributes for the page vector will require some exploration. 3.4 Data Flow At a logical level, there are two major application components and two sources of input into the system. The inputs are the data and the user feedback, and the components are the machine learning and the data visualization components. Figure 1 illustrates the flow of external data into and through the system: Machine Learning External Assign Instances Import Data Data Clusters Scale for Presentation Data Visualization Process User User Present Data Input Figure 1 Data Flow Diagram • External Data flows into the system as a Weka dataset, a flat set of vectors containing page information. • The machine-learning component applies the clustering algorithm to the vectors and thereby classifies each instance. • The machine-learning component projects the clusters in two or three dimensions for presentation in a user interface. • The data visualization component processes user feedback after the presentation. 7
  8. 8. • Depending on the feedback, the data visualization component re-presents the data, or re- classifies and then re-presents the data. 3.5 Architecture The diagram in Figure 2 shows three physical architectural components. The shaded elements represent elements that do not exist today; the unshaded elements represent third-party software or data sources. The following subsections describe each of the three components. Pre-Processing ClickStream Extraction Script DataStore ARFF file Unsupervised Learning Cluster Engine (Driver) Weka.core.Instances Weka.classifiers.Evaluation Existing Clusterer Weka.classifiers.Clusterers New Clusterer Multi-Dimensional Scaling Filter Java GUI Weka.core.Instances Visualization/Interaction 2D/3D Projection Window Control Panel Figure 2 System Diagram 3.5.1 Pre-processing The preprocessing step prepares data for use. The extraction script reads data from a database and writes it to the proprietary Weka data format. The Java application reads the formatted data from disk into memory. We will aggregate the atomic events in the database to the page level record described above. We will filter out robot traffic as much as possible, using already available flags in the database. The aggregation will discount the last event of the session for mean duration calculations. 8
  9. 9. Clickstream Data Store The data store exists today in the form of a large data warehouse for an anonymous web- publishing company. They have agreed to share data for the project on condition that we obfuscate user or commercially identifiable information. There will be some data transformation within the data store: a script will aggregate the individual events with the page as the key, forming the page vector described in the data description section. Extraction Module The extraction script pulls data from the data store. This module is a placeholder for the process that creates a flat file in Weka format from the database. It may be a series of scripts, or it may be a set of actions undertaken to spool query results to disk manually and add a header. Any code developed will be handed in for inspection, but it should be understood that evidence of completion is the Weka data file rather than code that created it. Anyone seeking to recreate this experiment on his or her own would have to code this module by hand; the rest would flow from there. As such, the extraction module is formally outside the bounds of the project. Attribute-Relation File Format (ARFF) file The Attribute-Relation File Format (ARFF) file constitutes the boundary of the application. A Java program using the Weka class libraries can easily read the file into an in-memory representation for machine learning by the various Weka modules. A website describing the Weka data file format in detail is found at 3.5.2 Unsupervised Learning The unsupervised learning component performs the actual clustering. It clusters instances read from the pre-processing component and refines or redistributes clusters based on feedback from the user through the GUI component. Cluster Engine The cluster engine is a driver that uses the Weka data-mining framework to read the ARFF file into memory and exercise a clustering algorithm on that data. After performing the unsupervised learning algorithm, it prepares data for presentation by creating low-dimensional projections of the instances. The program appends the location in the low-dimensional space to the existing attributes of the instance. At this point in the data path, the instance includes the original features, the cluster identifier and any related cluster metrics such as distance from the centroid, and the newly appended location in low-dimensional space. Multi Dimensional Scaling (MDS) Filter The task of this component is to take the n-dimensional feature vector from the input data and scale it to a projection suitable for presentation, i.e. either a two- or three-dimensional vector. The filter preserves the distance between the points in the original space as closely as possible by minimizing an error function. The MDS filter is completely independent of the clustering, and could be applied itself to the data. This filter will do some sort of analysis like Principle 9
  10. 10. Components Analysis (PCA) on the data (Bishop, C. 1995, Appendix E). It may be we can make use of the existing cluster information in a clever way as another alternative – in which case this could stay in the data path as a no-op. New Cluster Strategies The Weka framework allows for the easy introduction of new clustering strategies. More general classification techniques could make use of the generic classifier container as well. The Weka library offers several clustering choices out-of-the-box: Cobweb, Expectation Maximization, Farthest-first, and K-means (Witten, I. & Frank, E. 2000, pp. 210-227). This is a good start but by no means exhaustive. The application does not strictly require additional clustering strategies to function; as such, the new strategies are candidates for scope reduction. 3.5.3 Visualization/Interaction The visualization/interaction component is the user interface to the clustered data. This component allows the user to view the data. It allows the user to transform or refine the clusters through a limited set of interactions. Java GUI The Graphical User Interface presents the user with a two- or three-dimensional projection of the source data, using color to represent the class membership established by clustering. Intuitively, the intensity of the color can represent the distance from the centroid or “fuzzy” class memberships. Additionally, GUI presents the user with a set of controls that allow non- destructive and destructive data transformations. Non-destructive operations include standard graphical transformations such as rotation, pan, zoom. A search capability allows a user to locate specific instances or groups of instances in the scatter plot. Such a capability may further refine the subjectivity of the results. A proposed destructive operation of the data allows the user to act as a critic by demonstrating what a more appropriate result might be by “forcing” their own bias into the model. The system will re-cluster and re-present the modified instances. If happening interactively, this would likely happen on a random sample of the data. Projection Window (GUI Component) The projection window is a Java component that displays a colored scatterplot of two- dimensional or three-dimensional data. It should handle non-destructive transformations like rotation, pan, zoom, scale, color/grayscale toggling, selection, and indicating “interesting” instances. Control Panel (GUI Component) The control panel allows a user to interact with the GUI. It will provide a menu of custom controls fulfilling all the tasks defined for the interface. 4 Work Plan At a high level, there are two phases to the project: building the tools, and using the tools to explore the data. 10
  11. 11. The approach to the initial phase is to work backwards through the data path, and establish baseline functionality. The first component completed is the last in the data path, the Java GUI front-end. Pseudo data will functionally test the GUI. The clustering engine is scheduled next. Finally the MDS module, which in effect bridges the two initial components, will be completed. The exception to the rule is the projection window. The work on the projection window will happen in two parts. The initial round of work will produce a basic visualization window using test data; a second round of development will extend the functionality for application-specific features. The next phase involves experimentation. Once the first component in the data path brings in the data, the real project is underway. In addition to trials using various combinations of features, implementation of custom clustering strategies and incremental GUI changes happen during this phase. Ideally, the tools would complete by Christmas, and experimentation would start in January. The schedule below presents a slightly less aggressive view of the timeline. Code complete happens on 2/1/2005. Pulling in this date to 1/1/2005 is a stretch goal for the development cycle. 4.1 Assumptions, Risks and Alternatives The original code written for the project will be in Java. The cluster engine will use the Weka open source data-mining framework (Weka 3 - Data Mining with Open Source Machine Learning Software in Java. 2004). The GUI will use Swing components (Geary, D. 1999). It might use the standard Java 3D extensions (Java 3D API. 2004). If it does use the AWT-based 3D API, the GUI will use heavyweight AWT components instead of their Swing counterparts (Geary, D. 1997). Other small scripts will extract and format data for import; these may be in SQL, Perl, or some other language. CVS versioning software will facilitate milestone releases. The eventual application will run client-side. It might be packaged as a “Java Web Start” application. Risks and Alternatives: • Unachievable Schedule – the schedule as indicated below is aggressive. o Alternative: Use third party components in the GUI, especially for prototyping o Alternative: Use existing clustering software only • Personal Schedule Conflict – we’re expecting our 2nd child 4/15/2005 o Alternative: Enter into the program later. Re-negotiate graduation date. o Alternative: Build in extension. • No signal in target data – the experiment could fail. o Alternative: Establish signal before undertaking the project using sample data o Alternative: Establish validity of negative outcome; success of tool 11
  12. 12. 4.2 Preliminary Schedule Figure 3 shows a high-level view of the schedule. 1/17/2005 Full integ, Built-in Clustering 12/20/2004 - 1/3/2005 Break 1/31/2005 2/28/2005 11/22/2004 Code Complete Exploration Ends 10/4/2004 11/1/2004 Clustering 12/12/2004 3/31/2005 Start GUI Complete Dummy Data Scaling Work ends 11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005 10/4/2004 3/31/2005 Figure 3 High-level View of Schedule Table 2 shows a detailed view of the proposed schedule: Design Dev Unit test Integ Test Explore Scaling/ New Projectio Controlle Clustering Projectio Extractio Clustering Week Java GUI n r Engine n n Algorithms 4-Oct 11-Oct 18-Oct 25-Oct 1-Nov 8-Nov 15-Nov 22-Nov 29-Nov 6-Dec 13-Dec 20-Dec 27-Dec 3-Jan 10-Jan 17-Jan 24-Jan 31-Jan 7-Feb 14-Feb 21-Feb 28-Feb 7-Mar 14-Mar 21-Mar 12
  13. 13. 28-Mar Table 2 Detailed Schedule View 5 Glossary Centroid A pseudo exemplar serving as the statistical center of a given class. Clustering Clustering algorithms find groups of items that are similar. For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Since the categories are unspecified, this is sometimes referred to as unsupervised learning. (Two Crows: Data Mining Glossary. 2001). Unsupervised Learning As distinct from supervised learning, the classification of unlabeled data. Data Mining The process of automatically extracting valid, useful, previously unknown, and ultimately comprehensible information from large databases and using it to make crucial business decisions. “Torturing the data until they confess” (Hsu, W. 2001) Weka An open source Java project for machine learning and data mining found at: Webserver A software application for serving content to browsers on the world wide web. 6 References 6.1 Works Cited The following is a list of references cited in the document. Bishop, C. (1995). Neural Networks for Pattern Recognition. New York: Oxford Press. Bucklin, R., & Sisemiro, C. (2003). A Model of Website Browsing Behavior Estimated on Clickstream Data. Journal of Marketing Research, XL, 249-267. Retrieved August 15, 2004, from 13
  14. 14. Duda, R, Hart, P., & Stork, D. (2001) Pattern Classification. New York: John Wiley & Sons. Geary, D (1997) Graphic Java 1.1: Mastering the AWT. New York: Prentice Hall. Geary, D. (1999) Graphic Java Volume II: Swing. New York: Prentice Hall. Hsu, W. (2001). Knowledge Discovery in Databases and Data Mining. Retrieved October 17, 2004, from Java 3D API. Retrieved October 17, 2004 from Two Crows: Data Mining Glossary. (2001). Retrieved October 17, 2004, from UCI Machine Learning Repository Content Summary. Retrieved October 16, 2004, from Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (2004) Retrieved October 17, 2004, from: Witten, I., Frank, E. (2000) Data Mining. San Diego: Academic Press. 6.2 Works Consulted The following is a list of works consulted while researching the topic. Anasari, A., & Mela, C. (2003). E-Customization. [Electronic Version] Journal of Marketing Research, XL, 131-145. Retrieved August 10, 2004, from Moe, W. & Fader, P. (2002) Capturing Evolving Visit Behavior in Clickstream Data [Electronic Version] Retrieved August 10, 2004, from http://www- Moe, W. & Fader, P. (2003) Dynamic Purchase Behavior at e-Commerce Sites [Electronic Version] Retrieved August 10, 2004, from: %200303.pdf Montgomery, A., Li, S., Srinivasan, K., & Liechty, J (2004) Modeling Online Browsing and Path Analysis Using Clickstream Data [Electronic Version] Retrieved August 10, 2004, from 6.3 Works To Be Consulted The following is a list of works marked for future review. Jain, A., Murty, M., & Flynn, P. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3). Retrieved August 15, 2004, from 14
  15. 15. Leouski, A., & Swan, R. (1997). Interactive Cluster Visualization for Information Retrieval. Retrieved August 10, 2004, from Procopiuc, C., Jones, M., Agarwal, P., & Murali, T. (2002) A Monte Carlo Algorithm for Fast Projective Clustering. [Electronic Version] Presented at ACM SIGMOD 2002. Retrieved August 10, 2004, from 15