Proposal for a
Thesis in the Field of
In Partial Fulfillment of the Requirements
For a Master of Liberal Arts Degree
53 West Emerson Street
Melrose, MA 02176-3109
Proposed Start Date: 10/4/2004
Anticipated Date of Graduation: 6/2005
Thesis Directors: Sergei Makar-Limanov and Bhiksha Raj
1 Tentative Thesis Title:
Visualization of High-Dimensional Clickstream Data Using Java
Keywords: Clustering, Unsupervised Learning, Critic, Search, Data Visualization, Java 2D/3D,
Clickstream, Data Mining, Machine Learning
Unsupervised learning holds out a promise for the discovery of objectively valid disaggregate
patterns within large clickstream data stores. Using an interactive data visualization interface and
clustering algorithms, the software designed and delivered by this project will allow the
exploration of clickstream data in a subjectively meaningful way.
3 Thesis Project Description
Clickstream data accumulated by a commercial website offers site managers the potential for
objective insight into their audience unparalleled in other publishing media. Unlike their print,
television, and radio counterparts, web publishers have access to a detailed record of events
generated by their visitors. Each time a visitor requests a URL, a webserver records the request
and some information about the visitor’s browser in a log file. However, the potential for insight
remains in large part unrealized for the commercial Internet despite the availability of this
detailed behavior record, well-established machine learning algorithms, exponential growth in
processing power, and decreased memory and storage cost. This is in contrast to the evident
success of personalization and targeting efforts by sites such as Amazon.com and Netflix.com
that estimate the posterior probability of user decisions from previous behavior to present
contextually relevant recommendations. Content automation is certainly one positive outcome of
modeling behavior using clickstream data. However, it is fundamentally an application of
knowledge at the transaction level, not at the enterprise level. In contrast, this project seeks to
recognize patterns in web data at a high level, and to build an interface capable of presenting
these patterns to a non-technical (business) user in a meaningful way.
3.1.1 Challenge of Clickstream Data
A key reason for the lack of progress in the application of standard machine learning algorithms
to clickstream data lies in the nature of the data itself. In recent years, academics in market
research and applied economics have started building behavioral models using clickstream data.
The initial papers are interesting and encouraging. However, the shape of the data presents a
fundamental challenge. Unlike typical market research surveys and polls, web data is vast, noisy,
and censored. For example, the website contributing data for this project records more than one
billion events each month. The interesting events are those generated by real people interacting
with the website using a web browser. Software robots making requests for content generate
noise in the system. For very different reasons, these robots traverse a website using the same
protocol and transactional processes as real people. For example, a robot might gather
information for use in a search index, cache pages for a proxy server, or artificially increase
popularity for a particular product featured on a site by repeatedly requesting information. The
noise is not easily separable from interesting events. This is partly because it is easy for a robot
to do everything a real person would do, and partly because a person who configures his or her
browser to interact as minimally as possible with the webserver may appear to be a robot.
Moreover, a proxy server will appear as a single user, but in reality may convey requests for
thousands of users. Typically, a time series known as a session stores the sequence of events
generated by a user during a site visit. Analyzing time series data can help separate robot
generated events from real traffic. However, sessions are censored in the sense that there is no
event signifying completion. In particular, because the start of an event marks the end of the
previous event in the series, the final event has an unknown duration. These factors make it
difficult to model user-website interaction using raw clickstream data.
Recent efforts to apply market research techniques to clickstream data have generally used a
regression model to expose some sort of interesting user behavior pattern. Typically, websites
have extensive reporting at an aggregate level, but little real insight into user segments or the
differences between these segments. A recurring theme in recent research is the need to model
behavior in a disaggregate way to account for and expose behavior away from the mean. In
particular, Bucklin and Sismeiro (2003) suggest that accounting for user heterogeneity is of
critical importance, and that using aggregate metrics can potentially lead to the wrong
We propose inverting the general approach to the user behavior problem: rather than build
vectors of user or visit behavior, we will build vectors of pages, with features derived from user
or visit behavior. Metric design will account for user heterogeneity by incorporating aggregate
metrics from user dimensions as features. For example, the duration of time spent on the page
might vary depending on the time of day or the position of the page in the session. Representing
duration at a disaggregate level ensures the preservation of variance that allows users, and so
pages, to be successfully partitioned. The utility of an inverted approach is two-fold: first, we
express results in terms of website entities, which are under a site manager’s control. A low
repeat-visit rate, while important information for a site manager to know, suggests no direct
action. On the other hand, a site manager who learns that certain categories of pages are less
likely to generate repeat visits has a clear area on which to focus efforts to improve the site.
Second, using the data produced by our learning exercise as extra input for an existing user based
model may improve its predictive power. By modeling pages using unsupervised learning first,
we remove the bias of categorical features established by the top-down human design of the site.
When looking at behavior on a website, it is important to distinguish the hierarchical site
structure from the behavior on it as much as possible. Remodeling the business-driven
categorical entity as a behavioral class structure can help create better user models. A stretch
goal for the project is to use the page classification in a user behavior model to demonstrate the
value of clustering as a way of segmenting data behaviorally and supporting heterogeneity.
The software delivered by this project will cluster pages on the website according to their natural
order in the data. Clustering groups similar pages together. For example, pages visited on the
weekend by a young audience might fall into one cluster, and pages visited at the start and end of
the workday by an older group might fall into a second cluster. The pages within each group or
cluster are more similar to each other than to pages in other groups. Thus, the ordering emerges
from the data itself, rather than from an external agent. In this sense, the order is “natural.” This
approach is termed “unsupervised learning” because there is no known target class for the input
data; the model is fit to the features of the input data. The clustering algorithm assigns each page
to a class based on features derived from user interaction on that page. A user interface (GUI)
will visualize the clusters. We hope that the framework can make the often opaque results of
unsupervised learning subjectively meaningful for the site manager, that is, someone who
understands the problem domain well, but not the specifics of the machine learning process.
Unless the results have subjective meaning to the user of the software, the results will not be
useful. The software framework should be generalizable. Although the data set for our
investigation is specific, the application should perform reasonably well on other data sets.
Application testing includes scenarios using some of the common public domain machine
learning data sets, such as the iris data set (UCI Machine Learning Repository Content Summary).
In order to present data to the user, the GUI will project high-dimensional clustered page vectors
in two or three dimensions. There are established methods for achieving such a projection, such
as using the first two or three eigenvalues. Generally, the idea is to eliminate or merge features in
a way that minimizes the introduction of error into the system as information is lost. The user
interface will allow the user to search among the clusters for items of interest. The user will act
as a critic by using the search function to establish subjective validity of a given set of clusters,
and by suggesting (weighting) a direction for more useful results. While this technique has some
precedent in machine learning literature (Duda, R, Hart, P. & Stork, D. 2001, p. 565), we are
unaware of specific applications that use search as a tool for cluster exploration.
Although one readily finds examples of Java data visualization programs on the internet, we
found none that offered the feedback mechanism proposed here. An application notable for its
approach to dimensionality reduction is the two-dimensional cluster-visualization program
produced by IBM’s Alphaworks program, which can be found at
The Alphaworks program places cluster exemplars at the origin and extents of the x- and y- axes,
and plots instances based on similarity. The program translates similarity into Euclidean distance
on the plane. The x- and y-axes have no units. The exemplars at the origin and extents of the
axes triangulate the placement of clustered items in the two-dimensional space: the software
places items on the plane based on similarity to the three exemplars. This method has a few nice
properties: it is fast, it does not require a lot of extra computation, and it is visually meaningful
and intuitive. Exploring this method in three dimensions would be an interesting exercise. It
might provide a parsimonious way to scale the cluster space to a low dimensional representation.
A search for similar or related material uncovered no other papers on the topic of clustering web
pages for data visualization.
3.3 Data Description
The clustering algorithm will use data from a well-known shopping services, advice, and news
website. The site records approximately 70 million page events each day. Each time a user loads
a page, a tracking image is loaded, and the resulting log line in the server log represents a single
page event. Each page event has clickstream attributes from the webserver logline and attributes
derived from the site delivery application and site meta-data. Additionally, links into and out of
the site are tracked using an HTTP redirect. These redirects share the same attributes as the page
The following table represents a sample of data available at the atomic level from the database:
Field Name Description
Unique identified for the session in which the page event occurred. (A
session is continuous activity with gaps of no more than 30 minutes.
EVENT_SEQ_NUM The sequence number of the event w/in the session
REFERRING_HOST If the data is from an external site, the hostname of the external site
Foreign key to third party demographic data based on IP address.
Provides Country, US State, DMA, Line speed.
IP_ADDRESS Client IP Address
USER_AGENT The user agent of the browser performing the page request
EDITION The “branding” of the page.
PAGE_TYPE Identifies the template used to serve the page by the content application
PAGE_DURATION Amount of time spent on the page
TIME_SINCE_SESS_START Time elapsed since the first event of the session
IS_REG_USER Whether the client was a registered user
IS_NEW_USER Whether the client has been to the site before (cookie based)
IS_COOKIED_USER Whether the client allows cookies
The sequence number of the page w/in the session (in contrast to the
event_seq_num, which included redirects)
IS_LAST_PAGE Whether the event was the last page
TIMESTAMP The date and time of the page request
ANONYMOUS_ID ID based on website cookie
SITE_ID The site number of the event (40 total sites)
ONTOLOGY_NODE_ID The location of the page in the site navigational hierarchy
IS_IAB_ROBOT Whether the user agent is a known robot
IS_BEHAVIORAL_ROBOT Whether the user agent behaves like a robot
The search phrase that the user types, if any. (Includes third party sites
REGISTRATION_ID The id of the registered user, if any.
Table 1 Sample Data Fields
These fields are the raw material that will form the aggregate page vectors. Four entities uniquely
identify a page on the website: site, page type, ontology, and asset. “Site” is a business
dimension that groups content together at a high-level. The dataset contains tens of sites. “Page
type” is an application dimension identifying the template used to render the content. The data
contains thousands of page types. An “Ontology” node is a navigational dimension describing
the area on the site where the page lives – for example, a “door”, or a “story” page. The data
contains thousands of ontology nodes. “Asset” refers to a particular piece of content or a product
featured on a page. There are tens of thousands of assets active each day, and millions
historically. We have intentionally left Asset out of the page vector key, as this would produce
far too many instances to be useful for clustering. We expect between one and ten thousand
vectors for clustering, depending on the choice of sites.
The page vector will have the following structure:
SITE_ID, PAGE_TYPE, ONTOLOGY_NODE, derived attributes 1..n.
The derived attributes will be behavioral in nature, and computed from activity over some period:
30 or 60 days, for example. Initially, we are considering the following attributes for each page:
• Count total page views
• Count 1 page sessions (this page was the only page)
• Count 2-5 page sessions (this page occurring in)
• Count 5-10 page sessions
• Count 10+ page sessions
• Count registered user visits
• Count anonymous user visits
• Repeat visitor rate
• Average hits/day
• Average hits/weekday
• Average hits/weekend
• Average hits by hour of day, flattened
• Count session starts
• Count session stops
• Count leads (occurring in sessions with this page in it – leads are redirects to a partner
• Total time spent on the page
• Average duration
• Average duration, weekday
• Average duration, weekend
• Average duration, by hours 1-24, by Time Zone
• Average “place in session” – where this event occurs, as a fraction of all events occurring
• Anonymous and Registered user visits, for each world country (flattened)
• Anonymous and Registered user visits, for each US State
This is a starting point. As mentioned, asset is not part of the page vector, but we could preserve
attributes of the asset to qualify the metrics. For example, rather than using the product entity
itself to identify a page, we could use the product category. We expect finding an identifiable set
of attributes for the page vector will require some exploration.
3.4 Data Flow
At a logical level, there are two major application components and two sources of input into the
system. The inputs are the data and the user feedback, and the components are the machine
learning and the data visualization components. Figure 1 illustrates the flow of external data into
and through the system:
Assign Instances Import Data
User Present Data
Figure 1 Data Flow Diagram
• External Data flows into the system as a Weka dataset, a flat set of vectors containing
• The machine-learning component applies the clustering algorithm to the vectors and
thereby classifies each instance.
• The machine-learning component projects the clusters in two or three dimensions for
presentation in a user interface.
• The data visualization component processes user feedback after the presentation.
• Depending on the feedback, the data visualization component re-presents the data, or re-
classifies and then re-presents the data.
The diagram in Figure 2 shows three physical architectural components. The shaded elements
represent elements that do not exist today; the unshaded elements represent third-party software
or data sources. The following subsections describe each of the three components.
ClickStream Extraction Script
DataStore ARFF file
Cluster Engine (Driver)
Multi-Dimensional Scaling Filter
2D/3D Projection Window
Figure 2 System Diagram
The preprocessing step prepares data for use. The extraction script reads data from a database
and writes it to the proprietary Weka data format. The Java application reads the formatted data
from disk into memory. We will aggregate the atomic events in the database to the page level
record described above. We will filter out robot traffic as much as possible, using already
available flags in the database. The aggregation will discount the last event of the session for
mean duration calculations.
220.127.116.11 Clickstream Data Store
The data store exists today in the form of a large data warehouse for an anonymous web-
publishing company. They have agreed to share data for the project on condition that we
obfuscate user or commercially identifiable information. There will be some data transformation
within the data store: a script will aggregate the individual events with the page as the key,
forming the page vector described in the data description section.
18.104.22.168 Extraction Module
The extraction script pulls data from the data store. This module is a placeholder for the process
that creates a flat file in Weka format from the database. It may be a series of scripts, or it may
be a set of actions undertaken to spool query results to disk manually and add a header. Any code
developed will be handed in for inspection, but it should be understood that evidence of
completion is the Weka data file rather than code that created it. Anyone seeking to recreate this
experiment on his or her own would have to code this module by hand; the rest would flow from
there. As such, the extraction module is formally outside the bounds of the project.
22.214.171.124 Attribute-Relation File Format (ARFF) file
The Attribute-Relation File Format (ARFF) file constitutes the boundary of the application. A
Java program using the Weka class libraries can easily read the file into an in-memory
representation for machine learning by the various Weka modules. A website describing the
Weka data file format in detail is found at
3.5.2 Unsupervised Learning
The unsupervised learning component performs the actual clustering. It clusters instances read
from the pre-processing component and refines or redistributes clusters based on feedback from
the user through the GUI component.
126.96.36.199 Cluster Engine
The cluster engine is a driver that uses the Weka data-mining framework to read the ARFF file
into memory and exercise a clustering algorithm on that data. After performing the unsupervised
learning algorithm, it prepares data for presentation by creating low-dimensional projections of
the instances. The program appends the location in the low-dimensional space to the existing
attributes of the instance. At this point in the data path, the instance includes the original
features, the cluster identifier and any related cluster metrics such as distance from the centroid,
and the newly appended location in low-dimensional space.
188.8.131.52 Multi Dimensional Scaling (MDS) Filter
The task of this component is to take the n-dimensional feature vector from the input data and
scale it to a projection suitable for presentation, i.e. either a two- or three-dimensional vector.
The filter preserves the distance between the points in the original space as closely as possible by
minimizing an error function. The MDS filter is completely independent of the clustering, and
could be applied itself to the data. This filter will do some sort of analysis like Principle
Components Analysis (PCA) on the data (Bishop, C. 1995, Appendix E). It may be we can make
use of the existing cluster information in a clever way as another alternative – in which case this
could stay in the data path as a no-op.
184.108.40.206 New Cluster Strategies
The Weka framework allows for the easy introduction of new clustering strategies. More general
classification techniques could make use of the generic classifier container as well. The Weka
library offers several clustering choices out-of-the-box: Cobweb, Expectation Maximization,
Farthest-first, and K-means (Witten, I. & Frank, E. 2000, pp. 210-227). This is a good start but
by no means exhaustive. The application does not strictly require additional clustering strategies
to function; as such, the new strategies are candidates for scope reduction.
The visualization/interaction component is the user interface to the clustered data. This
component allows the user to view the data. It allows the user to transform or refine the clusters
through a limited set of interactions.
220.127.116.11 Java GUI
The Graphical User Interface presents the user with a two- or three-dimensional projection of the
source data, using color to represent the class membership established by clustering. Intuitively,
the intensity of the color can represent the distance from the centroid or “fuzzy” class
memberships. Additionally, GUI presents the user with a set of controls that allow non-
destructive and destructive data transformations. Non-destructive operations include standard
graphical transformations such as rotation, pan, zoom. A search capability allows a user to locate
specific instances or groups of instances in the scatter plot. Such a capability may further refine
the subjectivity of the results. A proposed destructive operation of the data allows the user to act
as a critic by demonstrating what a more appropriate result might be by “forcing” their own bias
into the model. The system will re-cluster and re-present the modified instances. If happening
interactively, this would likely happen on a random sample of the data.
18.104.22.168 Projection Window (GUI Component)
The projection window is a Java component that displays a colored scatterplot of two-
dimensional or three-dimensional data. It should handle non-destructive transformations like
rotation, pan, zoom, scale, color/grayscale toggling, selection, and indicating “interesting”
22.214.171.124 Control Panel (GUI Component)
The control panel allows a user to interact with the GUI. It will provide a menu of custom
controls fulfilling all the tasks defined for the interface.
4 Work Plan
At a high level, there are two phases to the project: building the tools, and using the tools to
explore the data.
The approach to the initial phase is to work backwards through the data path, and establish
baseline functionality. The first component completed is the last in the data path, the Java GUI
front-end. Pseudo data will functionally test the GUI. The clustering engine is scheduled next.
Finally the MDS module, which in effect bridges the two initial components, will be completed.
The exception to the rule is the projection window. The work on the projection window will
happen in two parts. The initial round of work will produce a basic visualization window using
test data; a second round of development will extend the functionality for application-specific
The next phase involves experimentation. Once the first component in the data path brings in the
data, the real project is underway. In addition to trials using various combinations of features,
implementation of custom clustering strategies and incremental GUI changes happen during this
Ideally, the tools would complete by Christmas, and experimentation would start in January. The
schedule below presents a slightly less aggressive view of the timeline. Code complete happens
on 2/1/2005. Pulling in this date to 1/1/2005 is a stretch goal for the development cycle.
4.1 Assumptions, Risks and Alternatives
The original code written for the project will be in Java. The cluster engine will use the Weka
open source data-mining framework (Weka 3 - Data Mining with Open Source Machine Learning
Software in Java. 2004). The GUI will use Swing components (Geary, D. 1999). It might use the
standard Java 3D extensions (Java 3D API. 2004). If it does use the AWT-based 3D API, the
GUI will use heavyweight AWT components instead of their Swing counterparts (Geary, D.
1997). Other small scripts will extract and format data for import; these may be in SQL, Perl, or
some other language. CVS versioning software will facilitate milestone releases. The eventual
application will run client-side. It might be packaged as a “Java Web Start” application.
Risks and Alternatives:
• Unachievable Schedule – the schedule as indicated below is aggressive.
o Alternative: Use third party components in the GUI, especially for prototyping
o Alternative: Use existing clustering software only
• Personal Schedule Conflict – we’re expecting our 2nd child 4/15/2005
o Alternative: Enter into the program later. Re-negotiate graduation date.
o Alternative: Build in extension.
• No signal in target data – the experiment could fail.
o Alternative: Establish signal before undertaking the project using sample data
o Alternative: Establish validity of negative outcome; success of tool
4.2 Preliminary Schedule
Figure 3 shows a high-level view of the schedule.
Full integ, Built-in Clustering
12/20/2004 - 1/3/2005
11/22/2004 Code Complete Exploration Ends
10/4/2004 11/1/2004 Clustering 12/12/2004 3/31/2005
Start GUI Complete Dummy Data Scaling Work ends
11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005
Figure 3 High-level View of Schedule
Table 2 shows a detailed view of the proposed schedule:
Projectio Controlle Clustering Projectio Extractio Clustering
Week Java GUI n r Engine n n Algorithms
Table 2 Detailed Schedule View
Centroid A pseudo exemplar serving as the statistical center of a given class.
Clustering Clustering algorithms find groups of items that are similar. For
example, clustering could be used by an insurance company to
group customers according to income, age, types of policies
purchased and prior claims experience. It divides a data set so that
records with similar content are in the same group, and groups are
as different as possible from each other. Since the categories are
unspecified, this is sometimes referred to as unsupervised learning.
(Two Crows: Data Mining Glossary. 2001).
Unsupervised Learning As distinct from supervised learning, the classification of unlabeled
Data Mining The process of automatically extracting valid, useful, previously
unknown, and ultimately comprehensible information from large
databases and using it to make crucial business decisions.
“Torturing the data until they confess” (Hsu, W. 2001)
Weka An open source Java project for machine learning and data mining
found at: http://www.cs.waikato.ac.nz/~ml/weka/
Webserver A software application for serving content to browsers on the world
6.1 Works Cited
The following is a list of references cited in the document.
Bishop, C. (1995). Neural Networks for Pattern Recognition. New York: Oxford Press.
Bucklin, R., & Sisemiro, C. (2003). A Model of Website Browsing Behavior Estimated on
Clickstream Data. Journal of Marketing Research, XL, 249-267. Retrieved August 15, 2004,
Duda, R, Hart, P., & Stork, D. (2001) Pattern Classification. New York: John Wiley & Sons.
Geary, D (1997) Graphic Java 1.1: Mastering the AWT. New York: Prentice Hall.
Geary, D. (1999) Graphic Java Volume II: Swing. New York: Prentice Hall.
Hsu, W. (2001). Knowledge Discovery in Databases and Data Mining. Retrieved October 17,
Java 3D API. Retrieved October 17, 2004 from http://java.sun.com/products/java-media/3D/
Two Crows: Data Mining Glossary. (2001). Retrieved October 17, 2004, from
UCI Machine Learning Repository Content Summary. Retrieved October 16, 2004, from
Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (2004) Retrieved
October 17, 2004, from: http://www.cs.waikato.ac.nz/ml/weka/
Witten, I., Frank, E. (2000) Data Mining. San Diego: Academic Press.
6.2 Works Consulted
The following is a list of works consulted while researching the topic.
Anasari, A., & Mela, C. (2003). E-Customization. [Electronic Version] Journal of Marketing
Research, XL, 131-145. Retrieved August 10, 2004, from
Moe, W. & Fader, P. (2002) Capturing Evolving Visit Behavior in Clickstream Data [Electronic
Version] Retrieved August 10, 2004, from http://www-
Moe, W. & Fader, P. (2003) Dynamic Purchase Behavior at e-Commerce Sites [Electronic
Version] Retrieved August 10, 2004, from:
Montgomery, A., Li, S., Srinivasan, K., & Liechty, J (2004) Modeling Online Browsing and Path
Analysis Using Clickstream Data [Electronic Version] Retrieved August 10, 2004, from
6.3 Works To Be Consulted
The following is a list of works marked for future review.
Jain, A., Murty, M., & Flynn, P. (1999). Data Clustering: A Review. ACM Computing Surveys,
31(3). Retrieved August 15, 2004, from http://portal.acm.org/citation.cfm?id=331499.331504
Leouski, A., & Swan, R. (1997). Interactive Cluster Visualization for Information Retrieval.
Retrieved August 10, 2004, from
Procopiuc, C., Jones, M., Agarwal, P., & Murali, T. (2002) A Monte Carlo Algorithm for Fast
Projective Clustering. [Electronic Version] Presented at ACM SIGMOD 2002. Retrieved
August 10, 2004, from http://www.research.att.com/resources/papers/Clustering.pdf