Record matching over query results

Software Requirements
Specification
For

Record Matching over Query
Results from Multiple Web
Databases
Prepared by

Frederick H. Lochovsky

Pelican Infotech

Submitted in partial fulfillment
Of the requirements of
Mining sequential patterns matching over high utility data sets

Mining sequential patterns matching over high utility data sets Page ii

Table of Contents
Introduction...................................................................................................................................3
Purpose .................................................................................................................................................... 3
Document Conventions ........................................................................................................................... 3
Intended Audience and Reading Suggestions.......................................................................................... 3
Product Scope........................................................................................................................................... 3
References................................................................................................................................................ 4
Overall Description.......................................................................................................................4
Product Perspective.................................................................................................................................. 4
Product Functions..................................................................................................................................... 5
User Classes and Characteristics.............................................................................................................. 5
Operating Environment............................................................................................................................ 6
Design and Implementation Constraints.................................................................................................. 7
User Documentation............................................................................................................................... 10
Assumptions and Dependencies............................................................................................................. 10
External Interface Requirements.............................................................................................. 11
User Interfaces....................................................................................................................................... 11
Hardware Interfaces............................................................................................................................... 11
Software Interfaces................................................................................................................................. 12
Communications Interfaces.................................................................................................................... 15
System Features.......................................................................................................................... 18
Other Nonfunctional Requirements..........................................................................................25
Performance Requirements.................................................................................................................... 25
Safety Requirements.............................................................................................................................. 25
Security Requirements........................................................................................................................... 25
Software Quality Attributes................................................................................................................... 25
Business Rules....................................................................................................................................... 25
Other Requirements................................................................................................................... 25

Revision History
Name Date Reason For Changes Version

Introduction
Purpose

This Software Requirements Specification provides a complete description of all the

functions and specifications of the Frederick H. Lochovsky on Mining sequential patterns

matching over high utility data sets

Document Conventions

Though this document is intended as a set of Requirements, and not a design document,

technical information has been included wherever it was deem appropriate.

Priority for all functionality is assumed to be equally except where noted.

Intended Audience and Reading Suggestions
The primary audience for this document is the development team. The secondary audience is the

Pelican InfoTech project management team.

Product Scope
Query-dependent and a pre learned method using training examples from previous query

results may fail on the results of a new query. To address the problem of record matching in

the Web database scenario, we present an unsupervised, online record matching method,

Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com

UDD, which, for a given query, can effectively identify duplicates from the query result

records of multiple Web databases.

References
The following references are relevant to the project and can be consulted to project a more

detailed view of the technologies and standards being used in this project

1. Eliminating Fuzzy Duplicates in Data Warehouses

R. Ananthakrishna, S. Chaudhuri, and V. Ganti

2. A Comparison of Fast Blocking Methods for Record Linkage

R. Baxter, P. Christen, and T. Churches

3. Robust Identification of Fuzzy Duplicates

S. Chaudhuri, V. Ganti, and R. Motwani

Overall Description
Product Perspective


• False data can discover the actions when unauthorized users

attempted to access computer systems or authorized users attempted

to misuse their privileges.

• Association rule mining

• An algorithm based on sequential pattern mining using the same data

collected by the Databases.

Product Functions

The product shall allow users to:

• Install and set up an issue tracking database
• Define the formats of acceptable issues
• File preformatted reports in a database
• Submit issues to a database
• Query the database in a number of ways
• Edit issues in the database and resubmit them
• Merge multiple issues into a single issue
• Relate issues to each other in a hierarchical form
• Assemble groups of related issues into a document

User Classes and Characteristics

Individual Local Developers. Individual developers should be able to submit issues, edit
issues, and perform queries on the database to discover what issues are relevant to them,
which issues are open (in the case of issues to which that is relevant, such as defect reports or
unsatisfied requirements), etc. These individual developers are assumed to have some
knowledge of the development environment and are familiar and comfortable with basic


software tools such as text editors etc. As a result, the individual developer tools will be the
most "primitive" but also the most efficient for use, probably implemented as text-based
command line tools. Since Network simulation is primarily intended as an easy-to-use, free
tool for individual developers and small teams, this is the most critical user class to satisfy.
The tools must be relatively easy to use, and extremely easy to set up.

Local Issue Managers. Issue managers -- those responsible for keeping track of open issues,
etc. -- must have tools capable of querying the database and relating issues to developers. The
tools used for issue managers and individual developers will be very similar, as they will be
doing similar tasks -- querying the database for open issues, assigning people to issues as
appropriate, recategorizing issues or merging/splitting them, etc. However, issue managers
may not be as comfortable with "primitive" tools as individual developers, so some thought
will be given to more "scripted" or directive tools, possibly involving simple GUI elements.
However, the bulk of user-interface issues will be placed on the next user class, remote users.

Remote Users. If Network simulation is used as a defect management system, then remote
users (users of software packages submitting reports to a Network simulation center) will
constitute the bulk of submissions. If Network simulation is to be used in this way, it must
cater to the needs of these users, who will have much lower skills and will require very
simple, easy-to-use interfaces. Primarily these interfaces will focus on problem submission,
but they will also allow some ability to query the database, etc.

Operating Environment

In a computer the operating environment includes temperature and so on affecting circuitry;

but in particular the term is often used to describe the non-physical environment in which


software runs. This may apply to application software with which users interact, comprising

the "look and feel" of the system, its appearance and the things that have to be done to achieve

desired results. The term may also apply to system software; e.g., software designed for a

Unix environment will do things differently than in a Microsoft Windows environment. Some

operating environments for programming purposes are referred as programming

environments; e.g., the "UNIX programming environment" for a Unix shell with its look and

feel and functionality.

"Operating environment" is not the totality of the functionality and appearance of an operating

system.

Design and Implementation Constraints

1 Architecture


Applying the Mining Tool

Using Mining the Data
Algorithms

Check the customer using RFC
model

Analyze the Business

Customer

Cluster formation
DB

Check max High profit, gold
the user customer

min Start the mining

Low profit

Store & manage

Analyze


User Documentation
none

Assumptions and Dependencies
Data bases are defined as

@relation 'cpu'
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute class real
@data

125,256,6000,256,16,128,199
29,8000,32000,32,8,32,253
29,8000,32000,32,8,32,253
29,8000,32000,32,8,32,253
29,8000,16000,32,8,16,132
26,8000,32000,64,8,32,290
23,16000,32000,64,16,32,381
23,16000,32000,64,16,32,381
23,16000,64000,64,16,32,749
23,32000,64000,128,32,64,1238
400,1000,3000,0,1,2,23
400,512,3500,4,1,6,24
60,2000,8000,65,1,8,70
50,4000,16000,65,1,8,117
350,64,64,0,1,4,15


External Interface Requirements
User Interfaces

 Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.
 Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID).
 Genetic algorithms: Optimization techniques that use process such as genetic
combination, mutation, and natural selection in a design based on the concepts of
evolution.
 Nearest neighbor method: A technique that classifies each record in a dataset
based on a combination of the classes of the k record(s) most similar to it in a
historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
 Rule induction: The extraction of useful if-then rules from data based on
statistical significance.

Hardware Interfaces

Hardware Specification

Processor Type : Pentium -III
Speed : 1.6 GHZ
Ram : 128 MB RAM
Hard disk : 8 GB HD


Software Interfaces

Java began as a client side platform independent programming language that enabled
stand-alone Java applications and applets. The numerous benefits of Java resulted in an
explosion in the usage of Java in the back end server side enterprise systems. The Java
Development Kit (JDK), which was the original standard platform defined by Sun, was soon
supplemented by a collection of enterprise APIs. The proliferation of enterprise APIs, often
developed by several different groups, resulted in divergence of APIs and caused concern
among the Java developer community.
Java byte code can execute on the server instead of or in addition to the client,
enabling you to build traditional client/server applications and modern thin client Web
applications. Two key server side Java technologies are servlets and JavaServer Pages.
Servlets are protocol and platform independent server side components which extend the
functionality of a Web server. JavaServer Pages (JSPs) extend the functionality of servlets by
allowing Java servlet code to be embedded in an HTML file.

Features of Java

• Platform Independence
o The Write-Once-Run-Anywhere ideal has not been achieved (tuning for

different platforms usually required), but closer than with other languages.
• Object Oriented
• Object oriented throughout - no coding outside of class definitions, including
main().
• An extensive class library available in the core language packages.
• Compiler/Interpreter Combo


• Code is compiled to byte codes that are interpreted by a Java virtual machines
(JVM).
• This provides portability to any machine for which a virtual machine has been
written.
• The two steps of compilation and interpretation allow for extensive code
checking and improved security.
• Robust
• Exception handling built-in, strong type checking (that is, all data must be
declared an explicit type), local variables must be initialized.
• Several dangerous features of C & C++ eliminated:
• No memory pointers
• No preprocessor
• Array index limit checking

• Automatic Memory Management
• Automatic garbage collection - memory management handled by JVM.

• Security
• No memory pointers
• Programs run inside the virtual machine sandbox.
• Array index limit checking
• Code pathologies reduced by
• byte code verifier - checks classes after loading
• Class loader - confines objects to unique namespaces. Prevents loading a
hacked "java.lang.SecurityManager" class, for example.


• Security manager - determines what resources a class can access such as
reading and writing to the local disk.

• Dynamic Binding
• The linking of data and methods to where they are located is done at run-time.
• New classes can be loaded while a program is running. Linking is done on the
fly.
• Even if libraries are recompiled, there is no need to recompile code that uses
classes in those libraries. This differs from C++, which uses static binding.
This can result in fragile classes for cases where linked code is changed and
memory pointers then point to the wrong addresses.

• Good Performance
• Interpretation of byte codes slowed performance in early versions, but
advanced virtual machines with adaptive and just-in-time compilation and
other techniques now typically provide performance up to 50% to 100% the
speed of C++ programs.

• Threading
• Lightweight processes, called threads, can easily be spun off to perform
multiprocessing.
• Can take advantage of multiprocessors where available
• Great for multimedia displays.


• Built-in Networking
• Java was designed with networking in mind and comes with many classes to
develop sophisticated Internet communications.

Communications Interfaces

ECLIPSE
Eclipse is an open-source software framework written primarily in Java .The
initial codebase originated from VisualAge. In its default form it is an Integrated
Development Environment (IDE) for Java developers, consisting of the Java Development
Tools (JDT). Users can extend its capabilities by installing plug-ins written for the Eclipse
software framework, such as development toolkits for other programming languages, and can
write and contribute their own plug-in modules. Language packs provide translations into over
a dozen natural languages.

4.1.1 ARCHITECTURE:

The basis for Eclipse is the Rich Client Platform (RCP). The following
components constitute the rich client platform:
• OSGi - a standard bundling framework
• Core platform - boot Eclipse, run plug-ins
• The Standard Widget Toolkit (SWT) - a portable widget toolkit
• JFace - viewer classes to bring model view controller programming to SWT,
file buffers, text handling, and text editors
• The Eclipse Workbench - views, editors, perspectives, wizards


Eclipse's widgets are implemented by a widget toolkit for Java called SWT,
unlike most Java applications, which use the Java standard Abstract Window Toolkit(AWT)
or Swing. Eclipse's user interface also leverages an intermediate GUI layer called JFace,
which simplifies the construction of applications based on SWT.
Eclipse employs plug-ins in order to provide all of its functionality on top of (and including)
the rich client platform, in contrast to some other applications where functionality is typically
hard coded. This plug-in mechanism is a lightweight software componentry framework. In
addition to allowing Eclipse to be extended using other programming languages such as C and
Python, the plug-in framework allows Eclipse to work with typesetting languages like LaTeX,
[3] networking applications such as telnet, and database management systems. The plug-in
architecture supports writing any desired extension to the environment, such as for
configuration management. Java and CVS support is provided in the Eclipse SDK.
The key to the seamless integration of tools with Eclipse is the plugin. With the exception of
a small run-time kernel, everything in Eclipse is a plug-in. This means that a plug-in you
develop integrates with Eclipse in exactly the same way as other plug-ins; in this respect, all
features are created equal. Eclipse provides plugins for a wide variety of features, some of
which are through third parties using both free and commercial models. Examples of plugins
include UML plugin for Sequence and other UML diagrams, plugin for Database explorer,
etc.
The Eclipse SDK includes the Eclipse Java Development Tools, offering an IDE with a built-
in incremental Java compiler and a full model of the Java source files. This allows for
advanced refactoring techniques and code analysis. The IDE also makes use of a workspace,
in this case a set of metadata over a flat files pace allowing external file modifications as long
as the corresponding workspace "resource" is refreshed afterwards. The Visual Editor project
allows interfaces to be created interactively, hence allowing Eclipse to be used as a RAD tool.

4.1.2 HISTORY


Eclipse began as an IBM Canada project. It was developed by OTI (Object Technology
International) as a replacement for VisualAge, which itself had been developed by OTI. In
November 2001, a consortium was formed to further the development of Eclipse as open
source. In 2003, the Eclipse Foundation was created.
Eclipse 3.0 (released on June 21 2004) selected the OSGi Service Platform specifications as
the runtime architecture.
Eclipse was originally released under the Common Public License, but was later re-licensed
under the Eclipse Public License. The Free Software Foundation has said that both licenses
are free software licenses, but are incompatible with the GNU General Public License (GPL).
Mike Milinkovich, of the Eclipse Foundation has commented that moving to the GPL will be
considered when version 3 of the GPL is released.

4.1.3 MYECLIPSE:

MyEclipse is a commercially available Enterprise Java and AJAX IDE created and
maintained by the company Genuitec, a founding member of the Eclipse Foundation.
MyEclipse is built upon the Eclipse platform, and integrates both proprietary and open source
solutions into the development environment.
MyEclipse has two primary versions a professional and a standard edition. The
standard edition adds database tools, a visual web designer, persistence tools, Spring tools,
Struts and JSF tooling, and a number of other features to the basic Eclipse Java Developer
profile. It competes with the Web Tools Project, which is a part of Eclipse itself, but
MyEclipse is a separate project entirely and offers a different feature set. Most recently,
MyEclipse has been made available via Pulse, a provisioning tool that maintains Eclipse
software profiles, including those that use MyEclipse.


System Features
Embedding Data into Weka Data mining tool

Weka (Waikato Environment for Knowledge Analysis) is a Java-based data mining
tool developed by Waikato University. After loading the dataset into it, the preprocess
function of Weka allows the user to input undesired attributes to prevent them from affecting
the quality of extracted knowledge. Next, the user can use one of the three algorithms to mine
the data: Classification, Clustering, and Association Rule.
Data Mining is playing a key role in most enterprises, which have to analyse great
amounts of data in order to achieve higher profits. Nevertheless, due to the large datasets
involved in this process, the data mining field must face some technological challenges. Grid
Computing takes advantage of the low-load periods of all the computers connected to a
network, making possible resource and data sharing. Providing Grid services constitute a
flexible manner of tackling the data mining needs. This paper shows the adaptation of Weka, a
widely used Data Mining tool, to a grid infrastructure.

Classifiers in WEKA are models for predicting nominal or numeric quantities,
Implemented learning schemes include: Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptions, logistic regression, Bayes’ nets, “Meta”-
classifiers include: Bagging, boosting, stacking, error-correcting output
Codes, locally weighted learning.

WEKA contains “clusters” for finding groups of similar instances in a dataset
Implemented schemes are: k-Means, EM, Cobweb, Farthest First , Clusters can be visualized


and compared to “true” clusters (if given) Evaluation based on log likelihood if clustering
scheme produces a probability distribution

Suppose you have some data and you want to build a decision tree from it. A common
situation is for the data to be stored in a spreadsheet or database. However, Weka expects it to
be in ARFF format, introduced in Section 2.4, because it is necessary to have type information
about each attribute which cannot be automatically deducted from the attribute values. Before
you can apply any algorithm to your data, is must be converted to ARFF form. This can be
done very easily. Recall that the bulk of an ARFF file consists of a list of all the instances,
with the attribute values for each instance being separated by commas (Figure 2.2). Most
spreadsheet and database programs allow you to export your data into a file in comma
separated format—as a list of records where the items are separated by commas.

Once this has been done, you need only load the file into a text editor or a word
processor; add the dataset’s name using the @relation tag, the attribute information using
@attribute, and a @data line; save the file as raw text—and you’re done! In the following
example we assume that your data is stored in a Microsoft Excel spreadsheet, and you’re
using Microsoft Word for text processing. Of course, the process of converting data into
ARFF format is very similar for other software packages. Figure 8.1a shows an Excel
spreadsheet containing the weather data. It is easy to save this data in comma-separated
format. First, select the Save As… item from the File pull-down menu. Then, in the ensuing
dialog box, select CSV. Now load this file into Microsoft Word. Your screen will look like.

The rows of the original spreadsheet have been converted into lines of text, and the
elements are separated from each other by commas. All you have to do is convert the first
line, which holds the attribute names, into the header structure that makes up the beginning of
an ARFF file. Shows the result. The dataset’s name is introduced by a @relation tag, and the


names, types, and values of each attribute are defined by @attribute tags. The data section of
the ARFF file begins with a @data tag. Once the structure of your dataset matches, you
should save it as a text file.

Choose Save as… from the File menu, and specify Text Only with Line Breaks as the
file type by using the corresponding popup menu. Enter a file name, and press the Save
button. We suggest that you rename the file to weather.arff to indicate that it is in ARFF
format. Note that the classification schemes in Weka assume by default that the class is the
last attribute in the ARFF file, which fortunately it is in this case. (We explain in Section 8.3
below how to override this default.) Now you can start analyzing this data using the
algorithms provided. In the following we assume that you have downloaded Weka to your
system, and that your Java environment knows where to find the library. (More information
on how to do this can be found at the Weka Web site.) To see what the C4.5 decision tree
learner described in Section 6.1 does with this dataset, we use the J4.8 algorithm, which is
Weka’s implementation of this decision tree learner. (J4.8 actually implements a later and
slightly improved version called C4.5 Revision 8, which was the last public version of this
family of algorithms before C5.0, a commercial implementation, was released.) Type java
weka.classifiers.j48.J48 -t weather.arff at the command line.

This incantation calls the Java virtual machine and instructs it to execute the J48
algorithm from the j48 package—a sub package of classifiers, which is part of the overall
weka package. Weka is organized in “packages” that correspond to a directory hierarchy.
We’ll give more details of the package structure in the next section: in this case, the sub
package name is j48 and the program to be executed from it is called J48. The –t option
informs the algorithm that the next argument is the name of the training file. After pressing
Return, you’ll see the output shown.


5.2.2.1 The weka.core package
The core package is central to the Weka system. It contains classes that are accessed
from almost every other class. You can find out what they are by clicking on the hyperlink
underlying weka.core, which brings up. The Web page is divided into two parts: the Interface
Index and the Class Index. The latter is a list of all classes contained within the package, while
the former lists all the interfaces it provides. An interface is very similar to a class, the only
difference being that it doesn’t actually do anything by itself—it is merely a list of methods
without actual implementations. Other classes can declare that they “implement” a particular
interface, and then provide code for its methods. For example, the Option Handler interface
defines those methods that are implemented by all classes that can process command-line
options—including all classifiers.

The key classes in the core package are called Attribute, Instance, and Instances. An
object of class Attribute represents an attribute. It contains the attribute’s name, its type and,
in the case of a nominal attribute, its possible values. An object of class Instance contains the
attribute values of a particular instance; and an object of class Instances holds an ordered set
of instances, in other words, a dataset. By clicking on the hyperlinks underlying the classes,
you can find out more about them. However, you need not know the details just to use Weka
from the command line. We will return to these classes in Section 8.4 when we discuss how to
access the machine learning routines from other Java code. Clicking on the All Packages
hyperlink in the upper left corner of any documentation page brings you back to the listing of
all the packages in Weka.

5.2.2.2 The weka.classifiers package


The classifiers package contains implementations of most of the algorithms for
classification and numeric prediction that have been discussed in this book. (Numeric
prediction is included in classifiers: it is interpreted as prediction of a continuous class.) The
most important class in this package is Classifier, which defines the general structure of any
scheme for classification or numeric prediction. It contains two methods, buildClassifier() and
classifyInstance(), which all of these learning algorithms have to implement. In the jargon of
object-oriented programming, the learning algorithms are represented by subclasses of
Classifier, and therefore automatically inherit these two methods. Every scheme redefines
them according to how it builds a classifier and how it
classifies instances. This gives a uniform interface for building and using classifiers from
other Java code.
Hence, for example, the same evaluation module can be used to evaluate the
performance of any classifier in Weka. Another important class is Distribution Classifier. This
subclass of Classifier defines the method distributionForInstance(), which returns a
probability distribution for a given instance. Any classifier that can calculate class
probabilities is a subclass of Distribution Classifier and implements this method.
To see an example, click on DecisionStump, which is a class for building a simple one-level
binary decision tree (with an extra branch for missing values). You have to use this rather
lengthy expression if you want to build a decision stump from the command line. The page
then displays a tree structure showing the relevant part of the class hierarchy. As you can see,
Decision Stump is a subclass of Distribution Classifier, and therefore produces class
probabilities. Distribution Classifier, in turn, is a subclass of Classifier, which is itself a
subclass of Object. The Object class is the most general one in Java: all classes are
automatically subclasses of it. After some generic information about the class, its author, and
its version, it gives an index of the constructors and methods of this class.
A constructor is a special kind of method that is called whenever an object of that
class is created, usually initializing the variables that collectively define its state. The index of


methods lists the name of each one, the type of parameters it takes, and a short description of
its functionality. Beneath those indexes, the Web page gives more details about the
constructors and methods. We return to those details later. As you can see, Decision Stump
implements all methods required by both a Classifier and a Distribution Classifier. In addition,
it contains toString() and main() methods. The former returns a textual description of the
classifier, used whenever it is printed on the screen. The latter is called every time you ask for
a decision stump from the command line, in other words, every time you enter a command
beginning with java weka.classifiers. Decision Stump
The presence of a main() method in a class indicates that it can be run from the command line,
and all learning methods and filter algorithms implement it.
 Waikato Environment for Knowledge Analysis
 Collection of state-of-the-art machine learning algorithms and data processing
tools implemented in Java
o Released under the GPL
 Support for the whole process of experimental data mining
o Preparation of input data
o Statistical evaluation of learning schemes
o Visualization of input data and the result of learning
 Used for education, research and applications
 Complements “Data Mining” by Witten & Frank
5.2.2.3 Features

 49 data preprocessing tools
 76 classification/regression algorithms
 8 clustering algorithms
 15 attribute/subset evaluators + 10 search algorithms for feature selection


 3 algorithms for finding association rules
 3 graphical user interfaces
o “The Explorer” (exploratory data analysis)
o “The Experimenter” (experimental environment)
o “The Knowledge Flow” (new process model inspired interface)
 Continue to develop and support WEKA
 MOA (Massive Online Analysis)
o Framework that supports learning from data streams
 Facilities for data generation, experimental analysis, learning
algorithms, etc.
o The Moa (another native NZ bird) is not only flightless, like the Weka, but also
extinct
o First public release, probably this Christmas, or perhaps Thanksgiving (as it’s
just another turkey)
 MILK
o Multi-Instance Learning Kit
 Proper
o Propositionalization toolbox for WEKA


Other Nonfunctional Requirements
Performance Requirements
The system has no specific performance requirements at this time

Safety Requirements
The system has no specific safety requirements at this time, except to the extent that it is
designed to run without root access.

Security Requirements
The system has no specific security requirements at this time.

Software Quality Attributes
No additional software quality attributes are addressed in the requirements at this time.

Business Rules
There are no explicit business rules for operation of Network simulation at this time. All users
with access to the command line tools and a copy of the repository will be allowed to perform
all actions. Additional security measures and procedures may be added at a future date.

Other Requirements
There are no additional requirements for the product at this time

Appendix A: Glossary


Record matching over query results

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Record matching over query results

Similar to Record matching over query results (20)

More from ambitlick

More from ambitlick (20)

Recently uploaded

Recently uploaded (20)

Record matching over query results