The document is a software requirements specification for a system to perform record matching over query results from multiple web databases. It describes the purpose, conventions, intended users, product scope, and references. It provides an overall description of the product perspective and functions, describes user classes and characteristics, operating environment, design constraints, and documentation. It outlines external interface requirements including user interfaces, hardware/software interfaces, and communications interfaces. It details system features and other non-functional requirements around performance, safety, security, quality, and business rules.
This file is the final report for the course Digital Content Retrieval (DCR) presented at Pavia University as Computer Engineering Master's course. The report explains the procedure for the development of a personal website and a video curriculum describing its development aspects using proper project management techniques. The source of the personal website and the video curriculum are available at https://github.com/kooroshsajadi/personal-website and https://vimeo.com/843032358?share=copy respectively.
Case Study for Ego-centric Citation NetworkMike Taylor
Patent Citation Network Research Tool used to build and analyze technology landscape Ego-centric Citation Network and Social Citation Network. visit us for more
This file is the final report for the course Digital Content Retrieval (DCR) presented at Pavia University as Computer Engineering Master's course. The report explains the procedure for the development of a personal website and a video curriculum describing its development aspects using proper project management techniques. The source of the personal website and the video curriculum are available at https://github.com/kooroshsajadi/personal-website and https://vimeo.com/843032358?share=copy respectively.
Case Study for Ego-centric Citation NetworkMike Taylor
Patent Citation Network Research Tool used to build and analyze technology landscape Ego-centric Citation Network and Social Citation Network. visit us for more
SafePeak offers a Plug & Play application acceleration solution for cloud, hosted and business SQL server applications.
SafePeak unique Dynamic Database Caching to resolve information access bottlenecks and latency without any change to existing applications or databases.
A whitepaper from qubole about the Tips on how to choose the best SQL Engine for your use case and data workloads
https://www.qubole.com/resources/white-papers/enabling-sql-access-to-data-lakes
Bulk Projects For sale
IEEE 2009-10-11-12-13 PAPERS AVILABLE.
We are providing low cost project for final year student projects.
Solved 2010 -2011 -2012 - 2013 IEEE in all the domain
Mobile : 8940956123
E-Mail : ambitlick@gmail.com,
INNOVATIVE TITLES ARE ALSO WELLCOME TO DO WITH US
For All BE/BTech, ME/MTech, MSC/MCA/MS , and diplamo graduates
PROJECT SUPPORTS & DELIVERABLES
•Project Abstract
•IEEE Paper
•PPT / Review Details
•Project Report
•Working Procedure in Video
•Screen Shots
•Materials & Books in CD
•Project Certification
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Group Presentation 2 Economics.Ariana Buscigliopptx
Record matching over query results
1. Software Requirements
Specification
For
Record Matching over Query
Results from Multiple Web
Databases
Prepared by
Frederick H. Lochovsky
Pelican Infotech
Submitted in partial fulfillment
Of the requirements of
Mining sequential patterns matching over high utility data sets
2. Mining sequential patterns matching over high utility data sets Page ii
Table of Contents
Introduction...................................................................................................................................3
Purpose .................................................................................................................................................... 3
Document Conventions ........................................................................................................................... 3
Intended Audience and Reading Suggestions.......................................................................................... 3
Product Scope........................................................................................................................................... 3
References................................................................................................................................................ 4
Overall Description.......................................................................................................................4
Product Perspective.................................................................................................................................. 4
Product Functions..................................................................................................................................... 5
User Classes and Characteristics.............................................................................................................. 5
Operating Environment............................................................................................................................ 6
Design and Implementation Constraints.................................................................................................. 7
User Documentation............................................................................................................................... 10
Assumptions and Dependencies............................................................................................................. 10
External Interface Requirements.............................................................................................. 11
User Interfaces....................................................................................................................................... 11
Hardware Interfaces............................................................................................................................... 11
Software Interfaces................................................................................................................................. 12
Communications Interfaces.................................................................................................................... 15
System Features.......................................................................................................................... 18
Other Nonfunctional Requirements..........................................................................................25
Performance Requirements.................................................................................................................... 25
Safety Requirements.............................................................................................................................. 25
Security Requirements........................................................................................................................... 25
Software Quality Attributes................................................................................................................... 25
Business Rules....................................................................................................................................... 25
Other Requirements................................................................................................................... 25
Revision History
Name Date Reason For Changes Version
3. Introduction
Purpose
This Software Requirements Specification provides a complete description of all the
functions and specifications of the Frederick H. Lochovsky on Mining sequential patterns
matching over high utility data sets
Document Conventions
Though this document is intended as a set of Requirements, and not a design document,
technical information has been included wherever it was deem appropriate.
Priority for all functionality is assumed to be equally except where noted.
Intended Audience and Reading Suggestions
The primary audience for this document is the development team. The secondary audience is the
Pelican InfoTech project management team.
Product Scope
Query-dependent and a pre learned method using training examples from previous query
results may fail on the results of a new query. To address the problem of record matching in
the Web database scenario, we present an unsupervised, online record matching method,
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
4. UDD, which, for a given query, can effectively identify duplicates from the query result
records of multiple Web databases.
References
The following references are relevant to the project and can be consulted to project a more
detailed view of the technologies and standards being used in this project
1. Eliminating Fuzzy Duplicates in Data Warehouses
R. Ananthakrishna, S. Chaudhuri, and V. Ganti
2. A Comparison of Fast Blocking Methods for Record Linkage
R. Baxter, P. Christen, and T. Churches
3. Robust Identification of Fuzzy Duplicates
S. Chaudhuri, V. Ganti, and R. Motwani
Overall Description
Product Perspective
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
5. • False data can discover the actions when unauthorized users
attempted to access computer systems or authorized users attempted
to misuse their privileges.
• Association rule mining
• An algorithm based on sequential pattern mining using the same data
collected by the Databases.
Product Functions
The product shall allow users to:
• Install and set up an issue tracking database
• Define the formats of acceptable issues
• File preformatted reports in a database
• Submit issues to a database
• Query the database in a number of ways
• Edit issues in the database and resubmit them
• Merge multiple issues into a single issue
• Relate issues to each other in a hierarchical form
• Assemble groups of related issues into a document
User Classes and Characteristics
Individual Local Developers. Individual developers should be able to submit issues, edit
issues, and perform queries on the database to discover what issues are relevant to them,
which issues are open (in the case of issues to which that is relevant, such as defect reports or
unsatisfied requirements), etc. These individual developers are assumed to have some
knowledge of the development environment and are familiar and comfortable with basic
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
6. software tools such as text editors etc. As a result, the individual developer tools will be the
most "primitive" but also the most efficient for use, probably implemented as text-based
command line tools. Since Network simulation is primarily intended as an easy-to-use, free
tool for individual developers and small teams, this is the most critical user class to satisfy.
The tools must be relatively easy to use, and extremely easy to set up.
Local Issue Managers. Issue managers -- those responsible for keeping track of open issues,
etc. -- must have tools capable of querying the database and relating issues to developers. The
tools used for issue managers and individual developers will be very similar, as they will be
doing similar tasks -- querying the database for open issues, assigning people to issues as
appropriate, recategorizing issues or merging/splitting them, etc. However, issue managers
may not be as comfortable with "primitive" tools as individual developers, so some thought
will be given to more "scripted" or directive tools, possibly involving simple GUI elements.
However, the bulk of user-interface issues will be placed on the next user class, remote users.
Remote Users. If Network simulation is used as a defect management system, then remote
users (users of software packages submitting reports to a Network simulation center) will
constitute the bulk of submissions. If Network simulation is to be used in this way, it must
cater to the needs of these users, who will have much lower skills and will require very
simple, easy-to-use interfaces. Primarily these interfaces will focus on problem submission,
but they will also allow some ability to query the database, etc.
Operating Environment
In a computer the operating environment includes temperature and so on affecting circuitry;
but in particular the term is often used to describe the non-physical environment in which
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
7. software runs. This may apply to application software with which users interact, comprising
the "look and feel" of the system, its appearance and the things that have to be done to achieve
desired results. The term may also apply to system software; e.g., software designed for a
Unix environment will do things differently than in a Microsoft Windows environment. Some
operating environments for programming purposes are referred as programming
environments; e.g., the "UNIX programming environment" for a Unix shell with its look and
feel and functionality.
"Operating environment" is not the totality of the functionality and appearance of an operating
system.
Design and Implementation Constraints
1 Architecture
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
8. Applying the Mining Tool
Using Mining the Data
Algorithms
Check the customer using RFC
model
Analyze the Business
Customer
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
9. Cluster formation
DB
Check max High profit, gold
the user customer
min Start the mining
Low profit
Store & manage
Analyze
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
10. User Documentation
none
Assumptions and Dependencies
Data bases are defined as
@relation 'cpu'
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute class real
@data
125,256,6000,256,16,128,199
29,8000,32000,32,8,32,253
29,8000,32000,32,8,32,253
29,8000,32000,32,8,32,253
29,8000,16000,32,8,16,132
26,8000,32000,64,8,32,290
23,16000,32000,64,16,32,381
23,16000,32000,64,16,32,381
23,16000,64000,64,16,32,749
23,32000,64000,128,32,64,1238
400,1000,3000,0,1,2,23
400,512,3500,4,1,6,24
60,2000,8000,65,1,8,70
50,4000,16000,65,1,8,117
350,64,64,0,1,4,15
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
11. External Interface Requirements
User Interfaces
Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.
Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID).
Genetic algorithms: Optimization techniques that use process such as genetic
combination, mutation, and natural selection in a design based on the concepts of
evolution.
Nearest neighbor method: A technique that classifies each record in a dataset
based on a combination of the classes of the k record(s) most similar to it in a
historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Hardware Interfaces
Hardware Specification
Processor Type : Pentium -III
Speed : 1.6 GHZ
Ram : 128 MB RAM
Hard disk : 8 GB HD
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
12. Software Interfaces
Java began as a client side platform independent programming language that enabled
stand-alone Java applications and applets. The numerous benefits of Java resulted in an
explosion in the usage of Java in the back end server side enterprise systems. The Java
Development Kit (JDK), which was the original standard platform defined by Sun, was soon
supplemented by a collection of enterprise APIs. The proliferation of enterprise APIs, often
developed by several different groups, resulted in divergence of APIs and caused concern
among the Java developer community.
Java byte code can execute on the server instead of or in addition to the client,
enabling you to build traditional client/server applications and modern thin client Web
applications. Two key server side Java technologies are servlets and JavaServer Pages.
Servlets are protocol and platform independent server side components which extend the
functionality of a Web server. JavaServer Pages (JSPs) extend the functionality of servlets by
allowing Java servlet code to be embedded in an HTML file.
Features of Java
• Platform Independence
o The Write-Once-Run-Anywhere ideal has not been achieved (tuning for
different platforms usually required), but closer than with other languages.
• Object Oriented
• Object oriented throughout - no coding outside of class definitions, including
main().
• An extensive class library available in the core language packages.
• Compiler/Interpreter Combo
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
13. • Code is compiled to byte codes that are interpreted by a Java virtual machines
(JVM).
• This provides portability to any machine for which a virtual machine has been
written.
• The two steps of compilation and interpretation allow for extensive code
checking and improved security.
• Robust
• Exception handling built-in, strong type checking (that is, all data must be
declared an explicit type), local variables must be initialized.
• Several dangerous features of C & C++ eliminated:
• No memory pointers
• No preprocessor
• Array index limit checking
• Automatic Memory Management
• Automatic garbage collection - memory management handled by JVM.
• Security
• No memory pointers
• Programs run inside the virtual machine sandbox.
• Array index limit checking
• Code pathologies reduced by
• byte code verifier - checks classes after loading
• Class loader - confines objects to unique namespaces. Prevents loading a
hacked "java.lang.SecurityManager" class, for example.
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
14. • Security manager - determines what resources a class can access such as
reading and writing to the local disk.
• Dynamic Binding
• The linking of data and methods to where they are located is done at run-time.
• New classes can be loaded while a program is running. Linking is done on the
fly.
• Even if libraries are recompiled, there is no need to recompile code that uses
classes in those libraries. This differs from C++, which uses static binding.
This can result in fragile classes for cases where linked code is changed and
memory pointers then point to the wrong addresses.
• Good Performance
• Interpretation of byte codes slowed performance in early versions, but
advanced virtual machines with adaptive and just-in-time compilation and
other techniques now typically provide performance up to 50% to 100% the
speed of C++ programs.
• Threading
• Lightweight processes, called threads, can easily be spun off to perform
multiprocessing.
• Can take advantage of multiprocessors where available
• Great for multimedia displays.
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
15. • Built-in Networking
• Java was designed with networking in mind and comes with many classes to
develop sophisticated Internet communications.
Communications Interfaces
ECLIPSE
Eclipse is an open-source software framework written primarily in Java .The
initial codebase originated from VisualAge. In its default form it is an Integrated
Development Environment (IDE) for Java developers, consisting of the Java Development
Tools (JDT). Users can extend its capabilities by installing plug-ins written for the Eclipse
software framework, such as development toolkits for other programming languages, and can
write and contribute their own plug-in modules. Language packs provide translations into over
a dozen natural languages.
4.1.1 ARCHITECTURE:
The basis for Eclipse is the Rich Client Platform (RCP). The following
components constitute the rich client platform:
• OSGi - a standard bundling framework
• Core platform - boot Eclipse, run plug-ins
• The Standard Widget Toolkit (SWT) - a portable widget toolkit
• JFace - viewer classes to bring model view controller programming to SWT,
file buffers, text handling, and text editors
• The Eclipse Workbench - views, editors, perspectives, wizards
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
16. Eclipse's widgets are implemented by a widget toolkit for Java called SWT,
unlike most Java applications, which use the Java standard Abstract Window Toolkit(AWT)
or Swing. Eclipse's user interface also leverages an intermediate GUI layer called JFace,
which simplifies the construction of applications based on SWT.
Eclipse employs plug-ins in order to provide all of its functionality on top of (and including)
the rich client platform, in contrast to some other applications where functionality is typically
hard coded. This plug-in mechanism is a lightweight software componentry framework. In
addition to allowing Eclipse to be extended using other programming languages such as C and
Python, the plug-in framework allows Eclipse to work with typesetting languages like LaTeX,
[3] networking applications such as telnet, and database management systems. The plug-in
architecture supports writing any desired extension to the environment, such as for
configuration management. Java and CVS support is provided in the Eclipse SDK.
The key to the seamless integration of tools with Eclipse is the plugin. With the exception of
a small run-time kernel, everything in Eclipse is a plug-in. This means that a plug-in you
develop integrates with Eclipse in exactly the same way as other plug-ins; in this respect, all
features are created equal. Eclipse provides plugins for a wide variety of features, some of
which are through third parties using both free and commercial models. Examples of plugins
include UML plugin for Sequence and other UML diagrams, plugin for Database explorer,
etc.
The Eclipse SDK includes the Eclipse Java Development Tools, offering an IDE with a built-
in incremental Java compiler and a full model of the Java source files. This allows for
advanced refactoring techniques and code analysis. The IDE also makes use of a workspace,
in this case a set of metadata over a flat files pace allowing external file modifications as long
as the corresponding workspace "resource" is refreshed afterwards. The Visual Editor project
allows interfaces to be created interactively, hence allowing Eclipse to be used as a RAD tool.
4.1.2 HISTORY
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
17. Eclipse began as an IBM Canada project. It was developed by OTI (Object Technology
International) as a replacement for VisualAge, which itself had been developed by OTI. In
November 2001, a consortium was formed to further the development of Eclipse as open
source. In 2003, the Eclipse Foundation was created.
Eclipse 3.0 (released on June 21 2004) selected the OSGi Service Platform specifications as
the runtime architecture.
Eclipse was originally released under the Common Public License, but was later re-licensed
under the Eclipse Public License. The Free Software Foundation has said that both licenses
are free software licenses, but are incompatible with the GNU General Public License (GPL).
Mike Milinkovich, of the Eclipse Foundation has commented that moving to the GPL will be
considered when version 3 of the GPL is released.
4.1.3 MYECLIPSE:
MyEclipse is a commercially available Enterprise Java and AJAX IDE created and
maintained by the company Genuitec, a founding member of the Eclipse Foundation.
MyEclipse is built upon the Eclipse platform, and integrates both proprietary and open source
solutions into the development environment.
MyEclipse has two primary versions a professional and a standard edition. The
standard edition adds database tools, a visual web designer, persistence tools, Spring tools,
Struts and JSF tooling, and a number of other features to the basic Eclipse Java Developer
profile. It competes with the Web Tools Project, which is a part of Eclipse itself, but
MyEclipse is a separate project entirely and offers a different feature set. Most recently,
MyEclipse has been made available via Pulse, a provisioning tool that maintains Eclipse
software profiles, including those that use MyEclipse.
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
18. System Features
Embedding Data into Weka Data mining tool
Weka (Waikato Environment for Knowledge Analysis) is a Java-based data mining
tool developed by Waikato University. After loading the dataset into it, the preprocess
function of Weka allows the user to input undesired attributes to prevent them from affecting
the quality of extracted knowledge. Next, the user can use one of the three algorithms to mine
the data: Classification, Clustering, and Association Rule.
Data Mining is playing a key role in most enterprises, which have to analyse great
amounts of data in order to achieve higher profits. Nevertheless, due to the large datasets
involved in this process, the data mining field must face some technological challenges. Grid
Computing takes advantage of the low-load periods of all the computers connected to a
network, making possible resource and data sharing. Providing Grid services constitute a
flexible manner of tackling the data mining needs. This paper shows the adaptation of Weka, a
widely used Data Mining tool, to a grid infrastructure.
Classifiers in WEKA are models for predicting nominal or numeric quantities,
Implemented learning schemes include: Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptions, logistic regression, Bayes’ nets, “Meta”-
classifiers include: Bagging, boosting, stacking, error-correcting output
Codes, locally weighted learning.
WEKA contains “clusters” for finding groups of similar instances in a dataset
Implemented schemes are: k-Means, EM, Cobweb, Farthest First , Clusters can be visualized
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
19. and compared to “true” clusters (if given) Evaluation based on log likelihood if clustering
scheme produces a probability distribution
Suppose you have some data and you want to build a decision tree from it. A common
situation is for the data to be stored in a spreadsheet or database. However, Weka expects it to
be in ARFF format, introduced in Section 2.4, because it is necessary to have type information
about each attribute which cannot be automatically deducted from the attribute values. Before
you can apply any algorithm to your data, is must be converted to ARFF form. This can be
done very easily. Recall that the bulk of an ARFF file consists of a list of all the instances,
with the attribute values for each instance being separated by commas (Figure 2.2). Most
spreadsheet and database programs allow you to export your data into a file in comma
separated format—as a list of records where the items are separated by commas.
Once this has been done, you need only load the file into a text editor or a word
processor; add the dataset’s name using the @relation tag, the attribute information using
@attribute, and a @data line; save the file as raw text—and you’re done! In the following
example we assume that your data is stored in a Microsoft Excel spreadsheet, and you’re
using Microsoft Word for text processing. Of course, the process of converting data into
ARFF format is very similar for other software packages. Figure 8.1a shows an Excel
spreadsheet containing the weather data. It is easy to save this data in comma-separated
format. First, select the Save As… item from the File pull-down menu. Then, in the ensuing
dialog box, select CSV. Now load this file into Microsoft Word. Your screen will look like.
The rows of the original spreadsheet have been converted into lines of text, and the
elements are separated from each other by commas. All you have to do is convert the first
line, which holds the attribute names, into the header structure that makes up the beginning of
an ARFF file. Shows the result. The dataset’s name is introduced by a @relation tag, and the
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
20. names, types, and values of each attribute are defined by @attribute tags. The data section of
the ARFF file begins with a @data tag. Once the structure of your dataset matches, you
should save it as a text file.
Choose Save as… from the File menu, and specify Text Only with Line Breaks as the
file type by using the corresponding popup menu. Enter a file name, and press the Save
button. We suggest that you rename the file to weather.arff to indicate that it is in ARFF
format. Note that the classification schemes in Weka assume by default that the class is the
last attribute in the ARFF file, which fortunately it is in this case. (We explain in Section 8.3
below how to override this default.) Now you can start analyzing this data using the
algorithms provided. In the following we assume that you have downloaded Weka to your
system, and that your Java environment knows where to find the library. (More information
on how to do this can be found at the Weka Web site.) To see what the C4.5 decision tree
learner described in Section 6.1 does with this dataset, we use the J4.8 algorithm, which is
Weka’s implementation of this decision tree learner. (J4.8 actually implements a later and
slightly improved version called C4.5 Revision 8, which was the last public version of this
family of algorithms before C5.0, a commercial implementation, was released.) Type java
weka.classifiers.j48.J48 -t weather.arff at the command line.
This incantation calls the Java virtual machine and instructs it to execute the J48
algorithm from the j48 package—a sub package of classifiers, which is part of the overall
weka package. Weka is organized in “packages” that correspond to a directory hierarchy.
We’ll give more details of the package structure in the next section: in this case, the sub
package name is j48 and the program to be executed from it is called J48. The –t option
informs the algorithm that the next argument is the name of the training file. After pressing
Return, you’ll see the output shown.
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
21. 5.2.2.1 The weka.core package
The core package is central to the Weka system. It contains classes that are accessed
from almost every other class. You can find out what they are by clicking on the hyperlink
underlying weka.core, which brings up. The Web page is divided into two parts: the Interface
Index and the Class Index. The latter is a list of all classes contained within the package, while
the former lists all the interfaces it provides. An interface is very similar to a class, the only
difference being that it doesn’t actually do anything by itself—it is merely a list of methods
without actual implementations. Other classes can declare that they “implement” a particular
interface, and then provide code for its methods. For example, the Option Handler interface
defines those methods that are implemented by all classes that can process command-line
options—including all classifiers.
The key classes in the core package are called Attribute, Instance, and Instances. An
object of class Attribute represents an attribute. It contains the attribute’s name, its type and,
in the case of a nominal attribute, its possible values. An object of class Instance contains the
attribute values of a particular instance; and an object of class Instances holds an ordered set
of instances, in other words, a dataset. By clicking on the hyperlinks underlying the classes,
you can find out more about them. However, you need not know the details just to use Weka
from the command line. We will return to these classes in Section 8.4 when we discuss how to
access the machine learning routines from other Java code. Clicking on the All Packages
hyperlink in the upper left corner of any documentation page brings you back to the listing of
all the packages in Weka.
5.2.2.2 The weka.classifiers package
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
22. The classifiers package contains implementations of most of the algorithms for
classification and numeric prediction that have been discussed in this book. (Numeric
prediction is included in classifiers: it is interpreted as prediction of a continuous class.) The
most important class in this package is Classifier, which defines the general structure of any
scheme for classification or numeric prediction. It contains two methods, buildClassifier() and
classifyInstance(), which all of these learning algorithms have to implement. In the jargon of
object-oriented programming, the learning algorithms are represented by subclasses of
Classifier, and therefore automatically inherit these two methods. Every scheme redefines
them according to how it builds a classifier and how it
classifies instances. This gives a uniform interface for building and using classifiers from
other Java code.
Hence, for example, the same evaluation module can be used to evaluate the
performance of any classifier in Weka. Another important class is Distribution Classifier. This
subclass of Classifier defines the method distributionForInstance(), which returns a
probability distribution for a given instance. Any classifier that can calculate class
probabilities is a subclass of Distribution Classifier and implements this method.
To see an example, click on DecisionStump, which is a class for building a simple one-level
binary decision tree (with an extra branch for missing values). You have to use this rather
lengthy expression if you want to build a decision stump from the command line. The page
then displays a tree structure showing the relevant part of the class hierarchy. As you can see,
Decision Stump is a subclass of Distribution Classifier, and therefore produces class
probabilities. Distribution Classifier, in turn, is a subclass of Classifier, which is itself a
subclass of Object. The Object class is the most general one in Java: all classes are
automatically subclasses of it. After some generic information about the class, its author, and
its version, it gives an index of the constructors and methods of this class.
A constructor is a special kind of method that is called whenever an object of that
class is created, usually initializing the variables that collectively define its state. The index of
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
23. methods lists the name of each one, the type of parameters it takes, and a short description of
its functionality. Beneath those indexes, the Web page gives more details about the
constructors and methods. We return to those details later. As you can see, Decision Stump
implements all methods required by both a Classifier and a Distribution Classifier. In addition,
it contains toString() and main() methods. The former returns a textual description of the
classifier, used whenever it is printed on the screen. The latter is called every time you ask for
a decision stump from the command line, in other words, every time you enter a command
beginning with java weka.classifiers. Decision Stump
The presence of a main() method in a class indicates that it can be run from the command line,
and all learning methods and filter algorithms implement it.
Waikato Environment for Knowledge Analysis
Collection of state-of-the-art machine learning algorithms and data processing
tools implemented in Java
o Released under the GPL
Support for the whole process of experimental data mining
o Preparation of input data
o Statistical evaluation of learning schemes
o Visualization of input data and the result of learning
Used for education, research and applications
Complements “Data Mining” by Witten & Frank
5.2.2.3 Features
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators + 10 search algorithms for feature selection
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
24. 3 algorithms for finding association rules
3 graphical user interfaces
o “The Explorer” (exploratory data analysis)
o “The Experimenter” (experimental environment)
o “The Knowledge Flow” (new process model inspired interface)
Continue to develop and support WEKA
MOA (Massive Online Analysis)
o Framework that supports learning from data streams
Facilities for data generation, experimental analysis, learning
algorithms, etc.
o The Moa (another native NZ bird) is not only flightless, like the Weka, but also
extinct
o First public release, probably this Christmas, or perhaps Thanksgiving (as it’s
just another turkey)
MILK
o Multi-Instance Learning Kit
Proper
o Propositionalization toolbox for WEKA
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
25. Other Nonfunctional Requirements
Performance Requirements
The system has no specific performance requirements at this time
Safety Requirements
The system has no specific safety requirements at this time, except to the extent that it is
designed to run without root access.
Security Requirements
The system has no specific security requirements at this time.
Software Quality Attributes
No additional software quality attributes are addressed in the requirements at this time.
Business Rules
There are no explicit business rules for operation of Network simulation at this time. All users
with access to the command line tools and a copy of the repository will be allowed to perform
all actions. Additional security measures and procedures may be added at a future date.
Other Requirements
There are no additional requirements for the product at this time
Appendix A: Glossary
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com