Information communication technology in libya for educational purposes
Edi text
1. Text Mining Documents in Electronic
Data Interchange Environment
Dr. Zakaria Suliman Zubi,
Associate Professor ,
Computer Science Department,
Faculty of Science ,
Sirt University,
Sirt ,Libya.
LOGO
2. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining .
4 Types of Information and
Methods .
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
3. Add your company slogan
Abstract
1. Internet is a huge source of electronic text documents,
in multilingual languages.
2. Electronic documents could be interchanged through the
web via Electronic Data Interchange (EDI) environments.
3. The text data can be exchanged in the web in an EDI
format such as X12 formats.
4. The EDI format can be transformed and stored in a
database.
5. The EDI database will be normalized and mapped into a
flat file in a form such as spreadsheets.
6. Text mining using clustering method were applied.
7. K-mean algorithm used with Euclidean distance measure.
8. We generate a dataset using text mining application
program solution called WEKA, to show some
experimental results.
www.themegallery.com LOGO
4. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods .
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
5. Add your company slogan
Introduction
1. Internet is a huge source of electronic documents in
multilingual languages.
2. Electronic documents may contains text, images, audios and
videos.
3. Text documents may contains text in Latin languages such as
English, French, Spanish,….etc Or Non-Latin's languages such as
Arabic, Chinese, Japanese, Indian,…etc.
4. As a matter of fact, text content of any electronic document is
the most significant value in any document, which makes
applying text mining or information retrieval approaches much
more reasonable.
5. Electronic Data Interchange (EDI) is another approach for
electronic documents interchange through the web in Electronic
Data Interchange (EDI) environments.
www.themegallery.com LOGO
6. Add your company slogan
Introduction (Continue…..)
6. EDI is becoming progressively more
significant as an easy mechanism for
organizations to manage, buy, sell, and
trade information. ANSI has approved a
set of EDI standards known as the X12
standards.
7. X12 standards represented the electronic
documents.
8. These electronic standards are a
necessary condition between any two
organization to start a business
transactions.
9. The EDI format can be transformed and
stored in a database.
EDI documents- to-
database – to- text
mining life cycle.
www.themegallery.com LOGO
7. Add your company slogan
Introduction (Continue…..)
9. The EDI database will be
normalized and mapped into a flat
file in a form such as spreadsheets.
10. Text mining using clustering
method will applied.
11. K-mean algorithm used with
Euclidean distance measure.
12. We generate a dataset using text
mining application program
solution called WEKA, to show some
experimental results.
www.themegallery.com LOGO
8. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods .
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
9. Add your company slogan
Types of Text Mining
The purposes of using text mining or data mining:
To improve customer achievement and maintenance.
To reduce fraud .
To identify internal inefficiencies and then revamp
operations.
To map the unexplored environment of the Internet.
The major types of tools used in text mining are:
I. Artificial Neural Networks;
II. Decision trees;
III. Genetic algorithms;
IV. Rule induction;
V. Nearest Neighbor Method;
VI. Data Visualization;
www.themegallery.com LOGO
10. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
11. Add your company slogan
Types of Information and Methods
Text mining usually produces five types of information such
as:
Turn out when occurrences
1. Associations; linked in a single occasion.
2. Sequences;
3. Classifications; Procedures linked over time
based on the event that happen.
4. Forecasting
5. Clustering;
It Classificationfuture value ofto
guesses the can assist us
Is one of the essential methods used discover the personality sales
continuous variables like of
in text mining approaches to discovercustomers who are likelywithin
figures based on patterns to
different groupings with the data. the data. provides a model that
leave and
used to expect who they are.
www.themegallery.com LOGO
12. Add your company slogan
Types of Information and Methods (count)
Clustering:
1. Is unsupervised learning process
applied to the text data depending
on pre-specified knowledge .
2. We use a common partitioned
method called K-means algorithm.
3. We calculate the distance
measures by using Euclidean
measures from the centroid.
4. Improving performance of text in
electronic documents.
www.themegallery.com LOGO
13. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
14. Add your company slogan
Methods and Algorithms used
1. Clustering using k- means Algorithm:
The k-means algorithm assigns each point to the cluster whose
centroid is the nearest point.
The center is the average of all the points in the cluster that is, its
coordinates are the arithmetic mean for each dimension separately
over all the points in the cluster.
The data set has three dimensions and the cluster has two points: X =
(x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3),
where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.The algorithm
steps are:
1. Input D:= {d1,d2,….,dn}; k:= the cluster number;
2. Select k document vectors as initial centriods of k cluster;
3. Repeat;
4. Select one vector d in remaining documents;
5. Compute similarities between d and k centriods;
6. Put d in the closest cluster and recomputed the centriods;
7. Until the centriods don't change;
8. Output: k clusters of documents.
www.themegallery.com LOGO
15. Add your company slogan
Methods and Algorithms used (count)
2.Bag-of-Words Document : The generation of electronic
documents as a bag of words in EDI database will leads to
the following features:
Text document is represented by the words it contains (and
their occurrences) e.g., "Lord of the rings" → {"the", "lord",
"rings", "of"}. This representation has a high efficient which
makes learning far simpler and easer. The order of words in
this case is not important for certain application.
Stemming to identify a word by it's root is also conducted
e.g., flying, flew → fly, it's used to reduce dimensionality.
Stop words are also used whereas, the most common words
are unlikely to help text mining e.g., "the", "a", "an", "you"
..etc.
Each document represented by the set of its word
frequencies and categories that it belongs too.
www.themegallery.com LOGO
16. Add your company slogan
Methods and Algorithms used (count)
3. Text in EDI document representation :
The representation of EDI text document will be as a bag of words,
which appears independently without considering the order.
Each word corresponds to a dimension in the resulting data space
and each document then becomes a vector consisting of non-negative
values on each dimension. We also remove stop words
We uses the frequency of each term as its weight, which means terms
that appear more frequently are more important and descriptive for the
document.
Let D = {d1, . . . , dn} be a set of documents and T = {t1, . . . ,tm} the set
of distinct terms occurring in D.
A document represented as a vector td. Let tf(d, t) signify the
frequency of term t ε T in document d ε D. Then the vector
representation of a document d: td = (tf(d, t1), . . . , tf(d, tm))
www.themegallery.com LOGO
17. Add your company slogan
Methods and Algorithms used (count)
4. Distance Measures map the distance between the representative
description of two objects into a single numeric value, which depends
on two factors the properties of the two objects and the measure it.
To qualify a distance measure as a metric, a measure d must satisfy
the following four conditions.
1. Let x and y be any two objects (electronic document) in a data set
and d(x, y) be the distance between x and y The distance between
any two points must be nonnegative, that is, d(x, y) ≥ 0.
2. The distance between two objects must be zero if and only if the
two objects are identical, that is, d(x, y) = 0 if and only if x = y.
3. Distance must be symmetric, that is, distance from x to y is the
same as the distance from y to x, i.e. d(x, y) = d(y, x).
4. The measure must satisfy the triangle inequality, which is d(x, z) ≤
d(x, y) + d(y, z).
www.themegallery.com LOGO
18. Add your company slogan
Methods and Algorithms used (count)
Euclidian distance Measures :
A widely used method in text clustering problem.
It is also the default distance measure used with the K-means
algorithm.
It is also the ordinary distance between two points and can be easily
measured with a ruler in two or three-dimensional space.
If we give two documents da and db represented by their term vectors ta
and tb respectively, the Euclidean distance of the two documents
defined as:
It can be calculated also: distance(x,y) = {Σi (xi - yi)2 }½.
Squared Euclidean distance: is used also when we want a greater
weight on objects that are further apart. This distance computed in
the following: distance(x,y) = Σi (xi - yi)2
www.themegallery.com LOGO
19. Add your company slogan
Methods and Algorithms used (count)
5. Dataset :We propose a collection of a banking
transaction of EDI electronic text data that been
gathered from EDI databases.
1. EDI text data collected and
aggregated in seven main
categories.
2. We create an EDI corpus.
3. This corpus represent the datasets
that consist of 2000 EDI electronic
documents of different lengths that
belongs to seven categories.
4. the categories are transactions
divisions in X12 standard EDI
format.
www.themegallery.com LOGO
20. Add your company slogan
Methods and Algorithms used (count)
6. Translating EDI to Databases :
1) Is an essential process for storing and accessing our
transaction information in a valid database format which
support all common database format.
2) It could be done by translating an EDI message EDI X12
standards formats into a variety of transactions.
3) Each transaction file format identifies as a mapping file
and can be transformed into a flat file format?
4) Mapping the translated EDI message into the database will
constricts a database more likely as illustrated in figure.
5) This flat file can be in any common form for instance in
comma-separated format or any common format. The
redundancy of data in the flat table can be clearly seen
from a small portion of an EDI file.
Table
www.themegallery.com LOGO
21. Add your company slogan
Methods and Algorithms used (count)
Back
www.themegallery.com LOGO
22. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
23. Add your company slogan
Types of Outputs
Text mining, using EDI data a retailer can identify the demographics of its customers
such as gender, martial status, number of children, etc. and the products that they buy.
This information can have a tremendous positive impact on their operations by
decreasing inventory movement as well as placing inventory in locations where it is likely
to sell.
1. Buying patterns of customers; associations among customer
demographic characteristics; predictions on which customers will
respond to which mailings;
2. Patterns of fraudulent credit card usage; identities of “loyal” customers;
credit card spending by customer groups; predictions of customers who
are likely to change their credit card affiliation;
3. Predictions on which customers will buy new insurance policies;
behavior patterns of risky customers; expectations of fraudulent
behavior;
4. Characterizations of patient behavior to predict frequency of office visits.
www.themegallery.com LOGO
24. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
25. Add your company slogan
Applications of Text Mining in EDI databases
Text-mining and EDI applications can be used in a variety of
sectors: consumer product sales, finance, manufacturing, health,
bank, insurance, and utilities.
We can benefit from these technologies (text mining and EDI) if
the types of data are available in EDI databases to perform text-
mining applications for customer-based businesses which are:
1) demographics, such as age, gender and marital status;
2) banking and economic status, such as salary, profession and
household income; and,
3) geographic details, such as city, state or regions.
4) Other demographics like education, hobbies or marital status
can also be used.
www.themegallery.com LOGO
26. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
27. Add your company slogan
Experimental Results
We generate the dataset by using Euclidean distance measures in
k-mean algorithms to assign every item to its nearest cluster
center using a common text mining application called WEKA.
The EDI banking text dataset normalized in a flat file and
represented in a comma-separated format. A primary dataset will
be created.
The resulting data file consists of 600 instances.
We will use the K-means algorithm to cluster the customers in the
bank dataset, to characterize the resulting customer data
segments.
Since K-mean permit numerical values for attributes, so we convert
the dataset into the standard spreadsheet format and convert
categorical attributes to binary.
www.themegallery.com LOGO
28. Add your company slogan
Experimental Results (count)
The WEKA k-means algorithm uses Euclidean distance measure to
compute distances between instances and clusters.
Entering seven clusters and seed values as well to generate a
random number for making the initial assignment of instances to
clusters.
WEKA illustrates the centroid of every cluster as well as statistics
on the number and percentage of instances assigned to dissimilar
clusters.
Cluster centroids are the mean vectors for each cluster (so, each
dimension value in the centroid corresponds to the mean value for
that dimension in the cluster).
In the final data portion, each instance has its assigned cluster as
the last attribute value.
www.themegallery.com LOGO
31. Add your company slogan
Contents
1 Abstract.
2 Introduction .
3 Types of Text Mining.
4 Types of Information and
Methods.
5 Methods and Algorithms
used.
6 Types of Outputs .
7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.
32. Add your company slogan
Conclusion
In this paper, we have used a homogenous mixture of two common technologies such
as EDI and Text mining.
EDI with a transformation process represented the database storage.
We used Text Mining to extract the useful hidden and previously unknown patterns or
information from EDI text databases.
We also circled only the most interesting intersection point that correlates between EDI
and text mining.
In EDI format, the file was translated into a normalized flat file in a comma-separated
format.
The flat file represented the EDI database where we propose a dataset collected from a
banking transaction of EDI electronic text data which been gathered from EDI databases.
In text mining, we suggest to use k-mean algorithm in clustering method to calculate the
Euclidean distance measures to assign every item to its nearest cluster center.
In the experimental section, we used a text mining application program solution called
WEKA to represent our results in a visual fashion.
www.themegallery.com LOGO