Edi text

Text Mining Documents in Electronic
Data Interchange Environment
Dr. Zakaria Suliman Zubi,
Associate Professor ,
Computer Science Department,
Faculty of Science ,
Sirt University,
Sirt ,Libya.

LOGO

Add your company slogan
Contents
1 Abstract.

2 Introduction .

3 Types of Text Mining .

4 Types of Information and
Methods .
5 Methods and Algorithms
used.
6 Types of Outputs .

7 Applications of Text Mining in EDI
databases.
8 Experimental Results.
www.themegallery.com LOGO
9 Conclusion.


Abstract
1. Internet is a huge source of electronic text documents,
in multilingual languages.
2. Electronic documents could be interchanged through the
web via Electronic Data Interchange (EDI) environments.
3. The text data can be exchanged in the web in an EDI
format such as X12 formats.
4. The EDI format can be transformed and stored in a
database.
5. The EDI database will be normalized and mapped into a
flat file in a form such as spreadsheets.
6. Text mining using clustering method were applied.
7. K-mean algorithm used with Euclidean distance measure.
8. We generate a dataset using text mining application
program solution called WEKA, to show some
experimental results.


Contents
1 Abstract.

2 Introduction .

3 Types of Text Mining.

Methods .
used.

databases.
9 Conclusion.


Introduction
1. Internet is a huge source of electronic documents in
multilingual languages.

2. Electronic documents may contains text, images, audios and
videos.

3. Text documents may contains text in Latin languages such as
English, French, Spanish,….etc Or Non-Latin's languages such as
Arabic, Chinese, Japanese, Indian,…etc.

4. As a matter of fact, text content of any electronic document is
the most significant value in any document, which makes
applying text mining or information retrieval approaches much
more reasonable.

5. Electronic Data Interchange (EDI) is another approach for
electronic documents interchange through the web in Electronic
Data Interchange (EDI) environments.



Introduction (Continue…..)
6. EDI is becoming progressively more
significant as an easy mechanism for
organizations to manage, buy, sell, and
trade information. ANSI has approved a
set of EDI standards known as the X12
standards.

7. X12 standards represented the electronic
documents.

8. These electronic standards are a
necessary condition between any two
organization to start a business
transactions.

9. The EDI format can be transformed and
stored in a database.
EDI documents- to-
database – to- text
mining life cycle.


Introduction (Continue…..)
9. The EDI database will be
normalized and mapped into a flat
file in a form such as spreadsheets.

10. Text mining using clustering
method will applied.

11. K-mean algorithm used with
Euclidean distance measure.

12. We generate a dataset using text
mining application program
solution called WEKA, to show some
experimental results.


Types of Text Mining
The purposes of using text mining or data mining:
 To improve customer achievement and maintenance.
 To reduce fraud .
 To identify internal inefficiencies and then revamp
operations.
 To map the unexplored environment of the Internet.

The major types of tools used in text mining are:
I. Artificial Neural Networks;
II. Decision trees;
III. Genetic algorithms;
IV. Rule induction;
V. Nearest Neighbor Method;
VI. Data Visualization;


Contents
1 Abstract.

2 Introduction .

3 Types of Text Mining.

Methods.
used.

databases.
9 Conclusion.


Types of Information and Methods
Text mining usually produces five types of information such
as:
Turn out when occurrences
1. Associations; linked in a single occasion.

2. Sequences;
3. Classifications; Procedures linked over time
based on the event that happen.
4. Forecasting
5. Clustering;
It Classificationfuture value ofto
guesses the can assist us
Is one of the essential methods used discover the personality sales
continuous variables like of
in text mining approaches to discovercustomers who are likelywithin
figures based on patterns to
different groupings with the data. the data. provides a model that
leave and
used to expect who they are.



Types of Information and Methods (count)
Clustering:
1. Is unsupervised learning process
applied to the text data depending
on pre-specified knowledge .

2. We use a common partitioned
method called K-means algorithm.

3. We calculate the distance
measures by using Euclidean
measures from the centroid.

4. Improving performance of text in
electronic documents.



Methods and Algorithms used
1. Clustering using k- means Algorithm:
 The k-means algorithm assigns each point to the cluster whose
centroid is the nearest point.

 The center is the average of all the points in the cluster that is, its
coordinates are the arithmetic mean for each dimension separately
over all the points in the cluster.

 The data set has three dimensions and the cluster has two points: X =
(x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3),
where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.The algorithm
steps are:

1. Input D:= {d1,d2,….,dn}; k:= the cluster number;
2. Select k document vectors as initial centriods of k cluster;
3. Repeat;
4. Select one vector d in remaining documents;
5. Compute similarities between d and k centriods;
6. Put d in the closest cluster and recomputed the centriods;
7. Until the centriods don't change;
8. Output: k clusters of documents.


Methods and Algorithms used (count)
2.Bag-of-Words Document : The generation of electronic
documents as a bag of words in EDI database will leads to
the following features:
 Text document is represented by the words it contains (and
their occurrences) e.g., "Lord of the rings" → {"the", "lord",
"rings", "of"}. This representation has a high efficient which
makes learning far simpler and easer. The order of words in
this case is not important for certain application.

 Stemming to identify a word by it's root is also conducted
e.g., flying, flew → fly, it's used to reduce dimensionality.

 Stop words are also used whereas, the most common words
are unlikely to help text mining e.g., "the", "a", "an", "you"
..etc.

 Each document represented by the set of its word
frequencies and categories that it belongs too.

3. Text in EDI document representation :
 The representation of EDI text document will be as a bag of words,
which appears independently without considering the order.
 Each word corresponds to a dimension in the resulting data space
and each document then becomes a vector consisting of non-negative
values on each dimension. We also remove stop words
 We uses the frequency of each term as its weight, which means terms
that appear more frequently are more important and descriptive for the
document.
 Let D = {d1, . . . , dn} be a set of documents and T = {t1, . . . ,tm} the set
of distinct terms occurring in D.
 A document represented as a vector td. Let tf(d, t) signify the
frequency of term t ε T in document d ε D. Then the vector
representation of a document d: td = (tf(d, t1), . . . , tf(d, tm))


4. Distance Measures map the distance between the representative
description of two objects into a single numeric value, which depends
on two factors the properties of the two objects and the measure it.
To qualify a distance measure as a metric, a measure d must satisfy
the following four conditions.
1. Let x and y be any two objects (electronic document) in a data set
and d(x, y) be the distance between x and y The distance between
any two points must be nonnegative, that is, d(x, y) ≥ 0.

2. The distance between two objects must be zero if and only if the
two objects are identical, that is, d(x, y) = 0 if and only if x = y.

3. Distance must be symmetric, that is, distance from x to y is the
same as the distance from y to x, i.e. d(x, y) = d(y, x).

4. The measure must satisfy the triangle inequality, which is d(x, z) ≤
d(x, y) + d(y, z).


Euclidian distance Measures :
 A widely used method in text clustering problem.
 It is also the default distance measure used with the K-means
algorithm.
 It is also the ordinary distance between two points and can be easily
measured with a ruler in two or three-dimensional space.
 If we give two documents da and db represented by their term vectors ta
and tb respectively, the Euclidean distance of the two documents
defined as:

It can be calculated also: distance(x,y) = {Σi (xi - yi)2 }½.
 Squared Euclidean distance: is used also when we want a greater
weight on objects that are further apart. This distance computed in
the following: distance(x,y) = Σi (xi - yi)2


5. Dataset :We propose a collection of a banking
transaction of EDI electronic text data that been
gathered from EDI databases.
1. EDI text data collected and
aggregated in seven main
categories.
2. We create an EDI corpus.
3. This corpus represent the datasets
that consist of 2000 EDI electronic
documents of different lengths that
belongs to seven categories.
4. the categories are transactions
divisions in X12 standard EDI
format.


6. Translating EDI to Databases :
1) Is an essential process for storing and accessing our
transaction information in a valid database format which
support all common database format.
2) It could be done by translating an EDI message EDI X12
standards formats into a variety of transactions.
3) Each transaction file format identifies as a mapping file
and can be transformed into a flat file format?
4) Mapping the translated EDI message into the database will
constricts a database more likely as illustrated in figure.
5) This flat file can be in any common form for instance in
comma-separated format or any common format. The
redundancy of data in the flat table can be clearly seen
from a small portion of an EDI file.

Table


Back

Types of Outputs
Text mining, using EDI data a retailer can identify the demographics of its customers
such as gender, martial status, number of children, etc. and the products that they buy.
This information can have a tremendous positive impact on their operations by
decreasing inventory movement as well as placing inventory in locations where it is likely
to sell.

1. Buying patterns of customers; associations among customer
demographic characteristics; predictions on which customers will
respond to which mailings;

2. Patterns of fraudulent credit card usage; identities of “loyal” customers;
credit card spending by customer groups; predictions of customers who
are likely to change their credit card affiliation;

3. Predictions on which customers will buy new insurance policies;
behavior patterns of risky customers; expectations of fraudulent
behavior;

4. Characterizations of patient behavior to predict frequency of office visits.



Applications of Text Mining in EDI databases
 Text-mining and EDI applications can be used in a variety of
sectors: consumer product sales, finance, manufacturing, health,
bank, insurance, and utilities.

 We can benefit from these technologies (text mining and EDI) if
the types of data are available in EDI databases to perform text-
mining applications for customer-based businesses which are:
1) demographics, such as age, gender and marital status;

2) banking and economic status, such as salary, profession and
household income; and,

3) geographic details, such as city, state or regions.

4) Other demographics like education, hobbies or marital status
can also be used.



Experimental Results
 We generate the dataset by using Euclidean distance measures in
k-mean algorithms to assign every item to its nearest cluster
center using a common text mining application called WEKA.

 The EDI banking text dataset normalized in a flat file and
represented in a comma-separated format. A primary dataset will
be created.

 The resulting data file consists of 600 instances.

 We will use the K-means algorithm to cluster the customers in the
bank dataset, to characterize the resulting customer data
segments.

 Since K-mean permit numerical values for attributes, so we convert
the dataset into the standard spreadsheet format and convert
categorical attributes to binary.



Experimental Results (count)
 The WEKA k-means algorithm uses Euclidean distance measure to
compute distances between instances and clusters.

 Entering seven clusters and seed values as well to generate a
random number for making the initial assignment of instances to
clusters.

 WEKA illustrates the centroid of every cluster as well as statistics
on the number and percentage of instances assigned to dissimilar
clusters.

 Cluster centroids are the mean vectors for each cluster (so, each
dimension value in the centroid corresponds to the mean value for
that dimension in the cluster).

 In the final data portion, each instance has its assigned cluster as
the last attribute value.



Conclusion
In this paper, we have used a homogenous mixture of two common technologies such
as EDI and Text mining.

EDI with a transformation process represented the database storage.

We used Text Mining to extract the useful hidden and previously unknown patterns or
information from EDI text databases.

We also circled only the most interesting intersection point that correlates between EDI
and text mining.

In EDI format, the file was translated into a normalized flat file in a comma-separated
format.

The flat file represented the EDI database where we propose a dataset collected from a
banking transaction of EDI electronic text data which been gathered from EDI databases.

In text mining, we suggest to use k-mean algorithm in clustering method to calculate the
Euclidean distance measures to assign every item to its nearest cluster center.

 In the experimental section, we used a text mining application program solution called
WEKA to represent our results in a visual fashion.


Edi text

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (8)

Similar to Edi text

Similar to Edi text (20)

More from Zakaria Zubi

More from Zakaria Zubi (6)

Edi text