SlideShare a Scribd company logo
1 of 4
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. V (Jan – Feb. 2015), PP 24-27
www.iosrjournals.org
DOI: 10.9790/0661-17152427 www.iosrjournals.org 24 | Page
Research on Clustering Method of Related Cases Based
On Chinese Text
Gang Zeng1
(Police Information Department, Liaoning Police Academy, Dalian, China)
Abstract: With the movement of population, crimes are showing mobility and professionalism characteristics,
the cases which the same suspects committed have the same or similar characteristics, it is helpful to find the
related cases. In this article, firstly, analyzes the factors associated with the cases, discusses how to form vector
space of Chinese cases, then proposes a method to associate cases by joint use of FKM and Canopy clustering
algorithm.
Keywords: related cases, Chinese segmentation, Fuzzy K-Means, Canopy.
I. Introduction
With the development of transportation and movement of population,Crimes are showing
mobility,professionalism and other characteristics, the suspects often gather his associates and wander around
to commit the crime. These cases have same characteristics, because the same person or gang do it. if the cases
are clustered, Investigation will make an unexpected progress.
When the case investigation is in trouble, Investigators often merge cases with the same characteristics
and investigate them, to find out the relationship between the cases, to find more clues and speed up the
investigation of cases. Cases often be searched in the cases database, to find out related cases with the same
characteristics, we call it related cases, after cases clustered, there will be a few hundred records in a class, or
even more. Investigators will spend much time and effort. to find out related cases in these cases, In this way it
is trivial and inefficient.
Ning Han presented a clustering method for related cases using FKM algorithm(Fuzzy K-means
algorithm, also known as the fuzzy C-means algorithm ), and analyze the information of the cases. and do the
experiment based on 681 records of cases, this analysis method is feasible in the low-dimensional stand-alone
environment. But his method does not solve the problem of Chinese word segmentation, Under condition of
high-dimensional and big data, the clustering results could not be gotten within an acceptable time using his
method, We propose a method, in the Hadoop environment, we can analyze related cases using Canopy and
FKM clustering algorithm, this method can overcome the lack of FKM clustering algorithm, and improve the
efficiency and quality of analysis.
II. Conditions Of Associated Cases
Conditions of related case includes many factors, such as time and location of the crime, the way into
the scene of crime, tool for criminal purpose, criminal activity sequence, modus operandi, the way and means of
escape, the way of transportation and sales of stolen goods, and so on. Because a variety of factors, such as the
suspect's life experiences, history of crime, level of education, professional features, life skills, mental set, etc.
In the course of committing the crime in different places, means and methods of the suspect committing the
crime have formed a stable characteristic, we call it crime stereotypes of criminal behavior, in the different cases
that the same suspect has made, the factors that mentioned above may are manifested more or less, this shows
that the stability of the crime. To find more investigation clues of cases, we must analyze the relationship of all
cases recorded in database, to find out the information we need.
1. The Same Or Similar Nature Of The Cases
All kinds of cases , such as criminal cases, security cases, economic cases, etc. the suspects are affected
by the above factors, their criminal behavior presents the characteristics of continuity. The suspect who unlock a
lock by tools will steal things in a long time, the suspect who murder will still kill someone.
2. Regularity Of The Time Of Occurrence Of The Case
A person or a group of people will commit crimes , at the same time or similar time. The time when the
suspect commit crime often shows a certain regularity, for example, a type of cases often occur at a certain time
in a day, a certain day in a week, a specific time period in a year. for example , Car theft cases are far more
likely to occur from midnight to 6 a.m. Robbery, theft and other usurpation of cases occurred on the eve of
Chinese New Year.
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 25 | Page
3. Target Suspects Violated Is Same Or Similar
Target suspects violated is closely related to the suspect's motive, Because of different psychological
needs, the choice of target is in a somewhat tendentious. for example, young single woman often become the
object of rape or robbery, Precious industrial raw materials or digital products become the object of theft.
4. Suspect’S Body Features Is Same Or Similar
if suspect’s body features is same or similar in different cases, we can consider the cases is related.
body features including height, weight, gender, countenance, clothing, accent, and so on.
5. Crime Means, Methods, Processes Are Same Or Similar
Because of the suspect’s history of crime, life skills, and other factors, In terms of tools of crime, the
characteristics of the crime, the crime procedures, etc. a certain pattern has been formed, And with its own
specificity and stability. for example, to destroy grille, some suspect prefer to use the jack, some prefer to use a
pipe wrench, some suspect prefer to use stick. to avoid video surveillance, some suspect choose cap, some
suspect choose mask, some suspect choose headgear.
6. Vestige Are Same Or Similar
Vestige is defined as a vestige of something is a very small part that still remains of something that was
once much larger or more important. we can use it as evidence in court, vestige includes footprints, fingerprint,
handprint, shot mark, semen, bloodstain, hair, toxicant, exploder, handwriting, and so on. After comparison,
overlapping, anastomosis, checkout, If identified as same or similar things in different cases, We can believe
that these cases are related.
Now, police manage cases by RDMS, each field describe a characteristic of the case, for example, the
β€˜time’ field describe the time when the case occurred, the β€˜body features’ field describe the suspect’s features of
his(her) body, including common human characteristics, such as left ring finger broken, yellow hair, myopia and
with glasses. The method for the management of the cases made important contributions.
But disadvantages are also obvious, firstly, It is complex to enter data into database, each case must be
recorded in the management system, a full-time police is responsible for data entry, because it is very complex
to describe the case. this is waste of police. secondly, the quality of the data is uneven. Because of different
understanding of the case, different expression level, different degree of familiarity with the management
system. much of the data is wrong, useless. thirdly, it is excessive precision, When associated with the cases, we
can seldom find related cases, It did not play the desired effect to associate with cases.
For the above reasons, we present that interrogation record of the case is adopted as the basis for
association. we can relate cases based on Chinese text, playing a specialty of big data, we can find related cases
nationwide.
III. Cluster Analysis Of Cases
Vector space model (VSM) is a common method to describe cases, the case set is described as a vector
space. Then we analyze the related cases by studying the similarity of cases. Clustering analysis is an important
aspect of the study of the data mining. It is process of division physical and abstracted collection into several
classes composed of similar objects. Clustering analysis is the learning process of a free guide. It can divide
criminal acts into a number of classes or cluster. In same cluster crime acts enjoy high similarity. On the
contrary, in different clusters they have large difference. in this article we will use fuzzy K-means clustering
algorithm to associate with cases.
1. The Choice Of Distance Measure
It is very important, how to calculate the distance between the vectors, and it is directly related to the
accuracy of clustering. how to calculate is the best way? there are many factors which infects the accuracy, but
the most important factor of all factors is the choice of distance measure.
(1) Euclidean Distance Measure
The Euclidean distance is the simplest of all distance measures. It’s the most intuitive and matches our
normal idea of distance. Euclidean distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... ,
bn) is
d = π‘Ž1 βˆ’ 𝑏1
2 + π‘Ž2 βˆ’ 𝑏2
2 + β‹―+ π‘Žπ‘– βˆ’ 𝑏𝑖
2
(2). Manhattan Distance Measure
The distance between any two points is the sum of the absolute differences of their coordinates. This
distance measure takes its name from the grid-like layout of streets in Manhattan. we can’t walk from 3th
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 26 | Page
Avenue and 3th Street to 5thAvenue and 5th Street by walking straight through buildings. The real distance
walked is two blocks up and two blocks over. the Manhattan distance between two n-dimensional vectors (a1,
a2, ... , an) and (b1,b2, ... , bn) is
d = π‘Ž1 βˆ’ 𝑏1 + π‘Ž2 βˆ’ 𝑏2 + β‹― π‘Ž 𝑛 βˆ’ 𝑏 𝑛
(3). Weighted Distance Measure
Weighted distance measure is based on Euclidean and Manhattan distance measures. A weighted
distance measure allows you to give weights to different dimensions in order to either increase or decrease the
effect of a dimension on the value of the distance measure. This will affect specific distance measures
differently but will in general make the distance value more sensitive to differences in a dimension.
2. Representing Chinese Text Documents As Vectors
The vector space model (VSM) is the common way of vectorizing text documents. All the words in the
world form a set of words, each word being assigned a number, the number is the dimension, it’ll occupy in
document vectors. the dimension of these document vectors can be very large. the maximum number of
dimensions possible is the cardinality of the vector. Because the count of all possible words or tokens is
unimaginably large, text vectors are usually assumed to have infinite dimensions.
Chinese text processing involves the Chinese segmentation. Chinese is different from English, English
words in the document are divided into independent word by a space or punctuation, Chinese Document can cut
apart into sentences by Chinese punctuation. but there is no space in sentence. Chinese segmentation is the most
important factor in representing Chinese text documents as vectors. The method commonly used in Chinese
segmentation is "two-way maximum matching", First, maximally splitting in the positive direction, then
maximally splitting in the opposite direction, the combination of the two parts is the final result, during the
procedure, an atomic word base is needed.
The more dimensions in vector space, the more complex the calculations become, It is proved by
theoretical calculations and practical tests. Representing the case with high-dimensional vector require
enormous computing resources, so two questions must be solved to calculate the related cases: firstly, reduce the
dimension of vector space, secondly, provide sufficient computing resources.
Atomic word is usually expressed as a natural word, the natural words are usually not standard, the
same meaning is expressed in different words, atomic word must be regulatory. the method to establish a
thesaurus and stop words dictionary can reduce the dimension of vector space effectively.
Sufficient computing resources in a stand-alone environment is impossible, with the emergence of
Hadoop, MapReduce model can break the huge computing task into small tasks, the small tasks can be executed
at different nodes, Hadoop provides us with sufficient computing resources. so we can calculate the relevance of
the cases in the Hadoop environment.
3. Term Frequency-Inverse Frequency
Term frequency–inverse document frequency (TF-IDF) weighting is a widely used improvement on
simple term-frequency weighting. The IDF part is the improvement; instead of simply using term frequency as
the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its
value is reduced more for words used frequently across all the documents in the dataset than for infrequently
used words.
the TF-IDF weight Wi for a word wi is:
π‘Šπ‘– = 𝑇𝐹𝑖 βˆ™ π‘™π‘œπ‘”
𝑁
𝐷𝐹𝑖
Where: TFi represents the term frequency (TFi) of word wi, N represents the document count. the
document vector will have this value at the dimension for wordi. This is the classic TF-IDF weighting. stop
words get a small weight, and terms that occur infrequently get a large weight. The important words, or the topic
words, usually have a high TF and a somewhat large IDF, so the product of the two becomes a larger value,
thereby giving more importance to these words in the vector produced.
IV. Application Of Clustering Algorithm
1. Fuzzy K-Means Clustering Algorithm
K-means clustering algorithm tries to find the hard clusters (where each point belongs to one cluster),
But fuzzy k-means discovers the soft clusters. In a soft cluster, any point can belong to more than one cluster
with a certain affinity value towards each. So the theory of the fuzzy clustering is more suitable for the nature of
things, and it can reflect objectively the reality. Currently, the fuzzy K-means clustering (FKM) algorithm is the
most widely used.
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 27 | Page
FKM partitions set of n objects X = (π‘₯1, π‘₯2, β‹― , π‘₯ 𝑛 ) into K fuzzy clusters with C = 𝑐1, 𝑐2, β‹― , 𝑐 π‘˜
cluster centers. In fuzzy matrix U = (𝑒𝑖𝑗 , ), uij is the membership degree of the ith object with the cluster, The
characters of uij are as follows :
𝑒𝑖𝑗 ∈ 0,1 𝑖 ∈ 1,2, β‹― , 𝑛; π‘—πœ–1,2,β‹― , π‘˜
𝑒 𝑖, 𝑗 = 1, 𝑗 = 1, β‹―, 𝑛
π‘˜
𝑖=1
Update uij according to (1):
𝑒𝑖𝑗 =
1
𝑑 𝑖𝑗
𝑑 π‘˜π‘—
2
π‘š βˆ’1
π‘˜
π‘˜=1
1
Where: m>1is fuzziness exponent, cj is the clustering center. 𝑑𝑖𝑗 = π‘₯𝑖 βˆ’ 𝑐𝑗 is the distance between xi
and cj. Update cj according to (2)
𝑐𝑗 =
𝑒𝑖𝑗
π‘š
π‘₯𝑖
𝑛
𝑖=1
𝑒𝑖𝑗
π‘šπ‘›
𝑖=1
(2οΌ‰
The objective function is the equation (3):
J U, c1, β‹―, c π‘˜ = 𝑒𝑖𝑗
π‘š
𝑛
𝑗 =1
π‘˜
𝑖=1
𝑑𝑖𝑗
2
(3οΌ‰
The nature of FKM algorithm is to apply the gradient descent method to find out optimal solution, so
there is a local optimization problem. And the algorithm convergence speed is greatly influenced by the initial
value, especially in the case of large number of clusters. For many real-world clustering problems, the number
of clusters isn’t known beforehand, A class of techniques known as approximate clustering algorithms can
estimate the number of clusters as well as the approximate location of the centroids from a given data set. One
such algorithm we can use is called canopy generation.
2. Canopy Clustering Algorithm
Canopy’s advantage lies in its ability to create clusters extremely quicklyβ€”it can do this with a single
pass over the data. But this algorithm may not give accurate and precise clusters. But it can give the optimal
number of clusters without even specifying the number of clusters, k, as required by fuzzy k-means.
V. Conclusion
This article proposed a method to associate cases by joint use of FKM and Canopy clustering
algorithm, it overcome the difficult problem of clustering Chinese cases, but this method is relatively complex,
the results are not very accurate, We hope to improve this method in the future.
Acknowledgment
This work was partly supported by general scientific research project of education department of
Liaoning Province of China (No. L2013490).
References
[1]. Ning Han, Wei Chen. Research on clustering analysis on related Cases. Journal of Chinese People’s Public Security
University(Science and Technology), 2012(1):53-58.
[2]. Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. Mahout in Action[M]. Manning Publication, Westampton.
[3]. Tony H. Grubesic. On The Application of Fuzzy Clustering for Crime Hot Spot Detection[J]. Journal of Quantitative Criminology,
2006,22(1):77-105.
[4]. [Deguang Wang, Baochang Han, Ming Huang. Application of Fuzzy C-Means Clustering Algorithm Based on Particle Swarm
Optimization in Computer Forensics. 2012 International Conference on Applied Physics and Industrial Engineering, 1186-1191.
[5]. Gang Liu. Programing Hadoop[M]. BeiJing:China Machine Press, 2014.
[6]. Hesam Izakian, Ajith Abraham, VΓ‘clav SnΓ‘Ε‘el. "Fuzzy Clustering Using Hybrid Fuzzy c-means and Fuzzy Particle Swarm
Optimization" 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC 2009), pp.1690-1694,2009.

More Related Content

Viewers also liked

Geostatistical approach to the estimation of the uncertainty and spatial vari...
Geostatistical approach to the estimation of the uncertainty and spatial vari...Geostatistical approach to the estimation of the uncertainty and spatial vari...
Geostatistical approach to the estimation of the uncertainty and spatial vari...IOSR Journals
Β 
Simulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorSimulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorIOSR Journals
Β 
Designing and Performance Evaluation of 64 QAM OFDM System
Designing and Performance Evaluation of 64 QAM OFDM SystemDesigning and Performance Evaluation of 64 QAM OFDM System
Designing and Performance Evaluation of 64 QAM OFDM SystemIOSR Journals
Β 
Effect of yogic practices on Static & Dynamic flexibility of College Student
Effect of yogic practices on Static & Dynamic flexibility of College StudentEffect of yogic practices on Static & Dynamic flexibility of College Student
Effect of yogic practices on Static & Dynamic flexibility of College StudentIOSR Journals
Β 
Modelling and Control of a Robotic Arm Using Artificial Neural Network
Modelling and Control of a Robotic Arm Using Artificial Neural NetworkModelling and Control of a Robotic Arm Using Artificial Neural Network
Modelling and Control of a Robotic Arm Using Artificial Neural NetworkIOSR Journals
Β 
Importance of Development of Quality Checklists
Importance of Development of Quality ChecklistsImportance of Development of Quality Checklists
Importance of Development of Quality ChecklistsIOSR Journals
Β 
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...IOSR Journals
Β 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction IOSR Journals
Β 
Touchpad Monitored Car
Touchpad Monitored Car Touchpad Monitored Car
Touchpad Monitored Car IOSR Journals
Β 
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy Resin
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy ResinThermal conductivity Characterization of Bamboo fiber reinforced in Epoxy Resin
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy ResinIOSR Journals
Β 
Enhance the Throughput of Wireless Network Using Multicast Routing
Enhance the Throughput of Wireless Network Using Multicast RoutingEnhance the Throughput of Wireless Network Using Multicast Routing
Enhance the Throughput of Wireless Network Using Multicast RoutingIOSR Journals
Β 

Viewers also liked (20)

B017230816
B017230816B017230816
B017230816
Β 
E0622730
E0622730E0622730
E0622730
Β 
I017355767
I017355767I017355767
I017355767
Β 
Geostatistical approach to the estimation of the uncertainty and spatial vari...
Geostatistical approach to the estimation of the uncertainty and spatial vari...Geostatistical approach to the estimation of the uncertainty and spatial vari...
Geostatistical approach to the estimation of the uncertainty and spatial vari...
Β 
Simulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorSimulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal Simulator
Β 
Designing and Performance Evaluation of 64 QAM OFDM System
Designing and Performance Evaluation of 64 QAM OFDM SystemDesigning and Performance Evaluation of 64 QAM OFDM System
Designing and Performance Evaluation of 64 QAM OFDM System
Β 
D0941822
D0941822D0941822
D0941822
Β 
Effect of yogic practices on Static & Dynamic flexibility of College Student
Effect of yogic practices on Static & Dynamic flexibility of College StudentEffect of yogic practices on Static & Dynamic flexibility of College Student
Effect of yogic practices on Static & Dynamic flexibility of College Student
Β 
Modelling and Control of a Robotic Arm Using Artificial Neural Network
Modelling and Control of a Robotic Arm Using Artificial Neural NetworkModelling and Control of a Robotic Arm Using Artificial Neural Network
Modelling and Control of a Robotic Arm Using Artificial Neural Network
Β 
Importance of Development of Quality Checklists
Importance of Development of Quality ChecklistsImportance of Development of Quality Checklists
Importance of Development of Quality Checklists
Β 
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...
Identification and Monitoring the Change of Land Use Pattern Using Remote Sen...
Β 
A012660105
A012660105A012660105
A012660105
Β 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Β 
Touchpad Monitored Car
Touchpad Monitored Car Touchpad Monitored Car
Touchpad Monitored Car
Β 
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy Resin
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy ResinThermal conductivity Characterization of Bamboo fiber reinforced in Epoxy Resin
Thermal conductivity Characterization of Bamboo fiber reinforced in Epoxy Resin
Β 
M1802028591
M1802028591M1802028591
M1802028591
Β 
Enhance the Throughput of Wireless Network Using Multicast Routing
Enhance the Throughput of Wireless Network Using Multicast RoutingEnhance the Throughput of Wireless Network Using Multicast Routing
Enhance the Throughput of Wireless Network Using Multicast Routing
Β 
E010213039
E010213039E010213039
E010213039
Β 
I010125361
I010125361I010125361
I010125361
Β 
J010116069
J010116069J010116069
J010116069
Β 

Similar to Research on Clustering Method of Chinese Text Cases

Text mining on criminal documents
Text mining on criminal documentsText mining on criminal documents
Text mining on criminal documentsZhongLI28
Β 
Inspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal ApplicationsInspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal ApplicationsIRJET Journal
Β 
Investigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysisInvestigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysisZhongLI28
Β 
Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringReuben George
Β 
Legal memorandum #1 to student fall 2018-ba 18 (81151)
Legal memorandum #1 to student   fall 2018-ba 18 (81151)Legal memorandum #1 to student   fall 2018-ba 18 (81151)
Legal memorandum #1 to student fall 2018-ba 18 (81151)niraj57
Β 
CRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptxCRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptxoluobes
Β 
Database and Analytics Programming - Project report
Database and Analytics Programming - Project reportDatabase and Analytics Programming - Project report
Database and Analytics Programming - Project reportsarthakkhare3
Β 
Propose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysisPropose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysisIOSR Journals
Β 
report
reportreport
reportPeter Kim
Β 
Advancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented PolicingAdvancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented PolicingAndrea Porter
Β 
IRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data MiningIRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data MiningIRJET Journal
Β 
georgia-tech-atlanta
georgia-tech-atlantageorgia-tech-atlanta
georgia-tech-atlantaPeter Kim
Β 
A predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analyticsA predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analyticseSAT Journals
Β 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docxhealdkathaleen
Β 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docxtodd271
Β 
Merseyside Crime Analysis
Merseyside Crime AnalysisMerseyside Crime Analysis
Merseyside Crime AnalysisParang Saraf
Β 
ACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdfACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdfKiranKumar757501
Β 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesHeta Parekh
Β 

Similar to Research on Clustering Method of Chinese Text Cases (20)

Crime
CrimeCrime
Crime
Β 
Text mining on criminal documents
Text mining on criminal documentsText mining on criminal documents
Text mining on criminal documents
Β 
Inspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal ApplicationsInspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal Applications
Β 
Investigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysisInvestigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysis
Β 
Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means Clustering
Β 
Legal memorandum #1 to student fall 2018-ba 18 (81151)
Legal memorandum #1 to student   fall 2018-ba 18 (81151)Legal memorandum #1 to student   fall 2018-ba 18 (81151)
Legal memorandum #1 to student fall 2018-ba 18 (81151)
Β 
CRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptxCRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptx
Β 
Database and Analytics Programming - Project report
Database and Analytics Programming - Project reportDatabase and Analytics Programming - Project report
Database and Analytics Programming - Project report
Β 
Propose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysisPropose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysis
Β 
report
reportreport
report
Β 
Advancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented PolicingAdvancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented Policing
Β 
IRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data MiningIRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data Mining
Β 
georgia-tech-atlanta
georgia-tech-atlantageorgia-tech-atlanta
georgia-tech-atlanta
Β 
U24149153
U24149153U24149153
U24149153
Β 
A predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analyticsA predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analytics
Β 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
Β 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
Β 
Merseyside Crime Analysis
Merseyside Crime AnalysisMerseyside Crime Analysis
Merseyside Crime Analysis
Β 
ACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdfACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdf
Β 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los Angeles
Β 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
Β 
M0111397100
M0111397100M0111397100
M0111397100
Β 
L011138596
L011138596L011138596
L011138596
Β 
K011138084
K011138084K011138084
K011138084
Β 
J011137479
J011137479J011137479
J011137479
Β 
I011136673
I011136673I011136673
I011136673
Β 
G011134454
G011134454G011134454
G011134454
Β 
H011135565
H011135565H011135565
H011135565
Β 
F011134043
F011134043F011134043
F011134043
Β 
E011133639
E011133639E011133639
E011133639
Β 
D011132635
D011132635D011132635
D011132635
Β 
C011131925
C011131925C011131925
C011131925
Β 
B011130918
B011130918B011130918
B011130918
Β 
A011130108
A011130108A011130108
A011130108
Β 
I011125160
I011125160I011125160
I011125160
Β 
H011124050
H011124050H011124050
H011124050
Β 
G011123539
G011123539G011123539
G011123539
Β 
F011123134
F011123134F011123134
F011123134
Β 
E011122530
E011122530E011122530
E011122530
Β 
D011121524
D011121524D011121524
D011121524
Β 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Β 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Β 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
Β 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Β 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Β 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Β 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
Β 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Β 

Research on Clustering Method of Chinese Text Cases

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. V (Jan – Feb. 2015), PP 24-27 www.iosrjournals.org DOI: 10.9790/0661-17152427 www.iosrjournals.org 24 | Page Research on Clustering Method of Related Cases Based On Chinese Text Gang Zeng1 (Police Information Department, Liaoning Police Academy, Dalian, China) Abstract: With the movement of population, crimes are showing mobility and professionalism characteristics, the cases which the same suspects committed have the same or similar characteristics, it is helpful to find the related cases. In this article, firstly, analyzes the factors associated with the cases, discusses how to form vector space of Chinese cases, then proposes a method to associate cases by joint use of FKM and Canopy clustering algorithm. Keywords: related cases, Chinese segmentation, Fuzzy K-Means, Canopy. I. Introduction With the development of transportation and movement of population,Crimes are showing mobility,professionalism and other characteristics, the suspects often gather his associates and wander around to commit the crime. These cases have same characteristics, because the same person or gang do it. if the cases are clustered, Investigation will make an unexpected progress. When the case investigation is in trouble, Investigators often merge cases with the same characteristics and investigate them, to find out the relationship between the cases, to find more clues and speed up the investigation of cases. Cases often be searched in the cases database, to find out related cases with the same characteristics, we call it related cases, after cases clustered, there will be a few hundred records in a class, or even more. Investigators will spend much time and effort. to find out related cases in these cases, In this way it is trivial and inefficient. Ning Han presented a clustering method for related cases using FKM algorithm(Fuzzy K-means algorithm, also known as the fuzzy C-means algorithm ), and analyze the information of the cases. and do the experiment based on 681 records of cases, this analysis method is feasible in the low-dimensional stand-alone environment. But his method does not solve the problem of Chinese word segmentation, Under condition of high-dimensional and big data, the clustering results could not be gotten within an acceptable time using his method, We propose a method, in the Hadoop environment, we can analyze related cases using Canopy and FKM clustering algorithm, this method can overcome the lack of FKM clustering algorithm, and improve the efficiency and quality of analysis. II. Conditions Of Associated Cases Conditions of related case includes many factors, such as time and location of the crime, the way into the scene of crime, tool for criminal purpose, criminal activity sequence, modus operandi, the way and means of escape, the way of transportation and sales of stolen goods, and so on. Because a variety of factors, such as the suspect's life experiences, history of crime, level of education, professional features, life skills, mental set, etc. In the course of committing the crime in different places, means and methods of the suspect committing the crime have formed a stable characteristic, we call it crime stereotypes of criminal behavior, in the different cases that the same suspect has made, the factors that mentioned above may are manifested more or less, this shows that the stability of the crime. To find more investigation clues of cases, we must analyze the relationship of all cases recorded in database, to find out the information we need. 1. The Same Or Similar Nature Of The Cases All kinds of cases , such as criminal cases, security cases, economic cases, etc. the suspects are affected by the above factors, their criminal behavior presents the characteristics of continuity. The suspect who unlock a lock by tools will steal things in a long time, the suspect who murder will still kill someone. 2. Regularity Of The Time Of Occurrence Of The Case A person or a group of people will commit crimes , at the same time or similar time. The time when the suspect commit crime often shows a certain regularity, for example, a type of cases often occur at a certain time in a day, a certain day in a week, a specific time period in a year. for example , Car theft cases are far more likely to occur from midnight to 6 a.m. Robbery, theft and other usurpation of cases occurred on the eve of Chinese New Year.
  • 2. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 25 | Page 3. Target Suspects Violated Is Same Or Similar Target suspects violated is closely related to the suspect's motive, Because of different psychological needs, the choice of target is in a somewhat tendentious. for example, young single woman often become the object of rape or robbery, Precious industrial raw materials or digital products become the object of theft. 4. Suspect’S Body Features Is Same Or Similar if suspect’s body features is same or similar in different cases, we can consider the cases is related. body features including height, weight, gender, countenance, clothing, accent, and so on. 5. Crime Means, Methods, Processes Are Same Or Similar Because of the suspect’s history of crime, life skills, and other factors, In terms of tools of crime, the characteristics of the crime, the crime procedures, etc. a certain pattern has been formed, And with its own specificity and stability. for example, to destroy grille, some suspect prefer to use the jack, some prefer to use a pipe wrench, some suspect prefer to use stick. to avoid video surveillance, some suspect choose cap, some suspect choose mask, some suspect choose headgear. 6. Vestige Are Same Or Similar Vestige is defined as a vestige of something is a very small part that still remains of something that was once much larger or more important. we can use it as evidence in court, vestige includes footprints, fingerprint, handprint, shot mark, semen, bloodstain, hair, toxicant, exploder, handwriting, and so on. After comparison, overlapping, anastomosis, checkout, If identified as same or similar things in different cases, We can believe that these cases are related. Now, police manage cases by RDMS, each field describe a characteristic of the case, for example, the β€˜time’ field describe the time when the case occurred, the β€˜body features’ field describe the suspect’s features of his(her) body, including common human characteristics, such as left ring finger broken, yellow hair, myopia and with glasses. The method for the management of the cases made important contributions. But disadvantages are also obvious, firstly, It is complex to enter data into database, each case must be recorded in the management system, a full-time police is responsible for data entry, because it is very complex to describe the case. this is waste of police. secondly, the quality of the data is uneven. Because of different understanding of the case, different expression level, different degree of familiarity with the management system. much of the data is wrong, useless. thirdly, it is excessive precision, When associated with the cases, we can seldom find related cases, It did not play the desired effect to associate with cases. For the above reasons, we present that interrogation record of the case is adopted as the basis for association. we can relate cases based on Chinese text, playing a specialty of big data, we can find related cases nationwide. III. Cluster Analysis Of Cases Vector space model (VSM) is a common method to describe cases, the case set is described as a vector space. Then we analyze the related cases by studying the similarity of cases. Clustering analysis is an important aspect of the study of the data mining. It is process of division physical and abstracted collection into several classes composed of similar objects. Clustering analysis is the learning process of a free guide. It can divide criminal acts into a number of classes or cluster. In same cluster crime acts enjoy high similarity. On the contrary, in different clusters they have large difference. in this article we will use fuzzy K-means clustering algorithm to associate with cases. 1. The Choice Of Distance Measure It is very important, how to calculate the distance between the vectors, and it is directly related to the accuracy of clustering. how to calculate is the best way? there are many factors which infects the accuracy, but the most important factor of all factors is the choice of distance measure. (1) Euclidean Distance Measure The Euclidean distance is the simplest of all distance measures. It’s the most intuitive and matches our normal idea of distance. Euclidean distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... , bn) is d = π‘Ž1 βˆ’ 𝑏1 2 + π‘Ž2 βˆ’ 𝑏2 2 + β‹―+ π‘Žπ‘– βˆ’ 𝑏𝑖 2 (2). Manhattan Distance Measure The distance between any two points is the sum of the absolute differences of their coordinates. This distance measure takes its name from the grid-like layout of streets in Manhattan. we can’t walk from 3th
  • 3. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 26 | Page Avenue and 3th Street to 5thAvenue and 5th Street by walking straight through buildings. The real distance walked is two blocks up and two blocks over. the Manhattan distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... , bn) is d = π‘Ž1 βˆ’ 𝑏1 + π‘Ž2 βˆ’ 𝑏2 + β‹― π‘Ž 𝑛 βˆ’ 𝑏 𝑛 (3). Weighted Distance Measure Weighted distance measure is based on Euclidean and Manhattan distance measures. A weighted distance measure allows you to give weights to different dimensions in order to either increase or decrease the effect of a dimension on the value of the distance measure. This will affect specific distance measures differently but will in general make the distance value more sensitive to differences in a dimension. 2. Representing Chinese Text Documents As Vectors The vector space model (VSM) is the common way of vectorizing text documents. All the words in the world form a set of words, each word being assigned a number, the number is the dimension, it’ll occupy in document vectors. the dimension of these document vectors can be very large. the maximum number of dimensions possible is the cardinality of the vector. Because the count of all possible words or tokens is unimaginably large, text vectors are usually assumed to have infinite dimensions. Chinese text processing involves the Chinese segmentation. Chinese is different from English, English words in the document are divided into independent word by a space or punctuation, Chinese Document can cut apart into sentences by Chinese punctuation. but there is no space in sentence. Chinese segmentation is the most important factor in representing Chinese text documents as vectors. The method commonly used in Chinese segmentation is "two-way maximum matching", First, maximally splitting in the positive direction, then maximally splitting in the opposite direction, the combination of the two parts is the final result, during the procedure, an atomic word base is needed. The more dimensions in vector space, the more complex the calculations become, It is proved by theoretical calculations and practical tests. Representing the case with high-dimensional vector require enormous computing resources, so two questions must be solved to calculate the related cases: firstly, reduce the dimension of vector space, secondly, provide sufficient computing resources. Atomic word is usually expressed as a natural word, the natural words are usually not standard, the same meaning is expressed in different words, atomic word must be regulatory. the method to establish a thesaurus and stop words dictionary can reduce the dimension of vector space effectively. Sufficient computing resources in a stand-alone environment is impossible, with the emergence of Hadoop, MapReduce model can break the huge computing task into small tasks, the small tasks can be executed at different nodes, Hadoop provides us with sufficient computing resources. so we can calculate the relevance of the cases in the Hadoop environment. 3. Term Frequency-Inverse Frequency Term frequency–inverse document frequency (TF-IDF) weighting is a widely used improvement on simple term-frequency weighting. The IDF part is the improvement; instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words. the TF-IDF weight Wi for a word wi is: π‘Šπ‘– = 𝑇𝐹𝑖 βˆ™ π‘™π‘œπ‘” 𝑁 𝐷𝐹𝑖 Where: TFi represents the term frequency (TFi) of word wi, N represents the document count. the document vector will have this value at the dimension for wordi. This is the classic TF-IDF weighting. stop words get a small weight, and terms that occur infrequently get a large weight. The important words, or the topic words, usually have a high TF and a somewhat large IDF, so the product of the two becomes a larger value, thereby giving more importance to these words in the vector produced. IV. Application Of Clustering Algorithm 1. Fuzzy K-Means Clustering Algorithm K-means clustering algorithm tries to find the hard clusters (where each point belongs to one cluster), But fuzzy k-means discovers the soft clusters. In a soft cluster, any point can belong to more than one cluster with a certain affinity value towards each. So the theory of the fuzzy clustering is more suitable for the nature of things, and it can reflect objectively the reality. Currently, the fuzzy K-means clustering (FKM) algorithm is the most widely used.
  • 4. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 27 | Page FKM partitions set of n objects X = (π‘₯1, π‘₯2, β‹― , π‘₯ 𝑛 ) into K fuzzy clusters with C = 𝑐1, 𝑐2, β‹― , 𝑐 π‘˜ cluster centers. In fuzzy matrix U = (𝑒𝑖𝑗 , ), uij is the membership degree of the ith object with the cluster, The characters of uij are as follows : 𝑒𝑖𝑗 ∈ 0,1 𝑖 ∈ 1,2, β‹― , 𝑛; π‘—πœ–1,2,β‹― , π‘˜ 𝑒 𝑖, 𝑗 = 1, 𝑗 = 1, β‹―, 𝑛 π‘˜ 𝑖=1 Update uij according to (1): 𝑒𝑖𝑗 = 1 𝑑 𝑖𝑗 𝑑 π‘˜π‘— 2 π‘š βˆ’1 π‘˜ π‘˜=1 1 Where: m>1is fuzziness exponent, cj is the clustering center. 𝑑𝑖𝑗 = π‘₯𝑖 βˆ’ 𝑐𝑗 is the distance between xi and cj. Update cj according to (2) 𝑐𝑗 = 𝑒𝑖𝑗 π‘š π‘₯𝑖 𝑛 𝑖=1 𝑒𝑖𝑗 π‘šπ‘› 𝑖=1 (2οΌ‰ The objective function is the equation (3): J U, c1, β‹―, c π‘˜ = 𝑒𝑖𝑗 π‘š 𝑛 𝑗 =1 π‘˜ 𝑖=1 𝑑𝑖𝑗 2 (3οΌ‰ The nature of FKM algorithm is to apply the gradient descent method to find out optimal solution, so there is a local optimization problem. And the algorithm convergence speed is greatly influenced by the initial value, especially in the case of large number of clusters. For many real-world clustering problems, the number of clusters isn’t known beforehand, A class of techniques known as approximate clustering algorithms can estimate the number of clusters as well as the approximate location of the centroids from a given data set. One such algorithm we can use is called canopy generation. 2. Canopy Clustering Algorithm Canopy’s advantage lies in its ability to create clusters extremely quicklyβ€”it can do this with a single pass over the data. But this algorithm may not give accurate and precise clusters. But it can give the optimal number of clusters without even specifying the number of clusters, k, as required by fuzzy k-means. V. Conclusion This article proposed a method to associate cases by joint use of FKM and Canopy clustering algorithm, it overcome the difficult problem of clustering Chinese cases, but this method is relatively complex, the results are not very accurate, We hope to improve this method in the future. Acknowledgment This work was partly supported by general scientific research project of education department of Liaoning Province of China (No. L2013490). References [1]. Ning Han, Wei Chen. Research on clustering analysis on related Cases. Journal of Chinese People’s Public Security University(Science and Technology), 2012(1):53-58. [2]. Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. Mahout in Action[M]. Manning Publication, Westampton. [3]. Tony H. Grubesic. On The Application of Fuzzy Clustering for Crime Hot Spot Detection[J]. Journal of Quantitative Criminology, 2006,22(1):77-105. [4]. [Deguang Wang, Baochang Han, Ming Huang. Application of Fuzzy C-Means Clustering Algorithm Based on Particle Swarm Optimization in Computer Forensics. 2012 International Conference on Applied Physics and Industrial Engineering, 1186-1191. [5]. Gang Liu. Programing Hadoop[M]. BeiJing:China Machine Press, 2014. [6]. Hesam Izakian, Ajith Abraham, VΓ‘clav SnΓ‘Ε‘el. "Fuzzy Clustering Using Hybrid Fuzzy c-means and Fuzzy Particle Swarm Optimization" 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC 2009), pp.1690-1694,2009.