Data Mining for Cancer Management in Egypt Case Study: Childhood ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining for Cancer Management in Egypt Case Study: Childhood ...

  1. 1. World Academy of Science, Engineering and Technology 8 2005 Data Mining for Cancer Management in Egypt Case Study: Childhood Acute Lymphoblastic Leukemia Nevine M. Labib, and Michael N. Malek Decision Trees may be used for classification, clustering, Abstract—Data Mining aims at discovering knowledge out of prediction, or estimation.One of the useful medical data and presenting it in a form that is easily comprehensible to applications in Egypt is the management of Leukemia as it humans. One of the useful applications in Egypt is the Cancer accounts for about 33% of pediatric malignancies. [2] management, especially the management of Acute Lymphoblastic Childhood Acute Lymphoblastic Leukemia (also called Leukemia or ALL, which is the most common type of cancer in acute lymphocytic leukemia or ALL) is a cancer of the blood children. and bone marrow. This type of cancer usually gets worse This paper discusses the process of designing a prototype that can help in the management of childhood ALL, which has a great quickly if it is not treated. It is the most common type of significance in the health care field. Besides, it has a social impact cancer in children. on decreasing the rate of infection in children in Egypt. It also There are different approaches in Data Mining, namely provides valubale information about the distribution and hypothesis testing where a database recording past behavior is segmentation of ALL in Egypt, which may be linked to the possible used to verify or disprove preconceived notions, ideas, and risk factors. hunches concerning relationships in the data, and knowledge Undirected Knowledge Discovery is used since, in the case of this discovery where no prior assumptions are made and the data is research project, there is no target field as the data provided is allowed to speak for itself. As for knowledge discovery, it mainly subjective. This is done in order to quantify the subjective variables. Therefore, the computer will be asked to identify may be directed or undirected. Directed knowledge discovery significant patterns in the provided medical data about ALL. This tries to explain or categorize some particular data field while may be achieved through collecting the data necessary for the undirected knowledge discovery aims at finding patterns or system, determimng the data mining technique to be used for the similarities among groups of records without the use of a system, and choosing the most suitable implementation tool for the particular target field or collection of predefined classes. domain. The remainder of this paper is organized as follows. In The research makes use of a data mining tool, Clementine, so as to apply Decision Trees technique. We feed it with data extracted from Section 2, we give a brief explanation of the data mining real-life cases taken from specialized Cancer Institutes. Relevant concept. In Section 3, we give a recent review of similar work medical cases details such as patient medical history and diagnosis in the field. In Section 4, we describe the implemented are analyzed, classified, and clustered in order to improve the disease approach in detail. As for the results and conclusions, they are management. provided in Section 5. Keywords—Data Mining, Decision Trees, Knowledge Discovery, II. DATA MINING CONCEPTS Leukemia. A. Definition I. INTRODUCTION Data mining may be defined as “the exploration and D ATA Mining or "the efficient discovery of valuable, non-obvious information from a large collection of data" analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules” [3]. [1] has a goal to discover knowledge out of data and present it in a form that is easily comprehensible to humans. Hence, it may be considered mining knowledge from large There are several data mining techniques, such as Market amounts of data since it involves knowledge extraction, as Basket Analysis, Memory-Based Reasoning (MBR), Cluster well as data/pattern analysis [4]. Detection, Link Analysis, Decision Trees, Artificial Neural Networks (ANNs), Genetic Algorithms, and On-Line Analytic B. Tasks Processing (OLAP). Some of the tasks suitable for the application of data mining are classification, estimation, prediction, affinity Michael N. Malek is with Department of Computers and Information grouping, clustering, and description. Some of them are best Systems, Faculty of Management, Sadat Academy for Management Sciences, approached in a top-down manner or hypothesis testing while Cairo, Egypt. others are best approached in a bottom-up manner called knowledge discovery either directed or undirected. 309
  2. 2. World Academy of Science, Engineering and Technology 8 2005 As for Classification, it is the most common data mining B. Decision Trees task and it consists of examining the features of a newly Decision trees are a way of representing a series of rules presented object in order to assign it to one of a predefined that lead to a class or value. Therefore, they are used for set of classes. directed data mining, particularly classification. One of the While classification deals with discrete outcomes, important advantages of decision trees is that the model is estimation deals with continuously-valued outcomes. In real- quite explainable since it takes the form of explicit rules. This life cases, estimation is often used to perform a classification allows the evaluation of results and the identification of key task. attributes in the process. The rules, which can be expressed Prediction deals with the classification of records easily as logic statements, in a language such as SQL, can be according to some predicted future behavior or estimated applied directly to new records. future value. Both Affinity grouping and market basket analysis have as C. Cluster Detection an objective to determine the things that can go together. Cluster detection consists of building models that find data Clustering aims at segmenting a heterogeneous population records similar to each other. This is inherently undirected into a number of more homogeneous subgroups or clusters data mining, since the goal is to find previously unknown that are not predefined. similarities in the data. Clustering data may be considered a Description is concerned with describing and explaining very good way to start any analysis on the data. Self-similar what is going in a complicated database so as to provide a clusters can provide the starting point for knowing what is in better understanding of the available data. the data and for figuring out how to best make use of it. D. Genetic Algorithms C. The Virtuous Cycle of Data Mining Genetic algorithms (GA), which apply the mechanics of The four stages of the virtuous cycle of data mining are: genetics and natural selection to a search, are used for finding 1. Identifying the problem: where the goal is to identify areas the optimal set of parameters that describe a predictive where patterns in data have the potential of providing value. function. Hence, they are mainly used for directed data 2. Using data mining techniques to transform the data into mining. Genetic algorithms use many operators such as the actionable information: for this purpose, the produced results selection, crossover, and mutation to evolve successive need to be understood in order to make the virtuous cycle generations of solutions. As these generations evolve, only successful. Numerous pitfalls can interfere with the ability to the most predictive survive, until the functions converge on use the results of data mining. Some of the pitfalls are bad an optimal solution. data formats, confusing data fields, and lack of functionality. In addition, identifying the right source of data is crucial to the IV. RELATED WORK results of the analysis, as well as bringing the right data together on the computing system used for analysis. A. Classification using Partial Least Squares with 3. Acting on the information: where the results from data Penalized Logistic Regression [6] mining are acted upon then fed into the measurement stage. In this paper, the classification problem is viewed as a 4. Measuring the results: this measurement provides the regression one with few observations and many predictor feedback for continuously improving results. These variables. A new method is proposed combining partial least squares (PLS) and Ridge penalized logistic regression. The measurements make the virtuous cycle of data mining basic methods are then reviewed based on PLS and/or virtuous. Even though the value of measurement and penalized likelihood techniques, their interest in some cases continuous improvement is widely acknowledged, it is usually are outlined and their sometimes poor behavior is theoretically given less attention than it deserves. explained. This procedure is compared with these other classifiers. The predictive performance of the resulting III. DATA MINING TECHNIQUES classification rule is illustrated on three data sets: Some of the mostly used techniques are the following: Leukemia, Colon and Prostate. A. Neural Networks B. Marker Identification and Classification of Cancer A Neural network may be defined as "a model of reasoning Types using Gene Expression Data and SIMCA [7] based on the human brain" [5]. It is probably the most The objective of this research was the development of a common data mining technique, since it is a simple model of computational procedure for feature extraction and neural interconnections in brains, adapted for use on digital classification of gene expression data. The Soft Independent computers. It learns from a training set, generalizing patterns Modeling of Class Analogy (SIMCA) approach was inside it for classification and prediction. Neural networks implemented in a data mining scheme in order to allow the can also be applied to undirected data mining and time-series identification of those genes that are most likely to confer prediction. robust and accurate classification of samples from multiple tumor types. The proposed method was tested on two different microarray data sets where the identified features represent a 310
  3. 3. World Academy of Science, Engineering and Technology 8 2005 rational and dimensionally reduced base for understanding the objective indices to estimate the performance of diagnosis biology of diseases, defining targets of therapeutic results. Sensitivity and specificity are the most two important intervention, and developing diagnostic tools for classification indices that a doctor concerned about. With sensitivity of pathological states. The analysis of the SIMCA model 93.33% and specificity 96.67%, the proposed method residuals was able to identify specific phenotype markers. On provides objective evidences for good diagnoses of breast the other hand, the class analogy approach allowed the tumors. assignment to multiple classes, such as different pathological V. RESEARCH METHODOLOGY conditions or tissue samples, for previously unseen instances. A. What kind of data are we working on? C. Data Mining the NCI Cancer Cell Line Compound Childhood Acute Lymphoblastic Leukemia [13] GI(50) Values: Identifying Quinone Subtypes Effective Childhood acute lymphoblastic leukemia (also called Acute Against Melanoma and Leukemia Cell Classes [8] Lymphocytic Leukemia or ALL) is a cancer of the blood and Using data mining techniques, a subset (1400) of bone marrow. This type of cancer usually gets worse quickly compounds from the large public National Cancer Institute if it is not treated. It is the most common type of cancer in (NCI) compounds data repository has been studied. First, a children. functional class identity assignment for the 60 NCI cancer Normally, the bone marrow produces stem cells (immature testing cell lines was carried out via hierarchical clustering of cells) that develop into mature blood cells. gene expression data. Comprised of nine clinical tissue types, In ALL, too many stem cells develop into a type of white the 60 cell lines were placed into six classes-melanoma, leukemia, renal, lung, and colorectal, and the sixth class was blood cell called lymphocytes. These lymphocytes may also comprised of mixed tissue cell lines not found in any of the be called lymphoblasts or leukemic cells. other five classes. Then, a supervised machine learning was In ALL, the lymphocytes are not able to fight infection very carried out using the GI(50) values tested on a panel of 60 well. Also, as the number of lymphocytes increases in the NCI cancer cell lines. With this approach, identified two small blood and bone marrow, there is less room for healthy white sets of compounds that were most effective in carrying out blood cells, red blood cells, and platelets. This may lead to complete class separation of the melanoma, non-melanoma infection, anemia, and easy bleeding. classes and leukemia, non-leukemia classes, were identified. The following tests and procedures may be used: As for attempts to subclassify melanoma or leukemia cell lines • Physical exam and history. based upon their clinical cancer subtype, they met with limited • Complete blood count(CBC) success. • Bone marrow aspiration and biopsy • Cytogenetic analysis D. Cancer Surveillance using Data Warehousing, Data • Immunophenotyping Mining, and Decision Support Systems [9] This study discussed how data warehousing, data mining, • Blood chemistry studies and decision support systems can reduce the national cancer • Chest x-ray burden or the oral complications of cancer therapies. For this goal to be achieved, it first will be necessary to monitor In childhood ALL, risk groups are used instead of populations; collect relevant cancer screening, incidence, stages.Risk groups are described as: treatment, and outcomes data; identify cancer patterns; explain the patterns, and translate the explanations into • Standard (low) risk: Includes children aged 1 to 9 years effective diagnoses and treatments. who have a white blood cell count of less than 50,000 µ/L Such data collection, processing, and analysis are time at diagnosis. consuming and costly. Success is highly dependent on the • High risk: Includes children younger than 1 year or older abilities, skills, and domain knowledge of the interested than 9 years and children who have a white blood cell parties. Even the most talented and skilled parties have count of 50,000/µL or more at diagnosis. incomplete knowledge about the study domain, pertinent information technology (IT), and relevant analytical tools. It is important to know the risk group in order to plan Data sharing across interested groups is limited. treatment. Consequently, much useful information may be lost in the B. Data Collection cancer surveillance effort. Sources of Data E. Data Mining with Decision Trees for Diagnosis of -National Cancer Institute's database. Breast Tumor in Medical Ultrasonic Images [10] -Real patients' cases from patients' profiles (tickets). In this study, breast masses were evaluated in a series of -Medical reviews (geographical divisions and disease pathologically proven tumors using data mining with decision categories). tree model for classification of breast tumors. -International Cancer Resources (NCI USA) from official Accuracy, sensitivity, specificity, positive predictive value website. [11] and negative predictive value are the five most generally used -Doctors, Professors and Biostatisticians from NCI. 311
  4. 4. World Academy of Science, Engineering and Technology 8 2005 - No. of cases: 172 3) Inconsistent Data There may be inconsistencies in the data recorded for some Methods of Data Collection transactions. Some data inconsistency may be corrected -Data acquisition from the NCI database using digital media manually using external references, for example errors made such as Flash Memories, diskettes and CDs. at data entry may be corrected by performing a paper trace -Capturing data from the NCI network from various spots. (the most used technique in our search, to guarantee the -Collecting data from patients' tickets (hard copies) and maximum data quality possible, by reducing prediction digitizing them (feeding the data into preformatted database). factors). -Note taking Other inconsistency forms are due to data integration, -Using published reviews from NCI containing percentages where a given attribute can have different names in different and distributions [12]. databases. Redundancies may also exist. -Structured interviews with experts. C. Data Cleaning -Real world data, like data acquired from NCI, tend to be incomplete, noisy and inconsistent. Data Cleaning routines attempt to fill on missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 1) Missing Values Many methods were applied to solve this issue depending Fig. 3 Data Inconsistency (El-Menya and El-Minya represent the on the importance of the missing value and its relation to the same value) search domain. D. Data Integration • Fill in the missing value manually • Use a global constant to fill in the missing value Data Mining often requires data integration, the merging of data from multiple data sources into one coherent data store. These sources include in our case NCI database, flat files, and data entry values. Equivalent real-world entities from multiple data sources must be matched up, for example, patient_id in one database must be matched up with patient_number in Fig. 1 Missing values another database. Careful integration of the data from multiple sources helped 2) Noisy Data reducing and avoiding redundancies and inconsistencies in the Noise is a random error or variance in a measured variable. resulting data set. This helped improving the accuracy and Many techniques were used to smooth out the data and speed of the subsequent mining process. remove the noise. E. Data Selection • Clustering Selecting fields of data of special interest for the search Outliers were detected by clustering, where similar values are domain is the best way to obtain results relevant to the search organized into groups, or clusters, values that fall outside of criteria. In this research Acute Lymphoblastic Leukemia the set of clusters may be considered outliers. clustering was the aim, so data concerning the diagnosis of • Combined computer and human inspection ALL and data concerning the patients of ALL were carefully Using clustering techniques and constructing groups of data selected from the overall data sets, and mining techniques sets, human can then sort through the patterns in the list to were applied to these specific data groups in order to reduce identify the actual garbage ones. This is much faster than the interesting patterns reached to the ones that represent an having to manually search through the entire database. interest for the domain. F. Data Transformation In Data Transformation, the data is transformed or consolidated into forms appropriate for mining. • Smoothing: which works to remove the noise form data. Such techniques include binning, clustering, and regression. • Aggregation: where summary or aggregation operations are applied to the data. Fig. 2 Outliers 312
  5. 5. World Academy of Science, Engineering and Technology 8 2005 • Generalization of the data: where low-level data are replaced by higher-level concepts through concept hierarchies. • Normalization: where the attribute data are scaled so as to fall within a small specified range. • Attribute construction: where new attributes are constructed and added from the given set of attributes to help the mining process. Fig. 4 Database Connection G. Data Mining Viewing the Data in a Tabular Form 1) Choosing the Tool After linking the software with the data source, we view the SPSS Clementine 8.1 data from the database in a tabular form by linking the data As a data mining application, Clementine offers a strategic source to a "table" output. approach to finding useful relationships in large data sets. In contrast to more traditional statistical methods, you do not necessarily need to know what you are looking for when you start. You can explore your data, fitting different models and investigating different relationships, until you find useful information. Working in Clementine is working with data. In its simplest form, working with Clementine is a three-step process. First, you read data into Clementine, then run the data through a series of manipulations, and finally send the data to a destination. This sequence of operations is known as a data Fig. 5 Viewing data in tabular form stream because the data flows record by record from the source through each manipulation and, finally, to the This enables us to assure that the link is successfully built destination--either a model or type of data output. Most of and let us take a look on the form of data read by the software your work in Clementine will involve creating and modifying to detect any loss, inconsistency or noise that may have data streams. occurred in the linking process. At each point in the data mining process, Clementine's visual interface invites your specific business expertise. Manipulating the Data Modeling algorithms, such as prediction, classification, By using the record Ops, operation concerning the records segmentation, and association detection, ensure powerful and as a whole can be applied and used to operate on the data, accurate models. Model results can easily be deployed and using sampling, aggregation, sorting etc… read into databases, SPSS, and a wide variety of other applications. You can also use the add-on component, Clementine Solution Publisher, to deploy entire data streams that read data into a model and deploy results without a full Fig. 6 Record Operations version of Clementine. This brings important data closer to decision makers who need it. Record Ops are linked to the data source directly and their The numerous features of Clementine's data mining output can be in any "output" means or can be directly fed as workbench are integrated by a visual programming interface. an input to other functions. You can use this interface to draw diagrams of data operations Using the Field Ops on specific fields allows us to explore relevant to your business. Each operation is represented by an the data deeper, by using type selecting and filtering some icon or node, and the nodes are linked together in a stream fields as input or output fields, deriving new fields and representing the flow of data through each operation. binning fields. 2) Using the Tool Creating the Data Source to use: In our case an Oracle 9i Database was set and data fed into Fig. 7 Field Operations it in one table, ODBC was used to link the data source with H. Data Evaluation the Clementine engine. After applying the data mining techniques comes the job of identifying the obtained results, in form of interesting patterns representing knowledge depending on interestingness measures. These measures are essential for the efficient 313
  6. 6. World Academy of Science, Engineering and Technology 8 2005 discovery of patterns of value to the given user. Such construct a full view of the resulted patterns and levels of measures can be used after the data mining step in order to accuracy of each technique may be very useful for this rank the discovered patterns according to their interestingness, application. filtering out the uninteresting ones. More importantly, such measures can be used to guide and constrain the discovery ACKNOWLEDGMENT process, improving the search efficiency by pruning away Thanks and gratitude to supervisors from NCI Egypt, subsets of the pattern space that do not satisfy pre-specified Department of Biostatistics & Cancer Epidemiology interestingness constraints. Prof. Dr. Inas El-Attar, Head of Department Dr. Nelly Hassan, Assistant Professor I. Knowledge Representation (outcome) Dr. Manar Mounir, Lecturer In this step visualization and knowledge representation Pediatrics Department : Dr. Wael Zekri, Specialist techniques are used to present the mined knowledge to the Special Thanks to May Nabil El-Shaarawy and Mr. Kamal user. All the operations applied on the records and fields, and Amin. the mining process itself are represented in the form visualizations and graphics in this step. REFERENCES [1] J. P. Bigus, "Data Mining with Neural Networks", New York: McGraw- Hill, 1996 [2] NCI Egypt website ( viewed on 1st August 2005 [3] M. Berry and S. Gordon, "Data Mining Techniques: For Marketing, Sales, and Customer Support", May 1997 [4] Han, J. and Kamber, M., "Data Mining Concepts and Techniques", 2001 [5] M. Negnevitsky, "Artificial Intelligence, A Guide to Intelligent Systems", England: Pearson Education Limited, 2002. [6] G. Fort, S. Lambert Lacroix, “Classification using partial least squares with penalized logistic regression”, England: Bioinformatics-Oxford, 2005. [7] S. Bicciato, A. Luchini, C. Di-Bello, “Marker identification and classification of cancer types using gene expression data and SIMCA”, Germany: Methods-of-information-in-medicine, 2004. Fig. 8 Distribution of IPT [8] K. A. Marx, P. O'Neil, P. Hoffman, M. L. Ujwal, “Data mining the NCI cancer cell line compound GI(50) values: identifying quinone subtypes effective against melanoma and leukemia cell classes”, United-States: Journal-of-chemical-information-and-computer-sciences, 2003. [9] G. A Forgionne, A. Gagopadhyay, and M. Adya, “Cancer Surveillance Using Data Warehousing, Data Mining, and Decision Support Systems”, Topics in Health Information Management, vol. 21(1); Proquest Medical Library, August 2000 [10] W. Kuo, R. Chang, D. Chen and C. C. Lee, “Data Mining with Decision Trees for Diagnosis of Breast Tumor in Medical Ultrasonic Images”, Breast Cancer Research and Treatment, Dordrecht, vol. 66, Iss. 1, Mar 2001. [11] National Cancer Institute official website ( viewed on Fig. 9 Distribution of Gov and IPT 1st August 2005 [12] Periodicals of NCI Egypt (2001) VI. CONCLUSIONS [13] "Introduction to Data Mining and Knowledge Discovery - Third Edition", Two Crows Corporation (pdf) Based on the previous work, the following conclusions were drawn: Nevine Makram Labib, a full timer Lecturer at the Sadat Academy for 1. Decision Trees, as a data mining technique, is very useful Management Sciences, Department of Computer & Information System and in the process of knowledge discovery in the medical field, Vice-Dean of the Training Center Alexandria Branch in addition to being an especially in the domains where available data have many International Trainer in Information Technology and Management-related disciplines. limitations like inconsistent and missing values. Dr. Makram has over ten years experience in the Information Technology In addition, using this technique is very convenient since field. She attained both her Doctoral Degree and Masters Degree specializing the Decision Tree is simple to understand, works with mixed in this very important area and has her academic background into practice, data types, models non-linear functions, handles classification, gaining vast hands on experience. Dr. Makram is known for the wealth of real-life expertise and experience and most of the readily available tools use it. she brings to the courses she runs leaving an outstanding impression for all 2. SPSS Clementine is very suitable as a mining engine her attendees and is always asked to return for more training. She has also with its interface and manipulating modules that allow data served as a speaker at numerous international preferences focusing on exploration, manipulation and exploration of any interesting Artificial Intelligence, Expert Systems, and Information Technology. knowledge patterns Michael Nabil Malek, Researcher graduated from Sadat Academy for 3. Using better quality of data influences the whole process Management Sciences, department of Computers and Information systems. of knowledge discovery, takes less time in cleaning and One year experience in IT field including training as System Administrator in integration, and assures better results from the mining process. Mena House Oberoi Hotel and Casino, and as Communication Engineer in ALCATEL. He has many projects concerning Data Mining, especially in the 4. Using the same data sets with different mining fields of hospitality and medicine. In addition, he has a working knowledge of techniques and comparing results of each technique in order to Oracle Systems (Development and Administration). 314