Building clinical data warehouse for traditional chinese medicine knowledge discovery


Published on

it about datawarehouse

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building clinical data warehouse for traditional chinese medicine knowledge discovery

  1. 1. 2008 International Conference on BioMedical Engineering and Informatics Building Clinical Data Warehouse for Traditional Chinese Medicine Knowledge Discovery Xuezhong Zhou1, Baoyan Liu2, Yinghui Wang3, Runsun Zhang3, Ping Li3, Shibo Chen2, Yufeng Guo3, Zhuye Gao4, Hua Zhang4 Email:; 1 College of Computer Science and Information Technology, Beijing Jiaotong University, Beijing, 100044, China 2 China Academy of Chinese Medicine Sciences, Beijing, 100070, China 3 Guanganmen Hospital, China Academy of Chinese Medicine Sciences, Beijing, 100053, China 4 Beijing University of Chinese Medicine, Beijing, 100029, China Abstract theoretical knowledge are not from laboratory but The clinical data from the daily clinical process, directly from daily clinical practice. The clinical which keeps to traditional Chinese medicine (TCM) practice with synthesized treatment based on syndrome theories and principles, is the core empirical differentiation (STSD) is the basis of TCM clinical knowledge source for TCM researches. This paper evaluation and clinical study [1]. The huge clinical introduces a data warehouse system, which is based on data storage is the firsthand and effective evident for the structured electronic medical record system and TCM clinical researches. Developing the clinical wet- daily clinical data, for TCM clinical researches and dry mode is a significant and vital task of TCM medical knowledge discovery. The system consists of researches [2]. several key components: clinical data schema, Data warehouse [3] is a technical solution for extraction-transformation-loading tool, online immense data storage, management and processing. analytical analysis (OLAP) based on Business Objects The increased demand on financial analysis [4], (a commercial business intelligence software), and disease control [5], clinical decision process [6], integrated data mining functionalities. Currently, the adverse drug events [7], laboratory test data analysis [8] data warehouse contains 20,000 inpatient data of and information feedback for hospital practice diabetes, coronary heart disease and stroke, and more management [9] in healthcare organizations has given than 20,000 outpatient data. Moreover, we have rise to the research and development of clinical data developed several important research oriented subject warehouse. Clinical data warehousing is a difficult analyses using OLAP, and conducted several TCM systematic task with many particular complicated clinical data mining applications. The analysis issues such as many-to-many relationships, entity- applications show that the developed clinical data attribute-value (EAV) data structure and bitemporal warehouse platform is promising to build the bridge data [10]. Data integration tasks of medical data store for TCM clinical practice and theoretical research, are challenging, hence the data warehouse architectures hence, will promote the related TCM researches. [11] are studied to propose practicable solutions to Keywords: Clinical data warehouse, Traditional tackle data integration issues. Chinese medicine, Extraction-transformation-loading, Compared with the clinical data of modern Online analytical analysis, Data mining medicine, TCM clinical data has some significant and distinct information contents like symptom/sign, TCM 1. Introduction syndrome, formula and herb, etc. The three types of information are the core elements of TCM clinical data. Moreover, the symptom/sign information with Traditional Chinese medicine (TCM) has long history systematic description is the foundational information and distinguished clinical effects. Different from the for TCM syndrome diagnosis. Therefore, the medical modern biomedical science, TCM has no general record contains symptom/sign should be structured and experimental practice in laboratory. In contrast, clinical stored in relational database. To utilize and analyze the practice or clinical experiments is the core basis of daily TCM clinical data for TCM researches, we have TCM. Hence, the new Chinese medical formulas and978-0-7695-3118-2/08 $25.00 © 2008 IEEE 615DOI 10.1109/BMEI.2008.83
  2. 2. developed a clinical data warehouse system based on system, Business Objects (BO). BO has the design andthe structured TCM electronic medical record system analysis clients like Crystal Report, Web/Desktop(SEMR) [12], which has structured data storage of the Intelligence, Dashboard and Performance manager toinformation of medical record (e.g. chief complaint and implement the OLAP functionalities. Meanwhile, wehistories). Furthermore, since most TCM clinical data, integrate the Oracle data mining option with the client-such as symptom/sign, diagnosis and formula Oracle data miner, and the machine learning platform,prescription, is represented by terminologies, we have Weka, to perform the online data mining tasks.a systematic study on the TCM clinical terminology Therefore, the infrastructure builds a technologicaland nomenclature [13] to facilitate the data entry and framework for huge TCM clinical data integration,standard representation. preprocessing, management and online analysis. We have collected about 20,000 inpatients data inTCM hospitals (ten top grade hospitals in Beijing,China) or TCM wards on diabetes, coronary heartdisease (CHD) and stroke. Furthermore, there are morethan 20,000 outpatient data instances, which record theoutpatient clinical process of twenty over famous TCMphysicians in Beijing, China. By comprehensiveanalyzing the characteristics of TCM clinical datastructure and the analysis subjects of TCM clinicalresearches, we have designed the information model,physical data model and multidimensional data model Fig.1. The technology infrastructure of TCM clinicalfor clinical data warehouse. Meanwhile, we have data warehouse.developed an extraction-transformation-loading (ETL) As a platform aiming to TCM clinical researches,tool, Medical Integrator (MI), to take the tasks of TCM clinical data warehouse system also can directlyclinical data integration, cleaning and preprocessing. provide preprocessed data for the statistics softwaresFurthermore, we have integrated the data mining (e.g. SPSS, SAS and STATISTICA) to make possiblesystems, namely Weka and Oracle data miner, and statistical analysis and test. Hence, from thebusiness intelligence tool (Business Objects) to application perspective, TCM clinical data warehouseimplement a TCM clinical intelligence platform with proposes integrative functional platform supportingdata mining and online analytical processing (OLAP) raw clinical data integration and data cleaning, OLAP,abilities. data mining and statistics analysis tasks.2. The infrastructure of TCM clinical data 3. Traditional Chinese medicine clinicalwarehouse data model designAs a comprehensive platform for TCM clinical and The information model analysis and design is the vitaltheoretical researches, the TCM clinical data step of TCM clinical data warehouse development.warehouse system is designed based on Java and J2EE Medical information model like HL7 referenceplatform. The technology infrastructure of TCM information model (RIM) [14] is a very complicatedclinical data warehouse is depicted in Fig.1. We see system with various concepts and relationships. Thethat the infrastructure aims to integrate different objectives of HL7 RIM are to support the medicaloperational data sources (e.g. SQL Server, Oracle, DB2) operational process, particularly, support theusing a self-developed specific ETL tool. More data information exchanging between different medicalsources are possible by extending the database information systems. The semantic network of unifiedinterface configuration. Due to the heterogeneous medical language system (UMLS) [15] is consideredoperational data sources, we use a series of metadata as the distinguished medical ontology in moderninformation tables to record the metadata (e.g. database biomedical science. The semantic types and structurestype, hospital information, physician information, data proposed a global conceptual view of the medicalcontent description and transforming information) of terminologies. The focus and emphasis of UMLS is tothe different data sources. bridge the gap between different terminological The data storage management is supported by systems used in the medical literatures. Hence, theOracle (currently, we use Oracle 10G as the database conceptual unification principle is adhered to designserver), also the analysis and query service is mainly the core framework of semantic network.supported by the distinguished business intelligence 616
  3. 3. However, the information model of TCM clinical 4. Medical Integratordata warehouse focuses on the information content thatwill be analyzed and used in TCM clinical and ETL is the core component of a successful datatheoretical researches. Hence, the classification and warehouse system. Due to the requirement of complexdefinition of the information generated by the TCM clinical data structure, flexible data checking, multipleclinical processes, is the emphases of our work. We heterogeneous data sources integration and numerousconsider TCM clinical process as a dynamic system terminological standardization processing, even thewith two core entities, namely physician and patient, commercial ETL systems can not fit well for the tasks.and three core information elements, namely symptom, Hence we develop MI, the specific ETL tool usingdisease/TCM syndrome and treatment. The symptom Java and Eclipse standard widget toolkit (SWT), toinformation element is regarded as a relatively implement the required functions. Fig.2 is the snapshotobjective disease phenomena, whereas, disease/TCM of the main form of MI. It has the key functions suchsyndrome is one type of human morbid status, which is as data connection configuration, data checking, sourcethe diagnosis result of a specific physician. Meanwhile, database consolidation, data transformation andthe TCM treatment is a clinical event that aims to make loading, data cleaning, data standardization and datapatient healthful. Therefore, while taking the analysis interface.abstraction of these five core information elements and Besides the traditional ETL functions, MI hasconstructing the global conceptual framework of TCM focused on the particular functions like datainformation model, we design an information model standardization and data analysis interface. Datafor clinical data warehouse. We consider that the main standardization process mainly concerns theinformation content of TCM clinical researches is standardization of the terminological data likestudying on the relationships between different entities symptom, diagnosis and treatment (herb name,in one event and also the relationships between description phrase of therapeutical method, etc.).different events. Therefore, we can regard the clinical Because the clinical data contains various terms andinformation as various kinds of events (phenomena and phrases with flexible expressions, and also errors, theactivity), in every event there may have several data standardization is vital and important to have anconceptual entities and physical entities participated at effective analysis. We use a rule-based batcha specific time. Because of the mixture of TCM and processing approach to take these tasks. About 8 rulemodern medical concepts and methods in current TCM tables are designed to store the different kinds ofclinical process in hospitals, the sub-classes of entity standardization rules. The rules are edited andclass are also the mixture of TCM and modern medical imported into the corresponding tables using Medicalclasses. For example, we have defined two distinct Integrator by TCM clinical experts. To keep the origindisease classes in the model. One class represents the data for different analysis applications, we let MI builddisease concept in TCM, while another class is the the necessary middle tables to store the processed data,modern medical concept. It should be noted that the and provide a standardized data set for differententity classes will be materialized as dictionary tables potential data analyses. We take the symptomin the physical data model in data warehouse. We have standardization process as an instance. The expressionsthe more detailed description of the information model of symptom are quite various in clinical practices duein the work [16]. to the personal favor of different physicians. Also the Adhering to the information model defined, we error expressions or writings are possible in such hugehave designed the physical data model to help store data storage. We let domain experts edit four kinds ofand manage the TCM clinical data. Furthermore, to transformation rules to standardize the symptom the multidimensional analysis such as OLAP, The four kinds of rules instruct the process of noisewe have designed several core multidimensional data data cleaning, unified term description, terminologicalmodels as the data structure basis of data marts. We granularity unification and synonymous unification.have developed several significant subject analysis The result of symptom standardization is theapplications for TCM clinical researches. Each subject terminological phrases with unified concept.analysis application has the corresponding relational The EAV structure [10][17] is the preferredmultidimensional data model. The practical results choice in clinical data model. However, most statisticalshow that the information model and multidimensional and data mining systems are requiring conventionaldata model can support very well for the clinical flat style data. Moreover, some analysis systems needanalysis applications. encoded data. Hence, to seamlessly integrate the statistical and data mining systems, we have developed several key functions (e.g. automatic encode process, 617
  4. 4. EAV to flat schema conversion and data exporting) for from clinical practice is a key step for TCM clinicaldata analysis interface. Using the functions of MI, we researches. Moreover, study on the relationshipshave a good preparation of data set with high quality between primary conceptual elements like disease,for various data analysis tasks. syndrome, symptom/sign, herb and formula is the central issue of TCM clinical researches. 6.1. Online analytical processing and description analysis Based on the multidimensional data schema and BO semantic layers, we have developed 10 OLAP subject analysis applications with more than 400 analysis reports. The subjects mainly focus on the two types of clinical knowledge: empirical diagnosis and treatmentFig. 2. The main interface of Medical Integrator with knowledge of famous TCM physician, and the clinicalfunctional items. features of vital chronic diseases like diabetes, stroke and CHD. The subjects contain data profile of5. Data analysis components physicians or diseases, clinical herb and formula using, the relationship among clinical finding, TCM syndrome, disease and complication, etc. The analysisBased on the multidimensional data model and ETL reports can be accessed by authorized web users.preprocessing, the clinical data has been prepared for Besides the interactive browsing of reports, the userthe analysis and data mining tasks of clinical can also export the results as Excel or PDF format.researches. We use BO to provide the OLAP analysis. Fig. 3 is the screenshot of the global data profileAlso the data mining systems such as Oracle data (the graphic area) of a famous TCM physician. Itmining, Weka, are integrated to the clinical data proposes the information about the total number ofwarehouse system. patient instances, consultation times, the disease BO has the multidimensional analysis report distribution, herb and formula using, symptomdesigning tools such as crystal report, web/desktop distribution and therapeutic method, etc. The globalintelligence. Also the BO platform is a middleware data profile provides the baseline information of theserver to support the management, design and clinical data related to a specific physician. Fig. 3browsing of the reports in B/S framework. The shows that the clinical data of the related physician issemantic layer is the patent product of BO Company. It mainly on the diseases such as Xiong Bi (thoracicrealizes the mapping of data structure to domain obstruction of Qi), gastric pain, Xin Ji (palpitation) andknowledge category. Compared with the complicated vertigo.physical data structure in data warehouse, the semanticlayer (categories and attributes) is rather simple andwith medical sense. Oracle data mining is an option of Oracle 10genterprise edition. We have integrated the data miningclient, Oracle data miner, to TCM clinical datawarehouse. Furthermore, we have integrated thefamous open-source machine learning platform, Weka(3.4 version) [18], with JDBC configuration to directlyuse the data in data warehouse. The integrated two datamining systems have the online data access ability ofthe clinical data warehouse. Hence, it makes the data Fig. 3. The global profile analysis of outpatient clinical data of a famous TCM physician.mining tasks more facilitating and on-line. Also we can know the herb using knowledge on TCM syndrome (Fig. 4) or symptom (Fig. 5) of a6. Clinical data analysis and knowledge famous TCM physician. Other empirical knowledgediscovery case studies like clinical using of classical formula, regular herb dosage is analyzed by the corresponding OLAP reports.Clinical practice has a vital role for TCM research and All the developed reports have the appropriatedevelopment. Inductive analysis of the empirical data parameters like physician name, disease name that can 618
  5. 5. be selected by users on demand to show the analysis case studies on the outpatient clinical data can refer toresults of the different physicians or diseases. The the work [16].exploring analysis of the inpatient data focuses on the The data mining case studies on the inpatientrelationships among disease, TCM syndrome and clinical data is focusing on T2DM and CHD. T2DM isclinical findings. still a relatively new disease for TCM treatment and the TCM syndrome classification of T2DM is a research issue. We study on the TCM syndrome classification of T2DM with metabolic syndrome by herb composition network analysis [20]. We find that the therapeutic methods for T2DM with metabolic syndrome mainly include nourish Yin & clear away hot, replenish Qi & nourish Yin, and replenish Qi & nourish blood, etc., as the disease course extends. This indicates that the TCM syndrome categories of T2DMFig. 4. The herb using information on a specific affiliated with metabolic syndrome are Yin DeficiencyTCM syndrome of a famous TCM physician. Heat Excess (early stage), Qi-Yin Deficiency (middle stage) and Qi-Deficiency Blood Stasis (terminal stage). The result proposes a primary guidance for clinical treatment for patients with T2DM affiliated with metabolic syndrome. We have study on the herb prescription knowledge for T2DM with different complications [21], which also propose useful information for TCM treatment of T2DM. 7. Conclusion and Future WorkFig. 5. The relationships between herb and In conclusion, clinical researches building on the realsymptom show which herbs would be prescribed TCM clinical practices, which keep to STSD, are thefor a specific symptom. essential requirement of TCM research. This paper proposes a data warehouse solution for the clinical data6.2. Data mining organization, management, processing and analysis. We have accomplished the whole framework andWith the integrated data mining abilities and developed the core components such as clinicalpreprocessing functions in clinical data warehouse, we information model, ETL tool, OLAP and data mininghave successfully conducted several preliminary TCM functions. Moreover, based on the collected structuredclinical data analysis researches like acupuncture EMR data, we have developed and performed severalprescription knowledge discovery [19], the relationship research oriented subject analyses and data miningbetween formula (herbs) and syndrome about T2DM tasks. The data analysis case studies show that theaffiliated metabolic syndrome [20], herb treatment for clinical data warehouse provides a handy platform forT2DM [21], and cluster analysis on syndrome type of TCM clinical knowledge discovery. Therefore, theTCM in patients with acute myocardial infarction [22]. clinical data warehouse will be promising to build an The acupuncture prescription knowledge infrastructure for TCM clinical and theoreticaldiscovery research [19] focuses on the empirical research. However, the project is still in progress. Weclinical acupuncture prescription of Prof. Conghuo will focus on the following three tasks in the future.Tian in acupuncture department of Guanganmen The private and security issues are main problemshospital, Beijing, China. Using the association rule in clinical data using and sharing. We will address themining method in Weka, we got 18 acupuncture information content protect about both physicians andprescriptions from more than one thousand and two patients. This has been considered in the current ETLhundred medical records. Prof. Tian indicates that one tool and data analysis applications.of the eighteen acupuncture prescriptions is not a fixed Currently, the clinical data only contains the TCMprescription in his clinical practice. Therefore, finally, research oriented information, while hospitalwe get 17 useful acupuncture prescriptions (with management information is not covered yet. Due to theprescription name, acupuncture point composition, decision support requirement of hospital management,modifications, main efficacy, etc.), which reflect the we will consider integrating the data from hospitalempirical knowledge of Prof. Tian. More data mining 619
  6. 6. information system and developing the corresponding [10]. Pedersen T.B., Jensen C.S., Research Issues in Clinicalsubject analyses. Data Warehousing. In Proceedings of SSDBM-98, Italy, Compared with the free-text EMR data collecting, July 1-3, 1998.the collecting of the structured clinical data with high [11]. Sahama T.R., Croll P.R., A data warehouse architecture for clinical data warehousing. in Proceedings of the fifthquality is still a laborious job. Therefore, the data limit Australasian symposium on ACSW frontiers, Australianhas not made full use of the whole clinical data Computer Society, Inc., Darlinghurst, Australia,warehouse framework. We have hammered at the 2007;68:227-32.upgrading of the SEMR system to facilitate the data [12]. Li P., Liu B., Wen T., et al, Traditional Chineseentry tasks. Furthermore, with more TCM hospitals medicine electronic medical record system and thetaking the SEMR system as the regular EMR collecting reorganization of TCM theoretical knowledge (intool and more research projects permitted to provide Chinese). Chinese Journal of Information on TCM. 2005;their data, the current data capacity will increase 12(4):7, 39.rapidly in the near future. [13]. Guo Y., Liu B., Li P., et al, Ontology and Standardization of the TCM Terms (in Chinese). Chinese Archives of TCM. 2007; 25(7):1368-70.Acknowledgements [14]. HL7 Reference Information Model, library/data-model/RIM.This work is partially supported by Scientific [15]. Lindberg D.A.B., Humphreys B.L., McCray A.T., TheBreakthrough Program of Beijing Municipal Science & Unified Medical Language System. Meth Inform Med.Technology Commission, China (H020920010130), 1993; 32:281-91. [16]. Zhou X., The Research on TCM Clinical DataChina Postdoctoral Science Foundation (2005037106), Warehousing and Clinical Data Mining Methods (inChina Key Technologies R & D Programme Chinese). Postdoctoral Report, China Academy of(2007BA110B06), China 973 project (2006CB504601) Chinese Medical Sciences, 2007.3.and the Science and Technology Foundation of Beijing [17]. Deshpande A.M., Brandt C., Nadkarni P.M., Metadata-Jiaotong University (2007RC072). driven Ad Hoc Query of Patient Data Meeting the Needs of Clinical Studies. JAMIA. 2002; 9(4):369-82.References [18]. Witten I.H. and Frank E., Data Mining: Practical machine learning tools and techniques (2nd Edition) Morgan Kaufmann, San Francisco, 2005.[1]. Liu B., Hu J., Xie Y., et al, Conception and Study in [19]. Zhang H., Tian C., Liu B., et al, Study on the idea of Establishment of Modern Individualized Diagnosis and clinical accupuncture point combination of TCM Treatment System in TCM (in Chinese). World Science physician Tian (in Chinese). Journal of Clinical and Technology-Modernization of TCM. 2003; 5(1):1-5. Acupuncture and Moxibustion. 2007.2, 23(2):36-8.[2]. Liu B., Zhou X., Design and Practice of Wet-Dry [20]. Ni Q., Liu B., Chen S., et al, Study of Relationship Approach in Clinical Research of TCM (in Chinese), between Formula (herbs) and Syndrome about Type 2 World Science and Technology-Modernization of TCM. Diabetes Mellitus Affiliated Metabolic Syndrome Based 2007; 9(1):85-9. on the Scale-free Network (in Chinese). Chinese Journal[3]. Inmon W.H., Building the Data Warehouse (Third of Information on TCM. 2006; 13(11):19-22. Edition), John Wiley & Sons, Inc.2002. [21]. Jian Z., Ni Q., Zhou X., et al, Study on treatment law of[4]. Silver M., Sakata T., Su H., et al, Case study: how to type 2 diabetes based on structural clinical information apply data mining techniques in a healthcare data collect system (in Chinese). Journal of Shangdong warehouse. J Healthc Inf Manag. 2001; 15: 155-64. University of TCM. 2007;31(3):195-7.[5]. Wisniewski M.F., Kieszkowski P., et al, Development [22]. Zhuye Gao, Hao Xu, Dazuo Shi, et al, The Cluster of a Clinical Data Warehouse for Hospital Infection Analysis on Syndrome Type of TCM in Patients with Control. JAMIA. 2003; 10(5):455-62. Acute Myocardial Infarction (in Chinese). Journal of[6]. Banek M., Tjoa A. M., Stolba N., Integrating Different Emergency in TCM. 2007;16(4): 432-4. Grain Levels in a Medical Data Warehouse Federation. In Proceedings of Data Warehousing and Knowledge Discovery, A. Min Tjoa, Juan Trujillo (Eds.), 2006, Krakow, Poland, LNCS, 4081, 185-94.[7]. Einbinder J.S., Scully K., Using a Clinical Data Repository to Estimate the Frequency and Costs of Adverse Drug Events. JAMIA. 2002 Nov–Dec; 9(6 Suppl 1): s34-s38.[8]. Allard R.D., The clinical laboratory data warehouse. An overlooked diamond mine, Am J Clin Pathol 2003, 817-9.[9]. Granta A., Moshyka A., Diaba H., et al, Integrating feedback from a clinical data warehouse into practice organisation. Int J Med Inform. 2006;75, 232-9. 620