Optimized Column-Oriented Model: A Storage and Search Efficient Representation of Medical Data

  • 710 views
Uploaded on

Optimized Column-Oriented Model: A Storage and Search Efficient Representation of Medical Data

Optimized Column-Oriented Model: A Storage and Search Efficient Representation of Medical Data

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
710
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
14
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Optimized Column-Oriented Model: A Storage and Search Efficient Representation of Medical Data Razan Paul and Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd Abstract. Medical data have a number of unique characteristics like data sparseness, high dimensionality and rapidly changing set of attributes. Entity Attribute Value (EAV) is the widely used solution to handle the above chal- lenges of medical data, but EAV is neither storage efficient nor search efficient. In this paper, we have proposed a storage & search efficient data model: Opti- mized Column-Oriented Model (OCOM) for physical representation of high dimensional and sparse data as an alternative of widely used EAV. We have implemented both EAV and OCOM models in a medical data warehousing en- vironment and performed different relational and warehouse queries on both the models. The experimental results show that OCOM is dramatically search effi- cient and occupy less storage space compared to EAV. Keywords: OCOM, Sparse Data, Open Schema, EAV, Data Model.1 IntroductionA number of exciting research challenges as described in [1] is posed by health caredata that make them different from the data in other industries. Data sparseness arisesbecause doctors perform only a few different clinical lab tests among thousands oftest attributes for a patient over his lifetime. It requires frequent schema change be-cause healthcare data need to accommodate new laboratory tests, diseases being in-vented every day. For the above reasons, relational data representation requires toomany columns and hence the data is high dimensional. Integrating Large-scale medical data is important for knowledge discovery. Oncethis knowledge is discovered, the results of such analysis can be used to define newguidelines and to improve patient care. We require an open schema data model tosupport dynamic schema change, sparse data, and high dimensional data. In openschema data models, logical model of data is stored as data rather than as schema, sochanges to the logical model can be made without changing the schema. EAV is thewidely used open schema data model to handle these challenges of data representa-tion. However, EAV suffers from higher storage requirement and not search efficient.In this paper, we have proposed a storage and search efficient open schema datamodel: OCOM.S. Khuri, L. Lhotská, and N. Pisanti (Eds.): ITBAM 2010, LNCS 6266, pp. 118–127, 2010.© Springer-Verlag Berlin Heidelberg 2010
  • 2. Optimized Column-Oriented Model: A Storage and Search Efficient Representation 119 Section 2 describes the related work. The overview of COM and EAV, the detailsorganizational structure and analysis of OCOM are given in section 3. A data trans-formation is required to adopt the existing data suitable for data warehouse represen-tation. The transformation is elaborated in section 4. Analytical details of performanceof the proposed models are given in section 5. Section 6 offers the result and discus-sion. Section 7 is the conclusion.2 Related WorkEAV [2] gives us extreme flexibility in data representation but it is not search effi-cient as it keeps attribute name as data in attribute column and has no tracking of howdata are stored. The EAV model for phenotype data management has been given in[3]. Use of EAV for medical observation data is also found in [4], [5], [6]. The Entity-Relationship Model is proposed in [7]. Storage and querying of high dimensionalsparsely populated data is proposed in [8], But this model does not keep the data insearch efficient way. Agarwal et al. in [9] propose several methods for the efficientcomputation of multidimensional aggregates. In [10], authors proposed data cube as arelational aggregation operator generalizing group-by, crosstab, and subtotals.3 Open Schema Data Models3.1 Column Oriented Model (OCM)A column-oriented model [11], [12] stores data separately for each column, so a newcolumn can be added without changing existing data. This model has several advan-tages: high performance in aggregate operation on a small number of columns overmany rows and we can perform various operations on a particular column valueswithout touching other column values. OCM can represent high dimensional data butcannot handle sparse data. This model suffers from higher storage requirement torepresent sparse data.3.2 Entity-Attribute-Value Model (EAV)EAV is an open schema data model, which is suitable for High Dimensional andsparse data like medical data. In EAV, every fact is conceptually stored in a table,with three sets of columns: entity, an attribute, and a value for that attribute. In thisdesign, one row actually stores a single fact. It eliminates sparse data to reduce data-base size and allows changing set of attributes. Moreover, EAV can represent highdimensional data, which cannot be modeled by relational model because existingRDBMS only support a limited number of columns. EAV gives us extreme flexibilitybut it is not search efficient as it keeps attribute name as data in attribute column andhas no tracking of how data are stored.3.3 Optimized Column-Oriented Model (OCOM)To remove the search inefficiency problem of EAV whilst preserving its efficiency ofrepresenting High Dimensional and sparse data, we have developed a search efficient
  • 3. 120 R. Paul and A.S.Md. Latiful Hoque Row Wise Representation Column Wise Representation A1 A2 A3 A4 A1 A2 A3 A4 Keep position 12 34 56 23 Make column 12 34 56 23 and Value for 23 null null 6 wise format. 23 null null 6 non null values. null null null 6 null null null 6 null 5 null null null 5 null null A1 A2 A3 A4 Concatenate binary A1 A2 A3 A4 65548 65570 65592 65559 262149 131078 representation of 1,12 1,34 1,56 1,23 131095 position and value 2,23 4,5 2,6 196614 3,6 Positional Bitmap Representation Positional column wise representation Fig. 1. Optimized Column-Oriented Modelopen schema data model OCOM. This model keeps data in a search efficient way.This approach is a read-optimized representation whereas the EAV approach is write-optimized. Most of the data warehouse systems write once and read many times, sothe proposed approach can serve the practical requirement of data warehouse. Figure 1shows the transformation of sparse relational data representation to an equivalentOCOM representation. In this model, data is represented in a column wise format. Theminimum amount of information that needs to be stored is the position of all non-nullelements with their values in a column. Then both the position and the value are con-verted into a compact single integer PV by concatenating their binary representation. InOCOM, every single fact is conceptually stored with a single field: the PV. This model maps position and value of non-null elements to p-bit and q-bit integerrespectively and concatenate them to form a single compact n-bit integer PV. Formedical domain, position represents patient ID and value represents various attributeof patients. This model stores every fact in sorted order of PV and the first p-bit of PVis entity. Hence, facts in this model are stored in sorted order of entity and we canperform binary search on it based on entity. We have considered both the position and value is 16 bit. 16 bit is good enough torepresent medical information, as cardinality of medical data is not high. However, itis not good enough for position, as the number of patients can be large. To handle thisproblem, we have developed a multilevel index structure for each attribute as shownin Figure 2. Each segment pointer table contains pointer of up to 216 segments ofpositional data. Each segment can hold up to 216 positional bitmaps. In this way, wecan store facts of 232 =4294967296 patients. To store value of an attribute of a
  • 4. Optimized Column-Oriented Model: A Storage and Search Efficient Representation 121 Table of Segment Segment of a Positional Bitmap data Pointer Table of Attribute 268 Pointer Segment Pointer . Number . Attribute Pointer 456 A1 0 A2 . . . 311 . . . . . 65535 . . An 345 345 Segment Pointer . Number . 0 333 234 . . . . . . 65535 123 Fig. 2. Multi level index structure for OCOMparticular patient ID, this model select segment pointer table of that attribute, and findsegment number = patient ID/65536. Then it finds the segment and stores the posi-tional bitmap in that segment in sorted order of PV.4 Data Transformation Using Domain Dictionary and Rule BaseFor knowledge discovery, the medical data have to be transformed into a suitabletransaction format to discover knowledge. We have addressed the problem of map-ping complex medical data to items using domain dictionary and rule base. The medi-cal data of diagnoses, laboratory results, allergies, vital signs, and etc are types ofcategorical, continuous numerical data, Boolean, interval, percentage, fraction andratio. Medical domain expert have the knowledge of how to map ranges of numericaldata for each attribute to a series of items. For example, there are certain conventionsto consider a person is young, adult, or elder with respect to age. A set of rules iscreated for each continuous numerical attribute using the knowledge of medical do-main experts. A rule engine is used to map continuous numerical data to items usingthese developed rules. Data, for which medical domain expert knowledge is not applicable, we have useddomain dictionary approach to transform these data to numerical forms. As cardinal-ity of attributes except continuous numeric data are not high in medical domain,these attribute values are mapped integer values using medical domain dictionaries.Here the mapping process is divided in two phases. Phase 1: a rule base is con-structed based on the knowledge of medical domain experts and dictionaries areconstructed for attributes where domain expert knowledge is not applicable, Phase 2:attribute values are mapped to integer values using the corresponding rule base andthe dictionaries.
  • 5. 122 R. Paul and A.S.Md. Latiful Hoque Original Mapped Original Mapped Generate dictionary for Value Value Value Value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to categorical items using rule base and Actual Data dictionaries If age <= 12 then 1 Medical If 13<=age<=60 then 2 Domain If 60 <=age then 3 Patient Age Smoke Diagnosis Knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Fig. 3. Data Transformation of Medical Data5 Storage Analysis of EAV and OCOMLet b be the total number of blocks of observation table and k is the total number ofattributes of observation table.5.1 Analysis of Storage Capacity of EAVLet n = total number of facts, q = Average length of attribute names, g = Averagelength of values. In EAV, 32 bits (4 bytes) is required to represent entity. Size of eachfact in EAV is ( 4 + q + g ) bytes. Hence, the total size to hold all facts is S = n × ( 4 + q + g ) bytes.5.2 Space Complexity of Medical Domain Dictionaries and Rule BaseLet C i = cardinality of ith categorical medical attribute, Li = average length of ithattribute value, u = number of categorical attributes. Integer codes of categorical at-tribute values, shown in data transformation, are not stored explicitly and the index ofattribute is the code. Domain Dictionary Storage of ith attribute is C i × Li bytes. uTotal domain dictionaries storage (SD) is ∑ ( C i × Li ) bytes. If the size of rule i =1 ubase storage is R, the dictionary and rule base storage (SDR) is ∑ ( C i × Li ) + R i =1bytes.
  • 6. Optimized Column-Oriented Model: A Storage and Search Efficient Representation 1235.3 Analysis of Storage Capacity of OCOMIn OCOM, 16 bits is required to represent entity and 16 bits is required to representvalue individually. Size of each fact in OCOM is 32 bits = 4 bytes. Let, n = totalnumber of facts, m = total number of facts in a block, w = word size (bytes). Totalnumber of blocks is ⎡n / m ⎤ . If word size is less than or equal to size of each fact, thenumber of words per fact is ⎡32 / w⎤ . The number of words per blockis ⎡ ( m × ⎡32 / w⎤ ) ⎤ , the size of each block is w (⎡ ( m × ⎡32 / w⎤ ) ⎤ ) and hence thetotal size to hold all facts is S = ⎡n / m ⎤ × w × (⎡ ( m × ⎡32 / w⎤ ) ⎤ ) . If word size isgreater than size of each fact, the number of facts per word is ⎡w / 32 ⎤ , the numberof words per block is ⎡ ( m / ⎡w / 32 ⎤ ) ⎤ , the size of each block isw (⎡ ( m / ⎡w / 32⎤ ) ⎤ ) and hence the size to hold all facts is⎡n / m ⎤ × w × (⎡ ( m / ⎡w / 32 ⎤ ) ⎤ ) . In OCOM, total size to hold all facts = storage forfacts + storage for domain dictionaries and rule base u = ⎡n / m ⎤ × w × (⎡ ( m × ⎡32 / w ⎤ ) ⎤ ) + ∑ ( Ci × Li ) + R (If w <= size of fact) or i =1 u = ⎡n / m ⎤ × w × (⎡ ( m / ⎡w / 32 ⎤ ) ⎤ ) + ∑ ( C i × Li ) + R (If w > size of fact). i =16 Results and DiscussionThe experiments were done using PC with core 2 duo processor with a clock rate of1.8 GHz and 3GB of main memory. The operating system was Microsoft Vista andimplementation language was c#. We have designed a data generator that generatesall categories of random data: ratio, interval, decimal, integer, percentage etc. Thisdata set is generated with 5000 attributes and (5-10) attributes per transaction on aver-age. We have used highly skewed attributes in all performance evaluations to measurethe performance improvement of our proposed open schema data models in worstcase. For all performance measurement except storage performance, we have used 1million transactions.6.1 Storage PerformanceFigure 4 shows the storage space required by EAV and OCOM. The EAV occupiessignificantly higher amount of storage than OCOM. This is due to the data redun-dancy of EAV models.6.2 Time Comparison of Projection OperationsFigure 5 shows the performance of Projection operations on various combinationsof attributes. Almost same time is needed with different number of attributes inEAV, as it has to scan all the blocks whatever the number of attributes. In OCOM,it can be observed that the time requirement is proportional to the number of
  • 7. 124 R. Paul and A.S.Md. Latiful Hoque 500 Storage(MB) 0 EAV 1 2 3 OCOM Number of transactions(million) Fig. 4. Storage Performance 60 Time(Seconds) 40 EAV 20 OCOM 0 Q1 Q2 Q3 Fig. 5. Time Comparison of Projection Operationsattributes projected. This is because that the query needs to scan more number ofblocks as the number of attributes increases.6.3 Time Comparison of Multiple Predicates Select QueriesFigure 6 shows the performance of multiple predicates select queries on various com-binations of attributes in three different open schema data models. In OCOM, It canbe observed that the number of attributes in select queries leads to the time taken. It isbecause as number of attributes in select queries increases, it has to scan more numberof attribute partitions. Almost same time is taken with different number of attributesin EAV as it has to scans all the blocks twice whatever the number of attributes inpredicate. This experiment shows that EAV requires much higher time compared toother models. It is because that EAV has no tracking of how data are stored, so it hasto scans all the blocks once to select entities and has to scan all the blocks again toretrieve the attribute values for the selected entities. 150 Time(Seconds) 100 EAV 50 0 OCOM Q4 Q5 Q6 Fig. 6. Time Comparison of Select Queries
  • 8. Optimized Column-Oriented Model: A Storage and Search Efficient Representation 1256.4 Time Comparison of Aggregate OperationsAggregate operations compute a single value by taking a collection of values as input.Figure 7 shows the performance of various Aggregate operations on a single attribute.Time is not varied significantly from one aggregate operation to another as differentaggregate operations need same number of data block access for most of the cases.Figure 7 shows EAV has taken much higher time than OCOM as it has to scan all theblocks to compute each operation. OCOM has taken almost same time to find count,min, max, average on a single attribute. This is because for all operations, it needs toscan the blocks of the attribute for which particular operation is executing. 40 Time(Seconds) 30 20 EAV 10 OCOM 0 Q7 Q8 Q9 Q10 Fig. 7. Time comparison of Aggregate Operations6.5 Time Comparison of Statistical OperationsFigure 8 shows the performance of various statistical operations on a single attribute.Time is varied significantly from one statistical operation to another as different sta-tistical operations need different sorts of processing. This experiment shows EAV hastaken much higher time compared to OCOM. It is because it has no tracking of howdata are stored, so it has to scans all the blocks to compute each operation. For me-dian, mode and standard deviation, OCOM has to scan the blocks of attribute forwhich particular operation is running one, two and two times respectively. 150 Time(Seconds) 100 50 EAV 0 OCOM Q11 Q12 Q13 Fig. 8. Time comparison of Statistical Operations6.6 Time Comparison of CUBE OperationsThe CUBE operation is the n-dimensional generalization of group-by operator. Thecube operator unifies several common and popular concepts: aggregates, group by,roll-ups and drill-downs and, cross tabs. Here No pre-computation is done for
  • 9. 126 R. Paul and A.S.Md. Latiful Hoque 400000 Time(Seconds) 200000 EAV OCOM 0 Q14 Q15 Q16 Fig. 9. Time comparison of CUBE operations Q1: Select Ai from observation; Q2: Select Ai , Aj ,Ak from observation. Q3: Select Ai, Aj, Ak, Al, Am from observation. Q4: Select * from observation where Ai=’XXX’. Q5: Select * from observation where Ai=’XXX’ AND Aj=’YYY’. Q6: Select * from observation where Ai=’XXX’ AND Aj=’YYY’ AND Ak=’ZZZ’. Q7: Select AVG (Ai) from observation. Q8: Select Max (Ai) from observation. Q9: Select Min (Ai) from observation. Q10: Select Count (Ai) from observation. Q11: Select Median (Ai) from observation. Q12: Select Mode (Ai) from observation. Q13: Select Standard Deviation (Ai) from observation. Q14: Select Ai, Aj, Max (Am) from observation CUBE-BY (Ai, Aj ) Q15: Select Ai, Aj, AK, Max (Am) from observation CUBE-BY (Ai, Aj, Ak) Q16: Select Ai, Aj, AK, Am, Max (An) from observation CUBE-BY (Ai, Aj, Ak, Am)aggregates at various levels and on various combinations of attributes. Figure 9 showsthe performance of CUBE Operations on various combinations of attributes. It can beobserved that the number of attributes in cube operations leads to the time taken asCUBE operation computes group-bys corresponding to all possible combinations ofCUBE attributes. The experiment results show that EAV has taken much higher timecompared to OCOM as it is not partitioned attribute wise and it has no entity index, soevery search becomes full scan of all blocks. OCOM does not need to read unusedattributes but does read unused values during the binary search.7 ConclusionEAV is a widely used solution to model data which are sparse, high dimensional andneed frequently schema change, but EAV is not a search efficient data model forknowledge discovery. In this paper, we have proposed a storage and search efficientopen schema data models OCOM to model high dimensional and sparse data as analternative of EAV. We have implemented both EAV and OCOM models in a datawarehousing environment and performed different relational and warehouse querieson both the models. We have achieved a query performance faster in the range of 15to 70 compared to existing EAV model. These efficiencies arise due to binary data
  • 10. Optimized Column-Oriented Model: A Storage and Search Efficient Representation 127representation in OCOM and partition of data attribute wise where as string represen-tation is used in EAV model. The experiment results show our proposed open schemadata model is dramatically efficient in knowledge discovery operation and occupy lessstorage compared to widely used EAV model.References 1. Torben, P.B., Christian, J.S.: Research Issues in Clinical Data Warehousing. In: Proceed- ings of the 10th International Conference on Scientific and Statistical Database Manage- ment, Capri, pp. 43–52 (1998) 2. Stead, W.W., Hammond, E.W., Straube, J.M.: A chartless record—Is it adequate? Journal of Medical Systems 7(2), 103–109 (1983) 3. Li, J., Li, M., Deng, H., Duffy, P., Deng, H.: PhD: a web database application for pheno- type data management. Oxford Bioinformatics 21(16), 3443–3444 (2005) 4. Anhøj, J.: Generic design of Web-based clinical databases. Journal of Medical Internet Re- search 5(4), 27 (2003) 5. Brandt, C., Deshpande, A., Lu, C.: TrialDB: A Web-based Clinical Study Data Manage- ment System. In: Proceedings of the American Medical Informatics Association Annual Symposium, Washington, p. 794 (2003) 6. Nadkarni, P.M., Brandt, C., Frawley, S., et al.: Managing Attribute—Value Clinical Trials Data Using the ACT/DB Client—Server Database System. The Journal of the American Medical Informatics Association 5(2), 139–151 (1998) 7. Chen, P.P.: The entity-relationship model—toward a unified view of data. ACM Transac- tions on Database Systems 1(1), 9–36 (1976) 8. Hoque, A.S.M.L.: Storage and Querying of High Dimensional Sparsely Populated Data in Compressed Representation. In: Shafazand, H., Tjoa, A.M. (eds.) EurAsia-ICT 2002. LNCS, vol. 2510, pp. 418–425. Springer, Heidelberg (2002) 9. Agarwal, S., Agrawal, R., Deshpande, P., et al.: On the Computation of Multidimensional Aggregates. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 506–521 (1996)10. Jim, G., Surajit, C., Adam, B.: Data Cube: A Relational Aggregation Operator Generaliz- ing Group-By, Cross-Tab and Sub-Totals. Data Mining and Knowledge Discovery 1(1), 29–53 (1997)11. Copeland, G.P.: A decomposition storage model. In: International Conference on Man- agement of Data, Austin, Texas, United States (1985)12. Stonebraker, M., Abadi, D.J., Batkin, A., et al.: C-Store: A column-oriented DBMS. In: Proceedings of the 31st VLDB Conference, Trondheim, Norway (2005)