This is convenient to distinguish following areas lack or excess of data; outliers, including inconsistencies; strange patterns in (joint) distributions; and unexpected analysis results and other types of inferences and abstractions
. During the diagnostic phase, the data munger may have to reconsider prior expectations and/or review quality assurance procedures.
data sink for storage, modelling or future use.
Graphical exploration of distributions: box plots, histograms, and scatter plots.Plots of repeated measurements on the same individual, e.g., growth curves.Statistical outlier detection
In statistics, hierarchical linear modeling (HLM), also known as multi-level analysis, is a more advanced form of simple linear regressi...
- (transform, truncate)-Robust estimation: Estimation of statistical parameters, using methods that are less sensitive to the effect of outliers than more conventional methods. Accomodate and reduce errrors(LSE, TrimedMEAN,Windsorized mean – mean by removing extreme and calculate with the closest)
Clinical data munging
Clinical Data MungingDO NOT ALTER THE RAW DATA
We must all accept that science is data andthat data are science, and thus providefor, and justify the need for the supportof, much improved data curation.Brooks Hanson , Andrew Sugden , Bruce Alberts (Science Editorial February 11th 2011)
Data Munging?• Manipulating raw data to achieve a finalform• Parsing or filtering data, or the many stepsrequired for data recognition.• Cleaning the raw data using algorithms(e.g. sorting) or parsing the data intopredefined data structures.
Clinical Data Munging?• Following clinical research ethics tomanipulate clinical data to achieve anacceptable form– Respect of Persons (Autonomy)– Data Security and Storage– Data Integrity / Data Quality– Privacy and Confidentiality
Why Clinical datamunging ?• Analyst devotes up to 85% of total time todata cleaning and preparation.• Health science is driven by data than bycomputation• Identify missing data
Why data munging? Cont.• Extreme Scores - Data value fallingoutside the expected range• Identify erroneous dates• Confounders
Phases in clinical DataMunging• ScreeningPhase:– lack or excess ofdata;– inconsistencies;– strange patternsin distributions;– unexpectedanalysis resultsand other typesof inferences andabstractions
Phases in clinical DataMunging• DiagnosticPhase: Thepurpose is to clarifythe true nature of theworrisome datapoints, patterns, andstatistics.-Documentationshould start at thispoint.• TreatmentPhase: What to dowith problematicobservation. Theoptions are limited tocorrecting, deleting,or leavingunchanged.
Phases in clinical DataMungingDataWarehouseCore1Core2Core3
Data screening?• Understand the clinical data and thedifferent clinical data variables• Categorise the data into groups/cores• Determine the unique identifier• Check data normality using frequencydistributions, skweness and kurtosis,summary statistics and cross-tabulations
Missing values• Occur if respondents refuse to answer,malfunction of tools, subjects withdrawalfrom studies• Missing values are categorized as– MAR ,MCAR or MNAR• Most modern stat packages requirecomplete data
Dealing with Missing Values• Use analysis that can deal with incompletedata (Hierarchical Linear Modelling),survivalanalysis• Adjusting the denominator – remove theunmarried from married• Delete values with missing data- lead tomisestimating of population thus lower thepower• Mean substitution – reduces the variance• Imputation via multiple regression
Other Data Errors• Duplications- take the first admission usingtime• Biologically impossible results– Robust estimation: Estimation of statisticalparameters, using methods that are lesssensitive to the effect of outliers than moreconventional methods• Questionable values
Given the rapid growth of the internet suchtechniques will become increasinglyimportant in the organization of the growingamounts of data available.Large synoptic survey telescope 40tb ofdata per day calls for a different way ofapproach….100+PB of data in 10 yrs
tOOLs for a Clinical DataMungerFeatures Stata R SPSS SASLearningCurveSteep/Gradual Pretty Steep GradualFlat Pretty SteepUserInterfaceCode/PnC Code Mostly PnC Very StrongDataManipulationVery Strong Very Strong Moderate Very StrongData Analysis Versatile Versatile Powerful Powerful/VersatileGraphics V good Excellent v good goodCost Renewal onupgrade -affordableOpen Source Expensive Expensive(yearly renewal)
Other Important Tools• Python - Getting real time data from socialnetworks• Nvivo – for qualitative data• perl