Clinical data munging


Published on

A file that describe the clinical data munging process

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is convenient to distinguish following areas lack or excess of data; outliers, including inconsistencies; strange patterns in (joint) distributions; and unexpected analysis results and other types of inferences and abstractions
  • . During the diagnostic phase, the data munger may have to reconsider prior expectations and/or review quality assurance procedures.
  • data sink for storage, modelling or future use.
  • Graphical exploration of distributions: box plots, histograms, and scatter plots.Plots of repeated measurements on the same individual, e.g., growth curves.Statistical outlier detection
  • In statistics, hierarchical linear modeling (HLM), also known as multi-level analysis, is a more advanced form of simple linear regressi...
  • - (transform, truncate)-Robust estimation: Estimation of statistical parameters, using methods that are less sensitive to the effect of outliers than more conventional methods. Accomodate and reduce errrors(LSE, TrimedMEAN,Windsorized mean – mean by removing extreme and calculate with the closest)
  • Clinical data munging

    1. 1. Clinical Data MungingDO NOT ALTER THE RAW DATA
    2. 2. We must all accept that science is data andthat data are science, and thus providefor, and justify the need for the supportof, much improved data curation.Brooks Hanson , Andrew Sugden , Bruce Alberts (Science Editorial February 11th 2011)
    3. 3. Data Munging?• Manipulating raw data to achieve a finalform• Parsing or filtering data, or the many stepsrequired for data recognition.• Cleaning the raw data using algorithms(e.g. sorting) or parsing the data intopredefined data structures.
    4. 4. Clinical Data Munging?• Following clinical research ethics tomanipulate clinical data to achieve anacceptable form– Respect of Persons (Autonomy)– Data Security and Storage– Data Integrity / Data Quality– Privacy and Confidentiality
    5. 5. Why Clinical datamunging ?• Analyst devotes up to 85% of total time todata cleaning and preparation.• Health science is driven by data than bycomputation• Identify missing data
    6. 6. Why data munging? Cont.• Extreme Scores - Data value fallingoutside the expected range• Identify erroneous dates• Confounders
    7. 7. Phases in clinical DataMunging• ScreeningPhase:– lack or excess ofdata;– inconsistencies;– strange patternsin distributions;– unexpectedanalysis resultsand other typesof inferences andabstractions
    8. 8. Phases in clinical DataMunging• DiagnosticPhase: Thepurpose is to clarifythe true nature of theworrisome datapoints, patterns, andstatistics.-Documentationshould start at thispoint.• TreatmentPhase: What to dowith problematicobservation. Theoptions are limited tocorrecting, deleting,or leavingunchanged.
    9. 9. Phases in clinical DataMungingDataWarehouseCore1Core2Core3
    10. 10. Data screening?• Understand the clinical data and thedifferent clinical data variables• Categorise the data into groups/cores• Determine the unique identifier• Check data normality using frequencydistributions, skweness and kurtosis,summary statistics and cross-tabulations
    11. 11. Data visualization
    12. 12. Missing values• Occur if respondents refuse to answer,malfunction of tools, subjects withdrawalfrom studies• Missing values are categorized as– MAR ,MCAR or MNAR• Most modern stat packages requirecomplete data
    13. 13. Dealing with Missing Values• Use analysis that can deal with incompletedata (Hierarchical Linear Modelling),survivalanalysis• Adjusting the denominator – remove theunmarried from married• Delete values with missing data- lead tomisestimating of population thus lower thepower• Mean substitution – reduces the variance• Imputation via multiple regression
    14. 14. Erroneous dates
    15. 15. Extreme Scores (Fringelier,Outlier)
    16. 16. Other Data Errors• Duplications- take the first admission usingtime• Biologically impossible results– Robust estimation: Estimation of statisticalparameters, using methods that are lesssensitive to the effect of outliers than moreconventional methods• Questionable values
    17. 17. Given the rapid growth of the internet suchtechniques will become increasinglyimportant in the organization of the growingamounts of data available.Large synoptic survey telescope 40tb ofdata per day calls for a different way ofapproach….100+PB of data in 10 yrs
    18. 18. tOOLs for a Clinical DataMungerFeatures Stata R SPSS SASLearningCurveSteep/Gradual Pretty Steep GradualFlat Pretty SteepUserInterfaceCode/PnC Code Mostly PnC Very StrongDataManipulationVery Strong Very Strong Moderate Very StrongData Analysis Versatile Versatile Powerful Powerful/VersatileGraphics V good Excellent v good goodCost Renewal onupgrade -affordableOpen Source Expensive Expensive(yearly renewal)
    19. 19. Other Important Tools• Python - Getting real time data from socialnetworks• Nvivo – for qualitative data• perl
    20. 20. AsanteQ?