Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Prof. Yannis Ioannidis
“Athena” Research Center & University of Athens
BioMed
Oceans
Space & Earth
Culture Environment
OA Policies
Data Proc
OpenMinTeD
EXAREMEMaDIS
GRAPHOS
PAROS
CHESS
Optique
AITION/
TopMod
KDD/ML
MDP
OpenAIRE
MaDgIK Systems
DCV ML
ResAnal
HBP
Capsella
W-D...
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Middleware La...
OpenAIRE HUB
CERN
zenodo
Visualize - Manage
Enhanced Publications
Get support
(NOADs)
Linked Content
Statistics
+++
Search...
 ICOS
 LIFEWATCH
 EMSO
 SIOS
 EURO-ARGO
 IAGOS
 EPOS
 EISCAT
 COPAL
 ACTRIS
 DANUBIUS_RI
 ICOS: Integrated Carbon Observation System
 Harmonized and High Precision Scientific Data on
Carbon Cycle And Greenhous...
 SIOS: Svalbard Integrated Earth Observing
System
 Arctic environmental and climate-related challenges
 EURO-ARGO: Euro...
 EISCAT_3D: European Incoherent Scatter
 Radar systems for the upper atmosphere, the
ionosphere and the Aurora Borealis
...
 ACTRIS: Aerosols, Clouds and Trace gases RI
 Models and forecast systems by offering high
quality data for atmospheric ...
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Federated dat...
Gateway
Master
Worker WorkerWorker Worker
Execution
Engine
Execution
Engine
Optimizatio
n Engine
Optimizatio
n Engine
Fast...
 Parallel / distributed execution of complex data flows
targeting data analysis and mining
 Data remain at source (hospi...
Query
Federation
Decompose query into
local and global parts
1 N
id m-name m-valueid m-name m-value
Local queries Local qu...
• Distributed elastic execution
– Parallel aggregations, unions, and joins
– Resources are reserved dynamically
• Iterativ...
• Time and money
• 2-dimensional optimization
 Quantum: 1 hour
• Simple map-reduce flow
– A: 1 hour B: 10 minutes C: 1 ho...
• Optimal dataflow scheduling
• Skyline of all Pareto optimal plans
Time
Money
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
EXAREME Middl...
Data Mining
Disease signatures
Patient grouping & similarity
Raw data from biomarker based
personalized acquisition
Person...
SEX AgeOnSet
ILAR
JntActDis
GlbActDis
DisDur JntLOM GenEval
CHAQ ESRCRPANA
MEFNIL2RAPoznanski
NSAID STEROID DMARD BIOLOGIC...
Disease signatures
Patient grouping & similarity
Variable dependencies & causality
Simulation Models
Individualized diagno...
 Extensible validation and data transformation engine
 Ιnteractive and efficient WEB-Based interface
 Data cleaning:
◦ ...
Variable dependencies & causality
Simulation Models
Individualized diagnosis,
prognosis & treatment plan
Transformed &
Val...
 Disease signatures: Latent factors (patterns) that characterize
disease
◦ Distribution of most relevant variables for di...
Similarity &
Graph clustering
Topics & allocations
Modelling
Disease signatures
Patient grouping & similarity
Individualized diagnosis,
prognosis & treatment plan
Transformed &
Valida...
 Bayesian Net: Directed Acyclic Graph + Conditional Prob Distributions
◦ Features (Nodes) & Dependencies (Edges)
◦ Compac...
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TP...
Disease signatures
Patient grouping & similarity
Variable dependencies & causality
Simulation Models
Transformed &
Validat...
Increased RVD is related
with worse values in every
MR aspect
(TVPRegurg, PSMotion,
RedRV, AV_Block,
TriRegurg)
Brussels – 6-7 May 2014
MyHealthMyData
Raw
Personal
Data
Raw
Anonymised
Summary
Anonymised
Private Controlled Access Public
Bioinformatics
services for All Users...
 Obtaining consent not straightforward
 Anonymisation: necessary, rather complicated,
ensuring neither privacy nor data ...
 data publishing: “Sanitization” (Anonymisation) hiding individual info
(k-anonymity) but preserving (sufficient) aggrega...
 Big data is not only about size
 Data is distributed, data is heterogeneous
 Processing goes to data, not data to proc...
 http://www.madgik.di.uoa.gr
 https://www.humanbrainproject.eu
 http://www.md-paedigree.eu/
 http://www.openaire.eu
 ...
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policies
Upcoming SlideShare
Loading in …5
×

Big data in the research life cycle: technologies, infrastructures, policies

497 views

Published on

Presented by Yannis Ioannidis (University of Athens, Athena RIC) during the 2nd BDE SC5 workshop, 11 October 2016, in Brussels, Belgium

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big data in the research life cycle: technologies, infrastructures, policies

  1. 1. Prof. Yannis Ioannidis “Athena” Research Center & University of Athens
  2. 2. BioMed Oceans Space & Earth Culture Environment OA Policies Data Proc OpenMinTeD
  3. 3. EXAREMEMaDIS GRAPHOS PAROS CHESS Optique AITION/ TopMod KDD/ML MDP OpenAIRE MaDgIK Systems DCV ML ResAnal HBP Capsella W-Dance O-MinTeD STE G-kak^3 BB EarthSrvr V-Exhibit EFG1914 Fut-TDM OpenUP WDAqua RDA StR-ESFRI
  4. 4. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine Application Layer: Data (pre) processing and knowledge discovery platform Imaging , Video Streaming Data Un/Semi/Structured Biomedical Data Legacy Data Simulation Models Digital Libraries (PubMed etc) Ontologies (UMLS, GO..) Clinician knowledge Upper level declarative language and extensible UDFs MADRefine module Data Preprocessing & Transformation Curation & Validation AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling Distributed execution on clouds and ad-hoc clusters Distributed Query Engine AITION simulation Graphical Probabilistic modelling for Statistical simulation Ontology Based Data Access Data Processing • Distribution, Federation, Parallelism • EXAREME Data Analytics • Cleaning & curation • MADRefine • Modeling, Mining • AITION Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Multi-modal, vertical integrated, distributed bio medical data Biomedical Info Registries & Metadata Simulation Models KDD Results Data Infrastructures • ESFRI Infrastructures • ICOS, EMSO, … • E-Infrastructures • OpenAIRE WHATWHEREHOWWHY
  5. 5. OpenAIRE HUB CERN zenodo Visualize - Manage Enhanced Publications Get support (NOADs) Linked Content Statistics +++ Search & Browse Curate & collaborate Deposit Publications & data Research impact Citations, usage statistics +++ Link Classify De-duplicate Cite Text Mine APIs Publication repositories Institutional & Thematic Open Access Journals 17,500,000 OA publications 700+ validated repositories accessing >5K repos/OA journals Data repositories Data Journals ResearchID (ORCID, ..) OpenDOAR … CRIS Systems National funding EC funding Usage dataMetadata on publications Metadata on data Guidelines for Data Providers & Open Data Pilot Guidelines for Funding Info Guidelines for Publications OpenAIRE
  6. 6.  ICOS  LIFEWATCH  EMSO  SIOS  EURO-ARGO  IAGOS  EPOS  EISCAT  COPAL  ACTRIS  DANUBIUS_RI
  7. 7.  ICOS: Integrated Carbon Observation System  Harmonized and High Precision Scientific Data on Carbon Cycle And Greenhouse Gas Budget and Perturbations  EMSO: European Multi-disciplinary Seafloor and water-column Observatory  Ocean observation systems for long-term, high- resolution, (near) real-time monitoring of environmental processes including natural hazards, climate change, and marine ecosystems
  8. 8.  SIOS: Svalbard Integrated Earth Observing System  Arctic environmental and climate-related challenges  EURO-ARGO: European contribution to ARGO  Ocean observation and for oceanography and climate  IAGOS: In-service Aircraft for a Global Observing System  Atmospheric composition, aerosol and cloud particles
  9. 9.  EISCAT_3D: European Incoherent Scatter  Radar systems for the upper atmosphere, the ionosphere and the Aurora Borealis  EUFAR-COPAL: European Facility for Airborne Research  Airborne research for the environmental and geo sciences in Europe
  10. 10.  ACTRIS: Aerosols, Clouds and Trace gases RI  Models and forecast systems by offering high quality data for atmospheric gases, clouds, and trace gases  DANUBIUS-RI: Int’l Center for Advanced Studies on River-Sea Systems  Addressing conflicts between society’s demands, environmental change and environmental protection in river–sea systems worldwide.
  11. 11. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer: Multi-modal, vertical integrated, distributed bio medical data Biomedical Info Registries & Metadata Simulation Models Imaging , Video Streaming Data Un/Semi/Structured Biomedical Data Legacy Data Simulation Models Digital Libraries (PubMed etc) Ontologies (UMLS, GO..) Clinician knowledge KDD Results Application Layer: Data (pre) processing and knowledge discovery platform MADRefine module Data Preprocessing & Transformation Curation & Validation AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling AITION simulation Graphical Probabilistic modelling for Statistical simulation Data Analytics • Cleaning & curation • MADRefine • Modeling, Mining • AITION Data Infrastructures • ESFRI Infrastructures • ELIXIR • E-Infrastructures • OpenAIRE Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine Upper level declarative language and extensible UDFs Distributed execution on clouds and ad-hoc clusters Distributed Query Engine Ontology Based Data Access Data Processing • Distribution, Federation, Parallelism • EXAREME
  12. 12. Gateway Master Worker WorkerWorker Worker Execution Engine Execution Engine Optimizatio n Engine Optimizatio n Engine Fast Local Net Data Connector Data Connector P2P Net
  13. 13.  Parallel / distributed execution of complex data flows targeting data analysis and mining  Data remain at source (hospital) – dataflow / query travels  Privacy preserving: transmit only aggregated information from hospital (sufficient statistics)  Advanced data compression, on the data partitioning  Query Language: SQL + UDFs (in Python)
  14. 14. Query Federation Decompose query into local and global parts 1 N id m-name m-valueid m-name m-value Local queries Local queries Partial aggregated results Run local queries Run local queries “count, avg, std” m-name N avg std m-name Σx Σx2 N Σx,Σx2,N Σx,Σx2,N Partial aggregated results m-name Σx Σx2 N L:“Σx, Σx2, N” G:“N, avg, std” Run global queries N, avg, std
  15. 15. • Distributed elastic execution – Parallel aggregations, unions, and joins – Resources are reserved dynamically • Iterative dataflow execution – Support machine learning algorithms • Novel query optimization techniques – SQL with User Defined Functions – Arbitrary user code with unknown properties – Privacy-aware query optimization
  16. 16. • Time and money • 2-dimensional optimization  Quantum: 1 hour • Simple map-reduce flow – A: 1 hour B: 10 minutes C: 1 hour Schedule Time (hours) Money (resource hours) Winner One host for all ops 18.60 19 5x cheaper Different host per op 2.16 102 9x faster
  17. 17. • Optimal dataflow scheduling • Skyline of all Pareto optimal plans Time Money
  18. 18. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources EXAREME Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer: Multi-modal, vertical integrated, distributed bio medical data Biomedical Info Registries & Metadata Simulation Models Imaging , Video Streaming Data Un/Semi/Structured Biomedical Data Legacy Data Simulation Models Digital Libraries (PubMed etc) Ontologies (UMLS, GO..) Clinician knowledge KDD Results Upper level declarative language and extensible UDFs Distributed execution on clouds and ad-hoc clusters Distributed Query Engine Ontology Based Data Access Data Processing • Distribution, Federation, Parallelism • EXAREME Data Infrastructures • ESFRI Infrastructures • ELIXIR • E-Infrastructures • OpenAIRE Application Layer: Data (pre) processing and knowledge discovery platform MADRefine module Data Preprocessing & Transformation Curation & Validation AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling AITION simulation Graphical Probabilistic modelling for Statistical simulation Data Analytics • Cleaning & curation • MADRefine • Modeling, Mining • AITION
  19. 19. Data Mining Disease signatures Patient grouping & similarity Raw data from biomarker based personalized acquisition Personalized Model Guided Medicine For a particular patient Unknown / missing data Predict value of missing variable Variable dependencies & causality Simulation Models Create Statistical Simulation Models Individualized diagnosis, prognosis & treatment plan Model & VerificationKnowledge Discovery Reasoning & decision support Data Preprocessing Curation & Validation Transformed & Validated Data Domain knowledge & assumptions Clinical workflows BOTTOM-UP TOP-DOWN Big Data Analytics • Capture • multi source • multi modal • multi system Management • Data provenance • Sanitization (Anonymization) • Process • aggregate • distributed Analysis • Privacy preserving • Algorithms • Mechanisms Modeling • Personalized • De-identified Practice • Ethics • Privacy
  20. 20. SEX AgeOnSet ILAR JntActDis GlbActDis DisDur JntLOM GenEval CHAQ ESRCRPANA MEFNIL2RAPoznanski NSAID STEROID DMARD BIOLOGIC JADI JntLOMDiff CHAQDiff ESRDiff CRPDiff JntActDisDiffGlbActDisDiff GenEvalDiff BOXValidatedOut Adapted Sharp/ van der Heijde Score Out JADIOut Extended BOX Predictors Medication Outcome demographics imaging genetics clinical lab Synovial volume OTHER
  21. 21. Disease signatures Patient grouping & similarity Variable dependencies & causality Simulation Models Individualized diagnosis, prognosis & treatment plan Data Mining Personalized Model Guided Medicine For a particular patient Unknown / missing data Predict value of missing variable Create Statistical Simulation Models Model & VerificationKnowledge Discovery Reasoning & decision support Domain knowledge & assumptions Clinical workflowsRaw data from biomarker based personalized acquisition Data Preprocessing Curation & Validation Transformed & Validated Data
  22. 22.  Extensible validation and data transformation engine  Ιnteractive and efficient WEB-Based interface  Data cleaning: ◦ Typographical error detection (numeric & alphanumeric) ◦ Data cleaning rules: (functional dependencies, conditional funct. dependencies, denial constraints) ◦ New/derived columns (discretization, computation of medical scores) ◦ Data visualisation (barcharts, piecharts, scatterplots, linecharts, etc.)  End-to-end data analysis workflow support (rerun experiments, reproduce results)
  23. 23. Variable dependencies & causality Simulation Models Individualized diagnosis, prognosis & treatment plan Transformed & Validated Data Personalized Model Guided Medicine For a particular patient Unknown / missing data Predict value of missing variable Create Statistical Simulation Models Model & Verification Reasoning & decision support Data Preprocessing Curation & Validation Domain knowledge & assumptions Clinical workflows Data Mining Raw data from biomarker based personalized acquisition Knowledge Discovery Disease signatures Patient grouping & similarity
  24. 24.  Disease signatures: Latent factors (patterns) that characterize disease ◦ Distribution of most relevant variables for disease (e.g., biomarkers) ◦ Multiple variables per signature, signatures per disease  Patient Cluster: Homogeneous patient group with common characteristics  Patient Similarity: Patients “like” me or mine (patient or clinician role) ◦ “like” = according to different criteria (e.g., allocation on disease signatures)
  25. 25. Similarity & Graph clustering Topics & allocations Modelling
  26. 26. Disease signatures Patient grouping & similarity Individualized diagnosis, prognosis & treatment plan Transformed & Validated Data Personalized Model Guided Medicine For a particular patient Unknown / missing data Predict value of missing variable Reasoning & decision support Clinical workflows Data Mining Raw data from biomarker based personalized acquisition Knowledge Discovery Data Preprocessing Curation & Validation Create Statistical Simulation Models Model & Verification Domain knowledge & assumptions Variable dependencies & causality Simulation Models
  27. 27.  Bayesian Net: Directed Acyclic Graph + Conditional Prob Distributions ◦ Features (Nodes) & Dependencies (Edges) ◦ Compact representation of joint data distribution Patient X1 X2 X3 X4 X5 X6 X7 X8 1 Y N N Y Y Y N Y : 1000 N N Y N N Y N N X1 X4 X5 X7 X8 Smoking Lung cancer Chronic bronchitis X2 Genetic Factor X6 X3 Allergy + Find: Given: + Domain Knowledge
  28. 28. Final DAG (based on MCMC&DP, threshold=0.5) Age ParCHD Procedures ExIntoler Cyanosis CPBP CPArrhy CPConcl CPTermRsn BSA TPVRegurg TriRegurg RVD RedRV PSMotion RestrPatt AVBlock SupravArrhy VentricArrhy Age ParCHD Procedures ExIntoler Cyanosis CPBP CPArrhy CPConcl CPTermRsn BSA TPVRegurg TriRegurg RVD RedRV PSMotion RestrPatt AVBlock SupravArrhy VentricArrhy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Modelling Dependency Analysis Inference
  29. 29. Disease signatures Patient grouping & similarity Variable dependencies & causality Simulation Models Transformed & Validated Data Data Mining Raw data from biomarker based personalized acquisition Knowledge Discovery Data Preprocessing Curation & Validation Create Statistical Simulation Models Model & Verification Domain knowledge & assumptions Personalized Model Guided Medicine For a particular patient Unknown / missing data Predict value of missing variable Reasoning & decision support Clinical workflows Individualized diagnosis, prognosis & treatment plan
  30. 30. Increased RVD is related with worse values in every MR aspect (TVPRegurg, PSMotion, RedRV, AV_Block, TriRegurg)
  31. 31. Brussels – 6-7 May 2014
  32. 32. MyHealthMyData
  33. 33. Raw Personal Data Raw Anonymised Summary Anonymised Private Controlled Access Public Bioinformatics services for All Users Doctors (and Patients?) Researchers
  34. 34.  Obtaining consent not straightforward  Anonymisation: necessary, rather complicated, ensuring neither privacy nor data value  “Blending in a crowd” and k-anonymity: privacy is property not output of sanitization  How do we define privacy?
  35. 35.  data publishing: “Sanitization” (Anonymisation) hiding individual info (k-anonymity) but preserving (sufficient) aggregated statistics  data mining: Specific algorithms (usually operating in two phases) for classification, clustering, association rules, …  mechanisms: Differential Privacy & Crowd-Blending Privacy perturb data or add noise ensuring ε-indistinguishable output distribution  encryption: Fully Homomorphic Encryption (FHE) for computation and query to run over encrypted data  decentralization: Blockchain to Protect Personal Data - decentralized personal data management, users own and control their data
  36. 36.  Big data is not only about size  Data is distributed, data is heterogeneous  Processing goes to data, not data to processing  ICT (Data management & processing) advances ◦ Data compression ◦ Federated / privacy-preserving processing ◦ Scalable parallel / distributed processing ◦ Data curation (otherwise: garbage in, garbage out) ◦ Text and data analytics
  37. 37.  http://www.madgik.di.uoa.gr  https://www.humanbrainproject.eu  http://www.md-paedigree.eu/  http://www.openaire.eu  http://www.optique-project.eu

×