Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ReComp–KeeleUniversity
Dec.2016–P.Missier
ReComp:
Preserving the value of large scale data analytics over time
through sel...
ReComp–KeeleUniversity
Dec.2016–P.Missier
2
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tool...
ReComp–KeeleUniversity
Dec.2016–P.Missier
3
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”...
ReComp–KeeleUniversity
Dec.2016–P.Missier
4
Example: supervised learning
Meta-knowledge
Training
set
Model
learning
Classi...
ReComp–KeeleUniversity
Dec.2016–P.Missier
5
Example: stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Patt...
ReComp–KeeleUniversity
Dec.2016–P.Missier
6
Analytics functions and their dependencies can be complex
Y = f(X, D) X inputs...
ReComp–KeeleUniversity
Dec.2016–P.Missier
7
Complex NGS pipelines
Recalibration
Corrects for system
bias on quality
scores...
ReComp–KeeleUniversity
Dec.2016–P.Missier
8
Problem size: HPC vs Cloud deployment
Configuration: HPC cluster (dedicated no...
ReComp–KeeleUniversity
Dec.2016–P.Missier
9
Understanding change: threats and opportunities
Big
Data
Life Sciences
Analyti...
ReComp–KeeleUniversity
Dec.2016–P.Missier
10
ReComp
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• ...
ReComp–KeeleUniversity
Dec.2016–P.Missier
11
Challenges
3. Control How much control do we have on the system?
• Re-run: Ho...
ReComp–KeeleUniversity
Dec.2016–P.Missier
12
General ReComp problem formulation
ReComp–KeeleUniversity
Dec.2016–P.Missier
13
Change Impact
ReComp–KeeleUniversity
Dec.2016–P.Missier
14
Example: NGS variant interpretation
Genomics: WES / WGS, Variant calling, Var...
ReComp–KeeleUniversity
Dec.2016–P.Missier
15
The SVI example
ReComp–KeeleUniversity
Dec.2016–P.Missier
16
Change in variant interpretation
What changes:
- Improved sequencing / varian...
ReComp–KeeleUniversity
Dec.2016–P.Missier
17
ReComp Problem Statement
1. Estimate impact of changes
2. Optimise ReComp dec...
ReComp–KeeleUniversity
Dec.2016–P.Missier
18
Estimators: formalisation and a possible approach

And local changes
Problem...
ReComp–KeeleUniversity
Dec.2016–P.Missier
19
Scope of change
2. Change: affects a single patient  partial re-run
May affe...
ReComp–KeeleUniversity
Dec.2016–P.Missier
20
Challenge 1: battleships
Patient / change impact matrix
First challenge:
prec...
ReComp–KeeleUniversity
Dec.2016–P.Missier
21
SVI process: detailed design
Phenotype to genes
Variant selection
Variant cla...
ReComp–KeeleUniversity
Dec.2016–P.Missier
22
Baseline: Blind recomputation
17 minutes / patient (single-core
VM)
Runtime c...
ReComp–KeeleUniversity
Dec.2016–P.Missier
23
Inside a single instance: Partial re-computation
Change in
ClinVar
Change in
...
ReComp–KeeleUniversity
Dec.2016–P.Missier
24
White-box granular provenance
x11
x12 y11
P
D11 D12
- Using provenance metada...
ReComp–KeeleUniversity
Dec.2016–P.Missier
26
Results
Run time
[mm:ss]
Saving
s
Run time
[mm:ss]
Saving
s
GeneMap
version
2...
ReComp–KeeleUniversity
Dec.2016–P.Missier
27
Partial re-computation using input difference
Idea: run SVI but replace ClinV...
ReComp–KeeleUniversity
Dec.2016–P.Missier
29
Saving resources on stream processing
x1
x2
…
xk
xk+1
…
x2k
W1
W2
Raw
stream
...
ReComp–KeeleUniversity
Dec.2016–P.Missier
30
Diff and currency functions
the quality of yi is initially maximal, and decre...
ReComp–KeeleUniversity
Dec.2016–P.Missier
31
Measuring DeComp performance
Evaluating the performance of comp / nocomp deci...
ReComp–KeeleUniversity
Dec.2016–P.Missier
32
Diff time series
ReComp–KeeleUniversity
Dec.2016–P.Missier
33
Forecasting drift
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi
yi …
Der...
ReComp–KeeleUniversity
Dec.2016–P.Missier
34
Initial experiments: the DEBS’15 Taxi routes challenge
Find the most frequent...
ReComp–KeeleUniversity
Dec.2016–P.Missier
35
Diff time series – taxi routes
Raw data stream
st1  ft1, x1y1  x2y2
st2  f...
ReComp–KeeleUniversity
Dec.2016–P.Missier
36
Routes drift – comparing ranked lists
[1] Fagin, Ronald, Ravi Kumar, and D. S...
ReComp–KeeleUniversity
Dec.2016–P.Missier
370
0.2
0.4
0.6
0.8
1
1.2
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 1...
ReComp–KeeleUniversity
Dec.2016–P.Missier
38
0
0.2
0.4
0.6
0.8
1
1.2
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
9...
ReComp–KeeleUniversity
Dec.2016–P.Missier
39
Approach: ARIMA forecasting
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 1...
ReComp–KeeleUniversity
Dec.2016–P.Missier
40
The next steps -- challenges
• Can we learn effective surrogate models and es...
ReComp–KeeleUniversity
Dec.2016–P.Missier
41
Summary and challenges
Forwards: React to changes
in data used by processes
B...
ReComp–KeeleUniversity
Dec.2016–P.Missier
42
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
...
ReComp–KeeleUniversity
Dec.2016–P.Missier
43
Observability / transparency
White box Black box
Structure
(static view)
Data...
ReComp–KeeleUniversity
Dec.2016–P.Missier
44
Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Ma...
Upcoming SlideShare
Loading in …5
×

ReComp: Preserving the value of large scale data analytics over time through selective re-computation

385 views

Published on

Invited talk, Keele University, Dec. 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ReComp: Preserving the value of large scale data analytics over time through selective re-computation

  1. 1. ReComp–KeeleUniversity Dec.2016–P.Missier ReComp: Preserving the value of large scale data analytics over time through selective re-computation recomp.org.uk Paolo Missier, Jacek Cala, Manisha Rathi School of Computing Science Newcastle University Keele University, Dec. 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  2. 2. ReComp–KeeleUniversity Dec.2016–P.Missier 2 Data Science Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge”
  3. 3. ReComp–KeeleUniversity Dec.2016–P.Missier 3 Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  4. 4. ReComp–KeeleUniversity Dec.2016–P.Missier 4 Example: supervised learning Meta-knowledge Training set Model learning Classification algorithms Predictive classifier Background Knowledge (prior)  the training set is no longer representative of current data  the model loses predictive power Ex.: training set is a sample from social media stream (Twitter, Instagram, …) • Incremental training: established (neural networks, Bayes classifiers, …) • Incremental unlearning: some established work [1] t [1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345. [2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497– 508. doi:10.1109/5326.983933. [3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.
  5. 5. ReComp–KeeleUniversity Dec.2016–P.Missier 5 Example: stream Analytics Meta-knowledge Data stream Time Series analysis Pattern recognition algorithms - Temporal Patterns - Activity detection - User behaviour - … Background Knowledge • If the output is stable over time, can I save computation and deliver older outcomes instead? • How do I quantify the quality/ cost trade-offs?
  6. 6. ReComp–KeeleUniversity Dec.2016–P.Missier 6 Analytics functions and their dependencies can be complex Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”) D: vector of dependencies: libraries, reference data Y outputs (vector of arbitrary data structures, “knowledge”) Ex.: machine learning Using Python and scikit-learn Learn model to recognise activity pattern Python 3 Ubuntu x.y.z Azure VM Model training Model Scikit-learn Numpy Pandas Ubuntu on Azure Dependencies Training + Testing dataset config Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variants GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs
  7. 7. ReComp–KeeleUniversity Dec.2016–P.Missier 7 Complex NGS pipelines Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  8. 8. ReComp–KeeleUniversity Dec.2016–P.Missier 8 Problem size: HPC vs Cloud deployment Configuration: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores) Big Data: • raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient • processed in cohorts of 20–40 or close to 1 TB per cohort • time required to process a 24-sample cohort can easily exceed 2 CPU months • WES is about 2% of what the Whole Genome Sequencing analyses require
  9. 9. ReComp–KeeleUniversity Dec.2016–P.Missier 9 Understanding change: threats and opportunities Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings from the pipelines be improved over time? • Cost: Need to model future costs based on past history and pricing trends for virtual appliances • Impact analysis: • Which patients/samples are likely to be affected? • How do we estimate the potential benefits on affected patients? • Can we estimate the impact of these changes without re-computing entire cohorts? Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap, GeneCard,…)
  10. 10. ReComp–KeeleUniversity Dec.2016–P.Missier 10 ReComp Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t A decision support system for selectively re-computing complex analytics in reaction to change - Generic: not just for the life sciences - Customisable: eg for genomics pipelines
  11. 11. ReComp–KeeleUniversity Dec.2016–P.Missier 11 Challenges 3. Control How much control do we have on the system? • Re-run: How often • Total vs partial execution • Input density / resolution / incremental update • Eg nonmonotonic learning / unlearning Change Events Diff(.,.) functions “business Rules” Optimal re-computation prioritisaton Impact and Cost estimates Reproducibility assessment ReComp Decision Support System History of past Knowledge Assets 1. Observability: To what extent can we observe the process and its execution? • Process structure • Data flow  provenance 2. Detecting and quantifying changes: • In inputs, dependencies, outputs  diff() functions
  12. 12. ReComp–KeeleUniversity Dec.2016–P.Missier 12 General ReComp problem formulation
  13. 13. ReComp–KeeleUniversity Dec.2016–P.Missier 13 Change Impact
  14. 14. ReComp–KeeleUniversity Dec.2016–P.Missier 14 Example: NGS variant interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Also: Metagenomics: Species identification. Eg The EBI metagenomics portal Can help to confirm/reject a hypothesis of patient’s phenotype Classifies variants into three categories: RED, GREEN, AMBER pathogenic, benign and unknown/uncertain
  15. 15. ReComp–KeeleUniversity Dec.2016–P.Missier 15 The SVI example
  16. 16. ReComp–KeeleUniversity Dec.2016–P.Missier 16 Change in variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources
  17. 17. ReComp–KeeleUniversity Dec.2016–P.Missier 17 ReComp Problem Statement 1. Estimate impact of changes 2. Optimise ReComp decisions: select subset of population that maximises espected impact, subject to a budget constraint Problem: P computationally expensive
  18. 18. ReComp–KeeleUniversity Dec.2016–P.Missier 18 Estimators: formalisation and a possible approach  And local changes Problem: f() computationally expensive Approach: learn an approximation f’() of f(): a surrogate (emulator) Sensitivity Analysis: Given Assess where ε is a stochastic term that accounts for the error in approximating f, and is typically assumed to be Gaussian Learning f’() requires a training set { (xi, yi) } … If f’() can be found, then we can hope to use it to approximate: which can then be used to carry out sensitivity analysis For simplicity
  19. 19. ReComp–KeeleUniversity Dec.2016–P.Missier 19 Scope of change 2. Change: affects a single patient  partial re-run May affect a subset of the patients population  scope Which patients will be affected? 1. Change:
  20. 20. ReComp–KeeleUniversity Dec.2016–P.Missier 20 Challenge 1: battleships Patient / change impact matrix First challenge: precisely identify the scope of a change Blind reaction to change: recompute the entire matrix Can we do better? - Hit the high impact cases (the X) without re- computing the entire matrix
  21. 21. ReComp–KeeleUniversity Dec.2016–P.Missier 21 SVI process: detailed design Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype hypothesis
  22. 22. ReComp–KeeleUniversity Dec.2016–P.Missier 22 Baseline: Blind recomputation 17 minutes / patient (single-core VM) Runtime consistent across different phenotypes Changes to GeneMap/ClinVar have negligible impact on the execution time Run time [mm:ss] GeneMap version 2016-03-08 2016-04-28 2016-06-07 μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17
  23. 23. ReComp–KeeleUniversity Dec.2016–P.Missier 23 Inside a single instance: Partial re-computation Change in ClinVar Change in GeneMap
  24. 24. ReComp–KeeleUniversity Dec.2016–P.Missier 24 White-box granular provenance x11 x12 y11 P D11 D12 - Using provenance metadata to identify fragments of SVI that are affected by the change in reference data
  25. 25. ReComp–KeeleUniversity Dec.2016–P.Missier 26 Results Run time [mm:ss] Saving s Run time [mm:ss] Saving s GeneMap version 2016-04-28 2016-06-07 μ ± σ 11:51 ± 16 31% 11:50 ± 20 31% ClinVar version 2016-02 2016-05 μ ± σ 9:51 ± 14 43% 9:50 ± 15 42% • How much can we save? • Process structure • First usage of reference data • Overhead: storing interim data required in partial re-execution • 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes
  26. 26. ReComp–KeeleUniversity Dec.2016–P.Missier 27 Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion rec. count Difference rec. count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion rec. count Difference rec. count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  27. 27. ReComp–KeeleUniversity Dec.2016–P.Missier 29 Saving resources on stream processing x1 x2 … xk xk+1 … x2k W1 W2 Raw stream windows P P y1 y2 … Wi+1 Wi Comp / noComp … yi-h-1 yi-h h<i P y’i yi-h yi Baseline stream processing Conditional stream processing - If we could predict that yi+1 will be similar to yi, we could skip computing P(Wi+1), save resources and instead deliver yi again - Can we make optimal comp/noComp decisions? What is required?
  28. 28. ReComp–KeeleUniversity Dec.2016–P.Missier 30 Diff and currency functions the quality of yi is initially maximal, and decreases over time in a way that depends on how rapidly the new values yj diverge from yi.
  29. 29. ReComp–KeeleUniversity Dec.2016–P.Missier 31 Measuring DeComp performance Evaluating the performance of comp / nocomp decisions on each window: Cost: - Very conservative DeComp computes every value: - Very optimistic, only computes first value: Boundary cases:
  30. 30. ReComp–KeeleUniversity Dec.2016–P.Missier 32 Diff time series
  31. 31. ReComp–KeeleUniversity Dec.2016–P.Missier 33 Forecasting drift … Wi+1 Wi Comp / noComp … yi-h-1 yi-h h<i P y’i yi yi … Derived Time series drift forecasting
  32. 32. ReComp–KeeleUniversity Dec.2016–P.Missier 34 Initial experiments: the DEBS’15 Taxi routes challenge Find the most frequent / most profitable taxi routes in Manhattan within each 30’ window VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon, PickupLat,DropoffLon,DropofLat,Pay,Fare$, ... 0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,- 73.962440,40.715008,CSH, 3.50, ... 22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CSH,27.00, ... 0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,- 73.965897,40.760445,CSH, 4.00, ... 1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,- 74.003838,40.726189,CSH, 4.00, ... 3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,- 73.983772,40.730995,CRD, 4.00, ... 5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 2.50, ... DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,- 73.979439,40.784386,CRD, 3.00, ... 1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000, 0.000000,CSH, 2.50, ... 4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,- 73.967758,40.760326,CSH, 6.50, ... 5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,- 73.981453,40.778465,CRD, 6.00, ... 6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,- 73.972206,40.752502,CRD, 4.50, ... 75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 3.00, ... C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,- 73.934540,40.797314,CSH, 4.50, ... C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000, 0.000000,CSH, 2.50, ...ta
  33. 33. ReComp–KeeleUniversity Dec.2016–P.Missier 35 Diff time series – taxi routes Raw data stream st1  ft1, x1y1  x2y2 st2  ft2, x3y3  x4y2 . . . Routes time series ft1, R1 ft2, R2 . . ftn, R1 ftn+1, R1 ftn+2, R3 . . ftm, R2 ftm+1, R4 . . Top-k time series R1  Freq1 R2  Freq2 . . Rk  Freqk Rk+1  Freqk+1 Rk+2  Freqk+2 . . R2k  Freq2k R2k+1  Freq2k+1 . . W1 W2 W3 W1 W2 W3 => =>
  34. 34. ReComp–KeeleUniversity Dec.2016–P.Missier 36 Routes drift – comparing ranked lists [1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856. P outputs a list of top most frequent/profitable routes To compare lists we use the generalised Kendall’s tau (Fagin et al. [1]) Quantify how much the top-k changes between one window and the next Input parameters determine stability / sensitivity: K: how many routes window size (e.g. 30’)
  35. 35. ReComp–KeeleUniversity Dec.2016–P.Missier 370 0.2 0.4 0.6 0.8 1 1.2 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)
  36. 36. ReComp–KeeleUniversity Dec.2016–P.Missier 38 0 0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 291 296 301 306 311 316 321 326 331 Drift function: top-40, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103109115121127133139145151157163169175181187193199205211217223229235241247253259265271277283289295301307313319325331 Drift function: top-20, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
  37. 37. ReComp–KeeleUniversity Dec.2016–P.Missier 39 Approach: ARIMA forecasting 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Actual normalised drift vs ARIMA forecast Drift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00) new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast Drift prediction using time series forecasting • This is the derived diff() time series! • Autoregressive integrated moving average (ARIMA) • Widely used and well understood, well supported • Fast to compute • Assumes normality of underlying random variable Poor prediction: compute P too often or too rarely
  38. 38. ReComp–KeeleUniversity Dec.2016–P.Missier 40 The next steps -- challenges • Can we learn effective surrogate models and estimators of change impact? • diff() functions, estimators seem very problem-specific • To what extent can the ReComp framework be made generic, reusable, yet still useful? • Metadata infrastructure: A DB of past executions history • Reproducibility: What really happens when I press the “ReComp” button?
  39. 39. ReComp–KeeleUniversity Dec.2016–P.Missier 41 Summary and challenges Forwards: React to changes in data used by processes Backwards: restore value of knowledge outcomes Re-compute Selected outcomes Es0mate: - Benefit - Cost of refresh Quan0fy knowledge decay Es0mate: - Impact of changes - Cost of refresh Quan0fy data changes Monitor data changes Input, reference data versioning Op0mise / Priori0se Outcomes Knowledge outcomes Provenance, Cost New ground truth Data change events ReComp: a meta-process to observe and control underlying analytics processes
  40. 40. ReComp–KeeleUniversity Dec.2016–P.Missier 42 ReComp scenarios ReComp scenario Target Impact areas Why is ReComp relevant? Proof of concept experiments Expected optimisation Dataflow, experimental science Genomics - Rapid Knowledge advances - Rapid scaling up of genetic testing at population level WES/SVI pipeline, workflow implementation (eScience Central) Timeliness and accuracy of patient diagnosis subject to budget constraints Time series analysis - Personal health monitoring - Smart city analytics - IoT data streams - Rapid data drift - Cost of computation at network edge (eg IoT) NYC taxi rides challenge (DEBS’15) Use of low-power edge devices when outcome is predictable and data drift is low Data layer optimisation Tuning of large-scale Data management stack Optimal Data organisation sensitive to current data profiles Graph DB re- partitioning System throughput vs cost of re-tuning Model learning Applications of predictive analytics Predictive models are very sensitive to data drift Twitter content analysis Sustained model predictive power over time vs retraining cost Simulation TBD repeated simulation. Computationally expensive but often not beneficial Flood modelling / CityCat Newcastle Computational resources vs marginal benefit of new simulation model
  41. 41. ReComp–KeeleUniversity Dec.2016–P.Missier 43 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Functions semantics - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning)
  42. 42. ReComp–KeeleUniversity Dec.2016–P.Missier 44 Project structure • 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call • Feb. 2016 - Jan. 2019 • 2 RAs fully employed in Newcastle • PI: Dr. Missier, School of Computing Science, Newcastle University (30%) • CO-Investigators (8% each): • Prof. Watson, School of Computing Science, Newcastle University • Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University • Dr. Phil James, Civil Engineering, Newcastle University Builds upon the experience of the Cloud-e-Genome project: 2013-2015 Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud - To facilitate the adoption of reliable genetic testing in clinical practice - A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University - Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”

×