Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sciunits: Resuable Research Object

112 views

Published on

Presented by Gabe Fils at eScience 2017

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Sciunits: Resuable Research Object

  1. 1. 1IEEE eScience 2017 sciunits:sciunits: Reusable Research ObjectsReusable Research Objects School of Computing, College of Computing and Digital MediaSchool of Computing, College of Computing and Digital Media Dai Hai Ton That, Gabe Fils,Dai Hai Ton That, Gabe Fils, Zhihao Yuan, Tanu MalikZhihao Yuan, Tanu Malik Presented by Gabe FilsPresented by Gabe Fils
  2. 2. 2IEEE eScience 2017 Problem SpaceProblem Space No easily creatable, readily reusable, efficiently versioned,No easily creatable, readily reusable, efficiently versioned, discrete unit of computation existsdiscrete unit of computation exists Virtualization Distributed Version Control Research Object portable self-contained repeatable collaborative versioned documented
  3. 3. 3IEEE eScience 2017 Introducing: TheIntroducing: The sciunitsciunit  Captures application executionsCaptures application executions  Repeats executionsRepeats executions  Reproduces executions, changing input argsReproduces executions, changing input args  Versioned executions stored as oneVersioned executions stored as one sciunitsciunit  Uses provenance for self-documentationUses provenance for self-documentation AA reusablereusable research object.research object.
  4. 4. 4IEEE eScience 2017 Sample Applications: FIE, VICSample Applications: FIE, VIC FIE Application WorkflowFIE Application Workflow  City of Chicago Food InspectionsCity of Chicago Food Inspections Evaluation ModelEvaluation Model  Four applicationsFour applications  Two languagesTwo languages  130 files130 files  1580 dependencies1580 dependencies  908 MB908 MB https://github.com/chicago/food-inspections-https://github.com/chicago/food-inspections- evaluationevaluation  Variable Infiltration CapacityVariable Infiltration Capacity  Four applicationsFour applications  Five languagesFive languages  7 GB7 GB https://vic.readthedocs.iohttps://vic.readthedocs.io
  5. 5. 5IEEE eScience 2017 sciunitsciunit ArchitectureArchitecture https://bitbucket.org/geotrust/sciunit-clihttps://bitbucket.org/geotrust/sciunit-cli sciunitsciunit clientclient sciunitsciunit serverserver sciunitsciunit  LightweightLightweight  VersionedVersioned  Self-containedSelf-contained  StoresStores sciunitssciunits  VisualizesVisualizes sciunitssciunits  Linux CLI AppLinux CLI App  C / PythonC / Python  Minimal DependenciesMinimal Dependencies
  6. 6. 6IEEE eScience 2017 ClientClient packagepackage CommandCommand Initialize or retrieve aInitialize or retrieve a sciunitsciunit:: Run FIE, capture into package:Run FIE, capture into package:
  7. 7. 7IEEE eScience 2017 Packaging DetailsPackaging Details Package In Storage (133 MB):Package In Storage (133 MB): 22ndnd Version Of Package (133 MB):Version Of Package (133 MB): FIE pkg-root DirFIE pkg-root Dir 1) Attach to process1) Attach to process 2) Intercept system calls2) Intercept system calls 3) Copy files / executables3) Copy files / executables 4) Log system calls4) Log system calls
  8. 8. 8IEEE eScience 2017 ClientClient repeatrepeat CommandCommand original calloriginal call 61021 exec61021 exec “/usr/bin/python”“/usr/bin/python” replaced callreplaced call 61021 exec61021 exec “/home/user1/pkgroot/usr/bin/python”“/home/user1/pkgroot/usr/bin/python” Repeat a package:Repeat a package: 1) Attach to process1) Attach to process 2) Replace system call args2) Replace system call args
  9. 9. 9IEEE eScience 2017 Versioning SolutionVersioning Solution 80G File, Fixed 4K Chunks:80G File, Fixed 4K Chunks: Same File, 1 Byte Inserted At Start:Same File, 1 Byte Inserted At Start:  When to store?When to store?  During packagingDuring packaging  After packagingAfter packaging  How to store?How to store?  Line-based diffsLine-based diffs  Fixed-size chunksFixed-size chunks  Content-definedContent-defined
  10. 10. 10IEEE eScience 2017 Rabin HashRabin Hash  Hash of subset of file bytes (Hash of subset of file bytes (RH(BRH(B11,, BB22, …, … BBnn))))  Fixed-size sliding windowFixed-size sliding window nn  Hash at any positionHash at any position ii ((RH(XRH(X(i,n)(i,n)))))  Deduplicate chunkDeduplicate chunk
  11. 11. 11IEEE eScience 2017 Storage And RetrievalStorage And Retrieval Deduplicated Container StorageDeduplicated Container Storage Store package:Store package: 1) Archive package-root1) Archive package-root 2) CDC on archive2) CDC on archive 3) Store manifest3) Store manifest Retrieve package:Retrieve package: 1) Retrieve manifest1) Retrieve manifest 2) Concatenate chunks2) Concatenate chunks 3) Extract archive3) Extract archive
  12. 12. 12IEEE eScience 2017 Detailed VisualizationDetailed Visualization Part Of A Normal (Verbose) Provenance LogPart Of A Normal (Verbose) Provenance Log Small Section Of Graph Built From Normal Provenance LogSmall Section Of Graph Built From Normal Provenance Log
  13. 13. 13IEEE eScience 2017 Summarization: Group By SimilaritySummarization: Group By Similarity  Group vertices byGroup vertices by type / connectionstype / connections  Effect: group subprocesses, group files in directoryEffect: group subprocesses, group files in directory Similarity RuleSimilarity Rule Type(u) = Type(v), Input(u) = Input(v), Output(u) = Output(v)Type(u) = Type(v), Input(u) = Input(v), Output(u) = Output(v) Similarity AppliedSimilarity AppliedFull GraphFull Graph
  14. 14. 14IEEE eScience 2017 Summarization: PackSummarization: Pack  Find min-connected nodes, pack into hubsFind min-connected nodes, pack into hubs Packability RulesPackability Rules 1) Type(u) = file, {1) Type(u) = file, { !e | e E ( e=(u,v) e=(v,u) ) }∃ ∈ ∧ ∨!e | e E ( e=(u,v) e=(v,u) ) }∃ ∈ ∧ ∨ 2) Type(u) = process, { !e | e E e=(u,v) }∃ ∈ ∧2) Type(u) = process, { !e | e E e=(u,v) }∃ ∈ ∧ 3) Type(u) = file, { !(e∃3) Type(u) = file, { !(e∃ 11,e,e22) | ( x V, v≠x) ( e∃ ∈ ∧) | ( x V, v≠x) ( e∃ ∈ ∧ 11=(u,v) E, e∈=(u,v) E, e∈ 22=(x,u) E ) }∈=(x,u) E ) }∈ Packability AppliedPackability Applied Similarity AppliedSimilarity Applied
  15. 15. 15IEEE eScience 2017 Summarization: AnnotateSummarization: Annotate  Higher precedence to process nodesHigher precedence to process nodes  File with n > 1 edges → n annotationsFile with n > 1 edges → n annotations Annotation AppliedAnnotation AppliedPackability AppliedPackability Applied
  16. 16. 16IEEE eScience 2017 Package / Repeat PerformancePackage / Repeat Performance  Added ptrace system callsAdded ptrace system calls  I/O-intensive apps: VICI/O-intensive apps: VIC  Non-I/O-intensive apps: FIENon-I/O-intensive apps: FIE Package/Repeat RuntimesPackage/Repeat Runtimes 1) Run app normally1) Run app normally 2) Run with2) Run with packagepackage 3) Run with3) Run with repeatrepeat
  17. 17. 17IEEE eScience 2017 Versioning PerformanceVersioning Performance Commit/Reconstruct TimesCommit/Reconstruct Times Storage SizesStorage Sizes Package/Repeat RuntimesPackage/Repeat Runtimes 1) Size of several versions1) Size of several versions 2) Size after deduplication2) Size after deduplication 3) CDC / concatenation time3) CDC / concatenation time
  18. 18. 18IEEE eScience 2017 Results From Provenance SummarizationResults From Provenance Summarization Reduction Of EdgesReduction Of EdgesReduction Of File NodesReduction Of File Nodes Reduction Of Process NodesReduction Of Process Nodes 1) Full FIE graph1) Full FIE graph 2) All techniques applied2) All techniques applied 3) Dynamic expansion3) Dynamic expansion
  19. 19. 19IEEE eScience 2017 Conclusion And Current WorkConclusion And Current Work  Graph summarization testingGraph summarization testing  Database applicationsDatabase applications  Exact partial repeatabilityExact partial repeatability  Apps with network-operationsApps with network-operations  Parallel HPC applicationsParallel HPC applications  Emerging reusable object formatsEmerging reusable object formats sciunitsciunit is a portable, self-contained, and inherentlyis a portable, self-contained, and inherently understandable versioned unit of computation.understandable versioned unit of computation.
  20. 20. 20IEEE eScience 2017 Links And AcknowledgementsLinks And Acknowledgements sciunitsciunit::  https://sciunit.runhttps://sciunit.run sciunitsciunit paper:paper:  https://arxiv.orghttps://arxiv.org  Search for “Search for “sciunit”sciunit” National Science Foundation grants ICER-1639759,National Science Foundation grants ICER-1639759, ICER-1661918, ICER-1440327, ICER-1343816ICER-1661918, ICER-1440327, ICER-1343816

×