SlideShare a Scribd company logo
07/02/13




                 EMC Summer School on
                 BIG DATA – NCE/UFRJ

                        Big Data in Astronomy
                        The LIneA-DEXL case


                             Fabio Porto (fporto@lncc.br)
                             LNCC – MCTI
                             DEXL Lab (dexl.lncc.br)




    Outline


      l    Introduction
      l    Big Data in Science
      l    Hypothesis Driven-Research
      l    Data management
            –    Data partitioning
            –    Parallel workflow processing
      l    Final remarks

2   EMC Summer School 2013




                                                                  1
07/02/13




    Laboratório Nacional de
    Computação Científica (LNCC)




                      Petropolis, Rio de Janeiro
3   EMC Summer School 2013




    LNCC - MCTI
      l    Graduate Course in Computational Modelling
             –    CAPES 6
      l    BioInformatics Laboratory
             –    Roche 454 high throughput sequencing
      l    Coordinator of INCT –MACC
             –    Medicine Supported by Computational Science
      l    Coordinator of SINAPAD
             –    HPC National System
      l    Thematic laboratories
             –    ACIMA
             –    MARTIN
             –    DEXEL
             –    COMCIDIS
             –    HEMOLAB
             –    LABINFO




4   EMC Summer School 2013




                                                                      2
07/02/13




             SINAPAD – National System of High
             Processing Computing
•    Organized in
     CENAPADS:
      •    Universities
      •    Research Centers
      •    Different
           Architectures:
             •    Shared Disks
             •    Shared Memory
             •    GPUs




     5     EMC Summer School 2013




      sinapad.lncc.br
                        CENAPADS




       6840 CPU Cores + 8192 GPU Cores	

       ~106.6 TFlops / ~17.3 TBytes RAM / ~ 2.3 PBytes Storage
     6 EMC Summer School 2013        6




                                                                       3
07/02/13




    The DEXL Lab Mission

      l    To support in-silico science with Big Data
            management techniques;
             –    To develop interdisciplinary research with
                  contributions on data modelling, design and
                  management;
             –    To develop tools and systems in support to in-
                  silico science data management;




7   EMC Summer School 2013




    e-Astronomy
      l    LNCC is a member of the LIneA Lab:
            –     Laboratório Inter-institucional de Astronomia
                   l    O.N., LNCC, CBPF, RNP
                   l    Development of e-Astronomy infrastructure in support for astronomy surveys
                   l    Official south hemisphere DES node
      l    Large astronomy surveys:
            –     Sloan Digital sky Survey
                   l    Currently SDSS-3
            –     Dark Energy Survey
                   l    DES – Brazil managed by LIneA laboratory
                   l    5.000 square degrees of the sky
            –     Large Synoptic Sky Telescope
                   l    20.000 square degrees of the sky
                   l    Each patch visited 1000 times during 10 years
      l    One of the scientific domains with extreme data processing and
            storage needs
      l    Big Data today !!!!


8   EMC Summer School 2013




                                                                                                            4
07/02/13




LSST – Large Synoptic Survey Telescope




                              •  800 images p/ night
                                 during 10 years !!
                              •  3D Map of the Universe
                              •  30 TeraBytes per night
                              •  100 PetaBytes in 10 years
                                  •  105 disks of 1 TB

 9   EMC Summer School 2013




     Sloan Portal




10 EMC Summer School 2013




                                                                   5
07/02/13




     Skyserver – Projeto Sloan




11   EMC Summer School 2013




     Dark Energy Survey
     l    Dark Energy Survey
           –    Astronomic project to explain:
                 l  Acceleration of the universe
                 l  Nature of dark energy

           –    Data production
                 l  DECam takes images of 1GB (400/night)
                 l  Images are analyzed;
                        –    galaxies and stars identified and catalogued
                 l    Catalogs are stored in database systems
                        –    Estimates of 1 billion of rows and 1 thousand attributes

     l    LIneA is the official Brazilian contributor for the DES
           collaboration
12 EMC Summer School 2013




                                                                                              6
07/02/13




                                  DES	
  Science	
  Pipelines	
  
                                                             	
  




      Global and local tests                                              Test environment & CTIO

                                                                          Un-supervised process
      Cluster industrialization

      Point source catalog

      Masks, random catalogs



                                                                    Addstar (MW, GC), Addqso


                                                                    Findsat, Sparse, fitmodel

                                                                    Stellar mass, LF, HOD fit

                                                                     Classifier, photo-z

                                                                    Identification, characterization

                                                                     Cosmological parameters
13 Summer School 2013
 EMC




14 EMC Summer School 2013




                                                                                                             7
07/02/13




    BIG DATA in Science

    l     Scientific process is being remodelled to be developed
           within an in-silico environment
    l     Powerful instruments:
            –    Digital telescopes
            –    DNA sequencers
            –    Mass spectrometers
    l     Huge simulations
            –    Weak lensing simulations
            –    Cardio-vascular system simulation
    l     Massive amounts of information streams in and out…
    l     Hypothesis-driven research supported by in-silico
           infrastructure, methods, models…


15 EMC Summer School 2013




    Big Data needs for e-science


          l    Data archival infra-structure;
          l    Scientific life cycle metadata management;
          l    Distributed big data management;
                 –    Parallel workflow processing;
                 –    Parallel Analytical algorithms;




16 EMC Summer School 2013




                                                                          8
07/02/13




         “Scientists are spending most of their time
        manipulating, organizing, finding and moving
        data, instead of researching. And it’s going to
                           get worse”
             –    Office Science. Data-Management Challenge
                  Report– DoE - 2004




17 EMC Summer School 2013




    Big Data needs for e-science


       l    Data archival infra-structure;
       l    Scientific life cycle metadata management;
       l    Distributed big data management;
             –    Parallel workflow processing;
             –    Parallel Analytical algorithms;




18 EMC Summer School 2013




                                                                    9
07/02/13




    Scientific Experiment Life-cycle



                            Experiment
                               Data




                                [Mattoso et al. 2010]



19 EMC Summer School 2013




      MODELLING -
      HYPOTHESIS-DRIVEN
      RESEARCH
20 EMC Summer School 2013




                                                             10
07/02/13




    e-Science life cycle




                           Hypothesis                                Experiment
   Phenomenon                                 Modeling                                           Publication
                           Formulation                                Life-cycle




21 EMC Summer School 2013




    Big Data Scenario in Scientific
    exploration life-cycle
                                              Experiment,	
  
                                               Workflow	
  
                                                 Design	
  
                                                                                    Workflow 	
  
               Hypothesis,	
                                                       Prepara;on    	
  
               experiment	
                   Workflow  	
  
                  Goals
                      	
                      repository 	
  
                                                                                 Data	
  
                             Hypotheses	
                                       Sources   	
  
          Analysis  	
        database	
            Provenance	
  
          Results	
                                    Store	
  
                                                                           Workflow 	
  
                                                                           Execu;on	
  
                               Post-­‐
                             Execu;on	
  
                              analysis 	
                                          Adapted from
                                                            Monitoring             [Mattoso et al. 2010]


22 EMC Summer School 2013




                                                                                                                    11
07/02/13




    Motivation


       l    As experiments produce more and more
             data, extracting meaning out of these data
             requires, among other things, contextualizing
             the data
       l    Metadata about the research allows for
             results sharing, fostering collaborative work
       l    Sharing knowledge about the scientific
             reasoning
23 EMC Summer School 2013




    Hypotheses in Astronomy - DES

       l    Phenomenon:
             –    Universe is speeding-up
                   l    Discovered by scientists in 1998 studying distant supernovae
                   l    Supported by observations of redshift on long distance supernovae
                         light
       l    Hypothesis
             –    A new odd behaviour named “Dark Energy” could make up
                  70% of the universe
             –    The universe is not homogeneous - it has regions with different
                  densities (our location is special….)
       l    Supporting evidences
             –    Weak gravitational lensing
             –    Galaxy clusters in different redshifts



24 EMC Summer School 2013




                                                                                                  12
07/02/13




    Hypothesis in Big Data Analytics

      l     Scientific exploration is hypothesis-driven
             –     Nevertheless, hypothesis remain out of reach of
                   in-silico exploration (big data analyses ??)
      l     Big Data Analyses is explorative in nature
             –     Understanding what one is doing when exploring
                   Big Data requires scientific hypothesis-driven
                   approach
      l     Corollary
             –     BIG Data needs hypothesis management

25 EMC Summer School 2013




    Context

       l    Scientists trying to understand some
             phenomenon
              –    Formulate Hypothesis about Phenomenon behaviour
       l    Natural Phenomena
              –    Simulated by computational models
              –    Explained by Scientific hypothesis
       l    Time-Space varying
              –    Space represented by physical meshes
                    l    1D, 3D,…
              –    Time reflected on simulation ticks


26 EMC Summer School 2013




                                                                          13
07/02/13




    Scientific Hypothesis
    Human Cardio-vascular System




27 EMC Summer School 2013




    Elements of hypothesis-driven
    research
      l    Scientific Phenomenon – an observable event
            –    occurs in space-time;
            –    characterized by observable quantities;
      l    Scientific Hypothesis – a falsifiable statement
            proposed to explain a phenomenon [Popper 2012]
            –    We are interested in a conceptual representation that puts
                 forward the idea the hypothesis carries on
      l    Mathematical Model – a language specific
            formalization of a scientific hypothesis
      l    Experiment – the set of computational artifacts put
            together to validate a scientific hypothesis;
      l    Data – observed or experimental data use in
            validating hypotheses;

28 EMC Summer School 2013




                                                                                   14
07/02/13




          Hypothesis modelling initiatives
                     l     Robot Scientist
                                   –      [R.D.King et al] The automation of science, Science, 2009.
                     l     HyQueu and HyBrow
                                   –      [A. Callahan, M. Dumontier, and N. H. Shah]. HyQue:
                                          Evaluating hypotheses using semantic web technologies.
                                          Journal of Biomedical Semantics, 2(Suppl 2):S3, 2011.
                                   –      Modeling hypothesis as propositions in part of the domain
                                          language
                                               l        Bioinformatics
                     l        SWAN
                                   –      Y. Gao et al. Journal of Web Semantics, 2006
                     l     J. Sowa, Process Ontology



29 EMC Summer School 2013



                                                                                Sc	
  Hypothesis	
  Conceptual	
  
                                                                                            Model	
                                                                                                  1..n          isTheBlendOf	
  
                                                                   0..n	
  
  Physical
         	
                      0..n	
   Phenomenon	
             elements	
  
 Quan::es     	
           1..1	
  
                                            physical	
  
                                           quan::es      	
                               Phenomenon                                              explains	
                                                   SC                         Is	
  basedOn   	
  
                                                                                                                           1..1	
                                                   1..n                    Hypothesis
                                                                                                                                                                                                                                           1..n
                                                                                                     1..1	
                                                                                                                 0..n	
  
                                                                                          0..n	
                      Domain ontology URL                                                          0..1	
     1..1
                1-n                      Space-­‐Time  	
  
                                                                                                                                                                  represented_as	
  
                                         Dimension  	
          1..1	
                                             represents	
  
                                                                                                                                   0..n	
                                                                                    isAuthor	
  
                                                                                                                           Ph_Process
                                                                                                                                    	
                                                           represented_as	
  

                                                                                                                                                                                                                                             1..m	
  
   Formal
        	
  
  Language   	
                                                                           0..n	
                                                                                                                                       Scien:st
                                                                                                                                                                                                                                              	
  
                                           Formal	
                                                                                                                                   0..n	
  
  1..1                                                     0..n	
      Con:nuous	
                                                                                         Discrete	
  
                                        Representa:on formulatedby	
   Ph_Process
                                                      	
                        	
  
                        0..n	
                                                                                     0..1                                                   Ph_Process    	
  
                                                                               1..1	
                                                                                  0..1	
                                        0..n	
  
                                                                                                                                                                                                        Discrete	
  
                                                                                                                                                                                                      Phenomenon          	
  
                                                                                                                                                          Refers-to                                    Simula:on     	
            Topologically	
  
                                                                                   	
  
                                                                  1..1	
                                                                                                                                             1..n	
  
                     variable	
               1..n	
                          Mathema:cal    	
         0..1	
  
                                                                                                                                       0..n	
  
                                                                                                                                                                                                                                   	
  modeled	
  	
  by	
  
                                                                                 Model  	
                                                                                                             1..1 1..1
                                                                                                                                                                                                                                                                 1..1
                                                                                   	
                                                    State	
          Modeled_as	
                                             transforms	
  
                                             1..n	
                                                                                                                                              0..1	
  
                                                                                                                                                                                                                     1..1                               Mesh	
  
                     constant
                            	
                                                                                                                                       1..1	
  
                                                                        Represented	
                                                                                                                          Event
                                                                                                                                                                                                                   	
  
                                             0..n	
  
                                                                        with	
                                                                            Data View
                     fucn:on	
                                                                                                                           (query over
                                                                                                                                                          Data view)
                                                                                                                                                                                                                      modeled_as	
  
                                             1..n	
                     Mathematical
                     equa:on	
                                                                                                                                                        0..n	
  
                                                                        Formulae XML
                                                                                                                   Observa:on       	
                      Simulated    	
  
                                                                                                                    Element
                                                                                                                                                                                                        Computational                               Mesh
                                                                                                                      1..1	
   	
                            Element	
  
                                                                                                                                                                                                         Model View                               Data view
                                                                                                                        0..1	
                                           0..n	
  
[Porto et al. ER 2008, ER 2012]                                                                                                       Compared_with	
  
30 EMC Summer School 2013




                                                                                                                                                                                                                                                                             15
07/02/13




    Modelling Hypotheses and their
    interconnections

                                    Τ
        Weak lensing   Galaxy            Earth special location
                       clustering

                                        Non uniform universe
                 Dark Energy




                                    Τ

       A lattice theoretic representation for hypotheses
       interconnect
31 EMC Summer School 2013




    Focus on Hypothesis modeling


       l    Scientific Hypothesis formulation as a
             conceptual entity
       l    Structuring of research evolution
       l    Isomorphic representation of: hypothesis,
             scientific model and phenomenon
       l    Structure amenable for data representation,
             association, querying and publishing

32 EMC Summer School 2013




                                                                       16
07/02/13




    Hypotheses Structuring: Lattice




33 EMC Summer School 2013




    The core entities of the
    hypothesis conceptual model




34 EMC Summer School 2013




                                           17
07/02/13




    Representation Isomorphism




35 EMC Summer School 2013




    Application: Linked Science


       l    An initiative to have a machine-readable
             content describing the scientific exploration;
       l    Support reproducibility of experiments;
       l    To foster reusing previous results;
       l    The community needs a more “open”
             science”



36 EMC Summer School 2013




                                                                   18
07/02/13




    Linked Science
    (or Linked Open Science)


       l    Is an initiative to interconnect all scientific
             assets;
       l    It is a combination of:
             –    Linked data, semantic web
             –    Open source;
             –    Scientific workflows and provenance (OPM);
             –    Scientific models;
             –    Cloud computing;
             –    …
37 EMC Summer School 2013




    Linked Science Core Vocabulary
    (LSC)


       l    Defines a vocabulary (LSC) with “basic”
             terms for science;
             –    More specific terminology shall be added by
                  individual communities (minimal ontological
                  commitment)




38 EMC Summer School 2013




                                                                     19
07/02/13




    LSC Core Vocabulary




39 EMC Summer School 2013




    Extension to LSC




40 EMC Summer School 2013




                                 20
07/02/13




                                        Published Research as Linked Data (1)3

                 Semantic                   rdfs:Class             rdf:Resource !                      rdf:Literal
               engineering of                                                    rdf:value
                hypotheses                  lsc:Researcher         authors1             !              “P.J. Blanco, M.R. Pivello, S.A. Urquiza, and R.A. Feijóo.”
                                                                                   dc:description
                                            lsc:Research           research1                     !     “Simulation of hemodynamic conditions in the carotid
                                                                                                       artery.”
                                                                           dc:title
               Introduction                 lsc:Publication        pub1           !                    “On the potentialities of 3D–1D coupled models in hemo-
               Motivation                                                                              dynamics simulations.”
               Goals & Challenges                                                dc:description
                                            lsc:Data               dataset1                    !       “Flow rate of 5.0 l/min as an inflow boundary condition at
               Related Work
                                                                                                       the aortic root, in observation of Avolio (1980) and others.”
                                                                                 dc:description
               Semantic                     lsc:Data               dataset2                    !       “1D mechanical and geometric data from Avolio (1980).”
               Modeling                                                          dc:description
                                            lsc:Data               dataset3                    !       “MRI images processed for reconstructing the 3D geome-
               Combination                                                                             try of both the left femoral and the carotid arteries.”
               and Order                                                  dc:description
                                            Phenomenon             p17                 !               “Blood flow in the carotid artery.”
                                                                                dc:description
               Partial Results              tisc:Region            region1                     !       “The carotid artery, a part of the human CVS.”
                                                                              dc:description
               Next Steps                   owl:IntervalEvent      beat1                   !           “A heart beat with period T = 0.8 s.”
                                                                          dc:description
                                            Observable             ob1                 !               “Blood flow rate.”
                                                                          dc:description
                                            Observable             ob2                 !               “Blood pressure.”
                                                                          rdfs:label
                                            lsc:Hypothesis         h17             !                   “blend(h13, h15, h16)”
                                                                           dc:description
                                            Model                  m17                  !              “3D-1D coupled model with lumped windkessel terminals.”




                                              3
                                                  Blanco et al.’s published research as an LSC instantiation.                                                     18/23



                            41 EMC Summer School 2013




                            Published Research as Linked Data (2)4

  Semantic
engineering of
 hypotheses
                                rdfs:Class             rdf:Resource !                        rdf:Literal
                                                                  dc:description
                                lsc:Data               dataset4                   !          “Plots of hemodynamic observables in the left femoral artery
                                                                                             produced to validate the hypothesis.”
Introduction                                                      dc:description
Motivation                      lsc:Data               dataset5                   !          “Plots of hemodynamic observables in the carotid artery.”
Goals & Challenges                                                dc:description
                                lsc:Data               dataset6                   !          “Scientific visualization of hemodynamic observables in the
Related Work
                                                                                             left femoral artery produced to validate the hypothesis.”
Semantic                                                          dc:description
                                lsc:Data               dataset7                   !          “Scientific visualization of hemodynamic observables in the
Modeling                                                                                     carotid artery both with and without aneurism.”
                                                                  rdf:value
Combination                     lsc:Prediction         predict1           !                  “Sensitivity of local blood flow in the carotid artery to the heart
and Order                                                                                    aortic inflow condition.”
                                                                  rdf:value
Partial Results                 lsc:Prediction         predict2           !                  “Sensitivity of the cardiac pulse to the presence of an
                                                                                             aneurysm in the carotid.”
Next Steps                                                            rdf:value
                                lsc:Conclusion         conclusion1              !            “3D-1D coupled models allow to perform quantitative and
                                                                                             qualitative studies about how local and global phenomena
                                                                                             are related, which is relevant in hemodynamics.”




                            42 EMC Summer School 2013
                                    4
                                        Blanco et al.’s published research as an LSC instantiation.                                                            19/23




                                                                                                                                                                               21
07/02/13




    Find in Blanco et al.'s microtheory a hypothesis (if any)
    explaining phenomena of blood flow in microvascular
    vessels and show which model formulates it.

      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      PREFIX lsc: <http://linkedscience.org/lsc/ns#>
      SELECT ?hypothesis_name ?model_name
      WHERE {
      ?h rdfs:label ?hypothesis_name .
      ?m rdfs:label ?model_name .
      ?h a lsc:Hypothesis .
      ?p a lsc:Phenomenon .
      ?m a lsc:Model .
      ?h lsc:explains ?p .
       ?m lsc:formulates ?h .
      ?p dc:description ?d .
      FILTER regex(?d, "blood flow", "i") . FILTER regex(?d, "microvascular", "i")
      }
43 EMC Summer School 2013




    Remarks
      l    Hypothesis modeling reflects the scientist mental model
            during data analyses;
            –    supports hypothesis-driven data exploration
            –    extends current eScience infrastructure;
      l    Scientific Hypothesis, Models and Phenomenon are the
            main primitives;
      l    The primitives maybe represented as isomorphic lattices
            with semantic association among themselves;
      l    One can search, discovery, mine hypotheses and related
            scientific artefacts;
      l    ER 2012- MODIC Workshop
      l    ISWC 2012– Linked Science workshop




44 EMC Summer School 2013




                                                                                          22
07/02/13




      DATA MANAGEMENT


45 EMC Summer School 2013




   Dark Energy Survey
    l    Dark Energy Survey
          –    Astronomic project to explain:
                l  Acceleration   of the universe
                l  Nature   of dark energy
          –    Data production
                l  DECam     takes images of 1GB (400/night)
                l  Images
                         are analyzed; galaxies and starts are identified
                  and catalogued
                l  Catalogs   are stored in database systems



46 EMC Summer School 2013




                                                                                 23
07/02/13




   Dark Energy Survey Project
   l    Main technical (CS) issue:
         –    Managing huge catalogs
         –    Relations loaded from std FITS files
   l    Database features
         –    Single relation for each catalog
         –    Volume: 1 billion tuples x 1000 attributes (300GB)
         –    Queries
               l  Users submit ad-hoc queries to the database
               l  Usually too many results for each query
                    –  Need to choose best results, e.g. using top-k techniques
               l  Some      queries scan the whole database
                     –    Looking for clusters of stars
47 EMC Summer School 2013




    Processing Astronomy data


     User access                                          Scientific workflows
      - Ad-hoc queries                                       - Analysis
      - downloads



                                          Astronomy
                                           catalogs



48 EMC Summer School 2013




                                                                                       24
07/02/13




    Ad-Hoc Queries

       l    Submitted by users through portal;
       l    For small size queries (Regions of the sky)
             –    Indexing based on ra, dec (e.g. Q3C)
                   l    [Koposov, S.,Bartunov, O., 2006] Q3C Quad Tree Cube,
                         Astronomical Data Analysis Software and Systems, 2006
                   l    HTM, Hierarchical Triangular Mesh, MSSQlServer, Sloan
             –    Spatial function (eg. Radial search)
             –    Other criteria need more fine grained criteria
       l    For large size queries (whole sky)
             –    Explore parallelism over partitioned data
       l    Data partitioning is efficient for small and large
             queries

49 EMC Summer School 2013




   Astronomer’s coordinate system




50 EMC Summer School 2013




                                                                                      25
07/02/13




    Workflow queries

       l    Workflows process data retrieved from the Catalog
             –    Two systems
                   l    Workflow engine
                   l    Database engine
             –    Lack of integration
                   l    upper bound on performance
             –    Large queries
                   l    Parallelism obtained by data partitioning is jeopardized by
                         consolidation of results operated by DBMS;
                   l    Workflow receives data and redistribute it to parallelize activities
             –    Concurrency among workflows
                   l    May impose huge penalties

51 EMC Summer School 2013




    Need to partition data

       l    Beneficial for both access patterns
             –    Ad-hoc and workflow
       l    How to apply it?
       l    Vertical partitioning
             –    Already applied based on semantic clustering of attributes
                   l    Ra, dec
                   l    Photometry, spectrometry, astrometry
       l    Horizontal partitioning
             –    Ra, Dec (the current approach)
             –    More fine grained criteria
                   l    Been developed in collaboration with INRIA Montpellier


52 EMC Summer School 2013




                                                                                                     26
07/02/13




    First Step: Hybrid Data Partitioning(HDP)

                              Std criterion: range of ra,dec
             Criterion 1                            Criterion k

Catalog Id           Ra    Dec                 Catalog     Id     Ra   Dec

Catalog Id         spectrometry               Catalog_s Id      spectrometry

Catalog-ph    Id   photometry        ŸŸŸ   Catalog-ph   Id   photometry

Catalog_a Id       astronometry               Catalog_a Id      astronometry

53 EMC Summer School 2013                       07/02/13




      IMPLEMENTATION
      ALTERNATIVES

54 EMC Summer School 2013




                                                                                    27
07/02/13




    Using PGPOOL-II

       l    Pgpool II
             –    Implemented on top of PostgreSQL 9.1
             –    Central node coordinates data/query distribution/
                  replication
             –    Requests distributed through nodes
             –    Parallel query Processing
                   l  data partitioning based on a table column       range
                       (e.g. id)
                   l  For short queries, may reduce the number of accessed
                       data
             –    Load Balance
                   l    Concurrent requests directed to different DB copies

55 EMC Summer School 2013




    Parallelism & LoadBalance
                                     Parallel query
                                       Pgpool II


                         Replication                       Replication
                          Pgpool II                         Pgpool II



    postgreS                     postgreS            PostgreS             PostgreS
      QL                           QL                  QL                   QL

56 EMC Summer School 2013




                                                                                          28
07/02/13




    Evaluation


       l    Strength
             –    Extends PostgreSQL
             –    Load balance queries from concurrent workflows
             –    Scales up to 128 DB nodes
       l    Weaknesses
             –    Lack of support to spatial functions
             –    Partitioning based on a single column
             –    Ingestion can’t use COPY

57 EMC Summer School 2013




    QServ - LSST


       l    Developed by the LSST DM team
       l    Astronomy data management
       l    Horizontal partitioning based on declination
             zones (nodes) and data on each node
             distributed into chunks based on RA-chunk
       l    Approx. 1000 partitions
       l    Native support to spatio-temporal functions
       l     Built on top of MySQL
58 EMC Summer School 2013




                                                                        29
07/02/13




    Evaluation


          l     Strong
                  –    Designed to support astronomy data surveys
                  –    Highly scalable: ~1000 nodes
                  –    First performance results are very promising
                  –    Alignment with the LSST project
          l     Weaknesses
                  –    Current culture based on PostgreSQL


59 EMC Summer School 2013




   Context (3/3)
    l    Requirement
            –     Efficient data storage and processing
    l  Challenges
        –  Big size of the database
        –  High number of attributes
        –  Evolving workload
        –  Mostly Scan Processing
    l    Questions:
            a)  How to efficiently process queries over catalogs?
            b)  How to efficiently process scientific workflows over
                catalogs?

60 EMC Summer School 2013




                                                                            30
07/02/13




    Current activities at DEXL
      a)    Design data partitioning strategies
            –    Cooperation with INRIA Montpellier- Zenith group
            –    Partition the data into blocks
                  l    such that the number of query accesses to the blocks is
                        minimum
                  l  Each    block can be stored on a different machine

      b)    Efficient execution of scientific
            workflows over partitioned data


61 EMC Summer School 2013




  a) Intuition
                 Q
                                  Queries and scientific workflows take a
                                  Queries and scientific workflows of
                                  Time proportional to the amount take
                                  Time to be processed size of their data
                                  Data proportional to the
                                  partitioning

                                      Q’               Q’’               Q’’’




62 EMC Summer School 2013




                                                                                       31
07/02/13




   Partitioning the DB into Blocks

                                                          B1
                            R(a1,…,a9)


                                                          B2

                                                                     How to
                                                                     compute
                                                           …         The best
                                                                     Partitioning?
                                                           Bm




63 EMC Summer School 2013




   Problem statement
    l    Given
          –    Single relation database R(a1,…,an), n ~1000
          –    Initial workload: set of k queries W0 = {q1,…,qk}
          –    m empty fixed size blocks
    l    Assumptions
          –    Accessing a block ≈ accessing all its tuples
          –    Periodically new tuples and queries arrive
          –    No privilege to a particular attribute
    l    Goal
          –    Minimize the total block access during the execution of queries by:
                l  Optimal partitioning of R’s data in blocks
                l  Optimal query execution
          –    Adapt to the arrival of new data and queries

64 EMC Summer School 2013




                                                                                          32
07/02/13




    Overview of the solution
    l    Data partitioning : graph based algorithm
           –    Nodes: each data item (e.g. tuple) represent a node in the graph
           –    Edges: an edge between two data items if are accessed by a
                common query
           –    Edge weight : the number of queries that access both data items
           –    Goal: partition the graph into m equal size sub-graphs with minimum
                edge cut
                 l    Use a min-cut algorithm
    l    Block explanation
           –    Blocks are explained in terms of queries
                 l    Each block is assigned an explaining query Bi = vi(R)
    l    Query processing
           –    Queries are compared to explaining queries
                         –    Matching blocks are selected (we haven’t worked on that yet)

65 EMC Summer School 2013




    Partitioning strategy
          Schism: VLDB2010

                                           1




                                            We create a node for each row




66 EMC Summer School 2013




                                                                                                  33
07/02/13




    Partitioning strategy

                            1



                            2



                            We create a node for each row




67 EMC Summer School 2013




    Partitioning strategy
                                     3
                            1



                            2



                            We create a node for each row




68 EMC Summer School 2013




                                                                 34
07/02/13




    Partitioning strategy
        For each vertical
            fragment                       3
         1
                                 1                    5
         2
         3
         4
         5
                                                                  7
         6
         7
                                 2                    6
                                           4

                                  We create a node for each row




69 EMC Summer School 2013




    Partitioning strategy
        For each vertical
            fragment                  1    3
         1
                                 1                    5
         2                                       1
         3
         4                  q1             1
                                                                  7
         5
         6
         7
                                 2                    6
                                           4

                                 We increment the arc weight when
                                  two rows are accessed together




70 EMC Summer School 2013




                                                                           35
07/02/13




    Partitioning strategy
        For each vertical
            fragment                  1    3        1

         1
                                 1                      5
         2                                       2
         3
         4                  q2             1            1
                                                                7
         5
         6
         7
                                 2                      6
                                           4

                                 We increment the arc weight when
                                  two rows are accessed together




71 EMC Summer School 2013




    Partitioning strategy
        For each vertical
            fragment                  1    3        2

         1
                                 1                      5
         2                                       7
         3
                                 5         1            3
         4
         5
                                                                7
         6
                                      7
         7                                                  2
                                 2              1       6
                                      4
       W = {q1,…,qn}                       4

                                 We increment the arc weight when
                                  two rows are accessed together




72 EMC Summer School 2013




                                                                         36
07/02/13




    Partitioning strategy
        For each vertical
            fragment                      1    3        2

         1
                                 1                          5
         2                                          7
         3
                                 5             1            3
         4
         5
                                                                    7
         6
                                         7
         7                                                      2
                                 2                  1       6
                                         4
                                               4



                                     We execute a min-cut algorithm




73 EMC Summer School 2013




    Partitioning strategy
             Catalog
                                          1    3        2

         1
                                 1                          5
         2                                          7
         3
                                 5             1            3
         4
         5
                                                                    7
         6
                                         7
         7                                                      2
                                 2                  1       6
                                         4
                                               4
                       3
   1
                       5
   2
                       6
   4
                       7
                                 Each partition is assigned a block
       B1                   B2




74 EMC Summer School 2013




                                                                             37
07/02/13




    Partitioned data with queries

  Each block is associated with the queries that access
  Some records of the block

              {q3,q5,…,, q13} {q1,q2,…,, q14}
                                     3
                     1
                                     5
                     2
                                     6
                     4               7




                         B1              B2


For a given query q the number of accessed blocks is minimized

75 EMC Summer School 2013




  Adaptive Strategy (1/2)
  l    New tuple arrival: [DEXA 2012]
        –    Select the best block
              l    i.e. block to which the new tuple is more correlated
        –    Challenges:
              l  How to select the best block with minimum effort?
                   –  Initial approach : find it based on the correlation of queries
                      to blocks
                   –  Define optimal allocation
                   –  Compute actual allocation efficiency
                   –  Compute block affinity
              l  What if the best block is full?
                   –  Initial approach: split the block

76 EMC Summer School 2013




                                                                                            38
07/02/13




    Allocation based on
    affinity to blocks




77 EMC Summer School 2013




    Elapsed-time of incrementing the
    DB as the size increases
                                       1000
                                                                           static
                                                           +


                                                                           DynPart, |D | = 500 k
                                                                           DynPart, |D | = 1 M
                                       100
                  Execution time (s)




                                         10



                                          1
                                              +




                                                  +    +




                                                                                                     +


                                                                                                              +
                                                               +
                                                                   +
                                                                                             +
                                                                                                                      +   +
                                                                                +
                                                                                                                  +

                                                                                                 +
                                                                            +
                                                                                                                                                      +
                                                                                        +                                                                                                 +
                                                                                                                              +
                                                                       +
                                                                                                         +

                                                                                    +



                                                                                                                                  +                                                                       +
                                                                                                                                                                      +
                                                                                                                                      +   +               +               +   +   +               +   +
                                                                                                                                              +   +           +   +                   +       +




                                        0.1
                                                  2M                   4M               6M               8M           10M 12M                         14M             16M             18M             20M
                                                                                                                      DB size


         Experiment:
         - Sloan DR8 – 350 million tuples
         - workload- synthetic 27000 queries
         - PaToH – hyper-graph partitioner
78 EMC Summer School 2013




                                                                                                                                                                                                                   39
07/02/13




      E-ASTRONOMY WORKFLOWS
      OVER PARTITIONED DATA

79 EMC Summer School 2013




   Processing Scientific Workflows
       l    Analytical Workflows process a large part of Catalog data
             –    Catalogs are supported by few indexes, thus most queries
                  scan tens-to-hundreds of millions of tuples
       l    Parallelization comes as a rescue to reduce analyses
             elapsed-time, but
             –    Compromise between:
                   l    Data partitioning and degree of parallelization;
             –    Current solutions consider:
                   l    Centralized files to be distributed through nodes (MapReduce)
                           –    [Alagianins, SIGMOD, 2012] NoDB – reading raw files without data
                                ingestion;
                   l    Distributed databases (Qserv) to serve Workflow engines
                           –    [ Wang.D.L,2011], Qserv: A Distributed Shared-Nothing Database for the
                                LSST catalog;
                   l    Centralized databases to serve Workflow Engine (Orchestration LineA)
                   l    Partitioned database to serve distributed queries (HadoopDB)


80 EMC Summer School 2013




                                                                                                              40
07/02/13




    HadoopDB - a step in between
    [Abouzeid09]

       l    Offers parallelism and fault tolerance as Hadoop,
             with SQL queries pushed-down to postgreSQL
             DBMS;
       l    Pushed-down queries are implemented as Map-
             reduce functions;
       l    Data are partitioned through nodes.
             –     Partitioning information stored in the catalog
             –     Distributed through the N nodes



81 EMC Summer School 2013




    HadoopDB architecture
                                   SQL query


                                     SMS Planner


                             MapReduce           Catalog
                             Framework




         Node 1                       Node 2                        Node n
                  Task Tracker            Task Tracker                  Task Tracker

        Database        DataNode      Database     DataNode         Database   DataNode



82 EMC Summer School 2013




                                                                                               41
07/02/13




     Example
     Select year(SalesDate),sum(revenue)
     From Sales
     Group by year(salesDate)
   a) Table partitioned by year(SalesDate) b) no partitioning by year(SalesDate)
                                                       FileSink Operator


                                                        Sum Operator
                                           Reduce
                                                      Group by Operator
            FileSink Operator

                                                     Reduce Sink Operator
 Map     Select Year(SalesDate),
         Sum(revenue)                        Map
         From Sales                                 Select Year(SalesDate),
         Group by year(salesDate)                   Sum(revenue)
                                                    From Sales
                                                    Group by year(salesDate)
83 EMC Summer School 2013




    Processing Astronomy data


     User access                                    Scientific workflows
      - Ad-hoc queries                                 - Analysis
      - downloads



                                    Astronomy
                                     catalogs



84 EMC Summer School 2013




                                                                                        42
07/02/13




    Traditional WF–Database
    decoupled architecture
                                            Workflow engine

                         act1      act2      act3


                                             Data is consolidated as
                                             input to the workflow engine
                        Database


                            DBp1     DBp2     DBp3


85 EMC Summer School 2013




    Problems


       l    Data locality
             –    Workflow activities run in remote nodes wrt the
                  partitioned data;
       l    Load Balance
             –    Local processes facing different processing time




86 EMC Summer School 2013




                                                                                 43
07/02/13




    Data locality

         l    Traditional distributed query processing pushes
               operations through joins and unions so that can
               be done close to the data partitions;
         l    Can we “localize” workflow activities?
               –    Moving activities in workflows require operation
                    semantics to be exposed
               –    Mapping of workflow activities to a known algebra
               –    Equivalence of algebra expressions enabling pushing
                    down operations

87 EMC Summer School 2013




  Algebraic transformation
  (i - workflow – relation perspective) (ii - decomposition)
                                                           rU
                                                     Filte
          Map Filter
         R               S           T           R                        S           *   T
                                                                     *
                                                          Q



               (iiii - anticipation)                 (iv - procastination)
                                                         U

                r    U
          Filte
     R                           S       *   T   R           *   V       Map Q        *   T
          Ma                 *
               p
                     Q                                                        *   S



   88 Summer School 2013
    EMC




                                                                                                   44
07/02/13




    Workflow optimization process
        Initial algebraic expressions

                   Generatation of                 Transform
                     search space                  ation rules
                              Equivalent algebraic expressions
                     Evaluation of                    Cost
                    search strategy                  model
                        Searh
                  yes    more
                           ?
                             no
     Optimized algebraic expressions

   89 Summer School 2013
    EMC




    Pushing down workflow activities


       l    A first naïve attempt
             –    Push down all operations before a Reduce;
       l    Use a MapReduce implementation where
             –    Mappers execute the “pushed-down” operations
                  close to the data




90 EMC Summer School 2013




                                                                      45
07/02/13




    Typical Implementation at LineA Portal
                                     Spatial partitioning
               Catalog DB




91 EMC Summer School 2013




    Parallel workflow over partitioned
    data
Partitioned catalogue stored on PostgreSQL

  DBp1                  SkyMap




  DBp2                      SkyMap             SkyAdd


    …

  DBpn                      SkyMap



92 EMC Summer School 2013




                                                                 46
07/02/13




 HQOOP - Parallelizing
 Pushed-down Scientific Workflows
       l    Partition of data across cluster nodes
             –    Partitioning criteria
                   l    Spatial (currently used and necessary for some applications)
                   l    Random (possible in SkyMap)
                   l    Based on query workload (Miguel Liroz-Gestau’s Work)
       l    Process the workflow close to data location
             –    Reduce data transfer
       l    Use Apache/Hadoop Implementation to manage parallel
             execution
                   l    Widely used in Big Data processing;
                   l    Implements Map-Reduce programming paradigm;
                   l    Fault Tolerance of failed Map processes;
       l    Use QEF as workflow Engine
             –    Implements Mapper interface
             –    Run workflows in Hadoop seamlessly;


93 EMC Summer School 2013




    Perspective
                          Qserv+
                         Workflow
                                          HQOOP
                          Wkfw Engine
                         Parallelization


        Orchestration layer, Query Hadoop+Kepler
        MapReduce            Distribution

                                                              HadoopDB+Hive

                                                                   Data
                                                                   distribution




94 EMC Summer School 2013




                                                                                             47
07/02/13




    Integrated architecture
                                          Final
                                          Result


         Workflow engine   Workflow engine                       Workflow engine
      act   act     act act    act     act                      act    act    act
       1      2      3   1       2      3                        1      2      3




                         DB1                   DB2                     DB3

95 EMC Summer School 2013




    Experiment Set-up

       l    Cluster SGI
             –    Configurations: 1, 47 and 95 nodes;
             –    Each node:
                   l  2 proc. Intel Zeon – X5650, 6 cores, 2.67 GHz
                   l  24 GB RAM
                   l  500 GB HD

       l    Data
             –    Catalog DC6B
       l    Hadoop
             –    QEF workflow engine


96 EMC Summer School 2013




                                                                                         48
07/02/13




    Preliminary Results


       l    Preliminary results are encouraging:
             –    Baseline Orchestration layer (234 nodes) –
                  approx. 46 min
             –    1 node HQOOP – approx. 35 min
             –    4 nodes HQOOP – approx. 12.3 min
             –    95 nodes (94 workers) HQOOP – approx. 2.10
                  min
             –    95 nodes (94 workers) Hadoop+Python – approx.
                  2.4 min
97 EMC Summer School 2013




    Resulting Image




98 EMC Summer School 2013




                                                                       49
07/02/13




    Conclusions
      l    Big data users (scientists) are in Big Trouble;
            –    Too much data, too fast, too complex;
      l    Different expertise required to cooperate
            towards Big Data Management;
      l    Adapted software development methods
            based on workflows;
      l    Complete support to scientific exploration
            life-cycle
      l    Efficient workflow execution on Big Data

99 EMC Summer School 2013




    Collaborators

      l    LNCC Researchers
            –    Ana Maria de C. Moura
            –    Bruno R. Schulze
            –    Antonio Tadeu Gomes
      l    PhD Students
            –    Bernardo N. Gonçalves
            –    Rocio Millagros
            –    Douglas Ericson de Oliveira
            –    Miguel Liroz-Gistau (INRIA)
10      –        Vinicius Pires (UFC)
0 EMC Summer School 2013




                                                                   50
07/02/13




   Collaborators
      l    ON
             –    Angelo Fausti
             –    Luiz Nicolaci da Costa
             –    Ricardo Ogando
      l    COPPE-UFRJ
             –    Marta Mattoso
             –    Jonas Dias (Phd Student)
             –    Eduardo Ogasawara (CEFET-RJ)
      l    UFC
             –    Vania Vidal
             –    José Antonio F. de Macedo
      l    PUC-Rio
             –    Marco Antonio Casanova
      l    INRIA-Montpellier
             –    Patrick Valduriez group
      l    EPFL
             –    Stefano Spaccapietra




10
1 EMC Summer School 2013




                  EMC Summer School on
                  BIG DATA – NCE/UFRJ

                               Big Data in Astronomy



                                   Fabio Porto (fporto@lncc.br)
                                   LNCC – MCTI
                                   DEXL Lab (dexl.lncc.br)




                                                                       51
07/02/13




      Overall performance
50                                                                      600
45
                                                                        500
40
35
                                                                        400
30
25                                                                      300                                           elapsed-time (min)
                                                   elapsed-time (min)
20                                                                                                                    linear scale-up
                                                   linear scale-up      200
15
                                                                                                                      % Linear Scale-up
10
                                                                        100
 5
 0                                                                         0
     Baseline 1 node 4 nodes 94 nodes 94 nodes                                 Baseline 1 node 4 nodes   94    94
       (234 HQOOP HQOOP HQOOP Hadoop                                             (234 HQOOP HQOOP nodes nodes
      nodes)                                                                    nodes)                 HQOOP Hadoop




10
3 EMC Summer School 2013




                                  1400000

                                  1200000

                                  1000000

                                   800000                                                   Tempo
                                                                                            Hadoop
                                                                                            Tempo
                                   600000
                                                                                            Reduce
                                   400000

                                   200000

                                        0
                                             47 CENT   47 CENT       94 CENT      94 CENT
                                               QEF     SEM QEF         QEF        SEM QEF




                               160000

                               140000

                               120000

                               100000
                                                                                          Tempo
                                80000                                                     Hadoop
                                                                                          Tempo
                                60000                                                     Reduce

                                40000

                                20000

10                                  0
                                         47 DIST    47 DIST      94 DIST        94 DIST
                                          QEF      SEM QEF        QEF          SEM QEF
4 EMC Summer School 2013




                                                                                                                                                52
07/02/13




   Execution with 4 nodes
                           Elapsed-time total: 11.27 min




10
5 EMC Summer School 2013




                                                                53
07/02/13




   Adaptive and Extensible Query Engine


      l    Extensible to data types
      l    Extensible to application algebra
      l    Extensible to execution model
      l    Extensible to heterogeneous data sources




10
7 EMC Summer School 2013




      Objective
       	

       •  Offer a query processing framework that
       can be extended to adapt to data centric
       application needs; 	

       •  Offer transparency in using resources to
       answer queries;
             •    Query optimization transparently introduced
             •    Standardize remote communication using web services even
                  when dealing with large amount of unstructured data
             •    Run-time performance monitoring and decision

10
8 EMC Summer School 2013




                                                                                  54
07/02/13




   Control Operators
     	

        •  Add data-flow and transformation operators
        •  Isolate application oriented operators from
        execution model data-flow concerns
        •  parallel grid based execution model:	

             •  Split/Merge - controls the routing of tuples to parallel
                nodes and the corresponding unification of multiple
                routes to a single flow
             •  Send/Receive - marshalling/ unmarshalling of tuples
                and interface with communication mechanisms
             •  B2I/I2B - blocks and unblocks tuples
             •  Orbit - implements loop in a data-flow
10       •  Fold/Unfold - logical serialization of complex structues
             (e.g. PointList to Points)
9 EMC Summer School 2013




   The Execution Model

             Example of simple QEF Workflow	





                                                                      Output
                                             Operator

                                                Possibly distributed over a
                                                Grid environment
                 Data sources
                    (Input)
                                            Integration unit (Tuple)
11                                          containing data source units
0    EMC Summer School 2013




                                                                                    55
07/02/13




     Iteration Model

                     OPEN          OPEN          OPEN
                              C             B             A

        DataSource

                  GETNEXT         GETNEXT       GETNEXT
                              C             B             A

        DataSource

                    CLOSE          CLOSE         CLOSE
                              C             B             A

         DataSource                                           Results

11
1    EMC Summer School 2013




     Distribution and Parallelization
             Operator distribution	

             A Query Optimizer selects a set of operators in the QEP to
             execute over a Grid environment.	



                                                  B1



                                        C         B2               A

                 DataSource
                                                  B3
11
2    EMC Summer School 2013




                                                                               56
07/02/13




             General Parallel Execution
             Model
          Remote QEP	

          In order to parallelize an execution, the initial QEP is
          modified and sent to remote nodes to handle the
          distributed execution.	





                      Initial                                                               Modified
                       plan                                                                 plan




                                    Control operator         R : Receiver
                                                             S : Sender
11                                  Distributed operator
                                                             Sp : Split

3    EMC Summer School 2013         User’s operator          M : Merge




 Modifying IQEP to adapt to
 executionI2B
            model (TCP)
                A

                       Send                     TJ
                                                                  Remote nodei
                                          SJ

                                 B2I                         Velocity
                                Receive              Geometry               Query optimizer adds
                                                                            control operators according
                    Receive                                                 to execution model and
                                       Send                                 IQEP statistics
                      B2I
                                          I2B
                      merge                                                         Local dataflow
                                          Split            Control node
                                                                                    Remote dataflow
                                  Orbit
                                                                                      Logical operator
11
                                          Particles                                   Control operator
4    EMC Summer School 2013




                                                                                                               57
07/02/13




     Grid node allocation algorithm
     (G2N)
Introduction
                  Grid Greedy Node scheduling algorithm (G2N)	

                  •  Offers maximum usage of scheduled resources
Principles
                  during query evaluation.
 Application      •  Basic idea : “an optimal parallel allocation strategy
                  for an independent query operator … is the one in
                  which the computed elapsed-time of its execution is
Architecture
                  as close as possible to the maximum sequential time
                  in each node evaluating an instance of the operator”.
  Implem.


                                         t1
Conclusion
                         A                    Bn   t ( Bn) operator cost on this node

11                                  t2                  t1 + t 2 = t x ( Bn )
5     EMC Summer School 2013




                                              €




             Implementation

             •  Core development in Java 1.5.
             •  Globus toolkit 4.
             •  Derby DBMS (catalog).
             •  Tomcat, AJAX and Google Web Toolkit for user
             interface.
             •  Runs on Windows, Unix and Linux.
             •  source code, demo, user guide available at:
             http://dexl.lncc.br
11
6     EMC Summer School 2013




                                                                                             58
07/02/13




     Summing-up

       l    HadoopDB extends Hadoop with expressive query
             language, supported by DBMSs
       l    Keeps Hadoop MapReduce framework
       l    Queries are mapped to MapReduce tasks
       l    For scientific applications is a question to be
             answered whether or not scientists will enjoy writing
             SQL queries
       l    Algebraic like languages may seem more natural
             (eg. Pig Latin)
11
7    EMC Summer School 2013




     Pig Latin - an high-level language
     alternative to SQL


       l    The use of high-level languages such as
             SQL may not please scientific community;
       l    Pig Latin tries to give an answer by providing
             a procedural language where primitives are
             Relational albegra operations;
       l    Pig Latin: A not-so-foreign language for data
             processing, Christopher Olson, Benjamin
             Reed et al., SIGMOD08;
11
8    EMC Summer School 2013




                                                                          59
07/02/13




     Example
       l    Urls (url, category, pagerank)
       l    In SQL
             –    Select category, avg (pagerank)
                from urls where pagerank > 0.2
                  group by category
                  having count(*) > 106
       l    In PIG
             –    Groupurls = FILTER urls by Pagerank > 0.2;
             –    Groups= Group good-urls by category;
             –    Big-group=FILTER groups BY count(good_urls) > 106
             –    Output = FOREACH big-groups GENERATE
11                         category, avg(good_urls_pagerank);
9    EMC Summer School 2013




     Pig Latin


       l    Program is a sequence of steps
             –    Each step executes one data transformation
       l    Optimizations among steps can be
             dynamically generated, example:
             –    1) spam-urls= FILTER urls BY isSpam(url);
             –    2) Highrankurl = FILTER spam-url BY pagerank >
                  0.8;
                        1            2
12                  2                1
0 EMC Summer School 2013




                                                                           60
07/02/13




   Data Model


      l    Types:
            –    Atom - a single atomic value;
            –    Tuple - a sequence of fields, eg.(‘DB’,’Science’,7)
            –    Bag - a collection of tuples with possible
                 duplicates;
            –    Map - a collection of data items where for each
                 data item a key is associated
                              ‘fanOf’            ‘flamengo’
                                                                   ‘music’

12                                      ‘age’                 20
1 EMC Summer School 2013




   Operations


      l    Per tuple processing: Foreach
            –    Allows the specification of iterations over bags
                  l    Ex:
                          –    Expanded-queries=FOREACH queries generate userId,
                                 expandedQuery (queryString);
                          –    Each tuple in a bag should be independent of all others, so
                               parallelization is possible;
            –    Flatten
                  l    Permits flattening of nested-tuples

                          alice,     Ipod,nano          flatten       alice, ipod, nano
                                     Ipod, shuffle                    alice, ipod, shuffle
12
2 EMC Summer School 2013




                                                                                                  61
07/02/13




   Olympic Laboratory




12
3 EMC Summer School 2013




   Olympic Laboratory

      l    Objective
            –    To study high performance sports as a science discipline
            –    To build the first sports laboratory in South America
      l    US$ 10M Project sponsored by FINEP(Funding
            Agency)
      l    Departments:
            –    Biochemistry, physiology, genetics, nutrition, computational
                 modeling, computer science, physiology


12
4 EMC Summer School 2013




                                                                                     62
07/02/13




   Our task


      l    To support athlete’s follow-up data
            –    Athlete’s training
            –    Variation on biochemical elements
            –    Variation on biometric variables
      l    More recently
            –    For some modalities, Integrate meteorological
                 conditions

12
5 EMC Summer School 2013




   Analyses Board




12
6 EMC Summer School 2013




                                                                      63
07/02/13




   Athletes follow-up database

      l    Athletes follow-up data modeled as trajectories
            –    Register measurements from athletes in different training
                 states
      l    Trajectory model
            –    Ordered set of measurements
            –    Division of time in training states
            –    Materialized view limited in time-range
            –    Imprecise measurements
                  l    Not detected =0
                  l    < x -> ]0,x[
                  l    y,y≥x
12
7 EMC Summer School 2013




   More on Athlete’s Trajectories

      l    Stops – modelled as measurements
            –    Qualified according the athlete’s training state
            –    Training states (recovery, training, rest,…)
      l    Moves – extrapolation between two stops
      l    Trajectory – the set of measurements,
            ordered in time, and limited in time according
            to some criteria (eg. A training program).
            –    Measurements of the same observable element
            –    Measurements of the same athlete
12
8 EMC Summer School 2013




                                                                                  64
07/02/13




   Metaphoric Trajectory




                           !
12
9 EMC Summer School 2013




13
0 EMC Summer School 2013




                                    65
07/02/13




13
1 EMC Summer School 2013




   Challenges


      l    Integrating athlete’s trajectory with weather
            information
      l    How to efficiently store metaphoric
            trajectories ?
            –    Trajstore [Cudre-Mauroux et al ICDE 2010]
            –    SciDB
      l    How to express and efficiently process
            similar trajectories
13
2 EMC Summer School 2013




                                                                  66
07/02/13




                   Part I: Where are they
                      coming from ?




      l    “Scientists are spending most of their time
            manipulating, organizing, finding and moving
            data, instead of researching. And it’s going
            to get worse”
            –    Office Science of Data Management challenge -
                 DoE



13
4 EMC Summer School 2013




                                                                      67
07/02/13




   Petabyte, parece muito mas
     LSST – Large Synoptic Survey Telescope




                                      •  800 imagens p/ noite
                                         durante 10 anos !!
                                      •  Mapa 3D do Universo
                                      •  30 TeraBytes por noite
                                      •  30 PetaBytes em 10 anos
13
5 EMC Summer School 2013




   LSST




13
6 EMC Summer School 2013




                                                                        68
07/02/13




   Sequências de DNA Publicadas
   no Genbank (UK NCBI)

                                     Em Abril 2012:
                                     •  1.5 x 107 sequências
                                         •  50% em 4 anos
                                     •  1.3 x 1011pares de base
                                         •  30% em 4 anos




13
7 EMC Summer School 2013




   Comunidades




         Segundo o IDC, a quantidade de dados digitais
         disponível em nosso cyberambiente ultrapassará
13       número de Avogrado em 2023 (> 1023) Yottabyte
8 EMC Summer School 2013




                                                                       69
07/02/13




   Em números:

      l    12 Terabytes de Tweets a cada dia (IBM, 2012)
      l    10 TeraBytes em Facebook a cada dia
      l    Algumas empresas produzem terabytes por
            hora, todos os dias do ano
            –    Eventos:
                  l  Abertura da porta do metrô
                  l  Fazer um check-in no aeroporto
                  l  Comprar uma música no iTunes



13
9 EMC Summer School 2013




   Comunidades Científicas




14
0 EMC Summer School 2013




                                                                 70
07/02/13




    Dados Governamentais


      l    Investimentos
      l    Programas de Governo
      l    Impostos
      l    Contratos, prestações de contas
      l    Índices: econômicos, sociais, educação,
            saúde, …
      l    Segurança e Defesa
14
1 EMC Summer School 2013




    Dados Históricos




14
2 EMC Summer School 2013




                                                           71
Emc 2013 Big Data in Astronomy

More Related Content

What's hot

Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup Cuenca
Jean-Paul Calbimonte
 
Session 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLiteSession 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLite
ISSGC Summer School
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Heiko Joerg Schick
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
Heiko Joerg Schick
 

What's hot (11)

Advanced Data Mining and Integration Research for Europe (ADMIRE)
Advanced Data Mining and Integration Research for Europe (ADMIRE)Advanced Data Mining and Integration Research for Europe (ADMIRE)
Advanced Data Mining and Integration Research for Europe (ADMIRE)
 
Implementing AI: Hardware Challenges
Implementing AI: Hardware ChallengesImplementing AI: Hardware Challenges
Implementing AI: Hardware Challenges
 
Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup Cuenca
 
Data-Intensive Research
Data-Intensive ResearchData-Intensive Research
Data-Intensive Research
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
Microsoft HPC User Group
Microsoft HPC User Group Microsoft HPC User Group
Microsoft HPC User Group
 
Session 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLiteSession 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLite
 
Digital Science
Digital ScienceDigital Science
Digital Science
 
Towards Supporting Data-Intensive Research
Towards Supporting Data-Intensive ResearchTowards Supporting Data-Intensive Research
Towards Supporting Data-Intensive Research
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 

Similar to Emc 2013 Big Data in Astronomy

TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Research
shandra_psc
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 

Similar to Emc 2013 Big Data in Astronomy (20)

Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-final
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
 
High Performance Cyberinfrastructure Required for Data Intensive Scientific R...
High Performance Cyberinfrastructure Required for Data Intensive Scientific R...High Performance Cyberinfrastructure Required for Data Intensive Scientific R...
High Performance Cyberinfrastructure Required for Data Intensive Scientific R...
 
TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Research
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
Report to the NAC
Report to the NACReport to the NAC
Report to the NAC
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean Observatories
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDC
 
Coupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyCoupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation Economy
 
End-to-end Optical Fiber Cyberinfrastructure for Data-Intensive Research: Imp...
End-to-end Optical Fiber Cyberinfrastructure for Data-Intensive Research: Imp...End-to-end Optical Fiber Cyberinfrastructure for Data-Intensive Research: Imp...
End-to-end Optical Fiber Cyberinfrastructure for Data-Intensive Research: Imp...
 
Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...
 
Coupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyCoupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation Economy
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Coupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyCoupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation Economy
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years In
 
Science Engagement: A Non-Technical Approach to the Technical Divide
Science Engagement: A Non-Technical Approach to the Technical DivideScience Engagement: A Non-Technical Approach to the Technical Divide
Science Engagement: A Non-Technical Approach to the Technical Divide
 
Shrinking the Planet—How Dedicated Optical Networks are Transforming Computat...
Shrinking the Planet—How Dedicated Optical Networks are Transforming Computat...Shrinking the Planet—How Dedicated Optical Networks are Transforming Computat...
Shrinking the Planet—How Dedicated Optical Networks are Transforming Computat...
 

Recently uploaded

Recently uploaded (20)

Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
Research Methods in Psychology | Cambridge AS Level | Cambridge Assessment In...
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Keeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security ServicesKeeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security Services
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 

Emc 2013 Big Data in Astronomy

  • 1. 07/02/13 EMC Summer School on BIG DATA – NCE/UFRJ Big Data in Astronomy The LIneA-DEXL case Fabio Porto (fporto@lncc.br) LNCC – MCTI DEXL Lab (dexl.lncc.br) Outline l  Introduction l  Big Data in Science l  Hypothesis Driven-Research l  Data management –  Data partitioning –  Parallel workflow processing l  Final remarks 2 EMC Summer School 2013 1
  • 2. 07/02/13 Laboratório Nacional de Computação Científica (LNCC) Petropolis, Rio de Janeiro 3 EMC Summer School 2013 LNCC - MCTI l  Graduate Course in Computational Modelling –  CAPES 6 l  BioInformatics Laboratory –  Roche 454 high throughput sequencing l  Coordinator of INCT –MACC –  Medicine Supported by Computational Science l  Coordinator of SINAPAD –  HPC National System l  Thematic laboratories –  ACIMA –  MARTIN –  DEXEL –  COMCIDIS –  HEMOLAB –  LABINFO 4 EMC Summer School 2013 2
  • 3. 07/02/13 SINAPAD – National System of High Processing Computing •  Organized in CENAPADS: •  Universities •  Research Centers •  Different Architectures: •  Shared Disks •  Shared Memory •  GPUs 5 EMC Summer School 2013 sinapad.lncc.br CENAPADS 6840 CPU Cores + 8192 GPU Cores ~106.6 TFlops / ~17.3 TBytes RAM / ~ 2.3 PBytes Storage 6 EMC Summer School 2013 6 3
  • 4. 07/02/13 The DEXL Lab Mission l  To support in-silico science with Big Data management techniques; –  To develop interdisciplinary research with contributions on data modelling, design and management; –  To develop tools and systems in support to in- silico science data management; 7 EMC Summer School 2013 e-Astronomy l  LNCC is a member of the LIneA Lab: –  Laboratório Inter-institucional de Astronomia l  O.N., LNCC, CBPF, RNP l  Development of e-Astronomy infrastructure in support for astronomy surveys l  Official south hemisphere DES node l  Large astronomy surveys: –  Sloan Digital sky Survey l  Currently SDSS-3 –  Dark Energy Survey l  DES – Brazil managed by LIneA laboratory l  5.000 square degrees of the sky –  Large Synoptic Sky Telescope l  20.000 square degrees of the sky l  Each patch visited 1000 times during 10 years l  One of the scientific domains with extreme data processing and storage needs l  Big Data today !!!! 8 EMC Summer School 2013 4
  • 5. 07/02/13 LSST – Large Synoptic Survey Telescope •  800 images p/ night during 10 years !! •  3D Map of the Universe •  30 TeraBytes per night •  100 PetaBytes in 10 years •  105 disks of 1 TB 9 EMC Summer School 2013 Sloan Portal 10 EMC Summer School 2013 5
  • 6. 07/02/13 Skyserver – Projeto Sloan 11 EMC Summer School 2013 Dark Energy Survey l  Dark Energy Survey –  Astronomic project to explain: l  Acceleration of the universe l  Nature of dark energy –  Data production l  DECam takes images of 1GB (400/night) l  Images are analyzed; –  galaxies and stars identified and catalogued l  Catalogs are stored in database systems –  Estimates of 1 billion of rows and 1 thousand attributes l  LIneA is the official Brazilian contributor for the DES collaboration 12 EMC Summer School 2013 6
  • 7. 07/02/13 DES  Science  Pipelines     Global and local tests Test environment & CTIO Un-supervised process Cluster industrialization Point source catalog Masks, random catalogs Addstar (MW, GC), Addqso Findsat, Sparse, fitmodel Stellar mass, LF, HOD fit Classifier, photo-z Identification, characterization Cosmological parameters 13 Summer School 2013 EMC 14 EMC Summer School 2013 7
  • 8. 07/02/13 BIG DATA in Science l  Scientific process is being remodelled to be developed within an in-silico environment l  Powerful instruments: –  Digital telescopes –  DNA sequencers –  Mass spectrometers l  Huge simulations –  Weak lensing simulations –  Cardio-vascular system simulation l  Massive amounts of information streams in and out… l  Hypothesis-driven research supported by in-silico infrastructure, methods, models… 15 EMC Summer School 2013 Big Data needs for e-science l  Data archival infra-structure; l  Scientific life cycle metadata management; l  Distributed big data management; –  Parallel workflow processing; –  Parallel Analytical algorithms; 16 EMC Summer School 2013 8
  • 9. 07/02/13 “Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it’s going to get worse” –  Office Science. Data-Management Challenge Report– DoE - 2004 17 EMC Summer School 2013 Big Data needs for e-science l  Data archival infra-structure; l  Scientific life cycle metadata management; l  Distributed big data management; –  Parallel workflow processing; –  Parallel Analytical algorithms; 18 EMC Summer School 2013 9
  • 10. 07/02/13 Scientific Experiment Life-cycle Experiment Data [Mattoso et al. 2010] 19 EMC Summer School 2013 MODELLING - HYPOTHESIS-DRIVEN RESEARCH 20 EMC Summer School 2013 10
  • 11. 07/02/13 e-Science life cycle Hypothesis Experiment Phenomenon Modeling Publication Formulation Life-cycle 21 EMC Summer School 2013 Big Data Scenario in Scientific exploration life-cycle Experiment,   Workflow   Design   Workflow   Hypothesis,   Prepara;on   experiment   Workflow   Goals   repository   Data   Hypotheses   Sources   Analysis   database   Provenance   Results   Store   Workflow   Execu;on   Post-­‐ Execu;on   analysis   Adapted from Monitoring [Mattoso et al. 2010] 22 EMC Summer School 2013 11
  • 12. 07/02/13 Motivation l  As experiments produce more and more data, extracting meaning out of these data requires, among other things, contextualizing the data l  Metadata about the research allows for results sharing, fostering collaborative work l  Sharing knowledge about the scientific reasoning 23 EMC Summer School 2013 Hypotheses in Astronomy - DES l  Phenomenon: –  Universe is speeding-up l  Discovered by scientists in 1998 studying distant supernovae l  Supported by observations of redshift on long distance supernovae light l  Hypothesis –  A new odd behaviour named “Dark Energy” could make up 70% of the universe –  The universe is not homogeneous - it has regions with different densities (our location is special….) l  Supporting evidences –  Weak gravitational lensing –  Galaxy clusters in different redshifts 24 EMC Summer School 2013 12
  • 13. 07/02/13 Hypothesis in Big Data Analytics l  Scientific exploration is hypothesis-driven –  Nevertheless, hypothesis remain out of reach of in-silico exploration (big data analyses ??) l  Big Data Analyses is explorative in nature –  Understanding what one is doing when exploring Big Data requires scientific hypothesis-driven approach l  Corollary –  BIG Data needs hypothesis management 25 EMC Summer School 2013 Context l  Scientists trying to understand some phenomenon –  Formulate Hypothesis about Phenomenon behaviour l  Natural Phenomena –  Simulated by computational models –  Explained by Scientific hypothesis l  Time-Space varying –  Space represented by physical meshes l  1D, 3D,… –  Time reflected on simulation ticks 26 EMC Summer School 2013 13
  • 14. 07/02/13 Scientific Hypothesis Human Cardio-vascular System 27 EMC Summer School 2013 Elements of hypothesis-driven research l  Scientific Phenomenon – an observable event –  occurs in space-time; –  characterized by observable quantities; l  Scientific Hypothesis – a falsifiable statement proposed to explain a phenomenon [Popper 2012] –  We are interested in a conceptual representation that puts forward the idea the hypothesis carries on l  Mathematical Model – a language specific formalization of a scientific hypothesis l  Experiment – the set of computational artifacts put together to validate a scientific hypothesis; l  Data – observed or experimental data use in validating hypotheses; 28 EMC Summer School 2013 14
  • 15. 07/02/13 Hypothesis modelling initiatives l  Robot Scientist –  [R.D.King et al] The automation of science, Science, 2009. l  HyQueu and HyBrow –  [A. Callahan, M. Dumontier, and N. H. Shah]. HyQue: Evaluating hypotheses using semantic web technologies. Journal of Biomedical Semantics, 2(Suppl 2):S3, 2011. –  Modeling hypothesis as propositions in part of the domain language l  Bioinformatics l  SWAN –  Y. Gao et al. Journal of Web Semantics, 2006 l  J. Sowa, Process Ontology 29 EMC Summer School 2013 Sc  Hypothesis  Conceptual   Model   1..n isTheBlendOf   0..n   Physical   0..n   Phenomenon   elements   Quan::es   1..1   physical   quan::es   Phenomenon explains   SC Is  basedOn   1..1   1..n Hypothesis 1..n 1..1   0..n   0..n   Domain ontology URL 0..1   1..1 1-n Space-­‐Time   represented_as   Dimension   1..1   represents   0..n   isAuthor   Ph_Process   represented_as   1..m   Formal   Language   0..n   Scien:st   Formal   0..n   1..1 0..n   Con:nuous   Discrete   Representa:on formulatedby   Ph_Process     0..n   0..1 Ph_Process   1..1   0..1   0..n   Discrete   Phenomenon   Refers-to Simula:on   Topologically     1..1   1..n   variable   1..n   Mathema:cal   0..1   0..n    modeled    by   Model   1..1 1..1 1..1   State   Modeled_as   transforms   1..n   0..1   1..1 Mesh   constant   1..1   Represented   Event   0..n   with   Data View fucn:on   (query over Data view) modeled_as   1..n   Mathematical equa:on   0..n   Formulae XML Observa:on   Simulated   Element Computational Mesh 1..1     Element   Model View Data view 0..1   0..n   [Porto et al. ER 2008, ER 2012] Compared_with   30 EMC Summer School 2013 15
  • 16. 07/02/13 Modelling Hypotheses and their interconnections Τ Weak lensing Galaxy Earth special location clustering Non uniform universe Dark Energy Τ A lattice theoretic representation for hypotheses interconnect 31 EMC Summer School 2013 Focus on Hypothesis modeling l  Scientific Hypothesis formulation as a conceptual entity l  Structuring of research evolution l  Isomorphic representation of: hypothesis, scientific model and phenomenon l  Structure amenable for data representation, association, querying and publishing 32 EMC Summer School 2013 16
  • 17. 07/02/13 Hypotheses Structuring: Lattice 33 EMC Summer School 2013 The core entities of the hypothesis conceptual model 34 EMC Summer School 2013 17
  • 18. 07/02/13 Representation Isomorphism 35 EMC Summer School 2013 Application: Linked Science l  An initiative to have a machine-readable content describing the scientific exploration; l  Support reproducibility of experiments; l  To foster reusing previous results; l  The community needs a more “open” science” 36 EMC Summer School 2013 18
  • 19. 07/02/13 Linked Science (or Linked Open Science) l  Is an initiative to interconnect all scientific assets; l  It is a combination of: –  Linked data, semantic web –  Open source; –  Scientific workflows and provenance (OPM); –  Scientific models; –  Cloud computing; –  … 37 EMC Summer School 2013 Linked Science Core Vocabulary (LSC) l  Defines a vocabulary (LSC) with “basic” terms for science; –  More specific terminology shall be added by individual communities (minimal ontological commitment) 38 EMC Summer School 2013 19
  • 20. 07/02/13 LSC Core Vocabulary 39 EMC Summer School 2013 Extension to LSC 40 EMC Summer School 2013 20
  • 21. 07/02/13 Published Research as Linked Data (1)3 Semantic rdfs:Class rdf:Resource ! rdf:Literal engineering of rdf:value hypotheses lsc:Researcher authors1 ! “P.J. Blanco, M.R. Pivello, S.A. Urquiza, and R.A. Feijóo.” dc:description lsc:Research research1 ! “Simulation of hemodynamic conditions in the carotid artery.” dc:title Introduction lsc:Publication pub1 ! “On the potentialities of 3D–1D coupled models in hemo- Motivation dynamics simulations.” Goals & Challenges dc:description lsc:Data dataset1 ! “Flow rate of 5.0 l/min as an inflow boundary condition at Related Work the aortic root, in observation of Avolio (1980) and others.” dc:description Semantic lsc:Data dataset2 ! “1D mechanical and geometric data from Avolio (1980).” Modeling dc:description lsc:Data dataset3 ! “MRI images processed for reconstructing the 3D geome- Combination try of both the left femoral and the carotid arteries.” and Order dc:description Phenomenon p17 ! “Blood flow in the carotid artery.” dc:description Partial Results tisc:Region region1 ! “The carotid artery, a part of the human CVS.” dc:description Next Steps owl:IntervalEvent beat1 ! “A heart beat with period T = 0.8 s.” dc:description Observable ob1 ! “Blood flow rate.” dc:description Observable ob2 ! “Blood pressure.” rdfs:label lsc:Hypothesis h17 ! “blend(h13, h15, h16)” dc:description Model m17 ! “3D-1D coupled model with lumped windkessel terminals.” 3 Blanco et al.’s published research as an LSC instantiation. 18/23 41 EMC Summer School 2013 Published Research as Linked Data (2)4 Semantic engineering of hypotheses rdfs:Class rdf:Resource ! rdf:Literal dc:description lsc:Data dataset4 ! “Plots of hemodynamic observables in the left femoral artery produced to validate the hypothesis.” Introduction dc:description Motivation lsc:Data dataset5 ! “Plots of hemodynamic observables in the carotid artery.” Goals & Challenges dc:description lsc:Data dataset6 ! “Scientific visualization of hemodynamic observables in the Related Work left femoral artery produced to validate the hypothesis.” Semantic dc:description lsc:Data dataset7 ! “Scientific visualization of hemodynamic observables in the Modeling carotid artery both with and without aneurism.” rdf:value Combination lsc:Prediction predict1 ! “Sensitivity of local blood flow in the carotid artery to the heart and Order aortic inflow condition.” rdf:value Partial Results lsc:Prediction predict2 ! “Sensitivity of the cardiac pulse to the presence of an aneurysm in the carotid.” Next Steps rdf:value lsc:Conclusion conclusion1 ! “3D-1D coupled models allow to perform quantitative and qualitative studies about how local and global phenomena are related, which is relevant in hemodynamics.” 42 EMC Summer School 2013 4 Blanco et al.’s published research as an LSC instantiation. 19/23 21
  • 22. 07/02/13 Find in Blanco et al.'s microtheory a hypothesis (if any) explaining phenomena of blood flow in microvascular vessels and show which model formulates it. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX lsc: <http://linkedscience.org/lsc/ns#> SELECT ?hypothesis_name ?model_name WHERE { ?h rdfs:label ?hypothesis_name . ?m rdfs:label ?model_name . ?h a lsc:Hypothesis . ?p a lsc:Phenomenon . ?m a lsc:Model . ?h lsc:explains ?p . ?m lsc:formulates ?h . ?p dc:description ?d . FILTER regex(?d, "blood flow", "i") . FILTER regex(?d, "microvascular", "i") } 43 EMC Summer School 2013 Remarks l  Hypothesis modeling reflects the scientist mental model during data analyses; –  supports hypothesis-driven data exploration –  extends current eScience infrastructure; l  Scientific Hypothesis, Models and Phenomenon are the main primitives; l  The primitives maybe represented as isomorphic lattices with semantic association among themselves; l  One can search, discovery, mine hypotheses and related scientific artefacts; l  ER 2012- MODIC Workshop l  ISWC 2012– Linked Science workshop 44 EMC Summer School 2013 22
  • 23. 07/02/13 DATA MANAGEMENT 45 EMC Summer School 2013 Dark Energy Survey l  Dark Energy Survey –  Astronomic project to explain: l  Acceleration of the universe l  Nature of dark energy –  Data production l  DECam takes images of 1GB (400/night) l  Images are analyzed; galaxies and starts are identified and catalogued l  Catalogs are stored in database systems 46 EMC Summer School 2013 23
  • 24. 07/02/13 Dark Energy Survey Project l  Main technical (CS) issue: –  Managing huge catalogs –  Relations loaded from std FITS files l  Database features –  Single relation for each catalog –  Volume: 1 billion tuples x 1000 attributes (300GB) –  Queries l  Users submit ad-hoc queries to the database l  Usually too many results for each query –  Need to choose best results, e.g. using top-k techniques l  Some queries scan the whole database –  Looking for clusters of stars 47 EMC Summer School 2013 Processing Astronomy data User access Scientific workflows - Ad-hoc queries - Analysis - downloads Astronomy catalogs 48 EMC Summer School 2013 24
  • 25. 07/02/13 Ad-Hoc Queries l  Submitted by users through portal; l  For small size queries (Regions of the sky) –  Indexing based on ra, dec (e.g. Q3C) l  [Koposov, S.,Bartunov, O., 2006] Q3C Quad Tree Cube, Astronomical Data Analysis Software and Systems, 2006 l  HTM, Hierarchical Triangular Mesh, MSSQlServer, Sloan –  Spatial function (eg. Radial search) –  Other criteria need more fine grained criteria l  For large size queries (whole sky) –  Explore parallelism over partitioned data l  Data partitioning is efficient for small and large queries 49 EMC Summer School 2013 Astronomer’s coordinate system 50 EMC Summer School 2013 25
  • 26. 07/02/13 Workflow queries l  Workflows process data retrieved from the Catalog –  Two systems l  Workflow engine l  Database engine –  Lack of integration l  upper bound on performance –  Large queries l  Parallelism obtained by data partitioning is jeopardized by consolidation of results operated by DBMS; l  Workflow receives data and redistribute it to parallelize activities –  Concurrency among workflows l  May impose huge penalties 51 EMC Summer School 2013 Need to partition data l  Beneficial for both access patterns –  Ad-hoc and workflow l  How to apply it? l  Vertical partitioning –  Already applied based on semantic clustering of attributes l  Ra, dec l  Photometry, spectrometry, astrometry l  Horizontal partitioning –  Ra, Dec (the current approach) –  More fine grained criteria l  Been developed in collaboration with INRIA Montpellier 52 EMC Summer School 2013 26
  • 27. 07/02/13 First Step: Hybrid Data Partitioning(HDP) Std criterion: range of ra,dec Criterion 1 Criterion k Catalog Id Ra Dec Catalog Id Ra Dec Catalog Id spectrometry Catalog_s Id spectrometry Catalog-ph Id photometry ŸŸŸ Catalog-ph Id photometry Catalog_a Id astronometry Catalog_a Id astronometry 53 EMC Summer School 2013 07/02/13 IMPLEMENTATION ALTERNATIVES 54 EMC Summer School 2013 27
  • 28. 07/02/13 Using PGPOOL-II l  Pgpool II –  Implemented on top of PostgreSQL 9.1 –  Central node coordinates data/query distribution/ replication –  Requests distributed through nodes –  Parallel query Processing l  data partitioning based on a table column range (e.g. id) l  For short queries, may reduce the number of accessed data –  Load Balance l  Concurrent requests directed to different DB copies 55 EMC Summer School 2013 Parallelism & LoadBalance Parallel query Pgpool II Replication Replication Pgpool II Pgpool II postgreS postgreS PostgreS PostgreS QL QL QL QL 56 EMC Summer School 2013 28
  • 29. 07/02/13 Evaluation l  Strength –  Extends PostgreSQL –  Load balance queries from concurrent workflows –  Scales up to 128 DB nodes l  Weaknesses –  Lack of support to spatial functions –  Partitioning based on a single column –  Ingestion can’t use COPY 57 EMC Summer School 2013 QServ - LSST l  Developed by the LSST DM team l  Astronomy data management l  Horizontal partitioning based on declination zones (nodes) and data on each node distributed into chunks based on RA-chunk l  Approx. 1000 partitions l  Native support to spatio-temporal functions l  Built on top of MySQL 58 EMC Summer School 2013 29
  • 30. 07/02/13 Evaluation l  Strong –  Designed to support astronomy data surveys –  Highly scalable: ~1000 nodes –  First performance results are very promising –  Alignment with the LSST project l  Weaknesses –  Current culture based on PostgreSQL 59 EMC Summer School 2013 Context (3/3) l  Requirement –  Efficient data storage and processing l  Challenges –  Big size of the database –  High number of attributes –  Evolving workload –  Mostly Scan Processing l  Questions: a)  How to efficiently process queries over catalogs? b)  How to efficiently process scientific workflows over catalogs? 60 EMC Summer School 2013 30
  • 31. 07/02/13 Current activities at DEXL a)  Design data partitioning strategies –  Cooperation with INRIA Montpellier- Zenith group –  Partition the data into blocks l  such that the number of query accesses to the blocks is minimum l  Each block can be stored on a different machine b)  Efficient execution of scientific workflows over partitioned data 61 EMC Summer School 2013 a) Intuition Q Queries and scientific workflows take a Queries and scientific workflows of Time proportional to the amount take Time to be processed size of their data Data proportional to the partitioning Q’ Q’’ Q’’’ 62 EMC Summer School 2013 31
  • 32. 07/02/13 Partitioning the DB into Blocks B1 R(a1,…,a9) B2 How to compute … The best Partitioning? Bm 63 EMC Summer School 2013 Problem statement l  Given –  Single relation database R(a1,…,an), n ~1000 –  Initial workload: set of k queries W0 = {q1,…,qk} –  m empty fixed size blocks l  Assumptions –  Accessing a block ≈ accessing all its tuples –  Periodically new tuples and queries arrive –  No privilege to a particular attribute l  Goal –  Minimize the total block access during the execution of queries by: l  Optimal partitioning of R’s data in blocks l  Optimal query execution –  Adapt to the arrival of new data and queries 64 EMC Summer School 2013 32
  • 33. 07/02/13 Overview of the solution l  Data partitioning : graph based algorithm –  Nodes: each data item (e.g. tuple) represent a node in the graph –  Edges: an edge between two data items if are accessed by a common query –  Edge weight : the number of queries that access both data items –  Goal: partition the graph into m equal size sub-graphs with minimum edge cut l  Use a min-cut algorithm l  Block explanation –  Blocks are explained in terms of queries l  Each block is assigned an explaining query Bi = vi(R) l  Query processing –  Queries are compared to explaining queries –  Matching blocks are selected (we haven’t worked on that yet) 65 EMC Summer School 2013 Partitioning strategy Schism: VLDB2010 1 We create a node for each row 66 EMC Summer School 2013 33
  • 34. 07/02/13 Partitioning strategy 1 2 We create a node for each row 67 EMC Summer School 2013 Partitioning strategy 3 1 2 We create a node for each row 68 EMC Summer School 2013 34
  • 35. 07/02/13 Partitioning strategy For each vertical fragment 3 1 1 5 2 3 4 5 7 6 7 2 6 4 We create a node for each row 69 EMC Summer School 2013 Partitioning strategy For each vertical fragment 1 3 1 1 5 2 1 3 4 q1 1 7 5 6 7 2 6 4 We increment the arc weight when two rows are accessed together 70 EMC Summer School 2013 35
  • 36. 07/02/13 Partitioning strategy For each vertical fragment 1 3 1 1 1 5 2 2 3 4 q2 1 1 7 5 6 7 2 6 4 We increment the arc weight when two rows are accessed together 71 EMC Summer School 2013 Partitioning strategy For each vertical fragment 1 3 2 1 1 5 2 7 3 5 1 3 4 5 7 6 7 7 2 2 1 6 4 W = {q1,…,qn} 4 We increment the arc weight when two rows are accessed together 72 EMC Summer School 2013 36
  • 37. 07/02/13 Partitioning strategy For each vertical fragment 1 3 2 1 1 5 2 7 3 5 1 3 4 5 7 6 7 7 2 2 1 6 4 4 We execute a min-cut algorithm 73 EMC Summer School 2013 Partitioning strategy Catalog 1 3 2 1 1 5 2 7 3 5 1 3 4 5 7 6 7 7 2 2 1 6 4 4 3 1 5 2 6 4 7 Each partition is assigned a block B1 B2 74 EMC Summer School 2013 37
  • 38. 07/02/13 Partitioned data with queries Each block is associated with the queries that access Some records of the block {q3,q5,…,, q13} {q1,q2,…,, q14} 3 1 5 2 6 4 7 B1 B2 For a given query q the number of accessed blocks is minimized 75 EMC Summer School 2013 Adaptive Strategy (1/2) l  New tuple arrival: [DEXA 2012] –  Select the best block l  i.e. block to which the new tuple is more correlated –  Challenges: l  How to select the best block with minimum effort? –  Initial approach : find it based on the correlation of queries to blocks –  Define optimal allocation –  Compute actual allocation efficiency –  Compute block affinity l  What if the best block is full? –  Initial approach: split the block 76 EMC Summer School 2013 38
  • 39. 07/02/13 Allocation based on affinity to blocks 77 EMC Summer School 2013 Elapsed-time of incrementing the DB as the size increases 1000 static + DynPart, |D | = 500 k DynPart, |D | = 1 M 100 Execution time (s) 10 1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.1 2M 4M 6M 8M 10M 12M 14M 16M 18M 20M DB size Experiment: - Sloan DR8 – 350 million tuples - workload- synthetic 27000 queries - PaToH – hyper-graph partitioner 78 EMC Summer School 2013 39
  • 40. 07/02/13 E-ASTRONOMY WORKFLOWS OVER PARTITIONED DATA 79 EMC Summer School 2013 Processing Scientific Workflows l  Analytical Workflows process a large part of Catalog data –  Catalogs are supported by few indexes, thus most queries scan tens-to-hundreds of millions of tuples l  Parallelization comes as a rescue to reduce analyses elapsed-time, but –  Compromise between: l  Data partitioning and degree of parallelization; –  Current solutions consider: l  Centralized files to be distributed through nodes (MapReduce) –  [Alagianins, SIGMOD, 2012] NoDB – reading raw files without data ingestion; l  Distributed databases (Qserv) to serve Workflow engines –  [ Wang.D.L,2011], Qserv: A Distributed Shared-Nothing Database for the LSST catalog; l  Centralized databases to serve Workflow Engine (Orchestration LineA) l  Partitioned database to serve distributed queries (HadoopDB) 80 EMC Summer School 2013 40
  • 41. 07/02/13 HadoopDB - a step in between [Abouzeid09] l  Offers parallelism and fault tolerance as Hadoop, with SQL queries pushed-down to postgreSQL DBMS; l  Pushed-down queries are implemented as Map- reduce functions; l  Data are partitioned through nodes. –  Partitioning information stored in the catalog –  Distributed through the N nodes 81 EMC Summer School 2013 HadoopDB architecture SQL query SMS Planner MapReduce Catalog Framework Node 1 Node 2 Node n Task Tracker Task Tracker Task Tracker Database DataNode Database DataNode Database DataNode 82 EMC Summer School 2013 41
  • 42. 07/02/13 Example Select year(SalesDate),sum(revenue) From Sales Group by year(salesDate) a) Table partitioned by year(SalesDate) b) no partitioning by year(SalesDate) FileSink Operator Sum Operator Reduce Group by Operator FileSink Operator Reduce Sink Operator Map Select Year(SalesDate), Sum(revenue) Map From Sales Select Year(SalesDate), Group by year(salesDate) Sum(revenue) From Sales Group by year(salesDate) 83 EMC Summer School 2013 Processing Astronomy data User access Scientific workflows - Ad-hoc queries - Analysis - downloads Astronomy catalogs 84 EMC Summer School 2013 42
  • 43. 07/02/13 Traditional WF–Database decoupled architecture Workflow engine act1 act2 act3 Data is consolidated as input to the workflow engine Database DBp1 DBp2 DBp3 85 EMC Summer School 2013 Problems l  Data locality –  Workflow activities run in remote nodes wrt the partitioned data; l  Load Balance –  Local processes facing different processing time 86 EMC Summer School 2013 43
  • 44. 07/02/13 Data locality l  Traditional distributed query processing pushes operations through joins and unions so that can be done close to the data partitions; l  Can we “localize” workflow activities? –  Moving activities in workflows require operation semantics to be exposed –  Mapping of workflow activities to a known algebra –  Equivalence of algebra expressions enabling pushing down operations 87 EMC Summer School 2013 Algebraic transformation (i - workflow – relation perspective) (ii - decomposition) rU Filte Map Filter R S T R S * T * Q (iiii - anticipation) (iv - procastination) U r U Filte R S * T R * V Map Q * T Ma * p Q * S 88 Summer School 2013 EMC 44
  • 45. 07/02/13 Workflow optimization process Initial algebraic expressions Generatation of Transform search space ation rules Equivalent algebraic expressions Evaluation of Cost search strategy model Searh yes more ? no Optimized algebraic expressions 89 Summer School 2013 EMC Pushing down workflow activities l  A first naïve attempt –  Push down all operations before a Reduce; l  Use a MapReduce implementation where –  Mappers execute the “pushed-down” operations close to the data 90 EMC Summer School 2013 45
  • 46. 07/02/13 Typical Implementation at LineA Portal Spatial partitioning Catalog DB 91 EMC Summer School 2013 Parallel workflow over partitioned data Partitioned catalogue stored on PostgreSQL DBp1 SkyMap DBp2 SkyMap SkyAdd … DBpn SkyMap 92 EMC Summer School 2013 46
  • 47. 07/02/13 HQOOP - Parallelizing Pushed-down Scientific Workflows l  Partition of data across cluster nodes –  Partitioning criteria l  Spatial (currently used and necessary for some applications) l  Random (possible in SkyMap) l  Based on query workload (Miguel Liroz-Gestau’s Work) l  Process the workflow close to data location –  Reduce data transfer l  Use Apache/Hadoop Implementation to manage parallel execution l  Widely used in Big Data processing; l  Implements Map-Reduce programming paradigm; l  Fault Tolerance of failed Map processes; l  Use QEF as workflow Engine –  Implements Mapper interface –  Run workflows in Hadoop seamlessly; 93 EMC Summer School 2013 Perspective Qserv+ Workflow HQOOP Wkfw Engine Parallelization Orchestration layer, Query Hadoop+Kepler MapReduce Distribution HadoopDB+Hive Data distribution 94 EMC Summer School 2013 47
  • 48. 07/02/13 Integrated architecture Final Result Workflow engine Workflow engine Workflow engine act act act act act act act act act 1 2 3 1 2 3 1 2 3 DB1 DB2 DB3 95 EMC Summer School 2013 Experiment Set-up l  Cluster SGI –  Configurations: 1, 47 and 95 nodes; –  Each node: l  2 proc. Intel Zeon – X5650, 6 cores, 2.67 GHz l  24 GB RAM l  500 GB HD l  Data –  Catalog DC6B l  Hadoop –  QEF workflow engine 96 EMC Summer School 2013 48
  • 49. 07/02/13 Preliminary Results l  Preliminary results are encouraging: –  Baseline Orchestration layer (234 nodes) – approx. 46 min –  1 node HQOOP – approx. 35 min –  4 nodes HQOOP – approx. 12.3 min –  95 nodes (94 workers) HQOOP – approx. 2.10 min –  95 nodes (94 workers) Hadoop+Python – approx. 2.4 min 97 EMC Summer School 2013 Resulting Image 98 EMC Summer School 2013 49
  • 50. 07/02/13 Conclusions l  Big data users (scientists) are in Big Trouble; –  Too much data, too fast, too complex; l  Different expertise required to cooperate towards Big Data Management; l  Adapted software development methods based on workflows; l  Complete support to scientific exploration life-cycle l  Efficient workflow execution on Big Data 99 EMC Summer School 2013 Collaborators l  LNCC Researchers –  Ana Maria de C. Moura –  Bruno R. Schulze –  Antonio Tadeu Gomes l  PhD Students –  Bernardo N. Gonçalves –  Rocio Millagros –  Douglas Ericson de Oliveira –  Miguel Liroz-Gistau (INRIA) 10 –  Vinicius Pires (UFC) 0 EMC Summer School 2013 50
  • 51. 07/02/13 Collaborators l  ON –  Angelo Fausti –  Luiz Nicolaci da Costa –  Ricardo Ogando l  COPPE-UFRJ –  Marta Mattoso –  Jonas Dias (Phd Student) –  Eduardo Ogasawara (CEFET-RJ) l  UFC –  Vania Vidal –  José Antonio F. de Macedo l  PUC-Rio –  Marco Antonio Casanova l  INRIA-Montpellier –  Patrick Valduriez group l  EPFL –  Stefano Spaccapietra 10 1 EMC Summer School 2013 EMC Summer School on BIG DATA – NCE/UFRJ Big Data in Astronomy Fabio Porto (fporto@lncc.br) LNCC – MCTI DEXL Lab (dexl.lncc.br) 51
  • 52. 07/02/13 Overall performance 50 600 45 500 40 35 400 30 25 300 elapsed-time (min) elapsed-time (min) 20 linear scale-up linear scale-up 200 15 % Linear Scale-up 10 100 5 0 0 Baseline 1 node 4 nodes 94 nodes 94 nodes Baseline 1 node 4 nodes 94 94 (234 HQOOP HQOOP HQOOP Hadoop (234 HQOOP HQOOP nodes nodes nodes) nodes) HQOOP Hadoop 10 3 EMC Summer School 2013 1400000 1200000 1000000 800000 Tempo Hadoop Tempo 600000 Reduce 400000 200000 0 47 CENT 47 CENT 94 CENT 94 CENT QEF SEM QEF QEF SEM QEF 160000 140000 120000 100000 Tempo 80000 Hadoop Tempo 60000 Reduce 40000 20000 10 0 47 DIST 47 DIST 94 DIST 94 DIST QEF SEM QEF QEF SEM QEF 4 EMC Summer School 2013 52
  • 53. 07/02/13 Execution with 4 nodes Elapsed-time total: 11.27 min 10 5 EMC Summer School 2013 53
  • 54. 07/02/13 Adaptive and Extensible Query Engine l  Extensible to data types l  Extensible to application algebra l  Extensible to execution model l  Extensible to heterogeneous data sources 10 7 EMC Summer School 2013 Objective •  Offer a query processing framework that can be extended to adapt to data centric application needs; •  Offer transparency in using resources to answer queries; •  Query optimization transparently introduced •  Standardize remote communication using web services even when dealing with large amount of unstructured data •  Run-time performance monitoring and decision 10 8 EMC Summer School 2013 54
  • 55. 07/02/13 Control Operators •  Add data-flow and transformation operators •  Isolate application oriented operators from execution model data-flow concerns •  parallel grid based execution model: •  Split/Merge - controls the routing of tuples to parallel nodes and the corresponding unification of multiple routes to a single flow •  Send/Receive - marshalling/ unmarshalling of tuples and interface with communication mechanisms •  B2I/I2B - blocks and unblocks tuples •  Orbit - implements loop in a data-flow 10 •  Fold/Unfold - logical serialization of complex structues (e.g. PointList to Points) 9 EMC Summer School 2013 The Execution Model Example of simple QEF Workflow Output Operator Possibly distributed over a Grid environment Data sources (Input) Integration unit (Tuple) 11 containing data source units 0 EMC Summer School 2013 55
  • 56. 07/02/13 Iteration Model OPEN OPEN OPEN C B A DataSource GETNEXT GETNEXT GETNEXT C B A DataSource CLOSE CLOSE CLOSE C B A DataSource Results 11 1 EMC Summer School 2013 Distribution and Parallelization Operator distribution A Query Optimizer selects a set of operators in the QEP to execute over a Grid environment. B1 C B2 A DataSource B3 11 2 EMC Summer School 2013 56
  • 57. 07/02/13 General Parallel Execution Model Remote QEP In order to parallelize an execution, the initial QEP is modified and sent to remote nodes to handle the distributed execution. Initial Modified plan plan Control operator R : Receiver S : Sender 11 Distributed operator Sp : Split 3 EMC Summer School 2013 User’s operator M : Merge Modifying IQEP to adapt to executionI2B model (TCP) A Send TJ Remote nodei SJ B2I Velocity Receive Geometry Query optimizer adds control operators according Receive to execution model and Send IQEP statistics B2I I2B merge Local dataflow Split Control node Remote dataflow Orbit Logical operator 11 Particles Control operator 4 EMC Summer School 2013 57
  • 58. 07/02/13 Grid node allocation algorithm (G2N) Introduction Grid Greedy Node scheduling algorithm (G2N) •  Offers maximum usage of scheduled resources Principles during query evaluation. Application •  Basic idea : “an optimal parallel allocation strategy for an independent query operator … is the one in which the computed elapsed-time of its execution is Architecture as close as possible to the maximum sequential time in each node evaluating an instance of the operator”. Implem. t1 Conclusion A Bn t ( Bn) operator cost on this node 11 t2 t1 + t 2 = t x ( Bn ) 5 EMC Summer School 2013 € Implementation •  Core development in Java 1.5. •  Globus toolkit 4. •  Derby DBMS (catalog). •  Tomcat, AJAX and Google Web Toolkit for user interface. •  Runs on Windows, Unix and Linux. •  source code, demo, user guide available at: http://dexl.lncc.br 11 6 EMC Summer School 2013 58
  • 59. 07/02/13 Summing-up l  HadoopDB extends Hadoop with expressive query language, supported by DBMSs l  Keeps Hadoop MapReduce framework l  Queries are mapped to MapReduce tasks l  For scientific applications is a question to be answered whether or not scientists will enjoy writing SQL queries l  Algebraic like languages may seem more natural (eg. Pig Latin) 11 7 EMC Summer School 2013 Pig Latin - an high-level language alternative to SQL l  The use of high-level languages such as SQL may not please scientific community; l  Pig Latin tries to give an answer by providing a procedural language where primitives are Relational albegra operations; l  Pig Latin: A not-so-foreign language for data processing, Christopher Olson, Benjamin Reed et al., SIGMOD08; 11 8 EMC Summer School 2013 59
  • 60. 07/02/13 Example l  Urls (url, category, pagerank) l  In SQL –  Select category, avg (pagerank) from urls where pagerank > 0.2 group by category having count(*) > 106 l  In PIG –  Groupurls = FILTER urls by Pagerank > 0.2; –  Groups= Group good-urls by category; –  Big-group=FILTER groups BY count(good_urls) > 106 –  Output = FOREACH big-groups GENERATE 11 category, avg(good_urls_pagerank); 9 EMC Summer School 2013 Pig Latin l  Program is a sequence of steps –  Each step executes one data transformation l  Optimizations among steps can be dynamically generated, example: –  1) spam-urls= FILTER urls BY isSpam(url); –  2) Highrankurl = FILTER spam-url BY pagerank > 0.8; 1 2 12 2 1 0 EMC Summer School 2013 60
  • 61. 07/02/13 Data Model l  Types: –  Atom - a single atomic value; –  Tuple - a sequence of fields, eg.(‘DB’,’Science’,7) –  Bag - a collection of tuples with possible duplicates; –  Map - a collection of data items where for each data item a key is associated ‘fanOf’ ‘flamengo’ ‘music’ 12 ‘age’ 20 1 EMC Summer School 2013 Operations l  Per tuple processing: Foreach –  Allows the specification of iterations over bags l  Ex: –  Expanded-queries=FOREACH queries generate userId, expandedQuery (queryString); –  Each tuple in a bag should be independent of all others, so parallelization is possible; –  Flatten l  Permits flattening of nested-tuples alice, Ipod,nano flatten alice, ipod, nano Ipod, shuffle alice, ipod, shuffle 12 2 EMC Summer School 2013 61
  • 62. 07/02/13 Olympic Laboratory 12 3 EMC Summer School 2013 Olympic Laboratory l  Objective –  To study high performance sports as a science discipline –  To build the first sports laboratory in South America l  US$ 10M Project sponsored by FINEP(Funding Agency) l  Departments: –  Biochemistry, physiology, genetics, nutrition, computational modeling, computer science, physiology 12 4 EMC Summer School 2013 62
  • 63. 07/02/13 Our task l  To support athlete’s follow-up data –  Athlete’s training –  Variation on biochemical elements –  Variation on biometric variables l  More recently –  For some modalities, Integrate meteorological conditions 12 5 EMC Summer School 2013 Analyses Board 12 6 EMC Summer School 2013 63
  • 64. 07/02/13 Athletes follow-up database l  Athletes follow-up data modeled as trajectories –  Register measurements from athletes in different training states l  Trajectory model –  Ordered set of measurements –  Division of time in training states –  Materialized view limited in time-range –  Imprecise measurements l  Not detected =0 l  < x -> ]0,x[ l  y,y≥x 12 7 EMC Summer School 2013 More on Athlete’s Trajectories l  Stops – modelled as measurements –  Qualified according the athlete’s training state –  Training states (recovery, training, rest,…) l  Moves – extrapolation between two stops l  Trajectory – the set of measurements, ordered in time, and limited in time according to some criteria (eg. A training program). –  Measurements of the same observable element –  Measurements of the same athlete 12 8 EMC Summer School 2013 64
  • 65. 07/02/13 Metaphoric Trajectory ! 12 9 EMC Summer School 2013 13 0 EMC Summer School 2013 65
  • 66. 07/02/13 13 1 EMC Summer School 2013 Challenges l  Integrating athlete’s trajectory with weather information l  How to efficiently store metaphoric trajectories ? –  Trajstore [Cudre-Mauroux et al ICDE 2010] –  SciDB l  How to express and efficiently process similar trajectories 13 2 EMC Summer School 2013 66
  • 67. 07/02/13 Part I: Where are they coming from ? l  “Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it’s going to get worse” –  Office Science of Data Management challenge - DoE 13 4 EMC Summer School 2013 67
  • 68. 07/02/13 Petabyte, parece muito mas LSST – Large Synoptic Survey Telescope •  800 imagens p/ noite durante 10 anos !! •  Mapa 3D do Universo •  30 TeraBytes por noite •  30 PetaBytes em 10 anos 13 5 EMC Summer School 2013 LSST 13 6 EMC Summer School 2013 68
  • 69. 07/02/13 Sequências de DNA Publicadas no Genbank (UK NCBI) Em Abril 2012: •  1.5 x 107 sequências •  50% em 4 anos •  1.3 x 1011pares de base •  30% em 4 anos 13 7 EMC Summer School 2013 Comunidades Segundo o IDC, a quantidade de dados digitais disponível em nosso cyberambiente ultrapassará 13 número de Avogrado em 2023 (> 1023) Yottabyte 8 EMC Summer School 2013 69
  • 70. 07/02/13 Em números: l  12 Terabytes de Tweets a cada dia (IBM, 2012) l  10 TeraBytes em Facebook a cada dia l  Algumas empresas produzem terabytes por hora, todos os dias do ano –  Eventos: l  Abertura da porta do metrô l  Fazer um check-in no aeroporto l  Comprar uma música no iTunes 13 9 EMC Summer School 2013 Comunidades Científicas 14 0 EMC Summer School 2013 70
  • 71. 07/02/13 Dados Governamentais l  Investimentos l  Programas de Governo l  Impostos l  Contratos, prestações de contas l  Índices: econômicos, sociais, educação, saúde, … l  Segurança e Defesa 14 1 EMC Summer School 2013 Dados Históricos 14 2 EMC Summer School 2013 71