The document discusses big data in astronomy and the LineA-DEXL case. It provides an outline and introduction to big data in science and hypothesis-driven research. It discusses data management techniques like data partitioning and parallel workflow processing. It then provides details on the Laboratorio Nacional de Computacao Cientifica (LNCC) and its role in supporting computational modeling and bioinformatics. It discusses astronomy surveys that generate large amounts of data like the Dark Energy Survey and challenges of data from the Large Synoptic Survey Telescope. Finally, it discusses the need for data infrastructure, metadata management, and distributed data management to support scientific research involving big data.
Salient features of Environment protection Act 1986.pptx
Emc 2013 Big Data in Astronomy
1. 07/02/13
EMC Summer School on
BIG DATA – NCE/UFRJ
Big Data in Astronomy
The LIneA-DEXL case
Fabio Porto (fporto@lncc.br)
LNCC – MCTI
DEXL Lab (dexl.lncc.br)
Outline
l Introduction
l Big Data in Science
l Hypothesis Driven-Research
l Data management
– Data partitioning
– Parallel workflow processing
l Final remarks
2 EMC Summer School 2013
1
2. 07/02/13
Laboratório Nacional de
Computação Científica (LNCC)
Petropolis, Rio de Janeiro
3 EMC Summer School 2013
LNCC - MCTI
l Graduate Course in Computational Modelling
– CAPES 6
l BioInformatics Laboratory
– Roche 454 high throughput sequencing
l Coordinator of INCT –MACC
– Medicine Supported by Computational Science
l Coordinator of SINAPAD
– HPC National System
l Thematic laboratories
– ACIMA
– MARTIN
– DEXEL
– COMCIDIS
– HEMOLAB
– LABINFO
4 EMC Summer School 2013
2
3. 07/02/13
SINAPAD – National System of High
Processing Computing
• Organized in
CENAPADS:
• Universities
• Research Centers
• Different
Architectures:
• Shared Disks
• Shared Memory
• GPUs
5 EMC Summer School 2013
sinapad.lncc.br
CENAPADS
6840 CPU Cores + 8192 GPU Cores
~106.6 TFlops / ~17.3 TBytes RAM / ~ 2.3 PBytes Storage
6 EMC Summer School 2013 6
3
4. 07/02/13
The DEXL Lab Mission
l To support in-silico science with Big Data
management techniques;
– To develop interdisciplinary research with
contributions on data modelling, design and
management;
– To develop tools and systems in support to in-
silico science data management;
7 EMC Summer School 2013
e-Astronomy
l LNCC is a member of the LIneA Lab:
– Laboratório Inter-institucional de Astronomia
l O.N., LNCC, CBPF, RNP
l Development of e-Astronomy infrastructure in support for astronomy surveys
l Official south hemisphere DES node
l Large astronomy surveys:
– Sloan Digital sky Survey
l Currently SDSS-3
– Dark Energy Survey
l DES – Brazil managed by LIneA laboratory
l 5.000 square degrees of the sky
– Large Synoptic Sky Telescope
l 20.000 square degrees of the sky
l Each patch visited 1000 times during 10 years
l One of the scientific domains with extreme data processing and
storage needs
l Big Data today !!!!
8 EMC Summer School 2013
4
5. 07/02/13
LSST – Large Synoptic Survey Telescope
• 800 images p/ night
during 10 years !!
• 3D Map of the Universe
• 30 TeraBytes per night
• 100 PetaBytes in 10 years
• 105 disks of 1 TB
9 EMC Summer School 2013
Sloan Portal
10 EMC Summer School 2013
5
6. 07/02/13
Skyserver – Projeto Sloan
11 EMC Summer School 2013
Dark Energy Survey
l Dark Energy Survey
– Astronomic project to explain:
l Acceleration of the universe
l Nature of dark energy
– Data production
l DECam takes images of 1GB (400/night)
l Images are analyzed;
– galaxies and stars identified and catalogued
l Catalogs are stored in database systems
– Estimates of 1 billion of rows and 1 thousand attributes
l LIneA is the official Brazilian contributor for the DES
collaboration
12 EMC Summer School 2013
6
7. 07/02/13
DES
Science
Pipelines
Global and local tests Test environment & CTIO
Un-supervised process
Cluster industrialization
Point source catalog
Masks, random catalogs
Addstar (MW, GC), Addqso
Findsat, Sparse, fitmodel
Stellar mass, LF, HOD fit
Classifier, photo-z
Identification, characterization
Cosmological parameters
13 Summer School 2013
EMC
14 EMC Summer School 2013
7
8. 07/02/13
BIG DATA in Science
l Scientific process is being remodelled to be developed
within an in-silico environment
l Powerful instruments:
– Digital telescopes
– DNA sequencers
– Mass spectrometers
l Huge simulations
– Weak lensing simulations
– Cardio-vascular system simulation
l Massive amounts of information streams in and out…
l Hypothesis-driven research supported by in-silico
infrastructure, methods, models…
15 EMC Summer School 2013
Big Data needs for e-science
l Data archival infra-structure;
l Scientific life cycle metadata management;
l Distributed big data management;
– Parallel workflow processing;
– Parallel Analytical algorithms;
16 EMC Summer School 2013
8
9. 07/02/13
“Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going to
get worse”
– Office Science. Data-Management Challenge
Report– DoE - 2004
17 EMC Summer School 2013
Big Data needs for e-science
l Data archival infra-structure;
l Scientific life cycle metadata management;
l Distributed big data management;
– Parallel workflow processing;
– Parallel Analytical algorithms;
18 EMC Summer School 2013
9
10. 07/02/13
Scientific Experiment Life-cycle
Experiment
Data
[Mattoso et al. 2010]
19 EMC Summer School 2013
MODELLING -
HYPOTHESIS-DRIVEN
RESEARCH
20 EMC Summer School 2013
10
11. 07/02/13
e-Science life cycle
Hypothesis Experiment
Phenomenon Modeling Publication
Formulation Life-cycle
21 EMC Summer School 2013
Big Data Scenario in Scientific
exploration life-cycle
Experiment,
Workflow
Design
Workflow
Hypothesis,
Prepara;on
experiment
Workflow
Goals
repository
Data
Hypotheses
Sources
Analysis
database
Provenance
Results
Store
Workflow
Execu;on
Post-‐
Execu;on
analysis
Adapted from
Monitoring [Mattoso et al. 2010]
22 EMC Summer School 2013
11
12. 07/02/13
Motivation
l As experiments produce more and more
data, extracting meaning out of these data
requires, among other things, contextualizing
the data
l Metadata about the research allows for
results sharing, fostering collaborative work
l Sharing knowledge about the scientific
reasoning
23 EMC Summer School 2013
Hypotheses in Astronomy - DES
l Phenomenon:
– Universe is speeding-up
l Discovered by scientists in 1998 studying distant supernovae
l Supported by observations of redshift on long distance supernovae
light
l Hypothesis
– A new odd behaviour named “Dark Energy” could make up
70% of the universe
– The universe is not homogeneous - it has regions with different
densities (our location is special….)
l Supporting evidences
– Weak gravitational lensing
– Galaxy clusters in different redshifts
24 EMC Summer School 2013
12
13. 07/02/13
Hypothesis in Big Data Analytics
l Scientific exploration is hypothesis-driven
– Nevertheless, hypothesis remain out of reach of
in-silico exploration (big data analyses ??)
l Big Data Analyses is explorative in nature
– Understanding what one is doing when exploring
Big Data requires scientific hypothesis-driven
approach
l Corollary
– BIG Data needs hypothesis management
25 EMC Summer School 2013
Context
l Scientists trying to understand some
phenomenon
– Formulate Hypothesis about Phenomenon behaviour
l Natural Phenomena
– Simulated by computational models
– Explained by Scientific hypothesis
l Time-Space varying
– Space represented by physical meshes
l 1D, 3D,…
– Time reflected on simulation ticks
26 EMC Summer School 2013
13
14. 07/02/13
Scientific Hypothesis
Human Cardio-vascular System
27 EMC Summer School 2013
Elements of hypothesis-driven
research
l Scientific Phenomenon – an observable event
– occurs in space-time;
– characterized by observable quantities;
l Scientific Hypothesis – a falsifiable statement
proposed to explain a phenomenon [Popper 2012]
– We are interested in a conceptual representation that puts
forward the idea the hypothesis carries on
l Mathematical Model – a language specific
formalization of a scientific hypothesis
l Experiment – the set of computational artifacts put
together to validate a scientific hypothesis;
l Data – observed or experimental data use in
validating hypotheses;
28 EMC Summer School 2013
14
15. 07/02/13
Hypothesis modelling initiatives
l Robot Scientist
– [R.D.King et al] The automation of science, Science, 2009.
l HyQueu and HyBrow
– [A. Callahan, M. Dumontier, and N. H. Shah]. HyQue:
Evaluating hypotheses using semantic web technologies.
Journal of Biomedical Semantics, 2(Suppl 2):S3, 2011.
– Modeling hypothesis as propositions in part of the domain
language
l Bioinformatics
l SWAN
– Y. Gao et al. Journal of Web Semantics, 2006
l J. Sowa, Process Ontology
29 EMC Summer School 2013
Sc
Hypothesis
Conceptual
Model
1..n isTheBlendOf
0..n
Physical
0..n
Phenomenon
elements
Quan::es
1..1
physical
quan::es
Phenomenon explains
SC Is
basedOn
1..1
1..n Hypothesis
1..n
1..1
0..n
0..n
Domain ontology URL 0..1
1..1
1-n Space-‐Time
represented_as
Dimension
1..1
represents
0..n
isAuthor
Ph_Process
represented_as
1..m
Formal
Language
0..n
Scien:st
Formal
0..n
1..1 0..n
Con:nuous
Discrete
Representa:on formulatedby
Ph_Process
0..n
0..1 Ph_Process
1..1
0..1
0..n
Discrete
Phenomenon
Refers-to Simula:on
Topologically
1..1
1..n
variable
1..n
Mathema:cal
0..1
0..n
modeled
by
Model
1..1 1..1
1..1
State
Modeled_as
transforms
1..n
0..1
1..1 Mesh
constant
1..1
Represented
Event
0..n
with
Data View
fucn:on
(query over
Data view)
modeled_as
1..n
Mathematical
equa:on
0..n
Formulae XML
Observa:on
Simulated
Element
Computational Mesh
1..1
Element
Model View Data view
0..1
0..n
[Porto et al. ER 2008, ER 2012] Compared_with
30 EMC Summer School 2013
15
16. 07/02/13
Modelling Hypotheses and their
interconnections
Τ
Weak lensing Galaxy Earth special location
clustering
Non uniform universe
Dark Energy
Τ
A lattice theoretic representation for hypotheses
interconnect
31 EMC Summer School 2013
Focus on Hypothesis modeling
l Scientific Hypothesis formulation as a
conceptual entity
l Structuring of research evolution
l Isomorphic representation of: hypothesis,
scientific model and phenomenon
l Structure amenable for data representation,
association, querying and publishing
32 EMC Summer School 2013
16
17. 07/02/13
Hypotheses Structuring: Lattice
33 EMC Summer School 2013
The core entities of the
hypothesis conceptual model
34 EMC Summer School 2013
17
18. 07/02/13
Representation Isomorphism
35 EMC Summer School 2013
Application: Linked Science
l An initiative to have a machine-readable
content describing the scientific exploration;
l Support reproducibility of experiments;
l To foster reusing previous results;
l The community needs a more “open”
science”
36 EMC Summer School 2013
18
19. 07/02/13
Linked Science
(or Linked Open Science)
l Is an initiative to interconnect all scientific
assets;
l It is a combination of:
– Linked data, semantic web
– Open source;
– Scientific workflows and provenance (OPM);
– Scientific models;
– Cloud computing;
– …
37 EMC Summer School 2013
Linked Science Core Vocabulary
(LSC)
l Defines a vocabulary (LSC) with “basic”
terms for science;
– More specific terminology shall be added by
individual communities (minimal ontological
commitment)
38 EMC Summer School 2013
19
20. 07/02/13
LSC Core Vocabulary
39 EMC Summer School 2013
Extension to LSC
40 EMC Summer School 2013
20
21. 07/02/13
Published Research as Linked Data (1)3
Semantic rdfs:Class rdf:Resource ! rdf:Literal
engineering of rdf:value
hypotheses lsc:Researcher authors1 ! “P.J. Blanco, M.R. Pivello, S.A. Urquiza, and R.A. Feijóo.”
dc:description
lsc:Research research1 ! “Simulation of hemodynamic conditions in the carotid
artery.”
dc:title
Introduction lsc:Publication pub1 ! “On the potentialities of 3D–1D coupled models in hemo-
Motivation dynamics simulations.”
Goals & Challenges dc:description
lsc:Data dataset1 ! “Flow rate of 5.0 l/min as an inflow boundary condition at
Related Work
the aortic root, in observation of Avolio (1980) and others.”
dc:description
Semantic lsc:Data dataset2 ! “1D mechanical and geometric data from Avolio (1980).”
Modeling dc:description
lsc:Data dataset3 ! “MRI images processed for reconstructing the 3D geome-
Combination try of both the left femoral and the carotid arteries.”
and Order dc:description
Phenomenon p17 ! “Blood flow in the carotid artery.”
dc:description
Partial Results tisc:Region region1 ! “The carotid artery, a part of the human CVS.”
dc:description
Next Steps owl:IntervalEvent beat1 ! “A heart beat with period T = 0.8 s.”
dc:description
Observable ob1 ! “Blood flow rate.”
dc:description
Observable ob2 ! “Blood pressure.”
rdfs:label
lsc:Hypothesis h17 ! “blend(h13, h15, h16)”
dc:description
Model m17 ! “3D-1D coupled model with lumped windkessel terminals.”
3
Blanco et al.’s published research as an LSC instantiation. 18/23
41 EMC Summer School 2013
Published Research as Linked Data (2)4
Semantic
engineering of
hypotheses
rdfs:Class rdf:Resource ! rdf:Literal
dc:description
lsc:Data dataset4 ! “Plots of hemodynamic observables in the left femoral artery
produced to validate the hypothesis.”
Introduction dc:description
Motivation lsc:Data dataset5 ! “Plots of hemodynamic observables in the carotid artery.”
Goals & Challenges dc:description
lsc:Data dataset6 ! “Scientific visualization of hemodynamic observables in the
Related Work
left femoral artery produced to validate the hypothesis.”
Semantic dc:description
lsc:Data dataset7 ! “Scientific visualization of hemodynamic observables in the
Modeling carotid artery both with and without aneurism.”
rdf:value
Combination lsc:Prediction predict1 ! “Sensitivity of local blood flow in the carotid artery to the heart
and Order aortic inflow condition.”
rdf:value
Partial Results lsc:Prediction predict2 ! “Sensitivity of the cardiac pulse to the presence of an
aneurysm in the carotid.”
Next Steps rdf:value
lsc:Conclusion conclusion1 ! “3D-1D coupled models allow to perform quantitative and
qualitative studies about how local and global phenomena
are related, which is relevant in hemodynamics.”
42 EMC Summer School 2013
4
Blanco et al.’s published research as an LSC instantiation. 19/23
21
22. 07/02/13
Find in Blanco et al.'s microtheory a hypothesis (if any)
explaining phenomena of blood flow in microvascular
vessels and show which model formulates it.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX lsc: <http://linkedscience.org/lsc/ns#>
SELECT ?hypothesis_name ?model_name
WHERE {
?h rdfs:label ?hypothesis_name .
?m rdfs:label ?model_name .
?h a lsc:Hypothesis .
?p a lsc:Phenomenon .
?m a lsc:Model .
?h lsc:explains ?p .
?m lsc:formulates ?h .
?p dc:description ?d .
FILTER regex(?d, "blood flow", "i") . FILTER regex(?d, "microvascular", "i")
}
43 EMC Summer School 2013
Remarks
l Hypothesis modeling reflects the scientist mental model
during data analyses;
– supports hypothesis-driven data exploration
– extends current eScience infrastructure;
l Scientific Hypothesis, Models and Phenomenon are the
main primitives;
l The primitives maybe represented as isomorphic lattices
with semantic association among themselves;
l One can search, discovery, mine hypotheses and related
scientific artefacts;
l ER 2012- MODIC Workshop
l ISWC 2012– Linked Science workshop
44 EMC Summer School 2013
22
23. 07/02/13
DATA MANAGEMENT
45 EMC Summer School 2013
Dark Energy Survey
l Dark Energy Survey
– Astronomic project to explain:
l Acceleration of the universe
l Nature of dark energy
– Data production
l DECam takes images of 1GB (400/night)
l Images
are analyzed; galaxies and starts are identified
and catalogued
l Catalogs are stored in database systems
46 EMC Summer School 2013
23
24. 07/02/13
Dark Energy Survey Project
l Main technical (CS) issue:
– Managing huge catalogs
– Relations loaded from std FITS files
l Database features
– Single relation for each catalog
– Volume: 1 billion tuples x 1000 attributes (300GB)
– Queries
l Users submit ad-hoc queries to the database
l Usually too many results for each query
– Need to choose best results, e.g. using top-k techniques
l Some queries scan the whole database
– Looking for clusters of stars
47 EMC Summer School 2013
Processing Astronomy data
User access Scientific workflows
- Ad-hoc queries - Analysis
- downloads
Astronomy
catalogs
48 EMC Summer School 2013
24
25. 07/02/13
Ad-Hoc Queries
l Submitted by users through portal;
l For small size queries (Regions of the sky)
– Indexing based on ra, dec (e.g. Q3C)
l [Koposov, S.,Bartunov, O., 2006] Q3C Quad Tree Cube,
Astronomical Data Analysis Software and Systems, 2006
l HTM, Hierarchical Triangular Mesh, MSSQlServer, Sloan
– Spatial function (eg. Radial search)
– Other criteria need more fine grained criteria
l For large size queries (whole sky)
– Explore parallelism over partitioned data
l Data partitioning is efficient for small and large
queries
49 EMC Summer School 2013
Astronomer’s coordinate system
50 EMC Summer School 2013
25
26. 07/02/13
Workflow queries
l Workflows process data retrieved from the Catalog
– Two systems
l Workflow engine
l Database engine
– Lack of integration
l upper bound on performance
– Large queries
l Parallelism obtained by data partitioning is jeopardized by
consolidation of results operated by DBMS;
l Workflow receives data and redistribute it to parallelize activities
– Concurrency among workflows
l May impose huge penalties
51 EMC Summer School 2013
Need to partition data
l Beneficial for both access patterns
– Ad-hoc and workflow
l How to apply it?
l Vertical partitioning
– Already applied based on semantic clustering of attributes
l Ra, dec
l Photometry, spectrometry, astrometry
l Horizontal partitioning
– Ra, Dec (the current approach)
– More fine grained criteria
l Been developed in collaboration with INRIA Montpellier
52 EMC Summer School 2013
26
27. 07/02/13
First Step: Hybrid Data Partitioning(HDP)
Std criterion: range of ra,dec
Criterion 1 Criterion k
Catalog Id Ra Dec Catalog Id Ra Dec
Catalog Id spectrometry Catalog_s Id spectrometry
Catalog-ph Id photometry ŸŸŸ Catalog-ph Id photometry
Catalog_a Id astronometry Catalog_a Id astronometry
53 EMC Summer School 2013 07/02/13
IMPLEMENTATION
ALTERNATIVES
54 EMC Summer School 2013
27
28. 07/02/13
Using PGPOOL-II
l Pgpool II
– Implemented on top of PostgreSQL 9.1
– Central node coordinates data/query distribution/
replication
– Requests distributed through nodes
– Parallel query Processing
l data partitioning based on a table column range
(e.g. id)
l For short queries, may reduce the number of accessed
data
– Load Balance
l Concurrent requests directed to different DB copies
55 EMC Summer School 2013
Parallelism & LoadBalance
Parallel query
Pgpool II
Replication Replication
Pgpool II Pgpool II
postgreS postgreS PostgreS PostgreS
QL QL QL QL
56 EMC Summer School 2013
28
29. 07/02/13
Evaluation
l Strength
– Extends PostgreSQL
– Load balance queries from concurrent workflows
– Scales up to 128 DB nodes
l Weaknesses
– Lack of support to spatial functions
– Partitioning based on a single column
– Ingestion can’t use COPY
57 EMC Summer School 2013
QServ - LSST
l Developed by the LSST DM team
l Astronomy data management
l Horizontal partitioning based on declination
zones (nodes) and data on each node
distributed into chunks based on RA-chunk
l Approx. 1000 partitions
l Native support to spatio-temporal functions
l Built on top of MySQL
58 EMC Summer School 2013
29
30. 07/02/13
Evaluation
l Strong
– Designed to support astronomy data surveys
– Highly scalable: ~1000 nodes
– First performance results are very promising
– Alignment with the LSST project
l Weaknesses
– Current culture based on PostgreSQL
59 EMC Summer School 2013
Context (3/3)
l Requirement
– Efficient data storage and processing
l Challenges
– Big size of the database
– High number of attributes
– Evolving workload
– Mostly Scan Processing
l Questions:
a) How to efficiently process queries over catalogs?
b) How to efficiently process scientific workflows over
catalogs?
60 EMC Summer School 2013
30
31. 07/02/13
Current activities at DEXL
a) Design data partitioning strategies
– Cooperation with INRIA Montpellier- Zenith group
– Partition the data into blocks
l such that the number of query accesses to the blocks is
minimum
l Each block can be stored on a different machine
b) Efficient execution of scientific
workflows over partitioned data
61 EMC Summer School 2013
a) Intuition
Q
Queries and scientific workflows take a
Queries and scientific workflows of
Time proportional to the amount take
Time to be processed size of their data
Data proportional to the
partitioning
Q’ Q’’ Q’’’
62 EMC Summer School 2013
31
32. 07/02/13
Partitioning the DB into Blocks
B1
R(a1,…,a9)
B2
How to
compute
… The best
Partitioning?
Bm
63 EMC Summer School 2013
Problem statement
l Given
– Single relation database R(a1,…,an), n ~1000
– Initial workload: set of k queries W0 = {q1,…,qk}
– m empty fixed size blocks
l Assumptions
– Accessing a block ≈ accessing all its tuples
– Periodically new tuples and queries arrive
– No privilege to a particular attribute
l Goal
– Minimize the total block access during the execution of queries by:
l Optimal partitioning of R’s data in blocks
l Optimal query execution
– Adapt to the arrival of new data and queries
64 EMC Summer School 2013
32
33. 07/02/13
Overview of the solution
l Data partitioning : graph based algorithm
– Nodes: each data item (e.g. tuple) represent a node in the graph
– Edges: an edge between two data items if are accessed by a
common query
– Edge weight : the number of queries that access both data items
– Goal: partition the graph into m equal size sub-graphs with minimum
edge cut
l Use a min-cut algorithm
l Block explanation
– Blocks are explained in terms of queries
l Each block is assigned an explaining query Bi = vi(R)
l Query processing
– Queries are compared to explaining queries
– Matching blocks are selected (we haven’t worked on that yet)
65 EMC Summer School 2013
Partitioning strategy
Schism: VLDB2010
1
We create a node for each row
66 EMC Summer School 2013
33
34. 07/02/13
Partitioning strategy
1
2
We create a node for each row
67 EMC Summer School 2013
Partitioning strategy
3
1
2
We create a node for each row
68 EMC Summer School 2013
34
35. 07/02/13
Partitioning strategy
For each vertical
fragment 3
1
1 5
2
3
4
5
7
6
7
2 6
4
We create a node for each row
69 EMC Summer School 2013
Partitioning strategy
For each vertical
fragment 1 3
1
1 5
2 1
3
4 q1 1
7
5
6
7
2 6
4
We increment the arc weight when
two rows are accessed together
70 EMC Summer School 2013
35
36. 07/02/13
Partitioning strategy
For each vertical
fragment 1 3 1
1
1 5
2 2
3
4 q2 1 1
7
5
6
7
2 6
4
We increment the arc weight when
two rows are accessed together
71 EMC Summer School 2013
Partitioning strategy
For each vertical
fragment 1 3 2
1
1 5
2 7
3
5 1 3
4
5
7
6
7
7 2
2 1 6
4
W = {q1,…,qn} 4
We increment the arc weight when
two rows are accessed together
72 EMC Summer School 2013
36
37. 07/02/13
Partitioning strategy
For each vertical
fragment 1 3 2
1
1 5
2 7
3
5 1 3
4
5
7
6
7
7 2
2 1 6
4
4
We execute a min-cut algorithm
73 EMC Summer School 2013
Partitioning strategy
Catalog
1 3 2
1
1 5
2 7
3
5 1 3
4
5
7
6
7
7 2
2 1 6
4
4
3
1
5
2
6
4
7
Each partition is assigned a block
B1 B2
74 EMC Summer School 2013
37
38. 07/02/13
Partitioned data with queries
Each block is associated with the queries that access
Some records of the block
{q3,q5,…,, q13} {q1,q2,…,, q14}
3
1
5
2
6
4 7
B1 B2
For a given query q the number of accessed blocks is minimized
75 EMC Summer School 2013
Adaptive Strategy (1/2)
l New tuple arrival: [DEXA 2012]
– Select the best block
l i.e. block to which the new tuple is more correlated
– Challenges:
l How to select the best block with minimum effort?
– Initial approach : find it based on the correlation of queries
to blocks
– Define optimal allocation
– Compute actual allocation efficiency
– Compute block affinity
l What if the best block is full?
– Initial approach: split the block
76 EMC Summer School 2013
38
39. 07/02/13
Allocation based on
affinity to blocks
77 EMC Summer School 2013
Elapsed-time of incrementing the
DB as the size increases
1000
static
+
DynPart, |D | = 500 k
DynPart, |D | = 1 M
100
Execution time (s)
10
1
+
+ +
+
+
+
+
+
+ +
+
+
+
+
+
+ +
+
+
+
+
+ +
+
+ + + + + + + +
+ + + + + +
0.1
2M 4M 6M 8M 10M 12M 14M 16M 18M 20M
DB size
Experiment:
- Sloan DR8 – 350 million tuples
- workload- synthetic 27000 queries
- PaToH – hyper-graph partitioner
78 EMC Summer School 2013
39
40. 07/02/13
E-ASTRONOMY WORKFLOWS
OVER PARTITIONED DATA
79 EMC Summer School 2013
Processing Scientific Workflows
l Analytical Workflows process a large part of Catalog data
– Catalogs are supported by few indexes, thus most queries
scan tens-to-hundreds of millions of tuples
l Parallelization comes as a rescue to reduce analyses
elapsed-time, but
– Compromise between:
l Data partitioning and degree of parallelization;
– Current solutions consider:
l Centralized files to be distributed through nodes (MapReduce)
– [Alagianins, SIGMOD, 2012] NoDB – reading raw files without data
ingestion;
l Distributed databases (Qserv) to serve Workflow engines
– [ Wang.D.L,2011], Qserv: A Distributed Shared-Nothing Database for the
LSST catalog;
l Centralized databases to serve Workflow Engine (Orchestration LineA)
l Partitioned database to serve distributed queries (HadoopDB)
80 EMC Summer School 2013
40
41. 07/02/13
HadoopDB - a step in between
[Abouzeid09]
l Offers parallelism and fault tolerance as Hadoop,
with SQL queries pushed-down to postgreSQL
DBMS;
l Pushed-down queries are implemented as Map-
reduce functions;
l Data are partitioned through nodes.
– Partitioning information stored in the catalog
– Distributed through the N nodes
81 EMC Summer School 2013
HadoopDB architecture
SQL query
SMS Planner
MapReduce Catalog
Framework
Node 1 Node 2 Node n
Task Tracker Task Tracker Task Tracker
Database DataNode Database DataNode Database DataNode
82 EMC Summer School 2013
41
42. 07/02/13
Example
Select year(SalesDate),sum(revenue)
From Sales
Group by year(salesDate)
a) Table partitioned by year(SalesDate) b) no partitioning by year(SalesDate)
FileSink Operator
Sum Operator
Reduce
Group by Operator
FileSink Operator
Reduce Sink Operator
Map Select Year(SalesDate),
Sum(revenue) Map
From Sales Select Year(SalesDate),
Group by year(salesDate) Sum(revenue)
From Sales
Group by year(salesDate)
83 EMC Summer School 2013
Processing Astronomy data
User access Scientific workflows
- Ad-hoc queries - Analysis
- downloads
Astronomy
catalogs
84 EMC Summer School 2013
42
43. 07/02/13
Traditional WF–Database
decoupled architecture
Workflow engine
act1 act2 act3
Data is consolidated as
input to the workflow engine
Database
DBp1 DBp2 DBp3
85 EMC Summer School 2013
Problems
l Data locality
– Workflow activities run in remote nodes wrt the
partitioned data;
l Load Balance
– Local processes facing different processing time
86 EMC Summer School 2013
43
44. 07/02/13
Data locality
l Traditional distributed query processing pushes
operations through joins and unions so that can
be done close to the data partitions;
l Can we “localize” workflow activities?
– Moving activities in workflows require operation
semantics to be exposed
– Mapping of workflow activities to a known algebra
– Equivalence of algebra expressions enabling pushing
down operations
87 EMC Summer School 2013
Algebraic transformation
(i - workflow – relation perspective) (ii - decomposition)
rU
Filte
Map Filter
R S T R S * T
*
Q
(iiii - anticipation) (iv - procastination)
U
r U
Filte
R S * T R * V Map Q * T
Ma *
p
Q * S
88 Summer School 2013
EMC
44
45. 07/02/13
Workflow optimization process
Initial algebraic expressions
Generatation of Transform
search space ation rules
Equivalent algebraic expressions
Evaluation of Cost
search strategy model
Searh
yes more
?
no
Optimized algebraic expressions
89 Summer School 2013
EMC
Pushing down workflow activities
l A first naïve attempt
– Push down all operations before a Reduce;
l Use a MapReduce implementation where
– Mappers execute the “pushed-down” operations
close to the data
90 EMC Summer School 2013
45
46. 07/02/13
Typical Implementation at LineA Portal
Spatial partitioning
Catalog DB
91 EMC Summer School 2013
Parallel workflow over partitioned
data
Partitioned catalogue stored on PostgreSQL
DBp1 SkyMap
DBp2 SkyMap SkyAdd
…
DBpn SkyMap
92 EMC Summer School 2013
46
47. 07/02/13
HQOOP - Parallelizing
Pushed-down Scientific Workflows
l Partition of data across cluster nodes
– Partitioning criteria
l Spatial (currently used and necessary for some applications)
l Random (possible in SkyMap)
l Based on query workload (Miguel Liroz-Gestau’s Work)
l Process the workflow close to data location
– Reduce data transfer
l Use Apache/Hadoop Implementation to manage parallel
execution
l Widely used in Big Data processing;
l Implements Map-Reduce programming paradigm;
l Fault Tolerance of failed Map processes;
l Use QEF as workflow Engine
– Implements Mapper interface
– Run workflows in Hadoop seamlessly;
93 EMC Summer School 2013
Perspective
Qserv+
Workflow
HQOOP
Wkfw Engine
Parallelization
Orchestration layer, Query Hadoop+Kepler
MapReduce Distribution
HadoopDB+Hive
Data
distribution
94 EMC Summer School 2013
47
48. 07/02/13
Integrated architecture
Final
Result
Workflow engine Workflow engine Workflow engine
act act act act act act act act act
1 2 3 1 2 3 1 2 3
DB1 DB2 DB3
95 EMC Summer School 2013
Experiment Set-up
l Cluster SGI
– Configurations: 1, 47 and 95 nodes;
– Each node:
l 2 proc. Intel Zeon – X5650, 6 cores, 2.67 GHz
l 24 GB RAM
l 500 GB HD
l Data
– Catalog DC6B
l Hadoop
– QEF workflow engine
96 EMC Summer School 2013
48
49. 07/02/13
Preliminary Results
l Preliminary results are encouraging:
– Baseline Orchestration layer (234 nodes) –
approx. 46 min
– 1 node HQOOP – approx. 35 min
– 4 nodes HQOOP – approx. 12.3 min
– 95 nodes (94 workers) HQOOP – approx. 2.10
min
– 95 nodes (94 workers) Hadoop+Python – approx.
2.4 min
97 EMC Summer School 2013
Resulting Image
98 EMC Summer School 2013
49
50. 07/02/13
Conclusions
l Big data users (scientists) are in Big Trouble;
– Too much data, too fast, too complex;
l Different expertise required to cooperate
towards Big Data Management;
l Adapted software development methods
based on workflows;
l Complete support to scientific exploration
life-cycle
l Efficient workflow execution on Big Data
99 EMC Summer School 2013
Collaborators
l LNCC Researchers
– Ana Maria de C. Moura
– Bruno R. Schulze
– Antonio Tadeu Gomes
l PhD Students
– Bernardo N. Gonçalves
– Rocio Millagros
– Douglas Ericson de Oliveira
– Miguel Liroz-Gistau (INRIA)
10 – Vinicius Pires (UFC)
0 EMC Summer School 2013
50
51. 07/02/13
Collaborators
l ON
– Angelo Fausti
– Luiz Nicolaci da Costa
– Ricardo Ogando
l COPPE-UFRJ
– Marta Mattoso
– Jonas Dias (Phd Student)
– Eduardo Ogasawara (CEFET-RJ)
l UFC
– Vania Vidal
– José Antonio F. de Macedo
l PUC-Rio
– Marco Antonio Casanova
l INRIA-Montpellier
– Patrick Valduriez group
l EPFL
– Stefano Spaccapietra
10
1 EMC Summer School 2013
EMC Summer School on
BIG DATA – NCE/UFRJ
Big Data in Astronomy
Fabio Porto (fporto@lncc.br)
LNCC – MCTI
DEXL Lab (dexl.lncc.br)
51
52. 07/02/13
Overall performance
50 600
45
500
40
35
400
30
25 300 elapsed-time (min)
elapsed-time (min)
20 linear scale-up
linear scale-up 200
15
% Linear Scale-up
10
100
5
0 0
Baseline 1 node 4 nodes 94 nodes 94 nodes Baseline 1 node 4 nodes 94 94
(234 HQOOP HQOOP HQOOP Hadoop (234 HQOOP HQOOP nodes nodes
nodes) nodes) HQOOP Hadoop
10
3 EMC Summer School 2013
1400000
1200000
1000000
800000 Tempo
Hadoop
Tempo
600000
Reduce
400000
200000
0
47 CENT 47 CENT 94 CENT 94 CENT
QEF SEM QEF QEF SEM QEF
160000
140000
120000
100000
Tempo
80000 Hadoop
Tempo
60000 Reduce
40000
20000
10 0
47 DIST 47 DIST 94 DIST 94 DIST
QEF SEM QEF QEF SEM QEF
4 EMC Summer School 2013
52
53. 07/02/13
Execution with 4 nodes
Elapsed-time total: 11.27 min
10
5 EMC Summer School 2013
53
54. 07/02/13
Adaptive and Extensible Query Engine
l Extensible to data types
l Extensible to application algebra
l Extensible to execution model
l Extensible to heterogeneous data sources
10
7 EMC Summer School 2013
Objective
• Offer a query processing framework that
can be extended to adapt to data centric
application needs;
• Offer transparency in using resources to
answer queries;
• Query optimization transparently introduced
• Standardize remote communication using web services even
when dealing with large amount of unstructured data
• Run-time performance monitoring and decision
10
8 EMC Summer School 2013
54
55. 07/02/13
Control Operators
• Add data-flow and transformation operators
• Isolate application oriented operators from
execution model data-flow concerns
• parallel grid based execution model:
• Split/Merge - controls the routing of tuples to parallel
nodes and the corresponding unification of multiple
routes to a single flow
• Send/Receive - marshalling/ unmarshalling of tuples
and interface with communication mechanisms
• B2I/I2B - blocks and unblocks tuples
• Orbit - implements loop in a data-flow
10 • Fold/Unfold - logical serialization of complex structues
(e.g. PointList to Points)
9 EMC Summer School 2013
The Execution Model
Example of simple QEF Workflow
Output
Operator
Possibly distributed over a
Grid environment
Data sources
(Input)
Integration unit (Tuple)
11 containing data source units
0 EMC Summer School 2013
55
56. 07/02/13
Iteration Model
OPEN OPEN OPEN
C B A
DataSource
GETNEXT GETNEXT GETNEXT
C B A
DataSource
CLOSE CLOSE CLOSE
C B A
DataSource Results
11
1 EMC Summer School 2013
Distribution and Parallelization
Operator distribution
A Query Optimizer selects a set of operators in the QEP to
execute over a Grid environment.
B1
C B2 A
DataSource
B3
11
2 EMC Summer School 2013
56
57. 07/02/13
General Parallel Execution
Model
Remote QEP
In order to parallelize an execution, the initial QEP is
modified and sent to remote nodes to handle the
distributed execution.
Initial Modified
plan plan
Control operator R : Receiver
S : Sender
11 Distributed operator
Sp : Split
3 EMC Summer School 2013 User’s operator M : Merge
Modifying IQEP to adapt to
executionI2B
model (TCP)
A
Send TJ
Remote nodei
SJ
B2I Velocity
Receive Geometry Query optimizer adds
control operators according
Receive to execution model and
Send IQEP statistics
B2I
I2B
merge Local dataflow
Split Control node
Remote dataflow
Orbit
Logical operator
11
Particles Control operator
4 EMC Summer School 2013
57
58. 07/02/13
Grid node allocation algorithm
(G2N)
Introduction
Grid Greedy Node scheduling algorithm (G2N)
• Offers maximum usage of scheduled resources
Principles
during query evaluation.
Application • Basic idea : “an optimal parallel allocation strategy
for an independent query operator … is the one in
which the computed elapsed-time of its execution is
Architecture
as close as possible to the maximum sequential time
in each node evaluating an instance of the operator”.
Implem.
t1
Conclusion
A Bn t ( Bn) operator cost on this node
11 t2 t1 + t 2 = t x ( Bn )
5 EMC Summer School 2013
€
Implementation
• Core development in Java 1.5.
• Globus toolkit 4.
• Derby DBMS (catalog).
• Tomcat, AJAX and Google Web Toolkit for user
interface.
• Runs on Windows, Unix and Linux.
• source code, demo, user guide available at:
http://dexl.lncc.br
11
6 EMC Summer School 2013
58
59. 07/02/13
Summing-up
l HadoopDB extends Hadoop with expressive query
language, supported by DBMSs
l Keeps Hadoop MapReduce framework
l Queries are mapped to MapReduce tasks
l For scientific applications is a question to be
answered whether or not scientists will enjoy writing
SQL queries
l Algebraic like languages may seem more natural
(eg. Pig Latin)
11
7 EMC Summer School 2013
Pig Latin - an high-level language
alternative to SQL
l The use of high-level languages such as
SQL may not please scientific community;
l Pig Latin tries to give an answer by providing
a procedural language where primitives are
Relational albegra operations;
l Pig Latin: A not-so-foreign language for data
processing, Christopher Olson, Benjamin
Reed et al., SIGMOD08;
11
8 EMC Summer School 2013
59
60. 07/02/13
Example
l Urls (url, category, pagerank)
l In SQL
– Select category, avg (pagerank)
from urls where pagerank > 0.2
group by category
having count(*) > 106
l In PIG
– Groupurls = FILTER urls by Pagerank > 0.2;
– Groups= Group good-urls by category;
– Big-group=FILTER groups BY count(good_urls) > 106
– Output = FOREACH big-groups GENERATE
11 category, avg(good_urls_pagerank);
9 EMC Summer School 2013
Pig Latin
l Program is a sequence of steps
– Each step executes one data transformation
l Optimizations among steps can be
dynamically generated, example:
– 1) spam-urls= FILTER urls BY isSpam(url);
– 2) Highrankurl = FILTER spam-url BY pagerank >
0.8;
1 2
12 2 1
0 EMC Summer School 2013
60
61. 07/02/13
Data Model
l Types:
– Atom - a single atomic value;
– Tuple - a sequence of fields, eg.(‘DB’,’Science’,7)
– Bag - a collection of tuples with possible
duplicates;
– Map - a collection of data items where for each
data item a key is associated
‘fanOf’ ‘flamengo’
‘music’
12 ‘age’ 20
1 EMC Summer School 2013
Operations
l Per tuple processing: Foreach
– Allows the specification of iterations over bags
l Ex:
– Expanded-queries=FOREACH queries generate userId,
expandedQuery (queryString);
– Each tuple in a bag should be independent of all others, so
parallelization is possible;
– Flatten
l Permits flattening of nested-tuples
alice, Ipod,nano flatten alice, ipod, nano
Ipod, shuffle alice, ipod, shuffle
12
2 EMC Summer School 2013
61
62. 07/02/13
Olympic Laboratory
12
3 EMC Summer School 2013
Olympic Laboratory
l Objective
– To study high performance sports as a science discipline
– To build the first sports laboratory in South America
l US$ 10M Project sponsored by FINEP(Funding
Agency)
l Departments:
– Biochemistry, physiology, genetics, nutrition, computational
modeling, computer science, physiology
12
4 EMC Summer School 2013
62
63. 07/02/13
Our task
l To support athlete’s follow-up data
– Athlete’s training
– Variation on biochemical elements
– Variation on biometric variables
l More recently
– For some modalities, Integrate meteorological
conditions
12
5 EMC Summer School 2013
Analyses Board
12
6 EMC Summer School 2013
63
64. 07/02/13
Athletes follow-up database
l Athletes follow-up data modeled as trajectories
– Register measurements from athletes in different training
states
l Trajectory model
– Ordered set of measurements
– Division of time in training states
– Materialized view limited in time-range
– Imprecise measurements
l Not detected =0
l < x -> ]0,x[
l y,y≥x
12
7 EMC Summer School 2013
More on Athlete’s Trajectories
l Stops – modelled as measurements
– Qualified according the athlete’s training state
– Training states (recovery, training, rest,…)
l Moves – extrapolation between two stops
l Trajectory – the set of measurements,
ordered in time, and limited in time according
to some criteria (eg. A training program).
– Measurements of the same observable element
– Measurements of the same athlete
12
8 EMC Summer School 2013
64
65. 07/02/13
Metaphoric Trajectory
!
12
9 EMC Summer School 2013
13
0 EMC Summer School 2013
65
66. 07/02/13
13
1 EMC Summer School 2013
Challenges
l Integrating athlete’s trajectory with weather
information
l How to efficiently store metaphoric
trajectories ?
– Trajstore [Cudre-Mauroux et al ICDE 2010]
– SciDB
l How to express and efficiently process
similar trajectories
13
2 EMC Summer School 2013
66
67. 07/02/13
Part I: Where are they
coming from ?
l “Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going
to get worse”
– Office Science of Data Management challenge -
DoE
13
4 EMC Summer School 2013
67
68. 07/02/13
Petabyte, parece muito mas
LSST – Large Synoptic Survey Telescope
• 800 imagens p/ noite
durante 10 anos !!
• Mapa 3D do Universo
• 30 TeraBytes por noite
• 30 PetaBytes em 10 anos
13
5 EMC Summer School 2013
LSST
13
6 EMC Summer School 2013
68
69. 07/02/13
Sequências de DNA Publicadas
no Genbank (UK NCBI)
Em Abril 2012:
• 1.5 x 107 sequências
• 50% em 4 anos
• 1.3 x 1011pares de base
• 30% em 4 anos
13
7 EMC Summer School 2013
Comunidades
Segundo o IDC, a quantidade de dados digitais
disponível em nosso cyberambiente ultrapassará
13 número de Avogrado em 2023 (> 1023) Yottabyte
8 EMC Summer School 2013
69
70. 07/02/13
Em números:
l 12 Terabytes de Tweets a cada dia (IBM, 2012)
l 10 TeraBytes em Facebook a cada dia
l Algumas empresas produzem terabytes por
hora, todos os dias do ano
– Eventos:
l Abertura da porta do metrô
l Fazer um check-in no aeroporto
l Comprar uma música no iTunes
13
9 EMC Summer School 2013
Comunidades Científicas
14
0 EMC Summer School 2013
70
71. 07/02/13
Dados Governamentais
l Investimentos
l Programas de Governo
l Impostos
l Contratos, prestações de contas
l Índices: econômicos, sociais, educação,
saúde, …
l Segurança e Defesa
14
1 EMC Summer School 2013
Dados Históricos
14
2 EMC Summer School 2013
71