Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France
genoveva.vargas@imag.fr
Moving forward data centric sciences
weaving AI, Big Data & HPC
AICSSA, Jordan, October-November, 2018
http://www.vargas-solar.com
3
D AT A R E V O L U T I O N
4
+
“Data is everything and everything is data”, Pythian
Turning reality phenomena into data thanks to the Big Data trend
5
6
Rendering into data, aspects of the world that have never been quantified
Any individual can analyse huge amounts of data in short periods of time
- Analytical knowledge: most of the crucial algorithms are accessible
- Use rich data to make evidence-based decisions open to virtually any person or company
DATIFICATION
DIGITAL HUMANITIES
… UNFINISHED FUGUE
Fuga a 3 Soggetti (Contrapunctus XIV):
- 4-voice triple fugue
- the third subject of which is based on
the
B A C H motif
« At the point where the composer introduces the name BACH in the
countersubject to this fugue, the composer died. »
8
What makes Bach sound like Bach?
http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/
The Art of Fugue is based on a single subject employed in some variation in each canon and fugue
9
• Identify the notes performed at specific times in a recording
• Classify the instruments that perform in a recording
• Classify the composer of a recording
• Identify precise onset times of the notes in a recording
• Predict the next note in a recording, conditioned on history
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various
constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
10
11
DATA SCIENCE
The representation of complex environments by rich data opens up the possibility of applying all the scientific
knowledge regarding how to infer knowledge from data
Definition:
- Methodology by which actionable insights can be inferred from data
- Complex, multifaceted field that can be approached from several points of view: ethics, methodology,
business models, how to deal with big data, data engineering, data governance, etc.
Objective:
- Production of beliefs informed by data and to be used as the basis of decision making
- N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
12
Computational Science
Digital humanities
Social Data Science Network Science
DATA CENTRIC SCIENCES
Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid”
conclusions, models, simulations, understanding
Develop methodologies weaving data management, greedy algorithms, and programming
models that must be tuned to be deployed in different target computer architectures
Computational Science
Digital humanities
Social Data Science Network Science
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
13
Experimental Sciences
Computational Science
Digital humanities
Social Data Science Network Science
14
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
Computation
(Algorithm: mathematical model)
Experiment setting
(Architecture: computing environment)
D AT A
15
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions
in which data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
16
NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH
RAW DATA
• Heterogeneous (variety)
• Huge (volume)
• Incomplete, unprecise, missing, contradictory (veracity)
• Continuous releases produced at different rates (velocity)
• Proprietary, critical, private (value)
DIGITAL DATA COLLECTIONS
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions in which
data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
17
DIGITAL DATA COLLECTIONS
18
EXPLORING DATA COLLECTIONS
19https://web.facebook.com/data/
20
ü Helping to select the right tool for
preprocessing or analysis
ü Making use of humans’ abilities to
recognize patterns
Not always sure what we are looking for (until we find it)
Query expression [guidance ∣ automatic generation]3,2
• Multi-scale query processing for gradual exploration
• Query morphing to adjust for proximity results
• Queries as answers: query alternatives to cope with lack of providence
Results filtering, analysis, visualization2
• Result-set post processing for conveying meaningful data
Data exploration systems & environments1
• Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database
cracking
• Auto-tuning database kernels : incremental, adaptive, partial indexing
1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM.
2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM.
3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
KEY MOTIVATIONS
EXPLORING DATA COLLECTIONS
21
QUANTITATIVE ANALYSIS OF DATA
Concepts:
- Population: collection of objects, items (“units”)
- Sample: a part of the observed population
Descriptive statistics: simplify data presenting quantitative descriptions
- Measures and concepts to describe the quantitative features
- Provide summaries about the samples as an approximation of the population
- Frequency of the notes performed at specific intervals in a recording
- Identify precise onset times of the notes in a recording
22
LOOKING BEYOND DATA
Inferential statistics: infer the population characteristic
- draws conclusions beyond the analysed data
- reaches conclusions regarding made hypotheses
- Classify the instruments that perform in a recording
- Predict the next note in a recording, conditioned on history
- Inferring a musical score from a recording
23
DATA CURATION
Preserving Describing
Extracting meta-data
ExploringHarvesting
ETL
Parallel Data
Processing
Platforms
Spark (RDD – Tables/Graphs)
Hadoop ecosystem tools (e.g., Pig)
Parallel Data
Processing
Platforms
NoSQL & NewSQL
(Parallel)
Parallel
Data Querying &
Analytics
Structured
Data provision
Parallel data
collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)
Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS,
Big Data Analytics Stacks (Asterix, BDAS)
Parallel analytics (Matlab, R)
CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
ARTIFICIAL INTELLIGENCE
BEYOND KNOWLEDGE
https://ai100.stanford.edu/2016-report
24
25
26
LOOKING BEYOND KNOWLEDGE
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
SETTING UP DATA CENTRIC EXPERIMENTS
27
28
https://web.facebook.com/data/https://azure.microsoft.com/
+
§Data collections with characteristics difficult to process on a single machine or
traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse
massive data collections
à Tools imposing the use of parallel processing and distributed storage
DATA COLLECTIONS ALIAS BIG DATA
29
30
DATA SCIENCE ECOSYSTEM &
INTEGRATED DEVELOPMENT ENVIRONMENT
The integrated development environment (IDE) is an essential tool designed to
maximize programmer productivity.
- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.
- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
Programming language:
- Python one of the most flexible programming languages because it can be seen as a multiparadigm language
- Alternatives are MATLAB and R
Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
WEB INTEGRATED
DEVELOPMENT
ENVIRONMENT
31
DATA SCIENCE VIRTUAL MACHINE
COMPUTING CAPACITY
33
D AT A B E Y O N D T H E C O N F O R T Z O N E
34
35
+
Curated
Increased versatility
& complexity
Increased scalability
& speed
Data collections rawness degree
Key-Value
stores
Document
stores
NewSQL
Relational databases
Graph
Databases
Extensible
record stores
QueryingLook up (R/W)
Analytics
AggregationProcessing Navigation
ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE
36
Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
Sharded & colocated
Input data
Distributed File System
Classification
Data
transformation
Tagged opus execution
Multimedia
multiform data
Indexing classes
INDEXING & STORING
• the precise time of each note every recording
• the instrument that plays each note
• the note's position in the metrical structure of the composition
37
SHARDING DATA ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File SystemMultimedia multiform data
38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html
Automatic and elastic data collections sharding tools to parametrize data access &
exploitation by parallel programs willing to scale-up in different target architectures
SHARDING ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File System
Factors:
- RAM - Disk
- CPU - Network
Sharded data architecture
39
Balanced and smooth fragmentation
(size, location, availability)
Optimum distribution across shards
providing storage spaces (chunks)
+
Persistence
- Which part of the document must persist?
- Explicit vs. implicit persistence
- In memory / hard disk Fragmentation/Sharding & replication:
- Vertical or horizontal fragmentation
- Strategies: range, hash, tagged
- Distribution & location
Availability & Fault tolerance
- Replication & distribution
Memory/Cache
SHARDING DATA ACROSS DIFFERENT STORES
Raw data collections
40
411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
DATA DELIVERY FOR GREEDY PROCESSING
“Multi-view computational problem”
Iterative data processing and visualization tasks need to share CPU cycles
42
Data is a bottleneck
APPLICATION
DRAM
DISK/DATABASE
CPU
Multiples Cores
GPU
Thousands of Cores
1-5GBps1-10GBps
Provide data storage, fetching and delivery
strategies
­ Architecture: distributed file system across nodes
­ Data sharding and replication: on storage and
memory
­ Fetch to fulfil multi-facet application requirements
­ Prefetching
­ Memory indexing
­ Reduce impedance mismatch
43
§ Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
§ the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
§ the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
§ Build underlying value added data managers that can
§ Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
§ Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’
availability and the data properties
OPPORTUNITIES
F I N A L C O M E N T S
44
45
Move from design based on intuition & experience to a more formal & systematic way
to design systems
Addressing data centric sciences problems is a matter of designing complex systems according
to a multidisciplinary vision
46
Let’s weave a golden trilogy
Big Data, AI & HPC
47

Moving forward data centric sciences weaving AI, Big Data & HPC

  • 1.
    Genoveva Vargas-Solar Senior Scientist,French Council of Scientific Research, LIG-LAFMIA, France genoveva.vargas@imag.fr Moving forward data centric sciences weaving AI, Big Data & HPC AICSSA, Jordan, October-November, 2018 http://www.vargas-solar.com 3
  • 2.
    D AT AR E V O L U T I O N 4
  • 3.
    + “Data is everythingand everything is data”, Pythian Turning reality phenomena into data thanks to the Big Data trend 5
  • 4.
    6 Rendering into data,aspects of the world that have never been quantified Any individual can analyse huge amounts of data in short periods of time - Analytical knowledge: most of the crucial algorithms are accessible - Use rich data to make evidence-based decisions open to virtually any person or company DATIFICATION
  • 5.
  • 6.
    … UNFINISHED FUGUE Fugaa 3 Soggetti (Contrapunctus XIV): - 4-voice triple fugue - the third subject of which is based on the B A C H motif « At the point where the composer introduces the name BACH in the countersubject to this fugue, the composer died. » 8
  • 7.
    What makes Bachsound like Bach? http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/ The Art of Fugue is based on a single subject employed in some variation in each canon and fugue 9 • Identify the notes performed at specific times in a recording • Classify the instruments that perform in a recording • Classify the composer of a recording • Identify precise onset times of the notes in a recording • Predict the next note in a recording, conditioned on history Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 8.
  • 9.
    11 DATA SCIENCE The representationof complex environments by rich data opens up the possibility of applying all the scientific knowledge regarding how to infer knowledge from data Definition: - Methodology by which actionable insights can be inferred from data - Complex, multifaceted field that can be approached from several points of view: ethics, methodology, business models, how to deal with big data, data engineering, data governance, etc. Objective: - Production of beliefs informed by data and to be used as the basis of decision making - N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
  • 10.
    12 Computational Science Digital humanities SocialData Science Network Science DATA CENTRIC SCIENCES Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid” conclusions, models, simulations, understanding Develop methodologies weaving data management, greedy algorithms, and programming models that must be tuned to be deployed in different target computer architectures
  • 11.
    Computational Science Digital humanities SocialData Science Network Science 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte 13 Experimental Sciences
  • 12.
    Computational Science Digital humanities SocialData Science Network Science 14 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte Computation (Algorithm: mathematical model) Experiment setting (Architecture: computing environment)
  • 13.
  • 14.
    Consumed data: • differentsizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 16 NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH RAW DATA • Heterogeneous (variety) • Huge (volume) • Incomplete, unprecise, missing, contradictory (veracity) • Continuous releases produced at different rates (velocity) • Proprietary, critical, private (value) DIGITAL DATA COLLECTIONS
  • 15.
    Consumed data: • differentsizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 17 DIGITAL DATA COLLECTIONS
  • 16.
  • 17.
  • 18.
    20 ü Helping toselect the right tool for preprocessing or analysis ü Making use of humans’ abilities to recognize patterns Not always sure what we are looking for (until we find it) Query expression [guidance ∣ automatic generation]3,2 • Multi-scale query processing for gradual exploration • Query morphing to adjust for proximity results • Queries as answers: query alternatives to cope with lack of providence Results filtering, analysis, visualization2 • Result-set post processing for conveying meaningful data Data exploration systems & environments1 • Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database cracking • Auto-tuning database kernels : incremental, adaptive, partial indexing 1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM. 2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM. 3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity. KEY MOTIVATIONS EXPLORING DATA COLLECTIONS
  • 19.
    21 QUANTITATIVE ANALYSIS OFDATA Concepts: - Population: collection of objects, items (“units”) - Sample: a part of the observed population Descriptive statistics: simplify data presenting quantitative descriptions - Measures and concepts to describe the quantitative features - Provide summaries about the samples as an approximation of the population - Frequency of the notes performed at specific intervals in a recording - Identify precise onset times of the notes in a recording
  • 20.
    22 LOOKING BEYOND DATA Inferentialstatistics: infer the population characteristic - draws conclusions beyond the analysed data - reaches conclusions regarding made hypotheses - Classify the instruments that perform in a recording - Predict the next note in a recording, conditioned on history - Inferring a musical score from a recording
  • 21.
    23 DATA CURATION Preserving Describing Extractingmeta-data ExploringHarvesting ETL Parallel Data Processing Platforms Spark (RDD – Tables/Graphs) Hadoop ecosystem tools (e.g., Pig) Parallel Data Processing Platforms NoSQL & NewSQL (Parallel) Parallel Data Querying & Analytics Structured Data provision Parallel data collection (Flink, Stream, Flume) Spark (descriptive statistics functions) Hadoop ecosystem tools (e.g., Hive) Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS) Parallel analytics (Matlab, R) CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
  • 22.
  • 23.
  • 24.
    26 LOOKING BEYOND KNOWLEDGE Musicinformation retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 25.
    SETTING UP DATACENTRIC EXPERIMENTS 27
  • 26.
  • 27.
    + §Data collections withcharacteristics difficult to process on a single machine or traditional databases §A new generation of tools, methods and technologies to collect, process and analyse massive data collections à Tools imposing the use of parallel processing and distributed storage DATA COLLECTIONS ALIAS BIG DATA 29
  • 28.
    30 DATA SCIENCE ECOSYSTEM& INTEGRATED DEVELOPMENT ENVIRONMENT The integrated development environment (IDE) is an essential tool designed to maximize programmer productivity. - The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the debugger. - Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment) Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and R Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
  • 29.
  • 30.
  • 31.
  • 32.
    D AT AB E Y O N D T H E C O N F O R T Z O N E 34
  • 33.
  • 34.
    + Curated Increased versatility & complexity Increasedscalability & speed Data collections rawness degree Key-Value stores Document stores NewSQL Relational databases Graph Databases Extensible record stores QueryingLook up (R/W) Analytics AggregationProcessing Navigation ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE 36 Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
  • 35.
    Sharded & colocated Inputdata Distributed File System Classification Data transformation Tagged opus execution Multimedia multiform data Indexing classes INDEXING & STORING • the precise time of each note every recording • the instrument that plays each note • the note's position in the metrical structure of the composition 37
  • 36.
    SHARDING DATA ACROSSDIFFERENT STORES Sharded & colocated Input data Distributed File SystemMultimedia multiform data 38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html Automatic and elastic data collections sharding tools to parametrize data access & exploitation by parallel programs willing to scale-up in different target architectures
  • 37.
    SHARDING ACROSS DIFFERENTSTORES Sharded & colocated Input data Distributed File System Factors: - RAM - Disk - CPU - Network Sharded data architecture 39 Balanced and smooth fragmentation (size, location, availability) Optimum distribution across shards providing storage spaces (chunks)
  • 38.
    + Persistence - Which partof the document must persist? - Explicit vs. implicit persistence - In memory / hard disk Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation - Strategies: range, hash, tagged - Distribution & location Availability & Fault tolerance - Replication & distribution Memory/Cache SHARDING DATA ACROSS DIFFERENT STORES Raw data collections 40
  • 39.
    411.Idreos, S., Dayan,M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
  • 40.
    DATA DELIVERY FORGREEDY PROCESSING “Multi-view computational problem” Iterative data processing and visualization tasks need to share CPU cycles 42 Data is a bottleneck APPLICATION DRAM DISK/DATABASE CPU Multiples Cores GPU Thousands of Cores 1-5GBps1-10GBps Provide data storage, fetching and delivery strategies ­ Architecture: distributed file system across nodes ­ Data sharding and replication: on storage and memory ­ Fetch to fulfil multi-facet application requirements ­ Prefetching ­ Memory indexing ­ Reduce impedance mismatch
  • 41.
    43 § Manage datacollections with different uses and access patterns because these properties tend to reach limits of: § the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a certain period, and § the pace (computing speed) in which data must be consumed (harvested, prepared and processed). § Build underlying value added data managers that can § Exploit available resources making a compromise between QoS properties and SLA requirements considering all the levels of the stack § Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’ availability and the data properties OPPORTUNITIES
  • 42.
    F I NA L C O M E N T S 44
  • 43.
    45 Move from designbased on intuition & experience to a more formal & systematic way to design systems Addressing data centric sciences problems is a matter of designing complex systems according to a multidisciplinary vision
  • 44.
    46 Let’s weave agolden trilogy Big Data, AI & HPC
  • 45.