This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central. In this keynote I will explore the key challenges and opportunities for data management in this new scientific world, and discuss how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains. If the moto is not academic, huge numbers of dollars being devoted to related applications are moving industry and academia to analyse these directions.
Moving forward data centric sciences weaving AI, Big Data & HPC
1. Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France
genoveva.vargas@imag.fr
Moving forward data centric sciences
weaving AI, Big Data & HPC
AICSSA, Jordan, October-November, 2018
http://www.vargas-solar.com
3
3. +
“Data is everything and everything is data”, Pythian
Turning reality phenomena into data thanks to the Big Data trend
5
4. 6
Rendering into data, aspects of the world that have never been quantified
Any individual can analyse huge amounts of data in short periods of time
- Analytical knowledge: most of the crucial algorithms are accessible
- Use rich data to make evidence-based decisions open to virtually any person or company
DATIFICATION
6. … UNFINISHED FUGUE
Fuga a 3 Soggetti (Contrapunctus XIV):
- 4-voice triple fugue
- the third subject of which is based on
the
B A C H motif
« At the point where the composer introduces the name BACH in the
countersubject to this fugue, the composer died. »
8
7. What makes Bach sound like Bach?
http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/
The Art of Fugue is based on a single subject employed in some variation in each canon and fugue
9
• Identify the notes performed at specific times in a recording
• Classify the instruments that perform in a recording
• Classify the composer of a recording
• Identify precise onset times of the notes in a recording
• Predict the next note in a recording, conditioned on history
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various
constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
9. 11
DATA SCIENCE
The representation of complex environments by rich data opens up the possibility of applying all the scientific
knowledge regarding how to infer knowledge from data
Definition:
- Methodology by which actionable insights can be inferred from data
- Complex, multifaceted field that can be approached from several points of view: ethics, methodology,
business models, how to deal with big data, data engineering, data governance, etc.
Objective:
- Production of beliefs informed by data and to be used as the basis of decision making
- N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
10. 12
Computational Science
Digital humanities
Social Data Science Network Science
DATA CENTRIC SCIENCES
Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid”
conclusions, models, simulations, understanding
Develop methodologies weaving data management, greedy algorithms, and programming
models that must be tuned to be deployed in different target computer architectures
14. Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions
in which data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
16
NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH
RAW DATA
• Heterogeneous (variety)
• Huge (volume)
• Incomplete, unprecise, missing, contradictory (veracity)
• Continuous releases produced at different rates (velocity)
• Proprietary, critical, private (value)
DIGITAL DATA COLLECTIONS
15. Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions in which
data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
17
DIGITAL DATA COLLECTIONS
18. 20
ü Helping to select the right tool for
preprocessing or analysis
ü Making use of humans’ abilities to
recognize patterns
Not always sure what we are looking for (until we find it)
Query expression [guidance ∣ automatic generation]3,2
• Multi-scale query processing for gradual exploration
• Query morphing to adjust for proximity results
• Queries as answers: query alternatives to cope with lack of providence
Results filtering, analysis, visualization2
• Result-set post processing for conveying meaningful data
Data exploration systems & environments1
• Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database
cracking
• Auto-tuning database kernels : incremental, adaptive, partial indexing
1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM.
2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM.
3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
KEY MOTIVATIONS
EXPLORING DATA COLLECTIONS
19. 21
QUANTITATIVE ANALYSIS OF DATA
Concepts:
- Population: collection of objects, items (“units”)
- Sample: a part of the observed population
Descriptive statistics: simplify data presenting quantitative descriptions
- Measures and concepts to describe the quantitative features
- Provide summaries about the samples as an approximation of the population
- Frequency of the notes performed at specific intervals in a recording
- Identify precise onset times of the notes in a recording
20. 22
LOOKING BEYOND DATA
Inferential statistics: infer the population characteristic
- draws conclusions beyond the analysed data
- reaches conclusions regarding made hypotheses
- Classify the instruments that perform in a recording
- Predict the next note in a recording, conditioned on history
- Inferring a musical score from a recording
21. 23
DATA CURATION
Preserving Describing
Extracting meta-data
ExploringHarvesting
ETL
Parallel Data
Processing
Platforms
Spark (RDD – Tables/Graphs)
Hadoop ecosystem tools (e.g., Pig)
Parallel Data
Processing
Platforms
NoSQL & NewSQL
(Parallel)
Parallel
Data Querying &
Analytics
Structured
Data provision
Parallel data
collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)
Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS,
Big Data Analytics Stacks (Asterix, BDAS)
Parallel analytics (Matlab, R)
CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
24. 26
LOOKING BEYOND KNOWLEDGE
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
27. +
§Data collections with characteristics difficult to process on a single machine or
traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse
massive data collections
à Tools imposing the use of parallel processing and distributed storage
DATA COLLECTIONS ALIAS BIG DATA
29
28. 30
DATA SCIENCE ECOSYSTEM &
INTEGRATED DEVELOPMENT ENVIRONMENT
The integrated development environment (IDE) is an essential tool designed to
maximize programmer productivity.
- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.
- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
Programming language:
- Python one of the most flexible programming languages because it can be seen as a multiparadigm language
- Alternatives are MATLAB and R
Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
34. +
Curated
Increased versatility
& complexity
Increased scalability
& speed
Data collections rawness degree
Key-Value
stores
Document
stores
NewSQL
Relational databases
Graph
Databases
Extensible
record stores
QueryingLook up (R/W)
Analytics
AggregationProcessing Navigation
ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE
36
Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
35. Sharded & colocated
Input data
Distributed File System
Classification
Data
transformation
Tagged opus execution
Multimedia
multiform data
Indexing classes
INDEXING & STORING
• the precise time of each note every recording
• the instrument that plays each note
• the note's position in the metrical structure of the composition
37
36. SHARDING DATA ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File SystemMultimedia multiform data
38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html
Automatic and elastic data collections sharding tools to parametrize data access &
exploitation by parallel programs willing to scale-up in different target architectures
37. SHARDING ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File System
Factors:
- RAM - Disk
- CPU - Network
Sharded data architecture
39
Balanced and smooth fragmentation
(size, location, availability)
Optimum distribution across shards
providing storage spaces (chunks)
38. +
Persistence
- Which part of the document must persist?
- Explicit vs. implicit persistence
- In memory / hard disk Fragmentation/Sharding & replication:
- Vertical or horizontal fragmentation
- Strategies: range, hash, tagged
- Distribution & location
Availability & Fault tolerance
- Replication & distribution
Memory/Cache
SHARDING DATA ACROSS DIFFERENT STORES
Raw data collections
40
39. 411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
40. DATA DELIVERY FOR GREEDY PROCESSING
“Multi-view computational problem”
Iterative data processing and visualization tasks need to share CPU cycles
42
Data is a bottleneck
APPLICATION
DRAM
DISK/DATABASE
CPU
Multiples Cores
GPU
Thousands of Cores
1-5GBps1-10GBps
Provide data storage, fetching and delivery
strategies
Architecture: distributed file system across nodes
Data sharding and replication: on storage and
memory
Fetch to fulfil multi-facet application requirements
Prefetching
Memory indexing
Reduce impedance mismatch
41. 43
§ Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
§ the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
§ the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
§ Build underlying value added data managers that can
§ Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
§ Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’
availability and the data properties
OPPORTUNITIES
43. 45
Move from design based on intuition & experience to a more formal & systematic way
to design systems
Addressing data centric sciences problems is a matter of designing complex systems according
to a multidisciplinary vision