SlideShare a Scribd company logo
1 of 45
Download to read offline
Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France
genoveva.vargas@imag.fr
Moving forward data centric sciences
weaving AI, Big Data & HPC
AICSSA, Jordan, October-November, 2018
http://www.vargas-solar.com
3
D AT A R E V O L U T I O N
4
+
“Data is everything and everything is data”, Pythian
Turning reality phenomena into data thanks to the Big Data trend
5
6
Rendering into data, aspects of the world that have never been quantified
Any individual can analyse huge amounts of data in short periods of time
- Analytical knowledge: most of the crucial algorithms are accessible
- Use rich data to make evidence-based decisions open to virtually any person or company
DATIFICATION
DIGITAL HUMANITIES
… UNFINISHED FUGUE
Fuga a 3 Soggetti (Contrapunctus XIV):
- 4-voice triple fugue
- the third subject of which is based on
the
B A C H motif
« At the point where the composer introduces the name BACH in the
countersubject to this fugue, the composer died. »
8
What makes Bach sound like Bach?
http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/
The Art of Fugue is based on a single subject employed in some variation in each canon and fugue
9
• Identify the notes performed at specific times in a recording
• Classify the instruments that perform in a recording
• Classify the composer of a recording
• Identify precise onset times of the notes in a recording
• Predict the next note in a recording, conditioned on history
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various
constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
10
11
DATA SCIENCE
The representation of complex environments by rich data opens up the possibility of applying all the scientific
knowledge regarding how to infer knowledge from data
Definition:
- Methodology by which actionable insights can be inferred from data
- Complex, multifaceted field that can be approached from several points of view: ethics, methodology,
business models, how to deal with big data, data engineering, data governance, etc.
Objective:
- Production of beliefs informed by data and to be used as the basis of decision making
- N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
12
Computational Science
Digital humanities
Social Data Science Network Science
DATA CENTRIC SCIENCES
Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid”
conclusions, models, simulations, understanding
Develop methodologies weaving data management, greedy algorithms, and programming
models that must be tuned to be deployed in different target computer architectures
Computational Science
Digital humanities
Social Data Science Network Science
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
13
Experimental Sciences
Computational Science
Digital humanities
Social Data Science Network Science
14
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
Computation
(Algorithm: mathematical model)
Experiment setting
(Architecture: computing environment)
D AT A
15
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions
in which data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
16
NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH
RAW DATA
• Heterogeneous (variety)
• Huge (volume)
• Incomplete, unprecise, missing, contradictory (veracity)
• Continuous releases produced at different rates (velocity)
• Proprietary, critical, private (value)
DIGITAL DATA COLLECTIONS
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions in which
data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
17
DIGITAL DATA COLLECTIONS
18
EXPLORING DATA COLLECTIONS
19https://web.facebook.com/data/
20
ü Helping to select the right tool for
preprocessing or analysis
ü Making use of humans’ abilities to
recognize patterns
Not always sure what we are looking for (until we find it)
Query expression [guidance ∣ automatic generation]3,2
• Multi-scale query processing for gradual exploration
• Query morphing to adjust for proximity results
• Queries as answers: query alternatives to cope with lack of providence
Results filtering, analysis, visualization2
• Result-set post processing for conveying meaningful data
Data exploration systems & environments1
• Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database
cracking
• Auto-tuning database kernels : incremental, adaptive, partial indexing
1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM.
2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM.
3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
KEY MOTIVATIONS
EXPLORING DATA COLLECTIONS
21
QUANTITATIVE ANALYSIS OF DATA
Concepts:
- Population: collection of objects, items (“units”)
- Sample: a part of the observed population
Descriptive statistics: simplify data presenting quantitative descriptions
- Measures and concepts to describe the quantitative features
- Provide summaries about the samples as an approximation of the population
- Frequency of the notes performed at specific intervals in a recording
- Identify precise onset times of the notes in a recording
22
LOOKING BEYOND DATA
Inferential statistics: infer the population characteristic
- draws conclusions beyond the analysed data
- reaches conclusions regarding made hypotheses
- Classify the instruments that perform in a recording
- Predict the next note in a recording, conditioned on history
- Inferring a musical score from a recording
23
DATA CURATION
Preserving Describing
Extracting meta-data
ExploringHarvesting
ETL
Parallel Data
Processing
Platforms
Spark (RDD – Tables/Graphs)
Hadoop ecosystem tools (e.g., Pig)
Parallel Data
Processing
Platforms
NoSQL & NewSQL
(Parallel)
Parallel
Data Querying &
Analytics
Structured
Data provision
Parallel data
collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)
Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS,
Big Data Analytics Stacks (Asterix, BDAS)
Parallel analytics (Matlab, R)
CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
ARTIFICIAL INTELLIGENCE
BEYOND KNOWLEDGE
https://ai100.stanford.edu/2016-report
24
25
26
LOOKING BEYOND KNOWLEDGE
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
SETTING UP DATA CENTRIC EXPERIMENTS
27
28
https://web.facebook.com/data/https://azure.microsoft.com/
+
§Data collections with characteristics difficult to process on a single machine or
traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse
massive data collections
à Tools imposing the use of parallel processing and distributed storage
DATA COLLECTIONS ALIAS BIG DATA
29
30
DATA SCIENCE ECOSYSTEM &
INTEGRATED DEVELOPMENT ENVIRONMENT
The integrated development environment (IDE) is an essential tool designed to
maximize programmer productivity.
- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.
- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
Programming language:
- Python one of the most flexible programming languages because it can be seen as a multiparadigm language
- Alternatives are MATLAB and R
Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
WEB INTEGRATED
DEVELOPMENT
ENVIRONMENT
31
DATA SCIENCE VIRTUAL MACHINE
COMPUTING CAPACITY
33
D AT A B E Y O N D T H E C O N F O R T Z O N E
34
35
+
Curated
Increased versatility
& complexity
Increased scalability
& speed
Data collections rawness degree
Key-Value
stores
Document
stores
NewSQL
Relational databases
Graph
Databases
Extensible
record stores
QueryingLook up (R/W)
Analytics
AggregationProcessing Navigation
ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE
36
Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
Sharded & colocated
Input data
Distributed File System
Classification
Data
transformation
Tagged opus execution
Multimedia
multiform data
Indexing classes
INDEXING & STORING
• the precise time of each note every recording
• the instrument that plays each note
• the note's position in the metrical structure of the composition
37
SHARDING DATA ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File SystemMultimedia multiform data
38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html
Automatic and elastic data collections sharding tools to parametrize data access &
exploitation by parallel programs willing to scale-up in different target architectures
SHARDING ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File System
Factors:
- RAM - Disk
- CPU - Network
Sharded data architecture
39
Balanced and smooth fragmentation
(size, location, availability)
Optimum distribution across shards
providing storage spaces (chunks)
+
Persistence
- Which part of the document must persist?
- Explicit vs. implicit persistence
- In memory / hard disk Fragmentation/Sharding & replication:
- Vertical or horizontal fragmentation
- Strategies: range, hash, tagged
- Distribution & location
Availability & Fault tolerance
- Replication & distribution
Memory/Cache
SHARDING DATA ACROSS DIFFERENT STORES
Raw data collections
40
411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
DATA DELIVERY FOR GREEDY PROCESSING
“Multi-view computational problem”
Iterative data processing and visualization tasks need to share CPU cycles
42
Data is a bottleneck
APPLICATION
DRAM
DISK/DATABASE
CPU
Multiples Cores
GPU
Thousands of Cores
1-5GBps1-10GBps
Provide data storage, fetching and delivery
strategies
­ Architecture: distributed file system across nodes
­ Data sharding and replication: on storage and
memory
­ Fetch to fulfil multi-facet application requirements
­ Prefetching
­ Memory indexing
­ Reduce impedance mismatch
43
§ Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
§ the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
§ the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
§ Build underlying value added data managers that can
§ Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
§ Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’
availability and the data properties
OPPORTUNITIES
F I N A L C O M E N T S
44
45
Move from design based on intuition & experience to a more formal & systematic way
to design systems
Addressing data centric sciences problems is a matter of designing complex systems according
to a multidisciplinary vision
46
Let’s weave a golden trilogy
Big Data, AI & HPC
47

More Related Content

What's hot

Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptx
Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptxPerspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptx
Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptxRonikaThakur
 
Artifial intellegence in Plant diseases detection and diagnosis
Artifial intellegence in Plant diseases detection and diagnosis Artifial intellegence in Plant diseases detection and diagnosis
Artifial intellegence in Plant diseases detection and diagnosis N.H. Shankar Reddy
 
Geographical information system and its application in horticulture
Geographical information system and its application in horticultureGeographical information system and its application in horticulture
Geographical information system and its application in horticultureAparna Veluru
 
Artificial intelligence in plant disease detection
Artificial intelligence in plant disease detectionArtificial intelligence in plant disease detection
Artificial intelligence in plant disease detectionGoliBhaskarSaiManika
 
Artificial Intelligence in Agriculture
Artificial Intelligence in AgricultureArtificial Intelligence in Agriculture
Artificial Intelligence in Agricultureijtsrd
 
Artificial Intelligence (AI): Applications in agriculture
Artificial Intelligence (AI): Applications in agricultureArtificial Intelligence (AI): Applications in agriculture
Artificial Intelligence (AI): Applications in agricultureadityak702
 
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...IRJET Journal
 
Image processing python
Image processing pythonImage processing python
Image processing pythonYAZIDI Imran
 
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...Priyanka Guleria
 
Entrepreneurship opportunities in agriculture
Entrepreneurship opportunities in agriculture Entrepreneurship opportunities in agriculture
Entrepreneurship opportunities in agriculture Diraviam Jayaraj
 
Precision agriculture
Precision agriculturePrecision agriculture
Precision agricultureSuryaBv1
 
Why YOU should join Wingify!
Why YOU should join Wingify!Why YOU should join Wingify!
Why YOU should join Wingify!Paras Chopra
 
20 uses cases - Artificial Intelligence and Machine Learning in agriculture ...
20 uses cases - Artificial Intelligence and Machine Learning  in agriculture ...20 uses cases - Artificial Intelligence and Machine Learning  in agriculture ...
20 uses cases - Artificial Intelligence and Machine Learning in agriculture ...Victor John Tan
 
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)Rahul Raj Tandon
 
Scope and importance, principles and concepts of precision horticulture
Scope and importance, principles and concepts of precision horticulture Scope and importance, principles and concepts of precision horticulture
Scope and importance, principles and concepts of precision horticulture Dr. M. Kumaresan Hort.
 
Artificial Intelligence In Agriculture & Its Status in India
Artificial Intelligence In Agriculture & Its Status in IndiaArtificial Intelligence In Agriculture & Its Status in India
Artificial Intelligence In Agriculture & Its Status in IndiaJanhviTripathi
 
seeds and planting materials marketing
 seeds and planting materials marketing seeds and planting materials marketing
seeds and planting materials marketingtharaka_92
 

What's hot (20)

Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptx
Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptxPerspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptx
Perspectives and Challenges of Phenotyping in Crop Improvement. - Copy.pptx
 
Artifial intellegence in Plant diseases detection and diagnosis
Artifial intellegence in Plant diseases detection and diagnosis Artifial intellegence in Plant diseases detection and diagnosis
Artifial intellegence in Plant diseases detection and diagnosis
 
Geographical information system and its application in horticulture
Geographical information system and its application in horticultureGeographical information system and its application in horticulture
Geographical information system and its application in horticulture
 
What Is Precision Farming
What Is Precision FarmingWhat Is Precision Farming
What Is Precision Farming
 
Artificial intelligence in plant disease detection
Artificial intelligence in plant disease detectionArtificial intelligence in plant disease detection
Artificial intelligence in plant disease detection
 
Artificial Intelligence in Agriculture
Artificial Intelligence in AgricultureArtificial Intelligence in Agriculture
Artificial Intelligence in Agriculture
 
Artificial Intelligence (AI): Applications in agriculture
Artificial Intelligence (AI): Applications in agricultureArtificial Intelligence (AI): Applications in agriculture
Artificial Intelligence (AI): Applications in agriculture
 
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...
Crop Recommendation System to Maximize Crop Yield using Machine Learning Tech...
 
Image processing python
Image processing pythonImage processing python
Image processing python
 
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...
Yellow rust seminar by Priyanka (Phd Scholar Genetics and Plant Breeding CSK ...
 
Entrepreneurship opportunities in agriculture
Entrepreneurship opportunities in agriculture Entrepreneurship opportunities in agriculture
Entrepreneurship opportunities in agriculture
 
Precision agriculture
Precision agriculturePrecision agriculture
Precision agriculture
 
Why YOU should join Wingify!
Why YOU should join Wingify!Why YOU should join Wingify!
Why YOU should join Wingify!
 
Breeding for Insect Resistance
Breeding for Insect ResistanceBreeding for Insect Resistance
Breeding for Insect Resistance
 
20 uses cases - Artificial Intelligence and Machine Learning in agriculture ...
20 uses cases - Artificial Intelligence and Machine Learning  in agriculture ...20 uses cases - Artificial Intelligence and Machine Learning  in agriculture ...
20 uses cases - Artificial Intelligence and Machine Learning in agriculture ...
 
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)
Precision Agriculture- By Anjali Patel (IGKV Raipur, C.G)
 
Scope and importance, principles and concepts of precision horticulture
Scope and importance, principles and concepts of precision horticulture Scope and importance, principles and concepts of precision horticulture
Scope and importance, principles and concepts of precision horticulture
 
Artificial Intelligence In Agriculture & Its Status in India
Artificial Intelligence In Agriculture & Its Status in IndiaArtificial Intelligence In Agriculture & Its Status in India
Artificial Intelligence In Agriculture & Its Status in India
 
seeds and planting materials marketing
 seeds and planting materials marketing seeds and planting materials marketing
seeds and planting materials marketing
 
Plant Phenomics
Plant PhenomicsPlant Phenomics
Plant Phenomics
 

Similar to Moving forward data centric sciences weaving AI, Big Data & HPC

Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesBalázs Kégl
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
New Forms of Data for e-Research
New Forms of Data for e-ResearchNew Forms of Data for e-Research
New Forms of Data for e-ResearchDavid De Roure
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things PayamBarnaghi
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
The Ai & I at Work
The Ai & I at WorkThe Ai & I at Work
The Ai & I at WorkTarek Hoteit
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 

Similar to Moving forward data centric sciences weaving AI, Big Data & HPC (20)

Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
New Forms of Data for e-Research
New Forms of Data for e-ResearchNew Forms of Data for e-Research
New Forms of Data for e-Research
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Dwdm
DwdmDwdm
Dwdm
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
The Ai & I at Work
The Ai & I at WorkThe Ai & I at Work
The Ai & I at Work
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 

More from Genoveva Vargas-Solar

More from Genoveva Vargas-Solar (10)

Aiccsa 2021-w-stem
Aiccsa 2021-w-stemAiccsa 2021-w-stem
Aiccsa 2021-w-stem
 
Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial Intelligence
 
Data w-steamm
Data w-steammData w-steamm
Data w-steamm
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
1 mapreduce-fest
1 mapreduce-fest1 mapreduce-fest
1 mapreduce-fest
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 

Recently uploaded

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Moving forward data centric sciences weaving AI, Big Data & HPC

  • 1. Genoveva Vargas-Solar Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France genoveva.vargas@imag.fr Moving forward data centric sciences weaving AI, Big Data & HPC AICSSA, Jordan, October-November, 2018 http://www.vargas-solar.com 3
  • 2. D AT A R E V O L U T I O N 4
  • 3. + “Data is everything and everything is data”, Pythian Turning reality phenomena into data thanks to the Big Data trend 5
  • 4. 6 Rendering into data, aspects of the world that have never been quantified Any individual can analyse huge amounts of data in short periods of time - Analytical knowledge: most of the crucial algorithms are accessible - Use rich data to make evidence-based decisions open to virtually any person or company DATIFICATION
  • 6. … UNFINISHED FUGUE Fuga a 3 Soggetti (Contrapunctus XIV): - 4-voice triple fugue - the third subject of which is based on the B A C H motif « At the point where the composer introduces the name BACH in the countersubject to this fugue, the composer died. » 8
  • 7. What makes Bach sound like Bach? http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/ The Art of Fugue is based on a single subject employed in some variation in each canon and fugue 9 • Identify the notes performed at specific times in a recording • Classify the instruments that perform in a recording • Classify the composer of a recording • Identify precise onset times of the notes in a recording • Predict the next note in a recording, conditioned on history Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 8. 10
  • 9. 11 DATA SCIENCE The representation of complex environments by rich data opens up the possibility of applying all the scientific knowledge regarding how to infer knowledge from data Definition: - Methodology by which actionable insights can be inferred from data - Complex, multifaceted field that can be approached from several points of view: ethics, methodology, business models, how to deal with big data, data engineering, data governance, etc. Objective: - Production of beliefs informed by data and to be used as the basis of decision making - N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
  • 10. 12 Computational Science Digital humanities Social Data Science Network Science DATA CENTRIC SCIENCES Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid” conclusions, models, simulations, understanding Develop methodologies weaving data management, greedy algorithms, and programming models that must be tuned to be deployed in different target computer architectures
  • 11. Computational Science Digital humanities Social Data Science Network Science 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte 13 Experimental Sciences
  • 12. Computational Science Digital humanities Social Data Science Network Science 14 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte Computation (Algorithm: mathematical model) Experiment setting (Architecture: computing environment)
  • 14. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 16 NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH RAW DATA • Heterogeneous (variety) • Huge (volume) • Incomplete, unprecise, missing, contradictory (veracity) • Continuous releases produced at different rates (velocity) • Proprietary, critical, private (value) DIGITAL DATA COLLECTIONS
  • 15. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 17 DIGITAL DATA COLLECTIONS
  • 18. 20 ü Helping to select the right tool for preprocessing or analysis ü Making use of humans’ abilities to recognize patterns Not always sure what we are looking for (until we find it) Query expression [guidance ∣ automatic generation]3,2 • Multi-scale query processing for gradual exploration • Query morphing to adjust for proximity results • Queries as answers: query alternatives to cope with lack of providence Results filtering, analysis, visualization2 • Result-set post processing for conveying meaningful data Data exploration systems & environments1 • Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database cracking • Auto-tuning database kernels : incremental, adaptive, partial indexing 1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM. 2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM. 3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity. KEY MOTIVATIONS EXPLORING DATA COLLECTIONS
  • 19. 21 QUANTITATIVE ANALYSIS OF DATA Concepts: - Population: collection of objects, items (“units”) - Sample: a part of the observed population Descriptive statistics: simplify data presenting quantitative descriptions - Measures and concepts to describe the quantitative features - Provide summaries about the samples as an approximation of the population - Frequency of the notes performed at specific intervals in a recording - Identify precise onset times of the notes in a recording
  • 20. 22 LOOKING BEYOND DATA Inferential statistics: infer the population characteristic - draws conclusions beyond the analysed data - reaches conclusions regarding made hypotheses - Classify the instruments that perform in a recording - Predict the next note in a recording, conditioned on history - Inferring a musical score from a recording
  • 21. 23 DATA CURATION Preserving Describing Extracting meta-data ExploringHarvesting ETL Parallel Data Processing Platforms Spark (RDD – Tables/Graphs) Hadoop ecosystem tools (e.g., Pig) Parallel Data Processing Platforms NoSQL & NewSQL (Parallel) Parallel Data Querying & Analytics Structured Data provision Parallel data collection (Flink, Stream, Flume) Spark (descriptive statistics functions) Hadoop ecosystem tools (e.g., Hive) Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS) Parallel analytics (Matlab, R) CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
  • 23. 25
  • 24. 26 LOOKING BEYOND KNOWLEDGE Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 25. SETTING UP DATA CENTRIC EXPERIMENTS 27
  • 27. + §Data collections with characteristics difficult to process on a single machine or traditional databases §A new generation of tools, methods and technologies to collect, process and analyse massive data collections à Tools imposing the use of parallel processing and distributed storage DATA COLLECTIONS ALIAS BIG DATA 29
  • 28. 30 DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT The integrated development environment (IDE) is an essential tool designed to maximize programmer productivity. - The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the debugger. - Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment) Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and R Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
  • 32. D AT A B E Y O N D T H E C O N F O R T Z O N E 34
  • 33. 35
  • 34. + Curated Increased versatility & complexity Increased scalability & speed Data collections rawness degree Key-Value stores Document stores NewSQL Relational databases Graph Databases Extensible record stores QueryingLook up (R/W) Analytics AggregationProcessing Navigation ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE 36 Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
  • 35. Sharded & colocated Input data Distributed File System Classification Data transformation Tagged opus execution Multimedia multiform data Indexing classes INDEXING & STORING • the precise time of each note every recording • the instrument that plays each note • the note's position in the metrical structure of the composition 37
  • 36. SHARDING DATA ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File SystemMultimedia multiform data 38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html Automatic and elastic data collections sharding tools to parametrize data access & exploitation by parallel programs willing to scale-up in different target architectures
  • 37. SHARDING ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File System Factors: - RAM - Disk - CPU - Network Sharded data architecture 39 Balanced and smooth fragmentation (size, location, availability) Optimum distribution across shards providing storage spaces (chunks)
  • 38. + Persistence - Which part of the document must persist? - Explicit vs. implicit persistence - In memory / hard disk Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation - Strategies: range, hash, tagged - Distribution & location Availability & Fault tolerance - Replication & distribution Memory/Cache SHARDING DATA ACROSS DIFFERENT STORES Raw data collections 40
  • 39. 411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
  • 40. DATA DELIVERY FOR GREEDY PROCESSING “Multi-view computational problem” Iterative data processing and visualization tasks need to share CPU cycles 42 Data is a bottleneck APPLICATION DRAM DISK/DATABASE CPU Multiples Cores GPU Thousands of Cores 1-5GBps1-10GBps Provide data storage, fetching and delivery strategies ­ Architecture: distributed file system across nodes ­ Data sharding and replication: on storage and memory ­ Fetch to fulfil multi-facet application requirements ­ Prefetching ­ Memory indexing ­ Reduce impedance mismatch
  • 41. 43 § Manage data collections with different uses and access patterns because these properties tend to reach limits of: § the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a certain period, and § the pace (computing speed) in which data must be consumed (harvested, prepared and processed). § Build underlying value added data managers that can § Exploit available resources making a compromise between QoS properties and SLA requirements considering all the levels of the stack § Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’ availability and the data properties OPPORTUNITIES
  • 42. F I N A L C O M E N T S 44
  • 43. 45 Move from design based on intuition & experience to a more formal & systematic way to design systems Addressing data centric sciences problems is a matter of designing complex systems according to a multidisciplinary vision
  • 44. 46 Let’s weave a golden trilogy Big Data, AI & HPC
  • 45. 47