SlideShare a Scribd company logo
1 of 27
(Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
License
• This work is licensed under the license
CC BY-NC-SA 4.0 International
• http://purl.org/NET/rdflicense/cc-by-nc-sa4.0
• You are free:
• to Share — to copy, distribute and transmit the work
• to Remix — to adapt the work
• Under the following conditions
• Attribution — You must attribute the work by inserting
• “[source Oscar Corcho]” at the footer of each reused slide
• a credits slide stating: “These slides are partially based on
“(Big) Data (Science) Skills” by O. Corcho”
• Non-commercial
• Share-Alike
Data Scientist: Technical and Soft Skills needed
• One of the two or
three pictures
expected from a talk
on skills…
• I may start going
through
• Each of these topics
• Discussing on the
specific skills needed
• However…
Sorry, looking for the reference to add here
What is Big Data?
Source: http://www.philipchircop.com/post/25783275888/seeing-the-full-elephant-its-a-tree-its-a
Big Data and the theory of ecological niches
Characteristics of an ecological niche
• A niche is defined by a spectrum of resource usage
• Species differ from each other in how efficient they are in
using resources that change continuously
• Characteristics of a niche
• Amplitude (range in which resources are used)
• Generic species (they can use a wide range of
resources)
• Specialist species (they require a very specific
combination of resources)
• Overlap (similarity among niches in their usage of resources)
• Competitive exclusion principle (Gause, 1934)
• If two species coexist in a stable environment, they do it as a
differentiation of their effective ecological niches.
Source: Javier Seoane. Ecología. Unidad Temática 21. Teoría del nicho ecológico
WHAT’S THE RELATIONSHIP
TO BIG DATA?
Well, that’s interesting, but…
Big Data Niche 1. HPC and e-Infrastructure Experts
Background: Computer Science (Systems)
System Administration
Terms used in their native language:
Blades, Infiniband, OpenMPI,
racks, HDF, TBs, Gflops
Their daily life:
Check system logs
Make sure that queues are active
Install a new rack
What’s Big Data for them?
A “commercial” term for something
that they have done for a long time
They really know how to configure
and monitor a Hadoop cluster
They would love seeing those talking
about Big Data executing processes
on fluid dynamics
Big Data Niche 2. Data Storage and Access Experts
Background: Computer Science
Database administration
Terms used in their native language:
SQL, NoSQL, Column store
Transacions, Hive, TBs/PBs/…,
TPS (Transactions per s)
Their daily life:
Optimise several queries
Run a new benchmark
Design an optimiser/physical operator
What’s Big Data for them?
A new opportunity to work on
optimisation algorithms
They know how to configure a database
They often laugh at those who deploy
a NoSQL solution for a problem
that can be solved with a
relational database
Big Data Niche 3. Machine Learning Experts
Background: Mathematics, Statistics,
Physics, Computer Science
Terms used in their native language:
Complexity, algorithm, p-value,
convergence, precision, recall
ROC curves, bayesian networks, R
Their daily life:
Read about a new problem
Write down a few formulae in the
whiteboard (even blackboards)
Prove that the algorithm terminates
What’s Big Data for them?
The same problems applied to data of
larger size, with new challenges
Problems are not only solved in
Haddop or a powerful NoSQL DB
Astonished by those who still mix up
correlation and causality
Big Data Niche 4. Slow-data Experts
Background: Computer Science, Statistics,
Library Sciences, Linguistics
Terms used in their native language:
Information model, vocabulary,
ontology, data quality, curation
Their daily life:
Receive a database schema
Talk to data producers and (re)users
Obtain consensus and transform data
What’s Big Data for them?
The difficulty lies on the variety of
data formats and structures
We may integrate data from varied
sources, although this is not
always possible
When you manage to integrate
heterogeneous data, you can achieve
better results
Big Data Niche 5. (Big Data) Consultants
Background: Computer Science, Economy,
…
Terms used in their native language:
Business model, business opportunity,
Big Data, Data Value Chain,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Read a Gartner Big Data report
Talk to potential customers
Transfer needs to technicians
What’s Big Data for them?
It’s the 4Vs, plus a few more
I have a PPT presentation with a
Big Data infrastructure,
architecture,
and previous projects, which I will
use to sell a project to my
customers
Are we missing any ecological niche?
• We have already seen a couple of ecological
niches…
• They all coexist
• Some of them are overlapping
Is there anyone that has not been yet
considered?
The evolution of a new species: the Data Scientist
Background: Computer Science+Statistics+
+Mathematics+Economy+
…
Terms used in their new exotic language:
HPC, databases, algorithms,
harmonisation, integration,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Learn about a new infraestructure
Code scripts to be run on Spark
Interpret results
Install a new framework
Read a few scientific papers
Make shiny presentations
Describe in their blog the activities
that they do, so that Big Data is
better known and understood
…
© Volker Markl: “Data Scientist” – “Jack of All Trades!”
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
Data Scientists and Pi-shaped people
• Let’s now go into
the expected
discussion
Sorry, looking for the reference to add here
Will all species survive?
• If Big Data defines an ecosystem…
• Which species will survive?
• Will Data Scientists wipe out the other species?
• Or will they be able to live in perfect symbiosis?
What is the ideal training required
for the individuals of these
species so that they can survive?
Data Science starter kits. Are they effective?
Masters in Data Science, Big Data and alike (I)
Expert in Big Data
Expert in Data Science
Masters in Data Science, Big Data and alike (II)
Masters in Data Science, Big Data and alike (III)
Year 1
• Data handling
• Data analysis
• Advanced data analysis and data
management
• Visualization
• Applications
Year 2
Are we doing it right in terms of training?
• Probably it is all about lack of maturity in the area, but
syllabi do not seem to be perfectly compatible…
• It is not easy to believe that we can create Data
Scientists in only one year
• Should we train people to know a bit about everything?
• Or should we separate more clearly the species in our
ecosystem and specialise them better for their work?
How do we manage to keep a
healthy and stable ecosystem?
Shameless self-promotion
• Strategies for success in the
Digital-Data Revolution
• Separation of concerns
• Intellectual ramps
• Data-intensive knowledge
discovery
• Components and usage
patterns
• Data-intensive engineering
• Development vs enactment
• Data-intensive application
experiences
• In Science
• In Business
Can we learn from lessons
learned in Data-Intensive
Science?
Separation of concerns: three clear profiles
• Domain experts (WHAT)
• They know the problems they want to
solve
• They know the application domain
• They can create (scientific) workflows
• Data-intensive analysts (WHAT)
• They know a lot about (Big) data
analysis
• The may not know about the
infrastructure behind the scenes
• They do not necessarily know all the
details of the application domain
• Data-intensive engineers (HOW)
• They know a lot about distributed
computing/infraestructure/HPC/cloud
s/etc.
• They received the description of an
algorithm and they can make it more
efficient (parallelisation)
Separation of concerns: Differentiated tasks
[<select =
"1<= day(inp.first.start)<=5",
project="inp">,
<select =
"6<= day(inp.first.start)<=10",
project="inp">,
<select =
"11<= day(inp.first.start)<=15",
project="inp">,
... ]
Programmable
Filter
Project
outputs
inp
rules
distrib
"second.fURI ASC..."
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
["first,second"]
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
De
List opinp
De
List opinp
De
List opinp
De
List opinp
inp
CorrFarm
User and application diversity
System complexity
Iterative "what"
process
development
Mapping,
optimisation,
deployment and
execution
Accommodating and facilitating
Several application domains
Several tool sets
Several process representations
Several working practices
DISPEL representation
Composing and providing
Many autonomous resources
One enactment mechanism
A single platform
Gateway
Tool level
Enactment
level
Component
library
Conclusions
• We all know that there are big opportunities in Big Data
• But we need to be more productive. For that we need:
• Create real multidisciplinary teams with at least three roles
(application developers, data-intensive analysts and data-intensive
engineers)
• Understand that simply by using Hadoop, Spark or R we are not
necessarily doing Big Data
• The same as by coding in Java we are not necessarily
understanding object-oriented programming
• Understand that we have to interpret results adequately, from a
scientific point of view
• Understand the importance of homogeneising datasets, in order to
facilitate their integration (slow-data)
• Continue working on delivering tools that can be used to develop
Big Data applications more productively
• Should we also be funding this?
(Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho

More Related Content

What's hot

Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupalDay
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge GraphsPeter Haase
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narrativesdgarijo
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...Dr. Haxel Consult
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeDr. Haxel Consult
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...Dr. Haxel Consult
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...Dr. Haxel Consult
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communicationSören Auer
 

What's hot (20)

Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...
 
CORFU-MTSR 2013
CORFU-MTSR 2013CORFU-MTSR 2013
CORFU-MTSR 2013
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
The RDFIndex-MTSR 2013
The RDFIndex-MTSR 2013The RDFIndex-MTSR 2013
The RDFIndex-MTSR 2013
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 

Viewers also liked

Matching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the FutureMatching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the Futurenado-web
 
e-skills reshaping the future of learning
e-skills reshaping the future of learninge-skills reshaping the future of learning
e-skills reshaping the future of learning@cristobalcobo
 
Day of data: skills for the future
Day of data: skills for the futureDay of data: skills for the future
Day of data: skills for the futureSteven Miller
 
Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century nado-web
 
Official Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFOfficial Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFBrian Solis
 

Viewers also liked (7)

Matching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the FutureMatching Workforce Skills with Employer Needs Now & into the Future
Matching Workforce Skills with Employer Needs Now & into the Future
 
e-skills reshaping the future of learning
e-skills reshaping the future of learninge-skills reshaping the future of learning
e-skills reshaping the future of learning
 
Day of data: skills for the future
Day of data: skills for the futureDay of data: skills for the future
Day of data: skills for the future
 
Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century Navigating the Changing Economic and Demographic Realities of the 21st Century
Navigating the Changing Economic and Demographic Realities of the 21st Century
 
How to hack into the big data team
How to hack into the big data teamHow to hack into the big data team
How to hack into the big data team
 
99 Facts on the Future of Business in the Digital Economy
99 Facts on the Future of Business in the Digital Economy99 Facts on the Future of Business in the Digital Economy
99 Facts on the Future of Business in the Digital Economy
 
Official Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTFOfficial Slideshare for What's the Future of Business by Brian Solis #WTF
Official Slideshare for What's the Future of Business by Brian Solis #WTF
 

Similar to (Big) Data (Science) Skills

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
 
Big data and you
Big data and you Big data and you
Big data and you IBM
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 

Similar to (Big) Data (Science) Skills (20)

Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
On Big Data
On Big DataOn Big Data
On Big Data
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
 
Big data and you
Big data and you Big data and you
Big data and you
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 

More from Oscar Corcho

Organisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOscar Corcho
 
Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Oscar Corcho
 
Open Data (and Software, and other Research Artefacts) - A proper management
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management Oscar Corcho
 
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosOscar Corcho
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
 
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Oscar Corcho
 
STARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaSTARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaOscar Corcho
 
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceTowards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceOscar Corcho
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyOscar Corcho
 
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101Oscar Corcho
 
Aplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMETAplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMET Oscar Corcho
 
Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Oscar Corcho
 
Educando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadEducando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadOscar Corcho
 
STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016Oscar Corcho
 
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaGeneración de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaOscar Corcho
 
Presentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesPresentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesOscar Corcho
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
 
Big Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosBig Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosOscar Corcho
 

More from Oscar Corcho (20)

Organisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de Madrid
 
Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020
 
Open Data (and Software, and other Research Artefacts) - A proper management
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management
 
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticos
 
Ontology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data Sharing
 
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...
 
STARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación LumínicaSTARS4ALL - Contaminación Lumínica
STARS4ALL - Contaminación Lumínica
 
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experienceTowards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
 
Publishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case studyPublishing Linked Statistical Data: Aragón, a case study
Publishing Linked Statistical Data: Aragón, a case study
 
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101
 
Aplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMETAplicando los principios de Linked Data en AEMET
Aplicando los principios de Linked Data en AEMET
 
Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016Ojo Al Data 100 - Call for sharing session at IODC 2016
Ojo Al Data 100 - Call for sharing session at IODC 2016
 
Educando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidadEducando sobre datos abiertos: desde el colegio a la universidad
Educando sobre datos abiertos: desde el colegio a la universidad
 
STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016STARS4ALL general presentation at ALAN2016
STARS4ALL general presentation at ALAN2016
 
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de EstadísticaGeneración de datos estadísticos enlazados del Instituto Aragonés de Estadística
Generación de datos estadísticos enlazados del Instituto Aragonés de Estadística
 
Presentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart CitiesPresentación de la red de excelencia de Open Data y Smart Cities
Presentación de la red de excelencia de Open Data y Smart Cities
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
Big Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los DatosBig Data - El Futuro a través de los Datos
Big Data - El Futuro a través de los Datos
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

(Big) Data (Science) Skills

  • 1. (Big) Data (Science) Skills Big Data Value Association Summit in Madrid 17/06/2015 Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho
  • 2. License • This work is licensed under the license CC BY-NC-SA 4.0 International • http://purl.org/NET/rdflicense/cc-by-nc-sa4.0 • You are free: • to Share — to copy, distribute and transmit the work • to Remix — to adapt the work • Under the following conditions • Attribution — You must attribute the work by inserting • “[source Oscar Corcho]” at the footer of each reused slide • a credits slide stating: “These slides are partially based on “(Big) Data (Science) Skills” by O. Corcho” • Non-commercial • Share-Alike
  • 3. Data Scientist: Technical and Soft Skills needed • One of the two or three pictures expected from a talk on skills… • I may start going through • Each of these topics • Discussing on the specific skills needed • However… Sorry, looking for the reference to add here
  • 4. What is Big Data? Source: http://www.philipchircop.com/post/25783275888/seeing-the-full-elephant-its-a-tree-its-a
  • 5. Big Data and the theory of ecological niches
  • 6. Characteristics of an ecological niche • A niche is defined by a spectrum of resource usage • Species differ from each other in how efficient they are in using resources that change continuously • Characteristics of a niche • Amplitude (range in which resources are used) • Generic species (they can use a wide range of resources) • Specialist species (they require a very specific combination of resources) • Overlap (similarity among niches in their usage of resources) • Competitive exclusion principle (Gause, 1934) • If two species coexist in a stable environment, they do it as a differentiation of their effective ecological niches. Source: Javier Seoane. Ecología. Unidad Temática 21. Teoría del nicho ecológico
  • 7. WHAT’S THE RELATIONSHIP TO BIG DATA? Well, that’s interesting, but…
  • 8. Big Data Niche 1. HPC and e-Infrastructure Experts Background: Computer Science (Systems) System Administration Terms used in their native language: Blades, Infiniband, OpenMPI, racks, HDF, TBs, Gflops Their daily life: Check system logs Make sure that queues are active Install a new rack What’s Big Data for them? A “commercial” term for something that they have done for a long time They really know how to configure and monitor a Hadoop cluster They would love seeing those talking about Big Data executing processes on fluid dynamics
  • 9. Big Data Niche 2. Data Storage and Access Experts Background: Computer Science Database administration Terms used in their native language: SQL, NoSQL, Column store Transacions, Hive, TBs/PBs/…, TPS (Transactions per s) Their daily life: Optimise several queries Run a new benchmark Design an optimiser/physical operator What’s Big Data for them? A new opportunity to work on optimisation algorithms They know how to configure a database They often laugh at those who deploy a NoSQL solution for a problem that can be solved with a relational database
  • 10. Big Data Niche 3. Machine Learning Experts Background: Mathematics, Statistics, Physics, Computer Science Terms used in their native language: Complexity, algorithm, p-value, convergence, precision, recall ROC curves, bayesian networks, R Their daily life: Read about a new problem Write down a few formulae in the whiteboard (even blackboards) Prove that the algorithm terminates What’s Big Data for them? The same problems applied to data of larger size, with new challenges Problems are not only solved in Haddop or a powerful NoSQL DB Astonished by those who still mix up correlation and causality
  • 11. Big Data Niche 4. Slow-data Experts Background: Computer Science, Statistics, Library Sciences, Linguistics Terms used in their native language: Information model, vocabulary, ontology, data quality, curation Their daily life: Receive a database schema Talk to data producers and (re)users Obtain consensus and transform data What’s Big Data for them? The difficulty lies on the variety of data formats and structures We may integrate data from varied sources, although this is not always possible When you manage to integrate heterogeneous data, you can achieve better results
  • 12. Big Data Niche 5. (Big Data) Consultants Background: Computer Science, Economy, … Terms used in their native language: Business model, business opportunity, Big Data, Data Value Chain, Hadoop, Spark, R, TBs, GFlops Their daily life: Read a Gartner Big Data report Talk to potential customers Transfer needs to technicians What’s Big Data for them? It’s the 4Vs, plus a few more I have a PPT presentation with a Big Data infrastructure, architecture, and previous projects, which I will use to sell a project to my customers
  • 13. Are we missing any ecological niche? • We have already seen a couple of ecological niches… • They all coexist • Some of them are overlapping Is there anyone that has not been yet considered?
  • 14. The evolution of a new species: the Data Scientist Background: Computer Science+Statistics+ +Mathematics+Economy+ … Terms used in their new exotic language: HPC, databases, algorithms, harmonisation, integration, Hadoop, Spark, R, TBs, GFlops Their daily life: Learn about a new infraestructure Code scripts to be run on Spark Interpret results Install a new framework Read a few scientific papers Make shiny presentations Describe in their blog the activities that they do, so that Big Data is better known and understood …
  • 15. © Volker Markl: “Data Scientist” – “Jack of All Trades!” Application Data Science Control Flow Iterative Algorithms Error Estimation Active Sampling Sketches Curse of Dimensionality Decoupling Convergence Monte Carlo Mathematical Programming Linear Algebra Stochastic Gradient Descent Regression Statistics Hashing Parallelization Query Optimization Fault Tolerance Relational Algebra / SQL Scalability Data Analysis Language Compiler Memory Management Memory Hierarchy Data Flow Hardware Adaptation Indexing Resource Management NF2 /XQuery Data Warehouse/OLAP Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Real-Time
  • 16. Data Scientists and Pi-shaped people • Let’s now go into the expected discussion Sorry, looking for the reference to add here
  • 17. Will all species survive? • If Big Data defines an ecosystem… • Which species will survive? • Will Data Scientists wipe out the other species? • Or will they be able to live in perfect symbiosis? What is the ideal training required for the individuals of these species so that they can survive?
  • 18. Data Science starter kits. Are they effective?
  • 19. Masters in Data Science, Big Data and alike (I) Expert in Big Data Expert in Data Science
  • 20. Masters in Data Science, Big Data and alike (II)
  • 21. Masters in Data Science, Big Data and alike (III) Year 1 • Data handling • Data analysis • Advanced data analysis and data management • Visualization • Applications Year 2
  • 22. Are we doing it right in terms of training? • Probably it is all about lack of maturity in the area, but syllabi do not seem to be perfectly compatible… • It is not easy to believe that we can create Data Scientists in only one year • Should we train people to know a bit about everything? • Or should we separate more clearly the species in our ecosystem and specialise them better for their work? How do we manage to keep a healthy and stable ecosystem?
  • 23. Shameless self-promotion • Strategies for success in the Digital-Data Revolution • Separation of concerns • Intellectual ramps • Data-intensive knowledge discovery • Components and usage patterns • Data-intensive engineering • Development vs enactment • Data-intensive application experiences • In Science • In Business Can we learn from lessons learned in Data-Intensive Science?
  • 24. Separation of concerns: three clear profiles • Domain experts (WHAT) • They know the problems they want to solve • They know the application domain • They can create (scientific) workflows • Data-intensive analysts (WHAT) • They know a lot about (Big) data analysis • The may not know about the infrastructure behind the scenes • They do not necessarily know all the details of the application domain • Data-intensive engineers (HOW) • They know a lot about distributed computing/infraestructure/HPC/cloud s/etc. • They received the description of an algorithm and they can make it more efficient (parallelisation)
  • 25. Separation of concerns: Differentiated tasks [<select = "1<= day(inp.first.start)<=5", project="inp">, <select = "6<= day(inp.first.start)<=10", project="inp">, <select = "11<= day(inp.first.start)<=15", project="inp">, ... ] Programmable Filter Project outputs inp rules distrib "second.fURI ASC..." Sort outp data rule Sort outp data rule Sort outp data rule Sort outp data rule ["first,second"] Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs Tuple Burst outp input structcols inputs De List opinp De List opinp De List opinp De List opinp inp CorrFarm User and application diversity System complexity Iterative "what" process development Mapping, optimisation, deployment and execution Accommodating and facilitating Several application domains Several tool sets Several process representations Several working practices DISPEL representation Composing and providing Many autonomous resources One enactment mechanism A single platform Gateway Tool level Enactment level Component library
  • 26. Conclusions • We all know that there are big opportunities in Big Data • But we need to be more productive. For that we need: • Create real multidisciplinary teams with at least three roles (application developers, data-intensive analysts and data-intensive engineers) • Understand that simply by using Hadoop, Spark or R we are not necessarily doing Big Data • The same as by coding in Java we are not necessarily understanding object-oriented programming • Understand that we have to interpret results adequately, from a scientific point of view • Understand the importance of homogeneising datasets, in order to facilitate their integration (slow-data) • Continue working on delivering tools that can be used to develop Big Data applications more productively • Should we also be funding this?
  • 27. (Big) Data (Science) Skills Big Data Value Association Summit in Madrid 17/06/2015 Oscar Corcho ocorcho@fi.upm.es @ocorcho https://www.slideshare.com/ocorcho