SlideShare a Scribd company logo
Data science:
Origins, methods,
challenges & the future?
CAGATAYTURKAY,
giCentre, City University London
City Unrulyversity, 18 March 2015
Who?
• Lecturer in Applied Data Science, City Univ. London
• @ giCentre
• PhD @VisGroup at Univ. of Bergen, Norway
Research on …
Methods for
InteractiveVisual
Data Analysis
WHAT IS
DATA SCIENCE?
Somes example first – Google Flu Trends
http://www.google.org/flutrends/
Ginsberg, Jeremy, et al. "Detecting influenza epidemics using search engine query data." Nature 457.7232 (2008): 1012-1014.
Video: http://goo.gl/4ysAmw
Google says ..
.. relationship between how
many people search for flu-
related topics and how many
people actually have flu
symptoms. … to estimate
how much flu is circulating ….
The Shortest Path to Happiness:
Recommending Beautiful, Quiet,
and Happy Routes in the City
Quercia, Daniele, Rossano Schifanella, and Luca Maria Aiello. ACM
conference on Hypertext and social media, 2014.
http://urbangems.org/
Data science is a systematic study
of generalizable extraction of
knowledge from data
From “Data Science and Prediction” byVasant Dhar, Communications of the ACM, 2013
Data science is a process starting with
formulating a question that can be answered with
data, and iteratively collecting, cleaning,
analysing and modelling the data, while
communicating the answers to the relevant
audience along this iteration.
On the origins ..
Term coined by William S. Cleveland in 2001[*]
…knowledge among computer scientists about how to think
of and approach the analysis of data is limited, just as the
knowledge of computing environments by statisticians is limited.
A merger of the knowledge bases would produce a powerful
force for innovation.
[*] Cleveland, William S. "Data science: an action plan for expanding the technical areas of the field of statistics.“, (2001)
By Capgemini Consulting, http://www.slideshare.net/capgemini/impact-of-big-data-on-analytics
Data scientist ?
• Sexiest job of the 21st century, according to Harvard Business Review, 2012
• A data scientist is many people in one, someone:
–who understands domain (industry/academia) needs and
terminology
–who is able to mash-up several analytical tools
–who is able to design and implement solutions to extract
knowledge from the data
–who can communicate findings
On data analysts – analyst types
On data analysts – skills vs. types
DATA SCIENCE PROCESS
DS Process
• Understand domain needs & formulate questions
• Collect & make data available
• Get the data ready for analysis
• Exploratively (and visually) analyse the data
• Model the phenomena (if needed)
• Evaluate findings
• Communicate findings
• ITERATE (from any stage to any other stage)!
DS Process – Data collection, efficient storing & access
DS Process – Getting data ready
Many names:
Data wrangling,
data munging,
data cleaning,
data massaging,
data scrubbing,
pre-processing, ….
I spend more than half of my time integrating,
cleansing and transforming data without doing any actual analysis.
Most of the time I’m lucky if I get to do any analysis. Most of
the time once you transform the data you just do an average... the
insights can be scarily obvious. It’s fun when you get to do something
somewhat analytical.
Kandel, Sean, et al. "Enterprise data analysis and visualization:An interview study.", IEEETVCG (2012)
DS Process – (Exploratively) Analysing data
Goal: Generate / confirm ideas, findings (hypotheses)
• Analysis tasks:
–Finding anomalies
–Finding relations
–Finding groups
–Summarizing information
–Making predictions
–Understanding uncertainties
–Evaluate hypotheses
DS Process – Build models
• Statistical models
–For summarising, representing data
–For predictions, estimations
• Machine learning methods, e.g., neural networks
–Predictive tasks
–Classification tasks
–Unsupervised / supervised models
Visualise & Communicate along the process!
ITERATE !
(SOME) CHALLENGES
HETEROGENOUS DATA
+
Text
Material by Tamara Munzner, http://www.cs.ubc.ca/~tmm/talks.html#minicourse14
Why a challenge?
• Computational methods do not work for all types of data
• Hard (cognitively) to link observations
• Gaps in the data (sparsity)
• Data is not necessarily linked!
[ by O’Donoghue et al., 2010]
BREADTH OF DATA
Why a challenge?
• Models become complex!
• Hard to interpret and utilise
• Less reliable findings, high uncertainy
Image from: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
The curse of dimensionality
IMPERFECT DATA
This is a straight line that “connects the dots”, i.e., best fits the points.
Small exercise
Think of straight lines for these four sets
DYNAMIC
DATA (& CONCEPTS)
… Internet search behaviour changed during pH1N1,
particularly in the categories “influenza complications”
and “term for influenza.”The complications associated
with pH1N1, the fact that pH1N1 began in the summer
rather than winter, and changes in health-seeking
behaviour ….
13 February 2013,
http://www.nature.com/news/when-google-got-flu-wrong-1.12413
GENERATING
PLAUSABLE & USEFUL
HYPOTHESES
http://www.tylervigen.com/
HOW WE (TRY TO) DEAL WITH
THESE? -- EXAMPLES
STUDY 1 – MULTIVARIATE GEOGRAPHICAL DATA
Attribute
Signatures:
DynamicVisual
Summaries for
Analyzing Multivariate
Geographical Data
CagatayTurkay,Aidan Slingsby,
Helwig Hauser, JoWood, Jason
Dykes, InfoVis 2014
UK Census of Population in 2001 and 2011 for the
181,000 Output Areas (OA)
for 41 indicators
How all the variables
vary over space (and time)?
question is…
interactively
generate
visual summaries of change
in all the variables
in response to variation
(location, extent, resolution)
we built methods to:
Attribute signatures
It starts with a map …
and an attribute to analyze,
e.g., unemployment rates
Attribute signatures
Above baseline
Attribute signatures
~ baseline (country average)
Attribute signatures
Below baseline
Attribute signatures
Attribute signatures
A dynamically generated
visual summary!
How about several attributes?
Linked
small
multiples
Along the Southwest England coast
2011 vs. 2001
STUDY 2 – HOW TO CHARACTERIZE CANCER SUBTYPES?
Characterizing
Cancer Subtypes
Using Dual Analysis
in Caleydo
StratomeX
CagatayTurkay,Alexander Lex, Marc Streit,
Hanspeter Pfister, Helwig Hauser, IEEE CG&A
2014
Patients(samples)
Genes
Candidate Subtype /
Heat Map
Cancer Subtypes are identified
by grouping datasets based on
• gene activity
• mutations
• or a combination of these
http://caleydo.org/
Colour to show
gene activity
There are always many ways to group!
Group A1
Group A2
Group A3
B1
B2
Grouping 3,
Group. C1
Group. C2
Grouping 1, Grouping 2,
What are the groups & Why?
Many shared Patients
Clustering 1 Clustering 2
Sample Overlaps
GeneOverlaps??
Finding distinctive properties
VISUAL ANALYSIS
TO FACILITATE THE
ITERATIVE PROCESS &
HYPOTHESES GENERATION
Some remarks on Future
• Infrastructure & speed – better solutions?
• More on “Value of the data”
• More sophisticated analyses
• Predictive analytics on the rise
• More central role for the user (through visualisation)
• New sources of data
–Increasingly you! , health apps, etc.
–Internet of things
Data science is a process starting with
formulating a question that can be answered with
data, and iteratively collecting, cleaning,
analysing and modelling the data, while
communicating the answers to the relevant
audience along this iteration.
To conclude ….
Some technologies (we use in teaching DS)
• Python
• Pandas for Statistical Computing
• Scikit-learn for machine learning
• R Statistical Computation Software
• Apache Spark
• Tableau Software
Further reading …
Thank you !
Cagatay.Turkay.1@city.ac.uk
@cagatay_turkay
http://staff.city.ac.uk/cagatay.turkay.1/ More data science?
Data Science MSc @ CITY

More Related Content

What's hot

Challenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL modelChallenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL model
Institute of Information Systems (HES-SO)
 
Medical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructuresMedical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructures
Institute of Information Systems (HES-SO)
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Amit Sheth
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Amit Sheth
 
2011 06-14 cristhian-parra_u_count
2011 06-14 cristhian-parra_u_count2011 06-14 cristhian-parra_u_count
2011 06-14 cristhian-parra_u_count
Cristhian Parra
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
Richard Layton
 
June2014 brownbag privacy
June2014 brownbag privacyJune2014 brownbag privacy
June2014 brownbag privacy
Micah Altman
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Amit Sheth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
Carole Goble
 
‘Smart’ Taxonomy- & Ontology- Enabled Resources for Taxonomy Bootcamp
‘Smart’ Taxonomy- & Ontology- Enabled Resourcesfor Taxonomy Bootcamp‘Smart’ Taxonomy- & Ontology- Enabled Resourcesfor Taxonomy Bootcamp
‘Smart’ Taxonomy- & Ontology- Enabled Resources for Taxonomy Bootcamp
Deborah McGuinness
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
National Information Standards Organization (NISO)
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
vishal choudhary
 
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent MiningAshutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Artificial Intelligence Institute at UofSC
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Comments to FTC on Mobile Data Privacy
Comments to FTC on Mobile Data PrivacyComments to FTC on Mobile Data Privacy
Comments to FTC on Mobile Data Privacy
Micah Altman
 
Big Data & Privacy -- Response to White House OSTP
Big Data & Privacy -- Response to White House OSTPBig Data & Privacy -- Response to White House OSTP
Big Data & Privacy -- Response to White House OSTP
Micah Altman
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
Amit Sheth
 
Kwon Ph.D. Dissertation 2016
Kwon Ph.D. Dissertation 2016Kwon Ph.D. Dissertation 2016
Kwon Ph.D. Dissertation 2016
Karl Kwon, Ph.D.
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 

What's hot (20)

Challenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL modelChallenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL model
 
Medical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructuresMedical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructures
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
 
2011 06-14 cristhian-parra_u_count
2011 06-14 cristhian-parra_u_count2011 06-14 cristhian-parra_u_count
2011 06-14 cristhian-parra_u_count
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
June2014 brownbag privacy
June2014 brownbag privacyJune2014 brownbag privacy
June2014 brownbag privacy
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
 
‘Smart’ Taxonomy- & Ontology- Enabled Resources for Taxonomy Bootcamp
‘Smart’ Taxonomy- & Ontology- Enabled Resourcesfor Taxonomy Bootcamp‘Smart’ Taxonomy- & Ontology- Enabled Resourcesfor Taxonomy Bootcamp
‘Smart’ Taxonomy- & Ontology- Enabled Resources for Taxonomy Bootcamp
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent MiningAshutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Comments to FTC on Mobile Data Privacy
Comments to FTC on Mobile Data PrivacyComments to FTC on Mobile Data Privacy
Comments to FTC on Mobile Data Privacy
 
Big Data & Privacy -- Response to White House OSTP
Big Data & Privacy -- Response to White House OSTPBig Data & Privacy -- Response to White House OSTP
Big Data & Privacy -- Response to White House OSTP
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 
Kwon Ph.D. Dissertation 2016
Kwon Ph.D. Dissertation 2016Kwon Ph.D. Dissertation 2016
Kwon Ph.D. Dissertation 2016
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 

Similar to Data Science: Origins, Methods, Challenges and the future?

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
Jordan Engbers
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
SanjayAcharaya
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
paperpublications3
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
shalini s
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
Giuseppe Manco
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
Lakmal Pathirana
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
ssuser1a4f0f
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
Joanne Luciano
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
butest
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
wahiba ben abdessalem
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
Andry Alamsyah
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
Johann van Wyk
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
rajsharma159890
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
IT Network marcus evans
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
Willard Van De Bogart
 

Similar to Data Science: Origins, Methods, Challenges and the future? (20)

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
 

More from Cagatay Turkay

Visual Analytics for User Behaviour Analysis in Cyber Systems
Visual Analytics for User Behaviour Analysis in Cyber SystemsVisual Analytics for User Behaviour Analysis in Cyber Systems
Visual Analytics for User Behaviour Analysis in Cyber Systems
Cagatay Turkay
 
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
Cagatay Turkay
 
Visualisation for Data Science: Advances and Opportunities in Visualisation R...
Visualisation for Data Science: Advances and Opportunities in Visualisation R...Visualisation for Data Science: Advances and Opportunities in Visualisation R...
Visualisation for Data Science: Advances and Opportunities in Visualisation R...
Cagatay Turkay
 
The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analytics
Cagatay Turkay
 
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Cagatay Turkay
 
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
Cagatay Turkay
 
Visualization, A Primer - Basics, Techniques and Guidelines
Visualization, A Primer - Basics, Techniques and GuidelinesVisualization, A Primer - Basics, Techniques and Guidelines
Visualization, A Primer - Basics, Techniques and Guidelines
Cagatay Turkay
 
Designing Interactive Visualisations to Solve Analytical Problems in Biology
Designing Interactive Visualisations to Solve Analytical Problems in BiologyDesigning Interactive Visualisations to Solve Analytical Problems in Biology
Designing Interactive Visualisations to Solve Analytical Problems in Biology
Cagatay Turkay
 

More from Cagatay Turkay (8)

Visual Analytics for User Behaviour Analysis in Cyber Systems
Visual Analytics for User Behaviour Analysis in Cyber SystemsVisual Analytics for User Behaviour Analysis in Cyber Systems
Visual Analytics for User Behaviour Analysis in Cyber Systems
 
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
The Inquisitive Data Scientist: Facilitating Well-Informed Data Science throu...
 
Visualisation for Data Science: Advances and Opportunities in Visualisation R...
Visualisation for Data Science: Advances and Opportunities in Visualisation R...Visualisation for Data Science: Advances and Opportunities in Visualisation R...
Visualisation for Data Science: Advances and Opportunities in Visualisation R...
 
The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analytics
 
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
 
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
Enhancing a Social Science Model-building Workflow with Interactive Visualisa...
 
Visualization, A Primer - Basics, Techniques and Guidelines
Visualization, A Primer - Basics, Techniques and GuidelinesVisualization, A Primer - Basics, Techniques and Guidelines
Visualization, A Primer - Basics, Techniques and Guidelines
 
Designing Interactive Visualisations to Solve Analytical Problems in Biology
Designing Interactive Visualisations to Solve Analytical Problems in BiologyDesigning Interactive Visualisations to Solve Analytical Problems in Biology
Designing Interactive Visualisations to Solve Analytical Problems in Biology
 

Recently uploaded

Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 

Recently uploaded (20)

Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 

Data Science: Origins, Methods, Challenges and the future?

  • 1. Data science: Origins, methods, challenges & the future? CAGATAYTURKAY, giCentre, City University London City Unrulyversity, 18 March 2015
  • 2. Who? • Lecturer in Applied Data Science, City Univ. London • @ giCentre • PhD @VisGroup at Univ. of Bergen, Norway
  • 3. Research on … Methods for InteractiveVisual Data Analysis
  • 5. Somes example first – Google Flu Trends http://www.google.org/flutrends/ Ginsberg, Jeremy, et al. "Detecting influenza epidemics using search engine query data." Nature 457.7232 (2008): 1012-1014. Video: http://goo.gl/4ysAmw Google says .. .. relationship between how many people search for flu- related topics and how many people actually have flu symptoms. … to estimate how much flu is circulating ….
  • 6. The Shortest Path to Happiness: Recommending Beautiful, Quiet, and Happy Routes in the City Quercia, Daniele, Rossano Schifanella, and Luca Maria Aiello. ACM conference on Hypertext and social media, 2014. http://urbangems.org/
  • 7. Data science is a systematic study of generalizable extraction of knowledge from data From “Data Science and Prediction” byVasant Dhar, Communications of the ACM, 2013
  • 8. Data science is a process starting with formulating a question that can be answered with data, and iteratively collecting, cleaning, analysing and modelling the data, while communicating the answers to the relevant audience along this iteration.
  • 9. On the origins .. Term coined by William S. Cleveland in 2001[*] …knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of the knowledge bases would produce a powerful force for innovation. [*] Cleveland, William S. "Data science: an action plan for expanding the technical areas of the field of statistics.“, (2001)
  • 10. By Capgemini Consulting, http://www.slideshare.net/capgemini/impact-of-big-data-on-analytics
  • 11. Data scientist ? • Sexiest job of the 21st century, according to Harvard Business Review, 2012 • A data scientist is many people in one, someone: –who understands domain (industry/academia) needs and terminology –who is able to mash-up several analytical tools –who is able to design and implement solutions to extract knowledge from the data –who can communicate findings
  • 12. On data analysts – analyst types
  • 13. On data analysts – skills vs. types
  • 15. DS Process • Understand domain needs & formulate questions • Collect & make data available • Get the data ready for analysis • Exploratively (and visually) analyse the data • Model the phenomena (if needed) • Evaluate findings • Communicate findings • ITERATE (from any stage to any other stage)!
  • 16. DS Process – Data collection, efficient storing & access
  • 17. DS Process – Getting data ready Many names: Data wrangling, data munging, data cleaning, data massaging, data scrubbing, pre-processing, ….
  • 18. I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any analysis. Most of the time once you transform the data you just do an average... the insights can be scarily obvious. It’s fun when you get to do something somewhat analytical. Kandel, Sean, et al. "Enterprise data analysis and visualization:An interview study.", IEEETVCG (2012)
  • 19. DS Process – (Exploratively) Analysing data Goal: Generate / confirm ideas, findings (hypotheses) • Analysis tasks: –Finding anomalies –Finding relations –Finding groups –Summarizing information –Making predictions –Understanding uncertainties –Evaluate hypotheses
  • 20. DS Process – Build models • Statistical models –For summarising, representing data –For predictions, estimations • Machine learning methods, e.g., neural networks –Predictive tasks –Classification tasks –Unsupervised / supervised models
  • 21. Visualise & Communicate along the process!
  • 25. + Text Material by Tamara Munzner, http://www.cs.ubc.ca/~tmm/talks.html#minicourse14
  • 26. Why a challenge? • Computational methods do not work for all types of data • Hard (cognitively) to link observations • Gaps in the data (sparsity) • Data is not necessarily linked! [ by O’Donoghue et al., 2010]
  • 28.
  • 29. Why a challenge? • Models become complex! • Hard to interpret and utilise • Less reliable findings, high uncertainy Image from: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/ The curse of dimensionality
  • 31. This is a straight line that “connects the dots”, i.e., best fits the points. Small exercise
  • 32. Think of straight lines for these four sets
  • 33.
  • 35. … Internet search behaviour changed during pH1N1, particularly in the categories “influenza complications” and “term for influenza.”The complications associated with pH1N1, the fact that pH1N1 began in the summer rather than winter, and changes in health-seeking behaviour …. 13 February 2013, http://www.nature.com/news/when-google-got-flu-wrong-1.12413
  • 38. HOW WE (TRY TO) DEAL WITH THESE? -- EXAMPLES
  • 39.
  • 40. STUDY 1 – MULTIVARIATE GEOGRAPHICAL DATA Attribute Signatures: DynamicVisual Summaries for Analyzing Multivariate Geographical Data CagatayTurkay,Aidan Slingsby, Helwig Hauser, JoWood, Jason Dykes, InfoVis 2014 UK Census of Population in 2001 and 2011 for the 181,000 Output Areas (OA) for 41 indicators
  • 41. How all the variables vary over space (and time)? question is…
  • 42. interactively generate visual summaries of change in all the variables in response to variation (location, extent, resolution) we built methods to:
  • 43. Attribute signatures It starts with a map … and an attribute to analyze, e.g., unemployment rates
  • 45. Attribute signatures ~ baseline (country average)
  • 48. Attribute signatures A dynamically generated visual summary!
  • 49. How about several attributes? Linked small multiples
  • 50.
  • 51. Along the Southwest England coast
  • 53. STUDY 2 – HOW TO CHARACTERIZE CANCER SUBTYPES? Characterizing Cancer Subtypes Using Dual Analysis in Caleydo StratomeX CagatayTurkay,Alexander Lex, Marc Streit, Hanspeter Pfister, Helwig Hauser, IEEE CG&A 2014
  • 54. Patients(samples) Genes Candidate Subtype / Heat Map Cancer Subtypes are identified by grouping datasets based on • gene activity • mutations • or a combination of these http://caleydo.org/ Colour to show gene activity
  • 55. There are always many ways to group! Group A1 Group A2 Group A3 B1 B2 Grouping 3, Group. C1 Group. C2 Grouping 1, Grouping 2,
  • 56. What are the groups & Why? Many shared Patients Clustering 1 Clustering 2 Sample Overlaps GeneOverlaps??
  • 58. VISUAL ANALYSIS TO FACILITATE THE ITERATIVE PROCESS & HYPOTHESES GENERATION
  • 59. Some remarks on Future • Infrastructure & speed – better solutions? • More on “Value of the data” • More sophisticated analyses • Predictive analytics on the rise • More central role for the user (through visualisation) • New sources of data –Increasingly you! , health apps, etc. –Internet of things
  • 60. Data science is a process starting with formulating a question that can be answered with data, and iteratively collecting, cleaning, analysing and modelling the data, while communicating the answers to the relevant audience along this iteration. To conclude ….
  • 61. Some technologies (we use in teaching DS) • Python • Pandas for Statistical Computing • Scikit-learn for machine learning • R Statistical Computation Software • Apache Spark • Tableau Software