SlideShare a Scribd company logo
The Data Lifecycle
#harvarddatafest
Mercè Crosas
Chief Data Science andTechnology Officer
Institute for Quantitative Social Science
Harvard University
@mercecrosas
Data
Facts that can be analyzed or used in an effort to
gain knowledge or make decisions; information.
Latin Datum “Something Given”
(The American Heritage Dictionary)
Museum of Fine Arts
The Art of the Ancient World
Friday, January 13, 2017, 7pm
Giza Excavation 1902-1947
Archeological Finds as Data
The Giza Archive
and catalogue:
• 21,000 finds
• 3,000 photos
• 3000 tomb and
monument
records
• 3,000 diary pages
• 7,000 maps
• books,
manuscripts
• metadata
3D Visualization to Re-Explore Giza
Peter Der Manuelian, Harvard’sVisualization Center (Geological Museum),
DARTH, and Dassault Systèmes in Paris
Books as Data
Harvard Cultural
Observatory (Aiden,
Michel, SEAS) + Google:
• 5 million digitized books
(from130 million total)
• From year1500 to 2008
• Optical Character
Recognition (OCR)
• Quantify unstructured
text to n-gram count
matrix
Frequency of Words with Time
Historical events as Data
my ab POSITION Places Notes Terrain Notes ID TYPE NAME DATE STEP VALUE UNITS TEXT LAT1 LON1 LAT2 LON2 PATH CERTAINTY
Needs static event
map Needs static event map 4/6/1760 0
4/7/1760 0
18.3648, -76.8852
Frontier misplaced
to NW 1 Conspiracy Frontier 4/7/1760 1 5 Rebel conspirators 18.3648 -76.8852 1
18.3542, -76.8990
Trinity misplaced to
NW 1 Rebels Trinity 4/7/1760 2 100 Rebel force 18.3542 -76.899 1
4/8/1760 0
Fixed
18.3542, -76.8990
to Ft. Haldane
Still glitchy; start
should be from
Trinity 1 Rebels Ft. Haldane 4/8/1760 1 100 Rebel force 18.3542 -76.899 18.3776 -76.8905 1
Fixed database fixed? 1 Clash Ft. Haldane 4/8/1760 2 100 Rebel force 18.3776 -76.8905 1
18.3542, -76.8990
Trinity misplaced to
NW 1 Rebels Trinity 4/8/1760 3 100 Rebel force 18.3776 -76.8905 18.3542 -76.899 1
18.3223, -76.8875
Ballard's Valley
misplaced to NW
To extent possible, movement
should follow roads (dotted line) 1 Rebels Ballard's Valley 4/8/1760 4 400 Rebel force 18.3542 -76.899 18.3223 -76.8875
[[18.3545,
-76.8976],[18.3521,
-76.8964],[18.3485,
-76.8971],[18.3446,
-76.8964],[18.3402,
-76.8974],[18.3389,
-76.8952],[18.3368,
-76.8933],[18.3368,
-76.8907],[18.3355,
-76.8888],[18.3340,
-76.8866],[18.3324,
-76.8844],[18.3291,
-76.8832],[18.3260,
-76.8846],[18.3224,
-76.8866]] 1
18.3223, -76.8875
Moving clash; Tab
is wrong
To extent possible, movement
should follow roads (dotted line) 1 Clash Ballard's Valley 4/8/1760 5 400 Rebel force
More than 12
whites, 30 non-
whites killed 18.3223 -76.8875 1
18.3223, -76.8875
to Esher
Rebels should
move from
Ballard's Valley to
Esher
To extent possible, movement
should follow roads (dotted line) 1 Rebels Esher 4/8/1760 6 18.3223 -76.8875 18.2809 -76.8675
[[18.3223,
-76.8875],
[18.3198,
-76.8859],
[18.3164,
-76.8866],
[18.3144,
-76.8863],
[18.3125,
-76.8888],
[18.3105,
-76.8925],
[18.3060,
-76.8933],
[18.2998,
-76.8955],
[18.2972,
-76.8991],
[18.2937,
-76.9000],
[18.2908,
-76.8971],
[18.2890,
-76.8928],
[18.2858,
-76.8906],
[18.2807,
-76.8906],
[18.2750,
-76.8882],
[18.2709,
-76.8851],
[18.2704,
-76.8822],
[18.2703,
-76.8773],
[18.2722,
-76.8736],
[18.2783,
-76.8703],
[18.2809,
-76.8675]] 1
To extent possible, movement
should follow roads (dotted line) 1 Clash Esher 4/8/1760 7 5 whites killed 18.2809 -76.8675 1
18.3223, -76.8875
Eliminate
movement; Militia
appear at single
point, Ballard's
Valley
To extent possible, movement
should follow roads (dotted line) 2 Militia Ballard's Valley 4/8/1760 8 18.3223 -76.8875 0
Esher to 18.3050,
-76.8882
Starting pt is
Esher
To extent possible, movement
should follow roads (dotted line) 1 Rebels Whitehall 4/8/1760 9 400 Rebel force 18.2809 -76.8675 18.305 -76.8882
[[18.2809,
-76.8675],
[18.2792,
-76.8694],
[18.2766,
-76.8720],
[18.2740,
-76.8736],
[18.2708,
-76.8796],
[18.2706,
-76.8839],
[18.2809,
-76.8906],
[18.2885,
-76.8921],
[18.2908,
-76.8964],
[18.2919,
-76.8995],
[18.2952,
-76.8995],
[18.2981,
-76.8976],
[18.3032,
-76.8937],[18.305,
-76.8882]] 1
Geospatial coordinates Time
Jamaica Slave Revolt in Space
and Time (1760 - 1761)
Animated maps to narrate and explore history data
byVincent Brown (Harvard’s History Department)
Polls as Data
2016 Election Data:
• 100s scientific polls
• Mixed models and sample
frames
• interviewer-administered
telephone polls
• interactive-voice-response polls
• probability-based web polls
• good quality non-probability based
web polls
• Inference and forecasting
(no comment)
Video as Data
Video in developmental
psychology:
• 143 hours of infant-
perspective scenes,
collected from 34
infants aged 1
month to 2 years
• Coding and
annotations
Fausey, C.M., Smith, L.B. & Jayaraman, S. (2015). From faces to hands: Changing
visual input in the first two years. Databrary
Pollution measurements as Data
Combine data from
diverse sources:
• 3000 air pollution
monitors
• 1800 power plants SO2
emissions
• claims data from 60 M
Medicare beneficiaries
• air pollution trajectories
based on weather
models
Zigler, Dominici,Wang, Kim, Choirat (Harvard Chan School of Public Health)
Your Genome as Data
Measurement of Single-
Nucleotide
Polymorphisms (SNPs)
with Sequencer
instrument
Genome Wide Association Study
Comparing genomes from individuals with
(cases) and without disease (controls)
Genetic variants more often found in individuals with constrictions in small blood vessels
(GWAS,Wikipedia)
Particle Detections as Data
The Large Hadron Collider (HLC), CERN
Yes, there is a Higgs Boson!
Likelihoods as result
of comparing LHC
observations with
predictions from
particle physics
Standard Model
Measurements of Higgs boson production and couplings in diboson final states with the ATLAS
detector at the LHC (Atlas Collaboration, 2013, with 2923 authors).
Acknowledgment Kyle Cranmer (NYU)
(but is it the one we expected?)
Radio Emissions as Data
Caltech Submillimeter Observatory, California
Temperature, Mass, and Distance
derived from Stellar Molecular Lines
Crosas, Menten, Physical Parameters of the IRC+10216 Circumstellar Envelope (ApJ, 1997)
Monte Carlo simulations of radiative transfer compare observations
with predictions from physical laws
raw or primary
data
order, transform,
and analyze them
Catalog
Classify
Visualize
Quantify
Summarize
Geo reference
Inference
Missing data
Forecast
Causal Inference
Coding
Annotations
Associations
Likelihoods
Compare with theory
Humanities
Social Sciences
Health and
Life Sciences
Physical Sciences
gain knowledge,
make decisions
Learn about
the whole
from a part.
Tell a story.
Make a
prediction.
Ultimately
explain.
raw or primary
data
order, transform,
and analyze them
gain knowledge,
make decisions
• Replication*: Independent scientific experiments to validate
findings
• Reproducibility*: Calculation of quantitative results by
others using original datasets and methods
(Stodden, Leisch, Peng, Implementing Reproducible Research, 2014)
* Replication and reproducibility definitions vary across disciplines
“Nullius inVerba”
(1665, the Royal Society motto and origin of Philosophical Transactions)
rigorous, scientific approach
verify, converge, expand
“Answering even a simple scientific question requires
lots of choices that can shape the results”
“When possible, make
data, methods, and code
open to verify”
“Science/research
might be imperfect, but
is self-correcting”
“It’s not unreliable, but
more challenging that
we give it credit for”
Findable
Accessible
Interoperable
Reusable
(data, code,
methods)
Harvard Data Resources
Computing & Storage Infrastructure
Services and Resources
Orchestra
(HMS)
Odyssey
(FAS)
RCE
(IQSS)
HBS RC
(HBS)
HMS Data
Management
group
Harvard Chan
Bioinformatics
Core
IQSS Data Science
Services
& Survey Research
Center for
Geographic
Analysis
DARTH Arts
and Humanities
Computing
Harvard Library
Services and
Archives
Research
Computing
Council
Harvard Data
Group
(OVPR)
Harvard Dataverse
(IQSS, Library,
HUIT)
VPAL Research
Group
HBS Research
Computing
Services
DataFest
Data Concepts
Project Planning a Data
Science Workflows
Processing and
Analyzing Data*
Dissemination
Plan how to
address your
research question
Address it!
Share your
results, data, code
with others
*Does not include handling sensitive data
Tuesday,January17Wednesday,January18 Follow your sessions
using SCHED
Don’t miss!
If you leave DataFest thinking
“I need to learn more about X”,
we have already succeeded.
(however, we welcome your
feedback, using SCHED)
The Data Lifecycle (Harvard DataFest)

More Related Content

Similar to The Data Lifecycle (Harvard DataFest)

Digital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social MachinesDigital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social Machines
David De Roure
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
Dan Gaston
 
Lehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandardsLehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandards
Kerstin Lehnert
 
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
Muki Haklay
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
Antica Culina
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
Merce Crosas
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesight
suresh sood
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
Microsoft Azure for Research
 
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
Deltares
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentation
Andrew Prescott
 
NGSS Earth and Space Science
NGSS Earth and Space ScienceNGSS Earth and Space Science
NGSS Earth and Space Science
SERC at Carleton College
 
Biol161 01
Biol161 01Biol161 01
Biol161 01
gfb1
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Rich Heimann
 
Dark Data In the Long Tail of Science:   Examples in Biology
Dark Data In the Long Tail of Science:  Examples in BiologyDark Data In the Long Tail of Science:  Examples in Biology
Dark Data In the Long Tail of Science:   Examples in Biology
Bryan Heidorn
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
Andrew Prescott
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesis
Antica Culina
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Andrea Scharnhorst
 
Bias and the Data Lifecycle
Bias and the Data LifecycleBias and the Data Lifecycle
Bias and the Data Lifecycle
Richard Ferrers
 

Similar to The Data Lifecycle (Harvard DataFest) (20)

Digital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social MachinesDigital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social Machines
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
Lehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandardsLehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandards
 
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
Introduction to Citizen Science and Scientific Crowdsourcing - Data Quality s...
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesight
 
Citizen science
Citizen scienceCitizen science
Citizen science
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
 
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
DSD-INT 2014 - Data Science symposium - 4th Paradigm - a technology perspecti...
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentation
 
NGSS Earth and Space Science
NGSS Earth and Space ScienceNGSS Earth and Space Science
NGSS Earth and Space Science
 
Biol161 01
Biol161 01Biol161 01
Biol161 01
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
 
Dark Data In the Long Tail of Science:   Examples in Biology
Dark Data In the Long Tail of Science:  Examples in BiologyDark Data In the Long Tail of Science:  Examples in Biology
Dark Data In the Long Tail of Science:   Examples in Biology
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesis
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
 
Bias and the Data Lifecycle
Bias and the Data LifecycleBias and the Data Lifecycle
Bias and the Data Lifecycle
 

More from Merce Crosas

Practical Implementation of research data policies: Solutions with Dataverse
Practical Implementation of research data policies: Solutions with DataversePractical Implementation of research data policies: Solutions with Dataverse
Practical Implementation of research data policies: Solutions with Dataverse
Merce Crosas
 
Research Data Management @Harvard
Research Data Management @HarvardResearch Data Management @Harvard
Research Data Management @Harvard
Merce Crosas
 
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack CloudCloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
Merce Crosas
 
Can data access combat fake news?
Can data access combat fake news?Can data access combat fake news?
Can data access combat fake news?
Merce Crosas
 
Data Repositories Impact
Data Repositories ImpactData Repositories Impact
Data Repositories Impact
Merce Crosas
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
Merce Crosas
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
Merce Crosas
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
Merce Crosas
 
Making Data Accessible
Making Data AccessibleMaking Data Accessible
Making Data Accessible
Merce Crosas
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
Merce Crosas
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Merce Crosas
 
Connecting Dataverse with the Research Life Cycle
Connecting Dataverse with the Research Life CycleConnecting Dataverse with the Research Life Cycle
Connecting Dataverse with the Research Life Cycle
Merce Crosas
 
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
Merce Crosas
 
A very Brief History of Communicating Science
A very Brief History of Communicating ScienceA very Brief History of Communicating Science
A very Brief History of Communicating Science
Merce Crosas
 
Data Citation Implementation at Dataverse
Data Citation Implementation at DataverseData Citation Implementation at Dataverse
Data Citation Implementation at Dataverse
Merce Crosas
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Merce Crosas
 
Dataverse on the MOC
Dataverse on the MOCDataverse on the MOC
Dataverse on the MOC
Merce Crosas
 
Data Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumData Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access Symposium
Merce Crosas
 
Dataverse hpdm symposium
Dataverse   hpdm symposiumDataverse   hpdm symposium
Dataverse hpdm symposium
Merce Crosas
 

More from Merce Crosas (20)

Practical Implementation of research data policies: Solutions with Dataverse
Practical Implementation of research data policies: Solutions with DataversePractical Implementation of research data policies: Solutions with Dataverse
Practical Implementation of research data policies: Solutions with Dataverse
 
Research Data Management @Harvard
Research Data Management @HarvardResearch Data Management @Harvard
Research Data Management @Harvard
 
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack CloudCloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
 
Can data access combat fake news?
Can data access combat fake news?Can data access combat fake news?
Can data access combat fake news?
 
Data Repositories Impact
Data Repositories ImpactData Repositories Impact
Data Repositories Impact
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
 
Making Data Accessible
Making Data AccessibleMaking Data Accessible
Making Data Accessible
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
Connecting Dataverse with the Research Life Cycle
Connecting Dataverse with the Research Life CycleConnecting Dataverse with the Research Life Cycle
Connecting Dataverse with the Research Life Cycle
 
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
The Rise of Data Publishing in the Digital World (and how Dataverse and DataT...
 
A very Brief History of Communicating Science
A very Brief History of Communicating ScienceA very Brief History of Communicating Science
A very Brief History of Communicating Science
 
Data Citation Implementation at Dataverse
Data Citation Implementation at DataverseData Citation Implementation at Dataverse
Data Citation Implementation at Dataverse
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
Dataverse on the MOC
Dataverse on the MOCDataverse on the MOC
Dataverse on the MOC
 
Data Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumData Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access Symposium
 
Dataverse hpdm symposium
Dataverse   hpdm symposiumDataverse   hpdm symposium
Dataverse hpdm symposium
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

The Data Lifecycle (Harvard DataFest)

  • 1. The Data Lifecycle #harvarddatafest Mercè Crosas Chief Data Science andTechnology Officer Institute for Quantitative Social Science Harvard University @mercecrosas
  • 2. Data Facts that can be analyzed or used in an effort to gain knowledge or make decisions; information. Latin Datum “Something Given” (The American Heritage Dictionary)
  • 3. Museum of Fine Arts The Art of the Ancient World Friday, January 13, 2017, 7pm
  • 5. Archeological Finds as Data The Giza Archive and catalogue: • 21,000 finds • 3,000 photos • 3000 tomb and monument records • 3,000 diary pages • 7,000 maps • books, manuscripts • metadata
  • 6. 3D Visualization to Re-Explore Giza Peter Der Manuelian, Harvard’sVisualization Center (Geological Museum), DARTH, and Dassault Systèmes in Paris
  • 7. Books as Data Harvard Cultural Observatory (Aiden, Michel, SEAS) + Google: • 5 million digitized books (from130 million total) • From year1500 to 2008 • Optical Character Recognition (OCR) • Quantify unstructured text to n-gram count matrix
  • 8. Frequency of Words with Time
  • 9. Historical events as Data my ab POSITION Places Notes Terrain Notes ID TYPE NAME DATE STEP VALUE UNITS TEXT LAT1 LON1 LAT2 LON2 PATH CERTAINTY Needs static event map Needs static event map 4/6/1760 0 4/7/1760 0 18.3648, -76.8852 Frontier misplaced to NW 1 Conspiracy Frontier 4/7/1760 1 5 Rebel conspirators 18.3648 -76.8852 1 18.3542, -76.8990 Trinity misplaced to NW 1 Rebels Trinity 4/7/1760 2 100 Rebel force 18.3542 -76.899 1 4/8/1760 0 Fixed 18.3542, -76.8990 to Ft. Haldane Still glitchy; start should be from Trinity 1 Rebels Ft. Haldane 4/8/1760 1 100 Rebel force 18.3542 -76.899 18.3776 -76.8905 1 Fixed database fixed? 1 Clash Ft. Haldane 4/8/1760 2 100 Rebel force 18.3776 -76.8905 1 18.3542, -76.8990 Trinity misplaced to NW 1 Rebels Trinity 4/8/1760 3 100 Rebel force 18.3776 -76.8905 18.3542 -76.899 1 18.3223, -76.8875 Ballard's Valley misplaced to NW To extent possible, movement should follow roads (dotted line) 1 Rebels Ballard's Valley 4/8/1760 4 400 Rebel force 18.3542 -76.899 18.3223 -76.8875 [[18.3545, -76.8976],[18.3521, -76.8964],[18.3485, -76.8971],[18.3446, -76.8964],[18.3402, -76.8974],[18.3389, -76.8952],[18.3368, -76.8933],[18.3368, -76.8907],[18.3355, -76.8888],[18.3340, -76.8866],[18.3324, -76.8844],[18.3291, -76.8832],[18.3260, -76.8846],[18.3224, -76.8866]] 1 18.3223, -76.8875 Moving clash; Tab is wrong To extent possible, movement should follow roads (dotted line) 1 Clash Ballard's Valley 4/8/1760 5 400 Rebel force More than 12 whites, 30 non- whites killed 18.3223 -76.8875 1 18.3223, -76.8875 to Esher Rebels should move from Ballard's Valley to Esher To extent possible, movement should follow roads (dotted line) 1 Rebels Esher 4/8/1760 6 18.3223 -76.8875 18.2809 -76.8675 [[18.3223, -76.8875], [18.3198, -76.8859], [18.3164, -76.8866], [18.3144, -76.8863], [18.3125, -76.8888], [18.3105, -76.8925], [18.3060, -76.8933], [18.2998, -76.8955], [18.2972, -76.8991], [18.2937, -76.9000], [18.2908, -76.8971], [18.2890, -76.8928], [18.2858, -76.8906], [18.2807, -76.8906], [18.2750, -76.8882], [18.2709, -76.8851], [18.2704, -76.8822], [18.2703, -76.8773], [18.2722, -76.8736], [18.2783, -76.8703], [18.2809, -76.8675]] 1 To extent possible, movement should follow roads (dotted line) 1 Clash Esher 4/8/1760 7 5 whites killed 18.2809 -76.8675 1 18.3223, -76.8875 Eliminate movement; Militia appear at single point, Ballard's Valley To extent possible, movement should follow roads (dotted line) 2 Militia Ballard's Valley 4/8/1760 8 18.3223 -76.8875 0 Esher to 18.3050, -76.8882 Starting pt is Esher To extent possible, movement should follow roads (dotted line) 1 Rebels Whitehall 4/8/1760 9 400 Rebel force 18.2809 -76.8675 18.305 -76.8882 [[18.2809, -76.8675], [18.2792, -76.8694], [18.2766, -76.8720], [18.2740, -76.8736], [18.2708, -76.8796], [18.2706, -76.8839], [18.2809, -76.8906], [18.2885, -76.8921], [18.2908, -76.8964], [18.2919, -76.8995], [18.2952, -76.8995], [18.2981, -76.8976], [18.3032, -76.8937],[18.305, -76.8882]] 1 Geospatial coordinates Time
  • 10. Jamaica Slave Revolt in Space and Time (1760 - 1761) Animated maps to narrate and explore history data byVincent Brown (Harvard’s History Department)
  • 11. Polls as Data 2016 Election Data: • 100s scientific polls • Mixed models and sample frames • interviewer-administered telephone polls • interactive-voice-response polls • probability-based web polls • good quality non-probability based web polls • Inference and forecasting
  • 13. Video as Data Video in developmental psychology: • 143 hours of infant- perspective scenes, collected from 34 infants aged 1 month to 2 years • Coding and annotations Fausey, C.M., Smith, L.B. & Jayaraman, S. (2015). From faces to hands: Changing visual input in the first two years. Databrary
  • 14. Pollution measurements as Data Combine data from diverse sources: • 3000 air pollution monitors • 1800 power plants SO2 emissions • claims data from 60 M Medicare beneficiaries • air pollution trajectories based on weather models Zigler, Dominici,Wang, Kim, Choirat (Harvard Chan School of Public Health)
  • 15. Your Genome as Data Measurement of Single- Nucleotide Polymorphisms (SNPs) with Sequencer instrument
  • 16. Genome Wide Association Study Comparing genomes from individuals with (cases) and without disease (controls) Genetic variants more often found in individuals with constrictions in small blood vessels (GWAS,Wikipedia)
  • 17. Particle Detections as Data The Large Hadron Collider (HLC), CERN
  • 18. Yes, there is a Higgs Boson! Likelihoods as result of comparing LHC observations with predictions from particle physics Standard Model Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC (Atlas Collaboration, 2013, with 2923 authors). Acknowledgment Kyle Cranmer (NYU) (but is it the one we expected?)
  • 19. Radio Emissions as Data Caltech Submillimeter Observatory, California
  • 20. Temperature, Mass, and Distance derived from Stellar Molecular Lines Crosas, Menten, Physical Parameters of the IRC+10216 Circumstellar Envelope (ApJ, 1997) Monte Carlo simulations of radiative transfer compare observations with predictions from physical laws
  • 21. raw or primary data order, transform, and analyze them Catalog Classify Visualize Quantify Summarize Geo reference Inference Missing data Forecast Causal Inference Coding Annotations Associations Likelihoods Compare with theory Humanities Social Sciences Health and Life Sciences Physical Sciences gain knowledge, make decisions Learn about the whole from a part. Tell a story. Make a prediction. Ultimately explain.
  • 22. raw or primary data order, transform, and analyze them gain knowledge, make decisions • Replication*: Independent scientific experiments to validate findings • Reproducibility*: Calculation of quantitative results by others using original datasets and methods (Stodden, Leisch, Peng, Implementing Reproducible Research, 2014) * Replication and reproducibility definitions vary across disciplines “Nullius inVerba” (1665, the Royal Society motto and origin of Philosophical Transactions) rigorous, scientific approach verify, converge, expand
  • 23. “Answering even a simple scientific question requires lots of choices that can shape the results” “When possible, make data, methods, and code open to verify” “Science/research might be imperfect, but is self-correcting” “It’s not unreliable, but more challenging that we give it credit for”
  • 25. Harvard Data Resources Computing & Storage Infrastructure Services and Resources Orchestra (HMS) Odyssey (FAS) RCE (IQSS) HBS RC (HBS) HMS Data Management group Harvard Chan Bioinformatics Core IQSS Data Science Services & Survey Research Center for Geographic Analysis DARTH Arts and Humanities Computing Harvard Library Services and Archives Research Computing Council Harvard Data Group (OVPR) Harvard Dataverse (IQSS, Library, HUIT) VPAL Research Group HBS Research Computing Services
  • 26. DataFest Data Concepts Project Planning a Data Science Workflows Processing and Analyzing Data* Dissemination Plan how to address your research question Address it! Share your results, data, code with others *Does not include handling sensitive data Tuesday,January17Wednesday,January18 Follow your sessions using SCHED
  • 28. If you leave DataFest thinking “I need to learn more about X”, we have already succeeded. (however, we welcome your feedback, using SCHED)