SlideShare a Scribd company logo
1 of 37
Download to read offline
Dr. Sven Nahnsen,
Quantitative Biology Center (QBiC)
Data Management for Quantitative
Biology
Lecture 2: Basics and Challenges of
Biological and Biomedical Data Management
Overview
•  Recap from last week/remaining slides from last week
•  Basics of data management
•  Data management plan
•  Challenges in relation to biomedical research
-  Tier-1 challenges
§  Data transfer
§  Storage concepts
§  Data sharing/data dissemination
§  User logging
•  Data privacy considerations
2
Basic concept of data management
Data management
Data Outcome
Biological data
(e.g. NGS,
proteomics,
metabolomics)
Store, analyze,
integrate, …
(see data life cycle, DMBoK)
Generate added value;
enable collaboration
sustainability,
reproducibility, ….
Tier 0
Tier 1
Tier 2
Tier 3
Data management plan
Data transfer Handling/storage Sharing/dissemination
Annotation (metadata) Data processing Data access logging
… … …
Domainspecificity
3
Data management plan (DMP)
•  There is no standard for a data management plan
•  However, funding agencies, journals and institutions require
researchers to have a data management plan
•  The DAMA-DMBOK provides a good orientation for any DMP
•  Not all aspects are relevant to a biological/biomedical research
project
•  Essential aspects concern:
-  Data acquisition, standards, file formats
-  Data sharing
-  Data preservation
4
Data management plan
5
Creating a DMP (1/4)
•  https://dmptool.org
-  Title of study
-  DMP creator
-  Affiliation
-  Time stamp
-  Copyright
6
Creating a DMP (2/4)
•  https://dmptool.org
-  Data collection, formats and
standards
-  Data storage and
preservation
-  Dissemination methods
7
Creating a DMP (3/4)
•  https://dmptool.org
-  Roles
-  Responsibilities
8
Creating a DMP (4/4)
•  https://dmptool.org
-  Policies for sharing
-  Public access
9
Data management plan – other sources
•  A guide to write a data management plan
-  http://data.research.cornell.edu
10
Data transfer
•  Why has data to be moved?
-  Data generation instruments are usually not at the place of the
data center
-  If data is shared globally -> big issue
-  If raw data has to be brought together with other raw data (e.g.
mass spec with NGS data)
•  Data transfer may be a security hole
•  There are many different data transfer technologies
-  FTP (File Transfer Protocol), based on TCP (Transmission Control
Protocol)
-  HTTP (Hypertext Transfer Protocol), based on TCP
-  R-Sync
-  FASP (Aspera)
-  We can not cover all protocols
•  Best solution: avoid data transfer!
11
Data transfer
•  Classical protocols such as ftp, scp and http work well if the data is
in MB or low GB range
•  Large data sets may not be suitable to such transfer protocols
•  However, compression is an option (reduce the size to be
transferred, but requires compute power on sender and recipient
side)
-  Data compression explained (http://mattmahoney.net/dc/dce.html.),
Matt Mahoney, 2013
-  Bottlenecks
§  Time
§  Memory usage
Science Blogs: Good Math, bad math, Carrol 2009. http://goo.gl/Mf1G9L
A compression is a
function C, mapping
a string x to a string
C(x) with,
12
R-Sync
•  Initial release 1998
•  R-sync is a widely-used utility to keep copies of a file on two
computer systems the same
•  Can invoke integrity checks with checksum
•  Example: longitudinal parity byte
-  Break data into words of n bits
-  Compute the Exclusive OR (XOR)
-  Append resulting word to the message
-  Check integrity by calculting the XOR
-  Result needs to a word with n zeros
13
http://en.wikipedia.org/wiki/Checksum, accessed 22/04/2015, 5 PM
http://en.wikipedia.org/wiki/Exclusive_or, accessed 22/04/2015, 6 PM
W1 W2
Fast and secure protocol (FASP)
•  Developed by Aspera (bought by IBM)
•  Hundred times faster than FTP and HTTP (both TCP based)
•  Built-in security using open standards
•  Open architecture
14
ASPERA (FASP)
•  Capabilities:
-  High-Performance data transfer software,
-  Enterprise File Sync and Share (Ad-hoc),
-  Email/mobile/cloud integration
•  Founded:
-  2004 in Emeryville, California
-  Acquired by IBM in 2014
•  Markets Served:
-  Over 3000 customers in
-  Engineering, Media & Entertainment,
Government, Telecommunications, Life
Sciences, etc.
•  The Aspera Difference:
-  Patented FASP™ transport technology
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
15
Challenges with TCP and alternative technologies
•  Distance degrades conditions on all networks
-  Latency (or Round Trip Times) increase
-  Packet losses increase
-  Fast networks just as prone to degradation
•  TCP performance degrades with distance
-  Increased latency and packet loss
•  TCP does not scale with bandwidth
-  TCP designed for low bandwidth
-  Adding more bandwidth does not improve throughput
•  Alternative Technologies
-  TCP-based - Network latency and packet loss must be low
-  Modified TCP – Improves TCP performance but insufficient for
fast networks
-  Data caching - Inappropriate for many large file transfer workflows
-  Data compression - Time consuming and impractical for certain file types
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
16
Latency: the time from the source sending a packet to the destination
receiving it
Packet loss: Packet loss occurs when data packets fail to reach the
destination
fasp™ — High-performance Data Transport
•  Maximum transfer speed
-  Optimal end-to-end throughput efficiency
-  Scales with bandwidth independent of latency and resilient to packet loss
•  Congestion Avoidance and Policy Control
-  Automatic, full utilization of available bandwidth
-  Ready to use on existing network due to adaptive rate-control
•  Uncompromising security and reliability
-  SSL/SSH2: Secure, user/endpoint authentication
-  AES cryptography in transit and at-rest
•  Central management, monitoring and control
-  Real-time progress, performance and bandwidth utilization
-  Detailed transfer history, logging, and manifest
•  Low Overhead
-  Less than 0.1% overhead on 30% packet loss
-  High performance with large files or large sets of small files
•  Resulting in
-  Transfers up to orders of magnitude faster than FTP
-  Precise and predictable transfer times
-  Extreme scalability (size, bandwidth, distance, number of
endpoints, and concurrency)
17 17
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
Data Storage
•  Data archive vs. data backup
-  Economic factor
-  Use case example
•  RAID technology (e.g., RAID 1 and
RAID 5)
18
Universität Tübingen/Jörg Jäger
Data archive vs. data backup
Archive
•  Stores data that's no longer in
day-to-day use but must still be
retained
•  Speed of retrieval is not as
crucial
•  Archiving time requirements can
be many years/decades
•  Data that is archived should
contain native (standardized)
raw data
Backup
•  Provides rapid recovery of
operational data
•  Use cases: data corruption,
accidental deletion or disaster
recovery (DR) scenarios
•  Speed is crucial
•  Time requirements can be
several weeks/months
•  Data is mostly kept in proprietary
formats
Data backup vs archiving: What's the difference?, Antony Adshead, 2009.
www.computerweekly.com 19
Data backup vs. archive: Big difference in costs
•  Backup is essential
•  Example:
-  10 GbE clients
-  4 daily backups of 100 TB
-  A full weekly backup
saved for 4 weeks
-  End of months backup
saved for a year
-  End of year backup saved for
seven years
•  25 times the production data
•  Full recovery within 24 h needs
1.2 Gbit/sec
How Archive Can Fix Backup? Spectra Logic, 2011.
https://www.spectralogic.com/how-archive-can-fix-backup/ 20
Data backup vs. archive: Big difference in costs
•  Consider what data really needs to
go into backup (how long?)
•  From experience a max of 20% is
really “hot data”
•  80% of data can go into the archive
Still 25 times the production data,
which is now only 0.5 TB and 0.24
Gbit/sec connection to recover
within 24 h
How Archive Can Fix Backup? Spectra Logic, 2011.
https://www.spectralogic.com/how-archive-can-fix-backup/ 21
Technologies
Archive
•  Tape storage
-  Inexpensive; can host large
volumes; very robust
-  Slow (data is read in blocks);
hardware management
•  Optical media storage
-  Write (W) once, read many times (no
physical contact)
-  Low capacity and rather slow
•  Disk storage
-  Random access; falling prices; fast;
RAID compatible
Backup
•  Tape and optical media are not really an
option
•  Disk storage (magnetic drives)
-  Fast access is essential; RAID
compatibility is essential
-  Continuous energy is needed and
backup will fail while power outage
•  Solid state drives
-  Very fast; falling prices
-  Capacity is still an issue
Data archiving techniques: Choosing the best archive, Pierre Dorion.
http://searchdatabackup.techtarget.com/tip/Data-archiving-techniques-Choosing-the-best-archive-media
On the horizon for archive and backup, alike: Cloud technologies
22
RAID (Redundant array of independent disks)
•  Data storage virtualization
•  Many disks into one logical volume for improved redundancy and
performance
•  Note: a RAID is not a backup, nor an archive
•  There are different RAID levels, indicating the level of redundancy
•  Raids use the parity concept to enable cost-efficient redundancy
-  A parity bit, or check bit is a bit added to the end of a string of binary
code that indicates whether the number of bits in the string with the
value one is even or odd.
-  Distinguish between even and
odd parity
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit 23
RAID (Redundant array of independent disks)
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit
•  There are many different RAID level (differing on performance,
availability and costs). We discuss RAID 1 and RAID 5
•  RAID 1
-  Data mirroring (no parity)
-  Good read performance and reliability
-  But not cost efficient
•  For the assignments you will also need to
use other raid technologies
24
RAID (Redundant array of independent disks)
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit
•  There are many different RAID level (differing in performance,
availability and costs). We discuss RAID 1 and RAID 5
•  RAID 5
-  Block-level striping with distributed parity
-  Can tolerate the loss of one disk
-  At least three disks are
required
25
Sharing data
•  Methods for sharing data:
-  Distinguish between post- and prepublication sharing
-  Biomart (biomart.org)
-  Integrated management solutions (e.g., Lab key server,
openBIS)
-  Many example can be found in the Baker paper…
•  Decentralized authentication system: OpenID
Nature Methods 9, 39–41 (2012) doi:10.1038/nmeth.1815
Published online 28 December 2011 26
Biomart
•  Free software and data sources for the scientific services for the
scientific community
•  Researches can set up their own data source
•  Own scientific data can be exposed to the world of researchers
•  Own data can be federated with data from others
27
Integrated Data management tools LabKey
•  Will be discussed thoroughly in lecture 11
•  Most of these tools are open source (partly with commercial
support, e.g., LabKey)
•  Complete workflow:
from source to the
sharing
•  Sharing can be public
or with dedicated users
28
OpenID
•  Relatively recent concept (OpenID
Connect 1.0, release: 02/2014)
•  Non-profit OpenID Foundation
•  Authentication via co-operating sites
•  Users can log in without the need to
enter all information again
•  Close to a unified webID
•  Many data sharing platforms in biology
and biomedicine are adapting the
OpenID concept (e.g., ICGC)
•  Advantages vs. disadvantages will be
discussed in the problem sessions
http://openid.net
http://openidexplained.com
Companies involved in OpenID
29
OpenID for scientific data
•  Allows logging of the usage of shared data
•  Important step towards guaranteeing intellectual property rights for
openly shared data
•  World-wide data usage can become possible
•  Avoids overhead for user management
http://openid.net
30
Data privacy considerations
31
Some facts
•  Surnames are paternally inherited in most western countries
•  Co-segregation with Y-Chromosome haplotypes
•  Breakers: Adoption, non-paternity, mutations..
•  Business model of genetic genealogy companies (e.g., find
biological father)
•  There are many (big) databases containing haplotype information
(e.g., HapMap project or www.smgf.org)
§  You need to enter a combination of Y-STR alleles (Y chromosome short tandem
repeats)
§  You receive matching records: surnames with geographical location and pedigrees
Definition haplotype:
A haplotype is a set of DNA variations, or polymorphisms, that
tend to be inherited together. A haplotype can refer to a
combination of alleles or to a set of single nucleotide
polymorphisms (SNPs) found on the same chromosome.
http://ghr.nlm.nih.gov/glossary=haplotype
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. 32
Can surnames be interfered from genome data?
•  Personal genome data is getting increasingly available
•  Open databases containing genealogy information
•  39 k unique surnames vs. 135 k records (R2=0.78)
•  Given a haplotype profile the correct surname can be discovered in 95%
•  If additional demographic data is available (internet searches), the
individual identity can almost be assigned completely
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
Records per surname (US data) Cumulative distribution of US males
Matching age,
state, surname
Matching only
state and age
33
NGS for haplotype information
•  100 bp (base pair read length) PE (paired end sequencing), 13 x
average coverage
•  Haplotype information can be reconstructed with a 99% accuracy
•  Using Craig Venter’s genome sequence, genealogy analysis
revealed the correct surname
•  Surname + data of birth + state reveals
the correct individual
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
34
Multiple matchings
•  If large genealogy information is available (e.g., procedure is
common in a family), the search may lead to many candidates
•  Surname has been recovered
•  Publically available internet information is
added (obituaries, search engine records
•  Demographic information from genealogy
databases
•  Resulted in full identification of the
corresponding individuals
•  Note the implication that social media profiles
may have !?
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
35
Summary – surname inference
•  There is potential for vulnerability
•  Accuracy is depending on the Y chromosome read coverage
(longer reads will lead to higher coverage)
•  Gymrek et al suggest
-  Establishing global data sharing policies
-  Educating participants about potential risks and benefits of
genetic studies
-  Development of legislation of proper usage of genetic data
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference.
Rodriguez et al., 2013. Science. The Complexities of Genomic Identifiability 36
Contact:
Quantitative Biology Center (QBiC)
Auf der Morgenstelle 10
72076 Tübingen · Germany
dmqb-ss15@informatik.uni-tuebingen.de
Thanks for listening – See you next week

More Related Content

What's hot

Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2BradDesAulniers2
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...DataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Science DMZ at Imperial
Science DMZ at ImperialScience DMZ at Imperial
Science DMZ at ImperialJisc
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory StorageDataWorks Summit
 
Panasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Science DMZ
Science DMZScience DMZ
Science DMZJisc
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSIRJET Journal
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
HDFS Federation++
HDFS Federation++HDFS Federation++
HDFS Federation++Hortonworks
 

What's hot (20)

Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2
 
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Science DMZ at Imperial
Science DMZ at ImperialScience DMZ at Imperial
Science DMZ at Imperial
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Panasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National Laboratory
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
 
Science DMZ
Science DMZScience DMZ
Science DMZ
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
Storage system architecture
Storage system architectureStorage system architecture
Storage system architecture
 
Hdfs
HdfsHdfs
Hdfs
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFS
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
HDFS Federation++
HDFS Federation++HDFS Federation++
HDFS Federation++
 

Viewers also liked

S-CUBE LP: Monitoring Adaptation of Service-based Applications
S-CUBE LP: Monitoring Adaptation of Service-based ApplicationsS-CUBE LP: Monitoring Adaptation of Service-based Applications
S-CUBE LP: Monitoring Adaptation of Service-based Applicationsvirtual-campus
 
Privacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesPrivacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesMicah Altman
 
NIST Privacy Engineering Working Group -- Risk Models
 NIST Privacy Engineering Working Group -- Risk Models NIST Privacy Engineering Working Group -- Risk Models
NIST Privacy Engineering Working Group -- Risk ModelsDavid Sweigert
 
Global privacy research
Global privacy researchGlobal privacy research
Global privacy researchbbw1984
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information LifecycleMicah Altman
 

Viewers also liked (6)

S-CUBE LP: Monitoring Adaptation of Service-based Applications
S-CUBE LP: Monitoring Adaptation of Service-based ApplicationsS-CUBE LP: Monitoring Adaptation of Service-based Applications
S-CUBE LP: Monitoring Adaptation of Service-based Applications
 
Privacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use CasesPrivacy in Research Data Managemnt - Use Cases
Privacy in Research Data Managemnt - Use Cases
 
NIST Privacy Engineering Working Group -- Risk Models
 NIST Privacy Engineering Working Group -- Risk Models NIST Privacy Engineering Working Group -- Risk Models
NIST Privacy Engineering Working Group -- Risk Models
 
Global privacy research
Global privacy researchGlobal privacy research
Global privacy research
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information Lifecycle
 
Il web marketing
Il web marketing Il web marketing
Il web marketing
 

Similar to Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr 23, 2015. Dr. Sven Nahnsen

DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdfBrahmam8
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...SURFnet
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The TrenchesGeorge Ang
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingGlobus
 
Strongbox Data Storage Podcast
Strongbox Data Storage PodcastStrongbox Data Storage Podcast
Strongbox Data Storage Podcastinside-BigData.com
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technologyphanleson
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at BristolSimon Price
 
Flashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drivesFlashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drivesPratik Bhat
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 

Similar to Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr 23, 2015. Dr. Sven Nahnsen (20)

DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
IBM Aspera overview
IBM Aspera overview IBM Aspera overview
IBM Aspera overview
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
Strongbox Data Storage Podcast
Strongbox Data Storage PodcastStrongbox Data Storage Podcast
Strongbox Data Storage Podcast
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technology
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
 
Mis chapter 5
Mis  chapter 5Mis  chapter 5
Mis chapter 5
 
Flashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drivesFlashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drives
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr 23, 2015. Dr. Sven Nahnsen

  • 1. Dr. Sven Nahnsen, Quantitative Biology Center (QBiC) Data Management for Quantitative Biology Lecture 2: Basics and Challenges of Biological and Biomedical Data Management
  • 2. Overview •  Recap from last week/remaining slides from last week •  Basics of data management •  Data management plan •  Challenges in relation to biomedical research -  Tier-1 challenges §  Data transfer §  Storage concepts §  Data sharing/data dissemination §  User logging •  Data privacy considerations 2
  • 3. Basic concept of data management Data management Data Outcome Biological data (e.g. NGS, proteomics, metabolomics) Store, analyze, integrate, … (see data life cycle, DMBoK) Generate added value; enable collaboration sustainability, reproducibility, …. Tier 0 Tier 1 Tier 2 Tier 3 Data management plan Data transfer Handling/storage Sharing/dissemination Annotation (metadata) Data processing Data access logging … … … Domainspecificity 3
  • 4. Data management plan (DMP) •  There is no standard for a data management plan •  However, funding agencies, journals and institutions require researchers to have a data management plan •  The DAMA-DMBOK provides a good orientation for any DMP •  Not all aspects are relevant to a biological/biomedical research project •  Essential aspects concern: -  Data acquisition, standards, file formats -  Data sharing -  Data preservation 4
  • 6. Creating a DMP (1/4) •  https://dmptool.org -  Title of study -  DMP creator -  Affiliation -  Time stamp -  Copyright 6
  • 7. Creating a DMP (2/4) •  https://dmptool.org -  Data collection, formats and standards -  Data storage and preservation -  Dissemination methods 7
  • 8. Creating a DMP (3/4) •  https://dmptool.org -  Roles -  Responsibilities 8
  • 9. Creating a DMP (4/4) •  https://dmptool.org -  Policies for sharing -  Public access 9
  • 10. Data management plan – other sources •  A guide to write a data management plan -  http://data.research.cornell.edu 10
  • 11. Data transfer •  Why has data to be moved? -  Data generation instruments are usually not at the place of the data center -  If data is shared globally -> big issue -  If raw data has to be brought together with other raw data (e.g. mass spec with NGS data) •  Data transfer may be a security hole •  There are many different data transfer technologies -  FTP (File Transfer Protocol), based on TCP (Transmission Control Protocol) -  HTTP (Hypertext Transfer Protocol), based on TCP -  R-Sync -  FASP (Aspera) -  We can not cover all protocols •  Best solution: avoid data transfer! 11
  • 12. Data transfer •  Classical protocols such as ftp, scp and http work well if the data is in MB or low GB range •  Large data sets may not be suitable to such transfer protocols •  However, compression is an option (reduce the size to be transferred, but requires compute power on sender and recipient side) -  Data compression explained (http://mattmahoney.net/dc/dce.html.), Matt Mahoney, 2013 -  Bottlenecks §  Time §  Memory usage Science Blogs: Good Math, bad math, Carrol 2009. http://goo.gl/Mf1G9L A compression is a function C, mapping a string x to a string C(x) with, 12
  • 13. R-Sync •  Initial release 1998 •  R-sync is a widely-used utility to keep copies of a file on two computer systems the same •  Can invoke integrity checks with checksum •  Example: longitudinal parity byte -  Break data into words of n bits -  Compute the Exclusive OR (XOR) -  Append resulting word to the message -  Check integrity by calculting the XOR -  Result needs to a word with n zeros 13 http://en.wikipedia.org/wiki/Checksum, accessed 22/04/2015, 5 PM http://en.wikipedia.org/wiki/Exclusive_or, accessed 22/04/2015, 6 PM W1 W2
  • 14. Fast and secure protocol (FASP) •  Developed by Aspera (bought by IBM) •  Hundred times faster than FTP and HTTP (both TCP based) •  Built-in security using open standards •  Open architecture 14
  • 15. ASPERA (FASP) •  Capabilities: -  High-Performance data transfer software, -  Enterprise File Sync and Share (Ad-hoc), -  Email/mobile/cloud integration •  Founded: -  2004 in Emeryville, California -  Acquired by IBM in 2014 •  Markets Served: -  Over 3000 customers in -  Engineering, Media & Entertainment, Government, Telecommunications, Life Sciences, etc. •  The Aspera Difference: -  Patented FASP™ transport technology Slide, courtesy from Arnd Kohrs, Aspera, 2015. 15
  • 16. Challenges with TCP and alternative technologies •  Distance degrades conditions on all networks -  Latency (or Round Trip Times) increase -  Packet losses increase -  Fast networks just as prone to degradation •  TCP performance degrades with distance -  Increased latency and packet loss •  TCP does not scale with bandwidth -  TCP designed for low bandwidth -  Adding more bandwidth does not improve throughput •  Alternative Technologies -  TCP-based - Network latency and packet loss must be low -  Modified TCP – Improves TCP performance but insufficient for fast networks -  Data caching - Inappropriate for many large file transfer workflows -  Data compression - Time consuming and impractical for certain file types Slide, courtesy from Arnd Kohrs, Aspera, 2015. 16 Latency: the time from the source sending a packet to the destination receiving it Packet loss: Packet loss occurs when data packets fail to reach the destination
  • 17. fasp™ — High-performance Data Transport •  Maximum transfer speed -  Optimal end-to-end throughput efficiency -  Scales with bandwidth independent of latency and resilient to packet loss •  Congestion Avoidance and Policy Control -  Automatic, full utilization of available bandwidth -  Ready to use on existing network due to adaptive rate-control •  Uncompromising security and reliability -  SSL/SSH2: Secure, user/endpoint authentication -  AES cryptography in transit and at-rest •  Central management, monitoring and control -  Real-time progress, performance and bandwidth utilization -  Detailed transfer history, logging, and manifest •  Low Overhead -  Less than 0.1% overhead on 30% packet loss -  High performance with large files or large sets of small files •  Resulting in -  Transfers up to orders of magnitude faster than FTP -  Precise and predictable transfer times -  Extreme scalability (size, bandwidth, distance, number of endpoints, and concurrency) 17 17 Slide, courtesy from Arnd Kohrs, Aspera, 2015.
  • 18. Data Storage •  Data archive vs. data backup -  Economic factor -  Use case example •  RAID technology (e.g., RAID 1 and RAID 5) 18 Universität Tübingen/Jörg Jäger
  • 19. Data archive vs. data backup Archive •  Stores data that's no longer in day-to-day use but must still be retained •  Speed of retrieval is not as crucial •  Archiving time requirements can be many years/decades •  Data that is archived should contain native (standardized) raw data Backup •  Provides rapid recovery of operational data •  Use cases: data corruption, accidental deletion or disaster recovery (DR) scenarios •  Speed is crucial •  Time requirements can be several weeks/months •  Data is mostly kept in proprietary formats Data backup vs archiving: What's the difference?, Antony Adshead, 2009. www.computerweekly.com 19
  • 20. Data backup vs. archive: Big difference in costs •  Backup is essential •  Example: -  10 GbE clients -  4 daily backups of 100 TB -  A full weekly backup saved for 4 weeks -  End of months backup saved for a year -  End of year backup saved for seven years •  25 times the production data •  Full recovery within 24 h needs 1.2 Gbit/sec How Archive Can Fix Backup? Spectra Logic, 2011. https://www.spectralogic.com/how-archive-can-fix-backup/ 20
  • 21. Data backup vs. archive: Big difference in costs •  Consider what data really needs to go into backup (how long?) •  From experience a max of 20% is really “hot data” •  80% of data can go into the archive Still 25 times the production data, which is now only 0.5 TB and 0.24 Gbit/sec connection to recover within 24 h How Archive Can Fix Backup? Spectra Logic, 2011. https://www.spectralogic.com/how-archive-can-fix-backup/ 21
  • 22. Technologies Archive •  Tape storage -  Inexpensive; can host large volumes; very robust -  Slow (data is read in blocks); hardware management •  Optical media storage -  Write (W) once, read many times (no physical contact) -  Low capacity and rather slow •  Disk storage -  Random access; falling prices; fast; RAID compatible Backup •  Tape and optical media are not really an option •  Disk storage (magnetic drives) -  Fast access is essential; RAID compatibility is essential -  Continuous energy is needed and backup will fail while power outage •  Solid state drives -  Very fast; falling prices -  Capacity is still an issue Data archiving techniques: Choosing the best archive, Pierre Dorion. http://searchdatabackup.techtarget.com/tip/Data-archiving-techniques-Choosing-the-best-archive-media On the horizon for archive and backup, alike: Cloud technologies 22
  • 23. RAID (Redundant array of independent disks) •  Data storage virtualization •  Many disks into one logical volume for improved redundancy and performance •  Note: a RAID is not a backup, nor an archive •  There are different RAID levels, indicating the level of redundancy •  Raids use the parity concept to enable cost-efficient redundancy -  A parity bit, or check bit is a bit added to the end of a string of binary code that indicates whether the number of bits in the string with the value one is even or odd. -  Distinguish between even and odd parity http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit 23
  • 24. RAID (Redundant array of independent disks) http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit •  There are many different RAID level (differing on performance, availability and costs). We discuss RAID 1 and RAID 5 •  RAID 1 -  Data mirroring (no parity) -  Good read performance and reliability -  But not cost efficient •  For the assignments you will also need to use other raid technologies 24
  • 25. RAID (Redundant array of independent disks) http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit •  There are many different RAID level (differing in performance, availability and costs). We discuss RAID 1 and RAID 5 •  RAID 5 -  Block-level striping with distributed parity -  Can tolerate the loss of one disk -  At least three disks are required 25
  • 26. Sharing data •  Methods for sharing data: -  Distinguish between post- and prepublication sharing -  Biomart (biomart.org) -  Integrated management solutions (e.g., Lab key server, openBIS) -  Many example can be found in the Baker paper… •  Decentralized authentication system: OpenID Nature Methods 9, 39–41 (2012) doi:10.1038/nmeth.1815 Published online 28 December 2011 26
  • 27. Biomart •  Free software and data sources for the scientific services for the scientific community •  Researches can set up their own data source •  Own scientific data can be exposed to the world of researchers •  Own data can be federated with data from others 27
  • 28. Integrated Data management tools LabKey •  Will be discussed thoroughly in lecture 11 •  Most of these tools are open source (partly with commercial support, e.g., LabKey) •  Complete workflow: from source to the sharing •  Sharing can be public or with dedicated users 28
  • 29. OpenID •  Relatively recent concept (OpenID Connect 1.0, release: 02/2014) •  Non-profit OpenID Foundation •  Authentication via co-operating sites •  Users can log in without the need to enter all information again •  Close to a unified webID •  Many data sharing platforms in biology and biomedicine are adapting the OpenID concept (e.g., ICGC) •  Advantages vs. disadvantages will be discussed in the problem sessions http://openid.net http://openidexplained.com Companies involved in OpenID 29
  • 30. OpenID for scientific data •  Allows logging of the usage of shared data •  Important step towards guaranteeing intellectual property rights for openly shared data •  World-wide data usage can become possible •  Avoids overhead for user management http://openid.net 30
  • 32. Some facts •  Surnames are paternally inherited in most western countries •  Co-segregation with Y-Chromosome haplotypes •  Breakers: Adoption, non-paternity, mutations.. •  Business model of genetic genealogy companies (e.g., find biological father) •  There are many (big) databases containing haplotype information (e.g., HapMap project or www.smgf.org) §  You need to enter a combination of Y-STR alleles (Y chromosome short tandem repeats) §  You receive matching records: surnames with geographical location and pedigrees Definition haplotype: A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome. http://ghr.nlm.nih.gov/glossary=haplotype Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. 32
  • 33. Can surnames be interfered from genome data? •  Personal genome data is getting increasingly available •  Open databases containing genealogy information •  39 k unique surnames vs. 135 k records (R2=0.78) •  Given a haplotype profile the correct surname can be discovered in 95% •  If additional demographic data is available (internet searches), the individual identity can almost be assigned completely Gymrek et al., 2013. Science .Identifying personal genomes by surname inference Records per surname (US data) Cumulative distribution of US males Matching age, state, surname Matching only state and age 33
  • 34. NGS for haplotype information •  100 bp (base pair read length) PE (paired end sequencing), 13 x average coverage •  Haplotype information can be reconstructed with a 99% accuracy •  Using Craig Venter’s genome sequence, genealogy analysis revealed the correct surname •  Surname + data of birth + state reveals the correct individual Gymrek et al., 2013. Science .Identifying personal genomes by surname inference 34
  • 35. Multiple matchings •  If large genealogy information is available (e.g., procedure is common in a family), the search may lead to many candidates •  Surname has been recovered •  Publically available internet information is added (obituaries, search engine records •  Demographic information from genealogy databases •  Resulted in full identification of the corresponding individuals •  Note the implication that social media profiles may have !? Gymrek et al., 2013. Science .Identifying personal genomes by surname inference 35
  • 36. Summary – surname inference •  There is potential for vulnerability •  Accuracy is depending on the Y chromosome read coverage (longer reads will lead to higher coverage) •  Gymrek et al suggest -  Establishing global data sharing policies -  Educating participants about potential risks and benefits of genetic studies -  Development of legislation of proper usage of genetic data Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. Rodriguez et al., 2013. Science. The Complexities of Genomic Identifiability 36
  • 37. Contact: Quantitative Biology Center (QBiC) Auf der Morgenstelle 10 72076 Tübingen · Germany dmqb-ss15@informatik.uni-tuebingen.de Thanks for listening – See you next week