This lecture was presented on April 23, 2015 as the second lecture within the the series "Data management for Quantitative Biology" at the University of Tübingen in Germany.
Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr 23, 2015. Dr. Sven Nahnsen
1. Dr. Sven Nahnsen,
Quantitative Biology Center (QBiC)
Data Management for Quantitative
Biology
Lecture 2: Basics and Challenges of
Biological and Biomedical Data Management
2. Overview
• Recap from last week/remaining slides from last week
• Basics of data management
• Data management plan
• Challenges in relation to biomedical research
- Tier-1 challenges
§ Data transfer
§ Storage concepts
§ Data sharing/data dissemination
§ User logging
• Data privacy considerations
2
3. Basic concept of data management
Data management
Data Outcome
Biological data
(e.g. NGS,
proteomics,
metabolomics)
Store, analyze,
integrate, …
(see data life cycle, DMBoK)
Generate added value;
enable collaboration
sustainability,
reproducibility, ….
Tier 0
Tier 1
Tier 2
Tier 3
Data management plan
Data transfer Handling/storage Sharing/dissemination
Annotation (metadata) Data processing Data access logging
… … …
Domainspecificity
3
4. Data management plan (DMP)
• There is no standard for a data management plan
• However, funding agencies, journals and institutions require
researchers to have a data management plan
• The DAMA-DMBOK provides a good orientation for any DMP
• Not all aspects are relevant to a biological/biomedical research
project
• Essential aspects concern:
- Data acquisition, standards, file formats
- Data sharing
- Data preservation
4
9. Creating a DMP (4/4)
• https://dmptool.org
- Policies for sharing
- Public access
9
10. Data management plan – other sources
• A guide to write a data management plan
- http://data.research.cornell.edu
10
11. Data transfer
• Why has data to be moved?
- Data generation instruments are usually not at the place of the
data center
- If data is shared globally -> big issue
- If raw data has to be brought together with other raw data (e.g.
mass spec with NGS data)
• Data transfer may be a security hole
• There are many different data transfer technologies
- FTP (File Transfer Protocol), based on TCP (Transmission Control
Protocol)
- HTTP (Hypertext Transfer Protocol), based on TCP
- R-Sync
- FASP (Aspera)
- We can not cover all protocols
• Best solution: avoid data transfer!
11
12. Data transfer
• Classical protocols such as ftp, scp and http work well if the data is
in MB or low GB range
• Large data sets may not be suitable to such transfer protocols
• However, compression is an option (reduce the size to be
transferred, but requires compute power on sender and recipient
side)
- Data compression explained (http://mattmahoney.net/dc/dce.html.),
Matt Mahoney, 2013
- Bottlenecks
§ Time
§ Memory usage
Science Blogs: Good Math, bad math, Carrol 2009. http://goo.gl/Mf1G9L
A compression is a
function C, mapping
a string x to a string
C(x) with,
12
13. R-Sync
• Initial release 1998
• R-sync is a widely-used utility to keep copies of a file on two
computer systems the same
• Can invoke integrity checks with checksum
• Example: longitudinal parity byte
- Break data into words of n bits
- Compute the Exclusive OR (XOR)
- Append resulting word to the message
- Check integrity by calculting the XOR
- Result needs to a word with n zeros
13
http://en.wikipedia.org/wiki/Checksum, accessed 22/04/2015, 5 PM
http://en.wikipedia.org/wiki/Exclusive_or, accessed 22/04/2015, 6 PM
W1 W2
14. Fast and secure protocol (FASP)
• Developed by Aspera (bought by IBM)
• Hundred times faster than FTP and HTTP (both TCP based)
• Built-in security using open standards
• Open architecture
14
15. ASPERA (FASP)
• Capabilities:
- High-Performance data transfer software,
- Enterprise File Sync and Share (Ad-hoc),
- Email/mobile/cloud integration
• Founded:
- 2004 in Emeryville, California
- Acquired by IBM in 2014
• Markets Served:
- Over 3000 customers in
- Engineering, Media & Entertainment,
Government, Telecommunications, Life
Sciences, etc.
• The Aspera Difference:
- Patented FASP™ transport technology
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
15
16. Challenges with TCP and alternative technologies
• Distance degrades conditions on all networks
- Latency (or Round Trip Times) increase
- Packet losses increase
- Fast networks just as prone to degradation
• TCP performance degrades with distance
- Increased latency and packet loss
• TCP does not scale with bandwidth
- TCP designed for low bandwidth
- Adding more bandwidth does not improve throughput
• Alternative Technologies
- TCP-based - Network latency and packet loss must be low
- Modified TCP – Improves TCP performance but insufficient for
fast networks
- Data caching - Inappropriate for many large file transfer workflows
- Data compression - Time consuming and impractical for certain file types
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
16
Latency: the time from the source sending a packet to the destination
receiving it
Packet loss: Packet loss occurs when data packets fail to reach the
destination
17. fasp™ — High-performance Data Transport
• Maximum transfer speed
- Optimal end-to-end throughput efficiency
- Scales with bandwidth independent of latency and resilient to packet loss
• Congestion Avoidance and Policy Control
- Automatic, full utilization of available bandwidth
- Ready to use on existing network due to adaptive rate-control
• Uncompromising security and reliability
- SSL/SSH2: Secure, user/endpoint authentication
- AES cryptography in transit and at-rest
• Central management, monitoring and control
- Real-time progress, performance and bandwidth utilization
- Detailed transfer history, logging, and manifest
• Low Overhead
- Less than 0.1% overhead on 30% packet loss
- High performance with large files or large sets of small files
• Resulting in
- Transfers up to orders of magnitude faster than FTP
- Precise and predictable transfer times
- Extreme scalability (size, bandwidth, distance, number of
endpoints, and concurrency)
17 17
Slide, courtesy from Arnd Kohrs, Aspera, 2015.
18. Data Storage
• Data archive vs. data backup
- Economic factor
- Use case example
• RAID technology (e.g., RAID 1 and
RAID 5)
18
Universität Tübingen/Jörg Jäger
19. Data archive vs. data backup
Archive
• Stores data that's no longer in
day-to-day use but must still be
retained
• Speed of retrieval is not as
crucial
• Archiving time requirements can
be many years/decades
• Data that is archived should
contain native (standardized)
raw data
Backup
• Provides rapid recovery of
operational data
• Use cases: data corruption,
accidental deletion or disaster
recovery (DR) scenarios
• Speed is crucial
• Time requirements can be
several weeks/months
• Data is mostly kept in proprietary
formats
Data backup vs archiving: What's the difference?, Antony Adshead, 2009.
www.computerweekly.com 19
20. Data backup vs. archive: Big difference in costs
• Backup is essential
• Example:
- 10 GbE clients
- 4 daily backups of 100 TB
- A full weekly backup
saved for 4 weeks
- End of months backup
saved for a year
- End of year backup saved for
seven years
• 25 times the production data
• Full recovery within 24 h needs
1.2 Gbit/sec
How Archive Can Fix Backup? Spectra Logic, 2011.
https://www.spectralogic.com/how-archive-can-fix-backup/ 20
21. Data backup vs. archive: Big difference in costs
• Consider what data really needs to
go into backup (how long?)
• From experience a max of 20% is
really “hot data”
• 80% of data can go into the archive
Still 25 times the production data,
which is now only 0.5 TB and 0.24
Gbit/sec connection to recover
within 24 h
How Archive Can Fix Backup? Spectra Logic, 2011.
https://www.spectralogic.com/how-archive-can-fix-backup/ 21
22. Technologies
Archive
• Tape storage
- Inexpensive; can host large
volumes; very robust
- Slow (data is read in blocks);
hardware management
• Optical media storage
- Write (W) once, read many times (no
physical contact)
- Low capacity and rather slow
• Disk storage
- Random access; falling prices; fast;
RAID compatible
Backup
• Tape and optical media are not really an
option
• Disk storage (magnetic drives)
- Fast access is essential; RAID
compatibility is essential
- Continuous energy is needed and
backup will fail while power outage
• Solid state drives
- Very fast; falling prices
- Capacity is still an issue
Data archiving techniques: Choosing the best archive, Pierre Dorion.
http://searchdatabackup.techtarget.com/tip/Data-archiving-techniques-Choosing-the-best-archive-media
On the horizon for archive and backup, alike: Cloud technologies
22
23. RAID (Redundant array of independent disks)
• Data storage virtualization
• Many disks into one logical volume for improved redundancy and
performance
• Note: a RAID is not a backup, nor an archive
• There are different RAID levels, indicating the level of redundancy
• Raids use the parity concept to enable cost-efficient redundancy
- A parity bit, or check bit is a bit added to the end of a string of binary
code that indicates whether the number of bits in the string with the
value one is even or odd.
- Distinguish between even and
odd parity
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit 23
24. RAID (Redundant array of independent disks)
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit
• There are many different RAID level (differing on performance,
availability and costs). We discuss RAID 1 and RAID 5
• RAID 1
- Data mirroring (no parity)
- Good read performance and reliability
- But not cost efficient
• For the assignments you will also need to
use other raid technologies
24
25. RAID (Redundant array of independent disks)
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/Parity_bit
• There are many different RAID level (differing in performance,
availability and costs). We discuss RAID 1 and RAID 5
• RAID 5
- Block-level striping with distributed parity
- Can tolerate the loss of one disk
- At least three disks are
required
25
26. Sharing data
• Methods for sharing data:
- Distinguish between post- and prepublication sharing
- Biomart (biomart.org)
- Integrated management solutions (e.g., Lab key server,
openBIS)
- Many example can be found in the Baker paper…
• Decentralized authentication system: OpenID
Nature Methods 9, 39–41 (2012) doi:10.1038/nmeth.1815
Published online 28 December 2011 26
27. Biomart
• Free software and data sources for the scientific services for the
scientific community
• Researches can set up their own data source
• Own scientific data can be exposed to the world of researchers
• Own data can be federated with data from others
27
28. Integrated Data management tools LabKey
• Will be discussed thoroughly in lecture 11
• Most of these tools are open source (partly with commercial
support, e.g., LabKey)
• Complete workflow:
from source to the
sharing
• Sharing can be public
or with dedicated users
28
29. OpenID
• Relatively recent concept (OpenID
Connect 1.0, release: 02/2014)
• Non-profit OpenID Foundation
• Authentication via co-operating sites
• Users can log in without the need to
enter all information again
• Close to a unified webID
• Many data sharing platforms in biology
and biomedicine are adapting the
OpenID concept (e.g., ICGC)
• Advantages vs. disadvantages will be
discussed in the problem sessions
http://openid.net
http://openidexplained.com
Companies involved in OpenID
29
30. OpenID for scientific data
• Allows logging of the usage of shared data
• Important step towards guaranteeing intellectual property rights for
openly shared data
• World-wide data usage can become possible
• Avoids overhead for user management
http://openid.net
30
32. Some facts
• Surnames are paternally inherited in most western countries
• Co-segregation with Y-Chromosome haplotypes
• Breakers: Adoption, non-paternity, mutations..
• Business model of genetic genealogy companies (e.g., find
biological father)
• There are many (big) databases containing haplotype information
(e.g., HapMap project or www.smgf.org)
§ You need to enter a combination of Y-STR alleles (Y chromosome short tandem
repeats)
§ You receive matching records: surnames with geographical location and pedigrees
Definition haplotype:
A haplotype is a set of DNA variations, or polymorphisms, that
tend to be inherited together. A haplotype can refer to a
combination of alleles or to a set of single nucleotide
polymorphisms (SNPs) found on the same chromosome.
http://ghr.nlm.nih.gov/glossary=haplotype
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. 32
33. Can surnames be interfered from genome data?
• Personal genome data is getting increasingly available
• Open databases containing genealogy information
• 39 k unique surnames vs. 135 k records (R2=0.78)
• Given a haplotype profile the correct surname can be discovered in 95%
• If additional demographic data is available (internet searches), the
individual identity can almost be assigned completely
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
Records per surname (US data) Cumulative distribution of US males
Matching age,
state, surname
Matching only
state and age
33
34. NGS for haplotype information
• 100 bp (base pair read length) PE (paired end sequencing), 13 x
average coverage
• Haplotype information can be reconstructed with a 99% accuracy
• Using Craig Venter’s genome sequence, genealogy analysis
revealed the correct surname
• Surname + data of birth + state reveals
the correct individual
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
34
35. Multiple matchings
• If large genealogy information is available (e.g., procedure is
common in a family), the search may lead to many candidates
• Surname has been recovered
• Publically available internet information is
added (obituaries, search engine records
• Demographic information from genealogy
databases
• Resulted in full identification of the
corresponding individuals
• Note the implication that social media profiles
may have !?
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference
35
36. Summary – surname inference
• There is potential for vulnerability
• Accuracy is depending on the Y chromosome read coverage
(longer reads will lead to higher coverage)
• Gymrek et al suggest
- Establishing global data sharing policies
- Educating participants about potential risks and benefits of
genetic studies
- Development of legislation of proper usage of genetic data
Gymrek et al., 2013. Science .Identifying personal genomes by surname inference.
Rodriguez et al., 2013. Science. The Complexities of Genomic Identifiability 36
37. Contact:
Quantitative Biology Center (QBiC)
Auf der Morgenstelle 10
72076 Tübingen · Germany
dmqb-ss15@informatik.uni-tuebingen.de
Thanks for listening – See you next week