Managing data

Joachim Jacob
8 and 15 November 2013
Bioinformatics data
Historically, bioinformatics has always used text files
to store data.
PDB file excerpt

Genbank recor...
NGS data
The NGS machines spit a lot of data, stored in plain
text files. These files are multiple gigabytes in size.
Tips for managing NGS data
1. When you move the data, do it in its smallest form.
→ Compress the data.
2. When you unpack ...
Compression: tools in Linux

And some more exist...

http://www.linuxlinks.com/article/20110220091109939/CompressionTools....
Tips
Widely used compression tools:
● GNU zip (gzip)
● Block Sorting compression (bzip2)
Typically, compression tools work...
Tar without compression
Tar (Tape Archive) is a tool for bundling a set of files
or directories into a single archive. The...
Compression: a typical case
Archiving and compression mostly occur together.
The most used formats are tar.gz or tar.bz. T...
Compression: on your desktop
Compression: on your desktop
Compression: on the command line
Tar is the tool for creating .tar archives, but it can
compress in one go, with the z or ...
De-/compression
To compress one or more files:
$ gzip [options] file
$ bzip2 [options] file
To decompress one or more file...
Tips
Many compression tools on the command line allow
to read compressed files (instead of first unpacking
then reading).
...
Exercise
→ a little compression exercise.
Symlinks
Pay attention. Something very convenient!
A symbolic link (or symlink) is a file which points to the
location of ...
Symlinks
To create a symlink, move to the folder in where the symlink
must be created, and execute ln.
~/Projects $ cd But...
Symlinks
The symlink is created. You can check with ls.
To delete a symlink, use unlink.
~/Projects $ cd Butterfly
~/Butte...
Exercise
→ a little symlink exercise
Disks and storage
If you dive into bioinformatics, you will have to
manage disks and storage.
Two types of disks
- solid s...
A disk is a device
Via the terminal, show the disks using
$ sudo fdisk -l
[sudo] password for joachim:
Disk /dev/sda: 13.4...
A disk is divided into partitions
A disk can be divided in parts, called partitions.
An internal disk which runs an operat...
Check out the disk utility tool
The system disk

Name of the disk
The system disk

Name currently highlighted partition
The system disk

Place in the directory structure
where the partition can be accessed
An example of an USB disk
-

Place in the directory structure
where the partition can be accessed
An example of an USB disk
The USB disk is 'mounted' automatically on the
directory tree under /media.
An example of an USB disk
-

This is the type of file system
on the partition.
The partition is said to be formatted
in FA...
File system formats
By default, many USB flash disks are formatted in
FAT32.
Other types are NTFS, ext4, ZFS.
FAT32 – max ...
An example of an USB disk
First unmount the device.
Next, choose format the device.

1

2
Format disks with disk utility
Choose the type of file
system you want to be on
that device.
Format disks with disk utility
Format disks with disk utility
You don't want to know all the commands that work
behind the gnome-disk-utility for you.
Bu...
Checking storage space
By default 'disk usage analyzer'.
Checking storage space
Bonus: K4DirStat. Not installed by default.
Checking storage space
Bonus: K4DirStat. Not installed by default.
K4Dirstat is a KDE package
Rehearsal: what is KDE?
Bonus: what happens when you install this
package on our system?
Space left on disks with df
To check the storage that is used on the
different disks.
~/ $ df -h

Filesystem
/dev/sda1
ude...
The size of directories
To check the size of files or directories.
~/ $ du -sh *
520K bin
281M Compression_exercise
4.0K D...
Wildcards on the command line
Wildcards are used to describe the names of files/dirs.
[]
On that position, the character m...
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards....
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards....
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards....
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards....
Keywords
Compression
Archive
Symbolic link
mounting
File system format
partition
Recursively
df
du
unlink
Write in your ow...
Break
Upcoming SlideShare
Loading in...5
×

Managing your data - Introduction to Linux for bioinformatics

576

Published on

This 4th presentation in the 'Introduction to linux for bioinformatics' shows tips and trick for the beginner to start managing data more effectively.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
576
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Managing your data - Introduction to Linux for bioinformatics"

  1. 1. Managing data Joachim Jacob 8 and 15 November 2013
  2. 2. Bioinformatics data Historically, bioinformatics has always used text files to store data. PDB file excerpt Genbank record HMM profile
  3. 3. NGS data The NGS machines spit a lot of data, stored in plain text files. These files are multiple gigabytes in size.
  4. 4. Tips for managing NGS data 1. When you move the data, do it in its smallest form. → Compress the data. 2. When you unpack the data, leave it where it is. → Symbolic links point to the data in different folders. 3. Provide enough storage for your data. → choose your file system type wisely
  5. 5. Compression: tools in Linux And some more exist... http://www.linuxlinks.com/article/20110220091109939/CompressionTools.html
  6. 6. Tips Widely used compression tools: ● GNU zip (gzip) ● Block Sorting compression (bzip2) Typically, compression tools work on one file. How to compress directories and their contents?
  7. 7. Tar without compression Tar (Tape Archive) is a tool for bundling a set of files or directories into a single archive. The resulting file is called a tar ball. Syntax to create a tarball: $ tar -cf archive.tar file1 file2 Syntax to extract: $ tar -xvf /path/to/archive.tar
  8. 8. Compression: a typical case Archiving and compression mostly occur together. The most used formats are tar.gz or tar.bz. These files are the result of two processes. Archiving (tar) Compressing (gzip or bzip2)
  9. 9. Compression: on your desktop
  10. 10. Compression: on your desktop
  11. 11. Compression: on the command line Tar is the tool for creating .tar archives, but it can compress in one go, with the z or j option. Creating a compressed tar archive: $ tar cvfz mytararchive.tar.gz $ tar cvfj mytararchive.tar.bz create Compression technique Decompressing a compressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj mytararchive.tar.bz extract files verbose docs/ docs/
  12. 12. De-/compression To compress one or more files: $ gzip [options] file $ bzip2 [options] file To decompress one or more files: $ gunzip [options] file(s) $ bunzip2 [options] file(s)
  13. 13. Tips Many compression tools on the command line allow to read compressed files (instead of first unpacking then reading). $ zcat file(s) $ bzcat file(s) Compression is always a balance between time and compression ratio. Gzip is faster, bzip2 compresses harder. If compression is important to you: benchmark!
  14. 14. Exercise → a little compression exercise.
  15. 15. Symlinks Pay attention. Something very convenient! A symbolic link (or symlink) is a file which points to the location of the linked-to file. You can do anything with the symlink that you can do on the original file. As you move the original file from its location, the symlink is 'dead'. Downloads/ ~ Annotation/ Rice/ Projects/ Butterfly/ Sequences/ alignment.sam
  16. 16. Symlinks To create a symlink, move to the folder in where the symlink must be created, and execute ln. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Butterfly/ Sequences/ alignment.sam
  17. 17. Symlinks The symlink is created. You can check with ls. To delete a symlink, use unlink. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam ~/Butterfly $ ls -lh Link_to_alignment.sam lrwxrwxrwx 1 joachim joachim 44 Oct 22 14:47 Link_to_alignment.sam -> ../Sequences/alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Sequences/ alignment.sam Butterfly/Link_to_alignment.sam
  18. 18. Exercise → a little symlink exercise
  19. 19. Disks and storage If you dive into bioinformatics, you will have to manage disks and storage. Two types of disks - solid state disks Low capacity, high speed, random writes - spinning hard disks High capacity, 'normal' speed, sequential writes. http://en.wikipedia.org/wiki/Solid-state_drive http://en.wikipedia.org/wiki/Hard_disk
  20. 20. A disk is a device Via the terminal, show the disks using $ sudo fdisk -l [sudo] password for joachim: Disk /dev/sda: 13.4 GB, 13408141312 bytes ... Disk /dev/sdb: 3997 MB, 3997171712 bytes ...
  21. 21. A disk is divided into partitions A disk can be divided in parts, called partitions. An internal disk which runs an operating system is usually divided in partitions, one for each functions. An external disk is usually not divided in partitions.
  22. 22. Check out the disk utility tool
  23. 23. The system disk Name of the disk
  24. 24. The system disk Name currently highlighted partition
  25. 25. The system disk Place in the directory structure where the partition can be accessed
  26. 26. An example of an USB disk - Place in the directory structure where the partition can be accessed
  27. 27. An example of an USB disk The USB disk is 'mounted' automatically on the directory tree under /media.
  28. 28. An example of an USB disk - This is the type of file system on the partition. The partition is said to be formatted in FAT32 (in this case).
  29. 29. File system formats By default, many USB flash disks are formatted in FAT32. Other types are NTFS, ext4, ZFS. FAT32 – max 4GB files NTFS – maximum portability (also for use under windows) Ext4 – default file system in Linux, http://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systems
  30. 30. An example of an USB disk First unmount the device. Next, choose format the device. 1 2
  31. 31. Format disks with disk utility Choose the type of file system you want to be on that device.
  32. 32. Format disks with disk utility
  33. 33. Format disks with disk utility You don't want to know all the commands that work behind the gnome-disk-utility for you. But if you do: - mount - umount - fdisk - mkfs You can read the man pages and search for guides on the internet if you want to get to know these (out of scope for this course).
  34. 34. Checking storage space By default 'disk usage analyzer'.
  35. 35. Checking storage space Bonus: K4DirStat. Not installed by default.
  36. 36. Checking storage space Bonus: K4DirStat. Not installed by default.
  37. 37. K4Dirstat is a KDE package Rehearsal: what is KDE? Bonus: what happens when you install this package on our system?
  38. 38. Space left on disks with df To check the storage that is used on the different disks. ~/ $ df -h Filesystem /dev/sda1 udev tmpfs none none /dev/sdb1 ~/ $ df -h . Size 12G 490M 200M 5.0M 498M 3.8G Used Avail Use% Mounted on 5.3G 5.7G 49% / 4.0K 490M 1% /dev 920K 199M 1% /run 0 5.0M 0% /run/lock 76K 498M 1% /run/shm 20M 3.7G 1% /media/test
  39. 39. The size of directories To check the size of files or directories. ~/ $ du -sh * 520K bin 281M Compression_exercise 4.0K Desktop 4.0K Documents 5.0M Downloads 4.0K Music 4.0K Pictures 4.0K Public 373M Rice Example 4.0K Templates 4.0K test 17M test.img 114M ugene-1.12.2 4.0K Videos
  40. 40. Wildcards on the command line Wildcards are used to describe the names of files/dirs. [] On that position, the character may be one of the characters between [ ], e.g. saniti[sz]ation matches: sanitisation and sanitization ? On that position, any character is allowed. e.g. saniti?ation matches: sanitisation, sanitiration, ... * On that position, any length of string is allowed e.g. s* matches: san, sdd, sanitisation, sam.alignment,...
  41. 41. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do*
  42. 42. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do* 4.0K Documents 20G Downloads
  43. 43. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq
  44. 44. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq ERR148552_1.fastq testout.fastq ERR148552_1_prinseq_good_zzwI.fastq ERR148552_2.fastq test.fastq
  45. 45. Keywords Compression Archive Symbolic link mounting File system format partition Recursively df du unlink Write in your own words what the terms mean
  46. 46. Break
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×