A very simplistic presentation on current Big Data challenge in bioinformatics. A case on study using one of the computational methods for drug discovery is presented. Cost for development of a new drug is increasing dramatically every year along with challenges associated with it. The big data approach in drug discovery is penetrating slowly, but with a constant speed. We believe effective use of big data would be highly beneficial for taking several crucial dicision during the complete drug discovery process. A discussion on data management using Hadoop and analysis using R programming package is also discussed.
3. Big Data is data that is too large, complex and dynamic for any
conventional data tools to capture, store, manage and
analyze.
The right use of Big Data allows analysis to spot trends and
gives niche insights that help create value and innovation
much faster than conventional methods.
However, there is more to the big data deluge than mere
volumes; in particular, increasing data heterogeneity and
complexity makes it difficult to extract knowledge from such
data.
If the use of big data for drug discovery should indeed open
new frontiers, and not only be hype, new visions and concepts
are required to reduce data complexity and increase data
consistency from different sources.
What is Big Data?
4. What is the Challenge?
Three “V’s”, i.e., the Volume, Variety and
Velocity of data coming in is what creates the
challenge.
http://hlwiki.slais.ubc.ca/images/1/1a/Big_data_2013.jpg
1 PB = 1000 TB
big challenges in data storage,
processing and analysis.
Coordinated efforts from both
experimental biologists and
bioinformaticists are required
to overcome these challenges.
10. One Target, One Compound
Disease
Enzyme, Drug Target
Potential Drug
Candidate
11. One Target, One Compound
Disease
Enzyme, Drug Target
Potential Drug
Candidate
1 Target, 1 Compound, 1 Disease = 1 Molecular Docking Run
12. One Compound to Many Targets
10,000 Protein
Targets
Disease-1
Disease-2
Disease-N
Potential Drug
Candidate
10,000 Targets, 1 Compound, 10,000 Diseases = Total 10,000 Molecular
Docking Runs
13. One Compound to Many Targets and Their Conformations
10,000 Protein
Targets
Disease-1
Disease-2
Disease-N
Potential Drug
Candidate
10,000X2 Target Conformations, 1 Compound, 10,000 Diseases = Total 20,000 Molecular Docking Runs
Conf-1Conf-2
14. Many Compounds to Many Targets and Their Conformations
10,000 Protein
Targets
Disease-1
Disease-2
Disease-N 60,826,590
Potential Compounds
10,000X2 Target Conformations, 60,826,590
Compounds, 10,000 Diseases = Total 1,216,531,800,000 Molecular Docking Runs
Conf-1Conf-2
15. Calculation
Suppose one docking run takes 1 min. time on single processor
1,216,531,800,000 /60 = 20275530000 Hours
1,216,531,800,000 /(60X24) = 844813750 Days
1,216,531,800,000 /(60X24X30) = 28160458 Months
1,216,531,800,000 /(60X24X30X12) = 2346704 Years
1,216,531,800,000 /(60X24X30X12X60) = 39111 Births
10 Crores Processors will be needed to complete all the docking runs in less than a day time
An excel sheet can accommodate 1048576 rows by 16384 columns
16. What if the same calculations are carried out by two different methods!
18. Supporting Tools/Languages
R is a free software environment for
statistical computing and graphics.
https://www.r-project.org/
Hadoop is an open-source framework that
allows to store and process big data in a
distributed environment across clusters of
computers using simple programming models.
https://hadoop.apache.org/