Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICIC 2017: Cheminformatics Scaling-up for the Age of Big Data


Published on

Árpád Figyelmesi (ChemAxon, Hungary)
During the recent years like other segments, cheminformatics also entered the field of big data. As we see quick transition from the traditional methods towards the direction of need for handling significantly large sets of data often in unstructured or semistructured forms. Words and phrases like, terabytes, scalability, NoSQL, cloud solutions are integrated in our everyday language. I would like to present a few case studies to highlight key features of this transition. Such as sowing techniques and technologies for handling different aspects of this new area.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

ICIC 2017: Cheminformatics Scaling-up for the Age of Big Data

  1. 1. CHEMINFORMATICS SCALING-UP FOR THE AGE OF BIG DATA Árpád Figyelmesi 29th ICIC Heidelberg 2017
  2. 2. Big data Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. (Wikipedia) Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner – IT Glossary)
  4. 4. Why searching? • Find relevant chemical patterns (substructure search) • Find similar compounds (similarity search) • Find building blocks (superstructure search) • Compare libraries (overlap analysis) • Identify compound families (clustering)
  5. 5. Screening Library 250k-1M Real life chemical data
  6. 6. Corporate Compound Repository 3M Chemical data in patent literature
  7. 7. Enumerated virtual compounds
  8. 8. Old „traditional” vs New „traditional”
  9. 9. SureChEMBL real-time search
  11. 11. MadFast Similarity Search • Ultra fast chemical similarity search • In-memory data storage • Optimized multi-threaded implementation
  12. 12. Near real-time search of 1 billion
  13. 13. 100k x 1M exhaustive search in a few mins
  14. 14. Chemical white-space analysis • Find potential drug analogs/novel ring systems • From synthetically feasible virtual chemical space • As close as possible to known drugs • As far as possible from patented compounds
  15. 15. Case study • Filter drugs from ChEMBL • Search analogs in GDB-13 • Analyze overlap with SureChEMBL /mfss-study-poster-23-1-imre-gabor-1.pdf
  16. 16. Results
  17. 17. Results
  18. 18. Results
  20. 20. Extreme fast search engines • Search time is not a question anymore • Hits as you draw • Real-time clustering for search suggestions and drill down New ways of interactions fundamentally change the chemical patent search
  21. 21. Beyond the search • Real-time analysis of extreme large compound libraries • Proactive exploration of synthetically feasible and patentable chemical space
  22. 22. THANK YOU Árpád Figyelmesi