Intel big data analytics in health and life sciences personalized medicine

  • 1,554 views
Uploaded on

Intel Corp. activities in big data analytics in health and life sciences. Focus on - compute, high performance computing, processors, storage, networking, bioinformatics, genomics, GATK, Broad, BGI, …

Intel Corp. activities in big data analytics in health and life sciences. Focus on - compute, high performance computing, processors, storage, networking, bioinformatics, genomics, GATK, Broad, BGI, Lustre, Hadoop, OHSU, SGI, Dell, Knome, IBM, CLCBio, DNA, RNA, personalized medicine, personalised medicine, payers, providers, pharma, clinics, sequencers, illumina, Cloud Computing, AWS, Security

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,554
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
2
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Age of Data-Driven Personalized Medicine Ketan Paranjape Worldwide Director, Health & Life Sciences Intel Corporation www.intel.com/healthcare/bigdata
  • 2. Notice and Disclaimers • Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. • All products, dates, and figures are preliminary for planning purposes and are subject to change without notice. • Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. •Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. • The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Knights Corner, Knights Ferry, Aubrey Isle and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. • Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com. • Intel®, Itanium®, Xeon®, Pentium®, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • Copyright © 2011-13, Intel Corporation. All rights reserved. • *Other names and brands may be claimed as the property of others.
  • 3. Compute for Personalized Medicine a.k.a Big Data Analytics in Healthcare and Analytics
  • 4. ACTIONABLE EHR ANALYTICS -PAYER, PROVIDER
  • 5. Regional Health Information Network RHIN – China (Jinzhou, Pop 3M) • Challenge: RHIN has challenges with scalability, performance and maintenance. Data storage is expensive • Solution: EMR data and healthcare services running on Intel Hadoop Distribution and Xeon E5 servers. • Benefits: High performance and scalability demonstrated via POC and stress testing. Significantly reduced storage cost • 1/5 Reduction in Response Time; 5x Concurrent Users Data processing flow of RHIN platform http://hadoop.intel.com/pdfs/IntelChinaHealthyCityAnalyticsCaseStudy.pdf
  • 6. GE-Medical Quality Improvement Consortium (MQIC) • Challenge – Gaining value from data in EMRs/EHRs and other digital health information tools • Solution – De-identified data from Centricity EMRs; Analytics capabilities to enhance their quality and reporting activities. • 1.6 billion documents representing 30 million de- identified patient records and 209 million office visits. • Benefits - Physician practices and ambulatory care clinics deliver their best care more efficiently, along with population-based research and public health activities. 6http://visualization.geblogs.com/visualization/network/ DEMO - http://visualization.geblogs.com/visualization/network/
  • 7. NHS Trust – Leeds Teaching Hospitals • Challenge – Capture data at the point of admission, throughout the patient care cycle and use natural language processing (NLP) to make sense of unstructured care notes and combine with structured care data for analysis • Solution – Partnering with ISVs – Ascribe, Two 10degrees, Microsoft and machines powered by Intel Xeon processor E5 family; • 30M patients, > 7M attendances each year worth of records • Benefits – Billing optimizations (doctors log the correct data), Resource Optimizations (learning patient trends for resource planning) 7http://visualization.geblogs.com/visualization/network/ DEMO - http://visualization.geblogs.com/visualization/network/ “The use of big data analysis on our patient care notes enables us to prove things our clinical intuition was telling us. In the new world anecdotal evidence isn’t enough. What we think isn’t sufficient to spend money. We need proof.” Iain MacBrairdy, Business Manager, Emergency Medicine, Leeds Teaching Hospitals
  • 8. Charite “Real-time” Cancer Analysis – Matching proper therapies to patients • Challenge: Real-time analysis of cancer patients using the in-memory SAP HANA Oncolyzer database that is running on mission critical Intel Xeon family infrastructure. (3.5M Data points per Patient, Up to 20 TB of data/patient • Solution: Using structured and unstructured data to collect and analyze tables used to take up to two days -- now takes seconds • Benefits: Improves medical quality in disruptive way for – Patient – Doctor – Hospital – Research 8 http://moss.ger.ith.intel.com/sites/SAP/SAP%20account%20team%20documents/Marketing/SAP%20HANA/SAPHANA_Charite_case_study_HI.PDF
  • 9. HANA Oncolyzer • Ad-hoc Analysis of heterogeneous tumor data for cancer research • Medical records from decades of tens of thousands of patients • Structured and unstructured data (records, time series, free text, etc.) Solution • Integrated into condensed but exhaustive view • On-the-fly analyses (e.g. Kaplan-Meier estimation, cohort statistics) • Includes external data sources (e.g. PubMed, pharmaceutical databases) • Attributes can be native, views, freetext-extracted, calculated
  • 10. LIFE SCIENCES, PHARMA, GENOMICS
  • 11. Life Sciences: At the intersection of transformative forces Enabling exascale computing on massive data sets Helping enterprises build open interoperable clouds Contributing code and fostering ecosystem HPC Cloud Open Source 10 18
  • 12. Genomics Is A Big Data Problem AffectingFactors Cell Response 313 Exabytes if everyone in the US has their genes sequenced 495 Exabytes if every cancer patient in the US has their genes sequenced every 2 weeks A complex interaction of varied & changing intrinsic and extrinsic factors determine cell response
  • 13. Life Sciences: Key Industry Challenges and Solutions • Many (most) applications are single- threaded, single address space Intel is delivering optimizations working with open source community, developing NGS+HPC curriculum • Some algorithms scale quadratically with the size of the problem. Large data sets exceed available memory and storage Innovations in acceleration, compute, storage, networking, security, and *-as-a- service. • International collaboration is an imperative, bioinformatics expertise is scarce • Intel is working closely with the ecosystem to address enterprise to cloud transmission of terabyte payloads • Databases are distributed, data is siloed and will likely stay that way Tools like Hadoop, Lustre, Graphlab, In- Memory Analytics, etc. Need for Balanced Compute Infrastructure
  • 14. Dell Active Infrastructure for HPC Life Sciences • Challenge: Experiment processing takes 7 days with current infrastructure. Delays treatment for sick patients • Solution: Dell Next Generation Sequencing Appliance – Single Rack Solution – 9 Teraflops of Sandy Bridge Processors – Lustre File Storage – Intel SW tools and engineers • Benefits: RNA-Seq processing reduced to 4 hour • Includes everything you need for NGS - compute, storage, software, networking, infrastructure, installation, deployment, training, service & support Dell HSS (Lustre) (up to 360TB) Dell NSS (NFS) (up to 180TB) Infrastructure: Dell PE, PC & F10 M420 (Compute) (up to 32 nodes) 2U Plenum Actual placement in racks may vary. NSS-HA Pair NSS User Data HSS Metadata Pair HSS OSS Pair HSS User Data
  • 15. IBM, CLC bio Genomics Sequencing Analytics Solution • Challenge: Need for processing power and storage capacity in order to correlate the variants in the genome with the relevant patient symptoms • Solution: IBM®, CLC Genomics server SW, Genomics Workbench client SW; Small (48 Cores, 192 GB), Medium, Large (192 Cores, 768 GB) Analytics Solutions • Benefits: – Reference Mapping for 37x coverage human genome – ~9hr (1 node) to ~30mins (37 nodes) – Variant Calling and annotation for 37x coverage – ~40 hrs (1 node) to ~3hrs (23 nodes) • Infrastructure – IBM System x® 3550 M4, E5-2650; 48 CPU cores and 192 GBs of memory to 192 CPU cores and 768 GBs of memory – IBM Storwize® V7000 – CLC Genomics Server 5.0.2 , Workbench 6.0.1 – 7x 3TB SAS 6 Gbps HDD (16 TB usable) http://www-148.ibm.com/bin/newsletter/tool/landingPage.cgi?lpId=6155
  • 16. NGS Appliances BioTeam “SlipStream” • Challenge: Significant IT overhead, limited bioinformatics support, changing landscape • Solution: “Slipstream” Appliance • Benefits: – Minimize lab IT startup costs – Integrate and standardize data management including security, easily traceable results – Adaptable to any Laboratory, Workflow- based Lab Management – Seamless Sequencer Integration • Infrastructure – Dell PowerEdge T620 Desktop Server – 2x Intel Xeon 8 Core Processors (16 cores) – 16x 32GB RAM (512GB), 1x 100GB SSD – 7x 3TB SAS 6 Gbps HDD (16 TB usable)
  • 17. Convey Computing’s Hybrid Core Architecture to Accelerate Algorithms • Challenge: Advances in sequencing technology have significantly increased data generation and require similar computational advances for bioinformatics analysis • Solution: Convey Hybrid-Core (HC) architecture - Intel® x86 microprocessors with a coprocessor comprised of reconfigurable hardware (FPGAs) • Benefits: Accelerated BWA pipeline up to 18x compared to a standard x86 system • Project Characteristics: HC-1: Intel L5408, Xilinx Virtex-5 FPGAs, 1TB SATA disks HC-2: Intel X5670, Xilinx Virtex-5 FPGAs, 1TB SATA disks HC-2ex: 128GB (host), 64GB (coprocessor), 1TB SATA disks
  • 18. Genomics & Health Analytics Appliances 18 2U Plenum Actual placement in racks may vary. NSS-HA Pair NSS User Data HSS Metadata Pair HSS OSS Pair HSS User Data Scale through independent solutions, each targeting a different segment & usage model
  • 19. Ultra High-Speed Networking Optimizations – Aspera Labs • Challenge: Improving big data transfer to and from the backend data center • Solution: Optimize ultra high-speed (10 Gbps and beyond) data transfer solutions built on Aspera’s FASP ™ transport technology and Intel’s innovative hardware platform • Benefits with Intel Xeon E5-2600 (DDIO, SR-IOV) – 300% improvement in Aspera transfer throughput – Same transfer speed performance in both physical and virtualized computing environments – Both LAN and WAN transfer speeds had similar results • Infrastructure and Data Characteristics: – Xeon E5 2687, 32GB DDR3 with Non-Uniform Memory Access (NUMA) Data Direct IO (DDIO), Intel 910 SSD, Intel 82599EB 10 GbE – Aspera Enterprise server 3.1.1.66573, Aspera Performance Automation Suite
  • 20. • Challenge: Can high performance interconnect technology (InfiniBand) keep up with increase in number of processor cores? • Workloads: VASP, WIEN2K • Benchmarks: MVAPICH (MPI over InfiniBand), IMPI (Intel MPI) • Results: – Scale-up research – 5 to 10 fold time improvement in performance when scaling from a single node to 16 nodes – Intel® True Scale Fabric QDR-40 shows excellent price/performance results • Infrastructure and Data Characteristics: – 1 Head + 16 compute nodes, Dual Xeon® E5 2680 2.7GHz p/node – 32GB of RAM 1666MHz p/node – RHEL, Compiler, MPI variations available – Intel® Cluster Suite, Intel® Fabric Suite High-Performance Interconnect (InfiniBand) and HPC – Intel® True Scale Fabric
  • 21. Data Life Cycle Management with iRODS – EMC, RENCI 4.3M WGS for all US newborns/yr. ~= 100 PB* Can I describe it? Can I find it? Can I access it? Can I move it? * Chris Mason, Weill Cornell Medical College, WGA Mtg. Nov. 2012
  • 22. High Performance Scale-out Storage for Wellcome Trust Sanger Institute Challenge: Exponential increases in the volume of data being generated – but storage budgets are flat or growing slowly. Large data sets are difficult to proactively manage, and can easily overwhelm storage resources. Un-optimized storage has a direct, negative impact on application performance – slowing the time for breakthrough results. Solution: Exploit the power and scale of HPC-class storage, powered by Intel® Enterprise Edition for Lustre* software for unprecedented performance with unmatched management simplicity. Benefits that storage solutions powered by Intel EE for Lustre software: – Openness – Developed and enhanced by the Lustre experts – Global namespace – all clients can access all data – Performance – Upwards of 1 TB/s – Virtually unlimited file system and per file sizes – Management simplicity using Intel® Manager for Lustre*
  • 23. Heterogeneous Clusters for Biomedical Computing at Virginia Bioinformatics Institute (VBI) • Challenge: Scalable infrastructure for rapid data growth and the need to run varied applications is driving the need for novel computing needs. • Solution: Combination of Intel® Xeon®, Intel® True Scale QDR Infiniband and SGI’s infiniteStorage platform was deployed to deliver a 300% speedup. Overall reduction in cost resulted in the purchase of additional compute nodes. • VBI Cluster – Symmetric multiprocessing (SMP) nodes (large memory Xeon E7) with 1 TB of RAM, massively parallel processing (MPP) nodes (Xeon E5) with 64 GB. 50 PB of tape storage, 600 TB of HDD. Using SGI’s IS16000 platform and Intel TrueScale fabric, VBI moves data through the storage systems at 2 GB per second. “The amazing thing is that we see almost a three times performance increase on 48 nodes compared to 56 nodes of the previous generatyion, even though the processors are slightly slower clock speed. The Intel® QuickPath Interconnect and Intel® TrueScale Fabric have has a big impact.” Dr. Kevin Shinpaugh, Director of IT and HPC, Virginia Bioinformatics Institute
  • 24. Top-5 Pharmaceutical Company - SAS Grid • Challenge: Need to accelerate and optimize “time to results” clinical trial simulation environment; resource allocation and job prioritization was manual/ad-hoc • Solution: “Scale-Out” architecture: – SAS Visual Analytics, Enterprise Miner, Grid Manager – Red Hat Enterprise Linux – Xeon E5 servers (HP) • Benefits: Clinical trial simulation exercises reduced from hours to < 5 minutes; registration decisions accelerated with multi-hundred million USD impact http://www.intel.com/content/www/us/en/cloud-computing/cloud-computing-xeon-e5-carestream-imaging-brief.html
  • 25. Mitsui Knowledge Industry (MKI) • Challenge: Reduce the amount of time it takes to do complete genomic analysis and deliver results to patients • Solution: Real-Time Big Data Platform – R (Revolution Analytics) – SAP HANA – Hadoop • Benefits: Genomic analysis shortened from several days to 20 minutes; performance for some queries improved 400,000 X http://www.intel.com/content/www/us/en/cloud-computing/cloud-computing-xeon-e5-carestream-imaging-brief.html http://www.saphana.com/docs/DOC-3641
  • 26. Value • Enable researchers to discover biomarkers and drug targets by correlating genomic data sets • 90% gain in throughput; 6X data compression Analytics • Provide curated data sets with pre-computed analysis (classification, correlation, biomarkers) • Provide APIs for applications to combine and analyze public and private data sets Data Management • Use Hive and Hadoop for query and search • Dynamically partition and scale Hbase • 10-node cluster / Intel Xeon E5 processors • 10GbE network Data-Intensive Discovery: Genomics Intel Distribution
  • 27. Intel Confidential • Solution: Intel Distribution for Hadoop (IDH), Map Reduce, Hbase, Hive • Benefits: Ability to compare 14 million proteins and more, reducing the processing time from days to hours. • Project Characteristics: Hadoop: 5 nodes Cluster Storage:16TB (Internal storage) per server Servers: Xeon E5 2 socket 8 cores, 64GB RAM SLA: reducing processing time from 30 days to less then a day and scale to 4x4 million samples comparison Data: Multi-Terabyte database Problem Statement: Back in 2008 a genome research team faced compute and scalability issue in comparing all pairs of 4 million proteins, the BLAST search results overwhelmed a single database table. Today they need to compare 14 million proteins, this requirement cannot be addressed with existing technology. Big Data, Bioinformatics Team website Blast Program Genome data Proteins comparison High performance scalable Hadoop/Hbase cluster
  • 28. High Throughput Science: Embracing Cloud-based Analytics • Challenge: Team of cancer researchers had to screen a drug concept with a list of tens of millions of molecules working with a tight deadline, a fixed budget, and strict security and compliance requirements. Schrödinger’s existing in-house servers would be tied up for weeks • Solution: Schrödinger leveraged software from AWS partner, Cycle Computing, to provision a fully secured cluster of 50,000 cores, powered by the Intel® Xeon® processor E5 family. – This configuration enabled the team to run 16 million molecular simulations an hour. – Developed 1000 molecule list in < 8hrs.
  • 29. High Throughput Science: Large Scale Computational Chemistry Simulation • Challenge: Sustaining access to 50000+ compute cores for large scale computational chemistry simulation results in under a week. Ability to monitor and re-launch jobs, no additional capital expenditure with internal HPCC already running at capacity. • Solution: Novartis leveraged software from AWS partner, Cycle Computing, and MolSoft to provision a fully secured cluster of 30,000 CPUs, powered by the Intel® Xeon® processor E5 family. – Completed screening of 3.2 million compounds in approximately 9 hrs, compared to 4 -14 days on existing resources. Virtual Screening
  • 30. Goals and Current applications target • Focus on improving genomics pipelines • Optimize individual applications • Work with code authors to release optimizations • Intel® Xeon® processor focus  Selectively experiment with Intel® Xeon Phi™ coprocessor DOMAIN Applications Intel® Architecture Target Genomics Bowtie 1*, Bowtie 2* Xeon® processor BWA* Xeon® processor BLAST* Xeon® processor GATK* Xeon® processor HMMER* Xeon® processor Xeon® Phi™ coprocessor Abyss* Xeon® processor Velvet* Xeon® processor *Other names and brands may be claimed as the property of others.
  • 31. TGen* RNA sequencing pipeline Partnership between Intel®, DELL*, Tgen* 1.8x ** 2-socket Intel(R) Xeon(R) CPU E5-2687W / 3.1 GHz *Other names and brands may be claimed as the property of others.
  • 32. Goals and Current applications target • Optimize for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor (node and cluster) • Increase availability of applications on Intel® Xeon Phi™ coprocessor • Work with code authors to release optimizations *Other names and brands may be claimed as the property of others. DOMAIN Applications Intel® Architecture Targets Molecular Dynamics/ Chemistry AMBER* Xeon® processor Xeon® Phi™ coprocessor NAMD* GROMACS* GAMESS* Quantum Espresso* Gaussian* VASP* CP2K* QBOX* CPMD* LAMMPS*
  • 33. Intel® Xeon® processor: new platforms, architecture improve Life Science applications 2-socket “Ivybridge vs. Sandybridge” Ivybridge 12c/24T 2.7Ghz, Sandybridge 8c/16T 2.9GHz http://www.intel.com/content/www/us/en/benchmarks/server/xeon-e5-2600-v2/xeon-e5-v2-hpc-life-sciences.html *Other names and brands may be claimed as the property of others.
  • 34. 34 Optimizing/Accelerating the DNA Pipeline Compression – IPP library – HW Acceleration – Custom library FPGA Acceleration
  • 35. 35 Incorporating Intel IPP Deflater into Picard Tools
  • 36. 36 Picard MarkDuplicates Optimizations Two Fold Approach: 1. Added optional tag ‘MC’ to SAM Specification • Tag ‘MC’ is used to store Mate Cigar for a Paired Read, where mate is mapped. • SAM JDK extended to support tag ‘MC’ • MergeBamAlignment modified to include the new ‘MC’ tag within each relevant record of the SAM/BAM file 2. Redesign of MarkDuplicates • Inclusion of ‘MC’ tag provides opportunity for algorithmic redesign of MarkDuplicates • Overall speedup ~2x for MergeBamAlig/MarkDuplicates Additional Gains: Enables streaming of records for the entire pre-GATK phase (from ‘bwa mem’ to ‘MarkDuplicates’ ) in a typical bwa_mem+GATK workflow
  • 37. 37 MarkDuplicates RdX_1: …………Cigar ……………… ……………… RdY_1: …………Cigar ……………… ……………… RdX_2: …………Cigar ……………… ……………… ……………… ……………… RdY_2: …………Cigar BASELINE 1) Store information per each read: Used to determine unclipped 5’ coordinate for both ends & orientation of pair 2) Sort reads within the entire file by unclipped 5’coordinate and MarkDuplicates 3) Write out BAM file OPTIMIZED RdX_1: …………Cigar ………. MC: ………………………………… ………………………………… RdY_1: …………Cigar ……… MC: ………………………………… ………………………………… RdX_2: ………Cigar ……… MC: ………………………………… ………………………………… ………………………………… ………………………………… RdY_2: ………Cigar ………. MC: PairX PairY PairX+ + 1) Sort reads within a small window by unclipped 5’ coordinate: - MarkDuplicates - Write out
  • 38. 38 DNA Pipeline: BWA+GATK: Whole Genome Sample: ~65x Coverage Collaborating with our Partners and Medical community Process level Parallelism Thread-level Parallelism Step # of Threads Runtime (hours) Read Alignment (bwa mem) 24 7 View (samtools) 24 2 Sort + Index (samtools) 24 3 MarkDuplicates (picardtools) + Index 1 11 RealignerTargetCreator (GATK) 24 1 IndelRealigner* (GATK) + Index 24 6.5 BaseRecalibrator(GATK) 24 1.3 PrintReads* (GATK) + Index + Flagstat 24 12.3 TOTAL (hours) 44 Step Tool # of Threads Runtime (hours) Read Alignment (bwa) 16 8 Sampe (bwa) 1 24 Import (samtools) 1 11 Sort + Index (samtools) 1 14.5 MarkDuplicates (picardtools) + Index 1 11.5 UnifiedGenotyper* (GATK) 16 7.5 SomaticIndelDetector (GATK) 1 3 RealignerTargetCreator (GATK) 16 0.8 IndelRealigner* (GATK) + Index 1 17.5 BaseRecalibrator*(GATK) 1 62 PrintReads* (GATK) + Index + Flagstat 1 25 TOTAL (hours) 177 Algorithmic Improvement 6X improvement so far and 4X without major code change and rest with code changes. Redesign of Mark Duplicates + Merge Bam Align 30-36 hours
  • 39. 39 Profiling: Single Instance Run – Lower Latency # of Machines = 1 # of cores/Machine = 24 Temporary Storage – RAID0 2x4TB HDD Input Dataset: G15512.HCC1954.1, coverage: 65x Average CPU utilization is very low. Most cores not being used Average I/O bandwidth is very low. Application not I/O bound Average memory footprint is small. Application not using memory available in newer systems There is a lot of room to improvise
  • 40. 40 Smith Waterman Acceleration Working on accelerating two versions of Smith Waterman: 1. Simplified version where gap open, gap extension, and mismatch penalties are identical 2. Affine gap penalty (as implemented in BWA-MEM) Initial results on #1 seem promising Speed up measured in terms of throughput for these runs. Banded Smith Waterman implementation Bitwise parallelism: Packed32: 32-bit uint Packed64: 64-bit uint AVX: 256-bit vector Xeon Phi: 512-bit vector
  • 41. 41 Optimizing/Accelerating Compression – IPP library – HW Acceleration – Custom library FPGA Acceleration
  • 42. Genomics - Big Data Problem AffectingFactors Cell Response 313 Exabytes if everyone in the US has their genes sequenced 495 Exabytes if every cancer patient in the US has their genes sequenced every 2 weeks. Images, Assays and Drug response data will push it further up as shown in Blue line Complex interaction of varied & changing intrinsic and extrinsic factors determine cell response Source: Knights Cancer Institute, Oregon Health Sciences University & Intel Proliferation Apoptosis Differentiation DNA Repair Motility Senescence With Genomic Data growing rapidly, hospitals and research centers need to access the local data (the ones not shared) and the centralized public/private data for various analysis and analytics for Genomic Research/Development/Medicine. Compute has to be done “where data is” and need to be consistent locally and in the cloud. Energy, Total Cost of Operation are key Invasion,Metastasis& therapeuticresponse The day when every newborn gets their DNA sequenced is not far away: http://www.nih.gov/news/health/sep2013/nhgri-04.htm.
  • 43. 43 1 2 2 3 3 3 4 4 4 4 5 5 5 5 PairHMM Matrix Dependencies Wave-Front Computation in AVX
  • 44. 44 Pair HMM Acceleration using AVX • Computation kernel and bottleneck in GATK Haplotype Caller • AVX enables 8 floating point SIMD operations in parallel • 2 Ways to vectorize HMM computation • Intra-Sequence – Parallelize computation within one HMM matrix operation. Run multiple (8) computations concurrently along diagonal • Inter-Sequence – Perform multiple (8) HMM matrix operations at once Time (seconds) Speedup C++/Java Serial C++ 1540 1x / 9x 1 core with AVX (Intra) 340 4.5x / 40.7x 1 core with AVX (Inter) 285 5.4x / 48.6x 24 cores with AVX (Inter) 14.3 108x / 970x 24 cores hybrid (Inter) 15.7 98x / 882x
  • 45. 45
  • 46. Policy – United States, European Union Snapshot of US, EU Recommendations Develop an ICT-enabled European Strategy for Personalised Medicine 2014-2020 Driving research to unleash the potential of ICT at the point-of-care EU R&D initiatives must address:  Interoperability of technical standards for managing and sharing sequence data in research and clinical samples;  Development of hardware, software and workflow algorithms to accelerate cost efficient analysis of genetic abnormalities that cause cancer and other complex diseases;  Research to ensure convergence of Big Data and Cloud Computing infrastructure to meet the requirements of High Performance Computing and data throughout the life sciences and healthcare value chains The eHealth Action Plan 2020 should include Personalised Medicine as a priority  Gain knowledge of the challenges and barriers (technical, organizational, legal and political) to the adoption of ICT in support of Personalised Medicine leveraged by genomic information;  Evaluate how to change workflows and education requirements to facilitate adoption of ICT mediated personalized medicine in clinical practice;  Expand collaboration with other regions of the world in matters of common interest, e.g. by leveraging the eHealth MoU with the United States of America;  Study, evaluate and disseminate technology neutral risk assessment frameworks for data privacy and security, covering the entire ICT enabled Personalised Medicine delivery chain;  Develop effective methods for enabling the use of medical information for public health and research
  • 47. Intel Assets for Life Sciences Intel Xeon E5 Intel Xeon Phi Intel Fabric Intel Storage Intel Software • Up to 80% greater performance • Up to 70% more energy efficiency • Up to 30% less network latency • Hardware- accelerated security (AES-NI) • Broad industry adoption Consistent Performance Gains each generation • Performance and programmability for highly-parallel workloads • Programming continuity and scalable parallel programming models: common source code and software tools between multicore Intel® Xeon® and manycore Intel® Xeon Phi™ • Partner ecosystem continues growing and making progress • Intel® Cluster Studio XE compilers, libraries, analysis tools, OpenMP and MPI • Intel® Hadoop Distribution • Intel® Data Center Manager and Intel® Node Manager (NM) Intel® Expressway Service Gateway for Cloud usage models • Intel® True Scale Fabric designed from the ground up for HPC • QDR-40 and QDR-80 deliver performance that scales - high MPI message rates and end-to-end latency that stays low at scale • Optimized support for Intel® Xeon® E5 and Xeon® Phi processors • Intel Fabric Suite – IB Fabric Management & FastFabric Management tools • Intel® Xeon® processors and platforms are enabled with beneficial storage optimizations • Solid State Drives (SSD) and other NVM technologies improve storage performance • Intel® Cache Acceleration Software • Intel’s open source Lustre file-system support/development and Chroma management/provisio ning tools
  • 48. Summary • Enabling ecosystem of partners to innovate and make Personalized Medicine vision a reality • Delivering hardware-enhanced capabilities and software to accelerate science, translate results, deliver today. • Looking for collaboration opportunities to take Personalized Medicine mainstream by 2020 • Big Data/Analytics in Health & Life Sciences • www.intel.com/healthcare • hadoop.intel.com
  • 49. 49 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 49