• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Jax 2013 - Big Data and Personalised Medicine
 

Jax 2013 - Big Data and Personalised Medicine

on

  • 342 views

 

Statistics

Views

Total Views
342
Views on SlideShare
342
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Our main building blocks consist of: Server. The Xeon family of processors consists of the E3, E5 and E7 product lines which offer different combinations of capabilities and price points for different workloads. The upcoming Intel MIC (Many Integrated Core) processor is targeted primarily at the portion of the HPC market that values maximum parallel processing density such as…. And our Atom line aims at the low-cost, low-power, ultra dense microserver market where node density is paramount. Networking. Intel is the Industry’s #1 selling 1GbE and 10GbE adapters and silicon and also offers a family of industry leading, low latency 10GbE/40GbE switch silicon productsStorage: one of the biggest trend in storage is the increasing use of compute within the storage box to reduce latencies and also provide lower overall cost/GB of storage thru more efficient storage. For large data sets and those storage workloads requiring the lowest latencies Xeon is the industry choice. Xeon provides the compute capability in over 80% of the storage market. And Intel enterprise SSD’s are designed for the demanding performance and endurance needs of the datacenterSoftware and other technologies: We are developing strong open-source components such as our Intel Distribution of Hadoop. Intel Datacenter Manager enables better power management at the server, rack and datacenter level. Advanced RAS (reliability, availability and serviceability) features ensure high levels of system resiliency and availability. And Intel’s heavy investment in industry enabling ensures these come available in the widest choice of systems. The most popular are general purpose systems, but many of our partners innovate further to create highly workload-optimized platforms and converged architecture systems. The greater level of bundling and integration in these systems allows for simpler and faster deployments and ongoing maintenance.Now lets look at the specific building blocks….
  • Field note:  There are few hyperlinks on this presentation in the blue boxes.  The first link in E5 leads to a solution showing a 25x increase in data analytics running on Intel architecture, which shows the capability of the new Xeon E5 processor family, using AVX technology and a variety of other performance optimizations from IBM. The second link in E5 will lead to a solution brief highlighting how Intel® Xeon® E5 processor based servers running Hadoop are at least three times faster than previous solution. They can load, sort, and perform their data analyses faster, and Intel® Hyper-Threading Technology really helps with Hadoop workloads The link in E7 proof point is focused on a scale-up in-memory analytics solution, SAP HANA, running on Intel’s Xeon E7 processor family.  All these proof points help the customer understand the power and variability of our processor solutions for Big Data.Key points:Significant performance gains delivered by featuressuch as new Intel® Advanced Vector Extensions and improved Intel® Turbo Boost Technology 2.0To improve flexibility and operational efficiency significant improvements in I/O with new Intel® Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCI Express 3.0Story:To meet the growing demands of IT such as readiness for cloud computing, the growth in users and the ability to tackle the most complex technical problems, Intel has focused on increasing the capabilities of the processor that lies at the heart of a next generation data center. The Intel Xeon processor E5-2600 product family is the next generation Xeon processor that replaces Platforms based on the Intel Xeon processor 5600 & 5500 series. Continuing to build on the success of the Xeon 5600, the E5-2600 product family has increased core count and cache size in addition to supporting more efficient instructions with Intel® Advance Vector Extensions, to deliver up to an average of 80% more performance across a range of workloads. These processors will offer better than ever performance no matter what your constraint is – floor space, power or budget – and on workloads that range from the most complicated scientific exploration to simple, yet crucial, web serving and infrastructure applications. In addition to the raw performance gains, we’ve invested in improved I/O with Intel Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCIe 3.0. This helps to reduce network and storage bottlenecks to unleash the performance capabilities of the latest Xeon processor. The Intel® Xeon® processor E5-2600 product family – versatile processers at the heart of today’s data center. Let’s look at just what kind of performance that these products are capable of…Legal Info:Configuration for 80% claim:Source: Performance comparison using best submitted/published 2-socket server results on the SPECfp*_rate_base2006 benchmark as of 6 March 2012. Baseline score of 271 published by Itautec on the ServidorItautec MX203* and ServidorItautec MX223* platforms based on the prior generation Intel® Xeon® processor X5690. New score of 492 submitted for publication by Dell on the PowerEdge T620 platform and Fujitsu on the PRIMERGY RX300 S7* platform based on the Intel® Xeon® processor E5-2690. For additional details, please visit www.spec.org.Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.Configuration for latency reduction: Source: Intel internal measurements of average time for an I/O device read to local system memory under idle conditions comparing the Intel® Xeon® processor E5-2600 product family (230 ns) vs.. the Intel® Xeon® processor 5500 series (340 ns). Baseline Configuration: Green City system with two Intel® Xeon® processor E5520 (2.26GHz, 4C), 12GB memory @ 1333, C-States Disabled, Turbo Disabled, SMT Disabled. New Configuration: Meridian system with two Intel® Xeon® processor E5-2665 (2.4GHz, 8C), 32GB memory @1600 MHz, C-States Enabled, Turbo Enabled. The measurements were taken with a LeCroy* PCIe* protocol analyzer using Intel internal Rubicon (PCIe* 2.0) and Florin (PCIe* 3.0) test cards running under Windows* 2008 R2 w/SP1.
  • Field note: There is a link to a proof point on this slide. Intel IT has a whitepaper on the performance benefits of 10GbE on Apache Hadoop. This whitepaper is at our Intel IT Resource Center, which is useful in many ways for your customer. We would recommend pointing the customer to this site for answers to a variety of questions and configurations.Up to 20x performance boost over legacy infrastructure with optimizations on Intel® Xeon processors, SSD storage, and 10GbE networking 10 Gigabit Ethernet (GbE) networks allow you to quickly import large data sets for processing in multiple locationsNetwork: 10 Gigabit Ethernet (10GbE) networking demonstrates its value in the form of high levels of network utilization in the Hadoop cluster. The full use of greater bandwidth can reduce time to ingest and to export data by 80 percent. Moreover, the cost per gigabit of bandwidth with 10GbE is now much lower than 1GbE, making it a natural choice for big data.Much of the performance gain from the underlying hardware requires deep optimization in the software as well as careful tuning of Hadoop configuration parameters. The Intel Distribution is optimized with the latest Intel® processor, storage, and networking hardware components to ensure that the platform delivers balanced performance for the widest range of use cases. The Need for a Balanced System Hadoop is designed and optimized for commonly available hardware. The pace of server innovation has continued unabated for many years, and mainstream systems now deliver massive processing power. To keep pace with that capability, it is vital to deploy Hadoop in the environment it was designed for, one that is balanced between compute, storage, and networking.Hadoop* is increasingly popular for processing big data. Dramatic improvements in mainstream compute and storage resources help make Hadoop clusters viable for most organizations. But to provide a balanced system, those building blocks must be complemented by 10 Gigabit Ethernet (10GbE), rather than legacy Gigabit Ethernet (GbE) networking. This study found success by building on a 10GBASE-T foundation that combines Arista switches, Intel® Ethernet 10 Gigabit Converged Network Adapters, and Intel® Xeon® processor based servers. In the area of networking for this balanced system, the performance of Gigabit Ethernet (GbE) implementations for Hadoop has been a major limiting factor to overall performance. Using the large block size means that, forexample, when a packet is dropped and retransmitted, the system needs to handle a large piece of data, which strains network bandwidth in a GbE environment. 10 Gigabit Ethernet (10GbE) networking proves its value in Hadoop clusters through high observed levels of network utilization, demonstrating the benefit of the higher bandwidth.4x Increase in Write PerformanceHadoop* PUT operation completed in 80 percent less time using 10 Gigabit Ethernet, compared to Gigabit Ethernet
  • Field note: There is a link to Intel CAS throughput performance data that is in the backup of this presentation.Field note: There is a link to a proof point for performance of an SSD on Oracle TimesTen using Intel SSDs. This is a useful whitepaper that shows how adding SSDs to a system configuration saves in both hardware acquisition and software license costs that pay many times over for the initial investment.There are a variety of new opportunities for solid state disk technologies in the enterprise, and this is enhanced by our new Intel CAS software.Intel Solid State drives come in a variety of form factors, and have enterprise-class levels of reliability along with capacities that are near those of fast rotating media. They can be used as a direct replacement for rotating media. For high-performance needs in the datacenter, Intel SSDs are a great solution that will likely pay for themselves in a short time. We have a pointer to an example that uses Oracle TimesTen if you’re interested in further information or examples.For some applications, adding the Intel Cache Acceleration Software (Intel CAS) solution enables an SSD to act as a local buffer for data on rotating media in the server. This enables you to add in a minimum of cost and get performance at near-SSD levels for all your data, which is a good hybrid solution for cost-conscious deployments. We can look at the performance data in backup if you’re interested.
  • Key Message: Whatever the solution, Intel is actively working with partners to optimize solutions for analyzing the huge variety of data, providing new insight models, and delivering real-time or near real-time information services.Intel is at the Core of the Big Data across provisioning models and in understanding the right data methods for the right data structure. In the last 24 months there has been abundant innovation on the DB product market than at any time in the last 10 years. While locality and distribution of compute, storage and IO platforms many vary. Intel has been actively working to optimize its technology portfolio within relational, emerging technologies and in the Analytical Engines that are commercially available
  • While Intel has started doing work in the area of Big Data with a distribution of Apache Hadoop, you should not assume that this will be the only thing we plan to do. It’s useful to look at what we’re doing and understand the type of capability we can bring to your company with our optimized tools.We are currently focusing our IDH efforts at adding key functionality that we can uniquely provide. For instance, we have added AES-NI support to the distribution, which makes encryption of the data set up to 20x faster. In other words, you have the capability to encrypt your data “for free” in terms of performance, making your data secure without penalty.We are also using our Intel CAS software to optimize data acquisition, and we are adding a variety of other features. Many of these features will be checked back into the Apache open source, providing benefit. If you have interest in understanding our Hadoop roadmap, we would be happy to set up a more detailed meeting with our team to give you details.Note to field: There is an additional slide in the backup for the Intel Lustre file system distribution for another example of where Intel is contributing to Big Data, specifically in the area of open-source file systems for better performance.Intel Tools for Apache Hadoop – Getting under the Hood of Hadoop for tuning & insightHiTune: monitors key performance metrics on each server in cluster, then aggregates/correlates these low-level indicators w/high-level data flow models – providing insight into performance bottlenecks, hw problems, application hot spots and more.HiBench: Measure, validate & compare performance of Hadoop clusters across a variety of workloads. Cluster performance can be measured for specific/common tasks such as sorting, word counting, web searching and data analytics.Distributed Hadoop environments can be challenging to fine-tune because of the way the framework handles data partitioning, load balancing, fault tolerance, and other low-level operations that Hadoop structures automatically. Intel recently introduced two open-source tools—HiBench and HiTune—to help optimize Hadoop clusters for faster analytics.
  • Many (most) applications are single threaded, single address spaceMany (most) applications are written for a single address space.NGS-size data quickly pushes 1) and 2) beyond the capacity of a single nodeNeed multiple threads, A large memory footprint Some algorithms (SW as an example) scale quadratically with the size of the problemMotivating algorithmic substitution or hardware accelerationCloud - Building in house means capital equipment investment, DC operating costs, and fixed capacity for growing workloads Building in the cloud offers elastic hourly capacity expansion, but brings challenge around management, ease of use, and data movement How best to leverage cloud resources in HPC business process? As a service – Working subsets are growing too large to fit into available memoryMapping/aligning with BW and assembly with De Bruijn are good examplesMotivating algorithmic innovations and novel approaches to large memory computers. The amount of data barely fits into currently available disk space. (And soon might not ) Databases are distributed and will likely stay that wayMotivating much talk of “bringing the computing to the data”Of preprocessing for downstream upload, etc.
  • Cisco* UCS Server1 Intel® Xeon® 5600Dell*PowerEdge* C Series2 Intel Xeon 5500/5600The Dell | Cloudera* solution for Apache*Hadoop combines Dell servers and networking components with Cloudera’s Distribution Including Apache Hadoop (CDH), as well as management tools, training, technology support and professional services, to give customers a single source to deploy, manage, and scale a comprehensive Apache Hadoop-based stackOracle* Sun Fire* server3 Intel Xeon E7-4800Oracle Exalytics* In-Memory Machine, features the Oracle BI Foundation Suite and Oracle TimesTen In-Memory Database for Exalytics, enhanced for an Oracle server designed for in-memory analytics. Contains 1 Terabyte of RAM, 40 Gb/s InfiniBand and 10 Gb/s Ethernet connectivity, and Integrated Lights Out Management.
  • IMS Demo Unit Provided to BioTeam configured with:3 blades each with dual 5650 CPUs and 24GB of RAM & 4 GbE NICsDual Ethernet Switches7 x 600GB Intel 320 Series SSD drivesTurnkey solutionMiniLIMS + Local Analysis EnginePlan is to link to cloud resources: automatic backup & link to hosted MiniLIMSWill ship with Ion Torrent initallySolution for any lab needing LIMS
  • Cost to soon reach $1000 to sequence the full Genomehttp://www.youtube.com/watch?v=F27BvqqNcY4

Jax 2013 - Big Data and Personalised Medicine Jax 2013 - Big Data and Personalised Medicine Presentation Transcript

  • Big Data in Genomics and Personalized Medicine – Challenges and Solutions Gaurav Kaul Software Architect, Intel JAX London 2013
  • Agenda Global Healthcare Trends The Rise of Personalized Medicine Big Data Scenarios in Healthcare Methods to Manage Big Data Use Cases Summary and Next Steps 2 *Other names and brands may be claimed as the property of others
  • We are at an Inflection Point in Healthcare - TRENDS % of population over age 60 30+ % 25-29% 20-24% 10-19% 0-9% 2050 WW Average Age 60+: 21% Source: United Nations “Population Aging 2002” Healthcare costs are RISING Significant % of GDP Global AGING Average Age 60+: growing from 10% to 21% by 2050 Source: McKinsey Global Institute Analysis ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast 3 *Other names and brands may be claimed as the property of others US Healthcare BIG DATA Value $300 Billion in value/year ~ 0.7% annual productivity growth
  • We are at an Inflection Point in Healthcare - TRENDS Storage Growth Total Data Healthcare Providers (PB) 15000 Admin Imaging 10000 Medical Imaging Archive Projection Case from just 1 healthcare system EMR Email 5000 File Non Clin Img 0 2010 2011 2012 2013 2014 2015 Research Data Explosion projected to reach 35 Zetabytes by 2020, with a 44-fold increase from 20095 Source: McKinsey Global Institute Analysis ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast 4 *Other names and brands may be claimed as the property of others
  • Sequencing Cost Trend 5 *Other names and brands may be claimed as the property of others
  • 6 *Other names and brands may be claimed as the property of others
  • Vision for Personalized Medicine 7 *Other names and brands may be claimed as the property of others
  • How can we take Personalized Medicine Mainstream by 2020 ??
  • A “bioinformatics computing system” includes technologies from this entire “stack” Software Frameworks Applications Programming Model (abstraction) Virtualization System Software and Resource Management Computer Hardware, Storage and Networks
  • A “bioinformatics computing system” includes technologies from this entire “stack” Software Frameworks Applications Programming Model (abstraction) Virtualization System Software and Resource Management Computer Hardware, Storage and Networks Multiple Cores – Shared memory, multi ple threads, Open MP Multiple Nodes – MPI; GAS, PGAS; Hadoop galaxy.psu.edu Searching for SNPs with cloud computing Langmead, Schatz et al;
  • The Crossbow Pipeline 11 *Other names and brands may be claimed as the property of others
  • Big Data – A Foundation For Delivering Big Value Big Data Building Blocks Network Storage Software & Technologies Intel® Xeon® Product Family E3E5-E7 Intel® Ethernet Controllers Intelligent Storage1 Intel® Distribution for Apache Hadoop Energy Efficient Responsive Compute Intel® Atom™ Xeon PhiTM Ethernet Adapters Intel® Ethernet Switch Silicon Intel® True Scale Fabric Choice High Availability Secure Intel® Intel® Scale-out Storage1 Scale-up Storage1 Intel® SSD 710 series, DC S3700 (SATA) Intel® SSD 910 series (PCIe) Intel® Node Manager Intel® Expressway Service Gateway Intel® Cache Acceleration Software Intel’s Lustre Intel® VT and Intel® TXT Intel® AES-NI Intel’s Foundational Technologies Offer Advanced Solutions for Big data Analytics Xeon-based storage systems are available in a wide range of configuration options from the industry’s leading storage vendors 12 Intel® Data Center Manager *Other names and brands may be claimed as the property of others
  • Big Data Compute Platform Optimizations Intel® Xeon® E5 Family Intel® Xeon® E7 Family RAM QPI 1 QPI 2 Xeon E7-4800 CORE 3 CORE 4 QPI 4 CORE 5 CORE 6 CORE 7 CORE 8 CORE 9 CORE 10 Up to 4 channels DDR3 1600 MHz memory Up to 8 cores Up to 20 MB cache SCALE-OUT with Hadoop and analytic/DW engines Proof point: E5 Analytics 25X Improvement Hadoop on E5 13 CORE 2 QPI 3 Integrated PCI Express* 3.0 Up to 40 lanes per socket CORE 1 *Other names and brands may be claimed as the property of others 4 QPI 1.0 Lanes for robust scalability Up to 8 channels DDR3 1066 MHz memory CACHE Up to 10 cores Up to 30 MB cache SCALE-UP in-memory analytic engines and databases: Oracle*, SAS*, SAP Hana* Proof point: SAP HANA
  • Big Data – A Foundation For Delivering Big Value Intel® Ethernet Reduces Time to Process Large Data Sets 1GbE Network Connections Trends and Challenges Big data is hitting the enterprise with unprecedented volume, velocity, variety, complexity, and OPPORTUNITY Intel® Ethernet Solution Up to 20x performance boost over legacy infrastructure with optimizations on Intel® Xeon® processors, Intel® SSD storage, and 10Gb Intel® Ethernet networking 10 Gigabit Ethernet allows quicker import and export of large data sets for processing VM VM VM VM VM VM Hypervisor Hypervisor Moving the Data with 10GbE Up to *Other names and brands may be claimed as the property of others Up to 80% 15% Reduction in Cables & Switch ports Reduction in Infrastructure Costs 1 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/10gbe-10gbase-t-hadoop-clusters-paper.pdf 14 2 Ports 10GbE 10 Ports 1GbE Up to 2x Improved Bandwidth per Server
  • Big Data – A Foundation For Delivering Big Value Intel® CAS with Intel® SSD Solution Added as cache layer accelerates Big Data workloads 50X IOPS 3X TPC-C 20X TPC-H Performance near equal to replacing all hard drives with SSDs at significantly lower cost http://www.intel.com/content/www/us/en/mission-critical/mission-critical-scalability-oracle-intel-brief.html 15 *Other names and brands may be claimed as the property of others throughput performance
  • Big Data – A Foundation For Delivering Big Value Data Methods for the Right Data Structure Unstructured Data Emerging Technologies Analytical Paradigms MapReduce /Hive Structured Data Relational Database EXALYTICS * Other names and brands may be claimed as the property of others. 16 *Other names and brands may be claimed as the property of others
  • Big Data – A Foundation For Delivering Big Value HiTune (URL) Intel® Distribution for Apache Hadoop* & Tools MapReduce File-based Encryption in HDFS Up to 20x faster decryption with AES-NI* Role-based access control for Hadoop services Instrument Up to 8.5X faster Hive queries using HBase co-processor Aggregation Engine Report Engine HiTune Controller Optimized for SSD with Cache Acceleration Software Adaptive replication in HDFS and HBase HiBench (URL) Integrated text search with Lucene 1 2 Micro Benchmarks Sort WordCount TeraSort Simplified deployment & comprehensive monitoring Deployment of HBase across multiple datacenters Web Search Nutch Indexing Page Rank HiBench Automated configuration with Intel ® Active Tuner Detailed profiling of Hadoop jobs Simplified design of HBase schemas (+ in 2.4) REST APIs for deployment and management (+ in 2.4) 3 Machine Learning Bayesian Classification K-Means Clustering 4 HDFS Enhanced DFSIO Result = many Hadoop optimization tips (IDF2012 presentation “Big Data Analytics on a Performance-optimized Hadoop Infrastructure”) 17 *Other names and brands may be claimed as the property of others
  • Life Sciences 2013: Key Industry Challenges and Solutions Many (most) applications are singlethreaded, single address space Intel is delivering optimizations working with open source community, developing NGS+HPC curriculum Some algorithms scale quadratically with the size of the problem. Large data sets exceed available memory and storage Innovations in acceleration, compute, storage, networking, security, and *-as-a-service. International collaboration is an imperative, bioinformatics expertise is scarce Intel is working closely with the ecosystem to address enterprise to cloud transmission of terabyte payloads Need are distributed, data is siloed and for Balanced Compute Infrastructure Databases 18will likely stay that way *Other names and brands may be claimed as the property of others
  • Examples of Intel®-powered Servers in Big Data and Analytics Cisco* UCS Server1 Intel® Xeon® 5600 Cisco UCS server with EMC Greenplum MR software “enterprise-class” Hadoop* distribution that features technology from MapR 1 Dell* PowerEdge* C Series2 Intel Xeon 5500/5600 The Dell | Cloudera* solution for Apache* Hadoop sold pre-configured Oracle* Sun Fire* server3 Intel Xeon E7-4800 Oracle Exalytics* In-Memory Machine, features the Oracle BI Foundation Suite and Oracle TimesTen In-Memory Database for Exalytics http://gigaom.com/cloud/ciscos-servers-now-tuned-for-hadoop/ http://www.businesswire.com/news/home/20110804005376/en/Dell-Cloudera-Collaborate-Enable-Large-Scale-Data 3 19 http://www.itp.net/mobile/588145-oracle-unveils-exalytics-in-memory-machine INTEL CONFIDENTIAL 2
  • Solution 4.0 – NGS Appliances 16 Cores 96 GB RAM 18T Red. Storage SSD for OS 32 Cores 1.2 TFlops 18-56TB RAID NSS-HA Pair NSS User Data HSS Metadata Pair HSS OSS Pair HSS User Data 2U Plenum Actual placement in racks may vary. Scale through independent solutions, each targeting a different segment & usage model 20 Intel Confidential may be claimed as the property of others *Other names and brands
  • NGS Appliance Dell Scalable Unit “SANGER” Infrastructure: Dell PE, PC & F10 NSS-HA Pair NSS User Data Dell NSS (NFS) (up to 180TB) Challenge: Experiment processing takes 7 days with current infrastructure. Delays treatment for sick patients Solution: Dell Next Generation Sequencing Appliance • • HSS Metadata Pair HSS OSS Pair Dell HSS (Lustre) (up to 360TB) 9 Teraflops of Sandy Bridge Processors • Lustre File Storage • Intel SW tools and engineers Benefits: RNA-Seq processing reduced to 4 hour HSS User Data M420 (Compute) (up to 32 nodes) 2U Plenum 21 Single Rack Solution *Other names and racks may vary. Actual placement in brands may be claimed as the property of others Includes everything you need for NGS compute, storage, software, networking, infra structure, installation, deployment, training, service & support
  • 22 *Other names and brands may be claimed as the property of others
  • Use Case: NEXTBIO Analytics for Genomics Data • Cost to sequence a Genome has fallen by 800x in the last 4 years • Each Genome has ~4 million variants • Growth in the genomics data in the public and private domain • Data available in variety of sources – • Structured, semi-structured, Un-structured New aggregated data growing exponentially Sequencing 3 Billion base Pairs 23 Data Processing Cloud Storage Visualization Millions of variants *Other names and brands may be claimed as the property of others Interpretation & Analytics Millions of Variants Millions of Patients Commercializing Targeted Therapeutics Companion Diagnostics Actionable Biomarkers
  • Data-Intensive Discovery: Genomics Value Enable researchers to discover biomarkers and drug targets by correlating genomic data sets 90% gain in throughput; 6X data compression Analytics Provide curated data sets with pre-computed analysis (classification, correlation, biomarkers) Provide APIs for applications to combine and analyze public and private data sets Data Management Use Hive and Hadoop for query and search Dynamically partition and scale Hbase 10-node cluster / Intel Xeon E5 processors 10GbE network 24 *Other names and brands may be claimed as the property of others Intel Distribution
  • Use Case: NEXTBIO Nextbio & Intel Collaboration Technical Challenge: Immutable Data – write once, change, read many times never Traditional Bloom Filters works Hadoop & HBase well suited 1 Genome  10 Million rows 100 Genomes  1Billion rows 1M Genomes  10 Trillion rows 100M Genomes  1 Quadrillion 1,000,000,000,000,000 rows App can dynamically partitions HBase as data size grows Intel Optimizations for Hadoop: Optimized Hadoop stack in Open Source Stabilize HBase to provide reliable scalable 25 deployment *Other names and brands may be claimed as the property of others
  • Putting it together .. Software Frameworks Applications Programming Model (abstraction) Virtualization System Software and Resource Management Computer Hardware, Storage and Networks
  • Summary • Enabling ecosystem of partners to innovate and make Personalized Medicine vision a reality • Delivering hardware-enhanced capabilities and software to deploy Personalized Medicine • Work with Big Data Vendors to onboard increasing number of life science workloads to Hadoop and other analytics technologies
  • Q&A GAURAV.KAUL@INTEL.COM