IDC Perspectives on Big Data Outside of HPC


Published on

In this video from the IDC Breakfast Briefing at ISC'13, Steve Conway presents: IDC's Perspective on Big Data Outside of HPC.

Watch the presentation video:

Check out more talks from the show at our ISC'13 Video Gallery:

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Here’s a general definition of Big Data using the schema of the “four V’s” that’s become familiar. This isn’t specific to high performance data analysis. It applies to Big Data across all markets.To qualify as Big Data in this general context, the data set has to be large in volume, critical to analyze in a timeframe...It has to include multiple types of data and it has to be worthwhile to someone, preferably with a monetary value.
  • The emerging market for high performance data analysis is narrower than that. As I said a minute ago, it’s the market being formed by the convergence of data-intensive simulation and data-intensive analytical methods, so it’s really a union set. As the slide shows, this evolving market is very inclusive in relation to methods, types of data, and market sectors. The common denominator across these segments is the use of models that incorporate algorithmic complexity. You typically don’t find that kind of algorithmic complexity in online transaction processing or in commercial applications such as supply chain management and customer relationship management.The ultimate criterion for HPDA that it requires HPC resources.
  • There are important HPDA market drivers on the data ingestion side and the data output side. Data sources have become much more powerful. CERN’s Large Hadron Collider generates 1PB/second when it’s running. The Square Kilometer Array telescope will produce 1EB/day when it becomes operational in 2016. - But those are extreme examples. Much more common are sensor networks for power grids and other things, gene sequencers, MRI machines, and so on.- Onllne sales transactions produce a lot of data and a lot of opportunity for fraud. Standards, regulations and lawsuits are on the rise. Boeing stores all its engineering data for the 30-year lifetime of their commercial airplanes, not just as a reference for designing future planes but in case there’s a crash and a lawsuit. On the output side, more powerful HPC systems are kicking out lots more data in response to the growing user requirements you see listed here.
  • Moving data costs time and money. Energy has become very expensive. It can take 100 times more energy to move the results of a calculation than to perform the calculation in the first place. It’s no wonder that oil and gas companies, for example, still rely heavily on courier services for overnight shipping of disk drives. It would take too long and cost too much to send the data over a computer network.- If you’re a vendor, you have two main strategies available to you: you can speed up data movement , mainly through better interconnects, or you can minimize data movement by pre-filtering data or bringing the compute to the data, or you can both accelerate and minimize.
  • The data in most HPDA jobs assigned to HPC resources will continue to have regular access patterns, whether the data is structured or unstructured.This means it can be partitioned and mapped onto a standard cluster or other distributed memory machine for running Hadoop or other software.But there’s a rising tide of data work that exhibits irregular access patterns and can’t take advantage of data locality processing features. Caches are highly inefficient for jobs like this. These jobs benefit from global memory combined with powerful interconnects and other data movement capabilities. Partitionable jobs are very important now and non-partitionable jobs are becoming more important. By the way, SGI systems address both types. One general remark is that as the data analysis side of HPC expands, HPC architectures will need to become less compute-centric and offer more support for data integration and analysis.“Many current approaches to big data have been about ‘search’ – the ability to efficiently find something that you know is there in your data,” said Arvind Parthasarathi, President of YarcData. “uRiKA was purposely built to solve the problem of ‘discovery’ in big data – to discover things, relationships or patterns that you don’t know exist. By giving organizations the ability to do much faster hypothesis validation at scale and in real time, we are enabling the solution of business problems that were previously difficult or impossible – whether it be discovering the ideal patient treatment, investigating fraud, detecting threats, finding new trading algorithms or identifying counter-party risk. Basically, we are systematizing serendipity.”
  • HPC servers are often used for more than one purpose. IDC classifies HPC servers according to the primary purpose they’re used for. So, an HPDA server is one that’s used more than 50% for HPDA work. As this table shows, IDC forecasts that revenue for HPC servers acquired primarily for HPDA use will grow robustly (10.4% CAGR) to approach $1 billion in 2015. Because HPDA revenue starts as such a relatively small chunk of overall HPC server revenue, the HPDA share of the overall HPC server revenue will still be in the single digits in 2015, despite the fast growth rate.
  • Let’s look at some real-world use cases
  • This slide lists some of the most prominent use cases, meaning ones where repeated sales of HPC products have been happening. Fraud detection and life sciences are emerging fastest. BTW, I didn’t include financial services here because we’ve been tracking back-office FSI analytics as part of the HPC market for more than 20 years. But FSI is an important part of the high performance data analysis market. – not an easy one to penetrate for the first time.
  • I want to zero in more on the PayPal example because they gave me permission to use these slides and because in many ways they are representative of a larger group of commercial companies whose business requirements are pushing them up into HPC. The slides are from a talk PayPal gave IDC’s September 2012 HPC User Forum meeting in Dearborn, Michigan. By the way, if you want a copy of this talk or any of the long list of talks on one of our first slides, just email me at sconway [at]
  • PayPal is an eBay subsidiary and, among other things, has responsibility for detecting fraud across eBay and SKYPE. Five years ago, a day's worth of data was processed in batch processing overnight and fraud wasn't detected until as much as two weeks later. They realized they needed to detect fraud in real time, and for that they needed graph analysis. They were most interested in checking out collusion between multiple parties, such as when a credit card shows activity from four or more users. They needed to be able to stop that before the credit card got hit. IBM Watson on the Jeopardy game show was amazing but it was a needle in a haystack problem, meaning that Watson could only find answers that were already in its database. PayPal’s problem was different, because there was no visible needle to be found. Graph analysis let them uncover hidden relationships and behavior patterns
  • This gives you an idea of PayPal’s data volumes and HPDA requirements. These are going up all the time.
  • Here’s what PayPal is using. For the serious fraud detection and analysis, they’re using SGI servers and storage on an InfiniBand network. For the less-challenging work that doesn’t involve pattern discovery and real-time requirements, they’re running Hadoop on a cluster. By the way, PayPal says HPC has already saved them $710 million in fraud they wouldn’t have been able to detect before.
  • This gives you an idea of PayPal’s data volumes and HPDA requirements. These are going up all the time.
  • For cost and growth reasons, GEICO moved to automated insurance quotes on the phone. They needed to provide quotes instantaneously, in 100 milliseconds or less. They couldn’t do these calculations nearly fast enough on the fly .GEICO’s solution was to install an HPC system and every weekend run updated quotes for every adult and every household in the United States. That takes 60 wall clock hours today. The phones tap into the stored quotes and return the correct one in 100 milliseconds.
  • Here’s a real-world example of one of the biggest names in global package delivery. Their problem is not so different from PayPal’s. This courier service is doing real-time fraud detection on huge volumes of packages that come into their sorting facility from many locations and leave the facility for many other locations around the world.They ran a difficult benchmark. The winner hasn’t been publicly announced yet, but IDC’s back channels tell us the vendor has a 3-letter name that starts with S.
  • Schrödinger is a global life sciences software company with offices in Munich and Mannheim. One of the major things they do is use molecular dynamics to identify promising candidates for new drugs to combat cancer and other diseases – and it seems they’ve been using the cloud for this High Performance Data Analysis problem. That’s not so surprising, since molecular dynamics codes are often highly parallel.
  • Here’s the architecture they used. Note that they were already using HPC in their on premise data center, but the resources weren’t big enough for this task. That’s why they bursted out to Amazon EC2 using a software management layer from Cycle Computing to access more than 50,000 additional cores. Bringing a new drug to market can cost as much as £10 billion and a decade of time, so security is a major concern with commercial drug discovery. Apparently, Schrödinger felt confident about the cloud security measures.
  • You may have seen the recent news that Optum, which is part of United Health Group, is teaming with the Mayo Cline to build a huge center in Cambridge, Massachusetts to lay the research groundwork for outcomes-based medicine. They’ll have more than 100 million patient records at their disposal for this enormous data-intensive work.They’ll be using data-intensive methods to look at other aspects of health care, too. A week ago, United Health issued a press release in which they said they believe that improved efficiencies alone could reduce Medicare costs by about 40%, obviating much of the need for the major reforms the political parties have been fighting about.
  • In the U.S., the largest urban gangs are the Crips and the Bloods. They’re rival gangs that are at each other’s throats all the time, fighting for money and power. Both gangs are national in scope, but the national organizations aren’t that strong. The branches of these gangs in each city have a lot of autonomy to do what they want.What you see here, again in blurred form, was something that astounded the police department of Atlanta, Georgia, a city with about 4 million inhabitants. Through real-time monitory of social networks, they were able to witness, as it happened, the planned merger of these rival gangs in their city. This information allowed the police to adapt their own plans accordingly.
  • In summary, we defined HPDA and told you that IDC is forecasting rapid growth from a small base.HPDA is about the convergence of data-intensive HPC and high-end commercial analytics. One of the most interesting aspects of this, to us, is that the demands of the commercial market are moving this along faster in the commercial sector than in the traditional HPC market. PayPal is a great example of this (story of how PayPal was shy about presenting at User Forum – both sides should be learning from each other). On the analytics side, some attractive use cases are already out there. In the time allotted to us here, we described some of the more prominent ones, but there are many others.Most of the work will be done on clusters, but some economically important use cases need more capable architectures, especially for graph analytics.Many of the large investment firms are IDC clients, so our growth estimates tend to err on the side of conservatism. There is potential for the HPDA market to grow faster than our current forecast. But we talk with a lot of people and we update the forecasts often, so we get too far off the mark.
  • This is a partial list of the user and vendor talks on this topic that we’ve lined up in the past two years as part of the HPC User Forum. IDC has operated the HPC User Forum since 1999 for a volunteer steering committee made up of senior HPC people from government, industry and academia – organizations like Boeing, GM, Ford, NSF and others. We hold meetings across the world, and the talks listed here include perspectives on High Performance Data Analysis from the Americas, Europe and Asia.I’ll ask Chirag to explain how we define High Performance Data Analysis. I’ll return later to walk you through some real-world use cases. Chirag...
  • IDC Perspectives on Big Data Outside of HPC

    1. 1. Jul-13© 2013 IDC IDC’s Perspective On Big Data Outside Of HPC
    2. 2. Jul-13© 2013 IDC Big Data: A General Definition Value +  Lots of data  Time critical  Multiple types (e.g., numbers, text, video)  Worth something to someone
    3. 3. Jul-13© 2013 IDC Defining Big Data: For the Broader IT Market
    4. 4. Jul-13© 2013 IDC Top Drivers For Implementing Big Data
    5. 5. Jul-13© 2013 IDC Organizational Challenges With Big Data: Government Compared To All Others
    6. 6. Jul-13© 2013 IDC Big Data Software
    7. 7. Jul-13© 2013 IDC Big Data Software Technology Stack
    8. 8. Jul-13© 2013 IDC Big Data Software Shortcomings -- Today
    10. 10. Jul-13© 2013 IDC HPDA (High Performance Data Analysis): Data-Intensive Simulation and Analytics HPDA = tasks involving sufficient data volumes and algorithmic complexity to require HPC resources/approaches  Established (simulation) or newer (analytics) methods  Structured data, unstructured data, or both  Regular (e.g., Hadoop) or irregular (e.g., graph) patterns  Government, industry, or academia  Upward extensions of commercial business problems  Accumulated results of iterative problem-solving methods (e.g., stochastic modeling, parametric modeling).
    11. 11. Jul-13© 2013 IDC HPDA Market Drivers  More input data (ingestion) • More powerful scientific instruments/sensor networks • More transactions/higher scrutiny (fraud, terrorism)  More output data for integration/analysis • More powerful computers • More realism • More iterations in available time  Real time, near-real time requirements • Catch fraud before it hits credit cards • Catch terrorists before they strike • Diagnose patients before they leave the office • Provide insurance quotes before callers leave the phone  The need to pose more intelligent questions • Smarter mathematical models and algorithms
    12. 12. Jul-13© 2013 IDC Data Movement Is Expensive: In Energy and Time-to-Solution Energy Consumption  1MW ≈ $1 million  Computing 1 calculation ≈ 1 picojoule  Moving 1 calculation = up to 100 picojoules  => It can take 100 times more energy to move the results of a calculation than to perform the calculation in the first place. Strategies  Accelerate data movement (bandwidth, latency)  Minimize data movement (e.g., data reduction, in- memory compute, in-storage compute, etc.)
    13. 13. Jul-13© 2013 IDC Different Systems for Different Jobs Partitionable Big Data Work  Most jobs are here!  Goal: search  Regular access patterns (locality)  Global memory not important  Standard clusters + Hadoop, Cassandra, etc. Non-Partitionable Work  Toughest jobs (e.g., graphing)  Goal: discovery  Irregular access patterns  Global memory very important  Systems turbo-charged for data movement +graphing versus HPC architectures today are compute-centric (FLOPS vs. IOPS)
    14. 14. Jul-13© 2013 IDC IDC HPDA Server Forecast  Fast growth from a small starting point  In 2015, conservatively approaching $1B
    16. 16. Jul-13© 2013 IDC Some Major Use Cases for HPDA • Fraud/error detection across massive databases  A horizontal use – applicable in many domains • National security/crime-fighting  SIGINT/anomaly detection/anti-hacking  Anti-terrorism (including evacuation planning)/anti-crime • Health care/medical informatics  Drug design, personalized medicine  Outcomes-based diagnosis & treatment planning  Systems biology • Customer acquisition/retention • Smart electrical grids • Design of social network architectures
    17. 17. Jul-13© 2013 IDC Use Case: PayPal Fraud Detection / Internet Commerce Slides and permission provided by PayPal, an eBay company
    18. 18. Jul-13© 2013 IDC The Problem Finding suspicious patterns that we don’t even know exist in related data sets.
    19. 19. Jul-13© 2013 IDC What Kind of Volume? PayPal’s Data Volumes And HPDA Requirements
    20. 20. Jul-13© 2013 IDC Where Paypal Used HPC
    21. 21. Jul-13© 2013 IDC The Results  $710 million saved in fraud that they wouldn’t have been able to detect before (in the first year)
    22. 22. Jul-13© 2013 IDC GEICO: Real-Time Insurance Quotes  Problem: Need accurate automated phone quotes in 100ms. They couldn’t do these calculations nearly fast enough on the fly.  Solution: Each weekend, use a new HPC cluster to pre- calculate quotes for every American adult and household (60 hour run time)
    23. 23. Jul-13© 2013 IDC Global Courier Service: Fraud/Error Detection Here’s a real-world example of one of the biggest names in global package delivery. Their problem is not so different from PayPal’s. This courier service is doing real-time fraud detection on huge volumes of packages that come into their sorting facility from many locations and leave the facility for many other locations around the world.  Check 1 billion-plus packages per hour in central sorting facility  Benchmark won by a HPC vendor with a turbo-charged interconnect and memory system
    24. 24. Jul-13© 2013 IDC Apollo Group/University of Phoenix: Student Recruitment and Retention Apollo Group is approaching 300,000 online students. To maintain and grow, they have to target millions of prospective students.  Must target millions of potential students  Must track student performance for early identification of potential dropouts – “churn” is very expensive  Solution: a sophisticated, cluster-based Big Data models
    25. 25. Jul-13© 2013 IDC They use the cloud for this High Performance Data Analysis problem -- that’s not so surprising, since molecular dynamics codes are often highly parallel.
    26. 26. Jul-13© 2013 IDC Architecture
    27. 27. Jul-13© 2013 IDC Optum + Mayo Initiative to Move Past Procedures-Based Healthcare You may have seen the recent news that Optum, which is part of United Health Group, is teaming with the Mayo Cline to build a large center ($500K) in Cambridge, Massachusetts to lay the research groundwork for outcomes-based medicine.  Data: 100M United Health Group claims (20 years) + 5M Mayo Clinic archived patient records. Option for genomic data  Findings will be published  Goal: outcomes-based care
    28. 28. Jul-13© 2013 IDC 28
    29. 29. Jul-13© 2013 IDC Summary: HPDA Market Opportunity  HPDA: simulation + newer high-performance analytics • IDC predicts fast growth from a small starting point  HPC and high-end commercial analytics are converging • Algorithmic complexity is the common denominator • Technologies will evolve greatly  Economically important use cases are emerging  No single HPC solution is best for all problems • Clusters with MR/Hadoop will handle most but not all work (e.g., graph analysis) • New technologies will be required in many areas  IDC believes our growth estimates could be conservative
    30. 30. Jul-13© 2013 IDC HPDA User Talks: HPC User Forums, UK, Germany, France, China and U.S. • HPC in Evolutionary Biology, Andrew Meade, University of Reading • HPC in Pharmaceutical Research: From Virtual Screening to All-Atom Simulations of Biomolecules, Jan Kriegl, Boehringer-Ingelheim • European Exascale Software Initiative, Jean-Yves Berthou, Electricite de France • Real-time Rendering in the Automotive Industry, Cornelia Denk, RTT-Munich • Data Analysis and Visualization for the DoD HPCMP, Paul Adams, ERDC • Why HPCs Hate Biologists, and What We're Doing About It, Titus Brown, Michigan State University • Scalable Data Mining and Archiving in the Era of the Square Kilometre Array, the Square Kilometre Array Telescope Project, Chris Mattmann, NASA/JPL • Big Data and Analytics in HPC: Leveraging HPC and Enterprise Architectures for Large Scale Inline Transactional Analytics in Fraud Detection at PayPal, Arno Kolster, PayPal, an eBay Company • Big Data and Analytics Vendor Panel: How Vendors See Big Data Impacting the Markets and Their Products/Services, Panel Moderator: Chirag Dekate, IDC • Data Analysis and Visualization of Very Large Data, David Pugmire, ORNL • The Impact of HPC and Data-Centric Computing in Cancer Research, Jack Collins, National Cancer Institute • Urban Analytics: Big Cities and Big Data, Paul Muzio, City University of New York • Stampede: Intel MIC And Data-Intensive Computing, Jay Boisseau, Texas Advanced Computing Center • Big Data Approaches at Convey, John Leidel • Cray Technical Perspective On Data-Intensive Computing, Amar Shan • Data-intensive Computing Research At PNNL, John Feo, Pacific Northwest National Laboratory • Trends in High Performance Analytics, David Pope, SAS • Processing Large Volumes of Experimental Data, Shane Canon, LBNL • SGI Technical Perspective On Data-Intensive Computing, Eng Lim Goh, SGI • Big Data and PLFS: A Checkpoint File System For Parallel Applications, John Bent, EMC • HPC Data-intensive Computing Technologies, Scott Campbell, Platform/IBM