Analyzing Big Data - Jeff Scheel


Published on

Analyzing Big Data.
A presentation given by Jeff Scheel, Chief Engineer for Linux on Power at IBM, at the OPEN14 conference in Belgium.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Analyzing Big Data - Jeff Scheel

  1. 1. © 2014 IBM Corporation Open '14 Analyzing Big Data Jeff Scheel Chief Engineer Linux on Power June 2, 2014
  2. 2. © 2014 IBM Corporation2 Agenda 1. Getting started with Big Data 2. OpenPOWER Foundation 3. The future of Analytics
  3. 3. © 2014 IBM Corporation Getting started with Big Data
  4. 4. © 2014 IBM Corporation4 Big Data is growing and moving fast from a variety of sources, are you keeping up? • 1 Trillion connected devices generate 2.5 quintillion bytes data / day • 80% of the world’s data today is unstructured • 1 in 2 business leaders don’t have access to data they need
  5. 5. © 2014 IBM Corporation5 “Data is the new oil” In its raw form, oil has little value. Once processed and refined, it helps power the world. “Big Data has arrived at Seton Health Care Family, fortunately accompanied by an analytics tool that will help deal with the complexity of more than two million patient contacts a year…” “Data is the new oil.” Clive Humby “At the World Economic Forum last month in Davos, Switzerland, Big Data was a marquee topic. A report by the forum, “Big Data, Big Impact,” declared data a new class of economic asset, like currency or gold. “Increasingly, businesses are applying analytics to social media such as Facebook and Twitter, as well as to product review websites, to try to “understand where customers are, what makes them tick and what they want”, says Deepak Advani, who heads IBM’s predictive analytics group.” “Companies are being inundated with data—from information on customer-buying habits to supply-chain efficiency. But many managers struggle to make sense of the numbers.”
  6. 6. © 2014 IBM Corporation6 The challenge: handling the large Volume, Variety, Velocity, and Veracity of data to find new insights and improve business outcome BI / Reporting Exploration / Visualization Functional App Industry App Predictive Analytics Content Analytics Analytic Applications IBM Big Data Platform Systems Management Application Development Visualization & Discovery Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse MFG - Analyze & correlate log records to improve service and predict failures Telco - Address customer satisfaction, Predict churn, and match promotions in real time Healthcare - Detect life- threatening conditions at hospitals in time to intervene Retail - Multi-channel customer sentiment and experience analysis Financial Services - Make risk decisions based on real-time transactional data Law Enforcement - Identify criminals and threats from video, audio feeds
  7. 7. © 2014 IBM Corporation7 Customers are deploying new infrastructure to leverage all data types Data in Motion Data at Rest Data in Many Forms Information Ingestion and Operational Information Decision Management BI and Predictive Analytics Navigation and Discovery Intelligence Analysis Landing Area, Analytics Zone and Archive  Raw Data  Structured Data  Text Analytics  Data Mining  Entity Analytics  Machine Learning Real-time Analytics  Video/Audio  Network/Sensor  Entity Analytics  Predictive Exploration, Integrated Warehouse, and Mart Zones  Discovery  Deep Reflection  Operational  Predictive Stream Processing  Data Integration  Master Data Stream s Information Governance, Security and Business Continuity Hadoop Infrastructure – currently being deployed on commodity hardware Hadoop Infrastructure – currently being deployed on commodity hardware
  8. 8. © 2014 IBM Corporation8 WATSON Two new Watson-based products: • Interactive Care Insights for Oncology • The WellPoint Interactive Care Guide and Interactive Care Reviewer IBM and Red Hat innovating in Healthcare with Watson • Watson's oncology education: • 600,000 pieces of medical evidence • 2 million pages of text • 25,000 training cases • Watson can review 1.5 million patient records faster than it takes most office computers to boot up
  9. 9. © 2014 IBM Corporation9 Big Data implementation patterns Common analysis of structured & unstructured data WarehouseHadoop App / BI Visualization / Exploration Warehouse and BigInsights partitioning HadoopWarehouse App / BI Visualization Exploration App / BI Visualization Exploration App / BI Visualization Exploration HadoopWarehouse Warehouse batch offload Warehouse App /BI Visualization Exploration Hadoop StructuredUnstructured App / BI Visualization Exploration Separate unstructured & structured analysis StructuredUnstructured Structured Structured
  10. 10. © 2014 IBM Corporation10 What the experts say 1. Seek project input from Sales, Marketing, and Operations teams 2. Select projects which are well- defined and have quick ROI – less than a year 3. Leverage your experiences from data warehouse and business intelligence projects 4. Avoid starting with “Big Bang” Source:
  11. 11. © 2014 IBM Corporation11 More ideas for starting Warehouse App /BI Visualization Exploration Hadoop Existing BI Stack App / BI Visualization Exploration Separate unstructured & structured analysis New  Find a small problem to solve, i.e. an internal phone directory, and start “on-the-side”.  Locate relevant data and identify pieces what are “in motion” or “at rest”.  For data at rest, build opensource Hadoop on your PowerLinux system or try the InfoSphere BigInsights Basic Edition (no charge).  For data in motion, use the InfoSphere Streams trial download.  Reference the IBM Information Center for details on how to import data into Hadoop and how to write applications using Streams Studio.  Explore Datameer to visualize your Hadoop based Big Data
  12. 12. © 2014 IBM Corporation12 PowerLinux jump start services facilitate starting with Big Data Analytics 5 Day IBM Power Analytics Services Jump Start Includes: • 5 days, on-site service offering • Quick Analytics Assessment Workshop •Software Installation • Hands on education in getting started • Evaluating the analytical approach for your business that will make the biggest impact • Quick sample application to consume customer data Reference Architecture Workshop Why Jump Start Services for your IBM Power Analytics solution? • Learn how to optimally leverage IBM Power System for Analytics • Learn the benefits and reasoning of Big Data •Learn how to gain business value from the data you have 2 Day IBM Power Analytics Services Jump Start Includes: • 2 days, on-site Big Data Analytics service offering •Software installation • Hands on education in getting started Evaluating the analytical approach for your business that will make the biggest impact IBM Systems Lab Services & Training - Power Systems Services for PowerLinux, AIX, and OS Contact – Linda Hoben, Opportunity Manager, IBM Power Servers is an ideal platform for streaming data and performing analytic computations for a multitude of applications. Let us help make you successful!
  13. 13. © 2014 IBM Corporation13 IBM POWER has a strong history in transactional processing workloads 1,556 2,845 5,669 9,200 12,602 23,871 32,046 50,164 63,021 95,081 150,000$109.00 $89.00 $52.70 $43.00 $17.80 $8.31 $5.42 $5.19 $2.97 $2.81 $0.69 0 20000 40000 60000 80000 100000 120000 140000 160000 S70 S7A S80 S85 p690 p690+ p690++ p5-595 p5-595+ P6 595 P7 780 $0 $20 $40 $60 $80 $100 $120 tpcC $/tpcC
  14. 14. © 2014 IBM Corporation14 POWER8 Processor Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Cores • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 exec pipe • 2X internal data flows/queues • Enhanced prefetching • 64K data cache, 32K instruction cache Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility Energy Management • On-chip Power Management Micro-controller • Integrated Per-core VRM • Critical Path Monitors Technology •22nm SOI, eDRAM, 15 ML 650mm2 Memory • Up to 230 GB/s sustained bandwidth Bus Interfaces • Durable open memory attach interface • Integrated PCIe Gen3 • SMP Interconnect • CAPI (Coherent Accelerator Processor Interface) ComputerWorld: To make the chip faster, IBM has turned to a more advanced manufacturing process, increased the clock speed and added more cache memory, but perhaps the biggest change heralded by the Power8 cannot be found in the specifications. After years of restricting Power processors to its servers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the Power8 is so clearly engineered for midrange and enterprise systems for running applications on a giant shared memory space, backed by lots of cores and threads. Power8 does not belong in a smartphone unless you want one the size of a shoebox that weighs 20 pounds. But it most certainly does belong in a badass server, and Power8 is by far one of the most elegant chips that Big Blue has ever created, based on the initial specs. PCWorld: With Power8, IBM has more than doubled the sustained memory bandwidth from the Power7 and Power7+, to 230 GB/s, as well as I/O speed, to 48 GB/s. Put another way, Watson’s ability to look up and respond to information has more than doubled as well. Microprocessor report: Called Power8, the new chip delivers impressive numbers, doubling the performance of its already powerful predecessor, Power7+. Oracle currently leads in server-processor performance, but IBM’s new chip will crush those records. The Power8 specs are mind boggling. Source: Hotchips presentation
  15. 15. © 2014 IBM Corporation15 POWER8 delivers 2.5x performance on Big Data / Hadoop POWER8 reduces the number of servers by 60% based on the best x86 published Terasort result  POWER8 S822L will deliver over 2x the performance of the best published x86 system … and continues to offer far superior RAS  POWER8 delivers 1.7X over HP on a per-core normalized benchmark.  POWER8 exploits additional cores, more threads, larger caches, memory bandwidth  Terasort is a popular benchmark to measure the performance of a Hadoop solution  Sorts a large dataset (10 TB) in parallel  Exercises the Map-reduced framework and Hadoop Distributed File System (HDFS) >2x>2x Relative System Performance 0 0.5 1 1.5 2 2.5 3 POWER8 Cisco 2.5x2.5x IBM Analytics Stack: IBM Power System S822L; 24 cores / 192 threads, POWER8; 3.0GHz, 512 GB memory, RHEL 6.5, InfoSphere BigInsights 3.0 Compared to a 16 Cores HP system
  16. 16. © 2014 IBM Corporation16 Power Systems S822LPower Systems S812L • 1-socket, 2U • Linux Only • 2-socket, 2U • Linux Only • 2-socket, 2U • All Operating Systems Power Systems S822 Power Systems S814 • 1-socket, 4U • All Operating Systems Power Systems S824 • 2-socket, 4U • All Operating SystemsPower Systems S824L • 2-socket, 4U • Linux Only • SOD 1 & 2 Sockets New IBM Power Systems based on POWER8
  17. 17. © 2014 IBM Corporation OpenPOWER Foundation – The emerging ecosystem
  18. 18. 18 © OpenPOWER Foundation 2014 Industry trends • The number of companies designing & building servers is increasing – Traditionally there have been few companies designing systems: HP, IBM, SUN, Dell, etc. – Today there are many more: Google, Microsoft, Facebook, Rackspace, Huawei, Sugon, Inspur, etc. – A fairly mature ecosystem including the Taiwanese ODMs is a key enabler of this trend • Numerous disruptive forces are impacting these custom system designs and driving designers to consider new ways of innovating – Ability to handle rapid growth in Big Data & Analytics based solutions – Choice and Innovation – CPU SOC integration drive need for chip development • These trends create a need for a server targeted “chip-system- software” ecosystem – IBM has technology and a software stack ready to meet these needs – IBM recognizes the need to work with partners to create this ecosystem – IBM recognizes the need for choice and options in processor sourcing
  19. 19. 19 © OpenPOWER Foundation 2014 OpenPOWER Foundation Structure OpenPOWER is an industry foundation based on the POWER architecture, enabling an Open community for development and opportunity for member differentiation and growth
  20. 20. 20 © OpenPOWER Foundation 2014 Building collaboration and innovation at all levels Welcoming new members in all areas of the ecosystem 100+ inquiries and numerous active dialogues underway Boards/Systems I/O, Storage, Acceleration Chip/SOC System/Software/Services
  21. 21. 21 © OpenPOWER Foundation 2014 OpenPOWER Proposed Ecosystem Enablement XCATXCAT System Operating Environment Software Stack A modern development environment is emerging based on tools and services Cloud Software Operating System / KVM Standard Operating Environment (System Mgmt) Software Power Open Source Software Stack Components Existing Open Source Software Communitie s Firmware Hardware New OSS Community OpenPOWER Technology OpenPOWER Firmware CAPP PCIe POWER8 CAPI over PCIe “Standard POWER Products” – 2014 Hardware “Custom POWER SoC” – Future Customizable Framework to Integrate System IP on Chip Industry IP License Model Multiple Options to Design with POWER Technology Within OpenPOWER
  22. 22. © 2014 IBM Corporation22 Non-IBM POWER8 products The Tyan reference (ATX) board, SP010, measures 12” by 9.6” ➢ one single-chip module (SCM) ➢ four DDR3 memory slots ➢ four 6 Gb/sec SATA peripheral connectors ➢ two USB 3.0 ports ➢ two Gigabit Ethernet network interfaces ➢ keyboard and video ➢ intended for developers The Google reference board ➢ two single-chip module (SCM) ➢ four modified SATA ports ➢ Google use only
  23. 23. © 2014 IBM Corporation The future of Analytics
  24. 24. © 2014 IBM Corporation24 The future of Analytics: An open approach Open Platform for Choice
  25. 25. 25 © OpenPOWER Foundation 2014 POWER8 CAPI Custom Hardware Application POWER8 CAPP Coherence Bus PSL FPGA or ASIC Customizable Hardware Application Accelerator • Specific system SW, middleware, or user application • Written to durable interface provided by PSL POWER8 PCIe Gen 3 Transport for encapsulated messages Processor Service Layer (PSL) • Present robust, durable interfaces to applications • Offload complexity / content from CAPP Virtual Addressing • Accelerator can work with same memory addresses that the processors use • Pointers de-referenced same as the host application • Removes OS & device driver overhead Hardware Managed Cache Coherence • Enables the accelerator to participate in “Locks” as a normal thread Lowers Latency over IO communication model Coherent Accelerator Processor Interface (CAPI)
  26. 26. © 2014 IBM Corporation26 Coherent Accelerator Processor Interface (CAPI) Overview CAPP PCIe POWER8 Processor Typical I/O Model Flow Flow with a Coherent Model Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Int Completion Copy or Unpin Result Data Ret. From DD Completion FPGA Functionn Function0 Function1 Function2 CAPI IBM Supplied POWER Service Layer
  27. 27. © 2014 IBM Corporation27 Example: Innovative “In-Memory” NoSQL/KVS Integrated Solution - via POWER8 CAPI-attached Flash WWW 10Gb Uplink POWER8 Server Flash Array w/ up to 40TB Differentiated NoSQL (POWER8 + CAPI Flash) Infrastructure Attributes - 192 threads in 4U Server drawer - 40 TB of memory based Flash per 4U Drawer - Shared Memory & Cache for dynamic tuning - Elimination of I/O and Network Overhead - Cluster solution in a box 5X Cost Reduction with equivalent performance WWW 500GB Cache Node500GB Cache Node500GB Cache Node500GB Cache Node500GB Cache Node500GB Cache Node Backup Node Load Balancer Today’s NoSQL in memory (x86) 10Gb Uplink Infrastructure Requirements - Large Distributed (Scale out) - Large Memory per node - Networking Bandwidth Needs - Load Balancing Power CAPI-attached Flash model for NoSQL offers dramatic (24:1) density advantage
  28. 28. © 2014 IBM Corporation Wrap-up
  29. 29. © 2014 IBM Corporation29 For more information on Big Data / Analytics ● Sales kits – PartnerWorld – IBM internal ● Worldwide contacts – Renato Loffreda-Mancinelli, World Wide Business Analytics and Big Data Solutions on Power - Business Dev. Leader ( – Michael Tabron, Solution Offering Manager, Power Analytics ( – Gina King, Solution Offering Manager, Big Data Analytics ( – Bob Friske, Marketing Manager (
  30. 30. © 2014 IBM Corporation30 Q & A Summary: 1.Getting started with Big Data is the toughest part. Start simple, small, and on the side. 2.The OpenPOWER Foundation enables new systems and helps support the emerging analytic solutions around NoSQL databases. 3.POWER8 technology like CAPI will enable new solutions from IBM and the OpenPOWER Foundation
  31. 31. © 2014 IBM Corporation31 Special notices This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment. Revised September 26, 2006
  32. 32. © 2014 IBM Corporation Backup
  33. 33. © 2014 IBM Corporation33 Where to find more information?