IBM Systems
Anand Haridass
Chief Engineer POWER Integrated Solutions (BD&A)
Senior Technical Staff Member
India Systems Development Lab
anharida@in.ibm.com
POWER Up Your
Insights
IBM Systems
Acknowledgement
Sources of these slides are numerous IBM presentations/tutorials/studies
– Thank you
| 2
IBM Systems
Agenda
The Big Picture about Big Data
Hadoop
Spark
IBM POWER Systems – Big Data
IBM Systems
VarietyVolume Velocity
Information is THE resource of the 21st Century …
2.5 quintillion bytes of data/day
90% of data created in 2 years
35 zettabytes in 2020 ! Rich Media
Weather
Consumer
Geospatial
Internet of Things
Social Media
Webpages
An unprecedented increase in use of digital devices is causing humungous amount of data to be
generated and captured by businesses. This tremendous amount of digital data, also known as
Big Data has the potential to transform businesses and create value.
There will be over 200 billion connected devices
There will be over 12 billion machine-to-machine devices
Machine generated data will be 42% of all data
IBM Systems
Merging the Traditional & Big Data approaches
| 5
IT
Structures the
data to answer
that question
IT
Delivers a
platform to enable
creative discovery
Business
Explores what
questions could be
asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory
Traditional Approach
Structured & Repeatable
IBM Systems
Big Data : Value from Insights
Descriptive
What is happening
Cognitive
What did I learn
Value
Prescriptive
What should I do
Predictive
What could happen
Diagnostic
Why did it happen
Cognitive computing defines systems
that learn at scale, reason with
purpose & interact with humans
naturally. Cognitive systems are
probabilistic, this is a core point of
difference as it means they are not
programmed, instead they have been
trained. Cognitive systems can
generate not just answers to questions
but hypotheses, reassured responses &
recommendations about more complex
& meaningful data.
IBM Systems
What is Hadoop?
• Open source project to enable processing of large data sets
• Batch oriented
• Structured, unstructured, semi-structured data
• Written in Java
• Scalable to thousands of machines
• Fault tolerant
• Core components: HDFS, MapReduce, Hadoop Common
Data 1TB
Disk Read 200MB/s
1 server
1 Disk 5000 sec
10 Disks 500 sec
100 server (x10 Disks) 5 sec
IBM Systems
Hadoop Basic Flow
Reduce
Processes data, write output
Logs
Social
Data
Map
Create key/value pairs
HDFS (3 copies of data)
Shuffle
Shuffle
Sort key/value pairs
Map Reduce
Extract
Data
Read ResultsInput Data
Devices
DBs
IBM Systems 9
9
IBM Systems
What is Spark?
• Open Source, Apache 2.0, version 1.x
• Written in Scala
• In-Memory, On-Disk, Batch, Interactive, Streaming (Near Real-Time)
• Rapid in-memory processing of resilient distributed datasets (RDDs)
• Multiple Workflows
• Multiple Libraries
• Multiple API’s
Fast flexible engine for big data processing - 10x (on disk) to
100x (in memory) faster than MapReduce
IBM Systems 11
Spark SQL Spark module for structured data processing using either SQL or a DataFrame API.
Provides a common way to access a wide range of data sources.
Spark Streaming Micro-batch processing engine that enables applications to process real-time
streams of data with latency as low as 0.5 seconds.
GraphX API for graphs and graph computation - is a graph processing engine
MLlib It is a collection of machine learning libraries that can run on a distributed cluster
SparkR enables R programmers to use existing tools (Rstudio) while Spark does the actual
processing behind the scenes
IBM Systems
Apache Spark - Resilient Distributed Dataset (RDD)
IBM Systems 13
Spark as a Service Spark Standalone
Spark on Hadoop Spark with Mesos
IBM Systems
Open Data Platform Initiative
14
• ODPi has an open governance model.
Developers form a Technical Steering
Committee
• All members have an equal vote on ODPi
Core decisions.
• ODPi has a Board of Directors
responsible for the financial, legal and
promotional aspects of ODPi.
• Non-profit organization accelerating the
delivery of Big Data solutions by powering
a platform called ODPi Core.
• The ODPi Core focuses on a small but
critical set of projects
• Goal: enables a rapid start and an industry
driven definition
ODPi Members include: Ampool, Altiscale, ArenaData, AsiaInfo, Capgemini,
DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, NEC, Pivotal, PLDT, SAS, Squid
Solutions, SyncSort, Telstra, Toshiba, UNIFi, VMware, WANdisco, Xiilab, zData and
Zettaset.
ODPi & Apache Software Foundation (ASF)
ODPi supports the ASF mission
ASF provides governance around individual projects
without looking at ecosystem and collections of
projects
ODPi provides a vendor-led consistent packaging
model and certification for Big Data components as an
ecosystem - Test once ; Run anywhere for big data
applications
Improves ecosystem interoperability
Unlocks customer choice
Eliminates wasteful guesswork
IBM Systems
Hadoop and Spark Offer Significant Business Benefits
15
Operations Data Warehousing Line of Business and
Analytics
New Business
Imperatives
Big Data Maturity High
High
Low
Data-Informed
Decision Making
• Full dataset analysis
(no more sampling)
• Extract value from
non-relational data
• 360
o
view of all
enterprise data
• Exploratory analysis
and discovery
Warehouse
Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive
and staging
Lower the Cost
of Storage
Business
Transformation
• Create new business
models
• Risk-aware decision
making
• Fight fraud and
counter threats
• Optimize operations
• Attract, grow, retain
customers
Value
IBM Systems
IBM POWER Systems
IBM Systems
Driving Innovation Beyond The Chip
17
Microprocessors alone no longer drive sufficient Cost/Performance improvements
System stack innovations are required to drive Cost/Performance
IBM Systems
• Moore’s law no longer
satisfies performance gain
• Numerous IT consumption
models
• Mature Open software
ecosystem
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
Performance of POWER architecture
amplified capability
•Rich software ecosystem
•Spectrum of power servers
•Multiple hardware options
• Derivative POWER chips
Market Shifts New Open Innovation
18
The OpenPOWER Foundation
Technology
FAB
I/O
Networking
Storage
FW
Open
Source
SYS
ODM
OEM
SW
Linux
ISV
Open Source
Chip
SoC Dev
IP Dev
Technology
FAB
I/O
Networking
Storage
FW
Open
Source
SYS
ODM
OEM
SW
Linux
ISV
Open Source
Chip
SoC Dev
IP Dev
WEB 2.0
Data Center
MSP
Cloud
Members
And growing ….
120+The goal of the OpenPOWER Foundation
is to create an open ecosystem, using the
POWER Architecture to share expertise,
investment, and server-class intellectual
property to serve the evolving needs of
customers.
Platinum
Members
IBM Systems
POWER8 Processor
Bus Interfaces
Integrated PCIe Gen3
SMP Interconnect
CAPI
Accelerators
Crypto & memory expansion
Transactional Memory
Data Move / VM Mobility
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
L3 Cache & Chip Interconnect
8M L3
Region
Mem. Ctrl.Mem. Ctrl.
SMPLinks
Accelerators
SMPLinks
PCIe
Caches
64K Data cache (L1)
512 KB SRAM L2 / core
96 MB eDRAM shared L3
Up to 128 MB eDRAM L4 (off-chip)
Cores
12 cores (SMT8)
8 dispatch, 10 issue, 16 exec pipe
2X internal data flows/queues
Enhanced prefetching
Memory
Dual memory Controllers
230 GB/sec Sustained bandwidth
Technology
22nm SOI, eDRAM, 15 ML 650mm2
IBM Journal of Research and Development Issue 1 • Date Jan.-Feb. 2015
On IEEE Explore - Link
Energy Management
On-chip Power Management Micro-controller
Integrated Per-core VRM
Critical Path Monitors
IBM Systems
POWER8 is designed & optimized for Big Data & Analytics
20
Processors
flexible, fast execution of
analytics algorithms
Memory
large, fast workspace to
maximize business insight
Cache
ensure continuous data load
for fast responses
4X
threads per core vs. x86
(up to 1536 threads per system)
~4X
memory bandwidth vs. x861
(up to 16TB of memory)
4X
more cache vs. x862
(up to 231MB cache per socket)
Optimized for a broad range of big data & analytics workloads:
Industry Solutions
5X
Faster
Supports growth of users,
reports and complex
queries
Delivers fast analytics
results for real-time
decision-making
Handles large volumes of
data for better response
times
Yateesh Vusirika – Open Databases SQL / NoSQL – What’s on offer ?
IBM Systems 21
Streaming and SQL benefit from High Thread Density and Concurrency
Processing multiple packets of a stream and different stages of a message stream pipeline
Processing multiple rows from a query
Machine Learning benefits from Large Caches and Memory Bandwidth
Iterative Algorithms on the same data
Fewer core pipeline stalls and overall higher throughput
Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength
Flexibility to go from 8 SMT threads per core to 4 or 2
Manage Balance between thread performance and throughput
Headroom
Balanced resource utilization, more efficient scale-out
Multi-tenant deployments
POWER Advantages for Spark
IBM Systems 22
Machine Learning SQL Graph
1.5X
•Spend 33% less on
infrastructure
supporting the same
amount of workload
•Spend the same on
infrastructure but host
50% more workload
* - based on SoftLayer pricing – subject to change 22
Price Performance of Spark on POWER Cloud
IBM Systems
GPU Use Case Example: Adverse Drug Reaction Prediction built on Spark
23
Fast and general engine for large-
scale data processing
• 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression)
• Transparent to the Spark Application
• Game changer for Personalized Medicine
IBM Systems
IBM Big Data on Power Offerings
24
Stage 1: Prove Value
Stage 2: Scale for Multiple Projects Stage 3: Scale for Mixed Analytics
Digital Start for
Big Data on
Power
IBM Data Engine
for Hadoop and
Spark
IBM Data Engine
for Analytics
Ready access for Power customers
On Premise or Cloud
Organization: Line of Business (LOB) or
Data Science team
Simplify operations: easy to deploy & manage
Advanced resource & storage management
Better resilience for big data
Spark: 2X better price perf vs x86
Organization: LOB or Data Science team
Designed for consolidation and mixed analytics
workloads: streams, at rest, text
Lowest $/TB and less than half storage infrastructure
Leadership resilience for big data environment
Adapt and scale to your changing analytics needs
Organization: IT infrastructure team supporting LoB’s
and data team
Limited data investment per
project, often <10TB
Single project, limited use cases
Moderate data investment per project
50TB to PB
Many independent use case projects across LOB’s
Significant data investment per project
1/2 to multi PB
Multiple use cases with diverse SLA’s
24
IBM Systems
POWER Hadoop and Spark Integrated Options
IBM Data Engine for
Analytics
• Compute only servers with shared storage
• Single replica of data
• Newer write oriented workloads
• Sophisticated scheduler
• POSIX compliant file system
• Ideal for larger deployments
Integrated Solution
IBM Data Engine for
Hadoop and Spark
• Scale-out storage rich servers
• Three replicas of data
• Traditional read dominated workloads
• Ideal for simpler workload patterns
• Ideal for smaller deployments
Integrated Solution
IBM Systems
IBM Data Engine for Hadoop and Spark: IDE-HS
OpenPOWER
IOP +
OpenPOWER
IOP +
OpenPOWER
IOP +
Spectrum Scale FPO Option
• Internal replicated disk
• POSIX compliant
• Encryption/replication
Opt. Platform
Symphony
• Higher utilization
• Shared cluster
• Better throughput
OpenPOWER (POWER8) S812LC
• 2x x86 core performance
• Lowest cost Power HW
Solution
• Pre-assembled/tested cluster
• On-site services
• Lower risk & faster time to value
IBM Open Platform
• Open Hadoop
• Value Add Options
Platform
Cluster Mgr.
Simplified physical
cluster management
OpenPOWER innovation with IBM Open Platform with Apache Hadoop for a high performance,
storage dense and fully integrated cluster offering.
IBM Systems
IBM Data Engine for Analytics: IDEA
Platform
Cluster Mgr.
POWER8
BigInsights
POWER8
BigInsights
POWER8
BigInsights
Platform
Symphony
Spectrum Scale ESS
• One copy of data
• POSIX compliant
• Erasure coding
• Encryption/replication
POWER8 - S
• 2X x86 core performance
• Fewer nodes
IBM Open Platform
• Industry standard Hadoop
Solution
• Grow disk/CPU separately
• Pre-assembled/tested cluster
• On-site services
• Lower risk & faster time to value
Simplified physical
cluster management
• Higher utilization
• Shared cluster
• Better throughput
Spec ScaleSpec ScaleSpec Scale
A fully integrated solution with software and infrastructure optimized for Big Data & Analytics
S822L
Appliance-Like
but much more Versatile!
IBM Systems
28
Storage Intensive
ComputeIntensive
AddMoreServers
Add More Storage
Add servers or storage or both as needed
Adjust compute to storage ratio as workload
needs change
Standard Hadoop configurations with local
storage and triple replica can result in
overprovisioned compute to meet the storage
demands
Data Engine for Analytics allows right sizing of
compute and storage independently to create an
optimized configuration
Data Engine for Analytics offers Independent Scaling of
Servers & Storage
IBM Systems
Client: Multinational Telecommunications Company
A multinational telecommunication company with over 6M
subscribers. Strategic value as they influence the IT decisions in
other countries.
Challenges
Expectations of a Real Time Marketing (RTM) based solution to
run event-based campaigns
Enable event-based marketing, analysing various sources of
input data containing information regarding subscribers actions
Dispatch the triggered events to downstream applications such
as campaign management, for associated campaign execution.
Architecture
•IBM Data Engine for Analytics: 20 X Power S822L, 2 X ESS
GL4, Spectrum Scale, PCM
•BigInsights, Streams, SPSS Modeller, SPSS Analytics Server
Solution and Approach
Solution was to provide a Hadoop-based Big Data platform, integrated
to the RTM decision engine, that will enable data monetisation
opportunities, including location based analytics
Customer was not comfortable with the huge number of x86 Data
Nodes approach of typical Hadoop Architecture
The IBM team designed the Power solution and conducted a technical
workshop with the client on newly redefined Hadoop architecture
based on IDEA.
Demonstrated the one IBM team value as an integrated approach to
the client
Key Client Benefits
Optimized Big Data deployment architecture with IDEAArchitecture
with Linux on Power, Elastic Storage Server and Spectrum Scale
Lower TCO with 4 Racks on Power against 12 racks on x86
More IO bandwidth with 40GbE Power network against 10GbE on
x86 based solution
3x less racks for 2 PB
Big Data solution
4 vs. 12
Client Example – IDEA Architecture
IBM Systems 30
POWER Processor Roadmap
POWER8 Architecture POWER9 Architecture
2014
POWER8
12 cores
22nm
New Micro-
Architecture
New Process
Technology
2016
POWER8
w/ NVLink
12 cores
22nm
Enhanced
Micro-
Architecture
With NVLink
2017
P9 SO
24 cores
14nm
New Micro-
Architecture
Direct attach
memory
New Process
Technology
Optimized for Data-Centric
Workloads
Integrated PCIe
CAPI Acceleration / I/O
Scale-Out Datacenter TCO
Optimization
Scale-up performance
Optimization
Acceleration Enhancements to
CAPI and NVLINK
Modularity for OpenPOWER
TBD
P9 SU
TBD cores
14nm
Enhanced
Micro-
Architecture
Buffered
Memory
POWER6 Architecture POWER7 Architecture
2007
POWER6
2 cores
65nm
New Micro-
Architecture
New Process
Technology
2008
POWER6+
2 cores
65nm+
Enhanced
Micro-
Architecture
Enhanced
Process
Technology
2010
POWER7
8 cores
45nm
New Micro-
Architecture
New Process
Technology
2012
POWER7+
8 cores
32nm
Enhanced
Micro-
Architecture
New Process
Technology
High Frequency
Enhanced RAS
Dynamic Energy Management
Large eDRAM L3 Cache
Optimized VSX
Enhanced Memory Subsystem
Focus on Enterprise
Technology and Performance Driven
Focus on Scale-Out and Enterprise
Cost and Acceleration Driven
2018 - 20
P8/9 SO
10nm - 7nm
Existing
Micro-
Architecture
Foundry
Technology
Partner Chip
POWER8/9
OpenPOWER
Ecosystem
Design
Targeting
Partner Markets
& Systems
Leveraging
Modulatrity
Price, performance, feature and ecosystem innovation
2020+
New Micro-
Architecture
New
Technology
POWER10
New
Features and
Functions
Future
TBD
IBM Systems
What’s in the works ?
31
GPUs and FPGAs for Compute
offload, Machine Learning, Graph
and other specialized acceleration
CAPI Flash for Memory
consolidation/expansion, and Storage
acceleration
RDMA for better latency, better
network utilization, lower CPU
utilization, lower Memory utilization
OpenPOWER extends the ability to innovate around Spark into the hardware and accelerators.
IBM Systems
Backup
IBM Systems 33
Power Systems and NVIDIA GPU Roadmap
NVIDIA GPU NVIDIA GPU with NVLink
Power Chip Power Chip
with NVLink
80 GB/s
Peak*
PCIe x16
Current GPU Attach Future NVLink GPU Attachment
Graphics Memory
System Memory
Graphics Memory Graphics Memory
System Memory
40+40 GB/s
16+16 GB/s
CPU to GPU NVLink Enables
Easier Programming of GPU Accelerators
Better Application Throughput
Expanded Set of Accelerated Applications
New Server: Early Shipments in 4Q ‘16
IBM-NVIDIA NVLink Acceleration Lab
Seeking Clients Now
Apply at accellab@us.ibm.com
IBM Systems
IBM Data Engine for Analytics (IDEA) overview (1 of 2)
34
Challenges with traditional Hadoop model
Typical Hadoop solutions use storage rich server based scale out solution
Lose control of storage since its lumped with compute
No separate storage capital planning
Usually no backup, archiving, disaster recovery facilities
Usually less storage security controls, auditing
Disk failures incur heavy rebuild penalty, consume network resources, reduces application
performance
Multiple replicas are expensive and superfluous if the workload doesn’t need lots of tasks
accessing the same data over and over (read mostly)
Typically cannot share resources with non-Hadoop workloads
Cannot reuse existing infrastructure, requires different infrastructure
IBM Systems
IBM Data Engine for Analytics (IDEA) overview (2 of 2)
35
Value Proposition
Compute only servers with shared storage
Single replica of data
Better suited for workloads that tend to have a significant write component
Ideal for complex analytic solutions where additional components may need to interact with
BigInsights
Composed of several integrated components
Compute: POWER8 compute nodes – 2 or more
Storage: IBM Elastic Storage System with advanced distributed and parallel filesystem
(Spectrum Scale Based)
FileSystem accessed over network supported by Spectrum Scale (GPFS) protocol
Networking: Ethernet and optionally InfiniBand (no Fibre Channel)
System Software: Linux, Cluster Provisioning and Management using PCM and xCAT
Middleware: Platform Symphony and BigInsights
IBM Systems
Solution Architecture Component Model
36
IBM Systems
Power Hadoop/Spark Solutions
Roll
Your Own
IBM Data Engine for
Hadoop and Spark
(IDE-HS)
IBM Data Engine for
Analytics
(IDEA)
• Point solution
• Classic Spark or
Hadoop architecture
• Open Platform for
Apache Hadoop
• Optional IBM value-
adds
• Based on Power LC
• Enterprise solution
• Shared external storage
(GPFS)
• BigInsights
• Platform Computing for
Resource and Cluster
Mgt.
• Based on Power L
IBM Integrated Offerings
IBM Systems
IBM Data Engine for Analytics Architecture
Hadoop
Management
2 VMs each
Edge Nodes
(Data Ingres)
Data Nodes
1-2 VMs each
Physical Cluster
Management
Shared
High Bandwidth
Storage
High Speed
Network
+ Pre-assembled
+ On-site Services
IBM Systems
Physical Cluster
Management
Power S812L, RHEL 6.5, 10 Cores, 32 GB Memory,
Platform Cluster Manager 4.2 Advanced Edition, XCAT 2.9
Single LPAR, HA server option available
Hadoop
Management
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory
BigInsights 4.1
Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition
Spectrum Scale 4.1 Client
Minimum 2 servers, 2 LPARs each
Data
Nodes
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory
BigInsights 4.1
Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition
Spectrum Scale 4.1 Client
User specified number of servers, 1 LPAR. 2 LPARs when running Big SQL
Shared
Storage
IBM Elastic Storage Server Models GL2, GL4, GL6, GS2, GS4, or GS6
GSS 2.2 Software, GSS Management/Maintenance
Spectrum Scale 4.1.1 TL1 Server
User specified number of storage servers
Edge
Nodes (Opt)
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory,
BigInsights 4.1
Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition
Spectrum Scale 4.1 Client
User specified number of edge nodes, 1 or 2 LPARs each
Network (Opt) Mellanox 10 or 40 Gb RoCE Ethernet or Mellanox 56 Gb InfiniBand
IBM Power Systems
Apache Hadoop Ecosystem
40
IBM Systems
Platform Symphony Differentiation vs Open Source
41
YARN
• Monitor CPU and Memory only
• XML file to setup
prioritization/scheduling strategy
• Limited Pre-emption
available with FAIR scheduler.
Capacity scheduler does not have
pre-emption so 100% elasticity
unwise.
• Cron to change strategy by time-of-day
Platform Symphony
• 50% Faster Time to Insights with
MapReduce
• Deeper performance insight with
visualizations of 150 machine metrics to
support tuning & planning
• GUI based setup/administration & Visual
validation of policies
• Advanced Pre-emption
Pre-empt the least running jobs
Round robin pre-emption
• Different resource strategy by time of day
• Showback Reports
IBM Systems
Spectrum Scale will enhance your Hadoop Environment !
42
Hadoop HDFS
HDFS NameNode HA added in version 2.0.
NameNode HA in active/passive configuration
Difficulty to ingest data – special tools required
Lacking enterprise readiness
No single point of failure, distributed metadata in
active/active configuration since 1998
Ingest data using policies for data placement
Versatile, Multi-purpose,
Hybrid Storage (locality and shared)
Enterprise ready with support for advanced storage
features (Encryption, DR, replication, SW RAID etc)
Large block-sizes – poor support for small files
Variable block sizes – suited to multiple types of
data and metadata access pattern
Scale compute and storage independently
(Policy based ILM)
Compute and Storage tightly coupled – leading to
very low CPU utilization
Single-purpose, Hadoop MapReduce only
POSIX file system – easy to use and manage
Non-POSIX file system – obscure commands.
Does not support in-place updates.
IBM Spectrum Scale

2016 August POWER Up Your Insights - IBM System Summit Mumbai

  • 1.
    IBM Systems Anand Haridass ChiefEngineer POWER Integrated Solutions (BD&A) Senior Technical Staff Member India Systems Development Lab anharida@in.ibm.com POWER Up Your Insights
  • 2.
    IBM Systems Acknowledgement Sources ofthese slides are numerous IBM presentations/tutorials/studies – Thank you | 2
  • 3.
    IBM Systems Agenda The BigPicture about Big Data Hadoop Spark IBM POWER Systems – Big Data
  • 4.
    IBM Systems VarietyVolume Velocity Informationis THE resource of the 21st Century … 2.5 quintillion bytes of data/day 90% of data created in 2 years 35 zettabytes in 2020 ! Rich Media Weather Consumer Geospatial Internet of Things Social Media Webpages An unprecedented increase in use of digital devices is causing humungous amount of data to be generated and captured by businesses. This tremendous amount of digital data, also known as Big Data has the potential to transform businesses and create value. There will be over 200 billion connected devices There will be over 12 billion machine-to-machine devices Machine generated data will be 42% of all data
  • 5.
    IBM Systems Merging theTraditional & Big Data approaches | 5 IT Structures the data to answer that question IT Delivers a platform to enable creative discovery Business Explores what questions could be asked Business Users Determine what question to ask Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization Big Data Approach Iterative & Exploratory Traditional Approach Structured & Repeatable
  • 6.
    IBM Systems Big Data: Value from Insights Descriptive What is happening Cognitive What did I learn Value Prescriptive What should I do Predictive What could happen Diagnostic Why did it happen Cognitive computing defines systems that learn at scale, reason with purpose & interact with humans naturally. Cognitive systems are probabilistic, this is a core point of difference as it means they are not programmed, instead they have been trained. Cognitive systems can generate not just answers to questions but hypotheses, reassured responses & recommendations about more complex & meaningful data.
  • 7.
    IBM Systems What isHadoop? • Open source project to enable processing of large data sets • Batch oriented • Structured, unstructured, semi-structured data • Written in Java • Scalable to thousands of machines • Fault tolerant • Core components: HDFS, MapReduce, Hadoop Common Data 1TB Disk Read 200MB/s 1 server 1 Disk 5000 sec 10 Disks 500 sec 100 server (x10 Disks) 5 sec
  • 8.
    IBM Systems Hadoop BasicFlow Reduce Processes data, write output Logs Social Data Map Create key/value pairs HDFS (3 copies of data) Shuffle Shuffle Sort key/value pairs Map Reduce Extract Data Read ResultsInput Data Devices DBs
  • 9.
  • 10.
    IBM Systems What isSpark? • Open Source, Apache 2.0, version 1.x • Written in Scala • In-Memory, On-Disk, Batch, Interactive, Streaming (Near Real-Time) • Rapid in-memory processing of resilient distributed datasets (RDDs) • Multiple Workflows • Multiple Libraries • Multiple API’s Fast flexible engine for big data processing - 10x (on disk) to 100x (in memory) faster than MapReduce
  • 11.
    IBM Systems 11 SparkSQL Spark module for structured data processing using either SQL or a DataFrame API. Provides a common way to access a wide range of data sources. Spark Streaming Micro-batch processing engine that enables applications to process real-time streams of data with latency as low as 0.5 seconds. GraphX API for graphs and graph computation - is a graph processing engine MLlib It is a collection of machine learning libraries that can run on a distributed cluster SparkR enables R programmers to use existing tools (Rstudio) while Spark does the actual processing behind the scenes
  • 12.
    IBM Systems Apache Spark- Resilient Distributed Dataset (RDD)
  • 13.
    IBM Systems 13 Sparkas a Service Spark Standalone Spark on Hadoop Spark with Mesos
  • 14.
    IBM Systems Open DataPlatform Initiative 14 • ODPi has an open governance model. Developers form a Technical Steering Committee • All members have an equal vote on ODPi Core decisions. • ODPi has a Board of Directors responsible for the financial, legal and promotional aspects of ODPi. • Non-profit organization accelerating the delivery of Big Data solutions by powering a platform called ODPi Core. • The ODPi Core focuses on a small but critical set of projects • Goal: enables a rapid start and an industry driven definition ODPi Members include: Ampool, Altiscale, ArenaData, AsiaInfo, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, NEC, Pivotal, PLDT, SAS, Squid Solutions, SyncSort, Telstra, Toshiba, UNIFi, VMware, WANdisco, Xiilab, zData and Zettaset. ODPi & Apache Software Foundation (ASF) ODPi supports the ASF mission ASF provides governance around individual projects without looking at ecosystem and collections of projects ODPi provides a vendor-led consistent packaging model and certification for Big Data components as an ecosystem - Test once ; Run anywhere for big data applications Improves ecosystem interoperability Unlocks customer choice Eliminates wasteful guesswork
  • 15.
    IBM Systems Hadoop andSpark Offer Significant Business Benefits 15 Operations Data Warehousing Line of Business and Analytics New Business Imperatives Big Data Maturity High High Low Data-Informed Decision Making • Full dataset analysis (no more sampling) • Extract value from non-relational data • 360 o view of all enterprise data • Exploratory analysis and discovery Warehouse Modernization • Data lake • Data offload • ETL offload • Queryable archive and staging Lower the Cost of Storage Business Transformation • Create new business models • Risk-aware decision making • Fight fraud and counter threats • Optimize operations • Attract, grow, retain customers Value
  • 16.
  • 17.
    IBM Systems Driving InnovationBeyond The Chip 17 Microprocessors alone no longer drive sufficient Cost/Performance improvements System stack innovations are required to drive Cost/Performance
  • 18.
    IBM Systems • Moore’slaw no longer satisfies performance gain • Numerous IT consumption models • Mature Open software ecosystem Open Development open software, open hardware Collaboration of thought leaders simultaneous innovation, multiple disciplines Performance of POWER architecture amplified capability •Rich software ecosystem •Spectrum of power servers •Multiple hardware options • Derivative POWER chips Market Shifts New Open Innovation 18 The OpenPOWER Foundation Technology FAB I/O Networking Storage FW Open Source SYS ODM OEM SW Linux ISV Open Source Chip SoC Dev IP Dev Technology FAB I/O Networking Storage FW Open Source SYS ODM OEM SW Linux ISV Open Source Chip SoC Dev IP Dev WEB 2.0 Data Center MSP Cloud Members And growing …. 120+The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise, investment, and server-class intellectual property to serve the evolving needs of customers. Platinum Members
  • 19.
    IBM Systems POWER8 Processor BusInterfaces Integrated PCIe Gen3 SMP Interconnect CAPI Accelerators Crypto & memory expansion Transactional Memory Data Move / VM Mobility Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 L3 Cache & Chip Interconnect 8M L3 Region Mem. Ctrl.Mem. Ctrl. SMPLinks Accelerators SMPLinks PCIe Caches 64K Data cache (L1) 512 KB SRAM L2 / core 96 MB eDRAM shared L3 Up to 128 MB eDRAM L4 (off-chip) Cores 12 cores (SMT8) 8 dispatch, 10 issue, 16 exec pipe 2X internal data flows/queues Enhanced prefetching Memory Dual memory Controllers 230 GB/sec Sustained bandwidth Technology 22nm SOI, eDRAM, 15 ML 650mm2 IBM Journal of Research and Development Issue 1 • Date Jan.-Feb. 2015 On IEEE Explore - Link Energy Management On-chip Power Management Micro-controller Integrated Per-core VRM Critical Path Monitors
  • 20.
    IBM Systems POWER8 isdesigned & optimized for Big Data & Analytics 20 Processors flexible, fast execution of analytics algorithms Memory large, fast workspace to maximize business insight Cache ensure continuous data load for fast responses 4X threads per core vs. x86 (up to 1536 threads per system) ~4X memory bandwidth vs. x861 (up to 16TB of memory) 4X more cache vs. x862 (up to 231MB cache per socket) Optimized for a broad range of big data & analytics workloads: Industry Solutions 5X Faster Supports growth of users, reports and complex queries Delivers fast analytics results for real-time decision-making Handles large volumes of data for better response times Yateesh Vusirika – Open Databases SQL / NoSQL – What’s on offer ?
  • 21.
    IBM Systems 21 Streamingand SQL benefit from High Thread Density and Concurrency Processing multiple packets of a stream and different stages of a message stream pipeline Processing multiple rows from a query Machine Learning benefits from Large Caches and Memory Bandwidth Iterative Algorithms on the same data Fewer core pipeline stalls and overall higher throughput Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength Flexibility to go from 8 SMT threads per core to 4 or 2 Manage Balance between thread performance and throughput Headroom Balanced resource utilization, more efficient scale-out Multi-tenant deployments POWER Advantages for Spark
  • 22.
    IBM Systems 22 MachineLearning SQL Graph 1.5X •Spend 33% less on infrastructure supporting the same amount of workload •Spend the same on infrastructure but host 50% more workload * - based on SoftLayer pricing – subject to change 22 Price Performance of Spark on POWER Cloud
  • 23.
    IBM Systems GPU UseCase Example: Adverse Drug Reaction Prediction built on Spark 23 Fast and general engine for large- scale data processing • 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression) • Transparent to the Spark Application • Game changer for Personalized Medicine
  • 24.
    IBM Systems IBM BigData on Power Offerings 24 Stage 1: Prove Value Stage 2: Scale for Multiple Projects Stage 3: Scale for Mixed Analytics Digital Start for Big Data on Power IBM Data Engine for Hadoop and Spark IBM Data Engine for Analytics Ready access for Power customers On Premise or Cloud Organization: Line of Business (LOB) or Data Science team Simplify operations: easy to deploy & manage Advanced resource & storage management Better resilience for big data Spark: 2X better price perf vs x86 Organization: LOB or Data Science team Designed for consolidation and mixed analytics workloads: streams, at rest, text Lowest $/TB and less than half storage infrastructure Leadership resilience for big data environment Adapt and scale to your changing analytics needs Organization: IT infrastructure team supporting LoB’s and data team Limited data investment per project, often <10TB Single project, limited use cases Moderate data investment per project 50TB to PB Many independent use case projects across LOB’s Significant data investment per project 1/2 to multi PB Multiple use cases with diverse SLA’s 24
  • 25.
    IBM Systems POWER Hadoopand Spark Integrated Options IBM Data Engine for Analytics • Compute only servers with shared storage • Single replica of data • Newer write oriented workloads • Sophisticated scheduler • POSIX compliant file system • Ideal for larger deployments Integrated Solution IBM Data Engine for Hadoop and Spark • Scale-out storage rich servers • Three replicas of data • Traditional read dominated workloads • Ideal for simpler workload patterns • Ideal for smaller deployments Integrated Solution
  • 26.
    IBM Systems IBM DataEngine for Hadoop and Spark: IDE-HS OpenPOWER IOP + OpenPOWER IOP + OpenPOWER IOP + Spectrum Scale FPO Option • Internal replicated disk • POSIX compliant • Encryption/replication Opt. Platform Symphony • Higher utilization • Shared cluster • Better throughput OpenPOWER (POWER8) S812LC • 2x x86 core performance • Lowest cost Power HW Solution • Pre-assembled/tested cluster • On-site services • Lower risk & faster time to value IBM Open Platform • Open Hadoop • Value Add Options Platform Cluster Mgr. Simplified physical cluster management OpenPOWER innovation with IBM Open Platform with Apache Hadoop for a high performance, storage dense and fully integrated cluster offering.
  • 27.
    IBM Systems IBM DataEngine for Analytics: IDEA Platform Cluster Mgr. POWER8 BigInsights POWER8 BigInsights POWER8 BigInsights Platform Symphony Spectrum Scale ESS • One copy of data • POSIX compliant • Erasure coding • Encryption/replication POWER8 - S • 2X x86 core performance • Fewer nodes IBM Open Platform • Industry standard Hadoop Solution • Grow disk/CPU separately • Pre-assembled/tested cluster • On-site services • Lower risk & faster time to value Simplified physical cluster management • Higher utilization • Shared cluster • Better throughput Spec ScaleSpec ScaleSpec Scale A fully integrated solution with software and infrastructure optimized for Big Data & Analytics S822L Appliance-Like but much more Versatile!
  • 28.
    IBM Systems 28 Storage Intensive ComputeIntensive AddMoreServers AddMore Storage Add servers or storage or both as needed Adjust compute to storage ratio as workload needs change Standard Hadoop configurations with local storage and triple replica can result in overprovisioned compute to meet the storage demands Data Engine for Analytics allows right sizing of compute and storage independently to create an optimized configuration Data Engine for Analytics offers Independent Scaling of Servers & Storage
  • 29.
    IBM Systems Client: MultinationalTelecommunications Company A multinational telecommunication company with over 6M subscribers. Strategic value as they influence the IT decisions in other countries. Challenges Expectations of a Real Time Marketing (RTM) based solution to run event-based campaigns Enable event-based marketing, analysing various sources of input data containing information regarding subscribers actions Dispatch the triggered events to downstream applications such as campaign management, for associated campaign execution. Architecture •IBM Data Engine for Analytics: 20 X Power S822L, 2 X ESS GL4, Spectrum Scale, PCM •BigInsights, Streams, SPSS Modeller, SPSS Analytics Server Solution and Approach Solution was to provide a Hadoop-based Big Data platform, integrated to the RTM decision engine, that will enable data monetisation opportunities, including location based analytics Customer was not comfortable with the huge number of x86 Data Nodes approach of typical Hadoop Architecture The IBM team designed the Power solution and conducted a technical workshop with the client on newly redefined Hadoop architecture based on IDEA. Demonstrated the one IBM team value as an integrated approach to the client Key Client Benefits Optimized Big Data deployment architecture with IDEAArchitecture with Linux on Power, Elastic Storage Server and Spectrum Scale Lower TCO with 4 Racks on Power against 12 racks on x86 More IO bandwidth with 40GbE Power network against 10GbE on x86 based solution 3x less racks for 2 PB Big Data solution 4 vs. 12 Client Example – IDEA Architecture
  • 30.
    IBM Systems 30 POWERProcessor Roadmap POWER8 Architecture POWER9 Architecture 2014 POWER8 12 cores 22nm New Micro- Architecture New Process Technology 2016 POWER8 w/ NVLink 12 cores 22nm Enhanced Micro- Architecture With NVLink 2017 P9 SO 24 cores 14nm New Micro- Architecture Direct attach memory New Process Technology Optimized for Data-Centric Workloads Integrated PCIe CAPI Acceleration / I/O Scale-Out Datacenter TCO Optimization Scale-up performance Optimization Acceleration Enhancements to CAPI and NVLINK Modularity for OpenPOWER TBD P9 SU TBD cores 14nm Enhanced Micro- Architecture Buffered Memory POWER6 Architecture POWER7 Architecture 2007 POWER6 2 cores 65nm New Micro- Architecture New Process Technology 2008 POWER6+ 2 cores 65nm+ Enhanced Micro- Architecture Enhanced Process Technology 2010 POWER7 8 cores 45nm New Micro- Architecture New Process Technology 2012 POWER7+ 8 cores 32nm Enhanced Micro- Architecture New Process Technology High Frequency Enhanced RAS Dynamic Energy Management Large eDRAM L3 Cache Optimized VSX Enhanced Memory Subsystem Focus on Enterprise Technology and Performance Driven Focus on Scale-Out and Enterprise Cost and Acceleration Driven 2018 - 20 P8/9 SO 10nm - 7nm Existing Micro- Architecture Foundry Technology Partner Chip POWER8/9 OpenPOWER Ecosystem Design Targeting Partner Markets & Systems Leveraging Modulatrity Price, performance, feature and ecosystem innovation 2020+ New Micro- Architecture New Technology POWER10 New Features and Functions Future TBD
  • 31.
    IBM Systems What’s inthe works ? 31 GPUs and FPGAs for Compute offload, Machine Learning, Graph and other specialized acceleration CAPI Flash for Memory consolidation/expansion, and Storage acceleration RDMA for better latency, better network utilization, lower CPU utilization, lower Memory utilization OpenPOWER extends the ability to innovate around Spark into the hardware and accelerators.
  • 32.
  • 33.
    IBM Systems 33 PowerSystems and NVIDIA GPU Roadmap NVIDIA GPU NVIDIA GPU with NVLink Power Chip Power Chip with NVLink 80 GB/s Peak* PCIe x16 Current GPU Attach Future NVLink GPU Attachment Graphics Memory System Memory Graphics Memory Graphics Memory System Memory 40+40 GB/s 16+16 GB/s CPU to GPU NVLink Enables Easier Programming of GPU Accelerators Better Application Throughput Expanded Set of Accelerated Applications New Server: Early Shipments in 4Q ‘16 IBM-NVIDIA NVLink Acceleration Lab Seeking Clients Now Apply at accellab@us.ibm.com
  • 34.
    IBM Systems IBM DataEngine for Analytics (IDEA) overview (1 of 2) 34 Challenges with traditional Hadoop model Typical Hadoop solutions use storage rich server based scale out solution Lose control of storage since its lumped with compute No separate storage capital planning Usually no backup, archiving, disaster recovery facilities Usually less storage security controls, auditing Disk failures incur heavy rebuild penalty, consume network resources, reduces application performance Multiple replicas are expensive and superfluous if the workload doesn’t need lots of tasks accessing the same data over and over (read mostly) Typically cannot share resources with non-Hadoop workloads Cannot reuse existing infrastructure, requires different infrastructure
  • 35.
    IBM Systems IBM DataEngine for Analytics (IDEA) overview (2 of 2) 35 Value Proposition Compute only servers with shared storage Single replica of data Better suited for workloads that tend to have a significant write component Ideal for complex analytic solutions where additional components may need to interact with BigInsights Composed of several integrated components Compute: POWER8 compute nodes – 2 or more Storage: IBM Elastic Storage System with advanced distributed and parallel filesystem (Spectrum Scale Based) FileSystem accessed over network supported by Spectrum Scale (GPFS) protocol Networking: Ethernet and optionally InfiniBand (no Fibre Channel) System Software: Linux, Cluster Provisioning and Management using PCM and xCAT Middleware: Platform Symphony and BigInsights
  • 36.
  • 37.
    IBM Systems Power Hadoop/SparkSolutions Roll Your Own IBM Data Engine for Hadoop and Spark (IDE-HS) IBM Data Engine for Analytics (IDEA) • Point solution • Classic Spark or Hadoop architecture • Open Platform for Apache Hadoop • Optional IBM value- adds • Based on Power LC • Enterprise solution • Shared external storage (GPFS) • BigInsights • Platform Computing for Resource and Cluster Mgt. • Based on Power L IBM Integrated Offerings
  • 38.
    IBM Systems IBM DataEngine for Analytics Architecture Hadoop Management 2 VMs each Edge Nodes (Data Ingres) Data Nodes 1-2 VMs each Physical Cluster Management Shared High Bandwidth Storage High Speed Network + Pre-assembled + On-site Services
  • 39.
    IBM Systems Physical Cluster Management PowerS812L, RHEL 6.5, 10 Cores, 32 GB Memory, Platform Cluster Manager 4.2 Advanced Edition, XCAT 2.9 Single LPAR, HA server option available Hadoop Management Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory BigInsights 4.1 Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition Spectrum Scale 4.1 Client Minimum 2 servers, 2 LPARs each Data Nodes Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory BigInsights 4.1 Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition Spectrum Scale 4.1 Client User specified number of servers, 1 LPAR. 2 LPARs when running Big SQL Shared Storage IBM Elastic Storage Server Models GL2, GL4, GL6, GS2, GS4, or GS6 GSS 2.2 Software, GSS Management/Maintenance Spectrum Scale 4.1.1 TL1 Server User specified number of storage servers Edge Nodes (Opt) Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory, BigInsights 4.1 Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition Spectrum Scale 4.1 Client User specified number of edge nodes, 1 or 2 LPARs each Network (Opt) Mellanox 10 or 40 Gb RoCE Ethernet or Mellanox 56 Gb InfiniBand
  • 40.
    IBM Power Systems ApacheHadoop Ecosystem 40
  • 41.
    IBM Systems Platform SymphonyDifferentiation vs Open Source 41 YARN • Monitor CPU and Memory only • XML file to setup prioritization/scheduling strategy • Limited Pre-emption available with FAIR scheduler. Capacity scheduler does not have pre-emption so 100% elasticity unwise. • Cron to change strategy by time-of-day Platform Symphony • 50% Faster Time to Insights with MapReduce • Deeper performance insight with visualizations of 150 machine metrics to support tuning & planning • GUI based setup/administration & Visual validation of policies • Advanced Pre-emption Pre-empt the least running jobs Round robin pre-emption • Different resource strategy by time of day • Showback Reports
  • 42.
    IBM Systems Spectrum Scalewill enhance your Hadoop Environment ! 42 Hadoop HDFS HDFS NameNode HA added in version 2.0. NameNode HA in active/passive configuration Difficulty to ingest data – special tools required Lacking enterprise readiness No single point of failure, distributed metadata in active/active configuration since 1998 Ingest data using policies for data placement Versatile, Multi-purpose, Hybrid Storage (locality and shared) Enterprise ready with support for advanced storage features (Encryption, DR, replication, SW RAID etc) Large block-sizes – poor support for small files Variable block sizes – suited to multiple types of data and metadata access pattern Scale compute and storage independently (Policy based ILM) Compute and Storage tightly coupled – leading to very low CPU utilization Single-purpose, Hadoop MapReduce only POSIX file system – easy to use and manage Non-POSIX file system – obscure commands. Does not support in-place updates. IBM Spectrum Scale