JuanMa Rebes – IBM Systems and Solutions
2. •2.5 quintillion bytes of data created every day
•90% of data was created in last 2 years
•35 zettabytes in 2020 !
BI / Reporting Exploration /
IBM Big Data Platform
Information Integration & Governance
Law Enforcement - Identify
criminals and threats
from video, audio feeds
MFG - Analyze & correlate log
records to improve service and
predict failures Video
Telco - Address customer
satisfaction, Predict churn, and
match promotions in real time
Healthcare - Detect life-
threatening conditions at
hospitals in time to intervene
Retail - Multi-channel
customer sentiment and
Financial Services - Make risk
decisions based on real-time
Handling the large Volume, Variety, Velocity, and Veracity of
data to find new insights and improve business outcome.
3. How to know if a big data solution is right for your
– Before making the decision to invest in a big data solution, evaluate the data
available for analysis; the insight that might be gained from analyzing it; and the
resources available to define, design, create, and deploy a big data platform.
Asking the right questions is a good place to start.
4. Can my current data environment be expanded?
– Are the current datasets very large — on the order of terabytes or petabytes?
– Does the existing warehouse environment contain a repository of all data
generated or acquired?
– Is there a significant amount of cold or low-touch data that is not being analyzed
to derive business insight?
– Do you have to throw data away because you are unable to store or process it?
– Do you want to be able to perform data exploration on complex and large
amounts of data?
– Do you want to be able to do analysis of non-operational data?
– Are you interested in using your data for traditional and new types of analytics?
– Are you trying to delay an upgrade to your existing data warehouse?
– Are you looking for ways to lower your overall cost of doing analytics?
Ask the following questions to determine if you can
augment the existing data warehouse platform
5. Is the data complexity increasing?
The variety of data might demand a big data solution if:
– The data content and structure cannot be anticipated or predicted.
– The data format varies, including structured, semi-structured, and unstructured
– The data can be generated by users and machines in any format, for example:
Microsoft® Word files, Microsoft Excel® spreadsheets, Microsoft PowerPoint
presentations, PDF files, social media, web and software logs, email, photos and
video footage from cameras, information-sensing mobile devices, aerial sensory
technologies, genomics, and medical records.
– New types of data have emerged from sources that weren't previously mined for
– Domain entities take on different meanings in different contexts.
Has the variety of data increased?
6. Is the data complexity increasing?
You may want to consider a big data solution if:
– The data is sized in petabytes and exabytes, and in the near future,
might grow to zetabytes.
– The data volume is posing technical and economic challenges to store,
search, share, analyze, and visualize using traditional methods, such as
relational database engines.
– The data processing can currently make use of massive parallel
processing power on available hardware.
Has the volume of data increased?
7. Is the data complexity increasing?
Consider whether your data:
– Is changing rapidly and must be responded to immediately
– Has overwhelmed traditional technologies and methods, which are no
longer adequate to handle data coming in real time
Has the velocity of the data increased or changed?
8. Is the data complexity increasing?
Consider a big data solution if:
– The authenticity or accuracy of the data is unknown.
– The data includes ambiguous information.
– It's unclear whether the data is complete.
Is your data trustworthy?
A big data solution might be appropriate if there is reasonable
complexity in the volume, variety, velocity, or veracity of the data.
For more complex data, assess any risks associated with
implementing a big data solution. For less complex data,
traditional solutions should be assessed.
9. What is the impact on existing IT governance?
– Security and privacy— In keeping with local regulations, what data can the
solution access? What data can be stored? What data should be encrypted
during motion? At rest? Who is allowed to see the raw data and the insights?
– Standardization of data— Are there standards governing the data? Is the data
in a proprietary format? Is some of the data in a non-standard format?
– Timeframe in which the data is available— Is the data available in a timeframe
that allows action to be taken in a timely fashion?
– Ownership of data— Who owns the data? Does the solution have appropriate
access and permission to use the data?
– Allowable uses: How is the data allowed to be used?
Consider the following governance-related issues in
the context of your situation
10. Are the right skills on board and the right people aligned?
Before undertaking a new big data project, make sure the right people are on
– Do you have buy-in from stakeholders and other business sponsors who are
willing to invest in the project?
– Are data scientists available who understand the domain, who can look at the
massive quantity of data and who can identify ways to generate meaningful and
useful insights from the data?
Specific skills are required to understand and analyze the requirements
and maintain the big data solution. These skills include industry
knowledge, domain expertise, and technical knowledge on big data tools
and technologies. Data scientists with expertise in modeling, statistics,
analytics, and math are key to the success of any big data initiative.
11. Using big data type to classify big data characteristics
– Analysis type. Whether the data is analyzed in real time or batched for later analysis.
Give careful consideration to choosing the analysis type, since it affects several other
decisions about products, tools, hardware, data sources, and expected data frequency.
A mix of both types may be required by the use case:
– Fraud detection; analysis must be done in real time or near real time
– Trend analysis for strategic business decisions; analysis can be in batch mode
– Processing methodology. The type of technique to be applied for processing data
(e.g., predictive, analytical, ad-hoc query, and reporting). Business requirements
determine the appropriate processing methodology. A combination of techniques can
be used. The choice of processing methodology helps identify the appropriate tools
and techniques to be used in your big data solution
– Data frequency and size. How much data is expected and at what frequency does it
arrive. Knowing frequency and size helps determine the storage mechanism, storage
format, and the necessary preprocessing tools. Data frequency and size depend on
– On demand, as with social media data
– Continuous feed, real-time (weather data, transactional data)
– Time series (time-based data)
12. Using big data type to classify big data characteristics
– Data type. Type of data to be processed — transactional, historical, master data, and
others. Knowing the data type helps segregate the data in storage.
– Content format. Format of incoming data — structured (RDMBS, for example),
unstructured (audio, video, and images, for example), or semi-structured. Format
determines how the incoming data needs to be processed and is key to choosing tools
and techniques and defining a solution from a business perspective.
– Data source. Sources of data (where the data is generated) — web and social media,
machine-generated, human-generated, etc. Identifying all the data sources helps
determine the scope from a business perspective. The figure shows the most widely
used data sources.
– Data consumers — A list of all of the possible consumers of the processed data:
– Business processes
– Business users
– Enterprise applications
– Individual people in various business roles
– Part of the process flows
– Other data repositories or enterprise applications
Big data classification
14. Logical layers of a big data solution
– Big data sources
– Data massaging and store layer
– This layer is responsible for acquiring data from the data sources and, if
necessary, converting it to a format that suits how the data is to be analyzed. For
example, an image might need to be converted so it can be stored in an Hadoop
Distributed File System (HDFS) store or a Relational Database Management
System (RDBMS) warehouse for further processing. Compliance regulations and
governance policies dictate the appropriate storage for different types of data.
– Analysis layer
– Decisions must be made with regard to how to manage the tasks to:
– Produce the desired analytics
– Derive insight from the data
– Find the entities required
– Locate the data sources that can provide data for these entities
– Understand what algorithms and tools are required to perform the analytics.
– Consumption layer.
– The consumers can be visualization applications, human beings, business
processes, or services.
15. Logical layers of a big data solution
16. Big Data Sources - Enterprise Systems
– Enterprise Legacy Systems
– Customer relationship management systems
– Billing operations
– Mainframe applications
– Enterprise resource planning
– Web applications and other data sources augment the enterprise-owned
data. Such applications can expose the data using custom protocols and
– Data Management Systems (DMS)
– DMS store legal data, processes, policies, and various other kinds of
documents: Microsoft® Excel® spreadsheets, Microsoft Word documents
– These documents can be converted into structured data that can be used
for analytics. The document data can be exposed as domain entities or the
data massaging and storage layer can transform it into the domain entities.
17. Big Data Sources
– Data stores— Includes enterprise data warehouses, operational databases, and
transactional databases. Typically structured and consumed directly or
transformed easily to suit requirements. May or may not be stored in the
distributed file system, depending on the context of the situation
– Smart devices— Smartphones, meters, healthcare devices... For the most part,
smart devices do real-time analytics, but the information stemming from smart
devices can be analyzed in batch, as well.
– Aggregated data providers— Huge volumes of data pour in, in a variety of
formats, produced at different velocities, and made available by various data
providers, and sensors: Geographical information, Human-generated content,
Sensor data (see NOTE)
18. Data massaging and store layer
Because incoming data characteristics can vary, components in the data massaging
and store layer must be capable of reading data at various frequencies, formats, sizes,
and on various communication channels:
– Data acquisition— Acquires data from various data sources and sends the data to
the data digest component or stores it in specified locations. This component must be
intelligent enough to choose whether and where to store the incoming data. It must
be able to determine whether the data should be massaged before it can be stored or
if the data can be directly sent to the business analysis layer.
– Data digest— Responsible for massaging the data in the format required to achieve
the purpose of the analysis. This component can have simple transformation logic or
complex statistical algorithms to convert source data. The analysis engine determines
the specific data formats that are required. The major challenge is accommodating
unstructured data formats, such as images, audio, video, and other binary formats.
– Distributed data storage— Responsible for storing the data from data sources.
Often, multiple data storage options are available in this layer, such as distributed file
storage (DFS), cloud, structured data sources, NoSQL, etc.
19. Analysis layer
This is the layer where business insight is extracted from the data:
– Analysis-layer entity identification— Responsible for identifying and
populating the contextual entities. This is a complex task that requires efficient
high-performance processes. The data digest component should complement
this entity identification component by massaging the data into the required
format. Analysis engines will need the contextual entities to perform the analysis.
– Analysis engine— Uses other components (specifically, entity identification,
model management, and analytic algorithms) to process and perform the
analysis. The analysis engine can have various workflows, algorithms, and tools
that support parallel processing.
– Model management— Responsible for maintaining various statistical models
and for verifying and validating these models by continuously training the models
to be more accurate. The model management component then promotes these
models, which can be used by the entity identification or analysis engine
20. Consumption layer
The outcome of the analysis is consumed by customers, vendors, partners,
Business processes can also be triggered. For example, the process to create a
new order if the customer has accepted an offer can be triggered automatically, or
the process to block the use of a credit card if a customer has reported fraud
Other option is a recommendation engine (RE) that can match customers with the
products they like. RE analyzes available information and provides personalized
and real-time recommendations.
For the internal consumers, the ability to build reports and dashboards for business
users enables the stakeholders to make informed decisions and to design
appropriate strategies. To improve operational effectiveness, real-time business
alerts can be generated from the data and operational key performance indicators
can be monitored:
21. Vertical layers
The aspects that affect all of the components of the logical layers (big data
sources, data massaging and storage, analysis, and consumption) are
covered by the vertical layers:
– Information integration
– Big data governance
– Systems management
– Quality of service
22. Vertical layers – Information Integration
Big data applications acquire data from various data origins, providers, and
data sources and are stored in data storage systems such as HDFS,
NoSQL, and MongoDB. This vertical layer is used by various components
(data acquisition, data digest, model management, and transaction
interceptor, for example) and is responsible for connecting to various data
sources. Integrating information across data sources with varying
characteristics (protocols and connectivity, for example) requires quality
connectors and adapters. Accelerators are available to connect to most of
the known and widely used sources. These include social media adapters
and weather data adapters. This layer can also be used by components to
store information in big data stores and to retrieve information from big
data stores for processing. Most of the big data stores have services and
APIs available to store and retrieve the information.
23. Vertical layers – Big Data Governance
Strong guidelines and processes are required to monitor, structure, store,
and secure the data entering the enterprise, getting processed, stored,
analyzed, and purged or archived. In addition to normal data governance
considerations, governance for big data includes additional factors:
– Managing high volumes of data in variety of formats.
– Continuously training and managing the statistical models required to pre-
process unstructured data and analytics. Keep in mind that this is an
important step when dealing with unstructured data.
– Setting policy and compliance regulations for external data regarding its
retention and usage.
– Defining the data archiving and purging policies.
– Creating the policy for how data can be replicated across various systems.
– Setting data encryption policies.
24. Vertical layers – Quality of Service (Data)
– Completeness in identifying all of the data elements required
– Timeliness for providing data at an acceptable level of freshness
– Accuracy in verifying that the data respects data accuracy rules
– Adherence to a common language (data elements fulfill the requirements
expressed in plain business language)
– Consistency in verifying that the data from multiple systems respects the data
– Technical conformance in meeting the data specification and information
– Data frequency. How frequently is fresh data available? Is it on-demand,
continuous, or offline?
– Size of fetch. This attribute helps define the size of data that can be fetched and
consumed per fetch
– Filters. Standard filters remove unwanted data and noise in the data and leave
only the data required for analysis.
25. Vertical layers – Quality of Service (privacy and
– Data can originate from different regions and countries and must be
treated accordingly. Decisions must be made about data masking and
the storage of such data. Consider the following data access policies:
– Data availability / Criticality / Authenticity
– Data sharing and publishing
– Data storage and retention, including questions such as can the external
data be stored? If so, for how long? What kind of data can be stored?
– Constraints of data providers (political, technical, regional)
26. Vertical layers – Systems Management
– Critical for big data because it involves many systems across clusters
and boundaries of the enterprise:
– Managing the logs of systems, virtual machines, applications, and other devices
– Correlating the various logs and helping investigate and monitor the situation
– Monitoring real-time alerts and notifications
– Using a real-time dashboard showing various parameters
– Referring to reports and detailed analysis about the system
– Setting and abiding by service-level agreements
– Managing storage and capacity
– Archiving and managing archive retrieval
– Performing system recovery, cluster management, and network management
– Policy management
27. What insights are possible with big data technologies?
28. What insights are possible with big data technologies?
29. What insights are possible with big data technologies?
30. What insights are possible with big data technologies?
– Compliance and regulatory reporting
– Risk analysis and management
– CRM and customer loyalty programs
– Credit risk, scoring, and analysis
– High-speed arbitrage trading
– Trade surveillance
– Abnormal trading pattern analysis
31. What insights are possible with big data technologies?
– Large-scale click-stream analytics
– Ad targeting, analysis, forecasting, and optimization
– Abuse and click-fraud prevention
– Social graph analysis and profile segmentation
– Campaign management and loyalty programs
Web and digital media
32. What insights are possible with big data technologies?
– Fraud detection
– Threat detection
– Compliance and regulatory analysis
– Energy consumption and carbon footprint management
33. What insights are possible with big data technologies?
– Health insurance fraud detection
– Campaign and sales program optimization
– Brand management
– Patient care quality and program analysis
– Medical device and pharmaceutical supply-chain management
– Drug discovery and development analysis
Health and life sciences
34. What insights are possible with big data technologies?
– Mashups: Mobile user location and precision targeting
– Machine-generated data
– Online dating: A leading online dating service uses sophisticated analysis
to measure the compatibility between individual members, so it can
suggest good matches
– Online gaming
– Predictive maintenance of aircraft and automobiles
Introducing…A NEW GENERATION OF IBM Power Systems
Open Innovation to
put data to work
flexible, fast execution of analytics
large, fast workspace to maximize
bring massive amounts of information to
compute resources in real-time
threads per core vs. x86
memory bandwidth vs. x86
more I/O bandwidth than POWER7
Designed for Big Data - optimized performance
Delivering insights 82x faster
Optimized for a broad range of data and analytics:
82X is based on IBM internal tests as of April 17, 2014 comparing IBM DB2 with BLU Acceleration on Power with a comparably tuned competitor row store database server on x86 executing a materially identical 2.6TB BI workload in a
controlled laboratory environment. Test measured 60 concurrent user report throughput executing identical Cognos report workloads. Competitor configuration: HP DL380p, 24 cores, 256GB RAM, Competitor row-store database, SuSE Linux
11SP3 (Database) and HP DL380p, 16 cores, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). IBM configuration: IBM S824, 24 cores, 256GB RAM, DB2 10.5, AIX 7.1 TL2 (Database) and IBM S822L, 16 of 20 cores activated,
384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment.
First generation of
systems built on
design, optimized for
big data & analytics
Next generation in-
technology for analytics
at the speed of thought
unstructured big data
Announcing new open innovation to put data to work
38. 82X faster
Power Systems can deliver insight to the point of impact
with big data & analytics accelerators
POWER8 with CAPI
1- Source: COCC Cast Study http://bit.ly/1iQemuu
2- 82X is based on IBM internal tests as of April 17, 2014 comparing IBM DB2 with BLU Acceleration on Power with a comparably tuned competitor row store database server on x86 executing a materially identical 2.6TB BI workload in a controlled
laboratory environment. Test measured 60 concurrent user report throughput executing identical Cognos report workloads. Competitor configuration: HP DL380p, 24 cores, 256GB RAM, Competitor row-store database, SuSE Linux 11SP3 (Database)
and HP DL380p, 16 cores, 384GB RAM, Cognos 10.2.1.1, SuSE Linux 11SP3 (Cognos). IBM configuration: IBM S824, 24 cores, 256GB RAM, DB2 10.5, AIX 7.1 TL2 (Database) and IBM S822L, 16 of 20 cores activated, 384GB RAM, Cognos 10.2.1.1,
SuSE Linux 11SP3 (Cognos). Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment.
3- 24:1 system consolidation ratio (12:1 rack density improvement) based on a single IBM S824, (24 cores, POWER8 3.5 GHz), 256GB RAM, AIX 7.1 with 40 TB memory based Flash replacing 24 HP DL380p, 24 cores, E5-2697 v2 2.7 GHz), 256GB
RAM, SuSE Linux 11SP3
IBM can help you build your solution on the platform
that was designed for big data & analytics
Business & Predictive
40. IBM Solution for Hadoop - Power Systems Edition
Higher ingest rates delivers 37% faster
insights than competitive Hadoop
solutions with 31% fewer data nodes.1
Better reliability and resiliency with
73% fewer outages and 92% fewer
performance problems over x86.2
7R2 servers DCS3700
NEW. A storage-dense integrated
platform optimized to simplify and
accelerate unstructured big data &
Integrated platform solution
for Hadoop ready for
1) Based on STG Performance testing comparing to Cloudera/HP published benchmark
2)CLAIMS: Solitaire Interglobal Paper - Power Boost Your Big Data Analytics Strategy – http://www.ibm.com/systems/power/solutions/assets/bigdata-analytics.html?LNK=wf
PowerLinux PowerLinux Advantages
Arch. Type Monolithic
1. Tailor compute/storage mix to achieve ideal match for
2. Monitor storage performance
Fixed Variable 1. Achieve top performance by tailoring compute/storage mix
for actual workload characteristics
Storage Internal External
1. Central monitoring of I/O performance
2. Upgrade storage and compute separately
3 2 1. Use less raw storage
2. Fewer data nodes needed for equivalent
RAID RAID0 RAID5 1. Better failure resiliency
2. Better cluster performance during disk failure
3. Use fewer Hadoop copies
Network 10 Gb
10 Gb Eth/TOR
PowerLinux vs. x86 Big Data Architecture – Overview of Differences
42. 37% Faster Business Insights with BigInsights on Linux on Power
–10 TB TeraSort
–Achieved 4216 sec
result for 10 TB
result by 22% with a
43. Get Better Storage Efficiency with Big Data Running on Linux on
User Space (user data after
1 PB 1 PB
Hadoop copies 3 2
Temp space 25% 25%
Raw storage required 3.25 PB 2.25 PB
Data nodes required (4 TB
disks, 12 data disks/node)
Example sizing comparison:
For equivalent data volume and compute performance, big data running on Linux on
Power solution requires less raw storage and fewer data nodes, leading to significant
up front and TCO savings.