This presentation gives the details about the sources for big data, the value of big data, what to do with big data, the platforms, the infrastructures and the architectures for big data analytics
A Technical Introduction to Big Data Analytics
Pethuru Raj PhD
Infrastructure Architect
IBM Global Cloud Center of Excellence (CoE)
IBM India, Bangalore
E-mail: peterindia@gmail.com
The Classification of the IT Trends
• The Technology Space - There is a cornucopia of technologies (Computing, Connectivity,
Miniaturization, Middleware, Sensing, Actuation, Perception, Analyses, Knowledge Engineering, etc.)
• The Process Space – With new kinds of services, applications, data, infrastructures, and devices
joining into the mainstream IT, fresh process consolidation, orchestration, governance and
management mechanisms are emerging. That is, process excellence is the ultimate aim
• Infrastructure Space – Infrastructure consolidation, convergence, centralization, federation,
automation and sharing methods clearly indicate the infrastructure trends in the computing and
communication disciplines. Physical infrastructures turn to be virtual infrastructures. Two major
infrastructural types are
• System Infrastructure (Compute, Storage, & Network)
• Application Infrastructure – Integration Backbones, Platforms (Design, Development, Deployment,
Delivery, Management, etc.), Messaging Middleware, Databases (SQL and NoSQL), etc.
• Architecture Space – Service oriented architecture (SOA), event-driven architecture (EDA), model-
driven architecture (MDA), resource oriented architecture (ROA) and so on are the leading
architectural patterns
• The Device Space is fast evolving (Slim & Sleek, handy & trendy, mobile, wearable, implantable,
portable, etc.). Everyday machines are tied up with one another as well as to the remote Web / Cloud
• Data Space – Data are being produced in an automated and massive manner
The TectonicTrendsTowards the Ensuing Knowledge Era
1. Data is being positioned as the strategic asset for any organization
2. Analytics has been an important ingredient for worldwide business enterprises
to
Strategize and Plan Ahead
Take Informed Decisions
Proceed with Confidence and Clarity (Insights-driven Enterprises)
With the arrival of newer technologies, the capabilities and competencies of
Analytics have been consistently on the climb.
In sync up with big data, platforms and infrastructures, big insights will become the
norm for worldwide organizations
For any Strategic and SustainableTransformation
Leverage Data Assets Insightfully
Optimize InfrastructureTechnologically
Innovate Processes Consistently
Assimilate Architectures Appropriately
ChooseTechnologies Carefully
Ensure Accessibility, Simplicity & Consumability Cognitively
The Deeper and Broader Integration pours out Big Data
• Device to Device (D2D) Integration
• Device to Enterprise (D2E) Integration - In order to have remote and real-time
monitoring, management, repair, and maintenance, and for enabling decision-
support and expert systems, ground-level heterogeneous devices have to be
synchronized with control-level enterprise packages such as ERP, SCM, CRM,
KM etc.
• Device to Cloud (D2C) Integration - As most of the enterprise systems are
moving to clouds, device to cloud (D2C) connectivity is gaining importance.
• Cloud to Cloud (C2C) Integration – Disparate, distributed and decentralised
clouds are getting connected to provide better prospects
The Unequivocal Result : the Data-drivenWorld
BusinessTransactions, Interactions, Operations, and Analytical data
System Infrastructure Log files
Social & People data
Customer, Product, Sales and other business data
Machine and Sensor Data
Scientific Experimentation & Observation Data (Genetics, Particle
Physics, Climate modeling, Drug Discovery, etc.,)
Why Big Data is Strategically Significant for Businesses?
Big Data brings in
Enhanced Business Value through better performance and productivity
Bigger and Bigger Insights through a host of newer Analytics and Use Cases
Big Data Big Insights
Aggregate all kinds of distributed, different and decentralized data
Analyze the formatted and formalized data
Articulate the extracted actionable intelligence
Act based on the insights delivered and raise the bar for futuristic analytics
(Real-time, predictive, prescriptive and personal analytics)
Accentuate business performance and productivity
The Drivers for Big Data Analysis
1. There is an Exponential Growth in Data Generation due to
◦ The continued increase in diverse and distributed data sources
2. The Maturity,Stability and Convergence ofTechnologies - DataVirtualization, Management,
Storage,Transmission,Analysis andVisualizationTechniques,Tips, andTools
3. The Massive Adoption and Adaption of Cloud Infrastructures (Compute, Storage and Network)
4. The Realization of more comprehensive, accurate, and speedier Knowledge Discovery and
Dissemination Platforms and Processes
5. Enhanced BusinessValue
6. NewerTypes of Analytics
◦ Domain-specific Analytics (Customer Sentiment, Social, Security, Retail, Fraud Detection
Analysis, etc.) and
◦ Generic Analytics(Predictive, Prescriptive, High-Performance, Real-time, Smarter
Analytics, etc.)
Machine Data Analytics - Use Cases
Here are a few ROI examples from a 1% improvement in productivity across different industries:
Commercial aviation industry — a 1% improvement in fuel savings would yield a savings of $30
billion over 15 years.
Utilities — In global gas-fired power plant fleet a 1% improvement could yield a $66 billion savings
in fuel consumption.
Global health care industry — A 1% efficiency gain from reduction of process inefficiencies
globally could yield more than $63 billion in health care savings.
Railway Networks — Freight moved across the world rail networks, if improved by 1% could yield
another gain of $27 billion in fuel savings.
Upstream Oil and Gas Exploration – a 1% improvement in capital utilization upstream oil and
gas exploration and development could total $90 billion in avoided or deferred capital expenditures.
The convergence of intelligent devices, intelligent networks and intelligent decisioning (Insight vs. Hindsight
analytics) is definitely paving the foundation for the next growth spurt or productivity gains.
Big Data Analytics:The Platforms
Analytical, Distributed, Scalable and Parallel Databases
Data warehouses, Data Marts, etc.
In-Memory Systems (SAP HANA, etc.)
In-Database Systems (SAS, etc.)
Distributed File Systems (HDFS)
Hadoop Implementations (Cloudera, Map R, HortonWorks,Apache
Hadoop, DataStax, etc.)
NoSQL & Hybrid Databases
Parallel DBMS
Standard relational tables and SQL
◦ Indexing, compression,caching, I/O sharing
◦ Tables partitioned over nodes
◦ Transparent to the user
Meet performance
◦ Needed highly skilled DBA
Flexible query interfaces
◦ UDFs varies accros implementations
Fault tolerance
◦ Not score so well
Assumption: failures are rare
Assumption: dozens of nodes in clusters
45
MapReduce Programming Model & Hadoop Platforms
MapReduce is a programming model which specifies:
◦ A map function that processes a key/value pair to generate a set of intermediate key/value pairs,
◦ A reduce function that merges all intermediate values associated with the same intermediate key.
Hadoop comprises large-scale, distributed, elastic, and fault-tolerant data processing and storage
modules
◦ Is a MapReduce implementation for processing large data sets over 1000s of nodes.
◦ Maps and Reduces run independently of each other over blocks of data distributed across a
cluster
46
Why Hadoop?
Better application development productivity through a more flexible data model;
Greater ability to scale dynamically to support more users and data;
Improved performance to satisfy expectations of users wanting highly responsive
applications and to allow more complex processing of data.
Scalability to large data volumes:
◦ Scan 100 TB on 1 node @ 50 MB/sec = 23 days
◦ Scan on 1000-node cluster = 33 minutes
Divide-And-Conquer (i.e., data partitioning)
Cost-efficiency
◦ Commodity nodes (cheap, but unreliable)
◦ Commodity network
◦ Automatic fault-tolerance (fewer administrators)
◦ Easy to use (fewer programmers)
Satisfies fault tolerance
Works on heterogeneous environment
NoSQL Databases
NoSQL encompasses a wide variety of different database technologies and were developed in response
to a rise in the volume of data stored about users, objects and products, the frequency in which this data
is accessed, and performance and processing needs.
Document databases pair each key with a complex data structure known as a document.Documents
can contain many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks, such as social connections.Graph stores
include Neo4J and HyperGraphDB.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an
attribute name (or "key"), together with its value. Examples of key-value stores are Riak andVoldemort.
Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds
functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and
store columns of data together, instead of rows.
Cassandra (Facebook) (CQL is the query language)
BigTable (Google)
Dynomo (Amazon)
RIAK (SoftLayer) (Apache Lucene)
MongoDB
CouchDB (UNQL is the query language0
RelationalVs. NoSQL Databases
SQL Databases NoSQL Databases
The relational model takes data and separates it into
many interrelated tables. Tables reference each other
through foreign keys
The relational model minimizes the amount of storage
space required, because each piece of data is only
stored in one place. However, space efficiency comes at
expense of increased complexity when looking up data.
The desired information needs to be collected from
many tables (often hundreds in today’s enterprise
applications) and combined before it can be provided to
the application. When writing data, the write needs to be
coordinated and performed on many tables.
Developers generally use object-oriented programming
languages to build applications. It’s usually most efficient
to work with data that’s in the form of an object with a
complex structure consisting of nested data, lists, arrays,
etc. The relational data model provides a very limited
data structure that doesn’t map well to the object model.
Instead data must be stored and retrieved from tens or
even hundreds of interrelated tables. Object-relational
frameworks provide some relief but the fundamental
impedance mismatch still exists between the way an
application would like to see its data and the way it’s
actually stored in a relational database
NoSQL databases have a very different model. For
example, a document-oriented NoSQL database takes
the data you want to store and aggregates it into
documents using the JSON format. Each JSON document
can be thought of as an object to be used by your
application. A JSON document might, for example, take
all the data stored in a row that spans 20 tables of a
relational database and aggregate it into a single
document/object.
Aggregating this information may lead to duplication of
information, but since storage is no longer cost
prohibitive, the resulting data model flexibility, ease of
efficiently distributing the resulting documents and read
and write performance improvements make it an easy
trade-off for web-based applications.
Document databases, on the other hand, can store an
entire object in a single JSON document and support
complex data structures. This makes it easier to
conceptualize data as well as write, debug, and evolve
applications, often with fewer lines of code
RelationalVs. NoSQL Databases
SQL Databases NoSQL Databases
Relational technology requires strict definition of a
schema prior to storing any data into a database.
Changing the schema once data is inserted is a big deal.
Want to start capturing new information not previously
considered? Want to make rapid changes to application
behavior requiring changes to data formats and content?
With relational technology, changes like these are
extremely disruptive and frequently avoided
RDBMS supports scale-up implying the fundamentally
centralized, shared-everything architecture of relational
database technology
Enhancement Techniques include
1. Sharding
2. Denormalizing,
3. Distributed caching
NoSQL databases especially document databases are
typically schemaless, allowing you to freely add fields to
JSON documents without having to first define changes.
The format of the data being inserted can be changed at
any time, without application disruption. This allows
application developers to move quickly to incorporate
new data into their applications.
NoSQL use a cluster of standard, physical or virtual
servers to store data and support database operations.
Support the following
Auto-sharding
Data Replication
Distributed query support – “Sharding” a relational
database can reduce, or eliminate in certain cases, the
ability to perform complex data queries. NoSQL database
systems retain their full query expressive power even
when distributed across hundreds of servers.
Integrated caching – Transparently cache data in system
memory. This behavior is transparent to the application
developer and the operations team, compared to
relational technology where a caching tier is usually a
separate infrastructure tier that must be developed to,
deployed on separate servers, and explicitly managed by
the ops team.
Big Data Analytics – The Emerging Infrastructures
Analytic, Scalable, Parallel and Distributed Databases & DataWarehouses -
Hardware Appliances (MPP and SMP)
In-Memory Compute Infrastructures (SAP HANA on IBM Power 7)
In-Database Compute Infrastructures (SAS Teradata, etc.)
Expertly Integrated Systems (IBM PureData System for Hadoop,Analytics,
etc.)
Clouds (public, private and hybrid) comprising bare metal servers and
virtual machines (VMs)
In-Memory Data Grid (IMDG)
An IMDG is a distributed non-relational data or object store. It can be distributed to
span more than one server.
Reading from memory is more than 3,300 times faster than reading from disk.A
simple calculation would suggest that if it takes an hour to read a set of information
from disk, it would take just over a second to read it from memory
This approach brings data to the cloud, where the application can interact with it,
and the application is completely shielded from the complexity of having to persist
or replicate data back to the on-premise store.
The use of an IMDG also means that while the data is available on the cloud, it is
only available in memory and is never stored on a disk in the cloud.
IMDGs usually support linear scaling to support high loads, data partitioning,
redundancy, and automatic data recovery in case of failures.
Why Big Data Analytics in Clouds?
Agility & Affordability - No capital investment of a large size of Infrastructures. Just Use
and Pay
Hadoop Platforms in Clouds - Deploying and using any Hadoop Platforms (generic or
specific, open or commercial-grade, etc.) are fast
NoSQL Databases in Clouds - NoSQL databases are made available in Clouds
WAN OptimizationTechnologies - There areWAN optimization products and
platforms for efficiently transmitting data over the Internet infrastructure
Business Applications in Clouds - With enterprise information systems (EISs), high-
performance computing systems, and the establishment of data storage, social, device and
sensor clouds go up in public clouds, big data analytics at remote, Internet-scale clouds
makes sense.
Cloud Integrators, Brokers & Orchestrators –There are products and platforms for
seamless interoperability among different and distributed systems, services and data
Entering into the HybridWorld
1. TheTraditional Analytical Systems (Data Warehouse)Vs.The
Big Data Analytical systems (Hadoop)
2. TheTraditional Databases (RDBMS)Vs.The NoSQL
Databases
3. The Scalable, Distributed, Parallel RDBMSVs.The NoSQL
Databases
Big Data Analytics: the Summary
Digitalization, service-enablement, extreme connectivity, distribution,
commoditization, Consumerization, Industrialization, etc. are the
brewing trends towards big data
DataVolume,Variety,Velocity andVariability are on the Rise signalling
a heightened DataValue.This development is due to the diversity
and multiplicity of data sources.
Data Capturing,transmission, Cleansing, Filtering, Formatting, and
StorageTasks,Tools, andTechnologies are maturing fast
Big Data platforms, patterns, practices, products, processes and
infrastructures are being developed to streamline big data analytics