Big Data Analytics
Quite a few data analytics and visualization
tools are available in the market today from
leading vendors such as IBM, tableau, SAS,
analytics, statistics, world programming
system (WPS), etc. to help process and
analyze your big data
The Scope of Business
Intelligence
Smaller organizations:
Excel spreadsheets
Larger organizations:
Data mining, predictive
analytics, dashboards
What can you do with Business
Intelligence?
Business Intelligence Applications
• Multidimensional Analysis or Online Analytical
Processing (OLAP)
• Data Mining
• Decision Support Systems
• Big Data means different things to people with different
backgrounds and interests
• Traditionally, “Big Data” = massive volumes of data
– Example, volume of data at CERN, NASA, Google, …
• Where does the Big Data come from?
– Everywhere! Web logs, RFID, GPS systems, sensor networks,
social networks, Internet-based text documents, Internet search
indexes, detail call records, astronomy, atmospheric science,
biology, genomics, nuclear physics, biochemical experiments,
medical records, scientific research, military surveillance,
multimedia archives, …
Big Data - Definition and Concepts
Big Data - Definition and Concepts
• Big Data is a misnomer!
• Big Data is more than just “big”
• The Vs that define Big Data
– Volume
– Variety
– Velocity
– Veracity
– Variability
– Value
A High-Level Conceptual Architecture
for Big Data Solutions (by AsterData /
Teradata)
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC
TOOLS & APPS USERS
DISCOVERY PLATFORM
INTEGRATED
DATA WAREHOUSE
DATA
PLATFORM
ACCESSMANAGEMOVE
UNIFIED DATA ARCHITECTURE
System Conceptual View
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
EVENT
PROCESSING
ERPERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
BIG DATA
SOURCES
ERP
Fundamentals of Big Data
Analytics
• Big Data by itself, regardless of the size, type, or
speed, is worthless
• Big Data + “big” analytics = value
• With the value proposition, Big Data also brought
about big challenges
– Effectively and efficiently capturing, storing, and
analyzing Big Data
– New breed of technologies needed (developed or
purchased or hired or outsourced …)
Big Data Considerations
• You can’t process the amount of data that you want to because of the
limitations of your current platform.
• You can’t include new/contemporary data sources (example, social media,
RFID, Sensory, Web, GPS, textual data) because it does not comply with
the data storage schema
• You need to (or want to) integrate data as quickly as possible to be current
on your analysis.
• You want to work with a schema-on-demand data storage paradigm
because the variety of data types involved.
• The data is arriving so fast at your organization’s doorstep that your
traditional analytics platform cannot handle it.
Critical Success Factors for Big
Data Analytics
• A clear business need (alignment with the vision and the
strategy)
• Strong, committed sponsorship (executive champion)
• Alignment between the business and IT strategy
• A fact-based decision-making culture
• A strong data infrastructure
• The right analytics tools
• Right people with right skills
Critical Success Factors for
Big Data Analytics
Keys to Success
with Big Data
Analytics
A Clear
business need
Strong,
committed
sponsorship
Alignment
between the
business and IT
strategy
A fact-based
decision-making
culture
A strong data
infrastructure
The right
analytics tools
Personnel with
advanced
analytical skills
Enablers of Big Data Analytics
• In-memory analytics
– Storing and processing the complete data set in RAM
• In-database analytics
– Placing analytic procedures close to where data is stored
• Grid computing & MPP
– Use of many machines and processors in parallel (MPP -
massively parallel processing)
• Appliances
– Combining hardware, software, and storage in a single unit for
performance and scalability
Challenges of Big Data Analytics
• Data volume
– The ability to capture, store, and process the huge volume of data in a
timely manner
• Data integration
– The ability to combine data quickly and at reasonable cost
• Processing capabilities
– The ability to process the data quickly, as it is captured (i.e., stream
analytics)
• Data governance (… security, privacy, access)
• Skill availability (… data scientist)
• Solution cost (ROI)
Business Problems Addressed by
Big Data Analytics
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
Business Problems
Addressed by Big Data
Analytics
• Risk management
• Regulatory compliance
• Enhanced security capabilities
Big Data Technologies--MapReduce
• MapReduce distributes the processing of very large multi-
structured data files across a large cluster of ordinary
machines/processors
• Goal - achieving high performance with “simple” computers
• Developed and popularized by Google
• Good at processing and analyzing large volumes of multi-
structured data in a timely manner
• Example tasks: indexing the Web for search, graph analysis,
text analysis, machine learning, …
Big Data Technologies--
MapReduce
• How does MapReduce work?
Big Data Technologies--Hadoop
• Hadoop is an open source framework for storing and analyzing
massive amounts of distributed, unstructured data
– Originally created by Doug Cutting at Yahoo!
• Hadoop clusters run on inexpensive commodity hardware so
projects can scale-out inexpensively
– Hadoop is now part of Apache Software Foundation
– Open source - hundreds of contributors continuously
improve the core technology
Big Data Technologies--
Hadoop
• Hadoop Technical Components
– Hadoop Distributed File System (HDFS)
– Name Node (primary facilitator)
– Secondary Node (backup to Name Node)
– Job Tracker
– Slave Nodes (the grunts of any Hadoop cluster)
– Additionally, Hadoop ecosystem is made up of a
number of complementary sub-projects: NoSQL
(Cassandra, Hbase), DW (Hive), …
• NoSQL = not only SQL
Big Data and Data Warehousing
• What is the impact of Big Data on DW?
– Big Data and RDBMS do not go nicely together
– Will Hadoop replace data warehousing/RDBMS?
• Use Cases for Hadoop
– Hadoop as the repository and refinery
– Hadoop as the active archive
• Use Cases for Data Warehousing
– Data warehouse performance
– Integrating data that provides business value
– Interactive BI tools
Hadoop Versus Data Warehouse
When to Use Which Platform
Requirement
Data
Warehouse
Hadoop
Low latency, interactive reports, and OLAP Checkmark Blank
ANSI 2003 SQL compliance is required Checkmark Checkmark
Preprocessing or exploration of raw
unstructured data
Blank Checkmark
Online archives alternative to tape Blank Checkmark
High-quality cleansed and consistent data Checkmark Checkmark
100s to 1,000s of concurrent users Checkmark Checkmark
Discover unknown relationships in the data Blank Checkmark
Definition of big data
• Anything beyond the human and technical
infrastructure needed to support storage,
processing and analysis.
• Today’s BIG may be tomorrow’s NORMAL.
• Terabytes or petabytes or zettabytes of data.
• I think it is about 3 vs.
Definition of big data
Cost-effective ,
innovation
forms of
information
processing
Enhanced
insight&decisi
on making
High-volume
High-velocity
High –variety
CHALLENGES WITH BIG DATA
• Data today is growing at an exponential rate. Most
of the data that we have today has been generated
in the last 2-3 years.
• Cloud computing and virtualization are here to
stay. cloud computing is the answer to managing
infrastructure for big data as far as cost-efficiency,
elasticity, and easy upgrading/downgrading is
concerned. This further complicates the decision to
host big data solution outsides the enterprise.
CHARACTERISTICS OF DATA
• Composition:
the composition of data deals with the
structure of data, that is , the sources of
data, the granularity, the types and the
nature of data as to whether it is static or
real- time streaming
• Condition:
the condition of data deals with the state
of data, that is “can one use this data as is
for analysis:” or “does it require cleansing
for further enhancement and enrichment.
Classification of analytics
More data
produced
More data
stored
More data
analyzed
Better
predictions
Steady
growth of
analysis
SYMMETRIC MULTIPROCESSOR
SYSTEM(SMP)
• Symmetric multiprocessor system
(SMP) has shared between or more
identical processor. The controlled by
a single operating system instance.
• Each processor has its own high-
speed memory called cache memory
and are connected using a application
Stream Analytics
A Use Case in Energy Industry
Sensor Data
(Energy Production
System Status)
Meteorological Data
(Wind, Light,
Temperature, etc.)
Usage Data
(Smart Meters,
Smart Grid Devises)
Permanent
Storage Area
Streaming Analytics
(Predicting Usage,
Production and
Anomalies)
Energy Production System
(Traditional and Renewable)
Energy Consumption System
(Residential and Commercial)
Data Integration
and Temporary
Staging
Capacity Decisions
Pricing Decisions
Big Data and Stream
Analytics
• Data-in-motion analytics and real-time data
analytics
– One of the Vs in Big Data = Velocity
• Analytic process of extracting actionable
information from continuously flowing data
• Why Stream Analytics?
– It may not be feasible to store the data, or lose
its value
• Stream Analytics Versus Perpetual Analytics
• Critical Event Processing

Big data unit 2

  • 1.
  • 2.
    Quite a fewdata analytics and visualization tools are available in the market today from leading vendors such as IBM, tableau, SAS, analytics, statistics, world programming system (WPS), etc. to help process and analyze your big data
  • 3.
    The Scope ofBusiness Intelligence Smaller organizations: Excel spreadsheets Larger organizations: Data mining, predictive analytics, dashboards
  • 4.
    What can youdo with Business Intelligence? Business Intelligence Applications • Multidimensional Analysis or Online Analytical Processing (OLAP) • Data Mining • Decision Support Systems
  • 5.
    • Big Datameans different things to people with different backgrounds and interests • Traditionally, “Big Data” = massive volumes of data – Example, volume of data at CERN, NASA, Google, … • Where does the Big Data come from? – Everywhere! Web logs, RFID, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detail call records, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, multimedia archives, … Big Data - Definition and Concepts
  • 6.
    Big Data -Definition and Concepts • Big Data is a misnomer! • Big Data is more than just “big” • The Vs that define Big Data – Volume – Variety – Velocity – Veracity – Variability – Value
  • 7.
    A High-Level ConceptualArchitecture for Big Data Solutions (by AsterData / Teradata) Math and Stats Data Mining Business Intelligence Applications Languages Marketing ANALYTIC TOOLS & APPS USERS DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE DATA PLATFORM ACCESSMANAGEMOVE UNIFIED DATA ARCHITECTURE System Conceptual View Marketing Executives Operational Systems Frontline Workers Customers Partners Engineers Data Scientists Business Analysts EVENT PROCESSING ERPERP SCM CRM Images Audio and Video Machine Logs Text Web and Social BIG DATA SOURCES ERP
  • 8.
    Fundamentals of BigData Analytics • Big Data by itself, regardless of the size, type, or speed, is worthless • Big Data + “big” analytics = value • With the value proposition, Big Data also brought about big challenges – Effectively and efficiently capturing, storing, and analyzing Big Data – New breed of technologies needed (developed or purchased or hired or outsourced …)
  • 9.
    Big Data Considerations •You can’t process the amount of data that you want to because of the limitations of your current platform. • You can’t include new/contemporary data sources (example, social media, RFID, Sensory, Web, GPS, textual data) because it does not comply with the data storage schema • You need to (or want to) integrate data as quickly as possible to be current on your analysis. • You want to work with a schema-on-demand data storage paradigm because the variety of data types involved. • The data is arriving so fast at your organization’s doorstep that your traditional analytics platform cannot handle it.
  • 10.
    Critical Success Factorsfor Big Data Analytics • A clear business need (alignment with the vision and the strategy) • Strong, committed sponsorship (executive champion) • Alignment between the business and IT strategy • A fact-based decision-making culture • A strong data infrastructure • The right analytics tools • Right people with right skills
  • 11.
    Critical Success Factorsfor Big Data Analytics Keys to Success with Big Data Analytics A Clear business need Strong, committed sponsorship Alignment between the business and IT strategy A fact-based decision-making culture A strong data infrastructure The right analytics tools Personnel with advanced analytical skills
  • 12.
    Enablers of BigData Analytics • In-memory analytics – Storing and processing the complete data set in RAM • In-database analytics – Placing analytic procedures close to where data is stored • Grid computing & MPP – Use of many machines and processors in parallel (MPP - massively parallel processing) • Appliances – Combining hardware, software, and storage in a single unit for performance and scalability
  • 13.
    Challenges of BigData Analytics • Data volume – The ability to capture, store, and process the huge volume of data in a timely manner • Data integration – The ability to combine data quickly and at reasonable cost • Processing capabilities – The ability to process the data quickly, as it is captured (i.e., stream analytics) • Data governance (… security, privacy, access) • Skill availability (… data scientist) • Solution cost (ROI)
  • 14.
    Business Problems Addressedby Big Data Analytics • Process efficiency and cost reduction • Brand management • Revenue maximization, cross-selling/up-selling • Enhanced customer experience • Churn identification, customer recruiting • Improved customer service • Identifying new products and market opportunities
  • 15.
    Business Problems Addressed byBig Data Analytics • Risk management • Regulatory compliance • Enhanced security capabilities
  • 16.
    Big Data Technologies--MapReduce •MapReduce distributes the processing of very large multi- structured data files across a large cluster of ordinary machines/processors • Goal - achieving high performance with “simple” computers • Developed and popularized by Google • Good at processing and analyzing large volumes of multi- structured data in a timely manner • Example tasks: indexing the Web for search, graph analysis, text analysis, machine learning, …
  • 17.
    Big Data Technologies-- MapReduce •How does MapReduce work?
  • 18.
    Big Data Technologies--Hadoop •Hadoop is an open source framework for storing and analyzing massive amounts of distributed, unstructured data – Originally created by Doug Cutting at Yahoo! • Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively – Hadoop is now part of Apache Software Foundation – Open source - hundreds of contributors continuously improve the core technology
  • 19.
    Big Data Technologies-- Hadoop •Hadoop Technical Components – Hadoop Distributed File System (HDFS) – Name Node (primary facilitator) – Secondary Node (backup to Name Node) – Job Tracker – Slave Nodes (the grunts of any Hadoop cluster) – Additionally, Hadoop ecosystem is made up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), … • NoSQL = not only SQL
  • 20.
    Big Data andData Warehousing • What is the impact of Big Data on DW? – Big Data and RDBMS do not go nicely together – Will Hadoop replace data warehousing/RDBMS? • Use Cases for Hadoop – Hadoop as the repository and refinery – Hadoop as the active archive • Use Cases for Data Warehousing – Data warehouse performance – Integrating data that provides business value – Interactive BI tools
  • 21.
    Hadoop Versus DataWarehouse When to Use Which Platform Requirement Data Warehouse Hadoop Low latency, interactive reports, and OLAP Checkmark Blank ANSI 2003 SQL compliance is required Checkmark Checkmark Preprocessing or exploration of raw unstructured data Blank Checkmark Online archives alternative to tape Blank Checkmark High-quality cleansed and consistent data Checkmark Checkmark 100s to 1,000s of concurrent users Checkmark Checkmark Discover unknown relationships in the data Blank Checkmark
  • 22.
    Definition of bigdata • Anything beyond the human and technical infrastructure needed to support storage, processing and analysis. • Today’s BIG may be tomorrow’s NORMAL. • Terabytes or petabytes or zettabytes of data. • I think it is about 3 vs.
  • 23.
    Definition of bigdata Cost-effective , innovation forms of information processing Enhanced insight&decisi on making High-volume High-velocity High –variety
  • 24.
    CHALLENGES WITH BIGDATA • Data today is growing at an exponential rate. Most of the data that we have today has been generated in the last 2-3 years. • Cloud computing and virtualization are here to stay. cloud computing is the answer to managing infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading is concerned. This further complicates the decision to host big data solution outsides the enterprise.
  • 25.
    CHARACTERISTICS OF DATA •Composition: the composition of data deals with the structure of data, that is , the sources of data, the granularity, the types and the nature of data as to whether it is static or real- time streaming • Condition: the condition of data deals with the state of data, that is “can one use this data as is for analysis:” or “does it require cleansing for further enhancement and enrichment.
  • 26.
    Classification of analytics Moredata produced More data stored More data analyzed Better predictions Steady growth of analysis
  • 27.
    SYMMETRIC MULTIPROCESSOR SYSTEM(SMP) • Symmetricmultiprocessor system (SMP) has shared between or more identical processor. The controlled by a single operating system instance. • Each processor has its own high- speed memory called cache memory and are connected using a application
  • 28.
    Stream Analytics A UseCase in Energy Industry Sensor Data (Energy Production System Status) Meteorological Data (Wind, Light, Temperature, etc.) Usage Data (Smart Meters, Smart Grid Devises) Permanent Storage Area Streaming Analytics (Predicting Usage, Production and Anomalies) Energy Production System (Traditional and Renewable) Energy Consumption System (Residential and Commercial) Data Integration and Temporary Staging Capacity Decisions Pricing Decisions
  • 29.
    Big Data andStream Analytics • Data-in-motion analytics and real-time data analytics – One of the Vs in Big Data = Velocity • Analytic process of extracting actionable information from continuously flowing data • Why Stream Analytics? – It may not be feasible to store the data, or lose its value • Stream Analytics Versus Perpetual Analytics • Critical Event Processing