BIG Data basics
Atulya Khobragadeatulyadk@gmail.com
None of the content is original and is sourced from different articles. The credit of the information included here is
completely to the different sources and authors. I have just compiled the information in the way I thought was best.
Atulya Khobragade
Big Data is not only about sheer size of data
Used to improve
analytics & stats
The ability to analyze
large volumes of multi-
structured data from
sources such as
databases, websites,
blogs, social media and
sensors
Having efficient
architectures run in
parallel, highly scalable
and able to manage,
process and analyze data
up to several Petabytes
Huge VOLUMES of data
can be processed
The VELOCITY of data
transfer
(importing/exporting) is
increased
Help achieve high degree
of VERACITY
A large VARIETY of data
can be processed, store
and analyzed
VALUE
Applications
DIGITAL MARKETING
OPTIMIZATION
Web analytics
Attribution
Golden Path analysis
DATA EXPLORATION AND
DISCOVRY
Identifying new data driven
products
New market
FRAUD DETECTION AND
PREVENTION
Revenue protection
Site integrity and uptime
SOCIAL NETWORK &
RELATIONSHIP ANALYSIS
Influencer marketing
Outsourcing
Attrition prediction
MACHINE GENERATED DATA
ANALYTICS
Remote device insight
Remote sensing
Location based intelligence
Big data usage by industry
Manufacturing
SCM
Customer care call
centers
Preventive
maintenance
CRM
Telecommunications
Improved network
performance
Call detail records
analysis
CRM
New product
creation
Energy
Smart meters
Condition based
maintenance
Distribution, Load
forecasting
maintenance
Common Big data sources
Social network profiles
• Traffic, logins, registrations
• Page visits, behaviour to marketing
Social influencers
• Things or activities affecting the
customer or the end user
Activity generated data
• Continuous monitoring of
automated processes across several
machines, locations or industries
SaaS & Cloud Apps
• Data from Cloud servers
Public web information Map Reduce results
• Complex data analysis from Map
reduce
Data Warehouse Appliances NoSQL databases
Network and in-stream
monitoring technologies
• Data from network
routers
Legacy Documents
• Very old data or data
stored in old and obsolete
systems
New analytics
(massive parallel processing and algorithms)
OLTP is the data
warehouse
Proprietary and
dedicated data
warehouse
General
Purpose data
warehouse
Enterprise data
warehouse
Logical Data
Warehouse
OLTP is the data
warehouse
Proprietary and
dedicated data
warehouse
General Purpose
data warehouse
Enterprise data
warehouse
Logical Data
Warehouse OLTP (On-line Transaction Processing) is
characterized by a large number of short online
transactions (INSERT, UPDATE, DELETE). The main
emphasis for OLTP systems is put on very fast query
processing, maintaining data integrity in multi-access
environments and an effectiveness measured by
number of transactions per second. In OLTP database
there is detailed and current data, and schema used
to store transactional databases is the entity model
(usually 3NF).
OLAP (On-line Analytical Processing) is characterized by
relatively low volume of transactions. Queries are often
very complex and involve aggregations. For OLAP
systems a response time is an effectiveness measure.
OLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated,
historical data, stored in multidimensional schemas
(usually star schema).
OLTP is the data
warehouse
Proprietary and
dedicated data
warehouse
General Purpose
data warehouse
Enterprise data
warehouse
Logical Data
Warehouse
OLTP is the data
warehouse
Proprietary and
dedicated data
warehouse
General Purpose
data warehouse
Enterprise data
warehouse
Logical Data
Warehouse
Repository Management
Data virtualization supports a
broad range of data
warehouse extensions
Data Virtualization
Data virtualization virtually
integrates data within the
enterprise and beyond.
Distributed Processes
Data virtualization integrates
big data sources such as
Hadoop as well as enable
integration with distributed
processes performed in
the cloud.
Auditing Statistics and
Performance Evaluation
Services
Data virtualization provides the
data governance, auditability
and lineage required.
Service Level Agreement
Management
Data virtualization’s scalable
query optimizers and caching
delivers the flexibility needed
to ensure SLA performance
Taxonomy / Ontology
Resolution
Data virtualization also provides
an abstracted, semantic layer
view of enterprise data across
repository-based, virtualized and
distributed sources.
Metadata Management
Data virtualization leverages
metadata from data sources as
well as internal metadata
needed to automate and
control key logical data
warehouse functions.
Elements of logical
data warehouse
Storage trends
Object storage
Audio Data (unstructured)
Video Data (unstructured)
Any Data (unstructured)
+ MetaData
+ MetaData
+ MetaData
Distributed File
system
Permanent storage of data
into logical units (files,
blocks eyc.)
Supports access to file and
remote servers,
concurrency, distribution,
replication of data
Network File system (NFS)
General parallel file system
(GPFS)
Hadoop Distributed FS
Gluster FS
• XML files- semi-structured data
• Word docs, PDF files, Text files- unstructured data
• ERP & CRM- structured data
• E-mail- unstructured data
Big data’s customer requirements
E-COMMERECE
- Recommendation
engines
- Ad targeting
- Search quality
- Abuse and click fraud
detection
TELE
COMMUNICATIONS
- Customer churn
prevention
- Network performance
optimization
- Calling data record
(CDR) analysis
- Network analysis to
predict failure
HEALTHCARE & LIFE
SCIENCES
- Health information
exchange
- Gene sequencing
- Serialization
- Healthcare service
quality improvements
- Drug safety
GOVERNMENTS
- Fraud detection and
cyber security
- Welfare schemes
- Judiciary systems
BANKS & FINANCIAL
INSTITUTIONS
- Modeling Tue risk
- Threat analysis
- Fraud detection
- Trade surveillance
- Credit sourcing and
analysis
RETAIL
- Point of sales transaction
analysis
- Customer churn analysis
- Sentiment analysis
Current analytics Vs. solution
Current analytics
• The data would got to the location
of processing through the compute
grid
• A lot of data is to be archives,
because complete data cannot be
moved to the computing location
• Involves a lot of import/export of
data
Big data’s solution
• A combined storage and computer
layer
• The program comes to the storage
location for computation
• This reduces a lot of transfer time and
increases speed
• This helps the complete data to be
alive and avoid data archiving
Thank you

Big data

  • 1.
    BIG Data basics AtulyaKhobragadeatulyadk@gmail.com None of the content is original and is sourced from different articles. The credit of the information included here is completely to the different sources and authors. I have just compiled the information in the way I thought was best.
  • 2.
    Atulya Khobragade Big Datais not only about sheer size of data Used to improve analytics & stats The ability to analyze large volumes of multi- structured data from sources such as databases, websites, blogs, social media and sensors Having efficient architectures run in parallel, highly scalable and able to manage, process and analyze data up to several Petabytes
  • 3.
    Huge VOLUMES ofdata can be processed The VELOCITY of data transfer (importing/exporting) is increased Help achieve high degree of VERACITY A large VARIETY of data can be processed, store and analyzed VALUE
  • 4.
    Applications DIGITAL MARKETING OPTIMIZATION Web analytics Attribution GoldenPath analysis DATA EXPLORATION AND DISCOVRY Identifying new data driven products New market FRAUD DETECTION AND PREVENTION Revenue protection Site integrity and uptime SOCIAL NETWORK & RELATIONSHIP ANALYSIS Influencer marketing Outsourcing Attrition prediction MACHINE GENERATED DATA ANALYTICS Remote device insight Remote sensing Location based intelligence
  • 5.
    Big data usageby industry Manufacturing SCM Customer care call centers Preventive maintenance CRM Telecommunications Improved network performance Call detail records analysis CRM New product creation Energy Smart meters Condition based maintenance Distribution, Load forecasting maintenance
  • 6.
    Common Big datasources Social network profiles • Traffic, logins, registrations • Page visits, behaviour to marketing Social influencers • Things or activities affecting the customer or the end user Activity generated data • Continuous monitoring of automated processes across several machines, locations or industries SaaS & Cloud Apps • Data from Cloud servers Public web information Map Reduce results • Complex data analysis from Map reduce Data Warehouse Appliances NoSQL databases Network and in-stream monitoring technologies • Data from network routers Legacy Documents • Very old data or data stored in old and obsolete systems
  • 7.
    New analytics (massive parallelprocessing and algorithms) OLTP is the data warehouse Proprietary and dedicated data warehouse General Purpose data warehouse Enterprise data warehouse Logical Data Warehouse
  • 8.
    OLTP is thedata warehouse Proprietary and dedicated data warehouse General Purpose data warehouse Enterprise data warehouse Logical Data Warehouse OLTP (On-line Transaction Processing) is characterized by a large number of short online transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multidimensional schemas (usually star schema).
  • 9.
    OLTP is thedata warehouse Proprietary and dedicated data warehouse General Purpose data warehouse Enterprise data warehouse Logical Data Warehouse
  • 10.
    OLTP is thedata warehouse Proprietary and dedicated data warehouse General Purpose data warehouse Enterprise data warehouse Logical Data Warehouse Repository Management Data virtualization supports a broad range of data warehouse extensions Data Virtualization Data virtualization virtually integrates data within the enterprise and beyond. Distributed Processes Data virtualization integrates big data sources such as Hadoop as well as enable integration with distributed processes performed in the cloud. Auditing Statistics and Performance Evaluation Services Data virtualization provides the data governance, auditability and lineage required. Service Level Agreement Management Data virtualization’s scalable query optimizers and caching delivers the flexibility needed to ensure SLA performance Taxonomy / Ontology Resolution Data virtualization also provides an abstracted, semantic layer view of enterprise data across repository-based, virtualized and distributed sources. Metadata Management Data virtualization leverages metadata from data sources as well as internal metadata needed to automate and control key logical data warehouse functions. Elements of logical data warehouse
  • 11.
    Storage trends Object storage AudioData (unstructured) Video Data (unstructured) Any Data (unstructured) + MetaData + MetaData + MetaData Distributed File system Permanent storage of data into logical units (files, blocks eyc.) Supports access to file and remote servers, concurrency, distribution, replication of data Network File system (NFS) General parallel file system (GPFS) Hadoop Distributed FS Gluster FS • XML files- semi-structured data • Word docs, PDF files, Text files- unstructured data • ERP & CRM- structured data • E-mail- unstructured data
  • 12.
    Big data’s customerrequirements E-COMMERECE - Recommendation engines - Ad targeting - Search quality - Abuse and click fraud detection TELE COMMUNICATIONS - Customer churn prevention - Network performance optimization - Calling data record (CDR) analysis - Network analysis to predict failure HEALTHCARE & LIFE SCIENCES - Health information exchange - Gene sequencing - Serialization - Healthcare service quality improvements - Drug safety GOVERNMENTS - Fraud detection and cyber security - Welfare schemes - Judiciary systems BANKS & FINANCIAL INSTITUTIONS - Modeling Tue risk - Threat analysis - Fraud detection - Trade surveillance - Credit sourcing and analysis RETAIL - Point of sales transaction analysis - Customer churn analysis - Sentiment analysis
  • 13.
    Current analytics Vs.solution Current analytics • The data would got to the location of processing through the compute grid • A lot of data is to be archives, because complete data cannot be moved to the computing location • Involves a lot of import/export of data Big data’s solution • A combined storage and computer layer • The program comes to the storage location for computation • This reduces a lot of transfer time and increases speed • This helps the complete data to be alive and avoid data archiving
  • 14.