Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!
The Data Lake Survival Guide
Exploratory Webcast | October 26, 2016
SPONSORED BY
Presenting
Robin Bloor
Chief Analyst, The Bloor Group
@robinbloor robin.bloor@bloorgroup.com
Host: Eric Kavanagh
CEO, The Bloor Group
@eric_kavanagh eric.kavanagh@bloorgroup.com
Dez Blanchfield
Data Scientist, The Bloor Group
@dez_blanchfield dez.blanchfield@bloorgroup.com
Findings Webcast
January 12, 2017
Data Lake Survival Guide
Roundtable Webcast
December 8, 2016
Exploratory Webcast
October 26, 2016
Data Lake
Survival
Robin Bloor, PhD
The Sequence of Topics….
1  Disturbance in the Force
2  What is a Data Lake,
exactly?
3  Streams and Events
1
Disturbance
in the
Force
The Generic Dimensions of IT
q  All IT involves 4 components (only)
q  Users
q  Software
q  Data
q  Hardware
q  They all relate to each other
q  Change any one of these and the other
three components have to adjust
q  Aggregate these and you get a process
q  Time will impose change anyway
q  We can also consider:
q  Staff
q  Business Processes
q  Business Information
q  Facility
q  And also
q  People
q  Information
q  Human Activity
q  Civilization (Stuff)
Four Fundamental (IT) Factors
Hardware
Users
Software Data
BusinessInformation
BusinessProcess
HumanActivity
AllInformation
Staff
Facility
People
Civilization
TIME
The Technology Layers
§  The buying impulse
descends through the
stack
§  The impact of
technology change rises
up the stack
§  This ensures the
eventual “legacification”
of all technology
The Buying
Impulse Goes
Down
Technology
Change Rises Up
The Technology
Layers
Disruption in the Technology Layers
§  Disruption (as
innovation) can happen in
any layer
§  Where it occurs it will
impact all layers above it
§  And it may also impact
the layers below it (but
less quickly)
§  There is no such thing as
future-proof; but some
technologies definitely live
longer
The Buying
Impulse Goes
Down
Technology
Change Rises Up
The Technology
Layers
§  Mainframe Computer (Batch architecture)
§  On-line Interaction (Centralized
architecture)
§  PC (Client Server)
§  Internet (Multi-tier architecture)
§  Mobile (Service Oriented architecture)
§  Internet of Things (Event Driven
Architecture)
Tech Revolutions
Note that all of these disruptive changes
were driven by hardware innovation
Cloud
Centralized Computer Systems
PC Based Systems
Integrated Systems
Limited process power
Terminals only
Few applications
No external data sources
Extensive process power
PCs & Apps
Analytics capability
Wealth of applications
Many external data sources
Moderate process power
PCs
Spreadsheets & email
Many applications
Few external data sources
Parallelism: The Imp Out of the Bottle
u  Multicore chips enabled
parallelism
u  It has changed the whole
performance equation
u  It enabled Big Data
u  Big Data is really Big
Processing
The Impact of Parallelism
We used to see 10x performance
improvement every 6 years, now we
see 1000x (and that’s just an
approximation)
Hardware Factors
q  CPUs, GPUs & FPGAs
q  Cross breeding
q  SoCs
q  3D Xpoint and PCM (and
memristor?)
q  SSDs & parallel access
q  Parallel hardware
architectures
Performance is accelerating
and costs continue to fall.
The Perfect Storm (Software)
q  The triumph of Open
Source as a business model
q  The dominance of Apache
q  Hadoop, the platform
for data
q  Spark, for speed
q  Kafka, for connectivity
q  The triumph of the cloud
and its dominance
q  Little data is also big data
q  Cost challenges
Then the Data
Lake evaporated
into the Cloud
2
What is a
Data Lake?
Everything in flux
u  Hardware (network,
storage, servers)
u  Data Sources
u  Data Staging
u  Data Volumes
u  Data Flow
u  Data Governance
u  Data Usage
u  Data Structures
u  Schema definition
u  Ingest Speeds
u  Data Workloads
Hadoop Applications
The Scale Out Applications
§  Data Ingest & Staging
§  Data Governance
§  Software development
platform
§  Analytics environment
§  Database/Data
Warehouse
§  Data Archiving
§  Video rendering & other
niche apps
The Data Lake involves just
the first two and does not
necessarily involve Hadoop
Data Lake, Refinery, Hub, in Overview
Think Logical, Implement Physical
The Data Lake Analytics Picture
Data Sources
Analytics
Service
Mgt
Life Cycle
Mgt
MetaData
Discovery
MDM
MetaData
Mgt
Data
Cleansing
Data
Lineage
R
O
U
N
D
|
U
P
W
R
A
N
G
L
I
N
G
Staging Area
(Hadoop)
Data Warehouse
or other location
Data Streams
ETL
ETL
How Data Gets to be Wrong
u  Accidentally born wrong
u  Deliberately born wrong
u  Defective sensor/data
source
u  Murdered (truncated,
overwritten)
u  Corrupted in flight (rare)
u  Corrupted by bad code
(surely not!)
u  Corrupted by bad DBA
Data Governance
If data governance was important
before Big Data, (and it was) it is
far more important in the era of
Data Lakes
What Needs To Be Governed
Data Governance
  Data Flows and Data Storage
  Security & Access
  Data cleansing and
transformation
  Data meaning
  Data provenance and lineage
  Data archive and disposal
  Availability and performance
Analytics Is a Process Not an Activity
q Data Analytics is a multi-
disciplinary end-to-end
process
q Until recently it was a
walled-garden. But the
walls were torn down by…
§  Data availability
§  Scalable technology
§  Open source tools
q It is now becoming an
integrated process
Data Governance is a process,
not an activity!!
The Global Map and Data Options
u  Move the data to
the processing
u  Move the
processing to the
data
u  Move the
processing and the
data
u  Shard
All network nodes can be data
creators, data stores and
processing points.
Logical Data Lakes
Soon we will be speaking of a
logical data lake and multiple
physical data lakes
3
Events
and
Streams
Big Data, Event Data – The Data of Everything
WHAT
IS BIG
DATA?
Business
data
Traditional
data
Log file
data
Operational
data
Mobile data
Location
data Social
network
data
Public data
Commercial
databases
Streaming
data
Internet of
Things
A TRANSACTION is a
MOLECULE of ATOMIC EVENTS
The ATOM of data has
become the EVENT
Events: Atoms and Molecules
It’s Become and Event Based World
Events
Think of events as drops of water.
They can live in streams, and they
can also live in data pools and data
lakes.
Two Data Flows
The Traffic Cop (Events)
Event Types
q  Instantiation Event
q  A State Report
q  A Trigger Event
q  A Correction Event
We also need to consider:
Data Refinement
Aggregations
Homogeneous Collections
Derived Data
§  The pulse and the
threshold alert
§  Some of this involves
distributed processing
§  There are known apps
and unknown apps, so
analytical exploration
needs to be enabled
§  Only aggregations will
migrate
DepotDepot
Central
Hub
Source
Proc.
Depot
Proc.
Central
Proc.
Sensors, controllers, CPUs
Data Data
Data
Event Based IoT Architecture
u Time
u Geographic location
u Virtual/logical location
u Source device
u Device ID
u Actors
u Ownership/
Provenance
u Values
Events and Event Data
Spark, Storm, Flink & Kafka
u  Spark has dethroned Hadoop as a platform and
has momentum, both for microbatch and
streaming
u  Storm provides batch and streaming (event
processing capabilities) concurrently via the
lambda architecture
u  Flink was purpose built for streaming
u  Kafka is the pipe
u  Lambda and Zeta Architectures…
In Summary
1  Disturbance in the Force
2  What is a Data Lake,
exactly?
3  Streams and Events
Questions?
THANK
YOU!
FIND OUT MORE at
InsideAnalysis.com

The Central Hub: Defining the Data Lake

  • 1.
    Grab some coffee and enjoythe pre-show banter before the top of the hour!
  • 2.
    The Data LakeSurvival Guide Exploratory Webcast | October 26, 2016 SPONSORED BY
  • 3.
    Presenting Robin Bloor Chief Analyst,The Bloor Group @robinbloor robin.bloor@bloorgroup.com Host: Eric Kavanagh CEO, The Bloor Group @eric_kavanagh eric.kavanagh@bloorgroup.com Dez Blanchfield Data Scientist, The Bloor Group @dez_blanchfield dez.blanchfield@bloorgroup.com
  • 4.
    Findings Webcast January 12,2017 Data Lake Survival Guide Roundtable Webcast December 8, 2016 Exploratory Webcast October 26, 2016
  • 5.
  • 6.
    The Sequence ofTopics…. 1  Disturbance in the Force 2  What is a Data Lake, exactly? 3  Streams and Events
  • 7.
  • 8.
    The Generic Dimensionsof IT q  All IT involves 4 components (only) q  Users q  Software q  Data q  Hardware q  They all relate to each other q  Change any one of these and the other three components have to adjust q  Aggregate these and you get a process q  Time will impose change anyway q  We can also consider: q  Staff q  Business Processes q  Business Information q  Facility q  And also q  People q  Information q  Human Activity q  Civilization (Stuff) Four Fundamental (IT) Factors Hardware Users Software Data BusinessInformation BusinessProcess HumanActivity AllInformation Staff Facility People Civilization TIME
  • 9.
    The Technology Layers § The buying impulse descends through the stack §  The impact of technology change rises up the stack §  This ensures the eventual “legacification” of all technology The Buying Impulse Goes Down Technology Change Rises Up The Technology Layers
  • 10.
    Disruption in theTechnology Layers §  Disruption (as innovation) can happen in any layer §  Where it occurs it will impact all layers above it §  And it may also impact the layers below it (but less quickly) §  There is no such thing as future-proof; but some technologies definitely live longer The Buying Impulse Goes Down Technology Change Rises Up The Technology Layers
  • 11.
    §  Mainframe Computer(Batch architecture) §  On-line Interaction (Centralized architecture) §  PC (Client Server) §  Internet (Multi-tier architecture) §  Mobile (Service Oriented architecture) §  Internet of Things (Event Driven Architecture) Tech Revolutions Note that all of these disruptive changes were driven by hardware innovation Cloud Centralized Computer Systems PC Based Systems Integrated Systems Limited process power Terminals only Few applications No external data sources Extensive process power PCs & Apps Analytics capability Wealth of applications Many external data sources Moderate process power PCs Spreadsheets & email Many applications Few external data sources
  • 12.
    Parallelism: The ImpOut of the Bottle u  Multicore chips enabled parallelism u  It has changed the whole performance equation u  It enabled Big Data u  Big Data is really Big Processing
  • 13.
    The Impact ofParallelism We used to see 10x performance improvement every 6 years, now we see 1000x (and that’s just an approximation)
  • 14.
    Hardware Factors q  CPUs,GPUs & FPGAs q  Cross breeding q  SoCs q  3D Xpoint and PCM (and memristor?) q  SSDs & parallel access q  Parallel hardware architectures Performance is accelerating and costs continue to fall.
  • 15.
    The Perfect Storm(Software) q  The triumph of Open Source as a business model q  The dominance of Apache q  Hadoop, the platform for data q  Spark, for speed q  Kafka, for connectivity q  The triumph of the cloud and its dominance q  Little data is also big data q  Cost challenges
  • 16.
    Then the Data Lakeevaporated into the Cloud 2 What is a Data Lake?
  • 17.
    Everything in flux u Hardware (network, storage, servers) u  Data Sources u  Data Staging u  Data Volumes u  Data Flow u  Data Governance u  Data Usage u  Data Structures u  Schema definition u  Ingest Speeds u  Data Workloads
  • 18.
  • 19.
    The Scale OutApplications §  Data Ingest & Staging §  Data Governance §  Software development platform §  Analytics environment §  Database/Data Warehouse §  Data Archiving §  Video rendering & other niche apps The Data Lake involves just the first two and does not necessarily involve Hadoop
  • 20.
    Data Lake, Refinery,Hub, in Overview Think Logical, Implement Physical
  • 21.
    The Data LakeAnalytics Picture Data Sources Analytics Service Mgt Life Cycle Mgt MetaData Discovery MDM MetaData Mgt Data Cleansing Data Lineage R O U N D | U P W R A N G L I N G Staging Area (Hadoop) Data Warehouse or other location Data Streams ETL ETL
  • 22.
    How Data Getsto be Wrong u  Accidentally born wrong u  Deliberately born wrong u  Defective sensor/data source u  Murdered (truncated, overwritten) u  Corrupted in flight (rare) u  Corrupted by bad code (surely not!) u  Corrupted by bad DBA
  • 23.
    Data Governance If datagovernance was important before Big Data, (and it was) it is far more important in the era of Data Lakes
  • 24.
    What Needs ToBe Governed
  • 25.
    Data Governance   DataFlows and Data Storage   Security & Access   Data cleansing and transformation   Data meaning   Data provenance and lineage   Data archive and disposal   Availability and performance
  • 26.
    Analytics Is aProcess Not an Activity q Data Analytics is a multi- disciplinary end-to-end process q Until recently it was a walled-garden. But the walls were torn down by… §  Data availability §  Scalable technology §  Open source tools q It is now becoming an integrated process Data Governance is a process, not an activity!!
  • 27.
    The Global Mapand Data Options u  Move the data to the processing u  Move the processing to the data u  Move the processing and the data u  Shard All network nodes can be data creators, data stores and processing points.
  • 28.
    Logical Data Lakes Soonwe will be speaking of a logical data lake and multiple physical data lakes
  • 29.
  • 30.
    Big Data, EventData – The Data of Everything WHAT IS BIG DATA? Business data Traditional data Log file data Operational data Mobile data Location data Social network data Public data Commercial databases Streaming data Internet of Things
  • 31.
    A TRANSACTION isa MOLECULE of ATOMIC EVENTS The ATOM of data has become the EVENT Events: Atoms and Molecules
  • 32.
    It’s Become andEvent Based World
  • 33.
    Events Think of eventsas drops of water. They can live in streams, and they can also live in data pools and data lakes.
  • 34.
  • 35.
  • 36.
    Event Types q  InstantiationEvent q  A State Report q  A Trigger Event q  A Correction Event We also need to consider: Data Refinement Aggregations Homogeneous Collections Derived Data
  • 37.
    §  The pulseand the threshold alert §  Some of this involves distributed processing §  There are known apps and unknown apps, so analytical exploration needs to be enabled §  Only aggregations will migrate DepotDepot Central Hub Source Proc. Depot Proc. Central Proc. Sensors, controllers, CPUs Data Data Data Event Based IoT Architecture
  • 38.
    u Time u Geographic location u Virtual/logical location u Sourcedevice u Device ID u Actors u Ownership/ Provenance u Values Events and Event Data
  • 39.
    Spark, Storm, Flink& Kafka u  Spark has dethroned Hadoop as a platform and has momentum, both for microbatch and streaming u  Storm provides batch and streaming (event processing capabilities) concurrently via the lambda architecture u  Flink was purpose built for streaming u  Kafka is the pipe u  Lambda and Zeta Architectures…
  • 40.
    In Summary 1  Disturbancein the Force 2  What is a Data Lake, exactly? 3  Streams and Events
  • 42.
  • 43.
    THANK YOU! FIND OUT MOREat InsideAnalysis.com