Cloud computing and big data
September 27, 2016
Ben Sharma | CEO
ben@zaloni.com
•  Award-winning provider of enterprise data lake
management solutions:
Integrated data lake management platform
Self-service data preparation
•  Data Lake Design and Implementation Services:
POC, Pilot, Production, Operations, Training
•  Data Science Professional Services
Zaloni Proprietary
Why cloud now?
“By 2018, at least half of IT
spending will be cloud-based,
reaching 60% of all IT infrastructure”
From IDC Research:
“By 2018, cloud becomes a
preferred delivery mechanism for
analytics, increasing public information
consumption by 150%”
Zaloni Confidential and Proprietary
Why are companies moving to a cloud-based platform
Infrastructure Drivers
•  Infrastructure agility
•  Cost
•  Compute and
storage elasticity
•  Heterogeneous
compute and storage
platforms
•  Converged
architectures for
various workloads
Data Locality
•  Data Gravity
•  Compliance and
regulatory
requirements
(international)
•  Keep data close to
where it is generated
New Requirements
•  Lot of data is
generated externally
•  Need to handle all
types of data –
Structured,
unstructured, images,
etc.
•  Latency and Currency
Zaloni Confidential and Proprietary
5.4 BILLION
IoT volume driving to cloud adoption
Cloud computing required
to provide the virtual
infrastructure needed to
process enormous volume
of data from the IoT
By 2020 there will be
Connected devices1, like smart
meters and connected cars —
This is the Internet of Things.
And it’s going to be big…
Exponential growth
loT: THE NEXT BIG THING
1.2B
5.4B
Source: ABI Research
2011 2014 2020
Zaloni Confidential and Proprietary
On-
Premises
32%
Cloud
Only
23%
Cloud Plus
On-
Premises
29%
Gartner’s Sept 2015 report: Hadoop Expansion Boosts Cloud and Unsupported On-Premises Deployment
Hadoop deployment trends
Zaloni Confidential and Proprietary
Cloud big data use case: Real-time data processing
Fleet Data Collection
Streaming Analytics
Idle Time Calculation Idle Time
reporting
Data-driven
Apps
Dispatchers
QueueCollectors
Ingestion
On-board
Unit
Data
Collectors
Zaloni Confidential and Proprietary
Data Lake in the Cloud
Consumption
Zone
Source
System
File Data
DB Data
ETL Extracts
Streaming
Transient
Loading Zone
Raw Data
Refined
Data
Trusted
Data
Discovery
Sandbox
Original unaltered
data attributes
Tokenized Data
APIs
Reference Data Master Data
Data Wrangling
Data Discovery
Exploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to
common format
Data Validation
Data Cleansing
Aggregations
OLTP or ODS
Enterprise Data
Warehouse
Logs
(or other unstructured
data)
Data Services
Business Analysts
Researchers
Data Scientists
Zaloni Confidential and Proprietary
•  Storage – Block, object and file level abstractions, with different degrees of
redundancy, availability and consistency guarantees, and cost considerations.
•  Compute - A variety of compute server types are possible, optimized for
different types of memory and processing requirements depending on the
workload.
•  Cloud native services – Higher levels of platform abstractions such as cloud
provider managed Hadoop clusters, managed databases, warehouses,
messaging services, etc.
•  Data Management, Governance, Entitlements and Security
Cloud Data Lake options
Zaloni Confidential and Proprietary
Cloud Data Lake Maturity model
Lift and Shift
Cloud Native
features
Multi and
Hybrid Cloud
Replicate on-
premise Data Lake
in the cloud
Leverage Object
stores, Transient
compute platforms,
Messaging systems
Abstraction over
multiple clouds,
consistent Data
Management and
Governance
Zaloni Confidential and Proprietary
•  Patterns:
§  Implement Data Lake in the cloud using elastic compute and cloud
optimized storage
§  Use Data Lake provided as a cloud service that is managed and optimized
by the cloud provider
§  Data pipelines with processing components decoupled by queuing
services
§  Leaving the heavy lifting to cloud provider services, example, for elastic
clusters, streaming, analytics and machine learning
§  Using cloud storage rather than ephemeral storage with data lifecycle
management
§  Real time processing with event driven architectures for streaming data
Patterns and Anti-patterns
Zaloni Confidential and Proprietary
•  Anti-Patterns:
§  Fork lift migration of on-premise Data Lake to the cloud.
§  Unmanaged, unmonitored, long term usage of resources such as
persistent on-demand compute instances.
§  Dedicating cloud resources for service peaks rather than using auto scaling
cloud services
Patterns and Anti-patterns
Zaloni Confidential and Proprietary
Governance considerations within cloud/hybrid environments
Zaloni Confidential and Proprietary
•  Repeatable Ingestion of vast amounts of data from a wide
variety of sources and formats (streaming, files, custom)
•  Data visibility across hybrid cloud environments with
proper security and access control. Data Masking, and
Encryption of sensitive data
•  Need to capture operational metadata implicitly during
ingestion and processing. Metadata persistent across
cluster instances
•  Reusable Managed Data Pipelines for Processing:
Validation, Standardization, Enrichments
Zaloni Confidential and Proprietary
•  Data Lake on IaaS with bare metal or virtualized infrastructures.
•  PaaS layers - managed data platforms that include various options for event
based data ingestion, data processing and serving layers.
•  Several cloud providers are also starting to offer Analytics as a Service with
Machine Learning offerings built on top of their IaaS and PaaS layers.
•  Geographical coverage due to any local in-country data requirements.
•  Cost, TCO for Cloud Data Lake
Assessing Cloud Data providers
Cloud options in the context of big data and data science
Zaloni Confidential and Proprietary15
IaaS
Platform
Analytics
Machine Learning
OR
OR
Cloud Providers Hadoop Ecosystem
Cortana
Amazon EMR
HDInsight
Cloud Machine Learning
MLlib
Streams
AWS Lambda
OR
Streaming
Analytics Dataflow
Dataproc
Streaming
DATA LAKE MANAGEMENT
AND GOVERNANCE PLATFORM
SELF-SERVICE DATA PREPARATION
FREE T-SHIRT!
Building a Modern Data Architecture
Ben Sharma, CEO and Founder, Zaloni
Wednesday, 2:05 p.m. – 1 E 09
Demo and FREE
copy of book
“Architecting Data Lakes”
Speaking Sessions:
Cloud Computing and Big Data
Ben Sharma, CEO and Founder, Zaloni
Tuesday, 9:30 a.m. – 1B 01/02
Visit Booth #644 for these giveaways!

Cloud Computing and Big Data

  • 1.
    Cloud computing andbig data September 27, 2016 Ben Sharma | CEO ben@zaloni.com
  • 2.
    •  Award-winning providerof enterprise data lake management solutions: Integrated data lake management platform Self-service data preparation •  Data Lake Design and Implementation Services: POC, Pilot, Production, Operations, Training •  Data Science Professional Services
  • 3.
    Zaloni Proprietary Why cloudnow? “By 2018, at least half of IT spending will be cloud-based, reaching 60% of all IT infrastructure” From IDC Research: “By 2018, cloud becomes a preferred delivery mechanism for analytics, increasing public information consumption by 150%”
  • 4.
    Zaloni Confidential andProprietary Why are companies moving to a cloud-based platform Infrastructure Drivers •  Infrastructure agility •  Cost •  Compute and storage elasticity •  Heterogeneous compute and storage platforms •  Converged architectures for various workloads Data Locality •  Data Gravity •  Compliance and regulatory requirements (international) •  Keep data close to where it is generated New Requirements •  Lot of data is generated externally •  Need to handle all types of data – Structured, unstructured, images, etc. •  Latency and Currency
  • 5.
    Zaloni Confidential andProprietary 5.4 BILLION IoT volume driving to cloud adoption Cloud computing required to provide the virtual infrastructure needed to process enormous volume of data from the IoT By 2020 there will be Connected devices1, like smart meters and connected cars — This is the Internet of Things. And it’s going to be big… Exponential growth loT: THE NEXT BIG THING 1.2B 5.4B Source: ABI Research 2011 2014 2020
  • 6.
    Zaloni Confidential andProprietary On- Premises 32% Cloud Only 23% Cloud Plus On- Premises 29% Gartner’s Sept 2015 report: Hadoop Expansion Boosts Cloud and Unsupported On-Premises Deployment Hadoop deployment trends
  • 7.
    Zaloni Confidential andProprietary Cloud big data use case: Real-time data processing Fleet Data Collection Streaming Analytics Idle Time Calculation Idle Time reporting Data-driven Apps Dispatchers QueueCollectors Ingestion On-board Unit Data Collectors
  • 8.
    Zaloni Confidential andProprietary Data Lake in the Cloud Consumption Zone Source System File Data DB Data ETL Extracts Streaming Transient Loading Zone Raw Data Refined Data Trusted Data Discovery Sandbox Original unaltered data attributes Tokenized Data APIs Reference Data Master Data Data Wrangling Data Discovery Exploratory Analytics Metadata Data Quality Data Catalog Security Data Lake Integrate to common format Data Validation Data Cleansing Aggregations OLTP or ODS Enterprise Data Warehouse Logs (or other unstructured data) Data Services Business Analysts Researchers Data Scientists
  • 9.
    Zaloni Confidential andProprietary •  Storage – Block, object and file level abstractions, with different degrees of redundancy, availability and consistency guarantees, and cost considerations. •  Compute - A variety of compute server types are possible, optimized for different types of memory and processing requirements depending on the workload. •  Cloud native services – Higher levels of platform abstractions such as cloud provider managed Hadoop clusters, managed databases, warehouses, messaging services, etc. •  Data Management, Governance, Entitlements and Security Cloud Data Lake options
  • 10.
    Zaloni Confidential andProprietary Cloud Data Lake Maturity model Lift and Shift Cloud Native features Multi and Hybrid Cloud Replicate on- premise Data Lake in the cloud Leverage Object stores, Transient compute platforms, Messaging systems Abstraction over multiple clouds, consistent Data Management and Governance
  • 11.
    Zaloni Confidential andProprietary •  Patterns: §  Implement Data Lake in the cloud using elastic compute and cloud optimized storage §  Use Data Lake provided as a cloud service that is managed and optimized by the cloud provider §  Data pipelines with processing components decoupled by queuing services §  Leaving the heavy lifting to cloud provider services, example, for elastic clusters, streaming, analytics and machine learning §  Using cloud storage rather than ephemeral storage with data lifecycle management §  Real time processing with event driven architectures for streaming data Patterns and Anti-patterns
  • 12.
    Zaloni Confidential andProprietary •  Anti-Patterns: §  Fork lift migration of on-premise Data Lake to the cloud. §  Unmanaged, unmonitored, long term usage of resources such as persistent on-demand compute instances. §  Dedicating cloud resources for service peaks rather than using auto scaling cloud services Patterns and Anti-patterns
  • 13.
    Zaloni Confidential andProprietary Governance considerations within cloud/hybrid environments Zaloni Confidential and Proprietary •  Repeatable Ingestion of vast amounts of data from a wide variety of sources and formats (streaming, files, custom) •  Data visibility across hybrid cloud environments with proper security and access control. Data Masking, and Encryption of sensitive data •  Need to capture operational metadata implicitly during ingestion and processing. Metadata persistent across cluster instances •  Reusable Managed Data Pipelines for Processing: Validation, Standardization, Enrichments
  • 14.
    Zaloni Confidential andProprietary •  Data Lake on IaaS with bare metal or virtualized infrastructures. •  PaaS layers - managed data platforms that include various options for event based data ingestion, data processing and serving layers. •  Several cloud providers are also starting to offer Analytics as a Service with Machine Learning offerings built on top of their IaaS and PaaS layers. •  Geographical coverage due to any local in-country data requirements. •  Cost, TCO for Cloud Data Lake Assessing Cloud Data providers
  • 15.
    Cloud options inthe context of big data and data science Zaloni Confidential and Proprietary15 IaaS Platform Analytics Machine Learning OR OR Cloud Providers Hadoop Ecosystem Cortana Amazon EMR HDInsight Cloud Machine Learning MLlib Streams AWS Lambda OR Streaming Analytics Dataflow Dataproc Streaming
  • 16.
    DATA LAKE MANAGEMENT ANDGOVERNANCE PLATFORM SELF-SERVICE DATA PREPARATION
  • 17.
    FREE T-SHIRT! Building aModern Data Architecture Ben Sharma, CEO and Founder, Zaloni Wednesday, 2:05 p.m. – 1 E 09 Demo and FREE copy of book “Architecting Data Lakes” Speaking Sessions: Cloud Computing and Big Data Ben Sharma, CEO and Founder, Zaloni Tuesday, 9:30 a.m. – 1B 01/02 Visit Booth #644 for these giveaways!