Agile
Enterprise Analytics
on AWS
April 26, 2019
Copyright 2019 - Don Gillis Zapwerx
Overview
Enterprise data on average is growing at over 30% year over year, yet
traditional analytics approaches have proven to be expensive and
unyielding. The result is that a growing proportion of our data is unused
“dark data”.
However, there is an analytics “perfect storm” happening right now, to the
benefit of enterprises that know how to harness its power:
● Open data formats
● Open source analytics
● Low cost cloud storage
● Rapid cloud innovation
● Low cost pay-as-you-go queries
● Easy to use Serverless components
● Cheap and Accessible Machine Learning and AI tools
Being a Data Driven Organization
Enabling evidence based decision making
● Data Agility
○ Widely Trained data literacy
○ Quickly iterate on questions, queries &
analysis
● Data Access
○ In context integration
○ Wide availability
○ Lower cost scalability
● Data Governance
○ Centralized
○ Single sourced
○ Attributed & Controlled
● Data Community
○ Analytics for everyone
○ Shareable stories
“...enabling numbers
people with
imagination and
story people with
discipline.”
A. Damodaran
Why a Data Lake?
A data lake brings organization-wide discipline to data use and governance
● Data sources are defined, captured and maintained
● Data is initiated and updated automatically
● Data alignment & enrichment processes are explicit
● Data access authorization is defined, asserted, and audited.
● Data is accessible for ad-hoc enquiries
● Data is source-complete
● Data is portable and in well-known open formats.
S3 Data Lake Strategy
Tier 1
Raw data as received from
batch or streaming data
sources.
Apply real-time analytics.
Immediately process to
Tier 2, then archived to
low cost archival storage.
Tier 2
Raw data optimized in
structure and size. Ready
for multiple tool access.
Apply partition strategy.
Optimize for file size.
Apply a highly
compressible columnar
data format such as
parquet or ORC, allowing
casual queries.
Purpose built and/or tool
specific optimizations,
views and applications.
● Redshift
● ElasticSearch
● Elastic MapReduce
Tier 3 (...n)
Data Catalog
Single point of discovery, authorization and access control.
S3 Data Lake Strategy
Tier 1
Raw data as received from
batch or streaming data
sources.
Apply real-time analytics.
Immediately process to
Tier 2, then archived to
low cost archival storage.
Tier 2
Raw data optimized in
structure and size. Ready
for multiple tool access.
Apply partition strategy.
Optimize for file size.
Apply a highly
compressible columnar
data format such as
parquet or ORC, allowing
casual queries.
Purpose built and/or tool
specific optimizations,
views and applications.
● Redshift
● ElasticSearch
● Elastic MapReduce
Tier 3 (...n)
Data Catalog
Single point of discovery, authorization and access control.
Key Solution Elements
● Data Lake - Centralize your data into a data lake flowing from raw to fully prepared,
with each “Tier” having its defined purpose.
● Data Catalog - Establish a single point of data registration, discovery, access, and audit.
● Tiered Data Retention - Keep interesting Tier 1 raw data longterm for future uses
● Open Data Formats - at Tier 2, apply open standard columnar data formats for
portability, discoverability, speed, and compression. Access using open source
technologies like Hadoop, Presto, and columnar in-memory databases engines.
● Schema on Read - separating the schema from the data allows better portability,
flexibility and agility
● Serverless Components - using serverless or managed components for data streaming,
storage, and processing, making it both quick to experiment with and easy to scale.
Approach
● Identify starter use case and its success measures
● Open Access - provide secure but wide access to your data lake allowing the business to:
○ Gain insights
○ Enhance visibility
○ Discover new data applications
○ Make better data driven decisions
○ Drive and measure business value
● Include a long-term AI strategy
○ Retaining raw data
○ Build and enhance your data acquisition strategy
○ Experiment with easy to use AI tools
● Build a plan to encourage adoption
○ Develop a change management & communication plan
○ Provide training and workshops for data analysts, builders, and users
Security & Privacy
● Security - Be aware of the mounting liability of data privacy and security
○ Centralize point of authorization, access control, monitoring, and audit.
○ Use built-in encryption in transit and at rest
○ Build on cloud native security tools
○ Allow for GDPR type subject access requests
○ Use cloud native tooling for data protection
○ Use cloud native tooling for identity & access control, anomaly detection and response
Technology
● Simplify Tooling
○ Your data is big and complex and growing
○ Your tooling should not add to the complexity
○ Use simple cloud native services and patterns
● Machine Learning - Experiment with the easy to use cloud tools for
○ Classifying unstructured data
○ Audio / Video recognition
○ Anomaly detection
○ Personalization
○ Recommendations
Workshop Roles
Participants
The people who bring an understand your
business, its data and goals.
The people who will continue to develop,
manage, and secure your enterprise analytics
service.
Facilitator
Brings an understanding of modern
enterprise analytics and how to implement it
on Amazon Web Services (AWS).
Bring a strategy of how to build your agile
enterprise analytic service. This will act as a
basis upon which we will refine your vision,
and build your service startup plan.
Uses techniques like Value Stream Analysis
to define and refine processes.
Workshop deliverables
A Vision for your Agile Enterprise Analytics
service within the context of your business, its data,
and its goals.
Vision
Start-up Plan
A start-up plan for implementing an initial Agile
Enterprise Analytics service.
Project Proposal A proposal with pricing and terms and conditions.
Delivery Schedule
A recommended delivery schedule to meet your
needs.
Beyond the Workshop
What’s next...
With your Data Lake taking form, it may be time to build your skills in the
application of Machine Learning and AI. You can learn to build and maintain
accurate models, deploy those models efficiently on AWS, and take full
advantage of AI and machine learning to make better predictions faster and
improve your bottom line.
Contact Us
Thank you.
info@zapwerx.com
Copyright 2019 - Don Gillis, Zapwerx

Agile enterprise analytics on aws

  • 1.
    Agile Enterprise Analytics on AWS April26, 2019 Copyright 2019 - Don Gillis Zapwerx
  • 2.
    Overview Enterprise data onaverage is growing at over 30% year over year, yet traditional analytics approaches have proven to be expensive and unyielding. The result is that a growing proportion of our data is unused “dark data”. However, there is an analytics “perfect storm” happening right now, to the benefit of enterprises that know how to harness its power: ● Open data formats ● Open source analytics ● Low cost cloud storage ● Rapid cloud innovation ● Low cost pay-as-you-go queries ● Easy to use Serverless components ● Cheap and Accessible Machine Learning and AI tools
  • 3.
    Being a DataDriven Organization Enabling evidence based decision making ● Data Agility ○ Widely Trained data literacy ○ Quickly iterate on questions, queries & analysis ● Data Access ○ In context integration ○ Wide availability ○ Lower cost scalability ● Data Governance ○ Centralized ○ Single sourced ○ Attributed & Controlled ● Data Community ○ Analytics for everyone ○ Shareable stories “...enabling numbers people with imagination and story people with discipline.” A. Damodaran
  • 4.
    Why a DataLake? A data lake brings organization-wide discipline to data use and governance ● Data sources are defined, captured and maintained ● Data is initiated and updated automatically ● Data alignment & enrichment processes are explicit ● Data access authorization is defined, asserted, and audited. ● Data is accessible for ad-hoc enquiries ● Data is source-complete ● Data is portable and in well-known open formats.
  • 5.
    S3 Data LakeStrategy Tier 1 Raw data as received from batch or streaming data sources. Apply real-time analytics. Immediately process to Tier 2, then archived to low cost archival storage. Tier 2 Raw data optimized in structure and size. Ready for multiple tool access. Apply partition strategy. Optimize for file size. Apply a highly compressible columnar data format such as parquet or ORC, allowing casual queries. Purpose built and/or tool specific optimizations, views and applications. ● Redshift ● ElasticSearch ● Elastic MapReduce Tier 3 (...n) Data Catalog Single point of discovery, authorization and access control.
  • 6.
    S3 Data LakeStrategy Tier 1 Raw data as received from batch or streaming data sources. Apply real-time analytics. Immediately process to Tier 2, then archived to low cost archival storage. Tier 2 Raw data optimized in structure and size. Ready for multiple tool access. Apply partition strategy. Optimize for file size. Apply a highly compressible columnar data format such as parquet or ORC, allowing casual queries. Purpose built and/or tool specific optimizations, views and applications. ● Redshift ● ElasticSearch ● Elastic MapReduce Tier 3 (...n) Data Catalog Single point of discovery, authorization and access control.
  • 8.
    Key Solution Elements ●Data Lake - Centralize your data into a data lake flowing from raw to fully prepared, with each “Tier” having its defined purpose. ● Data Catalog - Establish a single point of data registration, discovery, access, and audit. ● Tiered Data Retention - Keep interesting Tier 1 raw data longterm for future uses ● Open Data Formats - at Tier 2, apply open standard columnar data formats for portability, discoverability, speed, and compression. Access using open source technologies like Hadoop, Presto, and columnar in-memory databases engines. ● Schema on Read - separating the schema from the data allows better portability, flexibility and agility ● Serverless Components - using serverless or managed components for data streaming, storage, and processing, making it both quick to experiment with and easy to scale.
  • 9.
    Approach ● Identify starteruse case and its success measures ● Open Access - provide secure but wide access to your data lake allowing the business to: ○ Gain insights ○ Enhance visibility ○ Discover new data applications ○ Make better data driven decisions ○ Drive and measure business value ● Include a long-term AI strategy ○ Retaining raw data ○ Build and enhance your data acquisition strategy ○ Experiment with easy to use AI tools ● Build a plan to encourage adoption ○ Develop a change management & communication plan ○ Provide training and workshops for data analysts, builders, and users
  • 10.
    Security & Privacy ●Security - Be aware of the mounting liability of data privacy and security ○ Centralize point of authorization, access control, monitoring, and audit. ○ Use built-in encryption in transit and at rest ○ Build on cloud native security tools ○ Allow for GDPR type subject access requests ○ Use cloud native tooling for data protection ○ Use cloud native tooling for identity & access control, anomaly detection and response
  • 11.
    Technology ● Simplify Tooling ○Your data is big and complex and growing ○ Your tooling should not add to the complexity ○ Use simple cloud native services and patterns ● Machine Learning - Experiment with the easy to use cloud tools for ○ Classifying unstructured data ○ Audio / Video recognition ○ Anomaly detection ○ Personalization ○ Recommendations
  • 12.
    Workshop Roles Participants The peoplewho bring an understand your business, its data and goals. The people who will continue to develop, manage, and secure your enterprise analytics service. Facilitator Brings an understanding of modern enterprise analytics and how to implement it on Amazon Web Services (AWS). Bring a strategy of how to build your agile enterprise analytic service. This will act as a basis upon which we will refine your vision, and build your service startup plan. Uses techniques like Value Stream Analysis to define and refine processes.
  • 13.
    Workshop deliverables A Visionfor your Agile Enterprise Analytics service within the context of your business, its data, and its goals. Vision Start-up Plan A start-up plan for implementing an initial Agile Enterprise Analytics service. Project Proposal A proposal with pricing and terms and conditions. Delivery Schedule A recommended delivery schedule to meet your needs.
  • 14.
    Beyond the Workshop What’snext... With your Data Lake taking form, it may be time to build your skills in the application of Machine Learning and AI. You can learn to build and maintain accurate models, deploy those models efficiently on AWS, and take full advantage of AI and machine learning to make better predictions faster and improve your bottom line.
  • 15.