REAN Cloud is an AWS partner that provides Big Data solutions and services. The document outlines REAN Cloud's workshop agenda which includes an introduction to Big Data and selected AWS products like EMR, Athena, Kinesis, DynamoDB and Redshift. It also summarizes REAN Cloud's capabilities, customer focus areas, and provides a demo of using EMR and Athena. The workshop is aimed at helping customers implement and automate secure Big Data architectures on AWS.
1. www.reancloud.com
USA: Herndon | Philadelphia | Los Angeles
India: Hyderabad | Pune | Udaipur | Bangalore
Big Data on AWS
How to implement and automate secure Big
Data architectures & solutions on AWS
1
2. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Workshop Agenda
Introduction to REAN Cloud
What is Big Data
Why AWS is the most complete Big Data platform
Overview of selected AWS Big Data Products
Why, What, How { Innovation, Best Practices, and Security }
Equivalent Open Source Software ( where applicable )
REAN Data Lake Reference Architecture
Demo & Discussion
Live demo / hands-on of using EMR, Athena
2
3. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Organizational Profile
Established: 2013
Presence:
USA, India
Number of Employees: 300+
AWS Certifications: 150+ (Including 15+ Professional
Certifications)
Industry Focus:
Education, Government, Healthcare / Life Sciences,
Financial Services, and ISV
Management team consisting of executives formerly
from Fortune 500 Enterprises (AWS, Amdocs, BMC,
Merck and Cognizant) and with a deep AWS Cloud
Computing experience.
Recognized by TechTarget as the top AWS Partner
providing innovative DevSecOps services.
24x7 follow the sun model with offices around the
world and continuous operations in multiple time
zones - EST, PST and IST.
Recognized in Gartner’s March 2017 Magic Quadrant
for Public Cloud Infrastructure Managed Service
Providers - Leading the Niche Players in completeness
of Vision & Ability to Execute.
3
4. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
AWS Competencies
4
5. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Customers by Vertical
Government / Public
Sector ISVEducation
Financial
Services
Healthcare / Life
Sciences
5
6. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is Big Data?
6
7. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
VELOCITY
VARIETY
VOLUME
Big Data Characteristics:
Ever Expanding Horizon
Source: www.datasciencecentral.com
7
8. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
AWS Big Data technologies usually falls into one of these two categories
Lightswitch Cockpit
Or
8
9. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
AWS - The Most Complete Platform for Big Data - 1/2
Analytic Frameworks Real Time Analytics Storage & Databases Data Warehousing Business Intelligence
Amazon EMR
Amazon Athena Kinesis Firehouse
Kinesis Streams
Kinesis Analytics Amazon Athena
DynamoDB
Redshift QuickSight
Elasticsearch Service
S3
HBase on EMR, RDS & Aurora
9
10. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
AWS - The Most Complete Platform for Big Data - 2/2
Artificial Intelligence
Lex, Polly, Rekognition, ML, MXnet
Internet of Things
IoT, Greengrass
Serverless Compute
Lambda
Amazon EC2 Instances
Optimised { Compute, Memory,
Storage, GPU }
Data Movement
Direct connect, Snowball, Database
Migration Service, Storage gateway
10
11. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Big Data Analytic Frameworks
Amazon EMR Amazon Athena
11
12. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why EMR ? - 1 / 4
How to deal with Big Data ?
Massively Parallel Processing (MPP) frameworks !
Hadoop = MapReduce + HDFS
MapReduce is a popular distributed processing framework
HDFS provides scalable and reliable distributed data storage
Running a Hadoop cluster is challenging
Scale data nodes as data size increases
Scale out/in compute nodes to tackle spiky traffic
Managing resources (nodes, disks) is time-consuming
Enterprise Hadoop distributions fill the need…
Very expensive ($$$)
12
13. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is EMR ? - 2 / 4
An Enterprise-grade Hadoop distribution from Amazon with no Licensing fee
Compute Storage & Database Data Import/Export
Orchestration Configuration
Web UI
Machine Learning Monitoring
13
14. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
EMR Innovations - 3 / 4
Easy cluster resizing - add/remove task/core nodes
Auto scaling - scale out/in based on workload
Use spot instances for handling spiky traffic
Spin-up transient / long-running clusters
for pre-defined workflows and ad hoc tasks
Highly Availability
Monitors nodes and replaces unhealthy nodes.
Cost Savings
EMR = $304.63
Hadoop = $338.40
(5x m4.10xlarge instances / day)
Decouple storage and compute
No contention of system resources (CPU and RAM)
EMR Filesystem - EMRFS
Mount S3 as a HDFS endpoint
S3 offers virtually unlimited and inexpensive storage
No need to Hadoop add data nodes in EMR clusters
Data Transfer from S3 to EMR within same region is free
14
15. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
EMR Security - 4 / 4
Isolation
Logical isolation using VPC
Security Groups ( firewall rules )
Master/Slave security groups
Encryption at-rest
S3 server/client side encryption with AWS KMS /
Custom keys
Authentication, Authorization, Accounting
Use IAM roles
Audit API calls using CloudTrail
Use custom EMR AMI to encrypt boot volumes (KMS)
15
16. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why Athena ? - 1 / 4
Analyze data in all shapes and sizes with minimal effort
Prototype a solution before full-blown implementation
Building Data pipeline is complex…
Requires variety of specialized skillset
ETL is time-consuming…
Especially if you just want to quickly explore data
Data Scientists just want to get going asap!
Impeded by the need to perform ETL, build data pipelines
Data Analysts want a self-service solution..
without relying on Data Engineers for everything
16
17. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is Athena ? - 2 / 4
Athena is an interactive query service
Analyze data in Amazon S3 using Standard ANSI SQL
Athena is Serverless
• No infrastructure to maintain; zero spin-up time.
• Uses warm compute pools across multiple AZ
Built using well-known open source technologies
• Presto SQL query engine
• Hive for DDL functionality
17
18. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Athena Innovations - 3 / 4
No loading of data
• Stream data directly from S3
• No ETL required
Query data in its raw format
• Text, CSV, JSON, Web logs, AWS service logs
• Convert to ORC/Parquet for best performance / low cost
Pay per query
• Automatically parallelizes queries
Results also stored in S3
18
19. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Athena Security - 4 / 4
Access
Use IAM policies to manage service access
Use S3 permissions to manage r/w data access
Encryption at-rest
Data in S3 can be encrypted (SSE-S3, SSE-KMS, CSE-KMS)
Query results can be encrypted when stored back in S3
Encryption in-transit
Uses TLS encryption for accessing data stored in S3
Uses SSL encryption for JDBC clients
19
20. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Real Time Big Data Analytics
Kinesis FirehouseKinesis Streams Kinesis Analytics
20
21. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why Kinesis ? - 1 / 5
Scalable Service bus to connect data producers and consumers
IoT, Mobile / Web clients
Building an enterprise-grade queuing system is challenging…
Challenges
Scale - as # or size of input data increases
Reliability - store data until consumers downloads/processes
Push / Pull support
...
21
22. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Introduction to Kinesis Family - 2 / 5
Amazon Kinesis
Streams
Analyze streaming data
Firehose
Prepare and load streaming data into
AWS
Analytics
Analyze streaming data with
standard SQL
Fully managed service
Easily collect, process, and analyze real-time, streaming data.
Equivalent OSS - Apache Kafka
22
23. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Kinesis Streams - 3 / 5
Connects real-time data producers and consumers
Put data using the Kinesis Producer Library (KPL)
Get data using the Kinesis Client Library (KCL)
Data-consumer apps pulls data from streams
Streams are made of Shards
add/remove shards to adjust capacity
Encrypt sensitive data using SSE and AWS KMS
23
24. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Kinesis Firehose - 4 / 5
Capture, transform, and push streaming real-time data into...
Kinesis Analytics, S3, Redshift, Elasticsearch Service
Loads new data within 60 seconds after being received
Auto-scaling (stream management, sharding) and monitoring
Batch, compress, and encrypt the data before loading it
Add data conversion on-the-fly
without a data processing pipeline
24
25. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Kinesis Analytics - 5 / 5
Process streaming data in real time with standard SQL
Sub 1-second processing latencies
In: Kinesis streams or Firehose
Automatically recognizes standard data formats; suggests schema
Manually update schema or provide new one for unstructured data
Out: Kinesis Streams and Firehose
S3, Redshift, Elasticsearch Service, or custom destination.
Write your queries using standard SQL
25
26. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Big Data Storage & Databases
DynamoDB Aurora
26
27. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why DynamoDB ? - 1 / 2
More entities producing Non-Structured data
Need a scalable & reliable solution to capture, process
Building a Scalable, Reliable, Resilient NoSQL DB Infrastructure is an engineering challenge !
Some challenges...
Infrastructure layer - replicate, scale, restore
Database layer - synchronization, read-after-write
Network layer - load balancer
Security - restricted protected access
….
27
28. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is DynamoDB - 2 / 2
Fast and flexible NoSQL database service
Single-digit millisecond latency at any scale
For microsecond response, use DynamoDB Accelerator (DAX)
Supports both document and key-value store models
Similar to MongoDB from this perspective
Integrates with IAM, AWS Lambda for triggers
28
29. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why Aurora ? - 1 / 4
Open-source RDBMS are free-to-use but…
Do not meet commercial RDBMS performance
Commercial RDBMS provide performance but…
Are too expensive for many organizations
Is there a best of both worlds ?
29
30. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is Aurora ? - 2 / 4
Is a managed database service
Similar to RDS
Is a new relational database engine
Compatible with MySQL 5.6, PostgreSQL 9.6
Existing ecosystem will work with little or no change
Throughput
5x MySQL, 2x PostgreSQL
On par with commercial databases
Example: 500K reads, 100K writes
1/10th the cost of commercial databases
30
31. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Aurora Innovations - 3 / 4
High Availability and Durability
> 99.99% availability
Data is replicated 6 times across 3 AZs
Continuously backed to S3.
Failure recovery is transparent / automated
Database Migration
Use standard MySQL import and export tools
Use Database Migration Service (DMS) to migrate to Aurora
31
32. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Aurora Security - 4 / 4
Isolation
Network isolation using VPC
Encryption-at-rest
Storage is encrypted
Includes backups, snapshots and replicas
AWS KMS is transparently used
Encryption in-motion
Using SSL
32
33. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Data Warehousing
Redshift
33
34. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Why Redshift ? - 1 / 5
Why do you need a data warehouse ?
OLTP systems are good for transactions
• Not a good fit for analytics-type workloads
A centralized location to capture data from OLTP
• Used for analytics, trend analytics
Challenges
Performance
Cost
Scalability
…
34
35. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
How AWS built Redshift ? - 2 / 5
35
36. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
What is Redshift ? - 3 / 5
Distributed computing architecture
Each node is independent, self-sufficient
No single-point of failure
Nodes don’t share memory, disk storage (SN architecture)
Leader node
Endpoint: SQL, JDBC/ODBC, BI tools
Stores metadata, Coordinates parallel SQL processing
Compute nodes
Local, columnar storage
Ingestion, Backup, Restore to…
S3 / EMR / DynamoDB / SSH
36
37. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Redshift Innovations - 4 / 5
1-click deployment to launch
on multiple regions around the world
Pay-as-you-go pricing
$1000 / TB / Year
Large collection of pre-installed software, libraries
Pandas, NumPy and SciPy
Write new User Defined Functions
using Python 2.7
37
38. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Redshift Security - 5 / 5
IAM roles
Role-based Access Control
38
39. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
REAN Cloud Data Lake
Quick start / reference architecture
39
40. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
REAN Cloud gains AWS Big Data Competency
40
41. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Data Lake Reference architecture
41
42. @2017 Copyright REAN Cloud. All rights reserved. info@reancloud.com | www.reancloud.com
Demo, Workshop
Demo
• Live walkthrough of launching EMR, using Athena
Workshop
• Using Athena to analyze a open dataset
42
43. Thank You!
REAN Cloud LLC
2201 Cooperative Way, Suite 302
Herndon, VA 20171
+1(844) 377- (7326)
Do you have any questions?
info@reancloud.com
www.reancloud.com
Larry Bradley
S. P. T. Krishnan
Steve Toback
Steve Vaughan