Introduction to aws data pipeline services

Introduction to AWS Data Pipeline Services
Brought to you by Riley Shu
1

Goals
1. Identify the problem
2. Choose the best tool
3. Implement the solution
2

Agenda
1. What’s Big Data?
2. Data Temperature
3. Why AWS?
4. Pipeline
5. Security
6. Cost
7. AWS Design Principles
8. Demo(Glue + Athena + QuickSight)
9. Resources
3

Big Data
• Velocity: Facebook users upload more than 900 million photos a day
• Volume : As far back as 2016, Facebook has 2.5 trillion posts
• Variety : text, photos, audio, videos
4

Why AWS
• Managed Service
• Durable & Available
• Integrated ecosystem
6

Collect: File Types
• Transactional: quickly store and retrieve small pieces of data
• File: Data transmitted through individual files typically is ingested from connected
devices
• Stream: such as click-stream logs, should be ingested through an appropriate solution
so they’re available for real-time processing and analysis.
8

Collect: AWS Tools
• Near Real-time: Amazon Kinesis Firehose
• Data Import: Amazon Snowball
• Mesage Queuing: AWS SQS
• Web/app Servers: Amazon EC2
9

Collect: Data Transfer Options 10

Collect: Streaming Tools Comparison 11

Store: Storage Types
• Data Lake: store all of their data, structured and unstructured, in one centralized
repository.
• Data Warehouse: central repositories that store accumulations of integrated data from
one or more disparate sources.
• NoSQL Databases: NoSQL databases are schemaless and there is no common query
language analogous to SQL, and query flexibility is generally replaced by high I/O
performance and horizontal scalability.
• Columnar
• Document
• Graph
12

Store: Tools
• Object Stroage
• Amazon S3
• Amazon Glacier
• Near Real-time:
• Amazon Kinesis Streams
• RDBMS
• Amazon RDS
• NoSQL: Amazon DynamoDB
• Search: CloudSearch
13

Process & Analyze
• Hadoop Ecosystem: Amazon EMR
• Near Real-time
• AWS Lambda
• Amazon Kinesis Analytics
• Data Warehousing: Amazon Redshift
• Machine Learning: Amazon SageMaker
• Elastic Search: Elasticsearch Service
• Process and Move
• AWS Data Pipeline
• AWS Glue
• Ad Hoc Analysis: Amazon Athena
16

Consume: Tools
• Visualizations: AWS QuickSight
• Elastic Search Analytics: Amazon Elasticsearch Service
18

Security
• Responsibility Model
• VPC: Virtual Private Cloud
• IAM: Identity and Access Management
19

Security: VPC
Virtual Private Cloud
21

Security: IAM
Identity and Access Management
22

Example: Simple Month Calculator
Which tool should we use, S3 or Dynamo DB?
23

Example: Simple Month Calculator 24

Cost
• Simple Monthly Calculator
• Price pushes you to the right direction
• Understand different pricing patterns and pick the best practice for each service
25

AWS Design Principles
• Multiple Stage Decoupled “Data Bus”
• Multiple processing appliations can read from/write to different storages
• Processing frameworks can read from multiple data stores
• Use the right tool
• Data temperature, latency
• Lambda architecture ideas
• append-only, immutable
26

Demo
S3+ Glue + Athena + QuickSight
27

Resources
• Big Data Technology Fundamentals
• AWS Public Dataset Program
• AWS EDx Courses
• AWS Quick Start
28

Introduction to aws data pipeline services

More Related Content

What's hot

Similar to Introduction to aws data pipeline services

More from ArcBlock

Recently uploaded

Introduction to aws data pipeline services