Introduction to AWS Data Pipeline Services
Brought to you by Riley Shu
1
Goals
1. Identify the problem
2. Choose the best tool
3. Implement the solution
2
Agenda
1. What’s Big Data?
2. Data Temperature
3. Why AWS?
4. Pipeline
5. Security
6. Cost
7. AWS Design Principles
8. Demo(Glue + Athena + QuickSight)
9. Resources
3
Big Data
• Velocity: Facebook users upload more than 900 million photos a day
• Volume : As far back as 2016, Facebook has 2.5 trillion posts
• Variety : text, photos, audio, videos
4
Data Temperature 5
Why AWS
• Managed Service
• Durable & Available
• Integrated ecosystem
6
AWS Data Pipeline 7
Collect: File Types
• Transactional: quickly store and retrieve small pieces of data
• File: Data transmitted through individual files typically is ingested from connected
devices
• Stream: such as click-stream logs, should be ingested through an appropriate solution
so they’re available for real-time processing and analysis.
8
Collect: AWS Tools
• Near Real-time: Amazon Kinesis Firehose
• Data Import: Amazon Snowball
• Mesage Queuing: AWS SQS
• Web/app Servers: Amazon EC2
9
Collect: Data Transfer Options 10
Collect: Streaming Tools Comparison 11
Store: Storage Types
• Data Lake: store all of their data, structured and unstructured, in one centralized
repository.
• Data Warehouse: central repositories that store accumulations of integrated data from
one or more disparate sources.
• NoSQL Databases: NoSQL databases are schemaless and there is no common query
language analogous to SQL, and query flexibility is generally replaced by high I/O
performance and horizontal scalability.
• Columnar
• Document
• Graph
12
Store: Tools
• Object Stroage
• Amazon S3
• Amazon Glacier
• Near Real-time:
• Amazon Kinesis Streams
• RDBMS
• Amazon RDS
• NoSQL: Amazon DynamoDB
• Search: CloudSearch
13
Store 14
Store 15
Process & Analyze
• Hadoop Ecosystem: Amazon EMR
• Near Real-time
• AWS Lambda
• Amazon Kinesis Analytics
• Data Warehousing: Amazon Redshift
• Machine Learning: Amazon SageMaker
• Elastic Search: Elasticsearch Service
• Process and Move
• AWS Data Pipeline
• AWS Glue
• Ad Hoc Analysis: Amazon Athena
16
Process & Analyze 17
Consume: Tools
• Visualizations: AWS QuickSight
• Elastic Search Analytics: Amazon Elasticsearch Service
18
Security
• Responsibility Model
• VPC: Virtual Private Cloud
• IAM: Identity and Access Management
19
Security: Model 20
Security: VPC
Virtual Private Cloud
21
Security: IAM
Identity and Access Management
22
Example: Simple Month Calculator
Which tool should we use, S3 or Dynamo DB?
23
Example: Simple Month Calculator 24
Cost
• Simple Monthly Calculator
• Price pushes you to the right direction
• Understand different pricing patterns and pick the best practice for each service
25
AWS Design Principles
• Multiple Stage Decoupled “Data Bus”
• Multiple processing appliations can read from/write to different storages
• Processing frameworks can read from multiple data stores
• Use the right tool
• Data temperature, latency
• Lambda architecture ideas
• append-only, immutable
26
Demo
S3+ Glue + Athena + QuickSight
27
Resources
• Big Data Technology Fundamentals
• AWS Public Dataset Program
• AWS EDx Courses
• AWS Quick Start
28
29

Introduction to aws data pipeline services

  • 1.
    Introduction to AWSData Pipeline Services Brought to you by Riley Shu 1
  • 2.
    Goals 1. Identify theproblem 2. Choose the best tool 3. Implement the solution 2
  • 3.
    Agenda 1. What’s BigData? 2. Data Temperature 3. Why AWS? 4. Pipeline 5. Security 6. Cost 7. AWS Design Principles 8. Demo(Glue + Athena + QuickSight) 9. Resources 3
  • 4.
    Big Data • Velocity:Facebook users upload more than 900 million photos a day • Volume : As far back as 2016, Facebook has 2.5 trillion posts • Variety : text, photos, audio, videos 4
  • 5.
  • 6.
    Why AWS • ManagedService • Durable & Available • Integrated ecosystem 6
  • 7.
  • 8.
    Collect: File Types •Transactional: quickly store and retrieve small pieces of data • File: Data transmitted through individual files typically is ingested from connected devices • Stream: such as click-stream logs, should be ingested through an appropriate solution so they’re available for real-time processing and analysis. 8
  • 9.
    Collect: AWS Tools •Near Real-time: Amazon Kinesis Firehose • Data Import: Amazon Snowball • Mesage Queuing: AWS SQS • Web/app Servers: Amazon EC2 9
  • 10.
  • 11.
  • 12.
    Store: Storage Types •Data Lake: store all of their data, structured and unstructured, in one centralized repository. • Data Warehouse: central repositories that store accumulations of integrated data from one or more disparate sources. • NoSQL Databases: NoSQL databases are schemaless and there is no common query language analogous to SQL, and query flexibility is generally replaced by high I/O performance and horizontal scalability. • Columnar • Document • Graph 12
  • 13.
    Store: Tools • ObjectStroage • Amazon S3 • Amazon Glacier • Near Real-time: • Amazon Kinesis Streams • RDBMS • Amazon RDS • NoSQL: Amazon DynamoDB • Search: CloudSearch 13
  • 14.
  • 15.
  • 16.
    Process & Analyze •Hadoop Ecosystem: Amazon EMR • Near Real-time • AWS Lambda • Amazon Kinesis Analytics • Data Warehousing: Amazon Redshift • Machine Learning: Amazon SageMaker • Elastic Search: Elasticsearch Service • Process and Move • AWS Data Pipeline • AWS Glue • Ad Hoc Analysis: Amazon Athena 16
  • 17.
  • 18.
    Consume: Tools • Visualizations:AWS QuickSight • Elastic Search Analytics: Amazon Elasticsearch Service 18
  • 19.
    Security • Responsibility Model •VPC: Virtual Private Cloud • IAM: Identity and Access Management 19
  • 20.
  • 21.
  • 22.
    Security: IAM Identity andAccess Management 22
  • 23.
    Example: Simple MonthCalculator Which tool should we use, S3 or Dynamo DB? 23
  • 24.
    Example: Simple MonthCalculator 24
  • 25.
    Cost • Simple MonthlyCalculator • Price pushes you to the right direction • Understand different pricing patterns and pick the best practice for each service 25
  • 26.
    AWS Design Principles •Multiple Stage Decoupled “Data Bus” • Multiple processing appliations can read from/write to different storages • Processing frameworks can read from multiple data stores • Use the right tool • Data temperature, latency • Lambda architecture ideas • append-only, immutable 26
  • 27.
    Demo S3+ Glue +Athena + QuickSight 27
  • 28.
    Resources • Big DataTechnology Fundamentals • AWS Public Dataset Program • AWS EDx Courses • AWS Quick Start 28
  • 29.