深入淺出 AWS 大數據工具

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dickson Yue
Solutions Architect
Amazon Web Services
April 2017
深入淺出AWS大數據工具

大綱
大數據挑戰
如何簡化大數據處理
應該使用什麼技術？
為什麼？
怎麼樣？
參考架構

不斷增加的數據
數據量 Volume
速度 Velocity
品種 Variety

大數據進化
報告
Report
批次
Batch
processing
預測
Predict
機器學習
Machine
learning
警報
Alert
串流資料
Stream
processing

雲端服務進化
Virtual
machines
Managed
services
Serverless

大量工具
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data PipelineAmazon Kinesis
Cassandra
CloudSearch
Kinesis-
enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams

Simplify Big Data Processing
收集
COLLECT
儲存
STORE
處理/分析
PROCESS /
ANALYZE
使用
CONSUME
數據
data
答案
answers
延遲時間 Time to answer (Latency)
吞吐量 Throughput
成本 Cost
從這裡開始
WITH A BUSINESS CASE
簡化大數據處理

架構原則
建立可分拆式系統
數據→存儲→處理→存儲→分析→答案
使用正確的工具作業
數據結構，延遲，吞吐量，存取模式
利用 AWS 的管理服務
可擴展/彈性，可用，可靠，安全，無/低管理
使用以Log-centric 為中心的設計模式
不可變的Log，materialized view
留意成本
大數據≠大成本

數據類型收集
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
In-memory data structures
Database records
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Search documents
Log files
Messaging
Message MESSAGES
Messaging
Messages
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
Data streams
交易記錄 Transactions
檔案 Files
事件 Events

Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
數據特點：熱，暖，冷

儲存
STORE
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT 收集
COLLECT
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
MessagingApplications
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
數據存儲類型
Database SQL & NoSQL databases
Search Search engines
File store File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers

儲存
STORE
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT 收集
COLLECT
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
MessagingApplications
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
信息及串流式存儲
Database
Search
File store
Queue
Stream
storage
In-memory Amazon SQS
• Message queue 管理服務
Apache Kafka
• 高吞吐量分佈式串流平台
Amazon Kinesis Streams
• 串流式存儲及處理管理服務
Amazon Kinesis Firehose
• 串流式存儲管理服務
Amazon DynamoDB
• NoSQL database管理服務
• Tables can be stream-enabled

為何要用串流式存儲?
張生產者和使用者分拆
持久緩衝區
收集多個串流
保留數據次序
多個使用者平行提取
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
生產者 1
shard 1 / partition 1
shard 2 / partition 2
使用者 1
Count of
red = 4
Count of
violet = 4
使用者 2
Count of
blue = 4
Count of
green = 4
生產者 2
生產者 3
生產者 n
Key = violet
DynamoDB stream Amazon Kinesis stream Kafka topic

Amazon Kinesis Streams
• 可擴展性與彈性
• 擴展 — 增加碎片數
• 耐用性與可用性
• 複寫
• 保留指標
• 介面
• 輸入 — 資料傳入
• 輸出 — 資料傳出
• Kinesis Firehose
• 不適合的使用模式
• 小規模的傳輸量一致
• 長期資料儲存與分析

關於 Amazon SQS?
• 張生產者和使用者分拆
• 持久緩衝區
• 收集多個串流
• 不保留數據次序 (標準)
• FIFO queue preserves client
ordering
• 不支援 MapReduce串流
• 不支援多個使用者提取
• Amazon SNS can publish to
multiple SNS subscribers
(queues or ʎ functions)
Publisher
Amazon SNS
topic
function
ʎ
AWS Lambda
function
Amazon SQS
queue
queue
Subscriber
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO

應該使用哪個信息及串流式存儲？
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS 管理 Yes Yes Yes No Yes Yes
次序保證 Yes Yes No Yes No Yes
發送 (重覆) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
數據保留期 24 hours 7 days N/A Configurable 14 days 14 days
可用性 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
規模 / 吞吐量 No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
支援平行提取 Yes Yes No Yes No No
MapReduce串流 Yes Yes N/A Yes N/A N/A
行/物件大小 400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
成本 Higher (table
cost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm
New

In-memory
收集
COLLECT
儲存
STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon S3
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
文件存儲

S3對大數據有什麼好處？
• 大數據框架（Spark，Hive，Presto等）支持
• 不需要運行Compute Cluster進行存儲（與HDFS不同）
• 可以運行臨時Hadoop集群和Amazon EC2 Spot Instances
• 多個不同的Compute Cluster可以使用相同的數據
• 無限數量的對象和數據量
• 非常高的頻寬 - 沒有總吞吐量限制
• 專為99.99％的可用性設計 - 可以容忍區域故障
• 專為99.999999999耐久性設計
• 無需支付數據複製費用
• 本機支持版本控制
• 通過生命週期政策分層存儲（Standard，IA，Amazon Glacier）
• 安全 - SSL，客戶端/服務器端加密
• 低成本

HDFS和數據量分層怎麼樣？
• 使用HDFS進行非常頻繁訪問（熱）數據
• 對經常訪問的數據使用Amazon S3 Standard
• 使用Amazon S3 Standard - IA來訪問數據不
太頻繁
• 使用Amazon Glacier冷資料

In-memory
收集
COLLECT
儲存
STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon S3
File
內存中，數據庫，搜索
In-memory, Database,
Search

最佳方法：跟據實況選擇正確的工具
Search
Amazon Elasticsearch
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Data Tier

收集
COLLECT
儲存
STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
Amazon ElastiCache
• Memcached or Redis 管理服務
Amazon DynamoDB
• NoSQL管理服務
Amazon RDS
• Relational database 管理服務
Amazon Elasticsearch Service
• Elasticsearch 管理服務

應該使用哪個數據存儲工具類型？
數據結構→Fixed Schema，JSON ， key-value
提取模式→以您將提取的格式存儲
數據特點→熱，溫，冷
成本→正確的成本效益

數據結構和訪問模式
存取模式 Access 應該用什麼 What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction,
SQL
SQL
Faceting, search Search
數據結構 Data Structure 應該用什麼 What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) In-memory, NoSQL

Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
ES
Amazon S3 Amazon Glacier
延遲時間
Average
latency
ms ms ms, sec ms, sec ms, sec,min
(~ size)
hrs
總儲量
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
物件體積
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
提取频率
Request Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
成本Storage
cost GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
耐久力
Durability
Low - moderate Very high Very high High Very high Very high
可用性
Availability
High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
應該使用哪個數據存儲？

成本意識設計示例：
我應該使用Amazon S3還是Amazon DynamoDB？
“我目前正在劃定一個項目。該設計需要許多小文件，高峰
期可能達到十億。總體規模將在每月1.5 TB左右...“
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000

https://calculator.s3.amazonaws.com/index.html
Simple Monthly
Calculator
成本意識設計示例：
我應該使用Amazon S3還是Amazon DynamoDB？

Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
DynamoDB?

Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use

處理/分析
PROCESS /
ANALYZE

批量 Batch
需要幾分鐘到幾個小時
示例：每日/每週/每月報告
Amazon EMR（MapReduce，Hive，Pig，Spark）
互動 Interactive
需要秒
示例：自助式儀表板， Self-service dashboards
Amazon Redshift ， Amazon Athena， Amazon EMR（Presto，Spark）
信息 Message
毫秒到秒
示例：消息處理 Message processing
Amazon SQS應用程序在Amazon EC2上
串流 Stream
毫秒到秒
示例：欺詐警報 Fraud alerts ，1分鐘指標
Amazon EMR（Spark Streaming），Amazon Kinesis Analytics，KCL，Storm，AWS Lambda
機器學習 Machine Learning
需要毫秒到幾分鐘
示例：欺詐檢測 Fraud detection, ，預測需求 Forecast demand
Amazon ML， Amazon EMR（Spark ML），GPU instance + Deep Learning AMI
處理/分析
PROCESS / ANALYZE
Amazon
Machine Learning
MLMessage
Amazon SQS apps
Amazon EC2
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Stream
Amazon EC2
Amazon EMR
Fast
Amazon Redshift
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractive
分析類型和框架

PB 級規模
大量平行處理
關聯式資料倉儲
完全受管，無需任何管理作業
低達 $1,000/TB/年
快上許多
便宜許多
簡單許多
Amazon Redshift

Amazon Redshift
• 理想的使用模式 — 分析
• 銷售資料
• 歷史資料
• 博弈資料
• 社會發展趨勢
• 廣告資料
• 效能
• 大量平行處理
• 直欄式儲存
• 資料壓縮
• 區域圖
• 直接連接儲存
• 成本模型
• 不需支付前期成本或長期投入
• 免費備份儲存等於 100% 的佈建儲存
使用直欄式儲存時，只需讀取所要
的資料

Amazon Redshift
• 調整規模或擴展 — 只要按幾下就能變更節點
的數目或類型
• 複寫
• 備份
• 從故障的磁碟和節點自動復原
• 介面
• JDBC/ODBC 介面 (包含 BI/ETL 工具)
• Amazon S3 或 DynamoDB
• 小型資料集
• OLTP
• 非結構化資料
• BLOB 資料
10 GigE
(HPC)
擷取
備份
還原
JDBC/ODBC

快速啟動叢集
依小時付費，利用競價來節省成本
MapReduce、Apache Spark、
Presto
Amazon EMR

Amazon EMR
• 調整執行中叢集的規模
• 增加更多核心或任務節點
• 從屬節點的容錯能力 (HDFS)
• 備份至 S3，以提供主節點故障時的復原能力
• 介面
• Hive、Pig、Spark、Hbase、Impala、Hunk、
Presto、其他熱門的工具
• 小型資料集
• ACID (原子性、一致性、隔離與耐用性)
Amazon EMR 叢集
Amazon EMR 叢集
Amazon EMR 叢集

無伺服器互動式查詢服務
• 使用標準 SQL 輕鬆分析 Amazon S3 中的資料，不用設定和管理
任何伺服器或資料倉儲
• 不需載入資料，直接從 S3 查詢
• 無須擔心是否有足夠的運算資源，可獲得快速的互動式查詢效能。
• 支援多種標準資料格式，包括 CSV、JSON、ORC、Avro 和
Parquet。
• 只需支付所執行查詢掃描資料的費用。將資料壓縮、分割並轉換
為單欄格式，則每個查詢的成本可節省 30% 到 90%，且可獲得
較高的效能。
Amazon
Athena

受管服務的設計，可讓所有程度的開發人員輕
鬆使用機器學習
採用 Amazon 內部資料科學家使用多年的 ML
技術
Amazon Machine Learning 使用可擴充而強大
的實作產業標準 ML 演算法
Amazon
Machine Learning

應該使用哪種分析工具?
Amazon Redshift Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Ad-hoc Interactive
Queries
Interactive
Query
General purpose
(iterative ML, RT, ..)
Batch
Scale/throughput ~Nodes Automatic / No limits ~ Nodes
AWS Managed
Service
Yes Yes, Serverless Yes
Storage Local storage (Standard) Amazon S3 Amazon S3, HDFS
Optimization Columnar storage, data
compression, and zone
maps
CSV, TSV, JSON,
Parquet, ORC, Apache
Web log
Framework dependent
Metadata Amazon Redshift managed Athena Catalog
Manager
Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)
Access controls Users, groups, and access
controls
AWS IAM Integration with LDAP
UDF support Yes (Scalar) No Yes
Slow

ETL怎麼樣？
https://aws.amazon.com/big-data/partner-solutions/
ETL
儲存
STORE
處理/分析
PROCESS / ANALYZE
數據集成合作夥伴
減少移動，清理，同步，管理和自動化數據
相關進程的功夫。 AWS Glue
AWS Glue是完全管理的ETL服務，可以輕鬆了解您
的數據源，準備數據，並在數據存儲之間可靠地
移動
New

儲存
STORE
使用
CONSUME
處理/分析
PROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
應用程式與API
分析和可視化
筆記本 Notebooks
IDE
業務團隊
Business
users
數據科學家，
開發人員
Data scientist,
developers
收集
COLLECT
ETL

建置視覺化
進行專案性分析
透過Storyboard進行分享與協作
主要行動平台的原生存取功能
Amazon
QuickSight

Amazon QuickSight 介紹
雲端驅動的商業智慧服務，
成本只需傳統 BI 軟體的 1/10
ü 不需 IT 介入。不需建置維度模型
ü 自動搜索所有 AWS 資料來源
ü 超快速、並行、記憶體內計算引擎 (SPICE)
ü 全受管
aws.amazon.com/quicksight

Putting It All Together
全部放在一起

Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Reference architecture
ETL
BatchMessageInteractiveStreamML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena
儲存
STORE
使用
CONSUME
過程/分析
PROCESS / ANALYZE
收集
COLLECT

參考架構: Data Lake
AthenaGlue

總結
建立可分拆式系統
數據→存儲→處理→存儲→分析→答案
使用正確的工具作業
數據結構，延遲，吞吐量，訪問模式
利用 AWS 的管理服務
可擴展/彈性，可用，可靠，安全，無/低管理
使用以Log 為中心的設計模式
不可變的Log， materialized view
留意成本
大數據≠大成本

Q&A
http://aws.amazon.com/big-data/

Remember to complete
your evaluations!

深入淺出 AWS 大數據工具

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 深入淺出 AWS 大數據工具

Similar to 深入淺出 AWS 大數據工具 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

深入淺出 AWS 大數據工具