AWS provides a broad platform of managed services to help you build, secure, and seamlessly scale end-to-end Big Data applications quickly and with ease. Want to get ramped up on how to use Amazon's big data web services? Learn when to use which service? Want to write your first big data application on AWS? Join us in this session as we discuss reference architecture, design patterns, and best practices for pulling together various AWS services to meet your big data challenges.
2. • Big Data Challenges
• Architectural Principles
• Stages in a Data Processing Pipeline
• Demo : Build a data processing pipeline
• Design Patterns
Agenda
6. Architectural Principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views
Be cost-conscious
• Big data ≠ big cost
8. COLLECT STORE PROCESS/
ANALYZE
CONSUME
Simplify Big Data Processing
Transactions
Connected
devices
Web logs /
cookies
Social media
ERP
Answers &
Insights
Data analysts
Data scientists
Business users
Engagements
Automation / events
Data
10. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
LoggingIoTApplicationsTransportMessaging
Streaming App
11. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
LoggingIoTApplicationsTransportMessaging
Streaming App
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SQLNoSQLCache
12. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message
Devices
Sensors &
IoT platforms
AWS IoT
LoggingIoTApplicationsTransportMessaging
Streaming App
Amazon S3
FileSearch
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SQLNoSQLCache
MESSAGES
STREAMS
13. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT
Amazon SQS
Message
LoggingIoTApplicationsTransportMessaging
Streaming App
Amazon S3
FileSearch
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SQLNoSQLCache
STREAMS
14. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Stream
Amazon SQS
Message
LoggingIoTApplicationsTransportMessaging
Streaming App
Amazon S3
FileSearch
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SQLNoSQLCache
15. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Stream
Amazon SQS
Message
LoggingIoTApplicationsTransportMessaging
Streaming App
Amazon S3
FileSearch
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
SQLNoSQLCache
16. Weblogs – Common Log Format (CLF)
75.35.230.210 - - [20/Jul/2016:22:22:42 -0700]
"GET /images/pigtrihawk.jpg HTTP/1.1" 200 29236
"http://www.swivel.com/graphs/show/1163466"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11)
Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)"
17. Amazon Kinesis
Streams
• For Technical Developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift
and Amazon Elasticsearch
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver and process streams on AWS
18. Why Is Amazon S3 Good for Big Data?
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Multiple & heterogeneous analysis clusters can use the same data
• Designed for 99.99% availability – can tolerate zone failure
• Designed for 99.999999999% durability
• No need to pay for data replication
• Native support for versioning
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies
• Secure – SSL, client/server-side encryption at rest
• Low cost
31. Building a pipeline
PROCESS:
Amazon EMR with Spark & Hive
STORE
ANALYZE & VISUALIZE:
Amazon Redshift and Amazon QuickSight
COLLECT:
Amazon Kinesis Firehose
32. • Check the files which were ingested into Amazon S3
• Clean the data using Amazon EMR (Spark)
• Create a table in Amazon Athena
• Query data using SQL
Demo