This document discusses big data and AWS tools for managing it. It defines big data as data with high volume, velocity and variety. AWS provides scalable tools like EC2, EMR, Kinesis and Redshift for ingesting, processing and analyzing large and diverse datasets. These tools work together in an integrated environment and auto-scale based on demand, providing a cost-effective solution for big data challenges. An example use case of real-time IoT analytics is presented to illustrate how different AWS products interact for processing sensor data streams.
2. Agenda
• What is Big Data?
• What is AWS?
• Presenting the tools: How Big Data and AWS fit
together
3. What is Big Data?
• It’s at the intersection of data’s 3 V:
• Velocity (Batch / Real time / Streaming)
• Volume (Terabytes/Petabytes)
• Variety (structure/semi-structured/unstructured)
4. Why is everybody talking about it?
• Cost of generation of data has gone down
• By 2015, 3B people will be online, pushing data
volume created to 8 zettabytes
• More data = More insights = Better decisions
• Ease and cost of processing is falling thanks to
cloud platforms
5. Data flow and constraints
Generate
Ingest / Store
Process
Visualize / Share
The 3 V involve
heterogeneity and
make it hard to
achieve those steps
6. What is AWS?
• AWS is a cloud computing platform
• On-demand delivery of IT resources
• Pay-as-you-go pricing model
7. Cloud Computing
+ +
Compute Storage Networking
Adapts dynamically to ever
changing needs to stick closely
to user infrastructure and
applications requirements
8. How does AWS helps
with Big Data?
• Remove constraints on the ingesting, storing, and
processing layer and adapts closely to demands.
• Provides a collection of integrated tools to adapt to
the 3 V’s of Big Data
• Unlimited capacity of storage and processing power
fits well to changing data storage and analysis
requirements.
10. Computing Solutions
for Big Data on AWS
EC2
All-purpose computing instances.
Dynamic Provisioning and resizing
Let you scale your infrastructure
at low cost
Use Case: Well suited for running custom or proprietary
application (ex: SAP Hana, Tableau…)
11. Computing Solutions
for Big Data on AWS
EMR
‘Hadoop in the cloud’
Adapt to complexity of the analysis
and volume of data to process
Use Case: Offline processing of very large volume of data,
possibly unstructured (Variety variable)
12. Computing Solutions
for Big Data on AWS
Kinesis
Stream Processing
Real-time data
Scale to adapt to the flow of
inbound data
Use Case: Complex Event Processing, click streams,
sensors data, computation over window of time
13. Computing Solutions
for Big Data on AWS
RedShift
Data Warehouse in the cloud
Scales to Petabytes
Supports SQL Querying
Start small for just $0.25/h
Use Case: BI Analysis, Use of ODBC/JDBC legacy software
to analyze or visualize data
15. Storage Solution
for Big Data on AWS
DynamoDB
NoSQL Database
Consistent
Low latency access
Column-base flexible
data model
Use Case: Offline processing of very large volume of data,
possibly unstructured (Variety variable)
16. Storage Solution
for Big Data on AWS
S3
Versatile storage system
Low-cost
Fast retrieving of data
Use Case: Backups and Disaster recovery, Media storage,
Storage for data analysis
17. Storage Solution
for Big Data on AWS
Glacier
Archive storage of cold data
Extremely low-cost
optimized for data infrequently
accessed
Use Case: Storing raw logs of data. Storing media archives.
Magnetic tape replacement
19. Integrated Environment for Big Data
Given the 3V’s a collection of tools is most of the time
needed for your data processing and storage.
AWS Big Data solutions comes integrated with each others
already
AWS Big Data solutions also integrate with the whole AWS
ecosystem (Security, Identity Management, Logging, Backups,
Management Console…)
21. Tightly integrated rich
environment of tools
+
On-demand scaling sticking to
processing requirements
=
Extremely cost-effective and easy to
deploy solution for big data needs
22. Use Case:
Real-time IOT Analytics
Gathering data in real time from sensors deployed in
factory and send them for immediate processing
• Error Detection: Real-time detection of hardware
problems
• Optimization and Energy management
23. First Version of the
infrastructure
Aggregate
Sensors
data
nodejs
stream
processor
On customer site
evaluate rules
over time
window
mongodb
feed algorithm
in-house hadoop cluster
write raw
data for
further
processing
backup
24. Second Version of the
infrastructure
Aggregate
Sensors
data
On customer site
evaluate rules
over time
window
write raw
data for
archiving
Kinesis RedShift
for BI
analysis
Glacier