The document provides an overview of big data concepts and Amazon Web Services (AWS) products for big data and analytics. It describes challenges of big data including unpredictable resource demand and job orchestration complexities. It then summarizes AWS products for data collection, storage, processing, analytics and machine learning. Specific examples are given using AWS services like Redshift, EMR, Kinesis and DynamoDB for scenarios like data warehousing, real-time streaming and Hadoop workloads. Core principles and common challenges of big data implementations on AWS are also outlined.
1. Big Data & Analytics
Randall Barnes - Bill Moritz - Kevin Dillon
2. Today’s Session Objectives
• Describe core concepts, common objectives and lessons
learned
• Present specific platforms and products available in AWS
• Provide live, hands-on deployment experience
3. Big Data applications are defined as having data
volume or variety or velocity characteristics that
render traditional tools/processes impractical
Great potential…
• Keep pace with the accelerating information explosion
• New insights and analytics to improve business decisions
• Create new applications requiring massive real-time data processing
…and at times challenging
• Unpredictable resource demand
• Job orchestration and management complexities
• Geo-distribution of data sources
4. Reduce costs per workload, saving money and creating
opportunities
Extremely Flexible - ability to provide answers to analytics
questions that don't yet exist
Why Big Data solutions have worked well in the
cloud
6. Three types of data-driven development
Retrospective
analysis and
reporting
Predictions
to enable smart
applications
Amazon Machine Learning
Amazon EMR
Here-and-now
real-time processing
and dashboards
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift
Amazon EMR
7. Core Principles for Successful Implementations
Elastic resource capacity
• Data Storage, I/O, Computing resources scale on demand
• Dynamically support multiple ephemeral environments such as dev, test and QA
validations
• No up-front capital expenses; pay only for what you use
Streamlined management of platforms and solutions
• Raw infrastructure resources
• Application stacks
Well-supported ecosystem of tools and applications
• Data integration tools
• Analytics and reporting applications
• Resource and job orchestration
8. What is changing…
Diverse and non-traditional workloads
• Using big data strategies, tools and products to solve problems that
have not traditionally been viewed as big data.
Leverage managed solutions to reduce complexity and staff
constraints
• AWS-managed platforms
• 3rd-Party frameworks
Making The Cloud Work For Your Enterprise
9. Most common implementation challenges:
• Managing distributed data sets
• Application platform migration – limited resources
• ETL integration, especially leveraging existing IP and
business logic
Making The Cloud Work For Your Enterprise
10. TCO Mistakes
Overprovisioning
• High I/O storage space for non-active data sets
• Non-linear cost increase for certain instance types
Static resources
• Low overall utilization
• Not leveraging spot instance pricing
Not leveraging Reserved Instance (RI) price strategies
11. Example 1: MPP Data Warehouse
Bulk Transfer
Reporting
and
Analytics
Data Stage
and Archive
Data Warehouse
12. Example 1: Elastic Data Warehouse
S3 Glacier
Redshift
Import/Export
Service
Direct Connect
13. Data Collection, Ingestion and Consumption
• Ship storage devices directly to Amazon
• Transfer to EBS or S3
• Up to 4TB per device
• Higher bandwidth, more consistent performance
• 1Gbps and 10Gbps ports [network providers may offer slice]
Direct Connect
Import/Export Service
14. Amazon Simple Storage Service (S3)
Object storage container with virtually unlimited capacity
• Store files (objects) in containers (buckets)
• Redundant copies for high durability and reliability
• Available on the internet via REST requests directly or through SDK
• Multiple strategies to secure contents
• Set permissions, access policies and optionally require MFA
• Encryption: Server (simplified) or Client-side
• Audit logging (optional) will record all access requests via api
• Built-in tools for managing versioning, object lifecycle and creating static
websites
• Low pay-as-you-go pricing a function of storage amount (~$.03/GB/Month) plus
metering of I/O requests
15. Amazon Redshift
• High performance, massively parallel columnar storage
architecture providing streamlined scalability
• Mainstream SQL query syntax (PostgreSQL) allowing for
rapid platform adoption
• Flexible node type and RI options allowing for workload
alignment and cost efficiency
• Integrated with other AWS Big Data Platforms (S3, EMR,
DynamoDB, Data Pipeline)
• Streamlined administrative tasks (snapshot/restore, Node
increase/decrease)
Scalable, fully-managed Data Warehouse
16. Recap: Elastic Data Warehouse
S3 Glacier
Redshift
Import/Export
Service
Direct Connect
17. Example 2: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
Real-Time Processing NoSQL
18. Example 2: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
DynamoDBKinesis
19. Amazon Kinesis
• Fully managed service
• Real-time Log/Application data ingestion and
transformations
• Real-time reporting and analytics
• Data ordering, deterministic routing and replay (up to 24
hours)
• Records: Partition Key, Sequence Number, Data Blob (payload)
• Shards: Units of incremental throughput capacity
• Use SDK APIs for PUT/GET operations
Scalable real-time diverse data processing
20. Amazon DynamoDB
• Seamless and virtually unlimited scalability; managed
automatically
• Ability to define specific resource allocation limits
• Easy administration and well-supported development model
• Integration with other core Amazon data services
• GET/PUT operations with a user-defined Primary Key
• Tables contain items (PK + Attributes) up to 400KB
• Data Types: Scaler, Set (collections), key-values, documents
• Secondary Indexes (Global and Local)
• Provisioned read- and write-throughput, SSD storage
Challenge: Proprietary API via AWS SDKs (e.g. Java, .NET)
21. Recap: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
DynamoDBKinesis
22. Example 3: Hadoop Workloads
Data Warehouse
Application and Data Stage Tiers
Analytics
Hadoop Processing
23. Example 3: Hadoop Workloads
Data Warehouse
Application and Data Stage Tiers
Analytics
EMR
24. Amazon Elastic Map Reduce (EMR)
• Semi-managed service (access to underlying OS)
• Apache Hadoop Framework
• Robust, streamlined management for Map-Reduce jobs
• Simple api for popular extensions, e.g. Hive, Pig, Spark
• Spot Instance pricing available
• HDFS or S3 storage
Your Data + Machine Learning= Smart Applications
28. Example 4: Machine learning
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Your Data + Machine Learning= Smart Applications
29. Easy to use, managed machine learning service
built for developers
Robust, powerful machine learning technology
based on Amazon’s internal systems
Create models using your data already stored in
the AWS cloud (S3 files, Redshift query, MySQL
RDS query)
Deploy models to production in seconds
Amazon
ML
30. Smart applications by example
Based on what you
know about an order:
Is this order
fraudulent?
Based on what you
know about the user:
Will they use your
product?
Based on what you know
about a news article:
What other articles are
interesting?
31. And a few more examples…
Fraud detection Detecting fraudulent transactions, filtering spam emails,
flagging suspicious reviews, …
Personalization Recommending content, predictive content loading,
improving user experience, …
Targeted marketing Matching customers and offers, choosing marketing
campaigns, cross-selling and up-selling, …
Content classification Categorizing documents, matching hiring managers and
resumes, …
Churn prediction Finding customers who are likely to stop using the service,
free-tier upgrade targeting, …
Customer support Predictive routing of customer emails, social media
listening, …
32. Securing Data in the Cloud
• Secure your AWS console root account
• Use complex passwords and rotate regularly
• Secure data stage locations
34. Contact Us
LocationsContact Info
Randall Barnes
Principal Architect, 2nd Watch
rbarnes@2ndwatch.com
Bill Moritz
Sr Cloud Engineer, 2nd Watch
bmoritz@2ndwatch.com
2nd Watch, Inc.
1-888-317-7920
info@2ndwatch.com
www.2ndwatch.com
SEATTLE
NEW YORK
VIRGINIA
ATLANTA
PHILADELPHIA
HOUSTON
LIBERTY LAKE
LOS ANGELES
CHICAGO