SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solutions Architect
July 7th
, 2016
Big Data Architectural Patterns
and Best Practices on AWS
Big Data Evolution
Batch
processing
Stream
processing
Machine
learning
Powerful Services, Open Ecosystem
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Web Services Open Source & Commercial Ecosystem
Separation of compute & storage
• Event driven, loose coupling, independent scaling & sizing
“Data Sympathy” - use the right tool for the job
Data structure, latency, throughput, access patterns
Managed services, open technologies
Scalable/elastic, available, reliable, secure, no/low admin
Focus on the data, and your business, not undifferentiated tasks
Architectural Principles
Big Data != Big Cost
Aim to reduce the cost of experimentation to near 0
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Mobile SDK
AWS IoT
Amazon CloudWatchLogs Agent
Amazon CognitoSync
AWS Mobile Analytics
Easily Capture Mobile Data
through Sync and Server
Logging
File based integration to
Streaming Storage
API Integration via SDK’s
and Ecosystem Tools
Amazon Kinesis Producer Library
Amazon Kinesis Agent
Principles for Collection Architecture
• Should be easy to install, manage, and
understand
• Meet requirements for data durability & recovery
• Ensure ability to meet latency & performance
requirements
• Data handling should be sympathetic to the type
of data
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Types of Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
STORE
Search
File store
Queue
Stream
storage
Cache Heavily read, application defined structure
Consistently read, relational or document data
Frequently read, often requiring natural language
processing or geospatial access
Data received as files, or where object storage is
appropriate
Data intended for a single recipient, with low
latency
Data intended for many recipients, with low
latency
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
File store
Queue
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Stream
storage
Cache,
Database,
Search
Database Anti-pattern
Myriad of Applications
One GREAT BIG Database
Myriad of Applications
Best Practice - Use the Right Tool for the job
Data Tier
Search
Amazon Elasticsearch Service
Amazon CloudSearch
Cache
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
Amazon Redshift
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Database Tier
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Queue
Stream
storage
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
File store
Why is Amazon S3 Good for Big Data?
• Virtually unlimited storage, unlimited object count
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Native support for Versioning
• Secure – SSL, client/server-side encryption at rest
• Low cost
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• No need to pay for data replication
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances
• Multiple & heterogenous analysis clusters can use the same
data
What about HDFS & Data Tiering (S3 IA & Amazon Glacier)?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard as system of
record for durability
• Use Amazon S3 Standard – IA for near line
archive data
• Use Amazon Glacier for cold/offline archive
data
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
Managed message queue
service
Amazon Kinesis Streams
Managed stream storage +
processing
Amazon Kinesis Firehose
Managed data delivery
Amazon DynamoDB
Streams
Managed NoSQL database
Tables can be stream-enabled
Message & Stream Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
Queue
Stream
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon S3
Queue
Stream
storage
Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Streaming analysis
Parallel consumption
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, natural language, geo Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) Cache, NoSQL
What Is the Temperature of Your Data / Access ?
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low – high High Very high
Request rate Very high High-Med Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data / Access Characteristics: Hot, Warm, Cold
Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
COLLECT STORE PROCESS / ANALYZE
Amazon Elasticsearch
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotWarm
SearchSQLNoSQLCacheFileMessage
Stream
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Hot
Tools and Frameworks
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Batch - Minutes or hours on cold data
Daily/weekly/monthly reports
Interactive – Seconds on warm/cold data
Self-service dashboards
Messaging – Milliseconds or seconds on hot data
Message/eventbuffering
Streaming - Milliseconds or seconds on hot data
Billing/fraud alerts, 1 minute metrics
Tools and Frameworks
Batch
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Amazon Redshift, Amazon EMR (Presto, Spark)
Messaging
Amazon SQS application on Amazon EC2
Streaming
Micro-batch: Spark Streaming, KCL
Real-time: Amazon Kinesis Analytics, Storm,
AWS Lambda, KCL
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
What Analytics Technology Should I Use?
Amazon
Redshift
Amazon EMR
Presto Spark Hive
Query
latency
Low Low Low High
Durability High High High High
Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes
AWS
managed
Yes Yes Yes Yes
Storage Native HDFS / S3 HDFS / S3 HDFS / S3
SQL
compatibility
High High Low (SparkSQL) Medium (HQL)
Slow
What Streaming Analysis Technology Should I Use?
Spark
Streaming
Apache
Storm
Kinesis KCL
Application
AWS Lambda Amazon SQS
Apps
Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Micro-batch or
Real-time
Micro-batch Real-time Near-real-time Near-real-time Near-real-time
AWS managed
service
Yes (EMR) No (EC2) No (KCL + EC2
+ Auto Scaling)
Yes No (EC2 + Auto
Scaling)
Scalability No limits
~ nodes
No limits
~ nodes
No limits
~ nodes
No limits No limits
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python,
Scala
Any language
via Thrift
Java, via
MultiLang
Daemon (.NET,
Python, Ruby,
Node.js)
Node.js, Java,
Python
AWS SDK
languages
(Java, .NET,
Python, …)
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileMessage
Stream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
ETL
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Analysis&visualizationNotebooksIDE
Apps & Services
API
Applications & API
Reporting, Exploratory and
Ad-Hoc Analysis and Data
Visualization
Notebooks
Data Science
Business
users
Data scientist,
developers
COLLECT ETL
Design Patterns
Primitive: Batch Analysis & Enhancement
process
store
Amazon S3
Amazon
DynamoDB
Amazon EMR
Presto
Spark
SQL
Amazon
Redshift
Amazon
Quicksight
AWS
Marketplace
Primitive: Event Processing “Data Bus”
Multiple stages
Storage raises events to decouple processing stages
Store Process Store Process
Primitive: Pub/Sub
process
store
Amazon
Kinesis
AWS Lambda
Amazon Kinesis
S3
Connector
Apache Spark
Kinesis Client
Library
Combining Primitives: Event Processing Network
Combines data bus and pub/sub patterns to create
asynchronous, continuously executing transformations
Store Process Store Process
Process Store Process
Process
Process
Primitive: Messaging “Fan Out”
process
store
Amazon SQS AWS Lambda
Amazon
Kinesis
Real Time
Processing
Batch Analytics
process
store
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Real Time
StorageAmazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Amazon
ML
KCL
AWS Lambda
Storm
Real Time &
Batch
Architecture
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zoltan Arvai, Big Data Architect
July 7th, 2016
Leveraging S3 and EMR at Adbrain
About Adbrain
• Adbrain is a market leader in providing Probabilistic
Identity Solutions
• At the core of our business we build an identity graph
that represents connected devices on the planet
• We enable our clients to understand and target their
audience better
What does that mean?
• Ingesting large amounts of log data from
different data sources
• Build a graph using statistical and
machine learning algorithms
• Extract a relevant user graph for our
clients
Our main scenarios
• Production graph generation and servicing our clients
• Data Science team to run computations (Spark)
• R&D work with ML
• Data feed evaluation
• Graph analytics
• Data Engineering development workload
Our original setup
• HDFS Cluster on 18 nodes with most of our hot data
• Not leveraging co-location of data
• Spark clusters are separated
• Spark instance type optimized according to workload
• Custom cluster provisioning solution to run Spark jobs
• Built in-house
• Required maintenance
challenge
What are the issues?
• “Hadoop is a cheap Big Data solution on
commodity hardware” … define cheap for a
startup
• Scaling Hadoop means more focus on
infrastructure
• Data grows really quickly (adding a new country,
data feed, etc.)
What do we need?
• Better scale
• Cheaper solution
• Less focus on infrastructure
• Reduced risk
EMR and S3 to the rescue!
• S3 is significantly cheaper compared to HDFS
• We use infrequent access and Glacier storage as
well
• EMR is a fully managed solution
• Up to date Spark distributions
• Support of Spot instances
• Managed Hadoop when / if needed
Amazon
S3
Amazon
EMR
The implementation challenge
Some technical challenges
• Spark 1.3.1 did not work well with S3 and
Parquet files
• Writing to S3 using the default
ParquetOutputComitter is slow
• Calculating output summary metadata is slow
• S3 is not consistent by default (it’s not a file
system)
• S3 can be slower than HDFS
Solutions
• Upgrade to Spark 1.5.1+
• Using DirectParquetOutputCommitter with speculation
disabled
• parquet.enable.summary-metadata = false
• Enable S3 Consistent View in EMR!
• (Remember the DynamoDB tables!)
• In case of a multi-step Spark workflow, utilize HDFS on
the Spark Cluster
The end result
• Decreasing costs by 70%+
• Focusing on development instead of
commodities (hardware/versioning/patching)
• Eliminated risks
Thank you!
aws.amazon.com/big-data
Remember to complete
your evaluations!

More Related Content

What's hot

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
Amazon Web Services
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
Amazon Web Services
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 

What's hot (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 

Viewers also liked

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
Amazon Web Services
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
Amazon Web Services
 
Exploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDMExploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDM
Jean-Michel Franco
 
The Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoTThe Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoT
Solace
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developers
Heitor Souza
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
Salesforce Developers
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
Amazon Web Services
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
Amazon Web Services
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
Thinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Data Con LA
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 

Viewers also liked (20)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Exploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDMExploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDM
 
The Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoTThe Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoT
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developers
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 

Similar to Big Data Architectural Patterns and Best Practices on AWS

Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
Amazon Web Services
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Amazon Web Services
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Amazon Web Services
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
Amazon Web Services
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Amazon Web Services
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
Amazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Amazon Web Services
 
Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
Amazon Web Services
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
Julien SIMON
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Amazon Web Services
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 

Similar to Big Data Architectural Patterns and Best Practices on AWS (20)

Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Big Data Architectural Patterns and Best Practices on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solutions Architect July 7th , 2016 Big Data Architectural Patterns and Best Practices on AWS
  • 3. Powerful Services, Open Ecosystem Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Kinesis-enabled app Lambda ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Web Services Open Source & Commercial Ecosystem
  • 4. Separation of compute & storage • Event driven, loose coupling, independent scaling & sizing “Data Sympathy” - use the right tool for the job Data structure, latency, throughput, access patterns Managed services, open technologies Scalable/elastic, available, reliable, secure, no/low admin Focus on the data, and your business, not undifferentiated tasks Architectural Principles
  • 5. Big Data != Big Cost Aim to reduce the cost of experimentation to near 0
  • 6. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 7. Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Mobile SDK AWS IoT Amazon CloudWatchLogs Agent Amazon CognitoSync AWS Mobile Analytics Easily Capture Mobile Data through Sync and Server Logging File based integration to Streaming Storage API Integration via SDK’s and Ecosystem Tools Amazon Kinesis Producer Library Amazon Kinesis Agent
  • 8. Principles for Collection Architecture • Should be easy to install, manage, and understand • Meet requirements for data durability & recovery • Ensure ability to meet latency & performance requirements • Data handling should be sympathetic to the type of data
  • 9. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 10. Types of Data Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database STORE Search File store Queue Stream storage Cache Heavily read, application defined structure Consistently read, relational or document data Frequently read, often requiring natural language processing or geospatial access Data received as files, or where object storage is appropriate Data intended for a single recipient, with low latency Data intended for many recipients, with low latency
  • 11. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon Elasticsearch Service Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS SearchSQLNoSQLCacheFile LoggingIoTApplicationsTransportMessaging File store Queue Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Stream storage Cache, Database, Search
  • 12. Database Anti-pattern Myriad of Applications One GREAT BIG Database
  • 13. Myriad of Applications Best Practice - Use the Right Tool for the job Data Tier Search Amazon Elasticsearch Service Amazon CloudSearch Cache Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS Amazon Redshift NoSQL Amazon DynamoDB Cassandra HBase MongoDB Database Tier
  • 14. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Queue Stream storage Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS File store
  • 15. Why is Amazon S3 Good for Big Data? • Virtually unlimited storage, unlimited object count • Very high bandwidth – no aggregate throughput limit • Highly available – can tolerate AZ failure • Designed for 99.999999999% durability • Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Native support for Versioning • Secure – SSL, client/server-side encryption at rest • Low cost
  • 16. Why is Amazon S3 Good for Big Data? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • No need to pay for data replication • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple & heterogenous analysis clusters can use the same data
  • 17. What about HDFS & Data Tiering (S3 IA & Amazon Glacier)? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard as system of record for durability • Use Amazon S3 Standard – IA for near line archive data • Use Amazon Glacier for cold/offline archive data
  • 18. Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Amazon SQS Amazon SQS Managed message queue service Amazon Kinesis Streams Managed stream storage + processing Amazon Kinesis Firehose Managed data delivery Amazon DynamoDB Streams Managed NoSQL database Tables can be stream-enabled Message & Stream Data Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging Queue Stream Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon S3 Queue Stream storage
  • 19. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Streaming analysis Parallel consumption 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream
  • 20. What Data Store Should I Use? Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data / access characteristics → Hot, warm, cold Cost → Right cost
  • 21. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) Cache, NoSQL Simple relationships → 1:N, M:N NoSQL Cross table joins, transaction, SQL SQL Faceting, natural language, geo Search Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search (Key, value) Cache, NoSQL
  • 22. What Is the Temperature of Your Data / Access ?
  • 23. Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low – high High Very high Request rate Very high High-Med Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data / Access Characteristics: Hot, Warm, Cold
  • 24. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon Elasticsearch Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data What Data Store Should I Use?
  • 25. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 26. COLLECT STORE PROCESS / ANALYZE Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotWarm SearchSQLNoSQLCacheFileMessage Stream Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Hot
  • 27. Tools and Frameworks Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2 Batch - Minutes or hours on cold data Daily/weekly/monthly reports Interactive – Seconds on warm/cold data Self-service dashboards Messaging – Milliseconds or seconds on hot data Message/eventbuffering Streaming - Milliseconds or seconds on hot data Billing/fraud alerts, 1 minute metrics
  • 28. Tools and Frameworks Batch Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Amazon Redshift, Amazon EMR (Presto, Spark) Messaging Amazon SQS application on Amazon EC2 Streaming Micro-batch: Spark Streaming, KCL Real-time: Amazon Kinesis Analytics, Storm, AWS Lambda, KCL Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2
  • 29. What Analytics Technology Should I Use? Amazon Redshift Amazon EMR Presto Spark Hive Query latency Low Low Low High Durability High High High High Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes AWS managed Yes Yes Yes Yes Storage Native HDFS / S3 HDFS / S3 HDFS / S3 SQL compatibility High High Low (SparkSQL) Medium (HQL) Slow
  • 30. What Streaming Analysis Technology Should I Use? Spark Streaming Apache Storm Kinesis KCL Application AWS Lambda Amazon SQS Apps Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Micro-batch or Real-time Micro-batch Real-time Near-real-time Near-real-time Near-real-time AWS managed service Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling) Yes No (EC2 + Auto Scaling) Scalability No limits ~ nodes No limits ~ nodes No limits ~ nodes No limits No limits Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLang Daemon (.NET, Python, Ruby, Node.js) Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …)
  • 31. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 32. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileMessage Stream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging ETL Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams
  • 33. STORE CONSUMEPROCESS / ANALYZE Amazon QuickSight Analysis&visualizationNotebooksIDE Apps & Services API Applications & API Reporting, Exploratory and Ad-Hoc Analysis and Data Visualization Notebooks Data Science Business users Data scientist, developers COLLECT ETL
  • 34.
  • 36. Primitive: Batch Analysis & Enhancement process store Amazon S3 Amazon DynamoDB Amazon EMR Presto Spark SQL Amazon Redshift Amazon Quicksight AWS Marketplace
  • 37. Primitive: Event Processing “Data Bus” Multiple stages Storage raises events to decouple processing stages Store Process Store Process
  • 38. Primitive: Pub/Sub process store Amazon Kinesis AWS Lambda Amazon Kinesis S3 Connector Apache Spark Kinesis Client Library
  • 39. Combining Primitives: Event Processing Network Combines data bus and pub/sub patterns to create asynchronous, continuously executing transformations Store Process Store Process Process Store Process Process Process
  • 40. Primitive: Messaging “Fan Out” process store Amazon SQS AWS Lambda Amazon Kinesis
  • 41. Real Time Processing Batch Analytics process store Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Real Time StorageAmazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Amazon ML KCL AWS Lambda Storm Real Time & Batch Architecture Spark Streaming on Amazon EMR Applications Amazon Kinesis
  • 42. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Zoltan Arvai, Big Data Architect July 7th, 2016 Leveraging S3 and EMR at Adbrain
  • 43. About Adbrain • Adbrain is a market leader in providing Probabilistic Identity Solutions • At the core of our business we build an identity graph that represents connected devices on the planet • We enable our clients to understand and target their audience better
  • 44. What does that mean? • Ingesting large amounts of log data from different data sources • Build a graph using statistical and machine learning algorithms • Extract a relevant user graph for our clients
  • 45. Our main scenarios • Production graph generation and servicing our clients • Data Science team to run computations (Spark) • R&D work with ML • Data feed evaluation • Graph analytics • Data Engineering development workload
  • 46. Our original setup • HDFS Cluster on 18 nodes with most of our hot data • Not leveraging co-location of data • Spark clusters are separated • Spark instance type optimized according to workload • Custom cluster provisioning solution to run Spark jobs • Built in-house • Required maintenance
  • 48. What are the issues? • “Hadoop is a cheap Big Data solution on commodity hardware” … define cheap for a startup • Scaling Hadoop means more focus on infrastructure • Data grows really quickly (adding a new country, data feed, etc.)
  • 49. What do we need? • Better scale • Cheaper solution • Less focus on infrastructure • Reduced risk
  • 50. EMR and S3 to the rescue! • S3 is significantly cheaper compared to HDFS • We use infrequent access and Glacier storage as well • EMR is a fully managed solution • Up to date Spark distributions • Support of Spot instances • Managed Hadoop when / if needed Amazon S3 Amazon EMR
  • 52. Some technical challenges • Spark 1.3.1 did not work well with S3 and Parquet files • Writing to S3 using the default ParquetOutputComitter is slow • Calculating output summary metadata is slow • S3 is not consistent by default (it’s not a file system) • S3 can be slower than HDFS
  • 53. Solutions • Upgrade to Spark 1.5.1+ • Using DirectParquetOutputCommitter with speculation disabled • parquet.enable.summary-metadata = false • Enable S3 Consistent View in EMR! • (Remember the DynamoDB tables!) • In case of a multi-step Spark workflow, utilize HDFS on the Spark Cluster
  • 54. The end result • Decreasing costs by 70%+ • Focusing on development instead of commodities (hardware/versioning/patching) • Eliminated risks