SlideShare a Scribd company logo
1 of 44
© 2020, Amazon Web Services, Inc. or its Affiliates.
Big Data per le Startup
come creare applicazioni Big Data in modalità Serverless
Fausto Palma
AWS Solution Architect
© 2020, Amazon Web Services, Inc. or its Affiliates.
disk space RAM or CPU
Use case for Bid Data tools
Fits in standard DBs
Structured data
time
CPU
No excessive load spikes
streaming
Variety
tabular nested images video
Different
data formats
Velocity
Streaming real
time analysis
Volume
Large amount of data
not fitting resources
© 2020, Amazon Web Services, Inc. or its Affiliates.
Use case for Bid Data tools
Data lake
Open formats
Central catalog
Data collected when
available even in raw format
Recommendation
systems
Text mining
Supply chain flow
optimization
Social network
analysis
Anomaly
detection
Sentiment
analysis
Customer churn
prevention
…
© 2020, Amazon Web Services, Inc. or its Affiliates.
Analytics overall architecture (Data Lake)
Data movement
Storage Analytics Data value
Catalog
Management | Security
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Parallel Processing Reduction
Aggregation
General pattern to scale
Data
Messages
Streams
…
Mapping
Sharding
Shuffling
Shuffling
Shuffling
Outputs
© 2020, Amazon Web Services, Inc. or its Affiliates.
Most secure infrastructure: certifications
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
https://aws.amazon.com/compliance/programs/
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Simple Storage Service “S3”
§ Built to store any amount of data
§ Runs on the world’s largest global
cloud infrastructure
§ Designed to deliver 99.999999999% durability
§ Geographic redundancy & automatic replication
§ Tiered storage to optimize price/performance
S3
AZ
AZ AZ
Transit Transit
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon S3
Amazon Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue
Process Data in Place
© 2020, Amazon Web Services, Inc. or its Affiliates.
Output
Format: delimited text (CSV, TSV),
JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV, TSV,
JSON, Parquet…
Compression: GZIP, BZIP2 …
Amazon S3 Select
SQL
© 2020, Amazon Web Services, Inc. or its Affiliates.
S3 – how to access
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations.html
AWS S3 console AWS S3 API documentation
AWS S3 CLI
https://docs.aws.amazon.com/cli/latest/reference/s3/#available-commands
https://s3.console.aws.amazon.com/s3/
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Kinesis Data Firehose — How it Works
AWS IoT
Amazon Kinesis
Agent
Amazon Kinesis
Streams
Amazon CloudWatch
Logs
Amazon CloudWatch
Events
Managed Streams
for Kafka
Amazon S3
Amazon
Redshift
Amazon Elasticsearch
Service
Ingest Transform Deliver
Lambda
function
© 2020, Amazon Web Services, Inc. or its Affiliates.
Kinesis Firehouse – how to access
https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html
AWS Kinesis Firehouse console AWS Kinesis Firehouse API documentation
AWS Kinesis Firehouse CLI
https://docs.aws.amazon.com/cli/latest/reference/f
irehose/index.html#available-commands
https://eu-west-1.console.aws.amazon.com/kinesis/
© 2020, Amazon Web Services, Inc. or its Affiliates.
Simple demo
Amazon Kinesis
Data Firehose
Amazon Simple
Storage Service (S3)
Data
movement
Storage
App
© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Kinesis Data Generator (KDG)
https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html
{
"sensorId": {{random.number(50)}},
"currentTemperature": {{random.number(
{
"min":15,
"max":38
}
)}},
"status":
"{{random.weightedArrayElement(
{
"weights": [0.9,0.03,0.07],
"data": ["OK","FAIL","WARN"]
}
)}}"
}
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Hive metastore
service
Glue Catalog and Crawlers
Data Lake
S3
EMR
Athena
AWS Glue Jobs
AWS Glue Data CatalogAWS Glue Crawler
© 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Catalog console
© 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Crawlers console
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Athena console
https://eu-west-1.console.aws.amazon.com/athena/
Select catalog
Select database
Write Query
S3
© 2020, Amazon Web Services, Inc. or its Affiliates.
Data locations
Coordinator
Presto architecture
Workers
Worker
Worker
Worker
Worker
Parsing
Metastore
Planning
Scheduling
Connectors
Client
SELECT
sport,
count(distinct location) as locations,
count(distinct event_id) as events,
count(*) as tickets,
avg(ticket_price) as avg_ticket_price
FROM sporting_event_ticket_info
GROUP BY 1
ORDER BY 1;
Parsing
Planning
Scheduling
© 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
Tabular data
File in storage or streaming
© 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
Tabular data
File in storage or streaming
© 2020, Amazon Web Services, Inc. or its Affiliates.
Row vs Columnar file orientation
File in storage or streaming
Nested data
© 2020, Amazon Web Services, Inc. or its Affiliates.
Different file formats
Avro ParquetORC
Optimized Row Columnar
Compression ★ ★ ★ ★ ★ ★ ★ ★ ★
Schema evolution ★ ★ ★ ★ ★ ★ ★
Row vs column row column column
Splittability ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
Nested fields support ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
Best for Schema evolution Compression Nested fields
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Glue jobs console https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/
© 2020, Amazon Web Services, Inc. or its Affiliates.
RDD data structure
RDD
§ Resilient
§ Distributed
§ Datasets
Node
Node
Object 1
Object 2
Key 1
Key 2
Object 3Key 3
Node
Object nKey n
Distributed on multiple
node to take advantage of
parallel processing
1
2
3
4
6
5
7
Resiliency by replicating the DAG
execution (directed acyclic graph) in
case of failures
Object 1
Object 2
Object 3
Object n
Key 1
Key 2
Key 3
Key n
Collection of objects that may be
organized in key object pairs
© 2020, Amazon Web Services, Inc. or its Affiliates.
Narrow transformation – no shuffling among partitions
Worker node
Worker node
Worker node
Worker node
§ map()
§ flatMap()
§ mapPartition()
§ filter()
§ sample()
§ union()
© 2020, Amazon Web Services, Inc. or its Affiliates.
Wide transformation – shuffling among partitions
Worker node
Worker node
Worker node
Worker node
§ intersection()
§ distinct()
§ reduceByKey()
§ groupByKey()
§ join()
§ cartesian()
§ repartition()
§ coalesce()
© 2020, Amazon Web Services, Inc. or its Affiliates.
Spark Operations
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
Worker node
map()
flatMap()
mapPartition()
filter()
sample()
union()
intersection()
distinct()
reduceByKey()
groupByKey()
join()
cartesian()
repartition()
coalesce()
Narrow trasformations Wide transformations
Actions
count()
collect()
take()
top()
countByValue()
reduce()
fold()
aggregate()
foreach()
© 2020, Amazon Web Services, Inc. or its Affiliates.
Driver
spark = SparkSession...
spark.sparkContext
rdd_1 = spark.read...
rdd_2 = spark.read...
rdd_3 = rdd_1.filter(...)
rdd_4 = rdd_2.filter(...)
rdd_5 = rdd_3.join(rdd_4)
rdd_6 = rdd_5.filter(...)
output = rdd_6.count(...)
DAG Scheduler
Builds the DAG, splits into stages and tasks,
and signals the Task Scheduler
Cluster Manager
Allocate worker nodes
Worker node
Worker node
Worker node
…
Spark basic job execution process
rdd_1 rdd_2
task
task
task
task
task
rdd_3
rdd_5
rdd_4
Job
Starts executers
executer executer
executer executer
executer
rdd_x rdd_x
rdd_x rdd_x
rdd_x
rdd_6
out.
Task Scheduler
Places tasks on
executors
stage_1
stage_2
stage_3
spark = SparkSession...
spark.sparkContext
rdd_1 = spark.read...
rdd_2 = spark.read...
rdd_3 = rdd_1.filter(...)
rdd_4 = rdd_2.filter(...)
rdd_5 = rdd_3.join(rdd_4)
rdd_6 = rdd_5.filter(...)
output = rdd_6.count(...)
task
task
task
task
task
spark-submit mycode.py
...
© 2020, Amazon Web Services, Inc. or its Affiliates.
Additional features in Glue jobs (focus on PySpark)
PySpark Transforms
GlueTransform
ApplyMapping
DropFields
DropNullFields
ErrorsAsDynamicFrame
Filter
FlatMap
Join
Map
MapToCollection
Relationalize
RenameField
ResolveChoice
SelectFields
SelectFromCollection
Spigot
SplitFields
SplitRows
Unbox
UnnestFrame
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-
programming-python-transforms.html
AWS Glue
PySpark Extensions
getResolvedOptions
Types
DynamicFrame
DynamicFrameCollection
DynamicFrameWriter
DynamicFrameReader
GlueContext
https://docs.aws.amazon.com/glue/latest/dg/aws-
glue-programming-python-extensions.html
RDD
DataFrame
Spark DataSet
DynamicFrameGlue
© 2020, Amazon Web Services, Inc. or its Affiliates.
Demo custom script
© 2020, Amazon Web Services, Inc. or its Affiliates.
A 1 ★★★★
A 2 ★
A 3 ★★★
B 1 ★
B 2 ★★★★
B 3 ★
C 1 ★★★
C 2 ★
C 3 ★★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
A 1 ★★★★
A 2 ★
A 3 ★★★
B 1 ★
B 2 ★★★★
B 3 ★
A 1 ★★★★
A 2 ★
A 3 ★★★
C 1 ★★★
C 2 ★
C 3 ★★★★
A 1 ★★★★
A 2 ★
A 3 ★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
B 1 ★
B 2 ★★★★
B 3 ★
C 1 ★★★
C 2 ★
C 3 ★★★★
B 1 ★
B 2 ★★★★
B 3 ★
D 1 ★★
D 2 ★★★★
D 3 ★★
C 1 ★★★
C 2 ★
C 3 ★★★★
D 1 ★★
D 2 ★★★★
D 3 ★★
movies_pairs = movies.join(movies, on=user)
movie user rating movieX userX ratingX movieY userY ratingY
© 2020, Amazon Web Services, Inc. or its Affiliates.
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
C
C
C
B
B
B
C
C
C
D
D
D
C
C
C
D
D
D
D
D
D
movieX movieY
★★★★
★
★★★
★★★★
★
★★★
★★★★
★
★★★
★
★★★★
★
★
★★★★
★
★★★
★
★★★★
ratingX
★
★★★★
★
★★★
★
★★★★
★★
★★★★
★★
★★★
★
★★★★
★★
★★★★
★★
★★
★★★★
★★
ratingY
movie_pairs = movie_pairs.groupBy((movieX,movieY))
A B
A C
A D
B C
B D
C D
© 2020, Amazon Web Services, Inc. or its Affiliates.
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
C
C
C
B
B
B
C
C
C
D
D
D
C
C
C
D
D
D
D
D
D
similarity = movie_pairs.mapValue(cosine_similarity)
movieX movieY
★★★★
★
★★★
★★★★
★
★★★
★★★★
★
★★★
★
★★★★
★
★
★★★★
★
★★★
★
★★★★
ratingX
★
★★★★
★
★★★
★
★★★★
★★
★★★★
★★
★★★
★
★★★★
★★
★★★★
★★
★★
★★★★
★★
ratingY similarity
≠
=
≠
≠
=
≠
movieX movieY similarity
A B
A C
A D
B C
B D
C D
≠
=
≠
≠
=
≠
movie_pairs = movie_pairs.groupBy((movieX,movieY))
A B
A C
A D
B C
B D
C D
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS services
AWS Lake
Formation
AWS Key
Management
Service
AWS Identity
& Access
Management
Amazon Macie
…
Data
movement
Storage Analytics Data
value
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark & Python)
S3
QuickSight
SageMaker
Comprehend
Rekognition
Translate
Pinpoint
…
Managed Streaming
for Apache Kafka
Amazon Kinesis
Video Streams
Kinesis
Data Streams
Kinesis
Data Firehose
Glacier
AWS Glue
data catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.
Quicksight console
© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Training & Certification
https://www.aws.training: Free on-demand courses to help you build new cloud skills
E-Learning: Data Analytics Fundamentals
https://www.aws.training/Details/eLearning?id=35364
E-Learning: AWS Hadoop Fundamentals
https://www.aws.training/Details/eLearning?id=40337
Learning Path: Internet of Things Foundation Series
https://www.aws.training/Details/Curriculum?id=27289
Video: Serverless Analytics
https://www.aws.training/Details/Video?id=26848
Available AWS Certifications
© 2020, Amazon Web Services, Inc. or its Affiliates.
Thanks!

More Related Content

What's hot

Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations CloudHesive
 
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksDeep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksAmazon Web Services
 
Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsAmazon Web Services
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Amazon Web Services
 
Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)Garvit Anand
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesAmazon Web Services
 
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesAmazon Web Services
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3Yu Lun Teo
 
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...Simplilearn
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive Amazon Web Services
 

What's hot (20)

Aws
AwsAws
Aws
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations
 
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksDeep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
 
Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless Applications
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduction to Serverless
Introduction to ServerlessIntroduction to Serverless
Introduction to Serverless
 
AWS 101
AWS 101AWS 101
AWS 101
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web Services
 
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
 
What is AWS?
What is AWS?What is AWS?
What is AWS?
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
 
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...
AWS Training For Beginners | AWS Certified Solutions Architect Tutorial | AWS...
 
Introducing DynamoDB
Introducing DynamoDBIntroducing DynamoDB
Introducing DynamoDB
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive
 

Similar to AWS Big Data Tools for Startups

AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...
AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...
AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...Amazon Web Services LATAM
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018Amazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 

Similar to AWS Big Data Tools for Startups (20)

AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...
AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...
AWS Data Immersion Webinar Week - Entenda como ampliar suas possibilidades de...
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
AWSome Day Nairobi 2019
AWSome Day Nairobi 2019AWSome Day Nairobi 2019
AWSome Day Nairobi 2019
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
Introducing AWS DataSync - Simplify, automate, and accelerate online data tra...
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 

AWS Big Data Tools for Startups

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. Big Data per le Startup come creare applicazioni Big Data in modalità Serverless Fausto Palma AWS Solution Architect
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates. disk space RAM or CPU Use case for Bid Data tools Fits in standard DBs Structured data time CPU No excessive load spikes streaming Variety tabular nested images video Different data formats Velocity Streaming real time analysis Volume Large amount of data not fitting resources
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates. Use case for Bid Data tools Data lake Open formats Central catalog Data collected when available even in raw format Recommendation systems Text mining Supply chain flow optimization Social network analysis Anomaly detection Sentiment analysis Customer churn prevention …
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates. Analytics overall architecture (Data Lake) Data movement Storage Analytics Data value Catalog Management | Security
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates. Parallel Processing Reduction Aggregation General pattern to scale Data Messages Streams … Mapping Sharding Shuffling Shuffling Shuffling Outputs
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates. Most secure infrastructure: certifications CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G https://aws.amazon.com/compliance/programs/
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Simple Storage Service “S3” § Built to store any amount of data § Runs on the world’s largest global cloud infrastructure § Designed to deliver 99.999999999% durability § Geographic redundancy & automatic replication § Tiered storage to optimize price/performance S3 AZ AZ AZ Transit Transit
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon S3 Amazon Athena Amazon Redshift Spectrum Amazon SageMaker AWS Glue Process Data in Place
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates. Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV, JSON, Parquet… Compression: GZIP, BZIP2 … Amazon S3 Select SQL
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. S3 – how to access https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations.html AWS S3 console AWS S3 API documentation AWS S3 CLI https://docs.aws.amazon.com/cli/latest/reference/s3/#available-commands https://s3.console.aws.amazon.com/s3/
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates. Kinesis Data Firehose — How it Works AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Managed Streams for Kafka Amazon S3 Amazon Redshift Amazon Elasticsearch Service Ingest Transform Deliver Lambda function
  • 15. © 2020, Amazon Web Services, Inc. or its Affiliates. Kinesis Firehouse – how to access https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html AWS Kinesis Firehouse console AWS Kinesis Firehouse API documentation AWS Kinesis Firehouse CLI https://docs.aws.amazon.com/cli/latest/reference/f irehose/index.html#available-commands https://eu-west-1.console.aws.amazon.com/kinesis/
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Simple demo Amazon Kinesis Data Firehose Amazon Simple Storage Service (S3) Data movement Storage App
  • 17. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Kinesis Data Generator (KDG) https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html { "sensorId": {{random.number(50)}}, "currentTemperature": {{random.number( { "min":15, "max":38 } )}}, "status": "{{random.weightedArrayElement( { "weights": [0.9,0.03,0.07], "data": ["OK","FAIL","WARN"] } )}}" }
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates. Hive metastore service Glue Catalog and Crawlers Data Lake S3 EMR Athena AWS Glue Jobs AWS Glue Data CatalogAWS Glue Crawler
  • 20. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue Catalog console
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue Crawlers console
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates. Athena console https://eu-west-1.console.aws.amazon.com/athena/ Select catalog Select database Write Query S3
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates. Data locations Coordinator Presto architecture Workers Worker Worker Worker Worker Parsing Metastore Planning Scheduling Connectors Client SELECT sport, count(distinct location) as locations, count(distinct event_id) as events, count(*) as tickets, avg(ticket_price) as avg_ticket_price FROM sporting_event_ticket_info GROUP BY 1 ORDER BY 1; Parsing Planning Scheduling
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation Tabular data File in storage or streaming
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation Tabular data File in storage or streaming
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation File in storage or streaming Nested data
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates. Different file formats Avro ParquetORC Optimized Row Columnar Compression ★ ★ ★ ★ ★ ★ ★ ★ ★ Schema evolution ★ ★ ★ ★ ★ ★ ★ Row vs column row column column Splittability ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ Nested fields support ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ Best for Schema evolution Compression Nested fields
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 30. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue jobs console https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/
  • 31. © 2020, Amazon Web Services, Inc. or its Affiliates. RDD data structure RDD § Resilient § Distributed § Datasets Node Node Object 1 Object 2 Key 1 Key 2 Object 3Key 3 Node Object nKey n Distributed on multiple node to take advantage of parallel processing 1 2 3 4 6 5 7 Resiliency by replicating the DAG execution (directed acyclic graph) in case of failures Object 1 Object 2 Object 3 Object n Key 1 Key 2 Key 3 Key n Collection of objects that may be organized in key object pairs
  • 32. © 2020, Amazon Web Services, Inc. or its Affiliates. Narrow transformation – no shuffling among partitions Worker node Worker node Worker node Worker node § map() § flatMap() § mapPartition() § filter() § sample() § union()
  • 33. © 2020, Amazon Web Services, Inc. or its Affiliates. Wide transformation – shuffling among partitions Worker node Worker node Worker node Worker node § intersection() § distinct() § reduceByKey() § groupByKey() § join() § cartesian() § repartition() § coalesce()
  • 34. © 2020, Amazon Web Services, Inc. or its Affiliates. Spark Operations Worker node Worker node Worker node Worker node Worker node Worker node Worker node Worker node map() flatMap() mapPartition() filter() sample() union() intersection() distinct() reduceByKey() groupByKey() join() cartesian() repartition() coalesce() Narrow trasformations Wide transformations Actions count() collect() take() top() countByValue() reduce() fold() aggregate() foreach()
  • 35. © 2020, Amazon Web Services, Inc. or its Affiliates. Driver spark = SparkSession... spark.sparkContext rdd_1 = spark.read... rdd_2 = spark.read... rdd_3 = rdd_1.filter(...) rdd_4 = rdd_2.filter(...) rdd_5 = rdd_3.join(rdd_4) rdd_6 = rdd_5.filter(...) output = rdd_6.count(...) DAG Scheduler Builds the DAG, splits into stages and tasks, and signals the Task Scheduler Cluster Manager Allocate worker nodes Worker node Worker node Worker node … Spark basic job execution process rdd_1 rdd_2 task task task task task rdd_3 rdd_5 rdd_4 Job Starts executers executer executer executer executer executer rdd_x rdd_x rdd_x rdd_x rdd_x rdd_6 out. Task Scheduler Places tasks on executors stage_1 stage_2 stage_3 spark = SparkSession... spark.sparkContext rdd_1 = spark.read... rdd_2 = spark.read... rdd_3 = rdd_1.filter(...) rdd_4 = rdd_2.filter(...) rdd_5 = rdd_3.join(rdd_4) rdd_6 = rdd_5.filter(...) output = rdd_6.count(...) task task task task task spark-submit mycode.py ...
  • 36. © 2020, Amazon Web Services, Inc. or its Affiliates. Additional features in Glue jobs (focus on PySpark) PySpark Transforms GlueTransform ApplyMapping DropFields DropNullFields ErrorsAsDynamicFrame Filter FlatMap Join Map MapToCollection Relationalize RenameField ResolveChoice SelectFields SelectFromCollection Spigot SplitFields SplitRows Unbox UnnestFrame https://docs.aws.amazon.com/glue/latest/dg/aws-glue- programming-python-transforms.html AWS Glue PySpark Extensions getResolvedOptions Types DynamicFrame DynamicFrameCollection DynamicFrameWriter DynamicFrameReader GlueContext https://docs.aws.amazon.com/glue/latest/dg/aws- glue-programming-python-extensions.html RDD DataFrame Spark DataSet DynamicFrameGlue
  • 37. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo custom script
  • 38. © 2020, Amazon Web Services, Inc. or its Affiliates. A 1 ★★★★ A 2 ★ A 3 ★★★ B 1 ★ B 2 ★★★★ B 3 ★ C 1 ★★★ C 2 ★ C 3 ★★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ A 1 ★★★★ A 2 ★ A 3 ★★★ B 1 ★ B 2 ★★★★ B 3 ★ A 1 ★★★★ A 2 ★ A 3 ★★★ C 1 ★★★ C 2 ★ C 3 ★★★★ A 1 ★★★★ A 2 ★ A 3 ★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ B 1 ★ B 2 ★★★★ B 3 ★ C 1 ★★★ C 2 ★ C 3 ★★★★ B 1 ★ B 2 ★★★★ B 3 ★ D 1 ★★ D 2 ★★★★ D 3 ★★ C 1 ★★★ C 2 ★ C 3 ★★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ movies_pairs = movies.join(movies, on=user) movie user rating movieX userX ratingX movieY userY ratingY
  • 39. © 2020, Amazon Web Services, Inc. or its Affiliates. A A A A A A A A A B B B B B B C C C B B B C C C D D D C C C D D D D D D movieX movieY ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★ ★★★★ ★ ★ ★★★★ ★ ★★★ ★ ★★★★ ratingX ★ ★★★★ ★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★ ★★★★ ★★ ratingY movie_pairs = movie_pairs.groupBy((movieX,movieY)) A B A C A D B C B D C D
  • 40. © 2020, Amazon Web Services, Inc. or its Affiliates. A A A A A A A A A B B B B B B C C C B B B C C C D D D C C C D D D D D D similarity = movie_pairs.mapValue(cosine_similarity) movieX movieY ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★ ★★★★ ★ ★ ★★★★ ★ ★★★ ★ ★★★★ ratingX ★ ★★★★ ★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★ ★★★★ ★★ ratingY similarity ≠ = ≠ ≠ = ≠ movieX movieY similarity A B A C A D B C B D C D ≠ = ≠ ≠ = ≠ movie_pairs = movie_pairs.groupBy((movieX,movieY)) A B A C A D B C B D C D
  • 41. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  • 42. © 2020, Amazon Web Services, Inc. or its Affiliates. Quicksight console
  • 43. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS Training & Certification https://www.aws.training: Free on-demand courses to help you build new cloud skills E-Learning: Data Analytics Fundamentals https://www.aws.training/Details/eLearning?id=35364 E-Learning: AWS Hadoop Fundamentals https://www.aws.training/Details/eLearning?id=40337 Learning Path: Internet of Things Foundation Series https://www.aws.training/Details/Curriculum?id=27289 Video: Serverless Analytics https://www.aws.training/Details/Video?id=26848 Available AWS Certifications
  • 44. © 2020, Amazon Web Services, Inc. or its Affiliates. Thanks!