1. Deploy a Spark Application
on a Spark Cluster @
Elastic Map Reduce AWS
Dr. Rim Moussa
University of Carthage
2. Amazon S3 -Amazon Simple Storage Service
Upload
– S3 bucket for Spark code: .jar
– S3 bucket for Data
Uploads to S3 might be done via Terminal for big data sets
laumch EC2 instance to upload data into S3 bucket
curl ftp://ftp.ais.dk/ais_data/dk_csv_jun2018.zip | aws s3 cp -
s3://aisdma
Manipulation of S3 buckets and files can be done via
Terminal
aws s3 ls s3://data.info
aws s3 cp s3://spark.jars/rm-1.0-veracity.jar .
aws s3 rm s3://data.info
2
3. Open Datasets on Amazon
Amazon has a repository of big datasets
https://registry.opendata.aws/
Amazon implements a program AWS Public Dataset Program
in order to democratize access to data and encourage the
development of communities that benefit from access to
shared datasets.
https://aws.amazon.com/opendata/public-datasets/
3