24. STARTING AN EMR CLUSTER
WITH HADOOP ECOSYSTEM
TOOLS PRE-INSTALLED
25. COPY & LOAD OUR DATASET
$
scp
–i
EMRKeyPair.pem
~/aws/hadoop/LHRarrivals*.csv
hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐
west-‐1.compute.amazonaws.com:
$
ssh
–i
EMRKeyPair.pem
hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com
$
hadoop
fs
-‐mkdir
/data/
$
hadoop
fs
-‐put
<uploaded_files>
/data/
$
hadoop
fs
-‐ls
-‐h
-‐R
/data/
or at scale, Distributed Copy using S3DistCp to parallel load from S3
$
.
/home/hadoop/impala/conf/impala.conf
$
hadoop
jar
/home/hadoop/lib/emr-‐s3distcp-‐1.0.jar
-‐Dmapreduce.job.reduces=30
-‐-‐
src
s3://s3bucketname/
-‐-‐dest
hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/
data/
-‐-‐outputCodec
'none'
** Run on a cluster master node
26. CREATE EXTERNAL TABLE
$
#check
the
size
of
our
data
set
$
wc
–l
LHRarrivals*.csv
850
LHRarrivals2.csv
1526
LHRarrivals.csv
2376
total
$
impala-‐shell
Welcome
to
the
Impala
shell.
>
create
EXTERNAL
TABLE
flights
(
input
STRING,
id
BIGINT,
widget
STRING,
source
STRING,
resultnum
BIGINT,
pageurl
STRING,
scheduled
STRING,
flightnumber
STRING,
airport
STRING,
status
STRING,
terminal
STRING
)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
','
LOCATION
'/data/';
>
select
count
(*)
from
flights;
Should
return
count(*)
2376
reflecting
the
size
of
the
data
set
27. DEMO OF ODBC ACCESS
Doing this part on Amazon WorkSpaces using the Simba Cloudera
Impala ODBC Driver.!
Set up an SSH tunnel to the master node to allow us to connect to port
25010 from the WorkSpaces desktop to the Impala ODBC port!
A previously configured system DSN allows us to work with the data from
our EMR/Impala cluster directly within Microsoft Excel!
28. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
33. Real-time response to content
in semi-structured data streams
Relatively simple computations
on data (aggregates, filters,
sliding window, etc.)
34. Hourly server logs: how your
systems went wrong an hour ago
Weekly / Monthly Bill: What you
spent this past billing cycle
Daily customer report from your
website: tells you what deal or ad
to try next time
Daily fraud reports: tells you if there
was fraud yesterday
Daily business reports: tells me
how customers used AWS services
yesterday
Real-time metrics: what just went
wrong now
Real-time spending alerts/caps:
guaranteeing you can’t overspend
Real-time analysis: what to offer
the current customer now
Real-time detection: blocks
fraudulent use now
Fast ETL into Amazon Redshift:
how are customers using services
now