08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM break + set up query tool
10:20 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
• Why an Analytic Database?
• What is Amazon Redshift
• ‘Fire Up’ an Redshift Database
• Load Data
• Do a few queries
• Shut it down
• Have fun!
Why an Analytic Database?
Why use one?
• It a database optimized for read-only queries.
• It’s fast
• It can handle a lot of data
Why not to use one?
• Poor Transaction processing (aka OLTP)
• Rollback, multi-phase commits, etc
Under the hood.
Analytic Database typically have features like:
• Column (as opposed to row) storage
• Parallel queries across clusters of machines
• Support for partitioning
• Other cool stuff to make your queries fast
copy uservisits FROM 's3://big-data-benchmark/pavlo/text/tiny/uservisits/' CREDENTIALS
'aws_access_key_id=<your key>;aws_secret_access_key=<your key>' delimiter ',';
Load Data from S3
copy rankings FROM 's3://big-data-benchmark/pavlo/text/tiny/rankings/' CREDENTIALS
'aws_access_key_id =<your key>;aws_secret_access_key =<your key>' delimiter ',';
Load Bigger Data
Load Data from S3
-- options: "tiny", "1node", "5nodes", "10nodes"
select * from uservisits limit 100;
SELECT COUNT(*) from uservisits;
select * from rankings limit 100;
SELECT COUNT(*) from rankings;
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10;
SELECT sourceIP, SPLIT_PART(sourceIP, '.', 1) as fn, SPLIT_PART(sourceIP, '.', 2) as sn FROM
uservisits LIMIT 100;
SUM(adRevenue) AS totalRevenue,
AVG(pageRank) AS pageRank
FROM rankings R
JOIN (SELECT sourceIP,
FROM uservisits uv) NUV ON (R.pageURL = NUV.destinationURL)
GROUP BY sourceIP
ORDER BY totalRevenue DESC LIMIT 100;