!
AWS Chicago User Group	

!
Big Data Day
Have an idea for a meetup? Talk
to me:	

!
Margaret Walker

CohesiveFT	

!
!
Tweet: @MargieWalker

#AWSChicago	

Sponsors & Hosts
#AWSChicago
6:00 pm Introductions	

6:05 pm Short Talks	

!
"AWS Storage Options" Ben Blair, CTO at MarkITx
@stochastic_code 	

!
"APIs and Big Data in AWS" - Kin Lane,API Evangelist
@kinlane 	

!
"Democratizing Data Analysis with Amazon Redshift" - Bill
Wanjohi @billwanjohi and Michelangelo D'Agostino
@MichelangeloDA, Civis Analytics 	

!
6:45 pm Q & A 	

7:00 pm Networking, drinks and pizza
Agenda
#AWSChicago
Sponsors & Hosts
Next Meetups: 

October 15?	

!
+Nov 12

Let’s drink at re:Invent
Keep it Secret,
Keep it Safe
(and Fast and Available would be nice too)
Hi
Ben Blair
CTO @ MarkITx
We live on AWS
TL;DW
• Use IAM roles for access control
• Use DynamoDB for online storage &
transactions
• Use Redshift for offline storage & analysis
• Use S3 to keep *everything*
It’s hard to keep a
secret
Use AIM EC2 roles instead
3rd normal form,
anyone?
Data duplication is OK
Optimize for each context
Interactive Data goes
in DynamoDB
If your users read or write it, and it’s not huge, it should
probably go into DynamoDB
Why DynamoDB
• Works with tests. Tests are good.
• Predictable Performance & Cost
• Low Maintenance
Why Not DynamoDB
• Vendor lock-in vs Cassandra
• Can’t add / change indexes (but that’s ok)
• Need to watch utilization
SimpleDB
No, just no
ElastiCache
Good place to end, bad place to start
RDS
Hosted SQL Goodness
Redshift
Seriously wonderful
Redshift vs RDS
• Start with RDS
• Redshift is actually very cheap
• RDS for simple reporting on small data sets
• Redshift for all other analysis
S3
Store Everything.
!
You won’t, and you’ll regret it later.
EBS
Distributed Availability > Instance Recovery
Names Matter
Distributed systems care about your keyspace even
when you don’t
Thanks
ben@markitx.com
!
@stochastic_code
!
github.com/markitx
"APIs and Big Data in AWS" 	

Kin Lane

API Evangelist 	

!
@kinlane 	

!
Click here for slides on GitHub	

#AWSChicago
Sponsors & Hosts
Democratizing Data
Analysis with Amazon
Redshift
Michelangelo D’Agostino - Civis Analytics Senior Data Scientist
Bill Wanjohi - Civis Analytics Senior Engineer
● advantages of Redshift
● some pitfalls
● workflows and recommendations on best
practices
What you’ll learn
Why should you listen?
● 18 months of heavy Redshift use
● Two complementary perspectives:
The Scientist and The Engineer
Michelangelo @MichelangeloDA
Bill @billwanjohi
● collaborated on monolithic Vertica analytics
database
● dozens of TB of data
● scaled from 4-20 server blades
● dozens of concurrent users across
departments (hundreds total)
● arbitrary SQL allowed/encouraged
Life before Redshift
Our early requirements
● SQL language
● low starting cost
● easy to integrate with OSS, other DBs
● performant on large data sets
● minimal database administration
Choosing Redshift
● timing: first full release in Feb 2013
● drastically cheaper to start than other
commercial offerings
● very similar to our previous choice, HP
Vertica
● many fewer administration tasks
Basics
● RDBMS
● MPP/Columnar
Supports window functions
Few enforceable constraints
No concept of an index
● Redshift <= ParAccel <= PostgreSQL 8
Postgres drivers work
ORM requires mocking
● Most data I/O via S3 service
Things analytics DBs are good at
● Big aggregates
● Parallel I/O
● Merge joins between tables
Things they’re not good at
● Updates
● Retrieval of individual records
● Enforcing data quality
How’s it worked out?
Pretty good!
● adequate performance
○ big step up from traditional RDBMS
○ comparable to other analytics DBs
● easy to stand up new clusters
● cheaper clusters now available
● most workflows can live entirely in-database
● s3 is a good broker for what can’t
Data Science Workflow
Our custom plumbing syncs tables from dozens
of source databases into Redshift at varying
refresh frequencies.
We’ve found that SQL just invites so many
more people to the analytics game.
Analysts and data scientists run exploratory
SQL and build up complex tables for statistical
modeling一utilizing crazy joins, aggregates and
rollup features.
Redshift supports powerful window functions
Data Science Workflow
Predictive Modeling
Data is pulled directly from Redshift into
python/R to train statistical models
Predictive Modeling
For simple linear models, scoring is done
directly in redshift via SQL.
For more complicated models, data is pulled
from redshift to s3 with a COPY SQL
command, processed in EMR, and loaded back
into redshift with another COPY command.
Hurdles we’ve faced along
the way
● inconsistent runtimes
● catalog contention
● bugs (databases are hard)
● resizing
● too easy to end up with uncompressed data
● “missing” PostgreSQL functionality
● complex workload management
Setup Recommendations
● at least two nodes
● send 35-day snapshots to other regions
● at-rest encryption
● enforce SSL
● provision with boto or AWS CLI
● cluster isolation to hide objects
● buy 3-year reservations
We’re Hiring!
Through research, experimentation, and iteration, we’re
transforming how organizations do analytics. Our clients
range in scale and focus from local to international, all
empowered by our individual-level, data-driven approach.
civisanalytics.com/apply

Big data at AWS Chicago User Group - 2014

  • 1.
    ! AWS Chicago UserGroup ! Big Data Day
  • 2.
    Have an ideafor a meetup? Talk to me: ! Margaret Walker
 CohesiveFT ! ! Tweet: @MargieWalker
 #AWSChicago Sponsors & Hosts #AWSChicago
  • 3.
    6:00 pm Introductions 6:05pm Short Talks ! "AWS Storage Options" Ben Blair, CTO at MarkITx @stochastic_code ! "APIs and Big Data in AWS" - Kin Lane,API Evangelist @kinlane ! "Democratizing Data Analysis with Amazon Redshift" - Bill Wanjohi @billwanjohi and Michelangelo D'Agostino @MichelangeloDA, Civis Analytics ! 6:45 pm Q & A 7:00 pm Networking, drinks and pizza Agenda #AWSChicago Sponsors & Hosts
  • 4.
    Next Meetups: 
 October15? ! +Nov 12
 Let’s drink at re:Invent
  • 5.
    Keep it Secret, Keepit Safe (and Fast and Available would be nice too)
  • 6.
    Hi Ben Blair CTO @MarkITx We live on AWS
  • 7.
    TL;DW • Use IAMroles for access control • Use DynamoDB for online storage & transactions • Use Redshift for offline storage & analysis • Use S3 to keep *everything*
  • 8.
    It’s hard tokeep a secret Use AIM EC2 roles instead
  • 9.
    3rd normal form, anyone? Dataduplication is OK Optimize for each context
  • 10.
    Interactive Data goes inDynamoDB If your users read or write it, and it’s not huge, it should probably go into DynamoDB
  • 11.
    Why DynamoDB • Workswith tests. Tests are good. • Predictable Performance & Cost • Low Maintenance
  • 12.
    Why Not DynamoDB •Vendor lock-in vs Cassandra • Can’t add / change indexes (but that’s ok) • Need to watch utilization
  • 13.
  • 14.
    ElastiCache Good place toend, bad place to start
  • 15.
  • 16.
  • 17.
    Redshift vs RDS •Start with RDS • Redshift is actually very cheap • RDS for simple reporting on small data sets • Redshift for all other analysis
  • 18.
    S3 Store Everything. ! You won’t,and you’ll regret it later.
  • 19.
  • 20.
    Names Matter Distributed systemscare about your keyspace even when you don’t
  • 21.
  • 22.
    "APIs and BigData in AWS" Kin Lane
 API Evangelist ! @kinlane ! Click here for slides on GitHub #AWSChicago Sponsors & Hosts
  • 23.
    Democratizing Data Analysis withAmazon Redshift Michelangelo D’Agostino - Civis Analytics Senior Data Scientist Bill Wanjohi - Civis Analytics Senior Engineer
  • 24.
    ● advantages ofRedshift ● some pitfalls ● workflows and recommendations on best practices What you’ll learn
  • 25.
    Why should youlisten? ● 18 months of heavy Redshift use ● Two complementary perspectives: The Scientist and The Engineer
  • 26.
  • 27.
  • 28.
    ● collaborated onmonolithic Vertica analytics database ● dozens of TB of data ● scaled from 4-20 server blades ● dozens of concurrent users across departments (hundreds total) ● arbitrary SQL allowed/encouraged Life before Redshift
  • 29.
    Our early requirements ●SQL language ● low starting cost ● easy to integrate with OSS, other DBs ● performant on large data sets ● minimal database administration
  • 30.
    Choosing Redshift ● timing:first full release in Feb 2013 ● drastically cheaper to start than other commercial offerings ● very similar to our previous choice, HP Vertica ● many fewer administration tasks
  • 31.
    Basics ● RDBMS ● MPP/Columnar Supportswindow functions Few enforceable constraints No concept of an index ● Redshift <= ParAccel <= PostgreSQL 8 Postgres drivers work ORM requires mocking ● Most data I/O via S3 service
  • 32.
    Things analytics DBsare good at ● Big aggregates ● Parallel I/O ● Merge joins between tables
  • 33.
    Things they’re notgood at ● Updates ● Retrieval of individual records ● Enforcing data quality
  • 34.
    How’s it workedout? Pretty good! ● adequate performance ○ big step up from traditional RDBMS ○ comparable to other analytics DBs ● easy to stand up new clusters ● cheaper clusters now available ● most workflows can live entirely in-database ● s3 is a good broker for what can’t
  • 35.
    Data Science Workflow Ourcustom plumbing syncs tables from dozens of source databases into Redshift at varying refresh frequencies.
  • 36.
    We’ve found thatSQL just invites so many more people to the analytics game. Analysts and data scientists run exploratory SQL and build up complex tables for statistical modeling一utilizing crazy joins, aggregates and rollup features. Redshift supports powerful window functions Data Science Workflow
  • 37.
    Predictive Modeling Data ispulled directly from Redshift into python/R to train statistical models
  • 38.
    Predictive Modeling For simplelinear models, scoring is done directly in redshift via SQL. For more complicated models, data is pulled from redshift to s3 with a COPY SQL command, processed in EMR, and loaded back into redshift with another COPY command.
  • 39.
    Hurdles we’ve facedalong the way ● inconsistent runtimes ● catalog contention ● bugs (databases are hard) ● resizing ● too easy to end up with uncompressed data ● “missing” PostgreSQL functionality ● complex workload management
  • 40.
    Setup Recommendations ● atleast two nodes ● send 35-day snapshots to other regions ● at-rest encryption ● enforce SSL ● provision with boto or AWS CLI ● cluster isolation to hide objects ● buy 3-year reservations
  • 41.
    We’re Hiring! Through research,experimentation, and iteration, we’re transforming how organizations do analytics. Our clients range in scale and focus from local to international, all empowered by our individual-level, data-driven approach. civisanalytics.com/apply