Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Khairallah, Business Development Manager, AWS
Malini Saxena, Senior Consultant, AWS
Raj Chary, VP of Technology / Architecture, WagglePractice
Lige Hensley, Chief Technology Officer, Ivy Tech
June 20, 2016
Easy Analytics with AWS

What to expect from this session
• AWS toolkit for analytics
• Understand stakeholders
• Demo
• Case Study – WagglePractice
• Case Study – Ivy Tech
• Q&A

AnalyzeStore
Amazon
Glacier
Amazon
S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
Big data portfolio—but what do I recommend?
AWS Data Pipeline
Amazon
CloudSearch
Amazon
EMR
Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
AWS Database
Migration
Amazon
Kinesis
Analytics
Amazon Kinesis
Firehose
AWS Import/Export
AWS Direct
Connect
Collect
Amazon Kinesis
Streams
Amazon
QuickSight

Match toolset to right persona
• Business intelligence (BI) analyst
• Primary tool is SQL
• Historical data resides in data warehouse such as
Amazon Redshift
• Data scientist—Uses programmatic languages such as R or
Python
• Application developer—Requires API to integrate with AWS
services

BI analyst with existing BI tools
BI Analyst
BI tools
Amazon EC2
Amazon Redshift
QuickSight API
• Primary tool is SQL
• Data is largely structured with well known data sources
• Primary concern is fast, consistent performance
• Need to extend SQL with custom functions
BI tools
Amazon EC2
Amazon QuickSight
Amazon QuickSight

Amazon Redshift system architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2 TB to 2 PB
• DC1: SSD; scale from 160 GB to 356 TB
10 GigE
(HPC)
JDBC/ODBC

New SQL functions
We add SQL functions regularly to expand Amazon Redshift’s query capabilities
Added 25+ window and aggregate functions since launch, including:
LISTAGG
[APPROXIMATE] COUNT
DROP IF EXISTS, CREATE IF NOT EXISTS
REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE
PERCENTILE_CONT, _DISC, MEDIAN
PERCENT_RANK, RATIO_TO_REPORT
We’ll continue iterating but also want to enable you to write your own
Window function examples: http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_examples.html

Scalar user defined functions
You can write UDFs using Python 2.7
• Syntax is largely identical to PostgreSQL UDF
• Python execution is performed in parallel
• System and network calls within UDFs are prohibited
Comes integrated with Pandas, NumPy, SciPy, DateUtil, and
Pytz analytic libraries
• Import your own libraries for even more flexibility
• Take advantage of thousands of functions available through Python
libraries to perform operations not easily expressed in SQL

A very fast, cloud-powered, business
intelligence service for 1/10 the cost of
traditional BI software
What is Amazon QuickSight?

Business
User
Business
User
QuickSight
APIQuickSight UI
Mobile Devices Web Browsers
Partner BI Products
MetadataData PrepConnectors SuggestionsSPICE
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon EMRAmazon
Redshift
Amazon RDSFiles Third-party

Data scientist with existing toolsets
Data scientist Toolkits like SAS or
R Studio installed
with Amazon EC2
Unstructured data
Amazon S3
Structured data
Amazon Redshift
• Work with unstructured datasets
• Use existing toolsets to connect to Amazon Redshift

Querying Amazon Redshift with R packages
• RJDBC—Supports SQL queries
• dplyr—Uses R code for data
analysis
• RPostgreSQL—R compliant
driver or database Interface (DBI)R User
R Studio
Amazon
EC2
Unstructured data
Amazon S3
User profile
Amazon RDS
Amazon Redshift
Connecting R with Amazon Redshift blog post: https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-Redshift

Querying Amazon Redshift with R packages example

Application developers can build smart
applications using Amazon Machine Learning
Structured data/predictions
Amazon Redshift
Generate/query
predictions
Amazon QuickSight
Application
Amazon Machine
Learning
Visualize
• All skill levels
• Amazon Machine Learning technology is accessed through APIs and SDKs
• Embed visualizations in applications

Raj Chary, WagglePractice
Vice President of Technology/Architecture

Smart, responsive practice
Math and ELA (Grades 2-8)
Provides students the right
challenge at the right time
What is Waggle?

Right Challenge, Right Time
Waggle looks for more than
correct answers. Waggle
continually analyzes each
student’s decisions and
progress. That way, students get
tougher material right when
they’re ready.
What is Waggle?

Productive Struggle
Waggle motivates students to
push themselves forward. How?
Through helpful hints,
supportive feedback, and
achievement badges that build
grit and confidence.
What is Waggle?

Constructive Grouping Waggle’s
insights means you can easily
group students together based
on learning needs. All without
sacrificing the quality of
individual instruction.
What is Waggle?

Waggle: Product Demo
• Data Creators
 Differentiated learning experience
 Fun and engaging
• Data Visualizers
 Seamless integration with application
 Analytics with a Story
 Actionable Data

Redshift: Data Warehouse Layout
Write Cluster
Compute – dw2.large
Redshift
Read Cluster
Compute – dw2.large
Redshift
History Cluster
Density – dw1.xlarge
Redshift
Initial and
Increment
al
{processed
} data
loads
Periodic Data
Snapshots for
historical analysis
Data
sources
For serving Jaspersoft
reports
APIs
OLTP
S3 COPY
S3 UnLoad
and Load
S3 UnLoad
and Load
Data mart
(aggregations)
NodesNodes
Staging
Datamart
(aggregations)
Nodes
S3 UnLoad and
Load
S3 UnLoad and
Load
+ UPSERTS

Results and Lessons Learned
• Performance Metrics
– Millions of records are processed in <1 minute
• LOAD/UNLOAD commands | UPSERTS | S3 COPY Command
– Report queries average < 1 to ~1.5 seconds
– {compression} – gained 20+% efficiencies in data retrieval
• Best Practices
– {sort keys} – lens-based data model: visualize data in variety of ways
– {commit stats} – Redshift is not a transactional system
– {nested loops} – no Cartesian products, ensure joins well managed
– {queries that queue} – tune the WLM configuration
– {query runtimes} – faster query means less queuing
– {stats missing} – analyze and vacuum when possible
– {alerts with tables} – monitor to ensure queries running optimally

Ivy Tech & Amazon Redshift
May 25, 2016

• Transforming the culture of the College to be more data driven
• Moving from reporting silos to an Integrated Analytics system, we call
this a Data Democracy
• Collecting and analyzing a vast variety of data at a scale that no one
in Higher Ed is doing
• Using machine learning tools to identify students who may need
further assistance
• Starting this fall, we are implementing a one-on-one coaching
initiative for the students we identified with the machine learning tools
What We’re Doing

96% of organizations in the United States
use data in the same way.
…and it’s wrong.
But it’s not just education…

The “Standard” Approach
VIP

Data Regimes
Data Dictatorship: Data is controlled and its use is restricted. There
is asymmetric distribution of information based on your position
Data Aristocracy: Data analysts, scientists and PhDs are needed to
do anything meaningful. Power concentrates in the hands of these
employees and their supervisors
Data Anarchy: Business users feel underserved and take matters into
their own hands. They create “shadow IT” systems and work around
the “unresponsive” IT group
Data Democracy: Everybody gets timely and equitable access to
data. Line of business users are empowered and “own” the data.
Executives and IT get out of the way
1 Shash Hegde, Mariner, “The Rise of Data Regimes”, 9/12/13, http://www.mariner-usa.com/rise-data-regimes/ (image substitution for Mao Zedong)

Every organization moves through
increasingly complex stages of data
accessibility.
Data Maturity Model
… very few complete the transition to
Integrated Analytics

Stage 1: Report Silos
Request
Tracker
Banner Blackboard Luminis StarfishSCCM CAS
Authentication
This is what we have had for
decades at Ivy Tech…

Request
Tracker
Authentication
Stage 2: Data Warehousing
This is what
most
companies
do…
but we are
taking this a
step further…

Stage 3: Integrated Analytics
Request
Tracker
Authentication
Students by
Financial
Aid
Students
by
Award
Students
by
Term
Students
by
Class
Classes by
Class
Section
Students
These curated collections of
data are designed to enable
direct access to...
…the data you need, regardless of
where it came from. Quickly.
Easily.

GPA Graduation—Cumulative
Graduation Grade Point Average (Cumulative) is an indication of a student's academic progress for all
semester credit classes for all registered terms up to and including the selected term. Letter grades are
assigned points (A=4, B=3, C=2, D=1, F=0) and the GPA is calculated by taking the number of grade
points a student earned in a selected period of time divided by the total number of classes taken during
that same period.
GPA Graduation Cumulative = Sum of a student's total grade points earned in credit classes for all
classes for all registered terms up to and including the selected term / Sum of student's total classes
taken during that same period
NOTES ON USING THIS TERM: GPA Graduation - Cumulative does not include grades from remedial
classes.
Related Terms: [GPA Graduation - Term]

Resources
Amazon Redshift Getting Started Guide:
http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
Scalar UDF Documentation: http://docs.aws.amazon.com/redshift/latest/dg/user-defined-
functions.html
Introduction to Python UDFs in Amazon Redshift:
https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-UDFs-in-
Amazon-Redshift
Connecting R with Amazon Redshift:
https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-
Redshift
Databricks Apache Spark–Amazon Redshift Tutorial: https://github.com/databricks/spark-
redshift/tree/master/tutorial
Amazon ML Getting Started Guide: https://aws.amazon.com/machine-learning/getting-started/
Amazon QuickSight (Preview Registration): https://aws.amazon.com/quicksight/

Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016

Similar to Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016

Editor's Notes