McGraw-Hill Optimizes Analytics Workloads with Databricks

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
May 24, 2018| 1PM-2PM PDT
McGraw-Hill Education Optimizes
Analytics Workloads with
Databricks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Today’s presenters
Pratap Ramamurthy, Partner Solutions Architect, Amazon Web
Services
Brian Dirking, Senior Director of Partner Marketing, Databricks
Matthew Ashbourne, Lead Software Engineer, McGraw-Hill Education

Today’s agenda
1. Overview of AWS and AWS data lake services
2. Databricks: solutions that extend data lake management capabilities
3. McGraw-Hill Education recognizes the need to transform
4. McGraw-Hill Education revolutionizes digital learning with AWS and
Databricks
5. Q&A/Discussion

Learning objectives:
1. How data lakes, using a unified analytic platform, can enable
advanced analytic use cases such as machine learning
2. How to optimize data lakes to work effectively with real-time and fast-
moving data
3. How to streamline the read/write process for data lakes

The Data Lake and AWS
Drive business value with disparate types of data

Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront

Should I Build a Data Lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets

Rethink How to Become a Data-driven Business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. Take advantage of a rich platform of
services to respond quickly to changing business needs

Business Case Determines Platform Design
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
START HERE
WITH A BUSINESS CASE

Experiment and Scale Based on Your Business Needs
MATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights

Business Outcomes on a Modern Data Architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure

Data Lake on AWS
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
S3
Relational and non-relation data
Schema defined during analysis
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without movement
Scale storage and compute independently
Store data at $0.023 / month; Query for $0.05/GB scanned
Redshift
EMR
Athena Kinesis
Elasticsearch Service
Kinesis
Video Streams
AI Services

Why Amazon S3 for Modern Data Architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 Amazon EMR
 Amazon Redshift Spectrum
 Amazon DynamoDB
 Amazon Athena
 AWS Glue
 Amazon Kinesis
 Amazon SageMaker
IntegratedEasy to use
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies

Decouple Storage and Compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big Data architectures often benefit
from decoupling storage and compute

Analyzing streaming and
historical data at scale with
DatabricksBrian Dirking,
Senior Director of Partner Marketing,
Databricks

Unify big data and AI with Databricks on AWS
Powered by Apache Spark, the Unified Analytics Platform from Databricks runs on
AWS for cloud infrastructure.
5 – 8x

5 – 8x

Streaming Data Presents Challenges
Optimize data writes for retrieval
Data is being written while you access it for analysis –
how do you ensure that you get a good data set?
Storing data in a way that enables low cost and fast access
PROBLEM DESCRIPTION
Writing Data
for Access
Storing Data
Unreliable Data
Access historical data to blend with streaming data for
analytics models?
Blending with
Historical Data

Databricks Delta
Delta is a unified data management system that brings data reliability and
performance optimizations to cloud data lakes
• Help ensure data integrity with transactional guarantees.
• Enable the most consistent view of your streaming data.
• Modify data after it has been written with upserts.
• Leverage Amazon S3 for massive scale.
• Separate compute from storage for cost efficiency.
• Enable data portability with an open file format.
*
• Accelerate query speeds through indexing and caching.
• Self-optimize data layouts and simplify partition management.
• Up to 100x faster than Apache Spark on Parquet.

McGraw-Hill Education optimizes
its digital learning platform
Driving innovation with Databricks on AWS
Matthew Ashbourne
Lead Software Engineer
McGraw-Hill Education

Our history
McGraw-Hill Education Education is a 129-year-old
company that was reborn with the mission of accelerating
learning through intuitive and engaging experiences –
grounded in research
McGraw Hill is working to revolutionize education with Machine Learning and AI
complementing our suite of traditional reporting products.
We capture student interaction data within our online learning platforms to:
• Deliver a personalized learning experience
• Drive higher retention and pass rates
• Provide overview and detailed views of student work within online learning
environments to instructors

A Case Study in
Learning Science

Connect retention

Challenges

Key challenges
Data
Access
Processing Scale

Amazon ES
Data Mart
3rd party data
integration
platform
Kinesis
Systems
AWS
Lambda
Spark Cluster in
datacenter
Data Access ScaleProcessing

Problems we encountered
• Unable to productionize
data science output
• Schema-bound to
transactional DBs
• Data engineering splintered
across tech stacks
• Scaling issues
• Small file challenges
• High Overhead adding new
data sources

Data lake Solution

Data lake ecosystem

McGraw-Hill Education Education’s data lake
requirements
• Low engineering effort to get off the ground
• Support concurrent writes
• Resilient and auto-healing so small team can easily
manage
• Ability to compact small files to improve read
performance

Issues with open source data lake approach
• Small/too-large output files
• Dirty output directories
• Schema management
• No transactional support or safety
• No safe way to have multiple writers to same table

Databricks Delta to the rescue
ACID Transactions - Multiple writers can simultaneously modify a dataset and
see consistent views.
DELETES/UPDATES/UPSERTS - Writers can modify a dataset without
interfering with jobs reading the dataset.
Data Validation - Ensures that data meets specified invariants (for
example, NOT NULL) by rejecting invalid data.
Automatic File Management - Speeds up data access by organizing data into
large files that can be read efficiently.
Statistics and Data Skipping - Speeds up reads by 10-100x by tracking
statistics about the data in each file and avoiding reading irrelevant information.

Data Mart
Kinesis
Systems
ScaleProcessing
Systems
Kafka
Systems Databricks Spark
Cluster on EC2 Spot
Databricks Delta
Data lake on S3
Data Access

Problems solved with Databricks
Unable to productionize data
science output
Data scientists now work in
same environment and
toolchain
Schema-bound to
transactional DBs
•Eliminated direct schema
binding through event contract
Data engineering splintered
across tech stacks
•Unified data processing on
Spark API
Scaling issues •Huge increase in scaling with
simple lift & shift to Spark
Small file challenges •Small files are automatically
compacted
High Overhead adding new
data sources
•Push based model makes it
easier for new data sources to
expose themselves to analytics

Data lake implementation

You still need information architecture
Avoid Data Swamp with
• Consistent names and identifiers for Services,
Entities, Event topics/streams
• Schemas
• Documentation on how data joins cross domains

Flexible data lake schema
raw JSON string
event_class string
header struct
body JSON string
json_schema string
header_version string
partition_event_source string
partition_event_name string
partition_event_date date
meta struct<type:string,value:string,pipeline_datetime:timestamp>
Write data with a minimal schema

Read/Write batch
READ:
spark.read.format("delta").load("/delta/events")
WRITE:
spark.write.format("delta").partitionBy("date").save("/delta/events")

READ:
spark.readStream.format("delta").load("/delta/events")
WRITE:
events.writeStream.format("delta")
.partitionBy("date")
.outputMode("append")
.option("checkpointLocation", "/delta/events/_checkpoints/my-stream")
.start("/delta/events")
Read/Write

Schema on read
import org.apache.spark.sql.functions._
val eventBodySchema = new StructType()
.add("learner_xid", StringType)
.add("assignment_xid", StringType)
.add("raw_score", DoubleType)
val parsed = Data lake.withColumn("parsed_body",
from_json($"body", eventBodySchema))

Views
SELECT
assessment,
learner,
sensed_datetime
FROM delta.`/delta/events`
LATERAL VIEW json_tuple(assessment, ‘xid’, ‘due_date’) v1 as xid, due_datetime
LATERAL VIEW json_tuple(learner, ‘xid’) v2 as learner_xid

Bringing it all together

Case study: new enterprise.roster pipeline
Data Mart
Rostering System
Kafka
Structured Streaming Batch ETL
Delta Data lake
Data Mart
Rostering System
Batch Transform &
Load
Staging
Batch Extract

Case study: new enterprise.roster pipeline
Data Mart
Batch ETL
Rostering System
Kafka
Structured Streaming
Delta Data lake

The real deal

Possible pitfalls
• You need a strategy up front
• Organizational change
• Source systems need to embrace event instrumentation
• Training
• Data validation

Some surprises
Data pipeline development is not faster, BUT
• More options and possibilities
• ETLs faster and more reliable with better scaling
• Easily combine streaming and batch data
• Unified Analytics Platform matters

Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Take a Free 30-Day Trial of Databricks:
https://databricks.com/try-databricks
• Try AWS for free (full offer details available at the link below):
https://aws.amazon.com/free

Q & A

Thank you!

McGraw-Hill Optimizes Analytics Workloads with Databricks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to McGraw-Hill Optimizes Analytics Workloads with Databricks

Similar to McGraw-Hill Optimizes Analytics Workloads with Databricks (20)

More from Amazon Web Services

More from Amazon Web Services (20)

McGraw-Hill Optimizes Analytics Workloads with Databricks