SlideShare a Scribd company logo
1 of 51
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
May 24, 2018| 1PM-2PM PDT
McGraw-Hill Education Optimizes
Analytics Workloads with
Databricks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s presenters
Pratap Ramamurthy, Partner Solutions Architect, Amazon Web
Services
Brian Dirking, Senior Director of Partner Marketing, Databricks
Matthew Ashbourne, Lead Software Engineer, McGraw-Hill Education
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s agenda
1. Overview of AWS and AWS data lake services
2. Databricks: solutions that extend data lake management capabilities
3. McGraw-Hill Education recognizes the need to transform
4. McGraw-Hill Education revolutionizes digital learning with AWS and
Databricks
5. Q&A/Discussion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning objectives:
1. How data lakes, using a unified analytic platform, can enable
advanced analytic use cases such as machine learning
2. How to optimize data lakes to work effectively with real-time and fast-
moving data
3. How to streamline the read/write process for data lakes
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Data Lake and AWS
Drive business value with disparate types of data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Should I Build a Data Lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink How to Become a Data-driven Business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. Take advantage of a rich platform of
services to respond quickly to changing business needs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Case Determines Platform Design
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
START HERE
WITH A BUSINESS CASE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Experiment and Scale Based on Your Business Needs
MATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Outcomes on a Modern Data Architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on AWS
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
S3
Relational and non-relation data
Schema defined during analysis
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without movement
Scale storage and compute independently
Store data at $0.023 / month; Query for $0.05/GB scanned
Redshift
EMR
Athena Kinesis
Elasticsearch Service
Kinesis
Video Streams
AI Services
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon S3 for Modern Data Architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 Amazon EMR
 Amazon Redshift Spectrum
 Amazon DynamoDB
 Amazon Athena
 AWS Glue
 Amazon Kinesis
 Amazon SageMaker
IntegratedEasy to use
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple Storage and Compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big Data architectures often benefit
from decoupling storage and compute
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyzing streaming and
historical data at scale with
DatabricksBrian Dirking,
Senior Director of Partner Marketing,
Databricks
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unify big data and AI with Databricks on AWS
Powered by Apache Spark, the Unified Analytics Platform from Databricks runs on
AWS for cloud infrastructure.
5 – 8x
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
5 – 8x
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming Data Presents Challenges
Optimize data writes for retrieval
Data is being written while you access it for analysis –
how do you ensure that you get a good data set?
Storing data in a way that enables low cost and fast access
PROBLEM DESCRIPTION
Writing Data
for Access
Storing Data
Unreliable Data
Access historical data to blend with streaming data for
analytics models?
Blending with
Historical Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Databricks Delta
Delta is a unified data management system that brings data reliability and
performance optimizations to cloud data lakes
• Help ensure data integrity with transactional guarantees.
• Enable the most consistent view of your streaming data.
• Modify data after it has been written with upserts.
• Leverage Amazon S3 for massive scale.
• Separate compute from storage for cost efficiency.
• Enable data portability with an open file format.
*
• Accelerate query speeds through indexing and caching.
• Self-optimize data layouts and simplify partition management.
• Up to 100x faster than Apache Spark on Parquet.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
McGraw-Hill Education optimizes
its digital learning platform
Driving innovation with Databricks on AWS
Matthew Ashbourne
Lead Software Engineer
McGraw-Hill Education
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our history
McGraw-Hill Education Education is a 129-year-old
company that was reborn with the mission of accelerating
learning through intuitive and engaging experiences –
grounded in research
McGraw Hill is working to revolutionize education with Machine Learning and AI
complementing our suite of traditional reporting products.
We capture student interaction data within our online learning platforms to:
• Deliver a personalized learning experience
• Drive higher retention and pass rates
• Provide overview and detailed views of student work within online learning
environments to instructors
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A Case Study in
Learning Science
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connect retention
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connect retention
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key challenges
Data
Access
Processing Scale
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon ES
Data Mart
3rd party data
integration
platform
Kinesis
Systems
AWS
Lambda
Spark Cluster in
datacenter
Data Access ScaleProcessing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problems we encountered
• Unable to productionize
data science output
• Schema-bound to
transactional DBs
• Data engineering splintered
across tech stacks
• Scaling issues
• Small file challenges
• High Overhead adding new
data sources
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake Solution
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake ecosystem
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
McGraw-Hill Education Education’s data lake
requirements
• Low engineering effort to get off the ground
• Support concurrent writes
• Resilient and auto-healing so small team can easily
manage
• Ability to compact small files to improve read
performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Issues with open source data lake approach
• Small/too-large output files
• Dirty output directories
• Schema management
• No transactional support or safety
• No safe way to have multiple writers to same table
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Databricks Delta to the rescue
ACID Transactions - Multiple writers can simultaneously modify a dataset and
see consistent views.
DELETES/UPDATES/UPSERTS - Writers can modify a dataset without
interfering with jobs reading the dataset.
Data Validation - Ensures that data meets specified invariants (for
example, NOT NULL) by rejecting invalid data.
Automatic File Management - Speeds up data access by organizing data into
large files that can be read efficiently.
Statistics and Data Skipping - Speeds up reads by 10-100x by tracking
statistics about the data in each file and avoiding reading irrelevant information.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Mart
Kinesis
Systems
ScaleProcessing
Systems
Kafka
Systems Databricks Spark
Cluster on EC2 Spot
Databricks Delta
Data lake on S3
Data Access
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problems solved with Databricks
Unable to productionize data
science output
Data scientists now work in
same environment and
toolchain
Schema-bound to
transactional DBs
•Eliminated direct schema
binding through event contract
Data engineering splintered
across tech stacks
•Unified data processing on
Spark API
Scaling issues •Huge increase in scaling with
simple lift & shift to Spark
Small file challenges •Small files are automatically
compacted
High Overhead adding new
data sources
•Push based model makes it
easier for new data sources to
expose themselves to analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake implementation
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
You still need information architecture
Avoid Data Swamp with
• Consistent names and identifiers for Services,
Entities, Event topics/streams
• Schemas
• Documentation on how data joins cross domains
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Flexible data lake schema
raw JSON string
event_class string
header struct
body JSON string
json_schema string
header_version string
partition_event_source string
partition_event_name string
partition_event_date date
meta struct<type:string,value:string,pipeline_datetime:timestamp>
Write data with a minimal schema
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read/Write batch
READ:
spark.read.format("delta").load("/delta/events")
WRITE:
spark.write.format("delta").partitionBy("date").save("/delta/events")
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
READ:
spark.readStream.format("delta").load("/delta/events")
WRITE:
events.writeStream.format("delta")
.partitionBy("date")
.outputMode("append")
.option("checkpointLocation", "/delta/events/_checkpoints/my-stream")
.start("/delta/events")
Read/Write
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Schema on read
import org.apache.spark.sql.functions._
val eventBodySchema = new StructType()
.add("learner_xid", StringType)
.add("assignment_xid", StringType)
.add("raw_score", DoubleType)
val parsed = Data lake.withColumn("parsed_body",
from_json($"body", eventBodySchema))
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Views
SELECT
assessment,
learner,
sensed_datetime
FROM delta.`/delta/events`
LATERAL VIEW json_tuple(assessment, ‘xid’, ‘due_date’) v1 as xid, due_datetime
LATERAL VIEW json_tuple(learner, ‘xid’) v2 as learner_xid
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bringing it all together
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Case study: new enterprise.roster pipeline
Data Mart
Rostering System
Kafka
Structured Streaming Batch ETL
Delta Data lake
Data Mart
Rostering System
Batch Transform &
Load
Staging
Batch Extract
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Case study: new enterprise.roster pipeline
Data Mart
Batch ETL
Rostering System
Kafka
Structured Streaming
Delta Data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The real deal
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Possible pitfalls
• You need a strategy up front
• Organizational change
• Source systems need to embrace event instrumentation
• Training
• Data validation
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some surprises
Data pipeline development is not faster, BUT
• More options and possibilities
• ETLs faster and more reliable with better scaling
• Easily combine streaming and batch data
• Unified Analytics Platform matters
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Take a Free 30-Day Trial of Databricks:
https://databricks.com/try-databricks
• Try AWS for free (full offer details available at the link below):
https://aws.amazon.com/free
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q & A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

More Related Content

What's hot

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksDustin Vannoy
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep duttaCapgemini
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed InstanceJames Serra
 
Benefits of the Azure cloud
Benefits of the Azure cloudBenefits of the Azure cloud
Benefits of the Azure cloudJames Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine LearningJames Serra
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 

What's hot (20)

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
Benefits of the Azure cloud
Benefits of the Azure cloudBenefits of the Azure cloud
Benefits of the Azure cloud
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 

Similar to McGraw-Hill Optimizes Analytics Workloads with Databricks

Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and QuboleAmazon Web Services
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSAmazon Web Services
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-ValueAmazon Web Services
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise Amazon Web Services
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsAmazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETLAmazon Web Services
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析Amazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyAmazon Web Services
 
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Amazon Web Services
 
Using Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML ModelsUsing Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML ModelsAmazon Web Services
 
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017Amazon Web Services
 
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017Amazon Web Services
 
Using Amazon SageMaker to Build, Train, and Deploy Your ML Models
Using Amazon SageMaker to Build, Train, and Deploy Your ML ModelsUsing Amazon SageMaker to Build, Train, and Deploy Your ML Models
Using Amazon SageMaker to Build, Train, and Deploy Your ML ModelsAmazon Web Services
 
Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT
 Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT
Transform Your Risk Systems for Greater Agility with Accenture & AWS PPTAmazon Web Services
 
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...Amazon Web Services
 
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1Amazon Web Services
 

Similar to McGraw-Hill Optimizes Analytics Workloads with Databricks (20)

Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWS
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-Value
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made Easy
 
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Using Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML ModelsUsing Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML Models
 
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017
Self-Service Analytics with AWS Big Data and Tableau - ARC217 - re:Invent 2017
 
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
 
Using Amazon SageMaker to Build, Train, and Deploy Your ML Models
Using Amazon SageMaker to Build, Train, and Deploy Your ML ModelsUsing Amazon SageMaker to Build, Train, and Deploy Your ML Models
Using Amazon SageMaker to Build, Train, and Deploy Your ML Models
 
Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT
 Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT
Transform Your Risk Systems for Greater Agility with Accenture & AWS PPT
 
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
 
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

McGraw-Hill Optimizes Analytics Workloads with Databricks

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. May 24, 2018| 1PM-2PM PDT McGraw-Hill Education Optimizes Analytics Workloads with Databricks © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today’s presenters Pratap Ramamurthy, Partner Solutions Architect, Amazon Web Services Brian Dirking, Senior Director of Partner Marketing, Databricks Matthew Ashbourne, Lead Software Engineer, McGraw-Hill Education
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today’s agenda 1. Overview of AWS and AWS data lake services 2. Databricks: solutions that extend data lake management capabilities 3. McGraw-Hill Education recognizes the need to transform 4. McGraw-Hill Education revolutionizes digital learning with AWS and Databricks 5. Q&A/Discussion
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learning objectives: 1. How data lakes, using a unified analytic platform, can enable advanced analytic use cases such as machine learning 2. How to optimize data lakes to work effectively with real-time and fast- moving data 3. How to streamline the read/write process for data lakes
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Data Lake and AWS Drive business value with disparate types of data
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Warehouses & RDBMS • Complex to setup and manage • Do not scale • Takes months to add new data sources • Queries take too long • Cost $MM upfront
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Should I Build a Data Lake? Starting by amassing "all your data" and dumping into a large repository for the data gurus to start finding "insights" is like trying to win the lottery by buying all the tickets
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rethink How to Become a Data-driven Business • Business outcomes - start with the insights and actions you want to drive, then work backwards to a streamlined design • Experimentation - start small, test many ideas, keep the good ones and scale those up, paying only for what you consume • Agile and timely - deploy data processing infrastructure in minutes, not months. Take advantage of a rich platform of services to respond quickly to changing business needs
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Case Determines Platform Design Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & Insights START HERE WITH A BUSINESS CASE
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experiment and Scale Based on Your Business Needs MATCH AVAILABLE DATA Metrics and Monitoring Workflow Logs ERP Transactions Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & Insights
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Outcomes on a Modern Data Architecture Outcome 1 : Modernize and consolidate • Insights to enhance business applications and create new digital services Outcome 2 : Innovate for new revenues • Personalization, demand forecasting, risk analysis Outcome 3 : Real-time engagement • Interactive customer experience, event-driven automation, fraud detection Outcome 4 : Automate for expansive reach • Automation of business processes and physical infrastructure
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on AWS Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams S3 Relational and non-relation data Schema defined during analysis Unmatched durability and availability at EB scale Best security, compliance, and audit capabilities Run any analytics on the same data without movement Scale storage and compute independently Store data at $0.023 / month; Query for $0.05/GB scanned Redshift EMR Athena Kinesis Elasticsearch Service Kinesis Video Streams AI Services
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Amazon S3 for Modern Data Architecture? Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Amazon EMR  Amazon Redshift Spectrum  Amazon DynamoDB  Amazon Athena  AWS Glue  Amazon Kinesis  Amazon SageMaker IntegratedEasy to use  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple Storage and Compute • Legacy design was large databases or data warehouses with integrated hardware • Big Data architectures often benefit from decoupling storage and compute
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing streaming and historical data at scale with DatabricksBrian Dirking, Senior Director of Partner Marketing, Databricks
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unify big data and AI with Databricks on AWS Powered by Apache Spark, the Unified Analytics Platform from Databricks runs on AWS for cloud infrastructure. 5 – 8x
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 5 – 8x
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Data Presents Challenges Optimize data writes for retrieval Data is being written while you access it for analysis – how do you ensure that you get a good data set? Storing data in a way that enables low cost and fast access PROBLEM DESCRIPTION Writing Data for Access Storing Data Unreliable Data Access historical data to blend with streaming data for analytics models? Blending with Historical Data
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Databricks Delta Delta is a unified data management system that brings data reliability and performance optimizations to cloud data lakes • Help ensure data integrity with transactional guarantees. • Enable the most consistent view of your streaming data. • Modify data after it has been written with upserts. • Leverage Amazon S3 for massive scale. • Separate compute from storage for cost efficiency. • Enable data portability with an open file format. * • Accelerate query speeds through indexing and caching. • Self-optimize data layouts and simplify partition management. • Up to 100x faster than Apache Spark on Parquet.
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. McGraw-Hill Education optimizes its digital learning platform Driving innovation with Databricks on AWS Matthew Ashbourne Lead Software Engineer McGraw-Hill Education
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our history McGraw-Hill Education Education is a 129-year-old company that was reborn with the mission of accelerating learning through intuitive and engaging experiences – grounded in research McGraw Hill is working to revolutionize education with Machine Learning and AI complementing our suite of traditional reporting products. We capture student interaction data within our online learning platforms to: • Deliver a personalized learning experience • Drive higher retention and pass rates • Provide overview and detailed views of student work within online learning environments to instructors
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A Case Study in Learning Science
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connect retention
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connect retention
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key challenges Data Access Processing Scale
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ES Data Mart 3rd party data integration platform Kinesis Systems AWS Lambda Spark Cluster in datacenter Data Access ScaleProcessing
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problems we encountered • Unable to productionize data science output • Schema-bound to transactional DBs • Data engineering splintered across tech stacks • Scaling issues • Small file challenges • High Overhead adding new data sources
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake Solution
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake ecosystem
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. McGraw-Hill Education Education’s data lake requirements • Low engineering effort to get off the ground • Support concurrent writes • Resilient and auto-healing so small team can easily manage • Ability to compact small files to improve read performance
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Issues with open source data lake approach • Small/too-large output files • Dirty output directories • Schema management • No transactional support or safety • No safe way to have multiple writers to same table
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Databricks Delta to the rescue ACID Transactions - Multiple writers can simultaneously modify a dataset and see consistent views. DELETES/UPDATES/UPSERTS - Writers can modify a dataset without interfering with jobs reading the dataset. Data Validation - Ensures that data meets specified invariants (for example, NOT NULL) by rejecting invalid data. Automatic File Management - Speeds up data access by organizing data into large files that can be read efficiently. Statistics and Data Skipping - Speeds up reads by 10-100x by tracking statistics about the data in each file and avoiding reading irrelevant information.
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Mart Kinesis Systems ScaleProcessing Systems Kafka Systems Databricks Spark Cluster on EC2 Spot Databricks Delta Data lake on S3 Data Access
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problems solved with Databricks Unable to productionize data science output Data scientists now work in same environment and toolchain Schema-bound to transactional DBs •Eliminated direct schema binding through event contract Data engineering splintered across tech stacks •Unified data processing on Spark API Scaling issues •Huge increase in scaling with simple lift & shift to Spark Small file challenges •Small files are automatically compacted High Overhead adding new data sources •Push based model makes it easier for new data sources to expose themselves to analytics
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake implementation
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. You still need information architecture Avoid Data Swamp with • Consistent names and identifiers for Services, Entities, Event topics/streams • Schemas • Documentation on how data joins cross domains
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Flexible data lake schema raw JSON string event_class string header struct body JSON string json_schema string header_version string partition_event_source string partition_event_name string partition_event_date date meta struct<type:string,value:string,pipeline_datetime:timestamp> Write data with a minimal schema
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read/Write batch READ: spark.read.format("delta").load("/delta/events") WRITE: spark.write.format("delta").partitionBy("date").save("/delta/events")
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. READ: spark.readStream.format("delta").load("/delta/events") WRITE: events.writeStream.format("delta") .partitionBy("date") .outputMode("append") .option("checkpointLocation", "/delta/events/_checkpoints/my-stream") .start("/delta/events") Read/Write
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Schema on read import org.apache.spark.sql.functions._ val eventBodySchema = new StructType() .add("learner_xid", StringType) .add("assignment_xid", StringType) .add("raw_score", DoubleType) val parsed = Data lake.withColumn("parsed_body", from_json($"body", eventBodySchema))
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Views SELECT assessment, learner, sensed_datetime FROM delta.`/delta/events` LATERAL VIEW json_tuple(assessment, ‘xid’, ‘due_date’) v1 as xid, due_datetime LATERAL VIEW json_tuple(learner, ‘xid’) v2 as learner_xid
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bringing it all together
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Case study: new enterprise.roster pipeline Data Mart Rostering System Kafka Structured Streaming Batch ETL Delta Data lake Data Mart Rostering System Batch Transform & Load Staging Batch Extract
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Case study: new enterprise.roster pipeline Data Mart Batch ETL Rostering System Kafka Structured Streaming Delta Data lake
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The real deal
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Possible pitfalls • You need a strategy up front • Organizational change • Source systems need to embrace event instrumentation • Training • Data validation
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Some surprises Data pipeline development is not faster, BUT • More options and possibilities • ETLs faster and more reliable with better scaling • Easily combine streaming and batch data • Unified Analytics Platform matters
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Next steps and further information • Data Lake solution on AWS: https://aws.amazon.com/big-data/data-lake-on-aws/ • Take a Free 30-Day Trial of Databricks: https://databricks.com/try-databricks • Try AWS for free (full offer details available at the link below): https://aws.amazon.com/free
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q & A
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!