Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Prajakta Damle, Sr Product Manager – AWS Glue
Ben Snively, Specialist SA – Data and Analytics
September 14, 2017
Tackle Your Dark Data
Challenge with AWS Glue

Agenda
• What is Dark Data?
• Automatically discovering your Dark Data
• Understating the Dark Data
• Analyzing, processing and transforming your Dark Data
• Demonstration
• Conclusion

What is Dark Data?
“Dark data” is data that is collected and stored by an organization, but it is not
used by processes or analytics.
• Therefore, dark data is currently providing very little value.
Organizations, however, believe that their dark data can provide value, so
they want to:
• Discover the dark data that they have
• Query / analyze it to drive additional insights to move the business forward

AWS Glue
Automatically discovers and categorizes your dark data to make it
immediately searchable and queryable
Generates code to clean, enrich, and reliably move data between data
stores; you can also use their favorite tools to build ETL jobs
Runs your jobs on a serverless, fully managed, scale-out environment
without needing to provision or manage compute resources
Discover
Develop
Deploy

AWS Glue: Components
Data Catalog
 Apache Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Runs jobs on a serverless Apache Spark environment
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job Authoring
 Auto-generates ETL code
 Built on open frameworks – Python and Apache Spark
 Developer-centric – editing, debugging, sharing

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable

Glue Data Catalog
Data Catalog automatically populated through Crawlers
(can also populate using Apache Hive DDL or bulk import script)
Manage table metadata through an Apache Hive metastore API or Apache
Hive SQL
(supported by tools like Apache Hive, Presto, Apache Spark etc.)
We added a few extensions:
 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are updated

Glue Data Catalog: Crawlers
 Automatically discover new data and extract schema definitions
• Detect schema changes and version tables
• Detect Apache Hive style partitions on Amazon S3
 Built-in classifiers for popular data types
• Custom classifiers using Grok expressions
 Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync

Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!

Crawler: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type

Glue Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields

Glue Data Catalog: Version control
List of table versionsCompare schema versions

Analyzing and Processing your dark data

Job authoring in AWS Glue
 Python code generated by AWS Glue
 Connect a notebook or IDE to AWS Glue
 Existing code brought into AWS Glue
You have choices on
how to get started

1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation

 Human-readable, editable, and portable PySpark code
 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
 Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
 Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code

Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Apache Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records

Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C

Job authoring: Relationalize() transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing

Job authoring: Glue transforms
 Prebuilt transformation: Click and
add to your job with simple
configuration
 Spigot writes sample data from
DynamicFrame to S3 in JSON format
 Expanding… more transformations
to come

Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to Apache Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors

Job authoring: Developer endpoints
 Environment to iteratively develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue Apache Spark environment
Remote
interpreter
Interpreter
server

Conclusion
Data Catalog
 Apache Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Runs jobs on a serverless Apache Spark environment
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job Authoring
 Auto-generates ETL code
 Built on open frameworks – Python and Apache Spark
 Developer-centric – editing, debugging, sharing

Thank you!
https://aws.amazon.com/glue/developer-resources/

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

Similar to Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks