Build a DataWarehouse for your logs with Python, AWS Athena and Glue

Session ID:
Prepared by:
Remember to complete your evaluation for this session within the app!
1495
Build a DataWarehouse
for your (alert!) logs
With Python, AWS Athena
and AWS Glue
Wednesday, April 25 2018
Maxym Kharchenko
Sr. Database Engineer
Amazon.com

whoami
• Sr Database Engineer @amazon.com Big Data Technologies team
• Developer <-> DBA
• OCM, ACE Associate, AWS Developer (all “alumni”)
• I have stickers!

Agenda
• Why query (alert) logs with SQL
• How to query (alert) logs with SQL
• How to make it easy and efficient with AWS Athena and Glue
• Demo

Logs are the best operational data
about your system

Logs are great at simple ”tactical” questions
“Why did my query fail at 17:17 yesterday ?”
Sun Feb 11 17:17:04 2018
ORA-01115: IO error reading block from file (block # )
ORA-01110: data file 16:
‘/ora02/database/mydb/tbs12mydb_01.dbf'
“Why am I missing today’s partition ?”
Thu Jan 11 11:40:55 2018
Errors in file /logs/mydb/trace/mydb-36_j005_38530.trc:
ORA-12012: error on auto execute of job
"PART_ADMIN"."CREATE_PARTITION”
ORA-00028: your session has been killed
mydb
alert.log

But not so great when questions get “broader”
“Did the last patch solve our problem ?
> grep ORA-28 alert.log
opiodr aborting process unknown ospid (3411) as a result
of ORA-28
opiodr aborting process unknown ospid (65973) as a
result of ORA-28
result of ORA-28
result of ORA-28
result of ORA-28
result of ORA-28
mydb
alert.log

Or when analyzing multiple logs
“What is the timeline
of the latest cluster lockup issue ?”
Wed May 24 11:17:10 2017
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Wed May 24 11:17:17 2017
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Wed May 24 11:17:28 2017
Post SMON to start 1st pass IR Fix write in gcs resources
Reconfiguration complete
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18

Or when correlating data across different logs
“Are we seeing more node crashes because of
- Disk malfunctions ?
- ASM issues ?
- Network disconnects ?
”
> grep “WARNING: inbound connection timed out”
alert*.log
> grep “corrupted block” asm*.log
> grep -P “failed|error|critical” kern*.log
> grep -P “long wait|error|disconnect” tnsping*.log
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 18 18

Or when looking for trends
“Has the rate of network disconnects
increased over the last 6 months ?”
“What databases have the highest archived
log switch rate?”
“Do we see more problems in specific
datacenter locations ?”
“Are there times of the day with almost no
user activity ?”

Logs are not exactly easy to query
(in bulk)

If only there was a simpler way
to query all my logs …
SELECT trunc(event_time, ‘DD’), db, count(1) AS errors
FROM “all my logs”
WHERE event_time > sysdate – interval ‘90’ days
AND (
message LIKE ‘%ORA-00028%’
OR
message LIKE ‘%ORA-28%’
)
GROUP BY trunc(event_time, ‘DD’), db
ORDER BY 1,2
/

How to query
(application, db, …) logs
with SQL

Is it even possible
to query “unstructured text” with SQL ?

SQL Engines!
“Table”
• Linux “directory”
• HDFS “folder”
• Cloud storage “folder”
Log files
(aka: “text”)
?

How to make logs “queriable”
1. Structur-ize
2. Table-ize
3. Transform and Compact-ize

Step 1: Structur-ize
”Raw” logs
(i.e. alert_db.log)
“Structured”
(i.e. JSON) logs

Step 1: Find “structure” in logs
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
Thu Jan 11 17:15:54 2018
Thread 32 advanced to log sequence 34018 (LGWR switch)
Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439
Thu Jan 11 17:16:25 2018
Unable to create archive log file ‘+DG1’
ARC1: Error 19504 Creating archive log file to ‘+DG1’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance mydb-1 - Archival Error
ORA-16038: log 12 sequence# 34017 cannot be archived
ORA-19504: failed to create file "”
ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'

Step 1: Make log structure explicit
#! /usr/bin/env python
import json, re, sys
# Line format: <timestamp> <message>
# i.e. Jan 11 20:30:59 kernel: [185012.404818] sd 2:0:1:168: [sdgfp]
LINE_FORMAT = re.compile("^(w+s+d+s+d+:d+:d+)s+(.*)$")
for line in sys.stdin:
matched = LINE_FORMAT.match(line)
if matched:
# print ",".join(matched.groups())
print json.dumps(
dict(zip(("event_time", "message"), matched.groups()))
)

{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}

{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}

Step 2: Table-ize
Table “directory”
Table
“Metadata”
CREATE TABLE …
“Structured”
(i.e. JSON) logs

Step 2: Create table and “ingest” data
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`event_time` timestamp,
`message` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
LOCATION 's3://databucket/mydb/mytable/’
;
> cp log*.json /data/mydb/mytable
> hadoop fs -cp log*.json hdfs:/data/mydb/mytable
> aws s3 cp log*.json s3:/databucket//mydb/mytable

Step 3: Transform (into final form)
• Rollup
• Aggregations
• Materializing complex joins
• Partitioning

Step 3: Compact-ize
“Scan all files!”

Open data formats
TSV
• Text based
• Row-oriented
• Some compression
• Limited filtering
• Easy to make
• Binary
• Columnar
• Really good compression
• Advanced filtering
• More difficult to make

Step 3: Transform and Compact-ize
JSON logs PARQUET
Logs
• Format Transform
• SQL Transform

The SQL-on-logs pipeline
Staging
table(s)
”Raw” logs Structured
logs
Final
table(s)

Step 5: Make it simple with AWS
”Raw” logs Structured
logs
“Staging”
S3 bucket
“Final”
S3 bucket
AWS Glue AWS Athena

AWS Athena
• ”Query data in S3 using SQL”
• Serverless Presto cluster
• Rich SQL
• Supports multiple open data formats
• Fast, interactive performance

AWS Glue
• ”Prepare and load data (ETL!)”
• Serverless Apache Spark
• Crawlers: ”data discovery” and
automatic catalog maintenance
• Job scheduling
• Integrated with many data
“sources” and “sinks”
• ETL script generation (or BYO)

Extending
SQL-on-logs pipeline

Pre-parse logs in the cloud
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()

Build materialized views
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
Glue:
make_mview()

Use different SQL front-ends
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
to_redshift() Redshift
to_oracle() RDS ORACLE

Session ID:
Remember to complete your evaluation for this session within the app!
1495
Thank you!
maxym@amazon.com

Build a DataWarehouse for your logs with Python, AWS Athena and Glue

More Related Content

What's hot

Similar to Build a DataWarehouse for your logs with Python, AWS Athena and Glue

More from Maxym Kharchenko

Recently uploaded

Build a DataWarehouse for your logs with Python, AWS Athena and Glue