Session ID:
Prepared by:
Remember to complete your evaluation for this session within the app!
1495
Build a DataWarehouse
for your (alert!) logs
With Python, AWS Athena
and AWS Glue
Wednesday, April 25 2018
Maxym Kharchenko
Sr. Database Engineer
Amazon.com
whoami
• Sr Database Engineer @amazon.com Big Data Technologies team
• Developer <-> DBA
• OCM, ACE Associate, AWS Developer (all “alumni”)
• I have stickers!
Agenda
• Why query (alert) logs with SQL
• How to query (alert) logs with SQL
• How to make it easy and efficient with AWS Athena and Glue
• Demo
Logs are the best operational data
about your system
Logs are great at simple ”tactical” questions
“Why did my query fail at 17:17 yesterday ?”
Sun Feb 11 17:17:04 2018
ORA-01115: IO error reading block from file (block # )
ORA-01110: data file 16:
‘/ora02/database/mydb/tbs12mydb_01.dbf'
“Why am I missing today’s partition ?”
Thu Jan 11 11:40:55 2018
Errors in file /logs/mydb/trace/mydb-36_j005_38530.trc:
ORA-12012: error on auto execute of job
"PART_ADMIN"."CREATE_PARTITION”
ORA-00028: your session has been killed
mydb
alert.log
But not so great when questions get “broader”
“Did the last patch solve our problem ?
> grep ORA-28 alert.log
opiodr aborting process unknown ospid (3411) as a result
of ORA-28
opiodr aborting process unknown ospid (65973) as a
result of ORA-28
opiodr aborting process unknown ospid (56719) as a
result of ORA-28
opiodr aborting process unknown ospid (129663) as a
result of ORA-28
opiodr aborting process unknown ospid (11260) as a
result of ORA-28
opiodr aborting process unknown ospid (22534) as a
result of ORA-28
mydb
alert.log
Or when analyzing multiple logs
“What is the timeline
of the latest cluster lockup issue ?”
Wed May 24 11:17:10 2017
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Wed May 24 11:17:17 2017
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Wed May 24 11:17:28 2017
Post SMON to start 1st pass IR Fix write in gcs resources
Reconfiguration complete
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
Or when correlating data across different logs
“Are we seeing more node crashes because of
- Disk malfunctions ?
- ASM issues ?
- Network disconnects ?
”
> grep “WARNING: inbound connection timed out”
alert*.log
> grep “corrupted block” asm*.log
> grep -P “failed|error|critical” kern*.log
> grep -P “long wait|error|disconnect” tnsping*.log
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 18 18
Or when looking for trends
“Has the rate of network disconnects
increased over the last 6 months ?”
“What databases have the highest archived
log switch rate?”
“Do we see more problems in specific
datacenter locations ?”
“Are there times of the day with almost no
user activity ?”
Logs are not exactly easy to query
(in bulk)
If only there was a simpler way
to query all my logs …
SELECT trunc(event_time, ‘DD’), db, count(1) AS errors
FROM “all my logs”
WHERE event_time > sysdate – interval ‘90’ days
AND (
message LIKE ‘%ORA-00028%’
OR
message LIKE ‘%ORA-28%’
)
GROUP BY trunc(event_time, ‘DD’), db
ORDER BY 1,2
/
How to query
(application, db, …) logs
with SQL
Is it even possible
to query “unstructured text” with SQL ?
SQL Engines!
“Table”
• Linux “directory”
• HDFS “folder”
• Cloud storage “folder”
Log files
(aka: “text”)
?
How to make logs “queriable”
1. Structur-ize
2. Table-ize
3. Transform and Compact-ize
Step 1: Structur-ize
”Raw” logs
(i.e. alert_db.log)
“Structured”
(i.e. JSON) logs
Step 1: Find “structure” in logs
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
Thu Jan 11 17:15:54 2018
Thread 32 advanced to log sequence 34018 (LGWR switch)
Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439
Thu Jan 11 17:16:25 2018
Unable to create archive log file ‘+DG1’
ARC1: Error 19504 Creating archive log file to ‘+DG1’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance mydb-1 - Archival Error
ORA-16038: log 12 sequence# 34017 cannot be archived
ORA-19504: failed to create file "”
ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
Step 1: Find “structure” in logs
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
Thu Jan 11 17:15:54 2018
Thread 32 advanced to log sequence 34018 (LGWR switch)
Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439
Thu Jan 11 17:16:25 2018
Unable to create archive log file ‘+DG1’
ARC1: Error 19504 Creating archive log file to ‘+DG1’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance mydb-1 - Archival Error
ORA-16038: log 12 sequence# 34017 cannot be archived
ORA-19504: failed to create file "”
ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
Step 1: Make log structure explicit
#! /usr/bin/env python
import json, re, sys
# Line format: <timestamp> <message>
# i.e. Jan 11 20:30:59 kernel: [185012.404818] sd 2:0:1:168: [sdgfp]
LINE_FORMAT = re.compile("^(w+s+d+s+d+:d+:d+)s+(.*)$")
for line in sys.stdin:
matched = LINE_FORMAT.match(line)
if matched:
# print ",".join(matched.groups())
print json.dumps( 
dict(zip(("event_time", "message"), matched.groups()))
)
Step 1: Make log structure explicit
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}
Step 1: Make log structure explicit
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}
Step 2: Table-ize
Table “directory”
Table
“Metadata”
CREATE TABLE …
“Structured”
(i.e. JSON) logs
Step 2: Create table and “ingest” data
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`event_time` timestamp,
`message` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
LOCATION 's3://databucket/mydb/mytable/’
;
> cp log*.json /data/mydb/mytable
> hadoop fs -cp log*.json hdfs:/data/mydb/mytable
> aws s3 cp log*.json s3:/databucket//mydb/mytable
Step 3: Transform (into final form)
• Rollup
• Aggregations
• Materializing complex joins
• Partitioning
Step 3: Compact-ize
“Scan all files!”
Open data formats
TSV
• Text based
• Row-oriented
• Some compression
• Limited filtering
• Easy to make
• Binary
• Columnar
• Really good compression
• Advanced filtering
• More difficult to make
Step 3: Transform and Compact-ize
JSON logs PARQUET
Logs
• Format Transform
• SQL Transform
Step 4: Query
PARQUET
Logs
The SQL-on-logs pipeline
Staging
table(s)
”Raw” logs Structured
logs
Final
table(s)
Step 5: Make it simple with AWS
”Raw” logs Structured
logs
“Staging”
S3 bucket
“Final”
S3 bucket
AWS Glue AWS Athena
AWS Athena
• ”Query data in S3 using SQL”
• Serverless Presto cluster
• Rich SQL
• Supports multiple open data formats
• Fast, interactive performance
AWS Glue
• ”Prepare and load data (ETL!)”
• Serverless Apache Spark
• Crawlers: ”data discovery” and
automatic catalog maintenance
• Job scheduling
• Integrated with many data
“sources” and “sinks”
• ETL script generation (or BYO)
Demo time
Extending
SQL-on-logs pipeline
Pre-parse logs in the cloud
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
Build materialized views
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
Glue:
make_mview()
Use different SQL front-ends
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
to_redshift() Redshift
to_oracle() RDS ORACLE
Session ID:
Remember to complete your evaluation for this session within the app!
1495
Thank you!
maxym@amazon.com

Build a DataWarehouse for your logs with Python, AWS Athena and Glue

  • 1.
    Session ID: Prepared by: Rememberto complete your evaluation for this session within the app! 1495 Build a DataWarehouse for your (alert!) logs With Python, AWS Athena and AWS Glue Wednesday, April 25 2018 Maxym Kharchenko Sr. Database Engineer Amazon.com
  • 2.
    whoami • Sr DatabaseEngineer @amazon.com Big Data Technologies team • Developer <-> DBA • OCM, ACE Associate, AWS Developer (all “alumni”) • I have stickers!
  • 3.
    Agenda • Why query(alert) logs with SQL • How to query (alert) logs with SQL • How to make it easy and efficient with AWS Athena and Glue • Demo
  • 4.
    Logs are thebest operational data about your system
  • 5.
    Logs are greatat simple ”tactical” questions “Why did my query fail at 17:17 yesterday ?” Sun Feb 11 17:17:04 2018 ORA-01115: IO error reading block from file (block # ) ORA-01110: data file 16: ‘/ora02/database/mydb/tbs12mydb_01.dbf' “Why am I missing today’s partition ?” Thu Jan 11 11:40:55 2018 Errors in file /logs/mydb/trace/mydb-36_j005_38530.trc: ORA-12012: error on auto execute of job "PART_ADMIN"."CREATE_PARTITION” ORA-00028: your session has been killed mydb alert.log
  • 6.
    But not sogreat when questions get “broader” “Did the last patch solve our problem ? > grep ORA-28 alert.log opiodr aborting process unknown ospid (3411) as a result of ORA-28 opiodr aborting process unknown ospid (65973) as a result of ORA-28 opiodr aborting process unknown ospid (56719) as a result of ORA-28 opiodr aborting process unknown ospid (129663) as a result of ORA-28 opiodr aborting process unknown ospid (11260) as a result of ORA-28 opiodr aborting process unknown ospid (22534) as a result of ORA-28 mydb alert.log
  • 7.
    Or when analyzingmultiple logs “What is the timeline of the latest cluster lockup issue ?” Wed May 24 11:17:10 2017 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Wed May 24 11:17:17 2017 Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Wed May 24 11:17:28 2017 Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
  • 8.
    Or when correlatingdata across different logs “Are we seeing more node crashes because of - Disk malfunctions ? - ASM issues ? - Network disconnects ? ” > grep “WARNING: inbound connection timed out” alert*.log > grep “corrupted block” asm*.log > grep -P “failed|error|critical” kern*.log > grep -P “long wait|error|disconnect” tnsping*.log 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 18
  • 9.
    Or when lookingfor trends “Has the rate of network disconnects increased over the last 6 months ?” “What databases have the highest archived log switch rate?” “Do we see more problems in specific datacenter locations ?” “Are there times of the day with almost no user activity ?”
  • 10.
    Logs are notexactly easy to query (in bulk)
  • 11.
    If only therewas a simpler way to query all my logs … SELECT trunc(event_time, ‘DD’), db, count(1) AS errors FROM “all my logs” WHERE event_time > sysdate – interval ‘90’ days AND ( message LIKE ‘%ORA-00028%’ OR message LIKE ‘%ORA-28%’ ) GROUP BY trunc(event_time, ‘DD’), db ORDER BY 1,2 /
  • 12.
    How to query (application,db, …) logs with SQL
  • 13.
    Is it evenpossible to query “unstructured text” with SQL ?
  • 14.
    SQL Engines! “Table” • Linux“directory” • HDFS “folder” • Cloud storage “folder” Log files (aka: “text”) ?
  • 15.
    How to makelogs “queriable” 1. Structur-ize 2. Table-ize 3. Transform and Compact-ize
  • 16.
    Step 1: Structur-ize ”Raw”logs (i.e. alert_db.log) “Structured” (i.e. JSON) logs
  • 17.
    Step 1: Find“structure” in logs Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off Thu Jan 11 17:15:54 2018 Thread 32 advanced to log sequence 34018 (LGWR switch) Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439 Thu Jan 11 17:16:25 2018 Unable to create archive log file ‘+DG1’ ARC1: Error 19504 Creating archive log file to ‘+DG1’ ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance mydb-1 - Archival Error ORA-16038: log 12 sequence# 34017 cannot be archived ORA-19504: failed to create file "” ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
  • 18.
    Step 1: Find“structure” in logs Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off Thu Jan 11 17:15:54 2018 Thread 32 advanced to log sequence 34018 (LGWR switch) Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439 Thu Jan 11 17:16:25 2018 Unable to create archive log file ‘+DG1’ ARC1: Error 19504 Creating archive log file to ‘+DG1’ ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance mydb-1 - Archival Error ORA-16038: log 12 sequence# 34017 cannot be archived ORA-19504: failed to create file "” ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
  • 19.
    Step 1: Makelog structure explicit #! /usr/bin/env python import json, re, sys # Line format: <timestamp> <message> # i.e. Jan 11 20:30:59 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] LINE_FORMAT = re.compile("^(w+s+d+s+d+:d+:d+)s+(.*)$") for line in sys.stdin: matched = LINE_FORMAT.match(line) if matched: # print ",".join(matched.groups()) print json.dumps( dict(zip(("event_time", "message"), matched.groups())) )
  • 20.
    Step 1: Makelog structure explicit Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off { "message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk", "event_time": "Jan 11 20:30:59” } { "message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk", "event_time": "Jan 11 20:30:59” }
  • 21.
    Step 1: Makelog structure explicit Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off { "message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk", "event_time": ”2018-01-11 20:30:59.000” } { "message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk", "event_time": ”2018-01-11 20:30:59.000” }
  • 22.
    Step 2: Table-ize Table“directory” Table “Metadata” CREATE TABLE … “Structured” (i.e. JSON) logs
  • 23.
    Step 2: Createtable and “ingest” data CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable ( `event_time` timestamp, `message` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ LOCATION 's3://databucket/mydb/mytable/’ ; > cp log*.json /data/mydb/mytable > hadoop fs -cp log*.json hdfs:/data/mydb/mytable > aws s3 cp log*.json s3:/databucket//mydb/mytable
  • 24.
    Step 3: Transform(into final form) • Rollup • Aggregations • Materializing complex joins • Partitioning
  • 25.
  • 26.
    Open data formats TSV •Text based • Row-oriented • Some compression • Limited filtering • Easy to make • Binary • Columnar • Really good compression • Advanced filtering • More difficult to make
  • 27.
    Step 3: Transformand Compact-ize JSON logs PARQUET Logs • Format Transform • SQL Transform
  • 28.
  • 29.
    The SQL-on-logs pipeline Staging table(s) ”Raw”logs Structured logs Final table(s)
  • 30.
    Step 5: Makeit simple with AWS ”Raw” logs Structured logs “Staging” S3 bucket “Final” S3 bucket AWS Glue AWS Athena
  • 31.
    AWS Athena • ”Querydata in S3 using SQL” • Serverless Presto cluster • Rich SQL • Supports multiple open data formats • Fast, interactive performance
  • 32.
    AWS Glue • ”Prepareand load data (ETL!)” • Serverless Apache Spark • Crawlers: ”data discovery” and automatic catalog maintenance • Job scheduling • Integrated with many data “sources” and “sinks” • ETL script generation (or BYO)
  • 33.
  • 34.
  • 35.
    Pre-parse logs inthe cloud S3: “Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json()
  • 36.
    Build materialized views S3:“Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json() Glue: make_mview()
  • 37.
    Use different SQLfront-ends S3: “Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json() to_redshift() Redshift to_oracle() RDS ORACLE
  • 38.
    Session ID: Remember tocomplete your evaluation for this session within the app! 1495 Thank you! maxym@amazon.com