Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWSサービスで実現するEightの行動ログ活用基盤

6,637 views

Published on

AWS Dev Day Tokyo 2018 DB Track #04で発表したEightの行動ログ収集基盤の再構成についての発表資料です。AWS Glueを使うにあたっての苦労や実装のポイントについて共有させていただきました。

Published in: Technology
  • Be the first to comment

AWSサービスで実現するEightの行動ログ活用基盤

  1. 1. AWS Eight Sansan Eight AWS Dev Day Tokyo 2018 Database Track #04 Nov. 01, 2018
  2. 2. Sansan Eight 
 SRE Amazon Web Services (2015)
 JAWS-UG (10 )
 2014
  3. 3. Your Business Network Eight DSOCSansan
  4. 4. SNS
  5. 5. Your business network
  6. 6. Eight AWS
  7. 7. Web/API KPI DB NW Web/API Feed Recommend
  8. 8. Agenda
  9. 9. Eight
  10. 10.
  11. 11. Eight = 
 Dev Day Tokyo 2017 
 https://d1.awsstatic.com/events/jp/2017/summit/devday/D4T8-5.pdf 💡
  12. 12. KPI
  13. 13. Eight
  14. 14. Aggregator S3 (store) Elasticsearch (view) Cloud Service Redshift (view/store) Redash (view) Servers / Applications Kinesis Data Streams Lambda DynamoDB Lambda SQS ElastiCache EC2 (Poll) DynamoDB Lambda ElastiCache Athena (view)
  15. 15. Eight
  16. 16. Cloud Service Redshift (view/store) Redash (view) Servers / Applications Aggregator S3 (store) Elasticsearch (view) Kinesis Data Streams Lambda DynamoDB Lambda SQS ElastiCache EC2 (Poll) DynamoDB Lambda ElastiCache Athena (view)
  17. 17. fluentd Redshift Redshift or Redash Applications Cloud Service Redshift Redash Analyst
  18. 18. fluentd + Kinesis Data Firehose S3 + Redshift Glue + Redshift + Redash + Aggregators Kinesis Data Firehose S3 (Delivered logs) Lambda (Classify by name) S3 (Classified Logs) Glue Redshift Redash Analyst
  19. 19. Aggregators Kinesis Data Firehose S3 (Delivered logs) Lambda (Classify by name) S3 (Classified Logs) Glue Redshift fluent-plugin-kinesis Kinesis Data Firehose S3 Firehose Lambda Firehose S3 Redshift S3
  20. 20. Kinesis Data Firehose Lambda (Classify by name) Glue Redshift Kinesis Data Streams Kinesis Data Firehose S3 Lambda 1 -1Firehose DeliveryStream Lambda Redshift COPY OK Glue S3 Athena Redshift Spectrum
  21. 21. AWS Glue Full Managed ETL Service
  22. 22. Glue Redshift
  23. 23. Glue DPU 0.44 1 ETL 10 5DPU 2DPU DPU 0.44 1 10 2DPU 1 0.44(USD/DPU hour) * 10/60(hour) * 2(DPU) = 0.147(USD) 50 50 * 0.44(USD/DPU hour) * 10/60(hour) * 2(DPU) = 7.333(USD) 30 $5,280
  24. 24. 76 1 1 20 25 IOwait S3 Read / Redshift Write Python concurrent.futures.ThreadPoolExecutor 4DPU15 5 5 1
  25. 25. Redshift 1 Job Bookmark
  26. 26. choice Glue int/long 100 3458395800 1 string/ value JSON long/string ID ’undefined’ Glue (DynamicFrame) choice Redshift NULL information_schema.columns resolveChoice {“user_id”: 3234567890, “device_type”: “iOS”, “logged_in”: 1234567890} {“user_id”: “123”, “device_type”: “iOS”, “logged_in”: 1234567890} {“user_id”: 12345, “device_type”: “Android”, “logged_in”: 1234567890} {“user_id”: {“int”: null, “long”: 3234567890}, “device_type”: “iOS”, “logged_in”: 1234567890} {“user_id”: {“int”: “123”, “long”: null}, “device_type”: “iOS”, “logged_in”: 1234567890} {“user_id”: {“int”: 12345, “long”: null}, “device_type”: “Android”, “logged_in”: 1234567890}
  27. 27. timestamp UNIX timestamp “YYYY-mm-dd HH:MM:SS” NG or OK “YYYY-mm-dd HH:MM:SS” JSON string COLUMN_NAME_string COLUMN_NAME SELECT S3 Glue timestamp ApplyMapping source - target COLUMN_NAME_string {“user_id”: 1234567, “device_type”: “iOS”, “logged_in”: ”2018-11-01 01:00:00"} {“user_id”: “123”, “device_type”: “iOS”, “logged_in”: ”2018-11-01 03:00:00”} {“user_id”: 12345, “device_type”: “Android”, “logged_in”: ”2018-11-01 05:20:00"} id | bigint | not null default… user_id | bigint | not null device_type | integer | not null logged_in | timestamp without time zone | not null id | bigint | not null default… user_id | bigint | not null device_type | integer | not null logged_in | timestamp without time zone | not null logged_in_string | character varying(255) |
  28. 28. Not Null Redshift Not Null information_schema.columns NULL
  29. 29. S3 to Glue GlueContext.create_dynamic_frame.from_catalog Glue GlueContext.create_dynamic_frame.from_options S3 S3 key pyspark.sql.functions.input_file_name() from_catalog
  30. 30. Glue to Redshift GlueContext.write_dynamic_frame.from_jdbc_conf COPY spark-redshift DataFrame information_schema DynamicFrame DropNullFields COPY TRUNCATECOLUMNS Job Bookmark
  31. 31. Job Bookmark Job Bookmark S3 Redshift
  32. 32. Glue CloudWatch Logs /aws-glue/jobs/error
  33. 33. Glue 10 Trigger 1 1 3
  34. 34. information_schema.columns Redshift Redshift SELECT Job Bookmark 1 1 Max concurrency = 1
  35. 35. Aggregator S3 (store) Elasticsearch (view) Redshift (view/store) Redash (view) Servers / Applications Kinesis Data Streams Lambda DynamoDB Lambda SQS ElastiCache EC2 (Poll) DynamoDB Lambda ElastiCache Athena (view) Kinesis Data Firehose S3 Lambda S3 Glue
  36. 36. Personalized Feed Feed Feed Aurora

×