Building a Log Analysis Pipeline

1,706 views

Published on

Quick internal presentation on work we've been doing to deploy an ELK stack for our security analysis needs.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,706
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Building a Log Analysis Pipeline

  1. 1. Building a Log Analysis Pipeline A BRIEF TOUR
  2. 2. Problem Limited visibility into the environment SIEM solutions inadequate for risk management purposes Requests for extracts difficult or impossible to provide Unable to connect together different data sources
  3. 3. Requirements Cheap ◦ Budget + Labor ≈ 0 ◦ Hobby project Scalable ◦ SIEM data in the TB range ◦ Need to have historical data ◦ Decoupled from logging infrastructure Performance ◦ Batch processing is okay ◦ …but batches can’t be too slow ◦ Need near real-time exploration options Confidentiality ◦ This is security data. Let’s not create more problems than solutions.
  4. 4. Resources SIEM does a good job with log aggregation ◦ Stores raw syslog events Easy to access to raw events on the SIEM Data is relatively large, but not BIG
  5. 5. A Plan Is Born “I have a cunning plan!” – S. Baldrick, Blackadder
  6. 6. Early Approaches METHOD 1 - MONGODB ◦ Python regexp to create JSON ◦ Load to MongoDB ◦ Run Mongo MapReduce Worked – but slow. Required AWS for sufficient memory to run MapReduce flows METHOD 2 – PURE PYTHON ◦ Python regexp to create CSV ◦ Pull off to Analysis Workspace ◦ Python MapReduce in shell Worked – but limited and rigid
  7. 7. Premature Data Truncation Leads to Poor Results Loose ability to query context Additional queries not possible without custom redesign ◦ Blocks vs. Passes ◦ Port information Querying peer node relations, etc. not practical
  8. 8. Unleash the ELK!
  9. 9. Elasticsearch ◦ Full text search engine based on Apache Lucene ◦ Incredibly fast and flexible query DSL ◦ Built for distributed search (horizontal scale) from the ground up
  10. 10. Logstash Open Source log intake and processor Easy to use pattern matching ◦ No more opaque regexs! Terrific metadata enrichment Scores of plugins ◦ Inputs, outputs, filters, codecs
  11. 11. Kibana ◦ Lightweight HTML5 interface to Elasticsearch for logs ◦ Not a full SIEM replacement ◦ Targeting the Splunk market
  12. 12. Infrastructure On SIEM ◦ Python for creating extracts ◦ Bash for taring up raw logs Transport ◦ SCP from SIEM to Windows file share ◦ USB from Windows file share ◦ Sneaker net to analysis workspace On Analysis Workspace ◦ Vagrant ◦ Chef
  13. 13. Demo
  14. 14. Pieces Involved
  15. 15. Next Steps – Infrastructure Complete provisioning scripts for Hadoop & AWS Transfer raw GZ files to encrypted S3 bucket ◦ Allow extract AWS EMR jobs to run Process via Logstash into Elasticsearch ◦ Elasticsearch for short-term exploration ◦ Archive structured data to S3 Setup Elasticsearch-Hadoop connector Use AWS EMR to do ad hoc extracts off of structured S3 buckets
  16. 16. Next Steps – Data Products Full MaxMind integration ◦ Accuracy & detail Reputation ◦ REN-ISAC integration Graph exploration ◦ Who else talked to whom ◦ Clustering Future ◦ Proxy logs ◦ DNS logs
  17. 17. Thanks Google Groups IRC #logstash, #chef, #vagrant, #elasticsearch Seattle Search and Machine Learning Meetup Seattle Chef Meetup Hortonworks Sandbox The Phoenix Project Data-Driven Security AlienVault …and more!

×