Hadoop For OpenStack Log Analysis


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop For OpenStack Log Analysis

  1. 1. Hadoop For OpenStack Log AnalysisHavana Summit, April 16th, 2013,Mike PittaroPrincipal Architect, Big Data Solutions@pmikeyp Freenode: mikeypmichael_pittaro@dell.com
  2. 2. The Problem : Operating OpenStack at Scale2 Revolutionary Cloud Team
  3. 3. The Search for the Holy Grail of OpenStack Operations • Imagine if we could follow a request … – Through the entire system … – Across compute, storage, network … – Independent of physical nodes … – With timestamps … – Correlated with events outside OpenStack …3 Revolutionary Cloud Team
  4. 4. It’s Easy ! Collect Analyze ? Sleep ! Logs4 Revolutionary Cloud Team
  5. 5. OpenStack Log Analysis is a Big Data Problem Big Data is when the data itself is part of the problem. Volume • A large amount of data, growing at large rates Velocity • The speed at which the data must be processed Variety • The range of data types and data structure5 Revolutionary Cloud Team
  6. 6. Initial Focus and Scope of Our Efforts • The Operators – Assist in running OpenStack – Not for tenants • The Data – Load all detail into Hadoop – Extract and index significant fields – Enable future analysis • The Patterns – What works – What is repeatable across installs – How can we collaborate6 Revolutionary Cloud Team
  7. 7. Hadoop 101 – Simplified Block Diagram HUE Mahout Oozie (Machine (Workflow) (Web UI) Learning) Sqoop Pig Hive API’s Flume (Batch (SQL-ish Language) query) JDBC MapReduce Distributed HBase Database Processing HDFS Distributed Storage7 Revolutionary Cloud Team
  8. 8. The Big Pieces • Log Collection – Continuous streaming of log data into Hadoop from OpenStack • Intelligent Log parsing, Indexing and Search – Should ‘know’ about OpenStack • Well Defined Storage Organization – Defined schema for the data – Predefined queries for high level status - dashboard • Straightforward implementation pattern – Add as little complexity as possible • Ability to perform deeper analysis – Hadoop enables this8 Revolutionary Cloud Team
  9. 9. OpenStack Log Analysis Block Diagram OpenStack Nodes Search Python Syslog Logging Solr Cloud Query FlumeNG Agent Pig Hive Lucene Indexes Nagios + Ganglia Avro Avro MapReduce Sqoop HDFS Jobs /Flume Avro9 Revolutionary Cloud Team
  10. 10. Current Development Status • Batch Only, no Flume Collection • Converting logs to AVRO format • First cut of schema in place • Loading into Hadoop • Processing into SOLR indexes • Starting to look at data – Solr Searches – Pig scripts10 Revolutionary Cloud Team
  11. 11. Schema Thoughts 2013-03-26 11:57:41 WARNING nova.db.sqlalchemy.session [req-ace2ccc0-919e-4fd1-9f3a-671c0c87d28f None None] Got mysql server has gone away: (2006, MySQL server has gone away) {"namespace": "logfile.openstack", "type": "record", "name": "logentry", "fields": [ {"name": "hostname", "type": "string"}, {"name": "date", "type": "string"}, {"name": "time", "type": "string"}, {"name": "level", "type": "string"}, {"name": "module", "type": "string"}, {"name": "request_id1", "type": ["string", "null"]}, {"name": "request_id2", "type": ["string", "null"]}, {"name": "request_id3", "type": ["string", "null"]}, {"name": "data", "type": "string"} ] }11 Revolutionary Cloud Team
  12. 12. Demo: Where we are today • Solr Indexing and Search – Example of indexed fields and searching. • Pig for batch analysis – Reconstruct a sequence of messages related to an API request12 Revolutionary Cloud Team
  13. 13. Data Collection Thoughts • Sources – OpenStack subsystems – Syslog files – Nagios and Ganglia – General Infrastructure Data – Network Switches / Routers • Input Formats – Mostly Semi-structured text – Subsystem, timestamp, hostname, severity and error level are important • Output Formats – Avro – Thrift – Protocol Buffers13 Revolutionary Cloud Team
  14. 14. Log Collection Thoughts • Well understood patterns – Evolving best practices • Commonly Used Tools – Kafka – Scribe – Flume and FlumeNG • Key Requirements – Distributed – Reliable – Aggregators – consolidate streams – Store and Forward – when links are down14 Revolutionary Cloud Team
  15. 15. Storage Organization Thoughts • File Organization within Hadoop – Naming Convention – Directory Organization • Data Lifecycle – Input and Staging – ‘Hot’ and ‘Cold’ Data – Tiered Indexes – Compression – Archival and Deletion15 Revolutionary Cloud Team
  16. 16. What should we do next ? • Document the basic patterns so far ? • Are there any related efforts ? • Begin deeper discussions – Lots of decisions to make – Need community input and suggestions • Collaborate on Schema Design • What upstream OpenStack changes are needed ? • Get sample logs in hands of Hadoopians – Cleansed reference log sets would be useful16 Revolutionary Cloud Team
  17. 17. References • The unified logging infrastructure for data analytics at Twitter • Building LinkedIn’s Real-time Activity Data Pipeline • Advances and challenges in log analysis • BP: Ceilometer HBase Storage Backend • BP: Cross Service Request ID • Log Everything All the Time • Holy Grail, Gangam Style17 Revolutionary Cloud Team