• Save
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

on

  • 6,849 views

This talk was held at the second meeting of the Swiss Big Data User Group on July 16 at ETH Zürich.

This talk was held at the second meeting of the Swiss Big Data User Group on July 16 at ETH Zürich.
http://www.bigdata-usergroup.ch/item/296477

Statistics

Views

Total Views
6,849
Views on SlideShare
6,556
Embed Views
293

Actions

Likes
10
Downloads
0
Comments
0

2 Embeds 293

http://www.scoop.it 289
https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich) Presentation Transcript

  • 1. Large Scale Log Analysis with HBase andSolr at AmadeusMartin Aligaligma@student.ethz.ch
  • 2. Overview Problem Solution - Overview HBase Solr Solution - Details ResultsMontag, 16. Juli 2012 2
  • 3. Problem Amadeus is the worlds leading technology provider to the travel industry, providing marketing, distribution and IT services worldwide. The Amadeus computer reservation system (CRS) processed 850 million billable travel transactions in 2010. Current logging framework produces 100000 - 1000000 messages per secondMontag, 16. Juli 2012 3
  • 4. Problem - Log Messages Messages with 1 KB average size Message can be anything: XML, Edifact, HEX dump, ... A few fixed attributes per message given: Timestamp, source, various ids.Montag, 16. Juli 2012 4
  • 5. Problem - Current Solution Write log messages in plain text files. Split, compress and copy to SAN. Queries? Search? Statistics?Montag, 16. Juli 2012 5
  • 6. Solution Overview Use Apache HBase for storage and instant random access Apache MapReduce for complex queries. Apache Solr as full text search engine for queries on the log messages.Montag, 16. Juli 2012 6
  • 7. Apache HBase Open source, non-relational, distributed database. Modeled after Googles BigTable Runs on top of Hadoop Distributed Filesystem (HDFS)Montag, 16. Juli 2012 7
  • 8. HBase - Terms Region  Contigous ranges of rows stored together  Dynamically split / merged and distributed RegionServer (slave)  Serves regions, e.g. data for reads and writes HMaster (master)  Responsible for coordination  Assigns regions to Region Servers, detects failures  Admin functionsMontag, 16. Juli 2012 8
  • 9. HBase - Architecture ZooKeeper HMaster Client ZooKeeper HMaster ZooKeeper RegionServer RegionServer RegionServer HDFSMontag, 16. Juli 2012 9
  • 10. HBase - Data Access Java API REST Apache Avro, Apache Thrift Hadoop MapReduceMontag, 16. Juli 2012 10
  • 11. HBase - Secondary Indexes No native support for secondary indexes Different choices:  Client managed: Write value in data table and index in index table  Coprocessors that automatically create the secondary index  Periodic update: Use MapReduce job to add indexMontag, 16. Juli 2012 11
  • 12. HBase - Coprocessors Run arbitrary code on any node:  Observer: RegionObserver, MasterObserver, WALObserver provide hooks for code execution (prePut, postPut, preGet, postGet, ...)  Endpoint: Installed on nodes, executed on client requestMontag, 16. Juli 2012 12
  • 13. Apache Solr Apache Lucene + many features like  Distributed index  Distributed search  ... Apache Lucene is a high-performance, full-featured text search engine libraryMontag, 16. Juli 2012 13
  • 14. Solution - Details Client Insert log messages, create secondary indexes for predefinded attributes. HBase Use coprocessor functionality to index log messages in Solr after insert. SolrMontag, 16. Juli 2012 14
  • 15. Solution - Cluster Configuration Client Zookeeper Namenode SecondaryNamenode HMaster DataNode DataNode DataNode RegionServer RegionServer RegionServer Solr Solr Solr ...Montag, 16. Juli 2012 15
  • 16. Solution - HBase & MapReduce Very good integration of MapReduce into HBase Easy to use HBase as data source, data sink or both Provides helper classesMontag, 16. Juli 2012 16
  • 17. Solution - Problems Can Solr keep up with HBase? Is Solr full text search practical for log messages? (XML, other formats, ...)Montag, 16. Juli 2012 17
  • 18. Results Not many, yet. Generic experiments with random data Experiments with real log data just startedMontag, 16. Juli 2012 18
  • 19. Results - Write Random Data - HBaseOnly Insert random data, 1KB records. Cluster configuration:  5 Nodes:  RAM: 24 GiB  CPU: Intel Xeon L5520 2.26  HD: 2x 15k RPM Sas 73 GB (RAID1)  1. Node: Master (Namenode, HMaster, Zookeeper)  2. - 5. Node: Slaves (Datanode, RegionServer) Client on seperate node Experiment executed with and without secondary indexes. (5 additional indexes)Montag, 16. Juli 2012 19
  • 20. Results - Write Random Data - HBaseOnly No secondary indexes Secondary indexs avg. inserts/sec avg. inserts/sec (not counting index inserts ~30000 ~6000Montag, 16. Juli 2012 20
  • 21. Results - Write Read Data - HBase & Solr No real numbers First tests: Single Solr instance indexes ~1000 log messages per second.Montag, 16. Juli 2012 21
  • 22. QuestionsMontag, 16. Juli 2012 22
  • 23. Montag, 16. Juli 2012 23
  • 24. HBase - Architecure Source: HBase - The Definitive GuideMontag, 16. Juli 2012 24
  • 25. HBase - Key Design Source: HBase - The Definitive GuideMontag, 16. Juli 2012 25
  • 26. HBase - Hardware Master  Ram: 24 GB  CPU: Dual quad-core  Disks: 4 x 1 TB SATA, RAID 0+1 Slave  Ram: 24 GB or more  CPU: Dual quad-core  Disks: 6 + 1 TB SATA, JBODMontag, 16. Juli 2012 26
  • 27. HBase - Monitoring Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. HBase provides metrics for Ganglia.Montag, 16. Juli 2012 27
  • 28. Log Message Example (1) 2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace name: all0302 Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK- 310_OPK2_ETK-REQ), cxn=1498840662 (172.17.39.174:13101), addr=0x1db58830, len=354, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8] UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205 15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^ _08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ" TRXNB="1"/><$Montag, 16. Juli 2012 28
  • 29. Log Message Example (2) 2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace name: all0302 Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ, TRXNB=1, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]Montag, 16. Juli 2012 29
  • 30. Log Message Example (3) 2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace name: all0302 Message received [con=17697 (inSrvT2_TCIL_1), cxn=1626671045 (194.156.170.210:8000), addr=0x13e9b830, len=1710, CorrID=09B5840E, MsgID=OX7E09RYABBLS61HR2DXTL] +----- ADDR -----+--------------- HEX ---------------+----- ASCII ---- +---- EBCDIC ----+ 0000000013e9b830 554e421d 49415442 1f311d31 4153494c UNB.IATB.1.1ASIL .+.............< 0000000013e9b840 53533243 53544e1d 3141304c 53534352 SS2CSTN.1A0LSSCR ......+....<.... 0000000013e9b850 591d3132 30353135 1f303433 321d3030 Y.120515.0432.00 ................ 0000000013e9b860 39 ...Montag, 16. Juli 2012 30