This is Jianwen WEi's presentation on The 2011 International Workshop on Data Cloud (D-CLOUD 2011).
This presentation introduces a scalable cloud-based network log analysis platform, named Analysis Farm. Analysis Farm fulfills our needs to store and analyze more than 400 million log records every day.
D-Cloud 2011 http://www.cse.ust.hk/~lingu/D-CLOUD/ is held with affliation to 2011 International Conference on Cloud and Service Computing (IEEE CSC 2011) http://csc2011.comp.polyu.edu.hk/ . D-Cloud is held in Hong Kong, on Dec 12.
Email to me if you need a full-length paper. Be sure to introduce yourself in the letter :-)
D-Cloud 2011 A Cloud-based Scalable Aggregation and Query Platform for Network Log Analysis
1. Analysis Farm:
A Cloud-based Scalable Aggregation and
Query Platform for Network Log Analysis
Jianwen Wei,Yusu Zhao, Kaida Jiang, Rui Xie,Yaohui Jin
School of Electronic Information and Electrical Engineering, SJTU
Network and Information Center, SJTU
Shanghai Jiaotong University
wei.jianwen@gmail.com
Dec 12th, 2011
The 2011 International Workshop on Data Cloud (D-CLOUD 2011), Hong Kong
15. Background
Related Work
• loggly.com
• “Logging as a Service”
• Yottaa.com
• Log-based Website performance analysis
• They use cloud-based solutions for
scalability
17. Design and Implementation
Our Approach: Cloud Computing + NoSQL
• Cloud Computing
• manageable, scalable, on demand resources
• OpenStack open source toolset for building clouds
• NoSQL (Not Only SQL)
• weaken ACID to improve performance
• MongoDB document-oriented distributed database
18. Design and Implementation
The Architecture of Analysis Farm
Request Users
mongos Configuration
server
Application Layer
mongod mongod mongod mongod
VM VM VM VM IaaS Layer
Memory iSCIS Hardware Resource
CPU Network
Storage Pool
19. Design and Implementation
How we tackle the three challenges?
• Storage Scalability
• On line Storage Expansion
• Computation Scalability
• MongoDB Scale out
• Query Agility
• MongoDB Handles ad hoc queries effectively
20. Design and Implementation
Address the Storage Scalability
On Line Storage Expansion
1.The application servers ask the IaaS layer
for more disk space.
2.The IaaS layer asks the hardware resource
pool to attach new block devices.
3.The application servers execute on line
filesystem expansion.
No service interruption
21. Design and Implementation
Address the Computation Scalability
MongoDB Scale out
1.The IaaS provides a new
server to the cluster. MapReduce
2.The MongoDB cluster Request
rebalances data automatically. combiner mongos
No service interruption
mapper, combiner mapper, combiner mapper, combiner mapper, combiner
mongod mongod mongod mongod
22. Design and Implementation
Address the Query Agility
MongoDB handles ad hoc queries effectively
• Expressive Data Model
• Building Blocks for Compound Queries
• Aggregating tools such as Group, MapReduce
• Effective Optimization Methods, such as index
24. Experimental Results
Aggregating and Querying
• Aggregating Log
• Ad hoc Querying
SPEED is our primary focus.
25. Experimental Results
Experimental Setup for Aggregating
• Method
• Aggregate 10-min log with MongoDB MapReduce
• Dataset
• One day’s log records, ~400million records
• Configurations for Comparison
• 1x farm: 4 mongod threads on a single server
• 4x farm: 4 mongod threads on four servers
• 8x farm: 8 mongod threads on eight servers
26. Experimental Results
Experimental Results for Aggregating
Rate
Type Records Processed Time
(records/s)
1x 3201454 523s 6119
4x 3103742 200s 15568
8x 3317013 111s 29883
Experimental Results for 10-minute Log Aggregating
27. Experimental Results
Experimental Setup for ad hoc Querying
• Method
• Execute ad hoc querying
• Dataset
• One day’s log records, ~400million records
• Index
• (start_t, end_t, src_IP, dst_IP, app)
• Configuration for Analysis Farm
• 8x farm: 8 mongod threads on eight servers
28. Experimental Results
Experimental Setup for ad hoc Querying (cont.)
• Query Types
• IP-initial Querysrc_IP == IP
• IP-engaging Query src_IP == IP OR dst_IP == IP
• IP-pair Query IP-pair engaging AND app == HTTP
• Time Scopes
• 10 minutes, 30 minutes, 60 minutes
29. Experimental Results
Experimental Results for IP-initial Query
Rate
Time Scope Execution Time Records Scanned
(records/s)
10min 3.085s 227581 73770
30min 8.816s 643259 72965
60min 18.517s 1370443 73795
Experimental Results for IP-initial Query
(src_IP == IP)
30. Experimental Results
Experimental Results for IP-engaging Query
Rate
Time Scope Execution Time Records Scanned
(records/s)
10min 18.012s 1234582 68542
30min 54.708s 3673304 67144
60min 119.034s 7912644 66474
Experimental Results for IP-engaging Query
(src_IP == IP OR dst_IP == IP)
31. Experimental Results
Experimental Results for IP-pair Query
Rate
Time Scope Execution Time Records Scanned
(records/s)
10min 5.670s 296772 52340
30min 6.267s 324813 51829
60min 19.327s 1027513 53165
Experimental Results for IP-pair Query
(the IP-pair engages AND app == http)
33. Summary
• Analysis Farm is built on OpenStack and
MongoDB
• Analysis Farm is easy-to-manage and
easy-to-scale-out
• Feasibility in aggregating and querying is
verified
• We use Analysis Farm to analyze 400
million, or 350GB log records every day
34. Acknowledgement
• 973 program and NFSC
• My partners in Shanghai Jiaotong Univ.
• Dr. Lin Gu in HKUST
• Workshop organizers and reviewers
35. Analysis Farm:
A Cloud-based Scalable Aggregation and Query Platform for Network Log Analysis
Shanghai Jiaotong University
wei.jianwen@gmail.com @JianwenWEI
Thank you!
The 2011 International Workshop on Data Cloud (D-CLOUD 2011), Hong Kong