This document summarizes SoftServe's Hadoop demo lab project. It introduces SoftServe, explaining that they are a product development company with expertise in big data analytics. It then discusses why SoftServe started the demo lab project, including to increase internal Hadoop experience and provide a demo environment for customers. The document outlines the high-level tasks of the project, including ingesting log data and building a Lambda architecture. It also covers the solution architecture, such as using a Lambda architecture with Hadoop, Hive and Impala. Finally, it discusses trade-off analyses that were performed and development aspects like automation.
2. Agenda
1) Who we are & What we do
2) Why we started this project
3) High-Level Task Overview
4) Task Analysis
5) Solution Architecture
6) Trade-off Analysis
7) Development Aspects
4. 4
▪ Leading global Product and
Application Development partner
founded in 1993
▪ 3,300+ employees across North
America, Ukraine and Western
Europe
▪ Thousands of successful outsourcing
projects!
SaaS/Cloud Solutions . Mobility Solutions . UX/UI
BI/Analytics/Big Data . Software Architecture . Security
Clients include:
5. Why SoftServe
• Dedicated Architecture Group (including BI and BigData, 40+ architects)
• Demo Hadoop Environment
• Reference architecture library
• 10+ successful BI/BigData projects
• Certified Big Data engineers (Hadoop, MongoDB)
• Partnership with major RDBMS, BI and BigData vendors
6. What we do: Services
1) Design & Assessment
2) Optimization & Modernization
3) POC & Prototyping
4) Development and Quality Control
5) Production and Non-Production Support
13. Demo Lab: Input Data
Data Volume
270-300 Web Servers (Apache HTTPD)
447 392 events per minute
644 245 094 events / day
~100-250 bytes per event
150GB of data per day
Log Types
1) Apache HTTPD access log
2) Apache HTTPD error log
3) Service log (CPU, RAM, Disk I/O, Disk Space)
4) Application server servlet log
Retention
Last 30 days: Raw data
Last 24 hours: per minute aggregation
Whole period: per hour aggregation
14. Demo Lab: Log Data Examples
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client
stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist:
/home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76
30. BI Tool Selection
Options:
• Tableau
• JasperSoft
• Microstrategy
• QlikView
Microstrategy:
• Powerful and Feature-Rich BI Tool
• 31 days trial period w/o trial key
• Well-integrated with Hadoop (and Impala)
• Easy to install in a silent-mode (command-line)
35. Reference
35Click to add the title
1) Install Hadoop (CDH4) on 5 nodes with VMWare, CDH4, Cloudera Manager 4
https://www.youtube.com/watch?v=CobVqNMiqww
2) Puppet & Vagrant Tutorial
http://puppetlabs.com/blog/puppet-and-vagrant-tutorial
3) Hardware for Hadoop
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_cluster-planning-guide
/content/hardware-selection-for-hbase.html
4) How to Refine and Visualize Server Log Data
http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/
5) Hadoop Cluster Sizing
http://hortonworks.com/wp-content/uploads/downloads/2013/06/
Hortonworks.ClusterConfigGuide.1.0.pdf
36. Thank You!
36
SoftServe US Office
One Congress Plaza,
111 Congress Avenue, Suite 2700 Austin, TX
78701
Tel: 512.516.8880
Contacts
Valentyn Kropov
vkrop@softserveinc.com
Tel: 866.687.3588 x4341
Editor's Notes
All of these vendors are used in our projects.
Among them
Global leaders (Oracle, IBM, IBI, SAS, SAP) and many niche players (Infobright, Pentaho etc.)
With proprietary software and open source