• Like
  • Save
Robert McPherson & Ryan Talabis "The World of Security Seen Through Analytics"
Upcoming SlideShare
Loading in...5

Robert McPherson & Ryan Talabis "The World of Security Seen Through Analytics"



Analytics. Well, you've probably heard about it. It's being discussed from scientific papers to board rooms, from text books to the cooler. It is a buzz word that a lot of companies use to be ...

Analytics. Well, you've probably heard about it. It's being discussed from scientific papers to board rooms, from text books to the cooler. It is a buzz word that a lot of companies use to be perceived as cutting edge (like living in a Smarter Planet!). But under all the hype, analytics, when it comes down to it, is simply discovering meaningful insights in data. What you probably dont know is that analytics is everywhere now. When applying for insurance or a loan, when someone calls you about a credit card purchase, or even when you see an ad in a website! In this talk, Bob and Ryan will give us a glimpse of what analytics is all about and, more importantly, how we as security professionals can utilize it in our day to day activities. Have you ever tried to look through millions of lines of security logs manually? Have you ever tried to make sense of hundreds of thousands of vulnerability results? Has your boss ever tried to ask you to review years worth of VPN access logs? Did you ever want to analyze the trends of exploit development to see if any of your systems are at risk? Do you need to be a statistics or algorithm guru for this? No! Do you have to buy a fancy server appliance and business intelligence software? Not really! In this talk, Bob and Ryan will show you step by step (yes, there will be live demos) how to use readily available open source analytics tools and techniques such as text analysis, outlier detection, and clustering to augment dreary security chores. This will get you started on becoming the resident security analytics guru in your workplace!



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Robert McPherson & Ryan Talabis "The World of Security Seen Through Analytics" Robert McPherson & Ryan Talabis "The World of Security Seen Through Analytics" Presentation Transcript

    • Forensic Server Log Analysis with Hive and MapReduce Ryan Talabis and Robert McPherson Shakacon 2013
    • Big Data Methods for Server Log Analysis Problem • Server security and data breaches are increasingly in the news. • Forensic analysis of server logs is needed o Identify vulnerabilities o Perform damage assessments o Prescribe mitigative measures o Collect evidence. • Massive collections of server log data which are difficult to analyze Open Source Technology Solutions • The Hadoop, MapReduce, and Hive software stack • HiveQL (HQL) queries to assist in performing forensic analysis • Hive can produce analytical data sets for other software tools, such as R and Mahout. Methods of analysis • Perform fuzzy searches for keywords and terms • Time aggregations of web log variables for trending • Perform sorting to identify potential problem sources • Create analytical data sets suitable for further analysis 1
    • Many Servers Produce Many Logs 2 • Millions or billions of row entries - big data territory • Can analyze in the cloud • We'll use Amazon's cloud computing environment
    • Hive and Hadoop on Amazon's Elastic MapReduce Service 3 • Hadoop enables storing of large text files on grid of inexpensive, off-the- shelf server hardware • Hive enables use of SQL-like syntax to query data on Hadoop • Hive runs MapReduce batch jobs behind the scenes - can query terabytes of data, or larger Logs Amazon S3 Bucket Hadoop Distributed File System (HDFS) Hive Query Language (HQL)
    • Data and Code Needed to Run this Analysis • The server logs can be downloaded from this location: https://drive.google.com/folderview?id=0B9G18MREXE- dOS1USk5QRjFqOEE&usp=sharing • The data consists of Apache, combined format server log files. • Server logs one through six are from Amazon's sample collection. • An additional file has been added containing examples of log entries with known security breaches. • HQL code: https://www.box.com/s/coa1ycjx6zx5cydffvm7 • R code and data: https://www.box.com/s/g1ytwggtewiojdqkryjj 4
    • Software Needed to Run this Analysis • A working Hive environment, on Hadoop and MapReduce are required. • Mahout and R for some statistical analysis on analytical data sets • Hive is excellent for canned query applications, and ad hoc customized searches - beneficial for forensic analysis. • There are several resources available online to assist in utilizing these packages, including the following. Instructions for setting up an account for Amazon cloud computing: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html Instructions for starting a Hive cluster on Amazon's Elastic MapReduce platform: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.html If you want to run on your own equipment, instead of Amazon's for Hadoop, MapReduce, Hive: http://www.cloudera.com/content/support/en/documentation/cdh3-documentation/cdh3-documentation- v3-latest.html R Statistical Environment: http://cran.r-project.org/ Mahout: https://github.com/cloudera/mahout 5
    • Hive Query Language - HQL • Hive Query Language, or HQL, is almost identical to SQL. • The main difference is that HQL allows user defined Java Jar files to be used in the SQL statement. • SQL stands for Structured Query Language, and is easy to learn. • All SQL has the same basic pattern. • SELECT ...(Column names) • FROM ...(Table name) • WHERE ...(Filter) • GROUP BY ...(Category) • ORDER BY ...(Column) • Can also do joins between tables, and we will see some examples. • Hive uses Java code to make text files look like tables in a relational database to the user, but it is not a relational database – NoSQL. 6
    • Hive Runs on MapReduce and Hadoop 7 Hive Queries MapReduce Data Aggregations Hadoop File System
    • Data Setup on Amazon Elastic MapReduce 8 --Make a directory on the Hadoop Directory File System (HDFS) to store log files hadoop dfs -mkdir data --Copy data from Amazon S3 bucket, to Hadoop File System (HDFS) directory hadoop dfs -cp 's3n://JunePresentation/Data/access*' /user/hadoop/data --Start Hive hive --Add Java archive file path for Hive add jar hive/contrib/hive-contrib-0.8.1.jar; --View column headers in the Hive output set hive.cli.print.header=true; • Comments in code are marked by '--' at the beginning of a line • Move the data from the Amazon S3 storage bucket, into the Hadoop directory file system (HDFS) • Start Hive with the "hive" command • Add Hive Java archive ("jar") file path • Set Hive to show column headers, for convenience
    • Examples Run on Apache Server Log Files 9 • Log files are in "combined log format" • This format consists of five fields from the common format, plus two additional fields indicating referrer and agent o Host: IP address of the remote client that made the request of the server o Identity: Client identity, if available o User: ID of the person requesting the document, if available o Time: when the server finished processing the request o Request: URL line sent from the client o Status: the status code sent from the server, to the client o Size: of any object sent to the client o Referrer: gives the site that the client reports having been referred from o Agent: identifying information that the client browser reports about itself • Can parse fields using "regular expressions" • Method can be adapted to any log format
    • Create a Log Table on Hive Next, create the table on Hive. CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (- |[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE; 10
    • Load the Log Files into the Hive Table • I then loaded all seven of the log files. • The seventh file is one that I created, containing examples of security breaches. • The other six were extracted from Amazon’s examples. Command: LOAD DATA INPATH "YourFilePath" INTO TABLE apachelog; hive> > LOAD DATA INPATH "/user/hadoop/data/access*" INTO TABLE apachelog; Loading data to table default.apachelog OK Time taken: 0.295 seconds 11
    • Discovery Process for Specific Attack Vectors • Find patterns within the “request” field • Shows the URL information for the resource or web page that was requested • Many attacks leave telltale fingerprints behind within this field, that can be identified through the LIKE operator, using HQL. • The following slide has an example of a direct search on a file inclusion attack. 12
    • HQL: Directory Traversal and File Inclusion • Most of the terms included in this query related to directories at the root of a system. • Also includes the traversal command, “..”, as well as executables, ".exe", and ".ini". • Not an exhaustive list - merely a few examples SELECT * FROM apachelog WHERE LOWER(request) LIKE '%usr/%' OR LOWER(request) LIKE '%~/%' OR LOWER(request) LIKE '%.exe%' OR LOWER(request) LIKE '%.ini%' OR LOWER(request) LIKE '%usr/%' OR LOWER(request) LIKE '%etc/%' OR LOWER(request) LIKE '%dev/%' OR LOWER(request) LIKE '%opt/%' OR LOWER(request) LIKE '%root/%' OR LOWER(request) LIKE '%sys/%' OR LOWER(request) LIKE '%boot/%' OR LOWER(request) LIKE '%mnt/%' OR LOWER(request) LIKE '%proc/%' OR LOWER(request) LIKE '%sbin/%' OR LOWER(request) LIKE '%srv/%' OR LOWER(request) LIKE '%var/%' OR LOWER(request) LIKE '%c:%' OR LOWER(request) LIKE '%..%'; 13
    • Results: Directory Traversal and File Inclusion • The query found the following examples of directory traversal and file inclusion attack attempts. host identity user time size referer status agent - - [25/Apr/2013:15:31:46 -0400] "GET /cgi-bin/powerup/r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1" 404 539 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003175)" - - [25/Apr/2013:15:31:46 -0400] "GET /cgi-bin/r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1" 404 531 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003176)" xxx.xxx.64.79 - - [18/Sep/2009:00:00:55 -0800] "GET /example.com/doc/..%5c../Windows/System32/cmd.exe?/c+dir+c: HTTP/1.1" 200 3164 "-" "Mozilla/5.0 (compatible) Feedfetcher- Google; (+http://www.google.com/feedfetcher.html)" xxx.xxx.64.79 - - [18/Sep/2009:00:00:55 -0800] "GET /example.com/example.asp?display=../../../../../Windows/system.ini HTTP/1.1" 200 3164 "-" "Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)" 14
    • Command Injection • This attack tries to disguise commands with HTML URL encoding. • The following query includes keywords for some common examples. SELECT * FROM apachelog WHERE LOWER(request) LIKE '%&comma%' OR LOWER(request) LIKE '%20echo%' OR LOWER(request) LIKE '%60id%'; 15
    • Result: Command Injection Query host identity user time request status size referer agent - - [25/Apr/2013:15:31:46 -0400] "GET /forumscalendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 536 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003039)" - - [25/Apr/2013:15:31:46 -0400] "GET /forumzcalendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 536 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003040)" - - [25/Apr/2013:15:31:46 -0400] "GET /htforumcalendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 537 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003041)" - - [25/Apr/2013:15:31:46 -0400] "GET /vbcalendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 532 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003042)" - - [25/Apr/2013:15:31:46 -0400] "GET /vbulletincalendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 539 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003043)" - - [25/Apr/2013:15:31:46 -0400] "GET /cgi- bin/calendar.php?calbirthdays=1&action=getday&day=2001-8- 15&comma=%22;echo%20'';%20echo%20%60id%20%60;die();echo%22 HTTP/1.1" 404 538 "-" "Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003044)" Time taken: 13.51 seconds 16
    • Finding the Unknown Unknowns 17
    • Tracking and Tallying Failed Request Statuses • Often an attacker may experience a number of failures before hitting upon the right combination which brings success. • Tracking status codes may help to locate attack attempts, but is a much less direct method than those shown above. • Hosts sending the request can be sorted to determine which IP addresses produce the most failures. • While a high number of failures is not a definitive indication that an attack has taken place, it could serve as a starting point for further investigation. • Tracking failed status codes could also help to identify where users are having difficulty with the system, as well as those IP addresses that may be putting the system under the greatest stress. 18
    • Hosts with the Most Failed Requests • Status codes in the server log are shown as a three digit code. • Codes in the 100, 200, or 300 series range indicate successful client requests, and server responses. • However, codes in the 400 range indicate failed requests, and the 500 range indicates failed responses due to problems on the server side. • A series of failed requests could be an indication of an attacker using trial and error until the attack succeeds. • Failed server responses could be an indication that an attack has succeeded in doing server side damage. 19
    • Create Status Groupings View CREATE VIEW IF NOT EXISTS statusgroupings AS SELECT host ,identity ,user ,time ,request ,status ,size ,referer ,agent ,CASE substr(status,1,1) WHEN '1' THEN '0' WHEN '2' THEN '0' WHEN '3' THEN '0' WHEN '4' THEN '1' WHEN '5' THEN '1' ELSE '0' END AS failedaccess FROM apachelog; 20
    • Time Aggregations • The queries in this section parse the time field in the server log. • This enables the evaluation of server activity over time. • The time field appears in the log file as “[20/Jul/2009:20:12:22 -0700].” • The first digit, ‘20’ in this case, is the day, which is followed by the month, ‘Jul’, and the year, ‘2009’. • Three sets of numbers follow the year, separated by a colon, representing the hour, minute, and second respectively. • The final four digits that follow a dash represent the time zone. 21
    • Query to Parse Year, Month, and Day • The following query view parses the year, month, and day into "year", "yearmonth", and "monthday" fields. • The query makes use of Hive's concatenate function, CONCAT(). CREATE VIEW IF NOT EXISTS by_month AS SELECT host,identity,user,time,CASE substr(time,5,3) WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06' WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11' WHEN 'Dec' THEN '12' ELSE '00' END AS month,substr(time,9,4) AS year,concat(substr(time,9,4) ,CASE substr(time,5,3) WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06' WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11' WHEN 'Dec' THEN '12' ELSE '00' END) AS yearmonth ,concat(CASE substr(time,5,3) WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06' WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11' WHEN 'Dec' THEN '12' ELSE '00' END,substr(time,2,2)) AS monthday,request,status,size,referer,agentFROM apachelog; • The resulting new time columns appear as the following. month year monthday yearmonth 07 2009 0720 200907 22
    • Generate Daily Time Series for Failed Requests • A daily time series can reveal attack trend sequences. • The query view below combines the year, month, and day into a single string that can be sorted, in the form ‘20090720’, where the first four digits represents the year, the next two are the month, and the last two digits are the day. --enable the creation of a time series by day over multiple years and months CREATE VIEW by_day AS SELECT host ,identity ,user ,time ,concat(year, monthday) AS yearmonthday ,request ,status ,size ,referer ,agent FROM by_month; SELECT * FROM by_day LIMIT 10; 23
    • Two Views: Successful and Failed Requests by Day • The two additional views shown below reference the preceding view, "by_day". • They generate two sets of data that can be used to produce a ratio of failed to successful requests by day. --Unsuccessful server calls as a time series by year, month, and day Create VIEW FailedRequestsTimeSeriesByDay AS SELECT yearmonthday,COUNT(yearmonthday) AS failedrequest_freq FROM by_day WHERE substr(status,1,1) IN('4','5') GROUP BY yearmonthday ORDER BY yearmonthday ASC; --Successful server calls as a time series by year, month, and day Create VIEW SuccessfulRequestsTimeSeriesByDay AS SELECT yearmonthday ,COUNT(yearmonthday) AS successfulrequest_freq FROM by_day WHERE substr(status,1,1) IN('1','2','3') GROUP BY yearmonthday ORDER BY yearmonthday ASC; 24
    • Calculate Failed Request Ratio by Day Finally, the preceding two views are joined in the query below, to produce a ratio of failed to successful requests, by day. --Calculate ratio of failed to successful requests by year, month, and day SELECT a.yearmonthday ,a.failedrequest_freq / b.successfulrequest_freq AS failratio FROM FailedRequestsTimeSeriesByDay a JOIN SuccessfulRequestsTimeSeriesByDay b ON a.yearmonthday = b.yearmonthday ORDER BY yearmonthday ASC; • A small sample of the results are shown below. yearmonthday failratio 20090720 0.023759608665269043 20090721 0.024037482175595846 20090722 0.029298848252172157 20090723 0.032535684298908484 20090724 0.04544235924932976 20090725 0.030345800988002825 20090726 0.031446540880503145 20090727 0.03494060097833683 25
    • Export Data for Further Analysis • These results can also be exported, and analyzed in a program, such as R, Excel, or SAS. • Aggregating data into a time series is one way to turn large data into small data, that can be analyzed with more conventional tools. • Rerunning the last query, which produces the daily time series of the failed request ratio, and adding the INSERT OVERWRITE LOCAL DIRECTORY command, exports the results for further analysis. • An example of entering this into the command line is shown below. • Caution: This command will overwrite everything in a folder that already has files in it, so provide a new folder name at the end of the location address string - "FailedRequestsByDay", in my case. INSERT OVERWRITE LOCAL DIRECTORY ‘FailedRequestsByDay' SELECT a.yearmonthday ,a.failedrequest_freq / b.successfulrequest_freq AS failratio FROM FailedRequestsTimeSeriesByDay a JOIN SuccessfulRequestsTimeSeriesByDay b ON a.yearmonthday = b.yearmonthday ORDER BY yearmonthday ASC; 26
    • R-Code: Find Days with Failed Requests Beyond Two Standard Deviations of the Mean rm(list=ls()) #Remove any objects library(fBasics) #Import failed requests file, using standard delimeter in Hive failedRequests <- read.table("FailedRequestsByDay.txt",sep="^A") #Add column headings colnames(failedRequests) <- c("Date","FailedRequestsRatio") stdev <- sd(failedRequests$FailedRequestsRatio) #calculate the standard deviation avg <- mean(failedRequests$FailedRequestsRatio) #calculate the average avgPlus2Stdeva <- avg + 2*stdev #mean plus 2 standard deviations #Identify the days that had failed requests in excess of 2X the standard deviation failedRequests[failedRequests[,2]>avgPlus2Stdev,] • Results in three days being identified as beyond the control threshold. > failedRequests[failedRequests[,2]>avgPlus2Stdev,] Date FailedRequestsRatio 5 20090724 0.04544236 68 20090925 0.04203776 71 20090928 0.04075795 > 27
    • R-Code: Create a Control Chart, and Test for Autocorrelation • The following code generates a control chart, showing the days that are beyond two standard deviations of the mean of the failed request ratio, and exports the chart to a PDF file. • It also generates a plot of the autocorrelation function, "acfPlot()". #Produce a plot and save it as a PDF pdf("PlotOfFailedRequestRatioSeries.pdf") plot(failedRequests[,2],type='l',main="Ratio of Failed Server Requests to Successful Requests by Day" ,xlab="Day",ylab="Ratio of Failed Requests to Successful Requests") lines(rep(avg,length(failedRequests[,2]))) lines(rep(avgPlus3Stdev,length(failedRequests[,2])),lty=2) legend("topright",c("Average","3 X Standard Deviation"),lty=c(1,2)) dev.off() #Create autocorrelation plot to test for seasonality or other autocorrelation effects pdf("FaileRequestsAutoCorrelation.pdf") acfPlot(failedRequests[,2],lag.max=60) dev.off() 28
    • Control Chart 29
    • Autocorrelation Plot 30 • Reveals whether there is any seasonality or other autocorrelation effects. • Tallest bar is at zero, and no other bars are beyond the significance line. • In this case, there is no evidence of significant seasonality.
    • Producing other Analytical Data Sets with Hive • As just demonstrated with R, Hive can be useful for producing analytical data sets to be used in other software packages. • The “statusgroupings” view that was introduced in a previous slide can also be used to produce an analytical data set that could be useful for analysis in other tools. • As an experiment, a logistic regression was run to determine whether there could be found any patterns among the variables that might be predictive of request failure or success, as indicated by the status codes. • Use similar code as below, to export the "statustgroupings" view to a new folder in the local directory. INSERT OVERWRITE LOCAL DIRECTORY ‘ApacheLog' SELECT * FROM statusgroupings; 31
    • Importing the Data into Mahout • The logistic regression was run using Mahout. • While Hive’s standard delimiter of ‘^A’ can be useful for a number of data sets and analysis tools, such as R, Mahout appears to want data in traditional CSV format, with a comma. • You may be able to use the Hive comand, “set hive.io.output.fileformat = CSVTextFile”, prior to running the export snippet above. • However, this does not seem to always work for everyone, perhaps depending upon the environment. • Barring this, you may be able to do a find and replace using AWK, or SED, or with an editor, such as Emacs or Vi. 32
    • Mahout Logistic Regression Command • Outside of Hive, such as in a new command shell, navigate to where the data was exported to from the "statusgroupings" view, in the preceding step. • The following may be pasted into the command line to train a logistic regression model in Mahout. /usr/bin/mahout trainlogistic --input statusgroupings.csv --output ./model --target failedaccess --categories 2 --predictors host --types word --features 50 --passes 20 --rate 50 33
    • Logistic Regression Output Sample • The full command line output is much too lengthy to list here. • However, segments of it are shown below. • Incidentally, the term, “failedaccess” is the variable name to be predicted, and is not an error message. failedaccess ~ -3.757*Intercept Term + -19.927*host=xxx.xxx.11.174 + - 18.807*host=xxx.xxx.200.192 + ... ... Intercept Term -3.75710 host=xxx.xxx.11.174 -19.92741 host=xxx.xxx.200.192 -18.80668 host=xxx.xxx.3.173 -0.08841 host=xxx.xxx.43.173 1.03845 host=xxx.xxx.36.26 0.33822 host=xxx.xxx.246.36 3.49805 host=xxx.xxx.199.144 0.47374 host=xxx.xx.145.56 -3.32575 ... 34
    • Visualization of the Coefficients for Each Host 35 • Negative coefficients on left side of graph indicate high proportion of hosts contribute to consistently good status codes • Positive coefficients on right indicate hosts that tend to contribute to a disproportionate number of bad status codes
    • Time Series Cross Correlation Analysis 36 • Sum status code occurrences for each day • Export resulting time series data from Hive to be imported into R for analysis • Run correlations with different lags to search for possible leading or lagging effects between the status codes • Are any status codes associated with each other? • Look for evidence of trial and error attacks, where failed status codes are related to successful codes
    • Sum All Status Code Occurrences by Day in Hive 37 CREATE VIEW summed_status_by_day AS SELECT year ,monthday ,concat(year, monthday) AS yearmonthday ,SUM(100Continue) AS 100Continue ,SUM(101SwitchingProtocols) AS 101SwitchingProtocols ,SUM(102Processing) AS 102Processing ,SUM(200OK) AS 200OK ,SUM(201Created) AS 201Created ... ... ,SUM(511NetworkAuthenticationRequired) AS 511NetworkAuthenticationRequired ,SUM(598NetworkReadTimeoutError) AS 598NetworkReadTimeoutError ,SUM(599NetworkConnectTimeoutError) AS 599NetworkConnectTimeoutError FROM unstacked_status_by_day GROUP BY year, monthday ORDER BY year, monthday ASC; INSERT OVERWRITE LOCAL DIRECTORY ‘SummedStatusByDay' SELECT * FROM summed_status_by_day;
    • Cross Correlation Analysis: R Code 38 #Import summed status code frequencies grouped by day statusFrequencies <- read.table("SummedStatusByDay.txt",sep="") statusFrequenciesNumeric <- statusFrequencies[-1,4:length(statusFrequencies)]
    • Cross Correlation Analysis: R Code 39 #Copy headers from Hive and paste them here. colnames(statusFrequenciesNumeric) <- c("100continue","101switchingprotocols","102processing","200ok","201created","202accepted","203nonauth oritativeinformation","204nocontent","205resetcontent","206partialcontent","207multistatus","208alreadyre ported","226imused","300multiplechoices","301movedpermanently","302found","303seeother","304notmodi fied","305useproxy","306switchproxy","307temporaryredirect","308permanentredirect","400badrequest401u nauthorized","402paymentrequired","403forbidden","404notfound","405methodnotallowed","406notaccepta ble","407proxyauthenticationrequired","408requesttimeout","409conflict","410gone","411lengthrequired","4 12preconditionfailed","413requestentitytoolarge","414requesturitoolong","415unsupportedmediatype","416r equestedrangenotsatisfiable","417expectationfailed","418imateapot","420enhanceyourcalm","422unprocessa bleentity","423locked","424faileddependency","424methodfailure","425unorderedcollection","426upgradere quired","428preconditionrequired","429toomanyrequests","431requestheaderfieldstoolarge","444norespons e","449retrywith","450blockedbywindowsparentalcontrols","451unavailableforlegalreasonsorredirect","494re questheadertoolarge","495certerror","496nocert","497httptohttps","499clientclosedrequest","500internalser vererror","501notimplemented","502badgateway","503serviceunavailable","504gatewaytimeout","505httpve rsionnotsupported","506variantalsonegotiates","507insufficientstorage","508loopdetected","509bandwidthli mitexceeded","510notextended","511networkauthenticationrequired","598networkreadtimeouterror","599n etworkconnecttimeouterror")
    • Cross Correlation Analysis: R Code 40 colSums <- apply(statusFrequenciesNumeric,2,sum) nonEmptyCols <- which(colSums>0) str(nonEmptyCols) X <- statusFrequenciesNumeric[,which(colSums>0)] plot(X) cor(X) acf(diff(X$"200ok")) acf(diff(X$"405methodnotallowed")) ccf(y=diff(X$"200ok"),x=diff(X$"405methodnotallowed"),ylab="Cross-correlation") ccf(y=diff(X$"405methodnotallowed"[10:length(X$"405methodnotallowed")]),x=diff(X $"405methodnotallowed"),ylab="Cross-correlation")
    • 41 Correlation Plots of All Status Codes
    • Correlation Matrix of Code 200 with All Others 42 • Increase in code 405 "Method Not Allowed Associated with 200 "OK" • Might indicate trial and error attack followed by success 200 OK 200 OK 1.0 404 Not Found 0.44 405 Method Not Allowed 0.94 501 Not Implemented 0.77
    • Cross Correlation Analysis 43 • Increase in code 405 correlated with same day increase in code 200 • Four days later, increase in 200 associated with decrease in 405
    • Conclusion • Hive provides a very useful framework for analyzing large amounts of server log data. • Since attack vectors can be so varied, a flexible tool that enables drilldowns and ad hoc analysis on the fly is very useful. • However, it can also be useful to have a collection of queries and methods for analyzing common vectors of attack as a starting point. • It is hoped that the ideas offered here may serve as a catalyst for further discussion and research into this topic. • For comments, suggestions, or questions, contact Bob McPherson at Robert.L.McPherson@Gmail.com. 44
    • Bibliography “[#HIVE-662] Add a Method to Parse Apache Weblogs - ASF JIRA.” Accessed April 17, 2013. https://issues.apache.org/jira/browse/HIVE-662. “Analyze Log Data with Apache Hive, Windows PowerShell, and Amazon EMR : Articles & Tutorials : Amazon Web Services.” Accessed May 4, 2013. http://aws.amazon.com/articles/3681655242374956. “Analyze Log Data with Apache Hive, Windows PowerShell, and Amazon EMR : Articles & Tutorials : Amazon Web Services.” Accessed May 6, 2013. http://aws.amazon.com/articles/3681655242374956. “Analyzing Apache Logs with Hadoop Map/Reduce. | Rajvish.” Accessed April 17, 2013. http://rajvish.wordpress.com/2012/04/30/analyzing-apache-logs-with-hadoop-mapreduce/. “Apache Log Analysis with Hadoop, Hive and HBase.” Accessed April 17, 2013. https://gist.github.com/emk/1556097. “Apache Log Analysis with Hadoop, Hive and HBase.” Accessed April 17, 2013. https://gist.github.com/emk/1556097. “Apache LogAnalysis Using Pig : Articles & Tutorials : Amazon Web Services.” Accessed April 17, 2013. http://aws.amazon.com/code/Elastic-MapReduce/2728. “Blind-sqli-regexp-attack.pdf.” Accessed May 7, 2013. http://www.ihteam.net/papers/blind-sqli-regexp-attack.pdf. Devi, T. “Hive and Hadoop for Data Analytics on Large Web Logs,” May 8, 2012. http://www.devx.com/Java/Article/48100. “Exploring Apache Log Files Using Hive and Hadoop | Johnandcailin.” Accessed April 17, 2013. http://www.johnandcailin.com/blog/cailin/exploring-apache-log-files-using-hive-and-hadoop. “Fingerprinting Port80 Attacks: A Look into Web Server, and Web Application Attack Signatures: Part Two.” Accessed May 8, 2013. http://www.cgisecurity.com/fingerprinting-port80-attacks-a-look-into-web-server-and-web-application-attack- signatures-part-two.html. “Googlebot Makes An Appearance In Web Analytics Reports.” Accessed May 5, 2013. http://searchengineland.com/is-googlebot- skewing-google-analytics-data-22313. Continued... 45
    • Bibliography “IBM Security Intelligence with Big Data.” Accessed April 20, 2013. http://www-03.ibm.com/security/solution/intelligence-big- data/. “Intro to Mahout -- DC Hadoop.” Accessed May 6, 2013. http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop. “Kick Start Hadoop: Analyzing Apache Logs with Pig.” Accessed April 17, 2013. http://kickstarthadoop.blogspot.com/2011/06/analyzing-apache-logs-with-pig.html. “LAthesis.pdf - Fry_MASc_F2011.pdf.” Accessed May 4, 2013. http://spectrum.library.concordia.ca/7769/1/Fry_MASc_F2011.pdf. “Log Files - Apache HTTP Server.” Accessed April 25, 2013. http://httpd.apache.org/docs/current/logs.html#accesslog. “Mod_log_config - Apache HTTP Server.” Accessed April 17, 2013. http://httpd.apache.org/docs/2.0/mod/mod_log_config.html. “New Tab.” Accessed May 7, 2013. about:newtab. “Parsing Logs with Apache Pig and Elastic MapReduce : Articles & Tutorials : Amazon Web Services.” Accessed April 22, 2013. http://aws.amazon.com/articles/2729. “Reading the Log Files - Apache.” Accessed April 24, 2013. http://www.devshed.com/c/a/Apache/Logging-in-Apache/2/. “Recommender Documentation.” Accessed May 6, 2013. https://cwiki.apache.org/MAHOUT/recommender-documentation.html. “SQL Injection Cheat Sheet.” Accessed May 7, 2013. http://ferruh.mavituna.com/sql-injection-cheatsheet-oku/. “Tutorial.” Accessed May 5, 2013. https://cwiki.apache.org/Hive/tutorial.html#Tutorial-Builtinoperators. “User Agent - Wikipedia, the Free Encyclopedia.” Accessed May 7, 2013. https://en.wikipedia.org/wiki/User_agent. “Using Grep.exe for Forensic Log Parsing and Analysis on Windows Server and IIS - IIS - Sysadmins of the North.” Accessed May 4, 2013. http://www.saotn.org/using-grep-exe-for-forensic-log-parsing-and-analysis-on-windows-server-iis/. “Using Hive for Weblog Analysis | Digital Daaroo - by Saurabh Nanda.” Accessed April 17, 2013. http://www.saurabhnanda.com/2009/07/using-hive-for-weblog-analysis.html. “Using Versioning - Amazon Simple Storage Service.” Accessed February 22, 2013. http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html. 46