• Save
Hadoop / Spark on Malware Expression
Upcoming SlideShare
Loading in...5

Hadoop / Spark on Malware Expression



Presentation given by Sungwook Yoon, MapR Data Scientist ...

Presentation given by Sungwook Yoon, MapR Data Scientist

Topics Covered:
Advanced Persistent Threat (APT)
Big Data + Threat Intelligence
Hadoop + Spark Solution
Example Detection Algorithm Development Scenarios (most of them are still open problems)



Total Views
Views on SlideShare
Embed Views



4 Embeds 16

https://twitter.com 9
http://www.slideee.com 4
https://www.linkedin.com 2
http://www.dschool.co 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop / Spark on Malware Expression Hadoop / Spark on Malware Expression Presentation Transcript

  • © 2014 MapR Technologies 1© 2014 MapR Technologies
  • © 2014 MapR Technologies 2 Objective • Advanced Persistent Threat (APT) • Big Data + Threat Intelligence • Hadoop + Spark Solution • Example Detection Algorithm Development Scenarios (most of them are still open problems) Topics covered in this talk
  • © 2014 MapR Technologies 3© 2014 MapR Technologies Advanced Persistent Threat
  • © 2014 MapR Technologies 4 APT • Advanced Persistent Threat (APT) is one of the biggest headaches in IT departments – Target Compromise – Countless DDoS attacks (Thousands a day according to Arbor Networks) – These are only known cases, that could just be a tip of the iceberg. • Why APT is so prevalent? – No more hobby for smart hackers – Huge money is involved, even behind organized crimes – Political tool (Recent conflict between Ukraine and Russia sparked malware warfare between them) – Cyber warfare (Stuxnet) Overview
  • © 2014 MapR Technologies 5 APT • Hard to Detect – More software layer stacks without thorough vulnerability test popping every day • Storm, spark, yarn, grail, play, spring, flask, … – Mobile area is even worse • Particularly android • Some estimates 30% or more devices are already compromised, worldwide • Anti-Virus is useful only up to a certain point – It takes months to years to define malware signature – Zero day attack is still unpreventable – It became almost a Placebo • Firewall is not much useful anymore – A device can be infected when the user brings it outside the Firewall premise • Botnet itself is becoming more complex with many hierarchies – Minimal binary delivery – Surreptitious C&C connection with complex hierarchy or even headless peer to peer bots (Gameover Zeus Botnet) Status
  • © 2014 MapR Technologies 6 APT • Snort / Suricata – Rule based system – Community support, pre/post-compromise detection – Constant update is needed, cannot detect Zero day attack – Sourcefire provides paid service • Sandbox Technology – Firewall + In premise detection – Fireeye • Poly-morphing technology – ShapeSecurity • Log data mining based methods – Splunk / Sumo Logic, Solutionary Defense, state of the art
  • © 2014 MapR Technologies 7 APT • Many world wide security labs have malware labs and generate threat reports • The analysis takes from 2 weeks to months • Involves – Decoding binary execution and decrypting load / config parameters – Complete time line analysis, from infection to exploit – What devices and ips and domain names are involved • Sometimes, analyze IRC data, or even social network data – Botnet connection and verify the command and control • Can we automate this with Big Data? Threat Report
  • © 2014 MapR Technologies 8 APT Example Annual Threat Report (from Fireeye, 2013, Europe) Top Two Industries in Threat Finding were Healthcare and Finance
  • © 2014 MapR Technologies 9 APT • Configuration (Decrypted) • ID: F16 08-07-2013 Group: DNS/Port: Direct: toornt.servegame.com:443, Proxy DNS/Port: Proxy Hijack: No ActiveX Startup Key: HKLM Startup Entry: File Name: Install Path: C:Documents and SettingsAdminLocal SettingsTempmorse.exe Keylog Path: C:Documents and SettingsAdminLocal SettingsTempmorse Inject: No Process Mutex: gdfgdfgdg Key Logger Mutex: ActiveX Startup: No HKLM Startup: No Copy To: No Melt: No Persistence: No Keylogger: No Password: !@#GooD#@! Example Threat Report (from Fireeye) C&C Servers toornt.servegame.com updateo.servegame.com egypttv.sytes.net skype.servemp3.com natco2.no-ip.net Why does it need Password?
  • © 2014 MapR Technologies 10 APT • CHAIN OF EVENTS • ASSOCIATED DOMAINS • - www.toonzone.net - Compromised website • - ilinsting.com - Redirect • - bgbyhn.in.ua - Fiesta EK • INFECTION CHAIN OF EVENTS • 06:40:07 UTC - www.toonzone.net - GET /forums/adult-swim-toonami-forum/ • 06:40:08 UTC - ilinsting.com - GET /szjhmucw.js?3ad1359a5153d640 • 06:40:09 UTC - bgbyhn.in.ua - GET /hdjng94/?2 • 06:40:11 UTC - bgbyhn.in.ua - GET /hdjng94/?25b6d1b1cb76ec625b500e0d560a50040703520d5053520a0706510355090109 • 06:40:12 UTC - bgbyhn.in.ua - GET /hdjng94/?2d8a97d01a056fdd41084e5a0b0c56050752085a0d55540b07570b54080f0708;5110411 • 06:40:14 UTC - bgbyhn.in.ua - GET /hdjng94/?02bb88c62d7306c8534209590a035103050452590c5a530d0501515709000057;5 • 06:40:15 UTC - bgbyhn.in.ua - GET /hdjng94/?02bb88c62d7306c8534209590a035103050452590c5a530d0501515709000057;5;1 • 06:40:42 UTC - bgbyhn.in.ua - GET /hdjng94/?2ad5cdef3fc4ef9851110f0e515f57530757540e5706555d07525700525c065e;6 • 06:40:43 UTC - bgbyhn.in.ua - GET /hdjng94/?2ad5cdef3fc4ef9851110f0e515f57530757540e5706555d07525700525c065e;6;1 • 06:40:43 UTC - bgbyhn.in.ua - GET /hdjng94/?5998786b9c7a1ffe544b580305030457000f0903035a0659000a0a0d0600555a • 06:40:49 UTC - bgbyhn.in.ua - GET /hdjng94/?59576b00f4cfd03e5641500c04590205000f050c0200000b000a0602075a5308;1;2 • 06:40:49 UTC - bgbyhn.in.ua - GET /hdjng94/?59576b00f4cfd03e5641500c04590205000f050c0200000b000a0602075a5308;1;2;1 Another Example, Fiesta EK, from malware-traffic-analysis.net
  • © 2014 MapR Technologies 11© 2014 MapR Technologies Big Data + Threat Intelligence
  • © 2014 MapR Technologies 12 Big Data + Threat Intelligence • Tom Brady + Gisele Bundchen – An Ideal Marriage • With All the advances in Computing and Data Resources, why can’t we automate Malware detection • Big Data is an ideal platform for malware study – Simple packet capture can easily make PETA bytes data from small offices – Huge storage + Fast processing is essential for malware study • Various aspects of Big Data fit well with Malware – Streaming analysis (Storm, Spark Streaming) – Volumetric data analysis (Spark) – Graph analysis • View network devices as nodes, discover command and control role • Each url can be a node and the basis of graph analysis – Visualization for intuitive analysis Pros
  • © 2014 MapR Technologies 13 Big Data + Threat Intelligence • Anomaly detection – Typical log analysis – Router / Switch has built in alarm setting • Simple Level based detection – Is this going to be useful? • How much can you tell • Machine Learning – Not much useful • Not easy to get labeled data • Even with labeled data it is very hard to develop a feature set – If the feature set is known, hackers will revise their codes • Zero day attack does not come with a label – Modeling needs complete understanding of criminal minds Cons (e.g., Gwyneth Paltrow and Chris Martin)
  • © 2014 MapR Technologies 14 Big Data + Threat Intelligence An Example Architecture Storm Spout Packet Stream Or Binary Downloads Storm Bolt Packet Analysis Alert and store packet data Store to HDFS Spark Analysis Storm Bolt Meta Data Extraction Packet stream truly reveals Malware expression compared to Log Connect the Dots with Strong In Memory Processing
  • © 2014 MapR Technologies 15 Big Data + Threat Intelligence • Reduce False Positives – Mantra in Malware detection business • Big data is a great resource for reducing false positives (Type 1 error) – As soon as an update on an algorithm is made, test against the Big Data test cases – The test can even be applied to old cases, greatly reducing false positives • Typically, we had to sample test data by weighting old data lower False Positives
  • © 2014 MapR Technologies 16 Big Data + Threat Intelligence • Wireshark (tshark) is the goto software for packet analysis – Huge memory hogging software • Need to put packet data onto HDFS • Packetpig has been developed from Hortonworks – A lot more has to be done to be closer anywhere near to the strength of Wireshark • Need to design efficient meta data collection and storage mechanisms – Use snort or custom c platform library to extract essential flow data • Flow is a 5-tuple src/dest/ip/port/protocol • Flow is the de facto unit of network malware expression analysis Packet to HDFS
  • © 2014 MapR Technologies 17 Big Data + Threat Intelligence • Big Data provides opportunity to map out all the ip addresses used on a particular network • Through graph analysis, find rogue IP addresses • Use geographical information with IP to find abnormal connection behavior • DNS provides many insights on Malware connection – Static IP cannot be used for malware control purpose – Fast Flux – Awkward names IP based analysis
  • © 2014 MapR Technologies 18 Big Data + Threat Intelligence • Flow is an essential malware analysis unit • Flow identifies – Who’s connecting to whom • Frequency, duration, communication bandwidth • App can be identified from flow – Port, actual content – Palo Alto Networks • Normal flow vs Abnormal flow – With enough data, we could potentially identify normal flow • Use first 16 bytes? – Cluster analysis, detect anomaly Flow to detect malware expression
  • © 2014 MapR Technologies 19© 2014 MapR Technologies Spark on Hadoop
  • © 2014 MapR Technologies 20 Apache Spark • spark.apache.org • github.com/apache/spark • user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now at Apache Software Foundation
  • © 2014 MapR Technologies 21 Easy: Example – IP Count • Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } • Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(“,”)(0)) .map(ip=> (ip, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • © 2014 MapR Technologies 22 Fast: Using RAM, Operator Graphs • In-memory Caching • Data Partitions read from RAM instead of disk • Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • © 2014 MapR Technologies 23 SPARK RDD • Resilient Distributed Datasets (RDD) is the key (potentially) in memory data structure • RDD is distributed over Hadoop Nodes, typically resides on memory • Transform RDD, then get data from RDD, Lazy Evaluation – 2 sets of interfaces are provided, one for transform, the other for taking actions (e.g., count, save etc) • Most of the interface is quite similar to Lisp operations and SQL operations • Use Persist (Cache) to have the RDD on memory
  • © 2014 MapR Technologies 24 RDD
  • © 2014 MapR Technologies 25 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”SomeFile.txt”)
  • © 2014 MapR Technologies 26 Spark, Hadoop Malware Analysis Why useful Packet Stream Construct Group of Suspected Flows In RDD E.g., suspected DNS tunnels, IRC communications Analyze with SPARK on RDD, IN MEMORY Connect the Dots, Flows, SysLogs and Events Huge advantage over Wireshark! Store in HDFS for easy access and use HBase for database support Real Time Event Processing Fast Classification or Anomaly Detection
  • © 2014 MapR Technologies 27 SPARK and Hadoop • Connecting dots needs Huge Storage and Fast Access – Potential need to go back in time to find correlating events • DDoS attack found Today + 10 Days ago spotty IRC chat + 20 days ago NXDomain events by the suspected infected machine – Sometimes it takes months to know a domain (the machine contacted) is suspicious (e.g., scored in VirusTotal) – Then see if these patterns match with known malware expressions – Approximate matching technology here is quite important » HMM and Correlation Modeling – HDFS + Hbase would be a good solution • Store relevant temporal data • Retrieve fast according to the criteria • SPARK + Hadoop provides fast development cycle – From prototype to evaluation Why Hadoop
  • © 2014 MapR Technologies 28© 2014 MapR Technologies Example Detection Algorithm Development Scenarios
  • © 2014 MapR Technologies 29 Introduction to Botnet (Terminology) Bot Master Bots Code Server IRC Server Victim IRC Channel Attack IRC Channel C&C Traffic Updates Old Days BotNet operation, Just for Reference Companies are interested in finding these in there premises
  • © 2014 MapR Technologies 30 (Malware Expression) Detection Phases • Pre Infection Detection – Intrusion Detection System • Active Infection Detection – Recruit and Reconnaissance in the internal network • Post Infection Detection – Exploit and Monetize
  • © 2014 MapR Technologies 31 Pre Infection Detection • Detect suspicious URLs – When a device tries to contact or download suspicious URLs, block it • How it works – If suspicious or unknown contents are detected, send it to backend big data deep analysis engine – Update suspicious IP/Domain Name/URLs – Update hash of the binary – Regularly remove old hash/suspicious URLs CAMP
  • © 2014 MapR Technologies 32 On going infection detection • How it works – Detect suspicious internal behavior – Develop normal behavioral model for target customer site – Detect abnormal authentication behavior, e.g., Kerberos, LDAP etc – Detect suspicious data move – Detect suspicious port usage – Detect tunnels • It is highly important to leverage Big Data to develop sustainable normal behavioral model and constant update. Network data/model is constantly changing. • Consult with Security experts to define the measure points In-network infection propagation
  • © 2014 MapR Technologies 33 Post Infection Detection • HTTP / DNS is most frequently abused protocols – Firewalls allow these ports get through – If needed, play man in the middle for SSL data inspection • Ill formed Http Header detection – Abnormal location – Abnormal referrer – Abnormal User Agent – Abnormal Size • Abnormal Http Post Detection (e.g., entropy analysis) • Ill formed XML / HTML • SQL Injection – SELECT * FROM users WHERE name = '' OR '1'='1'; • LDAP Code Injection Protocol Abnormality Collect Malware Expression Samples Develop Feature Set with Hadoop and SME Deploy and Continually update the model
  • © 2014 MapR Technologies 34 Post Infection Detection • Click Fraud • Like Fraud • DDoS • SPAM Volumetric Abnormality
  • © 2014 MapR Technologies 35 Post Infection Detection • Cadence • Weird domain name resolution • Fast Fluxing domain names • Abnormal IRC traffic behavior • Abnormal twitter behavior • Abnormal facebook behavior Command and Control Contact
  • © 2014 MapR Technologies 36 DGA ClickSecurity.com What Features Would U Use?
  • © 2014 MapR Technologies 37© 2014 MapR Technologies Conclusion
  • © 2014 MapR Technologies 38 Conclusion • Threat Intelligence and Big Data are very HOT • Big Data is the ideal analysis platform for Malware expression analysis – Caution, Remember the Cons – Useful for efficiently connecting the dots • Big Data enables – Persistent model building and updating – Reducing false positives through exhaustive data check compared to spot check • Hadoop / SPARK supports ideal platform for Malware expression analysis – SPARK provides strong inmemory processing power for complex malware data analysis with simpler scripting level coding • scala – MapR provides fastest data access on Hadoop nodes • M7 • MapR is the better hadoop • Don’t under estimate NFS and Volume convenience • Questions are welcome, send to syoon@maprtech.com, mvasquez@maprtech.com nestrada@maprtech.com