Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Stopping Email Threats:A Big Data ApproachWade ChambersExecutive Vice President,Development/Operations                    ...
Overview  Current Solutions/Problems  Introducing Anomalytics  Details, details …                       2
Current Solutions/Problems   Current Solutions/Problems   Introducing Anomalytics   Details, details …                    ...
Spam Technology is better …  Spam detection effectiveness has vastly improved to ~99.5%.                                4
Spam Volumes are Down …                                    Overall Message Volume - August 2010 to August 2011Sep-10   Oct...
But the game is changing …   Spam is still an annoyance, but companies   are seeing   • An increase in sophisticated, mali...
International Data Trafficking: StolenData is Now a Marketable Commodity   In past 5 years a            Data Type         ...
Result: An Epidemic of BreachesA Small Sampling of Recent Breaches• DOE Laboratories:  April & July 2011• International Mo...
Phishing for AccountCredentials                       9
HTML Attachments                   10
Mobile End Users Highly Susceptibleto Phishing                            FAKE!!!!                       11
What are the Best of Breed Solutions?               •IP, URL, Domain, Registrar, Sender, Receiver, …Reputation     •Local ...
The nasty little secret:              You have to see it             to defend against it    via: spam traps • honey point...
The nasty little problem:      The time difference between the     email hitting your server and your     vendor’s spam tr...
Anomalytics  Current Solutions/Problems  Introducing Anomalytics  Details, details …                       15
The new challenge:Not a needle in a hay stack . . .                               Organizations see things                ...
We have to flip the model …  Existing techniques try to  understand “Bad”   • Always a half-step behind   • Can be defeate...
Surface level data from email messages …Attribute            Value                       Attribute                 Valuesp...
Hadoop/Anomalytics enables …Senders/Recei    vers                 Understanding User Trends/Behavior                 •   C...
Circle of Trust: Build a Ledger                                                      Build explicit counts:               ...
Circle of Trust: A Friend of a Friend                  No       C         Trust   D  Trust                                ...
Circle of Trust (cont.)  The fact of whether you, your group, or your company has sent     email to the sender of an incom...
Anomalytics: Looking for norms…  Use historical data to build what is normal for any feature for a        specific time of...
Applying Big Data Analysis . . . Spear phishing example:   You, no one on your team, or   anyone in your company has ever ...
Behavioral Analysis using Big Data                                         IP, URL, Domain, Registrar, Sender,            ...
Overview  Current Solutions/Problems  Introducing Anomalytics  Details, details …                       26
Architectural OverviewCustomer Datacenters                               Proofpoint Datacenters                           ...
Architectural Overview                                                                                     Transform from ...
Data Collection and Storage   Tradeoff   • S3 files are immutable, write-once and not available for reads until     "compl...
Processing  Elastic MapReduce  Custom MR jobs over S3 files  Hive jobs external tables   • JSON Serde: https://github.com/...
Building RESTful Services   Toolkit for building Java-based web services and applications   Mostly "glue" for common Java ...
Questions?32
Upcoming SlideShare
Loading in …5
×

Hadoop World 2011: Leveraging Big Data in the Fight Against Spam and Other Security Threats - Wade Chamber, Proofpoint

3,748 views

Published on

In 2004, Bill Gates told a select group of participants in the World Economic Forum that "two years from now, the spam issue will be solved.” Eight years later, the spam problem is only getting worse, with no sign of relief. Big Data technologies such as Hadoop, MapReduce, Cassandra, and real-time stream processing can be leveraged to develop new approaches to fight spam, phishing, and other email-borne threats more effectively than ever before. This session will focus on the development of radical new “spam anomalytics” techniques whereby billions of messages and message-related events are analyzed daily to find statistical norms- and identify deviations from those norms- in order to better detect and defend against email threats as they emerge.

Published in: Technology, News & Politics
  • Be the first to comment

Hadoop World 2011: Leveraging Big Data in the Fight Against Spam and Other Security Threats - Wade Chamber, Proofpoint

  1. 1. Stopping Email Threats:A Big Data ApproachWade ChambersExecutive Vice President,Development/Operations 1
  2. 2. Overview Current Solutions/Problems Introducing Anomalytics Details, details … 2
  3. 3. Current Solutions/Problems Current Solutions/Problems Introducing Anomalytics Details, details … 3
  4. 4. Spam Technology is better … Spam detection effectiveness has vastly improved to ~99.5%. 4
  5. 5. Spam Volumes are Down … Overall Message Volume - August 2010 to August 2011Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 Aug-11 Spam volumes near 12 month low 5
  6. 6. But the game is changing … Spam is still an annoyance, but companies are seeing • An increase in sophisticated, malicious, targeted attacks • More diverse organizations being hit with targeted, smaller-scale attacks Traditional Security Architectures aren’t keeping pace with evolving malicious attacks • Outdated detection and prevention technology • Lack of innovation from large vendors 6
  7. 7. International Data Trafficking: StolenData is Now a Marketable Commodity In past 5 years a Data Type Street Value sophisticated market in Valid Email Address $1 per 10,000 stolen data has emerged Username / Password / Emails from Compromised Website $1 per 1,000 Credit Card # $2-$90 Diverse buyers Medical Record $8-$20 Bank Credentials $80+ Sophisticated suppliers Rent 100 Botnet Infected Machines $700+/month Celebrity Medical Record $1000+ Vast impact of Admin Access to High-traffic $3,000+ cybercrime, industrial Compromised Website Company Financials, $10,000+ - ??? espionage Intellectual Property US intelligence agencies estimate cost of lost business due to theft of technology and business ideas $100 - $250 billion/year 7
  8. 8. Result: An Epidemic of BreachesA Small Sampling of Recent Breaches• DOE Laboratories: April & July 2011• International Monetary Fund: June 2011• Epsilon: April 2011• RSA: March 2011• Securities and Exchange Commission: May 2011• Human Services Agency of San Francisco: Feb 2011• Austrian Police Agencies: August 2011• Hyundai Capital (South Korea): April 2011 8
  9. 9. Phishing for AccountCredentials 9
  10. 10. HTML Attachments 10
  11. 11. Mobile End Users Highly Susceptibleto Phishing FAKE!!!! 11
  12. 12. What are the Best of Breed Solutions? •IP, URL, Domain, Registrar, Sender, Receiver, …Reputation •Local and Global automation •Super fast, but blunt edged •Words, Phrases, Patterns, RegEx … Content •Definitions built using machine-learning and Data Analysts •Trainable by Language •Millions of messagesSPAM TRAPS •Thousands of domains •Recipient, DKIM, …Verification •Local integration •Rejects lessen load on system and provide evidence 12
  13. 13. The nasty little secret: You have to see it to defend against it via: spam traps • honey points • honey pots • reported 13
  14. 14. The nasty little problem: The time difference between the email hitting your server and your vendor’s spam trap introduces risk to your organization 14
  15. 15. Anomalytics Current Solutions/Problems Introducing Anomalytics Details, details … 15
  16. 16. The new challenge:Not a needle in a hay stack . . . Organizations see things Vendors never see (extremely . . . a needle in a targeted attacks) needle stack Organizations see new threats before Vendors can train on it Organizations have amazingly rich data as a by-product of processing their email … … that can be used to help stop new threats the first time they appear 16
  17. 17. We have to flip the model … Existing techniques try to understand “Bad” • Always a half-step behind • Can be defeated by changing pattern Anomalytics: Model “Good” to find the abnormal • Find faster - anything outside normal is suspect • Hard to defeat – Normal is both dynamic and variable 17
  18. 18. Surface level data from email messages …Attribute Value Attribute Valuespf_result Fail attachment_count 1attachment_size 255 charset WINDOWS-1252country ua ip 94.153.252.70InsertionDate 2011-04-26 12:37:05 SmtpHelo 94-153-252-70-kh.ip.kyivstar.netSmtpHostIp 94.153.252.70 msgsize 261ip reputation 100virus score 18 number recipients 1adult score 81 bulk score 1phish score 0 spam score 97sender <onewayhash>@domain.com recipient: <onewayhash>@bestspecials.bizEvidence gathered: "HELO_DYNAMIC_IPADDR2”, "MISSING_HEADERS", "MISSING_SUBJECT”, "PP_ATTACHMENT_TXT”, "PP_FROM_NOANGLES”, "PP_HAS_RCVD", "PP_IMG_COUNT_0”, "PP_IP_COUNTRY_UA”, "PP_IP_SCORE_100", "PP_MIME_PLAINTEXT_ONLY”, "PP_NO_CTE”, "PP_NO_CTYPE”, "PP_NO_MSGID”, "PP_NO_MUA", "PP_RCVD_FROM_HOME_ISP”, "PP_TO_NOANGLES”, "TO_CC_NONE", 18
  19. 19. Hadoop/Anomalytics enables …Senders/Recei vers Understanding User Trends/Behavior • Circle of Trust • Sender Analytics • Receiver Analytics • …Domain/Comp any Understanding Domain Trends/Behavior • Domain to sending IP mappings • Domain % spam sent • Domain forensics (SPF/DKIM/Headers/etc.) • …Infrastructure Understanding Infrastructure Trends/Behavior • % email spam from IP • Average message size from IP (and message size distribution) • Average number of recipients per message from IP • … 19
  20. 20. Circle of Trust: Build a Ledger Build explicit counts: • Sender  Receiver Sender • Sender  Group • Sender  Company Domain Group • Receiver  Sender User • Receiver  Sender DomainReceiver User 0/3 n/a 0/56 • Receiver Group  Sender Group 0/21 n/a 0/127 • Receiver Group  Sender Domain Domain 1/79 n/a 4/9,215 • Receiver Domain  Sender • Receiver Domain  Sender Domain 20
  21. 21. Circle of Trust: A Friend of a Friend No C Trust D Trust Extra Credit: B Well known machine Trust learning algorithms allow you to build “friend of A friends” solution “A Trusts C: A friend of a friend is my friend.” 21
  22. 22. Circle of Trust (cont.) The fact of whether you, your group, or your company has sent email to the sender of an incoming message is a strong indicator of whether something is “normal” or not 22
  23. 23. Anomalytics: Looking for norms… Use historical data to build what is normal for any feature for a specific time of day based on the day of the week 23
  24. 24. Applying Big Data Analysis . . . Spear phishing example: You, no one on your team, or anyone in your company has ever sent email to the sender or the sender’s domain The sender’s IP has unknown reputation AND is associated with a suspicious registrar AND was just published less than 24 hours ago This sender has sent 5 emails in 5 minutes to your company, all to your group The content contains a URL that has never been seen before and has an extremely low Alexa ranking 24
  25. 25. Behavioral Analysis using Big Data IP, URL, Domain, Registrar, Sender, IP, URL, Domain, Registrar, Sender, Receiver, … Receiver, … Reputation Local and Global automation Local and Global automation Behavioral Super fast, but blunt edged Super fast, but blunt edged Word, Phrases, Patterns, RegEx… Cloud/Big Data solutions leveraged Content Definitions built using machine- learning and Data Analysts to catch evolving threats Trainable by Language Can leverage any high level facet of a Millions of messages Spam Traps message to compute Thousands of domains rates, norms, deviations, clusters Recipient, DKIM,… Verification Local integration Rejects lessen load on system and provide evidence 25
  26. 26. Overview Current Solutions/Problems Introducing Anomalytics Details, details … 26
  27. 27. Architectural OverviewCustomer Datacenters Proofpoint Datacenters Legacy EC2 for compute Spam Filter FN/FP events Appliance Systems email traffic events S3 for long-term storage Hosted Spam Aggregator Filter Scoring Servers and applications built request on top of Proofpoint Platform FN/FP events Scoring request email traffic events HTTP-based APIs Amazon AWS Deployments and application Scorer Collector lifecycle managed via Galaxy Hive + Hadoop MR Other AWS technologies: ELB, S3 ElasticMapReduce (+ Hive), Model Repository Event Repository CloudFormation 27
  28. 28. Architectural Overview Transform from legacy hierarchical XML format Email traffic events FN/FP events (Legacy XML) Normalizer json Event Collector into json Local Spool Snappy-compressed Canonicalize URLs, email Event Repository (S3) json addresses Email traffic Staging Area (S3) FN/FP Annotate with additional features (ASN, nameserver for sender Combiner IP) Forward to generic event collection layer 28
  29. 29. Data Collection and Storage Tradeoff • S3 files are immutable, write-once and not available for reads until "complete” • Ability to process new data as soon as possible requires writing small files • … but, Hadoop more efficient at processing large files Solution: • Local spool in collectors (1 minute or 512 MB) • Upload to staging area in S3 • Compressed using Snappy – Framing format supports concatenated compressed files – Pure java implementation: https://github.com/dain/snappy • Simulate "append" by repeatedly concatenating staged files into hourly buckets (or 512 MB) • S3 multipart upload API with references to existing files in S3 29
  30. 30. Processing Elastic MapReduce Custom MR jobs over S3 files Hive jobs external tables • JSON Serde: https://github.com/proofpoint/hive-serde Final output into S3 30
  31. 31. Building RESTful Services Toolkit for building Java-based web services and applications Mostly "glue" for common Java technologies JAX-RS (Jersey), HTTP (Jetty), JSON (Jackson), JMX Some abstractions to produce applications with uniform: • service discovery • configuration • logging • monitoring hooks • event generation • packaging and deployment Applications deployable via Galaxy • https://github.com/dain/galaxy-server Support for Rails apps (via JRuby) https://github.com/proofpoint/platform 31
  32. 32. Questions?32

×