Parallel SPAM Clustering with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Parallel SPAM Clustering with Hadoop

on

  • 911 views

 

Statistics

Views

Total Views
911
Views on SlideShare
911
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Parallel SPAM Clustering with Hadoop Parallel SPAM Clustering with Hadoop Presentation Transcript

  • Parallel Spam Clustering with Apache Hadoop Thibault Debatty
  • Spam ● 70% of total email volume ● Estimated cost : $20.5 billion/year ● To fight better, need better strategic knowledge ● Examples : ● “Guaranteed Results” ● “Make YourPenis 3-inches longer & thicker, girl will love you 1k”Thibault Debatty Parallel Spam Clustering with Apache Hadoop 2
  • Spam ● 70% of total email volume ● Estimated cost : $20.5 billion/year ● To fight better, need better strategic knowledge ● Examples : ● “Guaranteed Results” Close IP ● “Make YourPenis 3-inches longer & thicker, girl will Same domain love you 1k”Thibault Debatty Parallel Spam Clustering with Apache Hadoop 3
  • Problem statement ● Cluster spams in parallel : ● To get useful insights ● Fast! ● Dataset : 1 million spams (231MB)Thibault Debatty Parallel Spam Clustering with Apache Hadoop 4
  • Problem statement ● Subject Your Special Order #253650 ● Charset windows-1250 ● Geo GB ● Day 2010-10-01 ● Host virginmedia.com ● ip 82.4.229.158 ● Lang english ● Size 1482 ● From berry_wagnertl@migrosbank.ch ● Rcpt brady@domain0140.comThibault Debatty Parallel Spam Clustering with Apache Hadoop 5
  • Whats next...1. MapReduce and Apache Hadoop2. Parallel K-means3. Implementation4. Benchmarks and speedup analysis5. Clusters vizualisationThibault Debatty Parallel Spam Clustering with Apache Hadoop 6
  • 1. MapReduce ● Model for processing large data sets ● Master node splits and distributes dataset 2 steps : 1.Map : worker nodes process data, and pass partial results to master 2.Reduce : master combines partial results ● Also name of Googles implementationThibault Debatty Parallel Spam Clustering with Apache Hadoop 7
  • 1. Apache Hadoop ● Free implementation of MapReduce ● Written in Java ● Process large amounts of data (PB) ● Used by : ● Yahoo : + 10.000 cores ● Facebook : 30 PB of data ● Distributed filesystem (HDFS) + data localityThibault Debatty Parallel Spam Clustering with Apache Hadoop 8
  • 1. Apache Hadoop ● Job Tracker ● ≃ Master ● Divides input data into “splits” ● Schedules map tasks (with data locality) ● Schedules reduce tasks on nodes ● Checks tasks healthThibault Debatty Parallel Spam Clustering with Apache Hadoop 9
  • 1. Apache Hadoop <key, value> <key, list of values>Thibault Debatty Parallel Spam Clustering with Apache Hadoop 10
  • 2. KMeans ● Select initial centers ● Until stop criterion is reached : ● Assign each point to closest center ● Compute new center ● Advantages : ● Suited to large datasets ● Can be implemented in parallel ● Computation O(nki)Thibault Debatty Parallel Spam Clustering with Apache Hadoop 11
  • 2. Parallel KMeans ● “Parallel K-Means Clustering Based on MapReduce” Weizhong Zhao, Huifang Ma and Qing He ● Map (point) : ● Compute distance to each center ● Output <id closest center, point> ● Reduce (list of points) : ● Compute center ● Output <center>Thibault Debatty Parallel Spam Clustering with Apache Hadoop 12
  • 3. Implementation : KMeans ● Abstract KMeans ● Abstract KMeansMapper ● Abstract KmeansReducer ● Interface IPoint ● Interface ICenter ● 2 concrete implementations : ● Spam ● Simple 2D pointsThibault Debatty Parallel Spam Clustering with Apache Hadoop 13
  • 3. Implementation : Abstract KMeans// Write to "/it_0/part­00000"this.writeInitialCentroids();for (…) {    conf.setMapperClass(this.mapper);    conf.setReducerClass(this.reducer);    conf.setInt("iteration", iteration);    SetOutputPath(... "/it_" + (iteration + 1));    ...}Thibault Debatty Parallel Spam Clustering with Apache Hadoop 14
  • 3. Implementation : Abstract KMeansMapperpublic void configure(JobConf job) {    // reads from    // "/it_" + job.get("iteration") + "/part­xxxxx"    this.fetchCenters(job);}public void map(key, value,...) {    IPoint point = this.createPointInstance();    point.parse(value);    ...}public abstract IPoint createPointInstance();public abstract ICenter createCenterInstance();Thibault Debatty Parallel Spam Clustering with Apache Hadoop 15
  • 3. Implementation : Abstract KMeansReducerpublic void reduce(key, values, …) {    new_center = this.createCenterInstance();    new_center.setOldCenter(old_center);    while (values.hasNext()) {        new_center.addPoint(point);    }    new_center.compute();    output.collect(new_center);}public abstract IPoint createPointInstance();public abstract ICenter createCenterInstance();Thibault Debatty Parallel Spam Clustering with Apache Hadoop 16
  • 3. Implementation : Spam Clustering ● Distance between spams : Weighted Average of feature distances ● Text features : Jaro distanceThibault Debatty Parallel Spam Clustering with Apache Hadoop 17
  • 3. Implementation : Spam Clustering Jaro similarity = Where : ● m = number of matching characters; ● t = number matching characters not located at the same position / 2. Matching = not farther than => Takes misspelling into accountThibault Debatty Parallel Spam Clustering with Apache Hadoop 18
  • 3. Implementation : Spam Clustering Distance between spams : Weighted Average of feature distances ● Text features : Jaro distance ● IP : Number of different bits / 32 ● Size : max 10% difference ● Day : arctangent-shaped functionThibault Debatty Parallel Spam Clustering with Apache Hadoop 19
  • 3. Implementation : Spam ClusteringThibault Debatty Parallel Spam Clustering with Apache Hadoop 20
  • 3. Implementation : Spam Clustering ● Center of cluster : ● Text features : Longest Common Subsequence; ● Charset, Geo (country code), Lang, Day : most often occurring value; ● Size : average value.Thibault Debatty Parallel Spam Clustering with Apache Hadoop 21
  • 4. Benchmarks ● Small Cluster : 3 nodes ● Single core ● 2GB RAM ● Gigabit Ethernet network ● Data replication : 3Thibault Debatty Parallel Spam Clustering with Apache Hadoop 22
  • 4. Benchmarks ● n = 1M spams ● k = 30 ● i = 10 => 1131 secThibault Debatty Parallel Spam Clustering with Apache Hadoop 23
  • 4. Benchmarks : scalability 3500 3000 2500 Execution time (sec) 2000 1500 1000 500 0 1 node 2 nodes 3 nodesThibault Debatty Parallel Spam Clustering with Apache Hadoop 24
  • 4. Benchmarks : scalabilityThibault Debatty Parallel Spam Clustering with Apache Hadoop 25
  • 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) : 808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%)Thibault Debatty Parallel Spam Clustering with Apache Hadoop 26
  • 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) : 808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%) MPI JumpshotThibault Debatty Parallel Spam Clustering with Apache Hadoop 27
  • 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) : 808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%)No data (setup) : 76 sec (9.5%)Trivial distance (setup + sort) : 242 secSort : 166 sec (20.5%)Remaining : 81 sec (10%)Thibault Debatty Parallel Spam Clustering with Apache Hadoop 28
  • 4. Benchmarks : Weka and Mahout ● 10 million 2D points ● Weka (sequential) 5355 sec ● Hadoop: 1841 sec (2.9x faster) ● Mahout + 4h ?Thibault Debatty Parallel Spam Clustering with Apache Hadoop 29
  • 4. Benchmarks ● Bigger cluster : ● 27 nodes ● 2 x 4 cores ● 16 GB ● Deployment: ● Shared home dir (NFS) ● Custom setup script ● Executed on all nodes through SSHThibault Debatty Parallel Spam Clustering with Apache Hadoop 30
  • 4. Benchmarks : Cluster 1M spams Small cluster : Bigger cluster : ● 3 cores ● 216 cores ● k = 30 ● k = 4000 ● 1131 sec ● 2484 secThibault Debatty Parallel Spam Clustering with Apache Hadoop 31
  • 4. Benchmarks : Comparison Small cluster : Bigger cluster : x 72 ● 3 cores ● 216 cores x 133 ● k = 30 ● k = 4000 ● 1131 sec ● 2484 sec Expected : 2089 sec Difference : 19%Thibault Debatty Parallel Spam Clustering with Apache Hadoop 32
  • 4. Benchmarks : Profiling and optimization With String dates : With timestamps : - 32% ● 1131 sec ● 770 secThibault Debatty Parallel Spam Clustering with Apache Hadoop 33
  • 5. Results ● "Your receipt #" ● From: "" ● To: "@domain4.com" ● “LinkedIn Messages, /0/2010" ● From: "adjustsc5837@rodneymoore.com" ● To: "@domain0140.com" ● "" ● From: "LiliKepp5219@telemar.net.br" ● To: "@domain4.c"Thibault Debatty Parallel Spam Clustering with Apache Hadoop 34
  • 5. Results Visualization ● "eil rder #" ● From: "hilton_ns@datares.com.my"Thibault Debatty Parallel Spam Clustering with Apache Hadoop 35
  • Conclusion ● Hadoop allows faster clustering ● But: ● Limitations ● Lacks graphical performance analysis tool (MPI Jumpshot) ● Programmer needs to understand inner working! ● Lot of room for improvement: ● Memcached to store intermediate centers? ● MPI to intercept method calls between JVMs? ● Selection of initial centers (canopy?), stop criterion? ● Distance computation (WOWA) ● Clustering algorithm (online clustering) ● Influence of data locality and data size?Thibault Debatty Parallel Spam Clustering with Apache Hadoop 36
  • Questions ?Thibault Debatty Parallel Spam Clustering with Apache Hadoop 37