Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • https://docs.google.com/a/cloudera.com/spreadsheet/ccc?key=0AnZTJfxZEqfodGRJQ2xWY3hxU2pLMFo2a3dRcFhMZWc#gid=0
  • 20130521mlmeetup

    1. 1. 1Cloudera MLJeff Hammerbacher
    2. 2. Presentation Outline• Cloudera• k-means• Intrusion detection• Cloudera ML workflow: clustering• Cloudera ML: future• References2
    3. 3. 3Cloudera
    4. 4. Cloudera• Founded 2008• HQ in Palo Alto and San Francisco• Raised $141M• 352 employees• Several hundred customers4
    5. 5. Cloudera Products• Subscription• Proprietary software• Support• Training and Certification• Services5
    6. 6. Cloudera Software• Open source• Cloudera’s Distribution, including Apache Hadoop (CDH)• Cloudera Hue• Cloudera Impala• Cloudera ML• CDK• Proprietary• Cloudera Manager• Cloudera BDR• Cloudera Navigator6
    7. 7. Cloudera ML• Collection of Java libraries and command-line tools• Goal: make data scientists more productive with CDH• Exploratory data analysis• Data preparation• Model fitting• Model evaluation• Apache 2.0 licensed• Developed on GitHub• http://github.com/cloudera/ml7
    8. 8. Cloudera ML: Building Blocks• Apache Hadoop• Scalable data storage (HDFS) and processing (MapReduce)• Apache Hive• Metadata for structured data in HDFS• Apache Crunch• Easy MapReduce pipelines• Apache Mahout• Vector interface• Apache Avro• Serialization format8
    9. 9. 9k-Means
    10. 10. k-Partition Clustering• Given• A set of n points in Euclidean space• An integer k• Find• A partition of these points into k subsets• Each with a representative, also known as a center• k-means• Choose the center to be the point that minimizes the sumof squared distances to other points in the subset10
    11. 11. k-Means: Lloyd’s Algorithm• Choose k initial centers• Repeat until stopping criterion is met:• Assign each point to one center• Recalculate the centers11
    12. 12. k-Means: Properties• Good• Scaling is easy: let each point find its own center• Bad• Runtime can be exponential in the worst case• Solution can be locally optimal but globally suboptimal12
    13. 13. k-Means: Initial Centers Selection• Goals• Improve global quality of solution• Reduce number of iterations required13
    14. 14. k-Means++: Algorithm1. Choose one center uniformly at random from the datapoints2. For each data point x, compute D(x), the distance to thenearest center3. Choose one new data point at random as a new centerusing a weighted probability distribution where a pointx is chosen with probability proportional to D(x)24. Repeat Steps 2 and 3 until k centers have been chosen5. Proceed with Lloyd’s iterationsSource: http://en.wikipedia.org/wiki/K-means%2B%2B14
    15. 15. k-Means++: Limitations• Selection of the n+1st center relies on the nth center• Hence, not parallelizable15
    16. 16. k-Means||• Instead of sampling a single point in each pass of thek-means++ algorithm, sample O(k) points in each round• Repeat the process for approximately O(log n) rounds• At the end of the iteration, O(k log n) points are obtained• Weight each sampled point by the number of points in theoriginal set closest to it• Re-cluster the O(k log n) points to obtain k initial centers forLloyds iteration• In practice, O(log n) rounds are not necessary; 5 works well16
    17. 17. 17Intrusion Detection
    18. 18. Data• 1998 DARPA Intrusion Detection Evaluation Program• 9 weeks of tcpdump data (w/ simulated attacks)• 7 weeks of training data• 2 weeks of test data• KDD Cup 1999 Data• 4 GB raw training data --> 5M connection records• 2M connection records in the test set• Using 10% subset of the training data18
    19. 19. Examples• Connection• Sequence of TCP packets• Starting and ending at some well defined times• Single source IP, target IP, and protocol• Connection labeled as “normal” or “attack”• Attack can be only one of N attack types• Each connection record is ~100 bytes19
    20. 20. Labels• Attack categories• DOS: denial of service (e.g. SYN flood)• R2L: unauthorized access from remote machine• U2R: unauthorized access to local root privileges• Probing• Attack types• 24 training types• 14 additional types in test set20
    21. 21. Features• Symbolic/discrete or continuous• Basic features• Content features• Use domain knowledge• Extract feature from packet content• Derived features• Time-based traffic features (2s window): DOS• Same (destination) host• Same service• Host-based traffic features (100 connections): probing21
    22. 22. Sample• 494,021 rows (connections) in 10 percent sample• 42 columns: 41 features (3 groups), 1 label• Google Docs has 400k cell limit• So: let’s look at the first 1,000 connections• http://bit.ly/10KQ6TU22
    23. 23. 23Cloudera ML Workflow: Clustering
    24. 24. Cloudera ML: summary• client/bin/ml summary• --input-paths kddcup.data_10_percent (HDFS)• --format text• --header-file examples/kdd99/header.csv (local FS)• --summary-file examples/kdd99/s.json (local FS)24
    25. 25. Cloudera ML: summary25HDFSLocal FSkddcup.data_10_percentheader.csv1. summary
    26. 26. Cloudera ML: summary26HDFSLocal FSkddcup.data_10_percentheader.csv1. summarys.json
    27. 27. Cloudera ML: summary• s.json• Categorical features: histogram• Numerical features: distribution summary27
    28. 28. Cloudera ML: normalize• client/bin/ml normalize• --input-paths kddcup.data_10_percent (HDFS)• --format text• --summary-file examples/kdd99/s.json (local FS)• --transform Z• --output-path kdd99 (HDFS)• --output-type avro• --id-column category• --compress28
    29. 29. Cloudera ML: normalize29HDFSLocal FSkddcup.data_10_percentheader.csv2. normalizes.json
    30. 30. Cloudera ML: normalize30HDFSLocal FSkddcup.data_10_percentheader.csv2. normalizes.jsonkdd99/
    31. 31. Cloudera ML: normalize• kdd99/part-m-0000[0|1].avro• Examples (rows)• Part 0: 442,454 vectors• Part 1: 51,567 vectors• Total: 494,021 vectors• Features (columns)• Before: 41 fields• After: 143 fields31
    32. 32. Cloudera ML: ksketch• client/bin/ml ksketch• --input-paths kdd99 (HDFS)• --format avro• --points-per-iteration 500• --output-file wc.avro (local FS)• --seed 1729• --iterations 5• --cross-folds 232
    33. 33. Cloudera ML: ksketch33HDFSLocal FSkddcup.data_10_percentheader.csv3. ksketchs.jsonkdd99/
    34. 34. Cloudera ML: ksketch34HDFSLocal FSkddcup.data_10_percentheader.csv3. ksketchs.jsonkdd99/wc.avro
    35. 35. Cloudera ML: ksketch• wc.avro• Examples (rows)• 2 “folds” of 2501 examples• 1 initial example• 500 examples from each iteration (5 iterations)• Each example has an associated weight• Features (columns)• 143 features (still)35
    36. 36. Cloudera ML: kmeans• client/bin/ml kmeans• --input-file wc.avro (local FS)• --centers-file centers.avro (local FS)• --seed 19• --clusters 1,10,25,35,45• --best-of 2• --num-threads 4• --eval-stats-file kmeans_stats.csv (local FS)36
    37. 37. Cloudera ML: kmeans37HDFSLocal FSkddcup.data_10_percentheader.csv4. kmeanss.jsonkdd99/wc.avro
    38. 38. HDFSLocal FSkddcup.data_10_percentheader.csv4. kmeanss.jsonkdd99/wc.avrokmeans_stats.csvcenters.avroCloudera ML: kmeans38
    39. 39. Cloudera ML: kmeans• centers.avro• 1 row for each run of k-means++• 9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45• kmeans_stats.csv• Clustering quality scores39
    40. 40. Cloudera ML: kassign• client/bin/ml kassign• --input-paths kdd99 (HDFS)• --format avro• --centers-file centers.avro (local FS)• --center-ids 4• --output-path assigned (HDFS)• --output-type csv40
    41. 41. Cloudera ML: kassign41HDFSLocal FSkddcup.data_10_percentheader.csv5. kassigns.jsonkdd99/wc.avro centers.avro
    42. 42. Cloudera ML: kassign42HDFSLocal FSkddcup.data_10_percentheader.csv5. kassigns.jsonkdd99/wc.avro centers.avroassigned/
    43. 43. Cloudera ML: kassign• assigned/part-m-0000[0|1]• Rows• Part 0: 442,454• Part 1: 51,567• Total: 494,021• Columns• Point ID (normal/attack type, in this case)• Index in centers.avro• Assigned cluster ID• Squared distance to nearest cluster43
    44. 44. Cloudera ML: sample• client/bin/ml sample• --input-paths assigned (HDFS)• --format text• --header-file examples/kdd99/kassign_header.csv (local FS)• --weight-field squared_distance• --group-fields clustering_id,closest_center_id• --output-type csv• --size 20• --output-path extremal (HDFS)44
    45. 45. Cloudera ML: sample45HDFSLocal FSkddcup.data_10_percentheader.csv6. samples.jsonkdd99/wc.avro centers.avroassigned/kassign_header.csv
    46. 46. Cloudera ML: sample46HDFSLocal FSkddcup.data_10_percentheader.csv6. samples.jsonkdd99/wc.avro centers.avroassigned/kassign_header.csvextremal/
    47. 47. Cloudera ML: sample• extremal/part-r-00000• Rows• Up to 20 examples from each cluster• Examples that are furthest from the center of the cluster• Columns• Point ID (normal/attack type, in this case)• Index in centers.avro• Assigned cluster ID• Squared distance to nearest cluster47
    48. 48. 48Cloudera ML: future
    49. 49. Future• Clustering• Adapt k-means|| for Bregman divergences• Classification• Work underway in “classifier” branch• Sofia-ML in Java (SVM)• Ensemble classifiers (Random Forests)• Recommender systems• Collaboration with Sean Owen of Taste/Myrrix• Deep learning49
    50. 50. 50References
    51. 51. References• Reservoir Sampling• “Random sampling with a reservoir” (1985)• “Weighted random sampling with a reservoir” (2006)• “Weighted Random Sampling over Data Streams” (2010)• k-means++• “The Effectiveness of Lloyd-Type Methods for the k-Means Problem” (2006)• “k-means++: the advantages of careful seeding” (2007)• k-means||• “Scalable K-Means++” (2012)• Cluster quality• “Cluster Validation by Prediction Strength” (2005)• Intrusion Detection• “Cost-based modeling for fraud and intrusion detection: results from the JAM project” (2000)• “Intrusion detection with unlabeled data using clustering” (2001)• “Service-independent payload analysis to improve intrusion detection in network traffic” (2008)• “Outside the Closed World: On Using Machine Learning for Network Intrusion Detection” (2010)51
    52. 52. References• Single-pass k-means• “Fast and Accurate k-means For Large Datasets” (2011)• Clustering with Bregman Divergences• “Clustering with Bregman Divergences” (2005)• “Fast nearest neighbor retrieval for bregman divergences” (2008)• Classification• “Large Scale Learning to Rank” (2009)• “Detecting Adversarial Advertisements in the Wild” (2011)• “How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)” (2013) (Blog)• Deep learning• “An Analysis of Single-Layer Networks in Unsupervised Feature Learning” (2011)• “Learning Feature Representations with k-means” (2012)52
    53. 53. 53
    54. 54. 54Unused Slides
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.