Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning and Hadoop

12,043 views

Published on

Presentation on Machine Learning techniques for Hadoop and a peek at the near future of ML on Hadoop.

Published in: Technology, Education

Machine Learning and Hadoop

  1. 1. September 2011 – HUG– Atlanta, GA<br />Machine Learning With Hadoop<br />Josh Patterson | Sr Solution Architect<br />
  2. 2. Who is Josh Patterson?<br />josh@cloudera.com<br />Master’s Thesis: self-organizing mesh networks <br />Published in IAAI-09: TinyTermite: A Secure Routing Algorithm<br />Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)<br />Led team which designed classification techniques for time series and Map Reduce<br />Open source work at <br />http://openpdc.codeplex.com<br />https://github.com/jpatanooga<br />Today<br />Sr. Solutions Architect at Cloudera<br />
  3. 3. Outline<br />Hadoop Today<br />Data Mining<br />Mahout and Friends<br />A Peek at the Road Ahead<br />3<br />
  4. 4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”<br />--- Excerpt from the book “The American Gas Station”<br />Hadoop Today: The Oil Industry Circa 1900<br />4<br />
  5. 5. DNA Sequencing Trends<br />Cost of DNA Sequencing Falling Very Fast<br />5<br />
  6. 6. Unstructured Data Explosion<br />6<br />Complex, Unstructured<br />Relational<br /><ul><li> 2,500 exabytes of new information in 2012 with Internet as primary driver
  7. 7. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year</li></li></ul><li>Obstacles to Leveraging Data<br />Copyright 2010 Cloudera Inc. All rights reserved<br />7<br /><ul><li>Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)
  8. 8. Sometimes makes the data unwieldy
  9. 9. Customers are not creating schemas for all of their data
  10. 10. Yet still may want to join data sets
  11. 11. Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
  12. 12. They are throwing data away because its too expensive to hold
  13. 13. Similar to the oil industry in 1900</li></li></ul><li>A New Platform for an Evolving Landscape<br />Ability to look at true distribution of data<br />Previously impossible due to scale<br />Lower cost of analysis<br />Ad Hoc analysis now more open and flexible<br />Speed @ Scale is the new Killer App<br />Results in that previously took 1 day to process can gain new value when created in 10 minutes.<br />Greater Flexibility<br />Less restrictive than SQL-only systems<br />Copyright 2010 Cloudera Inc. All rights reserved<br />8<br />
  14. 14. Data Mining<br />9<br />“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?”<br />--- Peter Norvig, “Artificial Intelligence: A Modern Approach”<br />
  15. 15. Basic Concepts<br />What is Data Mining?<br />“the process of extracting patterns from data”<br />Why are we interested in Data Mining?<br />Raw data essentially useless<br />Data is simply recorded facts<br />Information is the patterns underlying the data<br />We want to learn these patterns<br />Information is key<br />
  16. 16. How does Machine Learning differ from Data Mining?<br />Data Mining<br />Extracting information from data<br />Finds patterns in data<br />Machine Learning<br />Algorithms for acquiring structural descriptions from data “examples”<br />Process of learning “concepts”<br />“structural descriptions” represent patterns explicitly<br />
  17. 17. Shades of Gray<br />Information Retrieval<br />information science, information architecture, cognitive psychology, linguistics, and statistics.<br />Natural Language Processing<br />grounded in machine learning, especially statistical machine learning<br />Statistics<br />Math and stuff<br />Machine Learning<br />Considered a branch of artificial intelligence<br />
  18. 18. Types of Machine Learning<br />Classification<br />Association<br />Clustering<br />Numeric Prediction<br />AKA: “Regression”<br />
  19. 19. Tools, Applications, and Mahout<br />Copyright 2010 Cloudera Inc. All rights reserved<br />14<br />
  20. 20. ML Focused on in Mahout<br />Classification<br />Naïve Bayes in Text Classification<br />Stochastic Gradient Descent (Logistic Regression)<br />Random Forests<br />Recommendation<br />Collaborative Filtering, Taste Engine<br />Item to item<br />Clustering<br />K-means, Fuzzy K-means<br />(Latent) Dirichlet Process<br />
  21. 21. Naïve Bayes and Text<br />Doc classification is an important domain in Machine Learning<br />Docs are characterized by the words that appear in them<br />One approach is to treat presence / absence of each word as a boolean attribute<br />Naïve Bayes is popular here, fast, accurate<br />
  22. 22. What Are Recommenders?<br />An algorithm that looks at a user’s past actions and suggests<br />Products<br />Services<br />People<br />
  23. 23. Collaborative Filtering<br />Collaborative filtering produces recommendations based on <br />user preferences for items, <br />“User Based”<br />does not require knowledge of the specific properties of the items. <br />In contrast, <br />content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.<br />“Item based”<br />
  24. 24. Clustering: Topic Modeling<br />Cluster words across docs to identify topics<br />Latent Dirichlet Allocation<br />
  25. 25. What is time series data?<br />Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals <br />Examples in finance<br />daily adjusted close price of a stock at the NYSE <br />Example in Sensors / Signal Processing / Smart Grid<br />sensor readings on a power grid occurring 30 times a second.<br />For more reference on time series data<br />http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/<br />
  26. 26. NERC Sensor Data Collection<br />openPDC PMU Data Collection circa 2009 <br /><ul><li>120 Sensors
  27. 27. 30 samples/second
  28. 28. 4.3B Samples/day
  29. 29. Housed in Hadoop</li></li></ul><li>Story Time: Keogh, SAX, and the openPDC<br />NERC wanted high res smart grid data tracked<br />Started openPDC project @ TVA<br />http://openpdc.codeplex.com/<br />We used Hadoop to store and process time series data<br />https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/<br />Needed to find “unbounded oscillations”<br />Time series unwieldy to work with at scale<br />We found “SAX” by Keogh and his folksfor dealing with time series<br />Copyright 2011 Cloudera Inc. All rights reserved<br />
  30. 30. What is Lumberyard?<br />Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage<br />It’s interesting for<br />Indexing large amounts of time series data<br />Low latency fuzzy pattern matching queries on time series data<br />Lumberyard is open source and ASF 2.0 Licensed at Github:<br />https://github.com/jpatanooga/Lumberyard/<br />Copyright 2011 Cloudera Inc. All rights reserved<br />
  31. 31. Genome Data as Time Series<br />A, C, G, and T<br />Could be thought of as “1, 2, 3, and 4”!<br />If we have sequence X, what is the “closest” subsequence in a genome that is most like it?<br />Doesn’t have to be an exact match!<br />Example:<br />ATATAT<br />TATATA<br />Useful in proteomics as well<br />iSAX Indexing<br />Lumberyard use case<br />Copyright 2011 Cloudera Inc. All rights reserved<br />
  32. 32. Bioinformatics<br />Applications in DNA Sequencing<br />Shortest Superstring Problem (SSP)<br />Take lots of reads from sequencing<br />We want the “superstring” of all the reads<br />We want a long string that “explains” all the reads we generated<br />We want the shortest string possible<br />NP-complete<br />We can reduce SSP to the Traveling Salesman Problem<br />Graph processing / algorithms now applicable<br />25<br />
  33. 33. Packages For Hadoop<br />DataFu<br />http://sna-projects.com/datafu/<br />UDFs in Pig<br />used at LinkedIn in many of off-line workflows for data derived products<br />"People You May Know”<br />"Skills”<br />Techniques<br />PageRank<br />Quantiles (median), variance, etc.<br />Sessionization<br />Convenience bag functions<br />Convenience utility functions<br />26<br />
  34. 34. Integration with Libs<br />Mix MapReduce with Machine Learning Libs<br />WEKA<br />KXEN<br />CPLEX<br />Map side “groups data”<br />Reduce side processes groups of data with Lib in parallel<br />Involves tricks in getting K/V pairs into lib<br />Pipes, tmp files, task cache dir, etc<br />27<br />
  35. 35. What Hadoop Not Good At in Data Mining<br />Anything highly iterative<br />Anything that is extemely CPU bound and not disk bound<br />Algorithms that can’t be inherently parallelized<br />Examples<br />Stochastic Gradient Descent (SGD)<br />Support Vector Machines (SVM)<br />Doesn’t mean they arent great to use<br />
  36. 36. MRv2: A Peek at the Road Ahead<br />©2011 Cloudera, Inc. All Rights Reserved.<br />29<br />
  37. 37. MRv2<br />Not everything fits great in MapReduce<br />Mahout as evidence of this<br />Examples<br />Stochastic Gradient Descent (SGD)<br />Support Vector Machines (SVM)<br />As we build further into verticals our analysis needs will become more complicated<br />MRv2 gives us new options<br />CDH4 will be based on 0.23.x (or later)<br />0.23.0 doesn't include MRv1<br />(via Tom White) CDH4 will *only* include MRv2<br />30<br />
  38. 38. Existing Parallel Frameworks<br />MapReduce<br />Java, Pig, Hive<br />Spark<br />Scala, hides complexity like hive/pig<br />Runs on hadoop, MRv2 already<br />Giraph<br />Bulk-synchronous parallel model <br />relative to graphs where vertices can send messages to other vertices during a given superstep<br />MPI<br />Older parallel lib<br />Includes primitives for data exchange, synchronization<br />Standardized and portable<br />GraphLab<br />“graph parallel” vs MR’s “data parallel”<br />Better at iterative style<br />©2011 Cloudera, Inc. All Rights Reserved.<br />31<br />
  39. 39. Frameworks Currently in Dev – MRv2<br />Giraph<br />https://issues.apache.org/jira/browse/GIRAPH-13<br />Hama BSP plans to integrate with MRv2<br />https://issues.apache.org/jira/browse/HAMA-431<br />MPI<br />https://issues.apache.org/jira/browse/MAPREDUCE-2911<br />Spark<br />https://github.com/mesos/spark-yarn<br />GraphLab<br />Discussion in user-mahout<br />32<br />
  40. 40. The Rise of the Meta Heuristic?<br />We’re seeing a data deluge drive demand for new data products<br />MapReduce applications are still relatively new<br />Customers have gotten a taste of data products with Hadoop<br />They like it<br />They want more<br />MRv2 has the potential to open up a range of meta heuristics to the hadoop sector<br />Techniques like genetic algorithms that were previously considered “boutique”<br />©2011 Cloudera, Inc. All Rights Reserved.<br />33<br />
  41. 41. The Shape of Things to Come<br />©2011 Cloudera, Inc. All Rights Reserved.<br />34<br />Pig, Hive, Scala, Java<br />Compiler to build workflows of { Data, Algorithm, Framework }<br />Algorithm Library: Mahout, SGD, SVM, NeuralNetworks<br />Framework Library, MPI, Spark, GraphLab, MapReduce<br />MRv2<br />HDFS For Large Streaming Files<br />Hbase for small low latency transactions<br />
  42. 42. Questions? (Thanks!)<br />Hadoop World 2011<br />You should go<br />Talks are high quality<br />Lots more Machine Learning talks<br />Developer class 10/10/2011<br />http://www.eventbrite.com/event/1951335497<br />10% discount with code atlhug<br />35<br />

×