Machine Learning and Hadoop


Published on

Presentation on Machine Learning techniques for Hadoop and a peek at the near future of ML on Hadoop.

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Theme: they through away a lot of valuable gas and oil just like we through away data today
  • But what if some constraints changed?
  • Talk about changing market dynamics of storage costWhat if some of the previously held constraints changed? Enter hadoop
  • Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  • MLCan be used to predict outcome in new situationCan be used to understand and explain how prediction is derived (may be even more important)Methods originate from artificial intelligence, statistics, and research on databasesDM: about the processML: about the algorithms“Can machines really learn?” --- long discussion, but from some perspectives yes. Good philosophical talk over beers.
  • Mention how different books lay out information in different formatting, or may not group techniques exactly the sameLots of bleed over, from NLP, to IR, to ML
  • SGD – online learning, non batch, not parallelizable, good performance
  • “What do other people w/ similar tastes like?”“strength of associations”
  • Let’s set the stage in the context of story, why we were looking at big data for time series.
  • Ok, so how did we get to this point?Older SCADA systems take 1 data point per 2-4 seconds --- PMUs --- 30 times a sec, 120 PMUs, Growing by 10x factor
  • On Monday Steve from google talked about working with genomic data --- genomic data is time seriesOur take home demo actually works with a small bit of genomic dataLots of chatter @ oscon about genomics, I just sat in one today
  • Check this against the Mahout impl
  • Dryad, CielmHyperFlow, ASTERIX, Hyracks, HaLoop
  • Machine Learning and Hadoop

    1. 1. September 2011 – HUG– Atlanta, GA<br />Machine Learning With Hadoop<br />Josh Patterson | Sr Solution Architect<br />
    2. 2. Who is Josh Patterson?<br /><br />Master’s Thesis: self-organizing mesh networks <br />Published in IAAI-09: TinyTermite: A Secure Routing Algorithm<br />Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)<br />Led team which designed classification techniques for time series and Map Reduce<br />Open source work at <br /><br /><br />Today<br />Sr. Solutions Architect at Cloudera<br />
    3. 3. Outline<br />Hadoop Today<br />Data Mining<br />Mahout and Friends<br />A Peek at the Road Ahead<br />3<br />
    4. 4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”<br />--- Excerpt from the book “The American Gas Station”<br />Hadoop Today: The Oil Industry Circa 1900<br />4<br />
    5. 5. DNA Sequencing Trends<br />Cost of DNA Sequencing Falling Very Fast<br />5<br />
    6. 6. Unstructured Data Explosion<br />6<br />Complex, Unstructured<br />Relational<br /><ul><li> 2,500 exabytes of new information in 2012 with Internet as primary driver
    7. 7. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year</li></li></ul><li>Obstacles to Leveraging Data<br />Copyright 2010 Cloudera Inc. All rights reserved<br />7<br /><ul><li>Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)
    8. 8. Sometimes makes the data unwieldy
    9. 9. Customers are not creating schemas for all of their data
    10. 10. Yet still may want to join data sets
    11. 11. Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
    12. 12. They are throwing data away because its too expensive to hold
    13. 13. Similar to the oil industry in 1900</li></li></ul><li>A New Platform for an Evolving Landscape<br />Ability to look at true distribution of data<br />Previously impossible due to scale<br />Lower cost of analysis<br />Ad Hoc analysis now more open and flexible<br />Speed @ Scale is the new Killer App<br />Results in that previously took 1 day to process can gain new value when created in 10 minutes.<br />Greater Flexibility<br />Less restrictive than SQL-only systems<br />Copyright 2010 Cloudera Inc. All rights reserved<br />8<br />
    14. 14. Data Mining<br />9<br />“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?”<br />--- Peter Norvig, “Artificial Intelligence: A Modern Approach”<br />
    15. 15. Basic Concepts<br />What is Data Mining?<br />“the process of extracting patterns from data”<br />Why are we interested in Data Mining?<br />Raw data essentially useless<br />Data is simply recorded facts<br />Information is the patterns underlying the data<br />We want to learn these patterns<br />Information is key<br />
    16. 16. How does Machine Learning differ from Data Mining?<br />Data Mining<br />Extracting information from data<br />Finds patterns in data<br />Machine Learning<br />Algorithms for acquiring structural descriptions from data “examples”<br />Process of learning “concepts”<br />“structural descriptions” represent patterns explicitly<br />
    17. 17. Shades of Gray<br />Information Retrieval<br />information science, information architecture, cognitive psychology, linguistics, and statistics.<br />Natural Language Processing<br />grounded in machine learning, especially statistical machine learning<br />Statistics<br />Math and stuff<br />Machine Learning<br />Considered a branch of artificial intelligence<br />
    18. 18. Types of Machine Learning<br />Classification<br />Association<br />Clustering<br />Numeric Prediction<br />AKA: “Regression”<br />
    19. 19. Tools, Applications, and Mahout<br />Copyright 2010 Cloudera Inc. All rights reserved<br />14<br />
    20. 20. ML Focused on in Mahout<br />Classification<br />Naïve Bayes in Text Classification<br />Stochastic Gradient Descent (Logistic Regression)<br />Random Forests<br />Recommendation<br />Collaborative Filtering, Taste Engine<br />Item to item<br />Clustering<br />K-means, Fuzzy K-means<br />(Latent) Dirichlet Process<br />
    21. 21. Naïve Bayes and Text<br />Doc classification is an important domain in Machine Learning<br />Docs are characterized by the words that appear in them<br />One approach is to treat presence / absence of each word as a boolean attribute<br />Naïve Bayes is popular here, fast, accurate<br />
    22. 22. What Are Recommenders?<br />An algorithm that looks at a user’s past actions and suggests<br />Products<br />Services<br />People<br />
    23. 23. Collaborative Filtering<br />Collaborative filtering produces recommendations based on <br />user preferences for items, <br />“User Based”<br />does not require knowledge of the specific properties of the items. <br />In contrast, <br />content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.<br />“Item based”<br />
    24. 24. Clustering: Topic Modeling<br />Cluster words across docs to identify topics<br />Latent Dirichlet Allocation<br />
    25. 25. What is time series data?<br />Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals <br />Examples in finance<br />daily adjusted close price of a stock at the NYSE <br />Example in Sensors / Signal Processing / Smart Grid<br />sensor readings on a power grid occurring 30 times a second.<br />For more reference on time series data<br /><br />
    26. 26. NERC Sensor Data Collection<br />openPDC PMU Data Collection circa 2009 <br /><ul><li>120 Sensors
    27. 27. 30 samples/second
    28. 28. 4.3B Samples/day
    29. 29. Housed in Hadoop</li></li></ul><li>Story Time: Keogh, SAX, and the openPDC<br />NERC wanted high res smart grid data tracked<br />Started openPDC project @ TVA<br /><br />We used Hadoop to store and process time series data<br /><br />Needed to find “unbounded oscillations”<br />Time series unwieldy to work with at scale<br />We found “SAX” by Keogh and his folksfor dealing with time series<br />Copyright 2011 Cloudera Inc. All rights reserved<br />
    30. 30. What is Lumberyard?<br />Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage<br />It’s interesting for<br />Indexing large amounts of time series data<br />Low latency fuzzy pattern matching queries on time series data<br />Lumberyard is open source and ASF 2.0 Licensed at Github:<br /><br />Copyright 2011 Cloudera Inc. All rights reserved<br />
    31. 31. Genome Data as Time Series<br />A, C, G, and T<br />Could be thought of as “1, 2, 3, and 4”!<br />If we have sequence X, what is the “closest” subsequence in a genome that is most like it?<br />Doesn’t have to be an exact match!<br />Example:<br />ATATAT<br />TATATA<br />Useful in proteomics as well<br />iSAX Indexing<br />Lumberyard use case<br />Copyright 2011 Cloudera Inc. All rights reserved<br />
    32. 32. Bioinformatics<br />Applications in DNA Sequencing<br />Shortest Superstring Problem (SSP)<br />Take lots of reads from sequencing<br />We want the “superstring” of all the reads<br />We want a long string that “explains” all the reads we generated<br />We want the shortest string possible<br />NP-complete<br />We can reduce SSP to the Traveling Salesman Problem<br />Graph processing / algorithms now applicable<br />25<br />
    33. 33. Packages For Hadoop<br />DataFu<br /><br />UDFs in Pig<br />used at LinkedIn in many of off-line workflows for data derived products<br />"People You May Know”<br />"Skills”<br />Techniques<br />PageRank<br />Quantiles (median), variance, etc.<br />Sessionization<br />Convenience bag functions<br />Convenience utility functions<br />26<br />
    34. 34. Integration with Libs<br />Mix MapReduce with Machine Learning Libs<br />WEKA<br />KXEN<br />CPLEX<br />Map side “groups data”<br />Reduce side processes groups of data with Lib in parallel<br />Involves tricks in getting K/V pairs into lib<br />Pipes, tmp files, task cache dir, etc<br />27<br />
    35. 35. What Hadoop Not Good At in Data Mining<br />Anything highly iterative<br />Anything that is extemely CPU bound and not disk bound<br />Algorithms that can’t be inherently parallelized<br />Examples<br />Stochastic Gradient Descent (SGD)<br />Support Vector Machines (SVM)<br />Doesn’t mean they arent great to use<br />
    36. 36. MRv2: A Peek at the Road Ahead<br />©2011 Cloudera, Inc. All Rights Reserved.<br />29<br />
    37. 37. MRv2<br />Not everything fits great in MapReduce<br />Mahout as evidence of this<br />Examples<br />Stochastic Gradient Descent (SGD)<br />Support Vector Machines (SVM)<br />As we build further into verticals our analysis needs will become more complicated<br />MRv2 gives us new options<br />CDH4 will be based on 0.23.x (or later)<br />0.23.0 doesn't include MRv1<br />(via Tom White) CDH4 will *only* include MRv2<br />30<br />
    38. 38. Existing Parallel Frameworks<br />MapReduce<br />Java, Pig, Hive<br />Spark<br />Scala, hides complexity like hive/pig<br />Runs on hadoop, MRv2 already<br />Giraph<br />Bulk-synchronous parallel model <br />relative to graphs where vertices can send messages to other vertices during a given superstep<br />MPI<br />Older parallel lib<br />Includes primitives for data exchange, synchronization<br />Standardized and portable<br />GraphLab<br />“graph parallel” vs MR’s “data parallel”<br />Better at iterative style<br />©2011 Cloudera, Inc. All Rights Reserved.<br />31<br />
    39. 39. Frameworks Currently in Dev – MRv2<br />Giraph<br /><br />Hama BSP plans to integrate with MRv2<br /><br />MPI<br /><br />Spark<br /><br />GraphLab<br />Discussion in user-mahout<br />32<br />
    40. 40. The Rise of the Meta Heuristic?<br />We’re seeing a data deluge drive demand for new data products<br />MapReduce applications are still relatively new<br />Customers have gotten a taste of data products with Hadoop<br />They like it<br />They want more<br />MRv2 has the potential to open up a range of meta heuristics to the hadoop sector<br />Techniques like genetic algorithms that were previously considered “boutique”<br />©2011 Cloudera, Inc. All Rights Reserved.<br />33<br />
    41. 41. The Shape of Things to Come<br />©2011 Cloudera, Inc. All Rights Reserved.<br />34<br />Pig, Hive, Scala, Java<br />Compiler to build workflows of { Data, Algorithm, Framework }<br />Algorithm Library: Mahout, SGD, SVM, NeuralNetworks<br />Framework Library, MPI, Spark, GraphLab, MapReduce<br />MRv2<br />HDFS For Large Streaming Files<br />Hbase for small low latency transactions<br />
    42. 42. Questions? (Thanks!)<br />Hadoop World 2011<br />You should go<br />Talks are high quality<br />Lots more Machine Learning talks<br />Developer class 10/10/2011<br /><br />10% discount with code atlhug<br />35<br />