Machine Learning and Hadoop

Uploaded on

Presentation on Machine Learning techniques for Hadoop and a peek at the near future of ML on Hadoop.

Presentation on Machine Learning techniques for Hadoop and a peek at the near future of ML on Hadoop.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Theme: they through away a lot of valuable gas and oil just like we through away data today
  • But what if some constraints changed?
  • Talk about changing market dynamics of storage costWhat if some of the previously held constraints changed? Enter hadoop
  • Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  • MLCan be used to predict outcome in new situationCan be used to understand and explain how prediction is derived (may be even more important)Methods originate from artificial intelligence, statistics, and research on databasesDM: about the processML: about the algorithms“Can machines really learn?” --- long discussion, but from some perspectives yes. Good philosophical talk over beers.
  • Mention how different books lay out information in different formatting, or may not group techniques exactly the sameLots of bleed over, from NLP, to IR, to ML
  • SGD – online learning, non batch, not parallelizable, good performance
  • “What do other people w/ similar tastes like?”“strength of associations”
  • Let’s set the stage in the context of story, why we were looking at big data for time series.
  • Ok, so how did we get to this point?Older SCADA systems take 1 data point per 2-4 seconds --- PMUs --- 30 times a sec, 120 PMUs, Growing by 10x factor
  • On Monday Steve from google talked about working with genomic data --- genomic data is time seriesOur take home demo actually works with a small bit of genomic dataLots of chatter @ oscon about genomics, I just sat in one today
  • Check this against the Mahout impl
  • Dryad, CielmHyperFlow, ASTERIX, Hyracks, HaLoop


  • 1. September 2011 – HUG– Atlanta, GA
    Machine Learning With Hadoop
    Josh Patterson | Sr Solution Architect
  • 2. Who is Josh Patterson?
    Master’s Thesis: self-organizing mesh networks
    Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
    Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)
    Led team which designed classification techniques for time series and Map Reduce
    Open source work at
    Sr. Solutions Architect at Cloudera
  • 3. Outline
    Hadoop Today
    Data Mining
    Mahout and Friends
    A Peek at the Road Ahead
  • 4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”
    --- Excerpt from the book “The American Gas Station”
    Hadoop Today: The Oil Industry Circa 1900
  • 5. DNA Sequencing Trends
    Cost of DNA Sequencing Falling Very Fast
  • 6. Unstructured Data Explosion
    Complex, Unstructured
    • 2,500 exabytes of new information in 2012 with Internet as primary driver
    • 7. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
  • Obstacles to Leveraging Data
    Copyright 2010 Cloudera Inc. All rights reserved
    • Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)
    • 8. Sometimes makes the data unwieldy
    • 9. Customers are not creating schemas for all of their data
    • 10. Yet still may want to join data sets
    • 11. Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
    • 12. They are throwing data away because its too expensive to hold
    • 13. Similar to the oil industry in 1900
  • A New Platform for an Evolving Landscape
    Ability to look at true distribution of data
    Previously impossible due to scale
    Lower cost of analysis
    Ad Hoc analysis now more open and flexible
    Speed @ Scale is the new Killer App
    Results in that previously took 1 day to process can gain new value when created in 10 minutes.
    Greater Flexibility
    Less restrictive than SQL-only systems
    Copyright 2010 Cloudera Inc. All rights reserved
  • 14. Data Mining
    “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?”
    --- Peter Norvig, “Artificial Intelligence: A Modern Approach”
  • 15. Basic Concepts
    What is Data Mining?
    “the process of extracting patterns from data”
    Why are we interested in Data Mining?
    Raw data essentially useless
    Data is simply recorded facts
    Information is the patterns underlying the data
    We want to learn these patterns
    Information is key
  • 16. How does Machine Learning differ from Data Mining?
    Data Mining
    Extracting information from data
    Finds patterns in data
    Machine Learning
    Algorithms for acquiring structural descriptions from data “examples”
    Process of learning “concepts”
    “structural descriptions” represent patterns explicitly
  • 17. Shades of Gray
    Information Retrieval
    information science, information architecture, cognitive psychology, linguistics, and statistics.
    Natural Language Processing
    grounded in machine learning, especially statistical machine learning
    Math and stuff
    Machine Learning
    Considered a branch of artificial intelligence
  • 18. Types of Machine Learning
    Numeric Prediction
    AKA: “Regression”
  • 19. Tools, Applications, and Mahout
    Copyright 2010 Cloudera Inc. All rights reserved
  • 20. ML Focused on in Mahout
    Naïve Bayes in Text Classification
    Stochastic Gradient Descent (Logistic Regression)
    Random Forests
    Collaborative Filtering, Taste Engine
    Item to item
    K-means, Fuzzy K-means
    (Latent) Dirichlet Process
  • 21. Naïve Bayes and Text
    Doc classification is an important domain in Machine Learning
    Docs are characterized by the words that appear in them
    One approach is to treat presence / absence of each word as a boolean attribute
    Naïve Bayes is popular here, fast, accurate
  • 22. What Are Recommenders?
    An algorithm that looks at a user’s past actions and suggests
  • 23. Collaborative Filtering
    Collaborative filtering produces recommendations based on
    user preferences for items,
    “User Based”
    does not require knowledge of the specific properties of the items.
    In contrast,
    content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.
    “Item based”
  • 24. Clustering: Topic Modeling
    Cluster words across docs to identify topics
    Latent Dirichlet Allocation
  • 25. What is time series data?
    Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals
    Examples in finance
    daily adjusted close price of a stock at the NYSE
    Example in Sensors / Signal Processing / Smart Grid
    sensor readings on a power grid occurring 30 times a second.
    For more reference on time series data
  • 26. NERC Sensor Data Collection
    openPDC PMU Data Collection circa 2009
    • 120 Sensors
    • 27. 30 samples/second
    • 28. 4.3B Samples/day
    • 29. Housed in Hadoop
  • Story Time: Keogh, SAX, and the openPDC
    NERC wanted high res smart grid data tracked
    Started openPDC project @ TVA
    We used Hadoop to store and process time series data
    Needed to find “unbounded oscillations”
    Time series unwieldy to work with at scale
    We found “SAX” by Keogh and his folksfor dealing with time series
    Copyright 2011 Cloudera Inc. All rights reserved
  • 30. What is Lumberyard?
    Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage
    It’s interesting for
    Indexing large amounts of time series data
    Low latency fuzzy pattern matching queries on time series data
    Lumberyard is open source and ASF 2.0 Licensed at Github:
    Copyright 2011 Cloudera Inc. All rights reserved
  • 31. Genome Data as Time Series
    A, C, G, and T
    Could be thought of as “1, 2, 3, and 4”!
    If we have sequence X, what is the “closest” subsequence in a genome that is most like it?
    Doesn’t have to be an exact match!
    Useful in proteomics as well
    iSAX Indexing
    Lumberyard use case
    Copyright 2011 Cloudera Inc. All rights reserved
  • 32. Bioinformatics
    Applications in DNA Sequencing
    Shortest Superstring Problem (SSP)
    Take lots of reads from sequencing
    We want the “superstring” of all the reads
    We want a long string that “explains” all the reads we generated
    We want the shortest string possible
    We can reduce SSP to the Traveling Salesman Problem
    Graph processing / algorithms now applicable
  • 33. Packages For Hadoop
    UDFs in Pig
    used at LinkedIn in many of off-line workflows for data derived products
    "People You May Know”
    Quantiles (median), variance, etc.
    Convenience bag functions
    Convenience utility functions
  • 34. Integration with Libs
    Mix MapReduce with Machine Learning Libs
    Map side “groups data”
    Reduce side processes groups of data with Lib in parallel
    Involves tricks in getting K/V pairs into lib
    Pipes, tmp files, task cache dir, etc
  • 35. What Hadoop Not Good At in Data Mining
    Anything highly iterative
    Anything that is extemely CPU bound and not disk bound
    Algorithms that can’t be inherently parallelized
    Stochastic Gradient Descent (SGD)
    Support Vector Machines (SVM)
    Doesn’t mean they arent great to use
  • 36. MRv2: A Peek at the Road Ahead
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 37. MRv2
    Not everything fits great in MapReduce
    Mahout as evidence of this
    Stochastic Gradient Descent (SGD)
    Support Vector Machines (SVM)
    As we build further into verticals our analysis needs will become more complicated
    MRv2 gives us new options
    CDH4 will be based on 0.23.x (or later)
    0.23.0 doesn't include MRv1
    (via Tom White) CDH4 will *only* include MRv2
  • 38. Existing Parallel Frameworks
    Java, Pig, Hive
    Scala, hides complexity like hive/pig
    Runs on hadoop, MRv2 already
    Bulk-synchronous parallel model
    relative to graphs where vertices can send messages to other vertices during a given superstep
    Older parallel lib
    Includes primitives for data exchange, synchronization
    Standardized and portable
    “graph parallel” vs MR’s “data parallel”
    Better at iterative style
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 39. Frameworks Currently in Dev – MRv2
    Hama BSP plans to integrate with MRv2
    Discussion in user-mahout
  • 40. The Rise of the Meta Heuristic?
    We’re seeing a data deluge drive demand for new data products
    MapReduce applications are still relatively new
    Customers have gotten a taste of data products with Hadoop
    They like it
    They want more
    MRv2 has the potential to open up a range of meta heuristics to the hadoop sector
    Techniques like genetic algorithms that were previously considered “boutique”
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 41. The Shape of Things to Come
    ©2011 Cloudera, Inc. All Rights Reserved.
    Pig, Hive, Scala, Java
    Compiler to build workflows of { Data, Algorithm, Framework }
    Algorithm Library: Mahout, SGD, SVM, NeuralNetworks
    Framework Library, MPI, Spark, GraphLab, MapReduce
    HDFS For Large Streaming Files
    Hbase for small low latency transactions
  • 42. Questions? (Thanks!)
    Hadoop World 2011
    You should go
    Talks are high quality
    Lots more Machine Learning talks
    Developer class 10/10/2011
    10% discount with code atlhug