Machine Learning and Hadoop

September 2011 – HUG– Atlanta, GA Machine Learning With Hadoop Josh Patterson | Sr Solution Architect

Who is Josh Patterson? josh@cloudera.com Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) Led team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com https://github.com/jpatanooga Today Sr. Solutions Architect at Cloudera

Outline Hadoop Today Data Mining Mahout and Friends A Peek at the Road Ahead 3

“After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.” --- Excerpt from the book “The American Gas Station” Hadoop Today: The Oil Industry Circa 1900 4

DNA Sequencing Trends Cost of DNA Sequencing Falling Very Fast 5

Unstructured Data Explosion 6 Complex, Unstructured Relational ,[object Object]

Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year,[object Object]

Sometimes makes the data unwieldy

Customers are not creating schemas for all of their data

Yet still may want to join data sets

Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”

They are throwing data away because its too expensive to hold

Similar to the oil industry in 1900,[object Object]

Data Mining 9 “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach”

Basic Concepts What is Data Mining? “the process of extracting patterns from data” Why are we interested in Data Mining? Raw data essentially useless Data is simply recorded facts Information is the patterns underlying the data We want to learn these patterns Information is key

How does Machine Learning differ from Data Mining? Data Mining Extracting information from data Finds patterns in data Machine Learning Algorithms for acquiring structural descriptions from data “examples” Process of learning “concepts” “structural descriptions” represent patterns explicitly

Shades of Gray Information Retrieval information science, information architecture, cognitive psychology, linguistics, and statistics. Natural Language Processing grounded in machine learning, especially statistical machine learning Statistics Math and stuff Machine Learning Considered a branch of artificial intelligence

Types of Machine Learning Classification Association Clustering Numeric Prediction AKA: “Regression”

ML Focused on in Mahout Classification Naïve Bayes in Text Classification Stochastic Gradient Descent (Logistic Regression) Random Forests Recommendation Collaborative Filtering, Taste Engine Item to item Clustering K-means, Fuzzy K-means (Latent) Dirichlet Process

Naïve Bayes and Text Doc classification is an important domain in Machine Learning Docs are characterized by the words that appear in them One approach is to treat presence / absence of each word as a boolean attribute Naïve Bayes is popular here, fast, accurate

What Are Recommenders? An algorithm that looks at a user’s past actions and suggests Products Services People

Collaborative Filtering Collaborative filtering produces recommendations based on user preferences for items, “User Based” does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items. “Item based”

Clustering: Topic Modeling Cluster words across docs to identify topics Latent Dirichlet Allocation

What is time series data? Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals Examples in finance daily adjusted close price of a stock at the NYSE Example in Sensors / Signal Processing / Smart Grid sensor readings on a power grid occurring 30 times a second. For more reference on time series data http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

NERC Sensor Data Collection openPDC PMU Data Collection circa 2009 ,[object Object]

Housed in Hadoop,[object Object]

What is Lumberyard? Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage It’s interesting for Indexing large amounts of time series data Low latency fuzzy pattern matching queries on time series data Lumberyard is open source and ASF 2.0 Licensed at Github: https://github.com/jpatanooga/Lumberyard/ Copyright 2011 Cloudera Inc. All rights reserved

Genome Data as Time Series A, C, G, and T Could be thought of as “1, 2, 3, and 4”! If we have sequence X, what is the “closest” subsequence in a genome that is most like it? Doesn’t have to be an exact match! Example: ATATAT TATATA Useful in proteomics as well iSAX Indexing Lumberyard use case Copyright 2011 Cloudera Inc. All rights reserved

Bioinformatics Applications in DNA Sequencing Shortest Superstring Problem (SSP) Take lots of reads from sequencing We want the “superstring” of all the reads We want a long string that “explains” all the reads we generated We want the shortest string possible NP-complete We can reduce SSP to the Traveling Salesman Problem Graph processing / algorithms now applicable 25

Packages For Hadoop DataFu http://sna-projects.com/datafu/ UDFs in Pig used at LinkedIn in many of off-line workflows for data derived products "People You May Know” "Skills” Techniques PageRank Quantiles (median), variance, etc. Sessionization Convenience bag functions Convenience utility functions 26

Integration with Libs Mix MapReduce with Machine Learning Libs WEKA KXEN CPLEX Map side “groups data” Reduce side processes groups of data with Lib in parallel Involves tricks in getting K/V pairs into lib Pipes, tmp files, task cache dir, etc 27

What Hadoop Not Good At in Data Mining Anything highly iterative Anything that is extemely CPU bound and not disk bound Algorithms that can’t be inherently parallelized Examples Stochastic Gradient Descent (SGD) Support Vector Machines (SVM) Doesn’t mean they arent great to use

Machine Learning and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Machine Learning and Hadoop

Similar to Machine Learning and Hadoop (20)

More from Josh Patterson

More from Josh Patterson (20)

Recently uploaded

Recently uploaded (20)

Machine Learning and Hadoop

Editor's Notes