SlideShare a Scribd company logo
September 2011 – HUG– Atlanta, GAMachine Learning With HadoopJosh Patterson | Sr Solution Architect
Who is Josh Patterson?josh@cloudera.comMaster’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing AlgorithmConceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)Led team which designed classification techniques for time series and Map ReduceOpen source work at http://openpdc.codeplex.comhttps://github.com/jpatanoogaTodaySr. Solutions Architect at Cloudera
OutlineHadoop TodayData MiningMahout and FriendsA Peek at the Road Ahead3
“After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”--- Excerpt from the book “The American Gas Station”Hadoop Today: The Oil Industry Circa 19004
DNA Sequencing TrendsCost of DNA Sequencing Falling Very Fast5
Unstructured Data Explosion6Complex, UnstructuredRelational 2,500 exabytes of new information in 2012 with Internet as primary driver
 Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this yearObstacles to Leveraging DataCopyright 2010 Cloudera Inc. All rights reserved7Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)
Sometimes makes the data unwieldy
Customers are not creating schemas for all of their data
Yet still may want to join data sets
Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
They are throwing data away because its too expensive to hold
Similar to the oil industry in 1900A New Platform for an Evolving LandscapeAbility to look at true distribution of dataPreviously impossible due to scaleLower cost of analysisAd Hoc analysis now more open and flexibleSpeed @ Scale is the new Killer AppResults in that previously took 1 day to process can gain new value when created in 10 minutes.Greater FlexibilityLess restrictive than SQL-only systemsCopyright 2010 Cloudera Inc. All rights reserved8
Data Mining9“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?”--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
Basic ConceptsWhat is Data Mining?“the process of extracting patterns from data”Why are we interested in Data Mining?Raw data essentially uselessData is simply recorded factsInformation is the patterns underlying the dataWe want to learn these patternsInformation is key
How does Machine Learning differ from Data Mining?Data MiningExtracting information from dataFinds patterns in dataMachine LearningAlgorithms for acquiring structural descriptions from data “examples”Process of learning “concepts”“structural descriptions” represent patterns explicitly
Shades of GrayInformation Retrievalinformation science, information architecture, cognitive psychology, linguistics, and statistics.Natural Language Processinggrounded in machine learning, especially statistical machine learningStatisticsMath and stuffMachine LearningConsidered a branch of artificial intelligence
Types of Machine LearningClassificationAssociationClusteringNumeric PredictionAKA: “Regression”
Tools, Applications, and MahoutCopyright 2010 Cloudera Inc. All rights reserved14
ML Focused on in MahoutClassificationNaïve Bayes in Text ClassificationStochastic Gradient Descent (Logistic Regression)Random ForestsRecommendationCollaborative Filtering, Taste EngineItem to itemClusteringK-means, Fuzzy K-means(Latent) Dirichlet Process
Naïve Bayes and TextDoc classification is an important domain in Machine LearningDocs are characterized by the words that appear in themOne approach is to treat presence / absence of each word as a boolean attributeNaïve Bayes is popular here, fast, accurate
What Are Recommenders?An algorithm that looks at a user’s past actions and suggestsProductsServicesPeople
Collaborative FilteringCollaborative filtering produces recommendations based on user preferences for items, “User Based”does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.“Item based”
Clustering: Topic ModelingCluster words across docs to identify topicsLatent Dirichlet Allocation
What is time series data?Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals Examples in financedaily adjusted close price of a stock at the NYSE Example in Sensors / Signal Processing / Smart Gridsensor readings on a power grid occurring 30 times a second.For more reference on time series datahttp://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/
NERC Sensor Data CollectionopenPDC PMU Data Collection circa 2009 120 Sensors
30 samples/second
4.3B Samples/day
Housed in HadoopStory Time: Keogh, SAX, and the openPDCNERC wanted high res smart grid data trackedStarted openPDC project @ TVAhttp://openpdc.codeplex.com/We used Hadoop to store and process time series datahttps://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/Needed to find “unbounded oscillations”Time series unwieldy to work with at scaleWe found “SAX” by Keogh and his folksfor dealing with time seriesCopyright 2011 Cloudera Inc. All rights reserved
What is Lumberyard?Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storageIt’s interesting forIndexing large amounts of time series dataLow latency fuzzy pattern matching queries on time series dataLumberyard is open source and ASF 2.0 Licensed at Github:https://github.com/jpatanooga/Lumberyard/Copyright 2011 Cloudera Inc. All rights reserved
Genome Data as Time SeriesA, C, G, and TCould be thought of as “1, 2, 3, and 4”!If we have sequence X, what is the “closest” subsequence in a genome that is most like it?Doesn’t have to be an exact match!Example:ATATATTATATAUseful in proteomics as welliSAX IndexingLumberyard use caseCopyright 2011 Cloudera Inc. All rights reserved
BioinformaticsApplications in DNA SequencingShortest Superstring Problem (SSP)Take lots of reads from sequencingWe want the “superstring” of all the readsWe want a long string that “explains” all the reads we generatedWe want the shortest string possibleNP-completeWe can reduce SSP to the Traveling Salesman ProblemGraph processing / algorithms now applicable25
Packages For HadoopDataFuhttp://sna-projects.com/datafu/UDFs in Pigused at LinkedIn in many of off-line workflows for data derived products"People You May Know”"Skills”TechniquesPageRankQuantiles (median), variance, etc.SessionizationConvenience bag functionsConvenience utility functions26
Integration with LibsMix MapReduce with Machine Learning LibsWEKAKXENCPLEXMap side “groups data”Reduce side processes groups of data with Lib in parallelInvolves tricks in getting K/V pairs into libPipes, tmp files, task cache dir, etc27
What Hadoop Not Good At in Data MiningAnything highly iterativeAnything that is extemely CPU bound and not disk boundAlgorithms that can’t be inherently parallelizedExamplesStochastic Gradient Descent (SGD)Support Vector Machines (SVM)Doesn’t mean they arent great to use

More Related Content

What's hot

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
Spark Summit
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Databricks
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
MLconf
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Andy Petrella
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
Databricks
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
Mehdi Shibahara
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Databricks
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
Adam Gibson
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Databricks
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

What's hot (20)

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 

Viewers also liked

WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
Kai Wähner
 
EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016
Ayman van Bregt
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
Cloudera, Inc.
 
Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
lucboudreau
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
Shivaji Dutta
 
Collaborative Filtering in Map/Reduce
Collaborative Filtering in Map/ReduceCollaborative Filtering in Map/Reduce
Collaborative Filtering in Map/Reduce
Ole-Martin Mørk
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
Dean Wampler
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
Pier Luca Lanzi
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration Patterns
Manuel Hurtado
 

Viewers also liked (11)

WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
 
EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016EURIB Korte opleiding: Online marketing - Maart 2016
EURIB Korte opleiding: Online marketing - Maart 2016
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Collaborative Filtering in Map/Reduce
Collaborative Filtering in Map/ReduceCollaborative Filtering in Map/Reduce
Collaborative Filtering in Map/Reduce
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration Patterns
 

Similar to Machine Learning and Hadoop

Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
sarith divakar
 
Big data business case
Big data   business caseBig data   business case
Big data business case
Karthik Padmanabhan ( MLE℠)
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
Kenneth Igiri
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
Jessica Morris
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
Nick Boucart
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
Cloudera, Inc.
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Seattle DAML meetup
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 

Similar to Machine Learning and Hadoop (20)

Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 

More from Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
Josh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
Josh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
Josh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
Josh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
Josh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Josh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
Josh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Josh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
Josh Patterson
 

More from Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 

Recently uploaded

How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
LINUS PROJECTS (INDIA)
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
HackersList
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 

Recently uploaded (20)

How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Pigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending PlantPigging Unit Lubricant Oil Blending Plant
Pigging Unit Lubricant Oil Blending Plant
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 

Machine Learning and Hadoop

  • 1. September 2011 – HUG– Atlanta, GAMachine Learning With HadoopJosh Patterson | Sr Solution Architect
  • 2. Who is Josh Patterson?josh@cloudera.comMaster’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing AlgorithmConceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)Led team which designed classification techniques for time series and Map ReduceOpen source work at http://openpdc.codeplex.comhttps://github.com/jpatanoogaTodaySr. Solutions Architect at Cloudera
  • 3. OutlineHadoop TodayData MiningMahout and FriendsA Peek at the Road Ahead3
  • 4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”--- Excerpt from the book “The American Gas Station”Hadoop Today: The Oil Industry Circa 19004
  • 5. DNA Sequencing TrendsCost of DNA Sequencing Falling Very Fast5
  • 6. Unstructured Data Explosion6Complex, UnstructuredRelational 2,500 exabytes of new information in 2012 with Internet as primary driver
  • 7. Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this yearObstacles to Leveraging DataCopyright 2010 Cloudera Inc. All rights reserved7Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)
  • 8. Sometimes makes the data unwieldy
  • 9. Customers are not creating schemas for all of their data
  • 10. Yet still may want to join data sets
  • 11. Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
  • 12. They are throwing data away because its too expensive to hold
  • 13. Similar to the oil industry in 1900A New Platform for an Evolving LandscapeAbility to look at true distribution of dataPreviously impossible due to scaleLower cost of analysisAd Hoc analysis now more open and flexibleSpeed @ Scale is the new Killer AppResults in that previously took 1 day to process can gain new value when created in 10 minutes.Greater FlexibilityLess restrictive than SQL-only systemsCopyright 2010 Cloudera Inc. All rights reserved8
  • 14. Data Mining9“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?”--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
  • 15. Basic ConceptsWhat is Data Mining?“the process of extracting patterns from data”Why are we interested in Data Mining?Raw data essentially uselessData is simply recorded factsInformation is the patterns underlying the dataWe want to learn these patternsInformation is key
  • 16. How does Machine Learning differ from Data Mining?Data MiningExtracting information from dataFinds patterns in dataMachine LearningAlgorithms for acquiring structural descriptions from data “examples”Process of learning “concepts”“structural descriptions” represent patterns explicitly
  • 17. Shades of GrayInformation Retrievalinformation science, information architecture, cognitive psychology, linguistics, and statistics.Natural Language Processinggrounded in machine learning, especially statistical machine learningStatisticsMath and stuffMachine LearningConsidered a branch of artificial intelligence
  • 18. Types of Machine LearningClassificationAssociationClusteringNumeric PredictionAKA: “Regression”
  • 19. Tools, Applications, and MahoutCopyright 2010 Cloudera Inc. All rights reserved14
  • 20. ML Focused on in MahoutClassificationNaïve Bayes in Text ClassificationStochastic Gradient Descent (Logistic Regression)Random ForestsRecommendationCollaborative Filtering, Taste EngineItem to itemClusteringK-means, Fuzzy K-means(Latent) Dirichlet Process
  • 21. Naïve Bayes and TextDoc classification is an important domain in Machine LearningDocs are characterized by the words that appear in themOne approach is to treat presence / absence of each word as a boolean attributeNaïve Bayes is popular here, fast, accurate
  • 22. What Are Recommenders?An algorithm that looks at a user’s past actions and suggestsProductsServicesPeople
  • 23. Collaborative FilteringCollaborative filtering produces recommendations based on user preferences for items, “User Based”does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.“Item based”
  • 24. Clustering: Topic ModelingCluster words across docs to identify topicsLatent Dirichlet Allocation
  • 25. What is time series data?Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals Examples in financedaily adjusted close price of a stock at the NYSE Example in Sensors / Signal Processing / Smart Gridsensor readings on a power grid occurring 30 times a second.For more reference on time series datahttp://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/
  • 26. NERC Sensor Data CollectionopenPDC PMU Data Collection circa 2009 120 Sensors
  • 29. Housed in HadoopStory Time: Keogh, SAX, and the openPDCNERC wanted high res smart grid data trackedStarted openPDC project @ TVAhttp://openpdc.codeplex.com/We used Hadoop to store and process time series datahttps://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/Needed to find “unbounded oscillations”Time series unwieldy to work with at scaleWe found “SAX” by Keogh and his folksfor dealing with time seriesCopyright 2011 Cloudera Inc. All rights reserved
  • 30. What is Lumberyard?Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storageIt’s interesting forIndexing large amounts of time series dataLow latency fuzzy pattern matching queries on time series dataLumberyard is open source and ASF 2.0 Licensed at Github:https://github.com/jpatanooga/Lumberyard/Copyright 2011 Cloudera Inc. All rights reserved
  • 31. Genome Data as Time SeriesA, C, G, and TCould be thought of as “1, 2, 3, and 4”!If we have sequence X, what is the “closest” subsequence in a genome that is most like it?Doesn’t have to be an exact match!Example:ATATATTATATAUseful in proteomics as welliSAX IndexingLumberyard use caseCopyright 2011 Cloudera Inc. All rights reserved
  • 32. BioinformaticsApplications in DNA SequencingShortest Superstring Problem (SSP)Take lots of reads from sequencingWe want the “superstring” of all the readsWe want a long string that “explains” all the reads we generatedWe want the shortest string possibleNP-completeWe can reduce SSP to the Traveling Salesman ProblemGraph processing / algorithms now applicable25
  • 33. Packages For HadoopDataFuhttp://sna-projects.com/datafu/UDFs in Pigused at LinkedIn in many of off-line workflows for data derived products"People You May Know”"Skills”TechniquesPageRankQuantiles (median), variance, etc.SessionizationConvenience bag functionsConvenience utility functions26
  • 34. Integration with LibsMix MapReduce with Machine Learning LibsWEKAKXENCPLEXMap side “groups data”Reduce side processes groups of data with Lib in parallelInvolves tricks in getting K/V pairs into libPipes, tmp files, task cache dir, etc27
  • 35. What Hadoop Not Good At in Data MiningAnything highly iterativeAnything that is extemely CPU bound and not disk boundAlgorithms that can’t be inherently parallelizedExamplesStochastic Gradient Descent (SGD)Support Vector Machines (SVM)Doesn’t mean they arent great to use
  • 36. MRv2: A Peek at the Road Ahead©2011 Cloudera, Inc. All Rights Reserved.29
  • 37. MRv2Not everything fits great in MapReduceMahout as evidence of thisExamplesStochastic Gradient Descent (SGD)Support Vector Machines (SVM)As we build further into verticals our analysis needs will become more complicatedMRv2 gives us new optionsCDH4 will be based on 0.23.x (or later)0.23.0 doesn't include MRv1(via Tom White) CDH4 will *only* include MRv230
  • 38. Existing Parallel FrameworksMapReduceJava, Pig, HiveSparkScala, hides complexity like hive/pigRuns on hadoop, MRv2 alreadyGiraphBulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstepMPIOlder parallel libIncludes primitives for data exchange, synchronizationStandardized and portableGraphLab“graph parallel” vs MR’s “data parallel”Better at iterative style©2011 Cloudera, Inc. All Rights Reserved.31
  • 39. Frameworks Currently in Dev – MRv2Giraphhttps://issues.apache.org/jira/browse/GIRAPH-13Hama BSP plans to integrate with MRv2https://issues.apache.org/jira/browse/HAMA-431MPIhttps://issues.apache.org/jira/browse/MAPREDUCE-2911Sparkhttps://github.com/mesos/spark-yarnGraphLabDiscussion in user-mahout32
  • 40. The Rise of the Meta Heuristic?We’re seeing a data deluge drive demand for new data productsMapReduce applications are still relatively newCustomers have gotten a taste of data products with HadoopThey like itThey want moreMRv2 has the potential to open up a range of meta heuristics to the hadoop sectorTechniques like genetic algorithms that were previously considered “boutique”©2011 Cloudera, Inc. All Rights Reserved.33
  • 41. The Shape of Things to Come©2011 Cloudera, Inc. All Rights Reserved.34Pig, Hive, Scala, JavaCompiler to build workflows of { Data, Algorithm, Framework }Algorithm Library: Mahout, SGD, SVM, NeuralNetworksFramework Library, MPI, Spark, GraphLab, MapReduceMRv2HDFS For Large Streaming FilesHbase for small low latency transactions
  • 42. Questions? (Thanks!)Hadoop World 2011You should goTalks are high qualityLots more Machine Learning talksDeveloper class 10/10/2011http://www.eventbrite.com/event/195133549710% discount with code atlhug35

Editor's Notes

  1. Theme: they through away a lot of valuable gas and oil just like we through away data today
  2. But what if some constraints changed?
  3. Talk about changing market dynamics of storage costWhat if some of the previously held constraints changed? Enter hadoop
  4. Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  5. MLCan be used to predict outcome in new situationCan be used to understand and explain how prediction is derived (may be even more important)Methods originate from artificial intelligence, statistics, and research on databasesDM: about the processML: about the algorithms“Can machines really learn?” --- long discussion, but from some perspectives yes. Good philosophical talk over beers.
  6. Mention how different books lay out information in different formatting, or may not group techniques exactly the sameLots of bleed over, from NLP, to IR, to ML
  7. SGD – online learning, non batch, not parallelizable, good performance
  8. “What do other people w/ similar tastes like?”“strength of associations”
  9. Let’s set the stage in the context of story, why we were looking at big data for time series.
  10. Ok, so how did we get to this point?Older SCADA systems take 1 data point per 2-4 seconds --- PMUs --- 30 times a sec, 120 PMUs, Growing by 10x factor
  11. On Monday Steve from google talked about working with genomic data --- genomic data is time seriesOur take home demo actually works with a small bit of genomic dataLots of chatter @ oscon about genomics, I just sat in one today
  12. Check this against the Mahout impl
  13. Dryad, CielmHyperFlow, ASTERIX, Hyracks, HaLoop