Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache MADlib AI/ML

36 views

Published on

This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Apache MADlib AI/ML

  1. 1. What Is Apache MADlib ? ● For scalable in-database analytics ● Open source Apache 2.0 license ● For machine learning in SQL ● At big data scale ● Offers graph, statistics, analytics, deep learning ● Provides data-parallel implementations ● For structured and unstructured data
  2. 2. MADlib Prerequisites ● Currently supports databases – PostgreSQL ● Needs Python extension specified – Greenplum (distributed db) – Apache Hawq ( v1.12+ ) (distributed db) ● Requires the GNU M4 Unix macro processor ● Works with Python 2.6 and 2.7
  3. 3. MADlib Architecture
  4. 4. MADlib Architecture ● MADlib has three main layers ● Python driver functions – Main entry point from user input – Largely responsible for algorithm flow control – Validating input parameters – Executing SQL statements – Evaluating the results – Potentially looping to execute more SQL statements ● Until some convergence criteria has been hit
  5. 5. MADlib Architecture ● MADlib has three main layers ● C++ implementations functions – C++ definitions of the core functions/aggregates ● Needed for particular algorithms – Implemented in C++ rather than Python ● For performance reasons
  6. 6. MADlib Architecture ● MADlib has three main layers ● C++ database abstraction layer – Provide a programming interface – Abstracts all the Postgres internal details – Provides support for different back end platforms – Focuses on the internal functionality ● Rather than the platform integration logic
  7. 7. MADlib Data Types and Transformations ● Arrays and Matrices ● Encoding Categorical Variables ● Path ● Pivot ● Sessionize ● Stemming
  8. 8. MADlib Graph Functionality ● All Pairs Shortest Path ● Breadth-First Search ● HITS ● Measures ● PageRank ● Single Source Shortest Path ● Weakly Connected Components
  9. 9. MADlib Model Selection / Sampling ● Model Selection – Cross Validation – Prediction Metrics – Train-Test Split ● Sampling – Balanced Sampling – Stratified Sampling
  10. 10. MADlib Statistics / Supervised Learning ● Statistics – Descriptive Statistics – Inferential Statistics – Probability Functions ● Supervised Learning – Conditional Random Field – k-Nearest Neighbors – Neural Network – Regression Models – Support Vector Machines – Tree Methods
  11. 11. MADlib Time Series / Unsupervised Learning ● Time Series Analysis – ARIMA ● Unsupervised Learning – Association Rules – Clustering – Dimensionality Reduction – Topic Modelling
  12. 12. MADlib Utilities ● Columns to Vector ● Database Functions ● Linear Solvers ● Mini-Batch Preprocessor ● PMML Export ● Term Frequency ● Vector to Columns
  13. 13. MADlib Deep Learning Example SQL ● First define the model configurations to train ● Meaning either model architectures or hyperparameters ● Load them into a model selection table ● The combination of model architectures and hyperparameters ● Constitutes the model configurations to train ● In the picture there are three model configurations ● Represented by the three different purple shapes
  14. 14. MADlib Deep Learning Example SQL
  15. 15. MADlib Deep Learning Example SQL ● Once we have model combinations ● In the model selection table ● Call the fit function to train the models – In parallel. ● In the picture the three orange shapes ● Represent the three models that have been trained
  16. 16. MADlib Deep Learning Example SQL
  17. 17. Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  18. 18. Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

×