What Is Apache MADlib ?
● For scalable in-database analytics
● Open source Apache 2.0 license
● For machine learning in SQL
● At big data scale
● Offers graph, statistics, analytics, deep learning
● Provides data-parallel implementations
● For structured and unstructured data
MADlib Prerequisites
● Currently supports databases
– PostgreSQL
● Needs Python extension specified
– Greenplum (distributed db)
– Apache Hawq ( v1.12+ ) (distributed db)
● Requires the GNU M4 Unix macro processor
● Works with Python 2.6 and 2.7
MADlib Architecture
MADlib Architecture
● MADlib has three main layers
● Python driver functions
– Main entry point from user input
– Largely responsible for algorithm flow control
– Validating input parameters
– Executing SQL statements
– Evaluating the results
– Potentially looping to execute more SQL statements
● Until some convergence criteria has been hit
MADlib Architecture
● MADlib has three main layers
● C++ implementations functions
– C++ definitions of the core functions/aggregates
● Needed for particular algorithms
– Implemented in C++ rather than Python
● For performance reasons
MADlib Architecture
● MADlib has three main layers
● C++ database abstraction layer
– Provide a programming interface
– Abstracts all the Postgres internal details
– Provides support for different back end platforms
– Focuses on the internal functionality
● Rather than the platform integration logic
MADlib Data Types and Transformations
● Arrays and Matrices
● Encoding Categorical Variables
● Path
● Pivot
● Sessionize
● Stemming
MADlib Graph Functionality
● All Pairs Shortest Path
● Breadth-First Search
● HITS
● Measures
● PageRank
● Single Source Shortest Path
● Weakly Connected Components
MADlib Model Selection / Sampling
● Model Selection
– Cross Validation
– Prediction Metrics
– Train-Test Split
● Sampling
– Balanced Sampling
– Stratified Sampling
MADlib Statistics / Supervised Learning
● Statistics
– Descriptive Statistics
– Inferential Statistics
– Probability Functions
● Supervised Learning
– Conditional Random Field
– k-Nearest Neighbors
– Neural Network
– Regression Models
– Support Vector Machines
– Tree Methods
MADlib Time Series / Unsupervised Learning
● Time Series Analysis
– ARIMA
● Unsupervised Learning
– Association Rules
– Clustering
– Dimensionality Reduction
– Topic Modelling
MADlib Utilities
● Columns to Vector
● Database Functions
● Linear Solvers
● Mini-Batch Preprocessor
● PMML Export
● Term Frequency
● Vector to Columns
MADlib Deep Learning Example SQL
● First define the model configurations to train
● Meaning either model architectures or hyperparameters
● Load them into a model selection table
● The combination of model architectures and hyperparameters
● Constitutes the model configurations to train
● In the picture there are three model configurations
● Represented by the three different purple shapes
MADlib Deep Learning Example SQL
MADlib Deep Learning Example SQL
● Once we have model combinations
● In the model selection table
● Call the fit function to train the models
– In parallel.
● In the picture the three orange shapes
● Represent the three models that have been trained
MADlib Deep Learning Example SQL
Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Apache MADlib AI/ML

  • 1.
    What Is ApacheMADlib ? ● For scalable in-database analytics ● Open source Apache 2.0 license ● For machine learning in SQL ● At big data scale ● Offers graph, statistics, analytics, deep learning ● Provides data-parallel implementations ● For structured and unstructured data
  • 2.
    MADlib Prerequisites ● Currentlysupports databases – PostgreSQL ● Needs Python extension specified – Greenplum (distributed db) – Apache Hawq ( v1.12+ ) (distributed db) ● Requires the GNU M4 Unix macro processor ● Works with Python 2.6 and 2.7
  • 3.
  • 4.
    MADlib Architecture ● MADlibhas three main layers ● Python driver functions – Main entry point from user input – Largely responsible for algorithm flow control – Validating input parameters – Executing SQL statements – Evaluating the results – Potentially looping to execute more SQL statements ● Until some convergence criteria has been hit
  • 5.
    MADlib Architecture ● MADlibhas three main layers ● C++ implementations functions – C++ definitions of the core functions/aggregates ● Needed for particular algorithms – Implemented in C++ rather than Python ● For performance reasons
  • 6.
    MADlib Architecture ● MADlibhas three main layers ● C++ database abstraction layer – Provide a programming interface – Abstracts all the Postgres internal details – Provides support for different back end platforms – Focuses on the internal functionality ● Rather than the platform integration logic
  • 7.
    MADlib Data Typesand Transformations ● Arrays and Matrices ● Encoding Categorical Variables ● Path ● Pivot ● Sessionize ● Stemming
  • 8.
    MADlib Graph Functionality ●All Pairs Shortest Path ● Breadth-First Search ● HITS ● Measures ● PageRank ● Single Source Shortest Path ● Weakly Connected Components
  • 9.
    MADlib Model Selection/ Sampling ● Model Selection – Cross Validation – Prediction Metrics – Train-Test Split ● Sampling – Balanced Sampling – Stratified Sampling
  • 10.
    MADlib Statistics /Supervised Learning ● Statistics – Descriptive Statistics – Inferential Statistics – Probability Functions ● Supervised Learning – Conditional Random Field – k-Nearest Neighbors – Neural Network – Regression Models – Support Vector Machines – Tree Methods
  • 11.
    MADlib Time Series/ Unsupervised Learning ● Time Series Analysis – ARIMA ● Unsupervised Learning – Association Rules – Clustering – Dimensionality Reduction – Topic Modelling
  • 12.
    MADlib Utilities ● Columnsto Vector ● Database Functions ● Linear Solvers ● Mini-Batch Preprocessor ● PMML Export ● Term Frequency ● Vector to Columns
  • 13.
    MADlib Deep LearningExample SQL ● First define the model configurations to train ● Meaning either model architectures or hyperparameters ● Load them into a model selection table ● The combination of model architectures and hyperparameters ● Constitutes the model configurations to train ● In the picture there are three model configurations ● Represented by the three different purple shapes
  • 14.
  • 15.
    MADlib Deep LearningExample SQL ● Once we have model combinations ● In the model selection table ● Call the fit function to train the models – In parallel. ● In the picture the three orange shapes ● Represent the three models that have been trained
  • 16.
  • 17.
    Available Books ● See“Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  • 18.
    Connect ● Feel freeto connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration