This talk was recorded in London on October 30, 2018.
KNIME Analytics Platform is an easy to use and comprehensive open source data integration, analysis, and exploration platform, enabling data scientists to visually compose end to end data analysis workflows. The over 2,000 available modules ("nodes") cover each step of the analysis workflow, including blending heterogeneous data types, data transformation, wrangling and cleansing, advanced data visualization, or model training and deployment.
Many of these nodes are provided through open source integrations (why reinvent the wheel?). This provides seamless access to large open source projects such as Keras and Tensorflow for deep learning, Apache Spark for big data processing, Python and R for scripting, and more. These integrations can be used in combination with other KNIME nodes meaning that data scientists can freely select from a vast variety of options when tackling an analysis problem.
The integration of H2O in KNIME offers an extensive number of nodes and encapsulating functionalities of the H2O open source machine learning libraries, making it easy to use H2O algorithms from a KNIME workflow without touching any code - each of the H2O nodes looks and feels just like a normal KNIME node - and the data scientist benefits from the high performance libraries and proven quality of H2O during execution. For prototyping these algorithms are executed locally, however training and deployment can easily be scaled up using a Sparkling Water cluster.
In our talk we give a short introduction to KNIME Analytics Platform and then demonstrate how data scientists benefit from using KNIME Analytics Platform and H2O Machine Learning in combination by using a real world analysis example.
Bio: Christian received a Master’s degree in Computer Science from the University of Konstanz. Having gained experience as a research software engineer at the University of Konstanz, where he developed frameworks and libraries in the fields of bioimage analysis and machine learning, Christian moved on to become a software engineer at KNIME. He now focuses on developing new functionalities and extensions for KNIME Analytics Platform. Some of his recent projects include deep learning integrations built upon Keras and Tensorflow, extensions for image analysis and active learning, and the integration of H2O Machine Learning and H2O Sparkling Water in KNIME Analytics Platform.
2. H2O Distributed Machine Learning Algorithms
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
3. Platforms with H2O Integration
H2O + KNIME Talk
at KNIME Summit
March 2018
this competition was published by a japanese Restaurant chain.
They wanted to know the number of future visitors for their different stores
Lets see what kind of data they provided us to solve this problem
This is the top-level-workflow we used to solve the problem
It will guide us through the major steps from reading in the data up to doing the prediction
and showcases the interaction of the knime native and the h2o nodes
Lets jump right into our data preparation
Data preparation part of the workflow
We‘ll not discuss it in too much detail
In the end we get two datasets
the trainset with information about the number of visitors (target variable), which we will use to build our model in the next steps
the test dataset without the number of visitors. These have to be predicted by our model and submitted to kaggle lateron
We just did the data preparation, before we jump right into the modeling we have to create a local h2o context and convert our knime table into an h2o frame
This frame will be used to build our models
At the moment there are three H2O models implemented in KNIME which are capable of solving such a regression task: Random Forest, Generalized Linear Model and Gradient Boosting Machine
Lets have a look at one of those to see how we trained, optimized and evaluated our models
The actual learning of a model happens in one single node: The H2O Random Forest Learner takes the h2o frame with the testset and builds a model
Configuration dialog: What is the target variable you want to predict? Here it is visitors, enter some model specific parameters, e.g. number of levels of a single tree and the number of tree models in this forest
Next we use the H2O predictor to use the just created model to predict the visitors for our testset
Afterwards the score of the model is computed with the H2O regression scorer. as performance measure we used the root mean squared logarithmic error, as this measure is also used on Kaggle to evaluate the final submissions.
To avoid overfitting we use the h2o cross validation loop, which partitions the data and trains one model for each partition of the data
!!! Tabelle mit mean von cv einbauen !!!
With one machine learning algorithm, here e.g. random forest, you can solve different problems.
With parameters, for a random forest e.g. the number of trees and the treedepth, one can adapt it to a specific problem with respect to the objective function. Here we are looking for parameters that minimize the error of our model validations
We did it with a grid search that performs one iteration of the loop for every possible combination of parameters
At the end we have a table with all parameter combinations and their respective scores
At the end of the loop we’ve got all parameter combinations with their respective scores.
We selected the parameters that lead to the best result and trained a new model on the complete public dataset
As you can see we’ve got a nested loop here. Luckily the new H2O nodes are really fast, so this is not gonna be a performance issue
The steps I just showed you happen in all three nodes.
Afterwards we select the model which scored best
convert it into an h2o MOJO, which is a model object that is optimized to be embedded in any java environment
By doing this we are able to use our just created model outside of an H2O context. We can for example do our prediction for the submission dataset from Kaggle
Or we can deploy it to where ever we want, so we just stored it somewhere for Christian. Lets see he is doing with it.