Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

H2O è la piattaforma di analisi dati distribuita più interessante degli ultimi anni, inserita da Gartner come Visionary nella Data Science. Permette di eseguire elaborazioni massivamente parallele, e si collega facilmente a piattaforme essenziali per il calcolo distribuito ed l'analisi predittiva, incluse Spark, TensorFlow e MxNet. Nel talk conosceremo l'architettura di questa libreria, le sue funzionalità e potenzialità, tramite demo interattive con Spark, TensorFlow, ed alcuni algoritmi particolarmente in voga.

  • Be the first to comment

  • Be the first to like this

Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

  1. 1. Massive distributed processing with H2O Codemotion, Milan, 10 November 2017 Gabriele Nocco, Senior Data Scientist
  2. 2. ● H2O Introduction ● GBM ● Demo 2 AGENDA
  3. 3. ● H2O Introduction ● GBM ● Demo 3 AGENDA
  4. 4. H2O INTRODUCTION H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user interface called Flow. The transversality of languages enables the access to the framework for many different professional roles, from analysts to programmers, up to more “academic” data scientists. So H2O can be a complete infrastructure, from the prototype model to the engineering solution.
  5. 5. H2O INTRODUCTION - GARTNER In 2017, H2O.ai became a Visionary in the Magic Quadrant for Data Science Platforms: STRENGTHS ● Market awareness ● Customer satisfaction ● Flexibility and scalability CAUTIONS ● Data access and preparation ● High technical bar for use ● Visualization and data exploration ● Sales execution https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb
  6. 6. H2O INTRODUCTION - FEATURES ● H2O Eco-System Benefits: ○ Scalable to massive datasets on large clusters, fully parallelized ○ Low-latency Java (“POJO”) scoring code is auto-generated ○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC ○ APIs include R, Python, Flow, Scala, Java, Javascript, REST ● Regularization techniques: Dropout, L1/L2 ● Early stopping, N-fold cross-validation, Grid search ● Handling of categorical, missing and sparse data ● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation weights, various loss functions ● Unsupervised mode for nonlinear dimensionality reduction, outlier detection ● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet
  7. 7. H2O INTRODUCTION - ALGORITHMS
  8. 8. H2O INTRODUCTION - ENSEMBLES In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. If your set of base learners does not contain the true prediction function, ensembles can give a good approximation of that function. Ensembles perform better than the individual base algorithms. You can use ensemble of weak learners or combine the predictions from multiple models (Generalized Model Stacking). Ensembles
  9. 9. H2O INTRODUCTION - DRIVERLESS AI At the research level, machine learning problems are complex and unpredictable, but the reality is that a lot of corporates today use machine learning for relatively predictable problems. Driverless AI is the latest product from H2O.ai aimed at lowering the barrier to making data science work in a corporate context. Driverless AI
  10. 10. H2O INTRODUCTION - ARCHITECTURE
  11. 11. H2O INTRODUCTION - ARCHITECTURE
  12. 12. H2O has the ability to develop Deep Neural Networks natively, or through integration with TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is possible to handle huge amounts of data, in the order of GBs or TBs. Another great advantage is the ability to exploit the potential of GPU to perform computations. H2O INTRODUCTION - H2O + TENSORFLOW
  13. 13. With the release of TensorFlow, H2O has embraced the wave of enthusiasm for the growth of Deep Learning. Thanks to Deep Water, H2O allows us to interact in a direct and simple way with Deep Learning tools like TensorFlow, MXNet and Caffe. H2O INTRODUCTION - H2O + TENSORFLOW
  14. 14. H2O INTRODUCTION - ARCHITECTURE
  15. 15. H2O INTRODUCTION - H2O + SPARK One of the first plugin developed in H2O was the one for Apache Spark, named Sparkling Water. Binding to an opensource project on the rise such as Spark, with the power of calculation that distributed computing allows, has been a great driving force for the growth of H2O.
  16. 16. A Sparkling Water application runs like a job that can be started with spark-submit. At this point the Spark Master produces the DAG and divides the execution for each Worker, in which the H2O libraries are loaded in the Java process. H2O INTRODUCTION - H2O + SPARK
  17. 17. The Sparkling Water solution is obviously certificated for all the Spark distributions: Hortonworks, Cloudera, MapR. Databricks provides a Spark cluster in cloud, and H2O works perfectly in this environment. H2O Rains with Databricks Cloud! H2O INTRODUCTION - H2O + SPARK
  18. 18. ● H2O Introduction ● GBM ● Demo 18 AGENDA
  19. 19. Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It can be applied for classification or regression, so it’s a supervised algorithm. This is one of the most diffused and used algorithm in the Kaggle community, performing better than SVMs, Decision Trees and Neural Networks in a large number of cases. http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ GBM can be an optimal solution when the dimension of the dataset or the computing power doesn’t allow to train a Deep Neural Network. GBM Gradient Boosting Machine
  20. 20. Kaggle is the biggest platform for Machine Learning contests in the world. https://www.kaggle.com/ In the beginning of March 2017, Google announces the acquisition of the Kaggle community. GBM - KAGGLE
  21. 21. GBM - GRADIENT BOOSTING Summarizing, GBM requires to specify three different components: ● The loss function with respect to the new weak learners. ● The specific form of the weak learner (e.g., short decision trees). ● A technique to add weak learners between them to minimize the loss function. How Gradient Boosting Works
  22. 22. GBM - GRADIENT BOOSTING The loss function determines the behavior of the algorithm. The only requirement is differentiability, in order to allow gradient descent on it. Although you can define arbitrary losses, in practice only a handful are used. For example, regression may use a squared error and classification may use logarithmic loss. Loss Function
  23. 23. GBM - GRADIENT BOOSTING In H2O, the weak learners are implemented as decision trees. In order to allow the addition of their outputs, regression trees (having real values in output) are used. When building each decision tree, the algorithm iteratively selects a split point in order to minimize the loss. It is possible to increase the depth of the trees to handle more complex problems. On the contrary, to limit overfitting we can constrain the topology of tree by, e.g. limiting the depth, the number of splits, or the number of leaf nodes. Weak Learner
  24. 24. GBM - GRADIENT BOOSTING In a GBM with squared loss, the resulting algorithm is extremely simple: at each step we train a new tree on the “residual errors” with respect to the previous weak learners. This can be seen as a gradient descent step with respect to our loss, where all previous weak learners are kept fixed and the gradient is approximated (it can be seen as optimization in a functional space, click here to go deeply). This generalizes easily to different losses. Additive Model
  25. 25. GBM - GRADIENT BOOSTING The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model. In particular, we associate a different weighting parameter to each decision region of the newly constructed tree. A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset. Output and Stop Condition
  26. 26. GBM - GRADIENT BOOSTING Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting. There are 4 enhancements to basic gradient boosting: ● Tree Constraints ● Learning Rate ● Stochastic Gradient Boosting ● Penalized Learning (Regularization of regression trees output in L1 or L2) Improvements to Basic Gradient Boosting
  27. 27. ● H2O Introduction ● GBM ● Demo 27 AGENDA
  28. 28. Q&A
  29. 29. mail: gabriele.nocco@gmail.com meetup: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/ IAML - Italian Association for Machine Learning: https://www.iaml.it/

    Be the first to comment

    Login to see the comments

H2O è la piattaforma di analisi dati distribuita più interessante degli ultimi anni, inserita da Gartner come Visionary nella Data Science. Permette di eseguire elaborazioni massivamente parallele, e si collega facilmente a piattaforme essenziali per il calcolo distribuito ed l'analisi predittiva, incluse Spark, TensorFlow e MxNet. Nel talk conosceremo l'architettura di questa libreria, le sue funzionalità e potenzialità, tramite demo interattive con Spark, TensorFlow, ed alcuni algoritmi particolarmente in voga.

Views

Total views

371

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

8

Shares

0

Comments

0

Likes

0

×