In this talk, I will give you an overview of our company (H2O.ai), our open-source machine learning platform (H2O) as well as our new projects (e.g. Deep Water and Steam). This will be useful for attendees who are not familiar with H2O.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...Data Science Milan
The “Deep Water" project is about integrating our H2O platform with other open-source deep learning libraries such as TensorFlow, mxnet and Caffe. I will talk about the motivation and potential benefits of this project and then carry out a live demo using mxnet as the GPU backend.
This document provides an overview of Think Big Analytics, an analytics consulting firm. It discusses their services portfolio including data engineering, data science, analytics operations and managed services. It also highlights their global delivery model and successful projects with over 100 clients. The document then discusses their approach to artificial intelligence and deep learning, including applications across industries like banking, connected cars, and automated check processing. It emphasizes the need for a phased implementation approach to AI and challenges around technology, data, and deployment.
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
This document discusses the rise of connected devices and machine learning. It notes that the number of connected devices is expected to grow from 15 billion in 2015 to 40 billion in 2020. It then covers various machine learning techniques including machine learning on historic data, online learning, neural networks, convolutional neural networks, recurrent neural networks, LSTM networks, and IBM's TrueNorth neural network chip. The document argues that neural networks can learn mathematical functions and algorithms and have outperformed traditional methods for problems like anomaly detection. However, neural networks are also computationally complex.
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
This document discusses advances in machine learning and deep learning on IoT devices. It notes that the number of connected devices is growing rapidly and will reach 40 billion by 2020. It then covers different types of machine learning approaches like online learning vs learning from historic data. It also demonstrates several deep learning techniques including neural networks, convolutional neural networks, LSTMs, and autoencoders. Finally, it discusses challenges like computational complexity and potential solutions like IBM's TrueNorth neuromorphic chip.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
Project “Deep Water” (H2O integration with other deep learning libraries - Jo...Data Science Milan
The “Deep Water" project is about integrating our H2O platform with other open-source deep learning libraries such as TensorFlow, mxnet and Caffe. I will talk about the motivation and potential benefits of this project and then carry out a live demo using mxnet as the GPU backend.
This document provides an overview of Think Big Analytics, an analytics consulting firm. It discusses their services portfolio including data engineering, data science, analytics operations and managed services. It also highlights their global delivery model and successful projects with over 100 clients. The document then discusses their approach to artificial intelligence and deep learning, including applications across industries like banking, connected cars, and automated check processing. It emphasizes the need for a phased implementation approach to AI and challenges around technology, data, and deployment.
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
This document discusses the rise of connected devices and machine learning. It notes that the number of connected devices is expected to grow from 15 billion in 2015 to 40 billion in 2020. It then covers various machine learning techniques including machine learning on historic data, online learning, neural networks, convolutional neural networks, recurrent neural networks, LSTM networks, and IBM's TrueNorth neural network chip. The document argues that neural networks can learn mathematical functions and algorithms and have outperformed traditional methods for problems like anomaly detection. However, neural networks are also computationally complex.
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
This document discusses advances in machine learning and deep learning on IoT devices. It notes that the number of connected devices is growing rapidly and will reach 40 billion by 2020. It then covers different types of machine learning approaches like online learning vs learning from historic data. It also demonstrates several deep learning techniques including neural networks, convolutional neural networks, LSTMs, and autoencoders. Finally, it discusses challenges like computational complexity and potential solutions like IBM's TrueNorth neuromorphic chip.
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Towards the Cytoscape CyberinfrastructureKeiichiro Ono
1) The document discusses ongoing projects at the Cytoscape Core Developer Team to integrate Cytoscape into larger computational workflows by sharing data and computing resources over the network.
2) This will utilize standard tools like RStudio and IPython Notebook as primary workbenches for advanced users.
3) One project is a simple web application called CyNetShare that allows sharing of network visualization using Cytoscape.js in a web browser.
Some "challenges" on the open-source/open-data frontGreg Landrum
The document discusses challenges with chemical data interoperability and proposes some solutions. It notes that different software tools produce inconsistent results for chemical descriptors and structure representations. It suggests standardizing an open-source cheminformatics toolkit and defining open formats for common file types like SMILES, to improve reproducibility. It also proposes developing new open standards for representing complex molecules like organometallics containing metals.
cyREST provides platform-independent access to Cytoscape's data models and functions via REST. This allows different tools like RStudio, IPython notebooks, command line utilities, and web apps to interact with Cytoscape. The goal is for all bioinformatics tools to work seamlessly together. cyREST demonstrates controlling Cytoscape from an IPython notebook to enable interactive data analysis across environments and computing resources.
Congresso Sociedade Brasileira de Computação CSBC2016 Porto Alegre (Brazil)
Workshop on Cloud Networks & Cloudscape Brazil
Rodolfo Azevedo - Associate professor at University of Campinas, Brazil
Interdisciplinary Research for Cloud Computing: Future and challenges
Overview of Modern Graph Analysis ToolsKeiichiro Ono
This document discusses modern tools for graph analysis and making graph workflows reproducible. It introduces cyREST, a RESTful API for programmatic access to Cytoscape, and language-specific wrappers like RCy3 and py2cytoscape that provide natural APIs. These tools allow running Cytoscape workflows in notebooks and remote machines. It also covers graph libraries for analysis like NetworkX, igraph, graph-tool, and PGX for smaller graphs, and distributed frameworks like GraphX, GraphLab Create, and Neo4j for extremely large graphs with billions of nodes. The document recommends not using NetworkX for large data and considering cloud-based options for difficult to install tools.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
Presented at Machine Learning Reading Group (MLRG) at NASA Jet Propulsion Laboratory (JPL).
Data programming is helpful for creating large datasets, our application is in the project named Mars Target Encyclopedia (MTE)
This document provides an overview of predictive churn modeling using H2O and Sparkling Water. It discusses what predictive churn is and key performance measures like lift. It also introduces H2O as a machine learning platform, Apache Spark, and H2O Sparkling Water which integrates H2O with Spark. The document demonstrates building a predictive churn model on telco customer data using different approaches in H2O Flow, Spark Scala, and R. It discusses deploying a model via REST API, Docker, and H2O Steam.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presentation slides for SDCSB Cytoscape Workshop on 5/19/2016. The presentation contains current status of Cytoscape project and overview of the Cytoscape ecosystem. It briefly mentions the Cytoscape Cyberinfrastructure.
Speaker: Pierre Richemond, Data Science Institute of Imperial College
Title: Cutting edge generative models: Applications and implications
Abstract: This talk will examine recent developments in deep learning content generation at scale. Whether it be images or text, the latest methods have now reached a level of quality making it hard to discriminate between human- and AI-generated content. We will review recent examples of such generative models, and put their significance in a broader context, in light of such powerful tools’ potential for dual use.
Bio: Pierre is currently researching his PhD in deep reinforcement learning at the Data Science Institute of Imperial College. He also teaches Deep Learning at the Graduate School, and helps to run the Deep Learning Network and organises thematic reading groups. His background is in mathematics - he has studied electrical engineering at ENST, probability theory and stochastic processes at Universite Paris VI - Ecole Polytechnique, and business management at HEC.
Today, Google Cloud Platform (GCP) is one of the leaders among cloud APIs. Although it was established only five years ago, GCP has gained notable expansion due to its suite of public cloud services that it based on a huge, solid infrastructure. GCP allows developers to use these services by accessing GCP RESTful API that is described through HTML pages on its website. However, the documentation of GCP API is written in natural language (English prose) and therefore shows several drawbacks, such as Informal Heterogeneous Documentation, Imprecise Types, Implicit Attribute Metadata, Hidden Links, Redundancy and Lack of Visual Support. To avoid confusion and misunderstandings, the cloud developers obviously need a precise specification of the knowledge and activities in GCP. Therefore, this paper introduces GCP MODEL, an inferred formal model-driven specification of GCP which describes without ambiguity the resources offered by GCP. GCP MODEL conforms to the Open Cloud Computing Interface (OCCI) metamodel and is implemented based on the open source model-driven Eclipse-based OCCIWARE tool chain. Thanks to our GCP MODEL, we offer corrections to the drawbacks we identified.
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
This document provides an overview of a tutorial on building reproducible network data visualization workflows using Cytoscape and IPython Notebook. The tutorial will cover integrating data, analyzing networks, visualizing results, and preparing outputs for publication. It will demonstrate setting up a portable data analysis environment using Docker and sharing work through GitHub. The bulk of the tutorial will focus on using IPython Notebook as an electronic lab notebook for interactive and reproducible experiments with Cytoscape.
SDCSB CYTOSCAPE AND NETWORK ANALYSIS WORKSHOP at Sanford ConsortiumKeiichiro Ono
This document provides an overview and update on Cytoscape, an open source platform for biological network analysis and visualization. Key points discussed include:
- Cytoscape 3.2.1 is the latest desktop application release with new features like a chart editor and exporting visualizations as web applications.
- Cytoscape.js is a JavaScript library for building web applications that visualize networks, and there are examples of web apps built with it.
- Cytoscape's cyberinfrastructure initiative aims to make the software more accessible and integratable for computational biologists through services, apps, and repositories.
Google Cloud Platform (GCP) is one of the leaders among cloud APIs. It has gained notable expansion due to its suite of public cloud services that it based on a huge, solid infrastructure. GCP allows developers to use these services by accessing GCP RESTful API that is described through HTML pages on its website. However, the documentation of GCP API is written in natural language (English prose) and therefore shows several drawbacks, such as Informal Heterogeneous Documentation, Imprecise Types, Implicit Attribute Metadata, Hidden Links, Redundancy and Lack of Visual Support. To avoid confusion and misunderstandings, the cloud developers obviously need a precise specification of the knowledge and activities in GCP. Therefore, this paper introduces GCP MODEL, an inferred formal model-driven specification of GCP which describes without ambiguity the resources offered by GCP.
Cytoscape and External Data Analysis ToolsKeiichiro Ono
This document summarizes Keiichiro Ono's lab meeting presentation about developing a RESTful API for Cytoscape. The presentation covered the motivation for external tools to programmatically access Cytoscape, the design of a new Cytoscape module that exposes a RESTful API, and a proof-of-concept demo. The goal is to make Cytoscape more accessible for hardcore users to embed in automated workflows from languages like R and Python.
Inaugural talk Data Science Milan - Gianmario SpacagnaData Science Milan
This document summarizes the inaugural talk for the Data Science Milan meetup group. It provides background on the speaker, Gianmario Spacagna, including his work experience and interests in machine learning systems, Scala, and the Professional Data Science Manifesto. It also gives an overview of the Data Science Milan meetup group, including its goals of promoting data-driven innovation and knowledge sharing among its members. Additionally, it outlines partnerships with Big Data consultancy startups and other meetup groups. Finally, it summarizes the results of an initial interests survey of group members.
Data intensive applications with Apache Flink - Simone Robutti, RadicalbitData Science Milan
"Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit
In the last 10 years, the IT industry has seen a complete revolution in the perceived value that computing has on businesses and how engineers think about applications: in several application domains, the need for data has outgrown the capacity of commodity hardware and the need for information has outpaced traditional processing technologies and approaches. In this talk we'll introduce Apache Flink, a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It is an open source project that builds on top of proven approaches, as well as innovative algorithms. We will go in-depth on how this tool can be used to implement data-intensive applications, in particular regarding present tools and future perspectives to use machine learning algorithms in a distributed context.
Simone Robutti, 27, Machine Learning Engineer at Radicalbit. He achieved a Master’s Degree at Università degli studi di Milano with a thesis on SVM for noisy labeled datasets. From then on his interests shifted towards the engineering side of Machine Learning and Big Data: implementation, deploy, portability and maintainability of ML-intensive systems. Right now his focus in Radicalbit is Flink and its Machine Learning library FlinkML.
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Towards the Cytoscape CyberinfrastructureKeiichiro Ono
1) The document discusses ongoing projects at the Cytoscape Core Developer Team to integrate Cytoscape into larger computational workflows by sharing data and computing resources over the network.
2) This will utilize standard tools like RStudio and IPython Notebook as primary workbenches for advanced users.
3) One project is a simple web application called CyNetShare that allows sharing of network visualization using Cytoscape.js in a web browser.
Some "challenges" on the open-source/open-data frontGreg Landrum
The document discusses challenges with chemical data interoperability and proposes some solutions. It notes that different software tools produce inconsistent results for chemical descriptors and structure representations. It suggests standardizing an open-source cheminformatics toolkit and defining open formats for common file types like SMILES, to improve reproducibility. It also proposes developing new open standards for representing complex molecules like organometallics containing metals.
cyREST provides platform-independent access to Cytoscape's data models and functions via REST. This allows different tools like RStudio, IPython notebooks, command line utilities, and web apps to interact with Cytoscape. The goal is for all bioinformatics tools to work seamlessly together. cyREST demonstrates controlling Cytoscape from an IPython notebook to enable interactive data analysis across environments and computing resources.
Congresso Sociedade Brasileira de Computação CSBC2016 Porto Alegre (Brazil)
Workshop on Cloud Networks & Cloudscape Brazil
Rodolfo Azevedo - Associate professor at University of Campinas, Brazil
Interdisciplinary Research for Cloud Computing: Future and challenges
Overview of Modern Graph Analysis ToolsKeiichiro Ono
This document discusses modern tools for graph analysis and making graph workflows reproducible. It introduces cyREST, a RESTful API for programmatic access to Cytoscape, and language-specific wrappers like RCy3 and py2cytoscape that provide natural APIs. These tools allow running Cytoscape workflows in notebooks and remote machines. It also covers graph libraries for analysis like NetworkX, igraph, graph-tool, and PGX for smaller graphs, and distributed frameworks like GraphX, GraphLab Create, and Neo4j for extremely large graphs with billions of nodes. The document recommends not using NetworkX for large data and considering cloud-based options for difficult to install tools.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
Presented at Machine Learning Reading Group (MLRG) at NASA Jet Propulsion Laboratory (JPL).
Data programming is helpful for creating large datasets, our application is in the project named Mars Target Encyclopedia (MTE)
This document provides an overview of predictive churn modeling using H2O and Sparkling Water. It discusses what predictive churn is and key performance measures like lift. It also introduces H2O as a machine learning platform, Apache Spark, and H2O Sparkling Water which integrates H2O with Spark. The document demonstrates building a predictive churn model on telco customer data using different approaches in H2O Flow, Spark Scala, and R. It discusses deploying a model via REST API, Docker, and H2O Steam.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Presentation slides for SDCSB Cytoscape Workshop on 5/19/2016. The presentation contains current status of Cytoscape project and overview of the Cytoscape ecosystem. It briefly mentions the Cytoscape Cyberinfrastructure.
Speaker: Pierre Richemond, Data Science Institute of Imperial College
Title: Cutting edge generative models: Applications and implications
Abstract: This talk will examine recent developments in deep learning content generation at scale. Whether it be images or text, the latest methods have now reached a level of quality making it hard to discriminate between human- and AI-generated content. We will review recent examples of such generative models, and put their significance in a broader context, in light of such powerful tools’ potential for dual use.
Bio: Pierre is currently researching his PhD in deep reinforcement learning at the Data Science Institute of Imperial College. He also teaches Deep Learning at the Graduate School, and helps to run the Deep Learning Network and organises thematic reading groups. His background is in mathematics - he has studied electrical engineering at ENST, probability theory and stochastic processes at Universite Paris VI - Ecole Polytechnique, and business management at HEC.
Today, Google Cloud Platform (GCP) is one of the leaders among cloud APIs. Although it was established only five years ago, GCP has gained notable expansion due to its suite of public cloud services that it based on a huge, solid infrastructure. GCP allows developers to use these services by accessing GCP RESTful API that is described through HTML pages on its website. However, the documentation of GCP API is written in natural language (English prose) and therefore shows several drawbacks, such as Informal Heterogeneous Documentation, Imprecise Types, Implicit Attribute Metadata, Hidden Links, Redundancy and Lack of Visual Support. To avoid confusion and misunderstandings, the cloud developers obviously need a precise specification of the knowledge and activities in GCP. Therefore, this paper introduces GCP MODEL, an inferred formal model-driven specification of GCP which describes without ambiguity the resources offered by GCP. GCP MODEL conforms to the Open Cloud Computing Interface (OCCI) metamodel and is implemented based on the open source model-driven Eclipse-based OCCIWARE tool chain. Thanks to our GCP MODEL, we offer corrections to the drawbacks we identified.
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
This document provides an overview of a tutorial on building reproducible network data visualization workflows using Cytoscape and IPython Notebook. The tutorial will cover integrating data, analyzing networks, visualizing results, and preparing outputs for publication. It will demonstrate setting up a portable data analysis environment using Docker and sharing work through GitHub. The bulk of the tutorial will focus on using IPython Notebook as an electronic lab notebook for interactive and reproducible experiments with Cytoscape.
SDCSB CYTOSCAPE AND NETWORK ANALYSIS WORKSHOP at Sanford ConsortiumKeiichiro Ono
This document provides an overview and update on Cytoscape, an open source platform for biological network analysis and visualization. Key points discussed include:
- Cytoscape 3.2.1 is the latest desktop application release with new features like a chart editor and exporting visualizations as web applications.
- Cytoscape.js is a JavaScript library for building web applications that visualize networks, and there are examples of web apps built with it.
- Cytoscape's cyberinfrastructure initiative aims to make the software more accessible and integratable for computational biologists through services, apps, and repositories.
Google Cloud Platform (GCP) is one of the leaders among cloud APIs. It has gained notable expansion due to its suite of public cloud services that it based on a huge, solid infrastructure. GCP allows developers to use these services by accessing GCP RESTful API that is described through HTML pages on its website. However, the documentation of GCP API is written in natural language (English prose) and therefore shows several drawbacks, such as Informal Heterogeneous Documentation, Imprecise Types, Implicit Attribute Metadata, Hidden Links, Redundancy and Lack of Visual Support. To avoid confusion and misunderstandings, the cloud developers obviously need a precise specification of the knowledge and activities in GCP. Therefore, this paper introduces GCP MODEL, an inferred formal model-driven specification of GCP which describes without ambiguity the resources offered by GCP.
Cytoscape and External Data Analysis ToolsKeiichiro Ono
This document summarizes Keiichiro Ono's lab meeting presentation about developing a RESTful API for Cytoscape. The presentation covered the motivation for external tools to programmatically access Cytoscape, the design of a new Cytoscape module that exposes a RESTful API, and a proof-of-concept demo. The goal is to make Cytoscape more accessible for hardcore users to embed in automated workflows from languages like R and Python.
Inaugural talk Data Science Milan - Gianmario SpacagnaData Science Milan
This document summarizes the inaugural talk for the Data Science Milan meetup group. It provides background on the speaker, Gianmario Spacagna, including his work experience and interests in machine learning systems, Scala, and the Professional Data Science Manifesto. It also gives an overview of the Data Science Milan meetup group, including its goals of promoting data-driven innovation and knowledge sharing among its members. Additionally, it outlines partnerships with Big Data consultancy startups and other meetup groups. Finally, it summarizes the results of an initial interests survey of group members.
Data intensive applications with Apache Flink - Simone Robutti, RadicalbitData Science Milan
"Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit
In the last 10 years, the IT industry has seen a complete revolution in the perceived value that computing has on businesses and how engineers think about applications: in several application domains, the need for data has outgrown the capacity of commodity hardware and the need for information has outpaced traditional processing technologies and approaches. In this talk we'll introduce Apache Flink, a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It is an open source project that builds on top of proven approaches, as well as innovative algorithms. We will go in-depth on how this tool can be used to implement data-intensive applications, in particular regarding present tools and future perspectives to use machine learning algorithms in a distributed context.
Simone Robutti, 27, Machine Learning Engineer at Radicalbit. He achieved a Master’s Degree at Università degli studi di Milano with a thesis on SVM for noisy labeled datasets. From then on his interests shifted towards the engineering side of Machine Learning and Big Data: implementation, deploy, portability and maintainability of ML-intensive systems. Right now his focus in Radicalbit is Flink and its Machine Learning library FlinkML.
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...Data Science Milan
In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team's goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:
• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.
• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.
• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.
• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.
• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).
• How Scala (and functional programming) helped our cause.
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications. His main expertise is on building production-oriented machine learning systems. Co-author of the Professional Manifesto for Data Science, he loves evangelising his passion for best practices and effective methodologies amongst the community. Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
El documento describe tres instituciones administrativas del trabajo en Venezuela: el Ministerio del Trabajo, la Inspectoría del Trabajo y el Instituto Nacional de Prevención, Salud y Seguridad Laborales (INPSASEL). El Ministerio del Trabajo es la institución líder en alcanzar el equilibrio político, económico y social del país protegiendo la dignidad de los trabajadores. La Inspectoría del Trabajo depende del Ministerio del Trabajo y tiene competencias como cumplir la ley laboral e inspeccionar entidades. El INPSASEL se
Heart to Heart International earned its seventh consecutive 4-star rating from Charity Navigator in 2014, occupying a spot on the 10 Top Notch Charity list. The organization launched a new website and expanded its operations internationally, including responding to the Ebola outbreak in Liberia and increasing healthcare access in remote areas of Southeast Haiti. The annual report highlights the organization's continued growth and impact through innovative programs and partnerships around the world.
This document provides guidance for teachers to use MAP growth data to help students set goals and improve learning. It outlines 5 key reports that teachers can access to understand class and student performance, identify areas of strength and growth, and inform instructional planning. The reports show each student's RIT score, projected proficiency level, and performance in different goal areas. Teachers are advised to use the data to flexibly group students and target instruction at each student's instructional level based on the learning continuum. Five action steps are proposed: determine instructional levels, identify strengths and growth areas, provide additional learning resources, implement instructional modifications, and set SMART goals to monitor progress.
La arquitectura romana se caracterizó por su monumentalidad y dinamismo debido al uso del arco y la bóveda, utilizando variados materiales como piedra, ladrillo y hormigón. Entre sus obras más relevantes se encuentran el Coliseo, un anfiteatro donde se realizaban espectáculos, el Panteón de Agripa con su planta circular única, y el Arco de Septimio Severo decorado con columnas y bajorrelieves.
Shop from our exclusive range of Railings right here at the best prices! Check out the wide range of products we offer and pick up according to your requirements.
POS Digital provides customized digital solutions for businesses. They create intuitive user interfaces and entertaining presentations that merge the digital and physical worlds. Their solutions include interactive maps and navigation, document viewers, photo and video galleries, and feedback forms. They have created these types of solutions for clients across various industries, including automotive companies, shopping malls, universities, hotels, and more.
The study analyzed class action cases filed between 2000-2006 in 12 California courts. It found that employment and business tort cases made up over half of all cases reviewed. Employment cases significantly increased over the period, while business tort cases declined after 2002. Contract cases also declined as a percentage of filings. Antitrust filings fluctuated greatly based on specific events - rising in 2000 due to energy crisis lawsuits and in 2003 due to auto industry suits. Overall, the nature of class action cases made traditional trend analysis difficult due to small case volumes and reactivity to events.
Este documento trata sobre la contaminación del agua y el aire. Explica que el agua se puede contaminar por aguas residuales urbanas, industriales y agrícolas, causando efectos físicos, químicos y biológicos. También describe que el aire se compone principalmente de oxígeno y nitrógeno, pero se contamina cuando se introducen otras sustancias o se modifican sus componentes naturales, debido a factores como emisiones. La contaminación del aire puede causar problemas ambientales como el cambio climático y da
Britton_NoAH World Package Design 10_Page_1-10Patti Britton
This document appears to be advertising a brand of kick ass natural juice or drink called Galante Kick Ass of Louisiana established in 1910. It provides the brand name in large text with the tagline "All Natural" suggesting it contains no artificial ingredients.
UWI Vice-Chancellor's Report to University CouncilUWI_Markcomm
Members of the Council of The UWI), along with guests met on Friday April 29 at The UWI St. Augustine Campus for what is considered the Council’s annual business meeting.
At the meeting, Professor Sir Hilary Beckles presented the Vice-Chancellor’s Report to the Council on the major advancements across The UWI's four campuses during the last academic year and articulated his “Triple A Vision” for rekindling the activist university.
La Web 2.0 se refiere a sitios web que facilitan el compartir información y la interacción entre usuarios a través de aplicaciones. Se caracteriza por dar prioridad a la experiencia del usuario, la confianza en los usuarios para generar contenido, y el software basado en la nube que puede actualizarse fácilmente. Ofrece ventajas como software legal y de código abierto, pero también plantea desventajas como la privacidad de datos y posibles cambios en los términos del servicio.
Zika Virus Surveillance and Reporting in the CaribbeanUWI_Markcomm
Shaping the Caribbean's response to Zika, UWI’s Zika Task Force (www.uwi.edu/zika) is gathering and providing expert advice and developing a strategic, scientific approach to tackling the Zika virus.
This document appears to be a resume for Wayne Hellon listing his work experience from 2007-2014 in pipeline construction and related roles. It includes his contact information, employers, roles and responsibilities, education, qualifications, and references. The employers listed include Welded Pipeline Construction, Price Gregory Pipeline, Northern Clearing, Phillips and Jordan, Michels Pipeline, and various pipeline companies. His roles involved right-of-way clearing, lowering-in pipe, coating, fencing, and other construction and labor tasks.
This book provides guidance on effective communication between parents and children. It teaches skills like active listening, acknowledging feelings without judgment, using creative problem-solving instead of punishment, and encouraging a child's autonomy. Readers are given examples and exercises to practice these skills, like naming a child's feelings in different scenarios, offering alternatives to problematic behaviors, and allowing choices rather than commands. The goal is for parents to respect their child's perspective, address issues respectfully, and strengthen the parent-child relationship through open communication.
El documento proporciona información sobre el polietileno de baja densidad (PEBD o LDPE), incluyendo su nombre, los números y códigos del polímero, los productos que se elaboran con él como manteles, envases de crema, shampoo y bolsas de basura, y que es reciclable para nuevas bolsas de supermercado.
This document provides an introduction to H2O, an open source machine learning platform, and discusses potential Internet of Things (IoT) use cases for predictive maintenance and outlier detection. The document outlines Joe Chow's background and experience, provides an overview of H2O's capabilities including algorithms, interfaces, and exporting models for production. It then demonstrates how to use H2O for predictive maintenance on a dataset of sensor readings to predict equipment failures, and for outlier detection on the MNIST handwritten digits dataset to identify anomalous images.
Introduction to Machine Learning with H2O and PythonJo-fai Chow
This document provides an introduction and overview of machine learning with H2O and Python. It begins with background information about the presenter, Joe Chow, including his work experience and side projects. The agenda then outlines topics to be covered, including an introduction to H2O.ai the company and machine learning platform, followed by a Python tutorial and examples. The tutorial will cover importing and manipulating data, basic and advanced regression and classification models, and using H2O in the cloud.
This is my Deep Water talk for the TensorFlow Paris meetup.
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment.
This document summarizes a presentation given by Joe Chow on machine learning using H2O.ai's platform. The presentation introduced H2O, its products like Deep Water for deep learning, and demonstrated examples of building models with R and Python. It showed how H2O provides a unified interface for TensorFlow, MXNet and Caffe, allowing users to easily build and deploy deep learning models with different frameworks. The document provided an overview of the company and platform capabilities like scalable algorithms, model export and multiple language interfaces like R and Python.
This document summarizes a presentation given by Joe Chow on machine learning using H2O.ai's platform. The presentation covered:
1) An introduction to Joe and H2O.ai, including the company's mission to operationalize data science.
2) An overview of the H2O platform for machine learning, including its distributed algorithms, interfaces for R and Python, and model export capabilities.
3) A demonstration of deep learning using H2O's Deep Water integration with TensorFlow, MXNet, and Caffe, allowing users to build and deploy models across different frameworks.
Introduction to H2O and Model Stacking Use CasesJo-fai Chow
This document provides an introduction and overview of H2O, an open source machine learning platform. It discusses H2O's capabilities for supervised and unsupervised learning using algorithms like gradient boosted machines, random forests, and deep learning. It also introduces the concept of model stacking in H2O, which uses the predictions from multiple models as inputs to train a new meta-model, and provides examples of stacking for regression and classification problems using various datasets.
Joe Chow gave a presentation about machine learning use cases using H2O. He introduced H2O, an open source machine learning platform that works with R, Python, and other languages. He discussed how companies like Telenor and customers in Brazil are using H2O for tasks like predictive modeling. Joe also highlighted current algorithms in H2O and how it is evolving with new features like deep learning integration. The talk provided an overview of H2O and real-world examples of how organizations are applying machine learning with the platform.
Kaggle Competitions, New Friends, New Skills and New OpportunitiesJo-fai Chow
1) Joe Chow transitioned from a career in civil engineering to data science after completing his PhD and participating in his first massive open online course (MOOC), which introduced him to Kaggle competitions.
2) His first Kaggle competition exposed gaps in his skills with open-source languages like R and Python and collaboration. He worked to fill these gaps by taking more MOOCs and learning from other Kagglers.
3) Participating in Kaggle competitions led to new opportunities for Joe, including side projects applying his skills, presentations, collaborations, a job at H2O.ai, and becoming a respected member of the data science community.
Introduction to Machine Learning with H2O and PythonJo-fai Chow
This document provides an introduction and agenda for a tutorial on machine learning with H2O and Python. The tutorial introduction covers the speaker's background and experience with machine learning. The agenda then outlines the topics to be covered, including an overview of H2O.ai as a company and machine learning platform, examples of importing and manipulating data with H2O and Python, training regression and classification models with different algorithms, improving model performance through tuning and ensembling techniques, and using H2O in the cloud. Code examples and Jupyter notebooks are referenced throughout the tutorial sections.
Introduction to Machine Learning with H2O and PythonSri Ambati
This document provides an introduction and agenda for a tutorial on machine learning with H2O and Python. The introduction discusses the presenter's background and qualifications. The agenda outlines topics to be covered including an overview of H2O.ai as a company and machine learning platform, tutorials on using the H2O Python module to import data, build regression and classification models, and improve model performance through techniques like cross-validation, grid search, and stacking. Case study notebooks and examples will be used to demonstrate key machine learning concepts and the H2O framework in Python.
This document summarizes a presentation about H2O's machine learning platform and Deep Water distributed deep learning capabilities. The presentation introduces H2O, its open source in-memory machine learning platform, performance advantages, and interfaces for R, Python and Flow. Deep Water is introduced as H2O's integration with TensorFlow, MXNet and Caffe that provides a unified interface for distributed deep learning on GPUs. Examples are shown training convolutional neural networks on image datasets using Deep Water with different backends.
Automatic and Interpretable Machine Learning in R with H2O and LIMESri Ambati
This is a hands-on tutorial for R beginners. I will demonstrate the use of two R packages, h2o & LIME, for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with a framework called Local Interpretable Model-Agnostic Explanations (LIME).
Automatic and Interpretable Machine Learning in R with H2O and LIMEJo-fai Chow
This is a hands-on tutorial for R beginners. I will demonstrate the use of two R packages, h2o & LIME, for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with a framework called Local Interpretable Model-Agnostic Explanations (LIME).
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
These slides accompanied a June 4th, 2016 presentation made by Dan Gillean of Artefactual Systems at the Association of Canadian Archivists' 2016 Conference in Montreal, QC, Canada.
This presentation aims to examine several existing or emerging computing paradigms, with specific examples, to imagine how they might inform next-generation archival systems to support digital preservation, description, and access. Topics covered include:
- Distributed Version Control and git
- P2P architectures and the BitTorrent protocol
- Linked Open Data and RDF
- Blockchain technology
The session is part of an attempt by the ACA to create interactive "working sessions" at its conferences. Accompanying notes can be found at: http://bit.ly/tech-Proche
Participants were also asked to use the Twitter hashtag of #techProche for online interaction during the session.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Testing In Production (TiP) Advances with Big Data and the CloudSOASTA
The document summarizes a webinar presentation about methodologies and technologies for testing in production (TiP). The presentation discusses leveraging active and passive monitoring of real user data in production for testing purposes. It also covers experimenting with new features on real users in production, and using load testing to evaluate system stress and scalability under real world conditions. The presentation was given by employees from Microsoft and SOASTA to discuss TiP strategies.
H2O Deep Water - Making Deep Learning Accessible to EveryoneJo-fai Chow
H2O Deep Water is a tool that integrates distributed deep learning with H2O's machine learning platform. It allows users to build, stack, and deploy deep learning models from libraries like TensorFlow, MXNet, and Caffe through a unified interface. Deep Water inherits properties from H2O like scalability, ease of use, and deployment capabilities. It also makes deep learning more accessible by supporting popular network architectures and allowing easy ensemble of deep models with other H2O algorithms.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
The Quest for an Open Source Data Science PlatformQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Jörg Schad (@joerg_schad, Head of Engineering & ML at ArangoDB)
=== Please download slides if blurred! ===
Abstract: With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example, consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
In this talk, we look at existing implementations, in particular MLMD as part of the TensorFlow ecosystem. Further, propose a first draft of a (MLMD compatible) universal Metadata API. We demo the first implementation of this API using ArangoDB.
Similar to Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O (20)
ML & Graph algorithms to prevent financial crime in digital paymentsData Science Milan
This document discusses using machine learning and graph algorithms to prevent financial crime in digital payments. It presents a three level approach: Level 0 uses rule-based SQL queries to detect anomalies, Level 1 applies supervised machine learning to classify transactions, and Level 2 uses a graph database and rules to model network anomalies. Level 3 combines machine learning, graph algorithms, and personalized page rank to spread anomaly scores throughout a transaction network to identify suspicious groups. The strategies are being piloted through the Infinitech Project to develop technologies for applications in financial crime prevention, cybersecurity, and personalized products using AI, big data, IoT, and blockchain.
How to use the Economic Complexity Index to guide innovation plansData Science Milan
The document discusses how to use the Economic Complexity Index (ECI) and Product Complexity Index (PCI) to guide innovation plans. It explains that the ECI and PCI are network measures that provide insights into economic development patterns by measuring diversity and ubiquity. The talk will show how to compute these metrics based on network theory and how they can be interpreted to compare countries, markets, products, and inform data-driven plans. Occupation complexity is also calculated based on skill diversity and ubiquity to understand changing skill demands over time.
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
It is indeed a wonderful time to build machine learning systems, as the growing ecosystems of tools and shared best practices make even small teams incredibly productive at scale. In this talk, we present our philosophy for modern, no-nonsense data pipelines, highlighting the advantages of a (almost) pure serverless and open-source approach, and showing how the entire toolchain works - from raw data to model serving - on a real-world dataset.
Finally, we argue that the crucial component for analyzing data pipelines is not the model per se, but the surrounding DAG, and present our proposal for producing automated "DAG cards" from Metaflow classes.
Bio:
Jacopo Tagliabue was co-founder and CTO of Tooso, an A.I. company in San Francisco acquired by Coveo in 2019. Jacopo is currently the Lead A.I. Scientist at Coveo. When not busy building A.I. products, he is exploring research topics at the intersection of language, reasoning and learning, with several publications at major conferences (e.g. WWW, SIGIR, RecSys, NAACL). In previous lives, he managed to get a Ph.D., do scienc-y things for a pro basketball team, and simulate a pre-Columbian civilization.
Topics: MLOps, Metaflow, model cards.
Question generation using Natural Language Processing by QuestGen.AIData Science Milan
Ramsri Goutham presented on generating multiple choice questions (MCQs) from text using natural language processing. He discussed using T5 transformers and sense2vec vectors to generate questions from news articles and generate wrong answer choices using WordNet and Sense2vec. Ramsri also shared an open source question generation library called Questgen and demonstrated generating MCQs from sample text about Elon Musk and cryptocurrencies in a Google Colab notebook.
Abstract: Data preparation and modelling are the activities that take most of the time in a typical data scientist workday. In this session we’ll see how AWS services for Analytics and data management can be effectively used and integrated in AI/ML pipelines. We’ll focus on AWS Glue, AWS Glue DataBrew and AWS Data Wrangler with a bit of theory and hands-on demos.
Bio:
Francesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in UK, italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems. Francesco also has a strong experience in systems integration and design and implementation of applications.
Topics: machine learning pipelines, AWS, cloud.
Helixa uses serverless machine learning architectures to power an audience intelligence platform. It ingests large datasets and uses machine learning models to provide insights. Helixa's machine learning system is built on AWS serverless services like Lambda, Glue, Athena and S3. It features a data lake for storage, a feature store for preprocessed data, and uses techniques like map-reduce to parallelize tasks. Helixa aims to build scalable and cost-effective machine learning pipelines without having to manage servers.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
This document provides an overview of reinforcement learning. It discusses the reinforcement learning framework including actors like agents, environments, states, actions, rewards, and policies. It also summarizes several common reinforcement learning methods including value-based methods, policy-based methods, and model-based methods. Value-based methods estimate value functions using algorithms like Q-learning and deep Q-networks. Policy-based methods directly learn policies using policy gradient algorithms like REINFORCE. Model-based methods learn models of the environment and then plan based on these models.
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
Today there are a lot of data that are stored in the form of time series, and with the actual large diffusion of real-time applications many areas are strongly increasing their interest in applications based on this kind of data, like for example finance, advertising, marketing, health care, automated disease detection, biometrics, retail, and identification of anomalies of any kind. It is therefore very interesting to understand the role and potential of machine learning in this sector.
Many methods can be used for the classification of the time series, but all of them, apart from deep learning, require some kind of feature engineering as a separate stage before the classification is performed, and this can imply the loss of some important information and the increase of the development and test time. On the contrary, deep learning models such as recurrent and convolutional neural networks already incorporate this kind of feature engineering internally, optimizing it and eliminating the need to do it manually. Therefore they are able to extract information from the time series in a faster, more direct, and more complete way.
Bio:
Marco Del Pra
I am 41 years old, I was born in Venice, I have 2 master's degrees (Computer Science and Mathematics). I have been working for about 10 years in Artificial Intelligence, first as Data Scientist, then as Team Leader and finally as Head of Data. Among others, I worked for Microsoft, for the European Commission (JRC of Ispra) and for Cuebiq. I am currently working as a freelancer and I am creating with 2 other cofounders an innovative AI startup. I have 2 important publications in applied mathematics.
Topics: recurrent and convolutional neural networks, deep learning, time-series.
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AIData Science Milan
The talk will introduce Ludwig, a deep learning toolbox that allows to train models and to use them for prediction without the need to write code. It is unique in its ability to help make deep learning easier to understand for non-experts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. By using Ludwig, experts and researchers can simplify the prototyping process and streamline data processing so that they can focus on developing deep learning architectures.
Bio:
Piero Molino is a Senior Research Scientist at Uber AI with focus on machine learning for language and dialogue. Piero completed a PhD on Question Answering at the University of Bari, Italy. Founded QuestionCube, a startup that built a framework for semantic search and QA. Worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on natural language processing with deep learning and then joined Geometric Intelligence, where he worked on grounded language understanding. After Uber acquired Geometric Intelligence, he became one of the founding members of Uber AI Labs.
Audience projection of target consumers over multiple domains a ner and baye...Data Science Milan
Traditional market research is generally conducted by questionnaires or other forms of explicit feedback, directly asked to an ad hoc panel of individuals that in aggregate are representative of a larger group of people. Unfortunately, those traditional approaches are often invasive, nonscalable, and biased. Indirect approaches based on sparse and implicit consumer feedback (e.g., social network interactions, web browsing, or online purchases) are more scalable, authentic, and more suitable for real-time consumer insights.
Although those sources of implicit consumer feedback provide relevant and detailed pictures of the population, they individually provide only a limited set of observable behaviors.
The Holy Grail of market research is the ability to merge different sources of consumers interests into an augmented view that connects all the dots across multiple domains.
Unfortunately, user-centric "fusion" algorithms present many limitations in the case of heterogeneous datasets strongly differing in terms of size and density and when the number of sources to merge increases.
We propose a novel approach of Audience Projection able to define a target audience as a subset of the population in a source domain and to project this target to a set of users into a destination dataset.
We will show how libraries such as spaCy can provide Deep Learning implementations for Named Entity Recognition (NER) to match related brands and we will use Bayesian Inference to transfer knowledge from the source domain. This way, we can estimate the probability of the user to belong to the target using the source distribution of volume of interests of common entities as model evidence and the source target size as prior probability.
Bio:
Gianmario Spacagna is the chief scientist and head of AI at Helixa. His team’s mission is building the next generation of behavior algorithms and models of human decision making with careful attention to their potential and effects on society. His experience covers a diverse portfolio of machine learning algorithms and data products across different industries. Previously, he worked as a data scientist in IoT automotive (Pirelli Cyber Technology), retail and business banking (Barclays Analytics Centre of Excellence), threat intelligence (Cisco Talos), predictive marketing (AgilOne), plus some occasional freelancing. He’s a co-author of the book Python Deep Learning, contributor to the “Professional Manifesto for Data Science,” and founder of the Data Science Milan community. Gianmario holds a master’s degree in telematics (Polytechnic of Turin) and software engineering of distributed systems (KTH of Stockholm). After having spent half of his career abroad, he now lives in Milan. His favorite hobbies include home cooking, hiking, and exploring the surrounding nature on his motorcycle.
Weakly Supervised Learning: Introduction and Best Practices
In the talk we will introduce the definition of three main types of weakly supervised learning: incomplete, inexact and inaccurate; we examine how the models can be trained in case of weak supervision and view the real application of weakly supervised learning, how it can improve results and decrease the costs.
Bio:
Kristina Khvatova works as a Software Engineer at Softec S.p.A. Currently she is involved in the development of a project for data analysis and visualisation; it includes quantitative and qualitative analysis based on classification, optimisation, time series prediction, anomaly detection techniques. She obtained a master degree in Mathematics at the Saint-Petersburg State University and a master degree in Computer Science at the University of Milano-Bicocca.
GANs beyond nice pictures: real value of data generation, Alex HoncharData Science Milan
Generative modeling can be used for problems beyond just generation, such as anomaly detection, determining factors of variation in datasets, domain adaptation between not-aligned datasets, and building better embeddings for supervised learning tasks. Generative models can model the underlying distribution of data to check if a point belongs to that distribution or create new points from the distribution. They can learn low-dimensional manifolds on which real-world high-dimensional data like images lie. This allows generative models to be applied to challenges like filtering, style transfer, and improving embeddings to boost performance on downstream tasks.
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan
Humans have the extraordinary ability to learn continually from experience. Not only can we apply previously learned knowledge and skills to new situations, we can also use these as the foundation for later learning. One of the grand goals of AI is building an artificial continually learning agent that constructs a sophisticated understanding of the world from its own experience through the autonomous incremental development of ever more complex skills and knowledge.
"Continual Learning" (CL) is indeed a fast emerging topic in AI concerning the ability to efficiently improve the performance of a deep model over time, dealing with a long (and possibly unlimited) sequence of data/tasks. In this workshop, after a brief introduction of the topic, we’ll implement different Continual Learning strategies and assess them on common vision benchmarks. We’ll conclude the workshop with a look at possible real world applications of CL.
Vincenzo Lomonaco is a Deep Learning PhD student at the University of Bologna and founder of ContinualAI.org. He is also the PhD students representative at the Department of Computer Science of Engineering (DISI) and teaching assistant of the courses “Machine Learning” and “Computer Architectures” in the same department. Previously, he was a Machine Learning software engineer at IDL in-line Devices and a Master Student at the University of Bologna where he graduated cum laude in 2015 with the dissertation “Deep Learning for Computer Vision: a Comparison Between CNNs and HTMs on Object Recognition Tasks".
Processing 3D images has many use cases. For example, to improve autonomous car driving, to enable digital conversions of old factory buildings, to enable augmented reality solutions for medical surgeries, etc. Also 3D images help in 3D modeling and safety evaluation of products.
3D image processing brings enormous benefits but also amplifies computing cost. The size of the point cloud, the number of points, sparse and irregular point cloud, and the adverse impact of the light reflections, (partial) occlusions, etc., make it difficult for engineers to process point clouds.
Moving from using hand crafted features to using deep learning techniques to semantically segment the images, to classify objects, to detect objects, to detect actions in 3D videos, etc., we have come a long way in 3D image processing.
3D Point Cloud image processing is increasingly used to solve Industry 4.0 use cases to help architects, builders and product managers. I will share some of the innovations that are helping the progress of 3D point cloud processing. I will share the practical implementation issues we faced while developing deep learning models to make sense of 3D Point Clouds.
Attendees: Beginners and Intermediate skilled in Image Processing and 3D Point Clouds
Profile of the speaker:
SK Reddy is the Chief Product Officer AI in Hexagon (www.hexagon.com). He is an AI and ML expert and a successful twice startup entrepreneur. He is an AI startup advisor too. Also he is a frequent speaker in conferences and is an AI blogger.
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Data Science Milan
1. The document discusses using deep learning models like recurrent neural networks to predict time-to-failure events from time series data. It specifically focuses on a technique called Deep Time-to-Failure which extends a Weibull Time-to-Event Recurrent Neural Network to predict a single failure event.
2. As a case study, the technique is applied to predict failure times of NASA jet engines using sensor data as inputs. The model is trained on historical sequences of data to learn the distribution of time-to-failure and can provide probabilistic predictions and confidence intervals.
3. Key aspects of the Deep Time-to-Failure approach include using censored and uncensored training data, consuming raw time series as input
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan
50 Shades of Text - Leveraging Natural Language Processing (NLP) to validate, improve, and expand the functionalities of a product
Nowadays, every company either stores or produces text data: from web logs and user queries, to translations and support tickets, yet not everyone knows how to extract valuable insights from it. In this session, we will present a practical case on how to move from raw text data to a valuable business application leveraging upon some of the major NLP methodologies (word embedding, word2vec, doc2vec, fastText, etc.)
Bio: Alessandro is a data veteran. He holds two Master’s degrees in computer engineering, one from Politecnico di Milano and the other from University of Illinois at Chicago (UIC).
He started his career in data consultancy, where he mastered Apache Spark for Machine Learning projects and subsequently joined WW Grainger, one of the largest MRO e-commerce companies in the United States. In September 2017, after more than 5 years in the USA, Alessandro returned to his native country, Italy, where he is now leading a team of data scientists. His current work focuses on achieving energy efficiency through the automation of energy management processes for commercial customers.
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyData Science Milan
This document contains summaries of three projects related to pricing optimization:
1) Optimal discount strategy for products in close-out phase to balance margin loss and inventory costs. The solution involved sales forecasting, price elasticity modeling, and discount optimization.
2) Online pricing optimization using contextual multi-armed bandit algorithms to maximize ticket revenues. The solution used algorithms like UCB1 and ORAT.
3) Renewal price optimization for subscription products by developing elasticity curves and using simplex optimization to determine optimal prices given business objectives and constraints.
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrigoni, Senior Data Scientist, Pirelli (pirelli.com)
Abstract:
Pirelli, a global performance tire manufacturer, uses data science in its 20 factories to improve quality and efficiency, and reduce energy consumption. For this “Smart Manufacturing” initiative, Pirelli’s data science team has developed predictive models and analytics tools to monitor processes, machines and materials on the factory floors. In this talk we will show some of the solutions we deploy, demonstrate how we used Domino’s data science platform and Plot.ly to build these solutions, and discuss the next steps in this journey towards predictive maintenance.
Bio:
Alberto Arrigoni is a data scientist at Pirelli, where he works to process sensors and telemetry data for IoT, Smart Factories and connected-vehicle applications.
He works closely with all major business units such as R&D, industrial engineering and BI to develop tailored machine learning algorithms and production systems.
He holds a PhD in biostatistics from the University of Milan Bicocca and prior to joining Pirelli was a staff data scientist at the National Institute of Molecular Genetics (Milan), as well as a Fulbright student at the Santa Clara University and visiting PhD student at Pacific Biosciences (Menlo Park, CA).
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Introduction to Machine Learning with H2O - Jo-Fai (Joe) Chow, H2O
1. Introduction to
Machine Learning with H2O
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
Data Science Milan
Politecnico di Milano
10th October, 2016
2. About Me: Civil Engineer → Data Scientist
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o Industrial PhD
• Water Engineering +
Machine Learning
• Discovered H2O in 2014!
• 2015 - Present
• Data Scientist
o Virgin Media (UK)
o Domino Data Lab (US)
o H2O.ai (US)
2
Why? Long story – see bit.ly/joe_h2o_talk2
3. Agenda
• First Talk (25 mins)
o About H2O.ai
o Demo
• A Simple Classification Task
• H2O’s Web Interface
o Why H2O?
• Our Community
• Our Customers
o What’s Next?
• New H2O Features
• Second Talk (25 mins)
o H2O for IoT
• Predictive Maintenance
• Anomaly Detection
• H2O’s R Interface
• Third Talk (25 mins)
o Deep Water
o Demo
• H2O + mxnet on GPU
• H2O’s Python Interface
3
5. About H2O.ai
• H2O.ai, the Company
o Team: 80 (70 shown)
o Founded in 2012
o HQ: Mountain View, California
• H2O, the Platform
o Open Source (Apache 2.0)
o Algorithms written in Java
• Fast, distributed and scalable
o Multiple interfaces to suit different users
• Web, R, Python, Java, Scala, REST/JSON
o Works with desktop/laptop, cloud, Spark
and Hadoop
Joe
11. A Typical Machine Learning Task
• Demo
o Dataset – MNIST
• LeCun et al. (1999)
• Hand-written Digits
o Import & Explore Data
o Build & Evaluate Models
o Make Predictions
11Photo credit: http://www.opendeep.org/v0.0.5/docs/tutorial-classifying-handwritten-mnist-images
12. MNIST Hand-Written Digits
• 784 Inputs
o 28 x 28 = 784 pixels
• 1 Output
o 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9
o Classification
• Files
o Train (60k Records)
o Test (10k)
• Links
o https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/train.csv.gz
o https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/test.csv.gz
12
Photo credit: https://ml4a.github.io/ml4a/neural_networks/
13. H2O Flow (Web Interface) Demo
• Download and unzip jar
from www.h2o.ai
• In terminal:
o java -jar h2o.jar
• Web browser:
o localhost:54321
13
17. More Advanced Topics
• Advanced Features
o Hyperparameters Tuning
o Model Stacking
o Saving/Loading Models
o Export Plain Old Java
Object (POJO)
• Key Resources
o docs.h2o.ai
• Joe’s Previous H2O Talks
o bit.ly/joe_h2o_talk3
o bit.ly/h2o_budapest_1
o bit.ly/h2o_paris_1
17
20. Szilard Pafka – Chief Data Scientist at Epoch
• Sziland’s talks / blog
posts about H2O:
o ML Benchmark
o Intro to ML with H2O
o H2O Scoring
o Tweets
20
29. H2O in Action
29
Thank you
Data Science Milan – May 19, 2016
Bringing Deep Learning into production - Paolo Platter, AgileLab
http://www.slideshare.net/ds_mi/bringing-deep-learning-into-production-paolo-platter-agilelab
31. H2O is Evolving
• H2O Open Tour NYC
YouTube Playlist
o Advanced data munging
o Visual ML
o Deep Water (3rd talk)
o Sparkling Water
• PySparkling & RSparkling
o Steam
31
Next time?
33. End of First Talk – Thanks!
33
• Data Science Milan
• Gianmario Spacagna
• Politecnico di Milano
• Resources
o bit.ly/h2o_milan_1
o www.h2o.ai
o docs.h2o.ai
• Contact
o joe@h2o.ai
o @matlabulous
o github.com/woobe
41. 41
Users have full access to all available parameters
– fine-tune model training process
For example, I am using
rectifier with dropout as the activation
to train the model for 20 epochs
with classes balancing
Leaving other settings as default
42. 42
Training the model with estimated remaining time
– users can stop the process early if they want to