Google AutoML, AWS SageMaker and other ML tools automate some but not all steps in machine learning workflows. Learn about problem formulation, data engineering, monitoring, and fairness assessment.
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...Sri Ambati
This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/PavE9px22Mo
Spark pipelines represent a powerful concept to support productionizing machine learning workflows. Their API allows to combine data processing with machine learning algorithms and opens opportunities for integration with various machine learning libraries. However, to benefit from the power of pipelines, their users need to have a freedom to choose and experiment with any machine learning algorithm or library.
Therefore, we developed Sparkling Water that embeds H2O machine learning library of advanced algorithms into the Spark ecosystem and exposes them via pipeline API. Furthermore, the algorithms benefit from H2O MOJOs – Model Object Optimized – a powerful concept shared across entire H2O platform to store and exchange models. The MOJOs are designed for effective model deployment with focus on scoring speed, traceability, exchangeability, and backward compatibility. In this talk we will explain the architecture of Sparkling Water with focus on integration into the Spark pipelines and MOJOs.
We’ll demonstrate creation of pipelines integrating H2O machine learning models and their deployments using Scala or Python. Furthermore, we will show how to utilize pre-trained model MOJOs with Spark pipelines.
Bio: Jakub (or “Kuba” as we call him) completed his Bachelor’s Degree in Computer Science and Master’s Degree in Software Systems at Charles University in Prague. As a bachelor’s thesis, Kuba wrote a small platform for distributed computing of any types of tasks. During his master’s degree studies, he developed a cluster monitoring tool for JVM based languages which makes debugging and reasoning the performance of distributed systems easier using a concept called distributed stack traces. Kuba enjoys dealing with problems and learning new programming languages. At H2O.ai, Kuba works on Sparkling Water. Aside from programming, Kuba enjoys exploring new cultures and bouldering. He’s also a big fan of tea preparation and the associated ceremony.
I am an instructor of the MLOps workshop for some anonymous startup incubation program where the objectives are (1) to orchestrate and deploy updates to the application and the deep learning model in a unified way. (2) To design a DevOps pipeline to coordinate retrieving the latest best model from the model registry, packaging the web application, deploying the web application and inferencing web service.
Productionalizing Models through CI/CD Design with MLflowDatabricks
Often times model deployment and integration consists of several moving parts that require intricate steps woven together. Automating this pipeline and feedback loop can be incredibly challenging, especially in lieu of varying model development techniques.
Productionzing ML Model Using MLflow Model ServingDatabricks
Productionzing ML Models are needs to ensure model integrity while it efficiently replicate runtime environments across servers besides it keep track of how each of our models were created. It helps us better trace the root cause of changes and issues over time as we acquire new data and update our model. We have greater accountability over our models and the results they generate.
MLflow Model Serving delivers cost-effective and on-click deployment of model for real-time inferences. Also the Model Version deployed in the Model Serving can also be conveniently managed with MLflow Model Registry. We will going to cover following topics Deployment, Consumption and Monitoring. For deployment, we will demo the different version deployment and validate the deployment. For consumption, we demo connecting power bi and generate prediction report using ML Model deployed in MLflow serving. Lastly will wrap up with managing the MLflow serving like, access rights and monitoring capabilities.
An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
Michal Malohlava's presentation on Building Your Own Recommendation Engine 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...Sri Ambati
This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/PavE9px22Mo
Spark pipelines represent a powerful concept to support productionizing machine learning workflows. Their API allows to combine data processing with machine learning algorithms and opens opportunities for integration with various machine learning libraries. However, to benefit from the power of pipelines, their users need to have a freedom to choose and experiment with any machine learning algorithm or library.
Therefore, we developed Sparkling Water that embeds H2O machine learning library of advanced algorithms into the Spark ecosystem and exposes them via pipeline API. Furthermore, the algorithms benefit from H2O MOJOs – Model Object Optimized – a powerful concept shared across entire H2O platform to store and exchange models. The MOJOs are designed for effective model deployment with focus on scoring speed, traceability, exchangeability, and backward compatibility. In this talk we will explain the architecture of Sparkling Water with focus on integration into the Spark pipelines and MOJOs.
We’ll demonstrate creation of pipelines integrating H2O machine learning models and their deployments using Scala or Python. Furthermore, we will show how to utilize pre-trained model MOJOs with Spark pipelines.
Bio: Jakub (or “Kuba” as we call him) completed his Bachelor’s Degree in Computer Science and Master’s Degree in Software Systems at Charles University in Prague. As a bachelor’s thesis, Kuba wrote a small platform for distributed computing of any types of tasks. During his master’s degree studies, he developed a cluster monitoring tool for JVM based languages which makes debugging and reasoning the performance of distributed systems easier using a concept called distributed stack traces. Kuba enjoys dealing with problems and learning new programming languages. At H2O.ai, Kuba works on Sparkling Water. Aside from programming, Kuba enjoys exploring new cultures and bouldering. He’s also a big fan of tea preparation and the associated ceremony.
I am an instructor of the MLOps workshop for some anonymous startup incubation program where the objectives are (1) to orchestrate and deploy updates to the application and the deep learning model in a unified way. (2) To design a DevOps pipeline to coordinate retrieving the latest best model from the model registry, packaging the web application, deploying the web application and inferencing web service.
Productionalizing Models through CI/CD Design with MLflowDatabricks
Often times model deployment and integration consists of several moving parts that require intricate steps woven together. Automating this pipeline and feedback loop can be incredibly challenging, especially in lieu of varying model development techniques.
Productionzing ML Model Using MLflow Model ServingDatabricks
Productionzing ML Models are needs to ensure model integrity while it efficiently replicate runtime environments across servers besides it keep track of how each of our models were created. It helps us better trace the root cause of changes and issues over time as we acquire new data and update our model. We have greater accountability over our models and the results they generate.
MLflow Model Serving delivers cost-effective and on-click deployment of model for real-time inferences. Also the Model Version deployed in the Model Serving can also be conveniently managed with MLflow Model Registry. We will going to cover following topics Deployment, Consumption and Monitoring. For deployment, we will demo the different version deployment and validate the deployment. For consumption, we demo connecting power bi and generate prediction report using ML Model deployed in MLflow serving. Lastly will wrap up with managing the MLflow serving like, access rights and monitoring capabilities.
An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations
Presentation from Mumbai Tech Meetup on December 13, 2015. This deck presents various updates to the Google Cloud Platform in the last 6+ months. Covers : App Engine, Compute Engine, Cloud Vision API, Cloud Shell, Containers and more.
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...Databricks
Clickstream data is messy. A single user session in a Zynga game can generate thousands of events, with each game, client version and OS having their own event schemas. Unfortunately, most ML models require their training data to be formatted as a uniform matrix, with each user having the exact same columns. It’s a time consuming challenge to develop feature sets that capture all the nuanced trends and interactions of event streams.
At Zynga we’ve developed a technique to represent user game actions with temporal heatmap feature sets. Utilizing the power of PySpark, our generic data pipeline can generate thousands of features without the need to manually interpret the events of each game. The graphical structure of the heatmaps allow us to take advantage of established image classification techniques to make personalized user level predictions. Within 30 minutes of installing our games, Zynga is able to make accurate predictions on whether a new installer will churn or become a payer.
This talk will present R as a programming language suited for solving data analysis and modeling problems, MLflow as an open source project to help organizations manage their machine learning lifecycle and the intersection of both by adding support for R in MLflow. It will be highly interactive and touch on some of the technical implementation choices taken while making R available in MLflow. It will also demonstrate using MLflow tracking, projects, and models directly from R as well as reusing R models in MLflow to interoperate with other programming languages and technologies.
This presentation is about Web APIs in general and MicroProfile GraphQL in particular. It has been used for EclipseCon 2020 and is backed by a GitHub project (link on slide 11).
This class will introduce the Forge platform from the perspective of an early adopter – starting with business aspects, paradigm shift, cloud concepts, and the future of Autodesk cloud platform strategy. We will cover some of the technical challenges with web programming from the perspective of someone migrating from a desktop programming environment to the cloud, and discuss how to overcome them. We will then walk through some simple yet representative code samples helping you to get started with the Forge platform through Model Derivative API and Design Automation services.
In this presentation, Suraj Kumar Paul of Valuebound has walked us through GraphQL. Founded by Facebook in 2012, GraphQL is a data query language that provides an alternative to REST and web service architectures.
Here he has discussed core ideas of GraphQL, limitations of RESTful APIs, operations, arguments, fragmentation, variables, mutations etc.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Presentation used in December 2017 monthly community call for SharePoint Patterns and Practices (PnP). Monthly summary on guidance, sample and community work. Also 3 specific live demos on SharePoint development.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Learn how to build advanced GraphQL queries, how to work with filters and patches and how to embed GraphQL in languages like Python and Java. These slides are the second set in our webinar series on GraphQL.
SharePoint Framework, React and Office UI SPS Paris 2016 - d01Sonja Madsen
This session is about building client-side web parts, list-based and page-based applications on SharePoint. I'll show the workbench, the web part and a list based application, React and how to apply simple CSS styles for typography, color, icons, animations, and responsive grid layouts with Office UI Fabric.
This presentation is from the Integration Day event, a TechMeet360 Community Initiative, held on September 10, 2016 at Microsoft GSTC in Bangalore. In this slide, Microsoft's Escalation Engineers Tulika Chaudharie and Harikharan Krishnaraju explain using Azure Functions for Integration. The presentation starts with a general overview of Azure Functions and then it moves on to some of the common Integration Patterns and how Azure Functions fit into the scenarios.
A practical guidance of the enterprise machine learning Jesus Rodriguez
This session provides an analysis of the machine learning market in the enterprise. The analysis includes vendors, platforms and best practices that should be considered by companies implementing data science solutions at an enterprise scale
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Presentation from Mumbai Tech Meetup on December 13, 2015. This deck presents various updates to the Google Cloud Platform in the last 6+ months. Covers : App Engine, Compute Engine, Cloud Vision API, Cloud Shell, Containers and more.
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...Databricks
Clickstream data is messy. A single user session in a Zynga game can generate thousands of events, with each game, client version and OS having their own event schemas. Unfortunately, most ML models require their training data to be formatted as a uniform matrix, with each user having the exact same columns. It’s a time consuming challenge to develop feature sets that capture all the nuanced trends and interactions of event streams.
At Zynga we’ve developed a technique to represent user game actions with temporal heatmap feature sets. Utilizing the power of PySpark, our generic data pipeline can generate thousands of features without the need to manually interpret the events of each game. The graphical structure of the heatmaps allow us to take advantage of established image classification techniques to make personalized user level predictions. Within 30 minutes of installing our games, Zynga is able to make accurate predictions on whether a new installer will churn or become a payer.
This talk will present R as a programming language suited for solving data analysis and modeling problems, MLflow as an open source project to help organizations manage their machine learning lifecycle and the intersection of both by adding support for R in MLflow. It will be highly interactive and touch on some of the technical implementation choices taken while making R available in MLflow. It will also demonstrate using MLflow tracking, projects, and models directly from R as well as reusing R models in MLflow to interoperate with other programming languages and technologies.
This presentation is about Web APIs in general and MicroProfile GraphQL in particular. It has been used for EclipseCon 2020 and is backed by a GitHub project (link on slide 11).
This class will introduce the Forge platform from the perspective of an early adopter – starting with business aspects, paradigm shift, cloud concepts, and the future of Autodesk cloud platform strategy. We will cover some of the technical challenges with web programming from the perspective of someone migrating from a desktop programming environment to the cloud, and discuss how to overcome them. We will then walk through some simple yet representative code samples helping you to get started with the Forge platform through Model Derivative API and Design Automation services.
In this presentation, Suraj Kumar Paul of Valuebound has walked us through GraphQL. Founded by Facebook in 2012, GraphQL is a data query language that provides an alternative to REST and web service architectures.
Here he has discussed core ideas of GraphQL, limitations of RESTful APIs, operations, arguments, fragmentation, variables, mutations etc.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Presentation used in December 2017 monthly community call for SharePoint Patterns and Practices (PnP). Monthly summary on guidance, sample and community work. Also 3 specific live demos on SharePoint development.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Learn how to build advanced GraphQL queries, how to work with filters and patches and how to embed GraphQL in languages like Python and Java. These slides are the second set in our webinar series on GraphQL.
SharePoint Framework, React and Office UI SPS Paris 2016 - d01Sonja Madsen
This session is about building client-side web parts, list-based and page-based applications on SharePoint. I'll show the workbench, the web part and a list based application, React and how to apply simple CSS styles for typography, color, icons, animations, and responsive grid layouts with Office UI Fabric.
This presentation is from the Integration Day event, a TechMeet360 Community Initiative, held on September 10, 2016 at Microsoft GSTC in Bangalore. In this slide, Microsoft's Escalation Engineers Tulika Chaudharie and Harikharan Krishnaraju explain using Azure Functions for Integration. The presentation starts with a general overview of Azure Functions and then it moves on to some of the common Integration Patterns and how Azure Functions fit into the scenarios.
A practical guidance of the enterprise machine learning Jesus Rodriguez
This session provides an analysis of the machine learning market in the enterprise. The analysis includes vendors, platforms and best practices that should be considered by companies implementing data science solutions at an enterprise scale
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
Slides from my talk at Big Data Conference 2018 in Vilnius
Doing data science today is far more difficult than it will be in the next 5-10 years. Sharing, collaborating on data science workflows in painful, pushing models into production is challenging.
Let’s explore what Azure provides to ease Data Scientists’ pains. What tools and services can we choose based on a problem definition, skillset or infrastructure requirements?
In this talk, you will learn about Azure Machine Learning Studio, Azure Databricks, Data Science Virtual Machines and Cognitive Services, with all the perks and limitations.
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022GoDataDriven
Deploy your own modern data stack using open source components usingTerraform cloud-agnostic tooling. By leveraging open-source components you can deploy a state-of-the-art modern data platform in a day. What are the pro's and con's of “build-it-yourself" in the data+analytics space?
Databricks is a popular tool used with large amounts of data, applying to many roles - including data analysts, data engineers, data scientists, and machine learning engineers. It can be found on many cloud platforms - including Azure, AWS, and GCP. In this talk, we will look at a flight-themed end-to-end solution using Azure Databricks, Azure Data Factory, Azure Storage, and Power BI. By the end of this session, you will have a better understanding of Databricks' capabilities and how it integrates with other Azure offerings.
Deploying ML models in production, with or without CI/CD, is significantly more complicated than deploying traditional applications. That is mainly because ML models do not just consist of the code used for their training, but they also depend on the data they are trained on and on the supporting code. Monitoring ML models also adds additional complexity beyond what is usually done for traditional applications. This talk will cover these problems and best practices for solving them, with special focus on how it's done on the Databricks platform.
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Lucas Jellema
Not since the rise of Service Oriented Architecture (and the supporting Fusion Middleware technology) over a decade ago have we seen so much rapid change in terms of application and infrastructure architecture. Cloud, Microservices and DevOps are perhaps the most explicit examples – but many other developments in technology, architecture and even the industry at large have an impact on how enterprises consider and employ IT – such as machine learning, IoT, blockchain.
In this session for (infrastructure, solution, application, enterprise, security, data) architects – we will present the main stories, roadmaps and technologies from Oracle OpenWorld 2017 (and JavaOne) that influence, shape and enable architecture. We will brainstorm together on the consequences of the new directions outlined by Oracle – and coming our way from other quarters. We are seeing a a lot of change. New opportunities arise – that may become challenges or threats if we fail to recognize and embrace the change in time. This session will help us all to get a better handle on the winds in enterprise IT in general and in Oracle land in particular.
Among the topics we will present and discuss are:
- The Only Way is Up – the inevitable and imminent move from on premises to the cloud, and upwards in the stack – from IaaS to SaaS
- Security and Ops in a hybrid landscape (multiple clouds & on premises, multiple technologies & interaction channels)
- Autonomous Database – what, when, how
- Oracle’s cloud strategy, High PaaS and Low PaaS, Open [source] technology (star of the show: Apache Kafka) and the commodization of the traditional Oracle platform
- Container and Cloud Native at Oracle Cloud (Docker, Kubernetes Container Platform, Wercker, Istio Service Mesh, CNCF)
- Serverless
- Java Reborn – for microservices and cloud, modularized (highlights from the JavaOne conference)
- Disruptive: Blockchain, IoT, Machine Learning
Migrating Your Data Platform At a High Growth StartupDatabricks
At Abnormal Security, Spark has played a fundamental role in helping us create an ML system that detects thousands of sophisticated email threats every day. Initially, we set up our Spark infrastructure using YARN on EMR because we had previous experience with it. But after growing very quickly in a short amount of time, we found ourselves spending too much time solving problems with our Spark infrastructure and less time solving problems for our customers. Given we’re in a high growth environment where the only constant is change, we asked ourselves: aren’t these problems only going to get worse as we add more employees, more products, and more data?
Over the past few months, Abnormal Security executed a full migration of our Spark infrastructure to Databricks, not only improving cost, operational overhead, and developer productivity, but simultaneously laying the foundation for a modern Data Platform via the Lakehouse architecture.
In this talk, we’ll cover how we executed the migration in a few months’ time, from pre-Databricks-POC, through the POC, through the migration itself. We’ll talk about how to really figure out exactly what it is that you care about when evaluating Databricks, splitting out the must-haves from the nice-to-haves, so that you can best utilize the Databricks trial and have concrete and measurable results. We’ll talk about the work we did to execute the migration and the tooling we created to minimize downtime and properly measure performance and cost. Finally, we’ll talk about how the move to Databricks not only solved problems with our legacy Spark infrastructure, but will save us huge amounts of time in the long-run from adopting a Lakehouse architecture.
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup
Serverless architecture is the next big shift in computing - completely abstracting the underlying infrastructure and focusing 100% on the business logic.
Today we can create applications directly in our browser and leave the decision how they are hosted and scaled to the cloud provider. Moreover, this approach give us incredible control over the granularity of our applications since most of the time we are dealing with single function at a time.
In this presentation we will cover:
• Introduce Serverless Architectures
• Talk about the advantages of Serverless Architectures
• Discuss in details in event-driven computing
• Cover common Serverless approaches
• See practical applications with Azure Functions
• Compare AWS Lambda and Azure Functions
• Talk about open source alternatives
• Explore the relation between Microservices and Serverless Architectures
Forge - DevCon 2016: From Desktop to the Cloud with ForgeAutodesk
Fernando Malard, OFCdesk
This class will introduce the Forge platform from the perspective of an early adopter – starting with business aspects, paradigm shift, cloud concepts, and the future of Autodesk cloud platform strategy. We will cover some of the technical challenges with web programming from the perspective of someone migrating from a desktop programming environment to the cloud, and discuss how to overcome them. We will then walk through some simple yet representative code samples helping you to get started with the Forge platform through Model Derivative API and Design Automation services.
After this presentation you will know how to:
- sell Drupal 8 to business on large enterprise
- plan migration of code and content
- technically migrate a lot of custom code and data
- automate migration process
- test migration and regression
- overcome migration challenges, based on a JYSK case
https://drupalcampkyiv.org/node/55
Accelerating Machine Learning on Databricks RuntimeDatabricks
"We all know the unprecedented potential impact for Machine Learning. But how do you take advantage of the myriad of data and ML tools now available? How do you streamline processes, speed up discovery, share knowledge, and scale up implementations for real-life scenarios?
In this talk, we'll cover some of the latest innovations brought into the Databricks Unified Analytics Platform for Machine Learning. In particular we will show you how to:
- Get started quickly using the Databricks Runtime for Machine Learning, that provides pre-configured Databricks clusters including the most popular ML frameworks and libraries, Conda support, performance optimizations, and more.
- Get started with most popular Deep Learning frameworks within a few minutes and go deep with state of the art model DL diagnostics tools.
- Scale up Deep Learning training workloads from a single machine to large clusters for the most demanding applications using the new HorovodRunner with ease.
- How all of these ML frameworks get exposed to large and distributed data using Databricks Runtime for Machine Learning."
Going Serverless - an Introduction to AWS GlueMichael Rainey
Going "serverless" is the latest technology trend for enterprises moving their processing to the cloud, including data integration and ETL tools. But what does that mean and when should I use serverless ETL? In this session, we'll dive into the world of Amazon's fully managed data processing service called AWS Glue. With no server to provision or resources to allocate, and an easy to populate metadata catalog, AWS Glue allows the data engineer to focus on his or her craft; building data transformations and pipelines. Gaining an understanding of the similarities and differences between traditional ETL tools, such as Oracle Data Integrator, and Glue will prepare attendees for the new world of data integration. Presented at Collaborate 18.
Architecting an Open Source AI Platform 2018 editionDavid Talby
How to build a scalable AI platform using open source software. The end-to-end architecture covers data integration, interactive queries & visualization, machine learning & deep learning, deploying models to production, and a full 24x7 operations toolset in a high-compliance environment.
BigQuery is an analytical database designed to scale to petabyte scales. To optimize BigQuery we need to use practices and patterns that take advantage of the BigQuery architecture.
Google Cloud Professional Data Engineer certification prepares machine learning engineers for running ML models in production. This includes DevOps tasks, such as monitoring and scaling.
Text mining techniques like sentiment analysis, topic modeling, named entity extraction, and event extraction are used to map unstructured text to conventional data store structures.
How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
As relational and NoSQL database continue to adopt characteristic of each other, it becomes more important to understand that ACID-BASE is a spectrum. Instead of making a binary choice between ACID and BASE, developers and designers choose a combination of varying levels of data consistency, availability and network partition tolerance. This presentation briefly describes the ACID-BASE spectrum, the CAP Theorem and how to find the right balance of trade-offs for your application.
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
One of the most challenging problems in bioscience is data integration. From subcellular studiesto population simulations, we are faced with large volumes of difficult to integrate data. Presentation includes tips on getting started in big data bioscience.
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
Text has evolved from a secon-class citizen in the world of data management to a principle source of insight. In this class, you will learn ways of analyzing text (statistical, syntactic and semantic methods), common text mining tasks (classification, named entity extraction, and information extraction), and the advantages and disadvantages of various algorithms. The class begins with an overview of statistical text mining, syntactic parsing, and semantic representations. Statistical techniques will focus on n-grams and their advantages and limitations. Syntactic parsing is described along with a discussion of well developed open-source parsers. The need for integration with structured data drives the discussion of semantic representations. Algorithms are introduced for classification with particular emphasis on term frequency – inverse document frequency (TF-IDF) representations and support vector machines (SVMs). This combination is widely used but there are limits to the precision and recall that one can achieve.
Alternative formulations, such as distributed word representations, are discussed in detail. The problem of named entity extraction is addressed using conditional random fields. New advances in applying neural networks to create distributed word representations and their advantages over TF-IDF representations are discussed. Examples will be drawn from a large-scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. The class will also include a discussion of open-source tools for text mining that include R, Spark, NLTK and the Python scientific stack. The session with conclude with a checklist of tips for planning and managing large-scale text mining projects.
Document databases are more flexible in many ways than relational databases and this presents both opportunities and challenges. Poorly designed document structures adversely affect performance, increase maintenance overhead, and lead to unnecessarily complex application code. This presentation describes 5 commonly used design patters in document databases: one-to-many, many-to-many, simple table inheritance, trees and lookup patterns.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
2. Bio
• Principle Engineer, PEAK6 Technologies
• Author
• Instructor
• Udemy
• Google Cloud
• LinkedIn Learning
• Data Science
• Machine Learning
• Databases & Data Modeling
3.
4. Overview
• Machine Learning Workflow
• Formulating an ML Problem
• Building ML Models in GCP
• Data Engineering
• Monitoring and Evaluating Fairness
6. 0. Machine Learning Workflow
• Formulate problem
• Identity data sources
• Prepare data
• Train, evaluate, and tune model
• Deploy model
• Use model in production
• Monitor and Evaluate Fairness
https://thenounproject.com/term/workflow/2409348/
7. Define the Problem to
Be Solved
• Informal description
• What is the value of
solving the problem?
• How can the problem
be solved
• Regression
• Classification
https://static.thenounproject.com/png/230138-200.png
8. IdentifyData Sources
• Amount of data available
• Quality
• Rate of generation
• Requirements to access
• Limitations on the use of data
https://commons.wikimedia.org/wiki/File:Data_types_-_en.svg
11. CloudAutoML
• Designed for model builders with limited
ML experience
• GUI for training, evaluating, and tuning
• Services for sight, language, and
structured data
• AutoML Tables uses structured data to
build regression and classification
models
12. AI Platform Training
• Trains and runs models built in
• Tensorflow
• Scikit Learn
• XGBoost
• Hosted frameworks but can run custom containers
• Service provisions compute resources needed for a job and
then executes the job
13. Kubeflow
• Kubeflow is machine learning toolkit for
Kubernetes
• Packages models like applications
• Compose, deploy, and manage ML
workflows
14. Dataproc and Spark ML
• Dataproc is managed Spark and
Hadoop service
• Spark ML is machine learning
library
• ML Algorithms
• Feature engineering
• Pipelines
• Persistence
• Utilities
15. BigQueryML
• BigQuery is serverless analytical database
• BigQuery ML brings machine learning
functions to SQL
• Key advantages are:
• Ability to train and run models in
BigQuery
• Use SQL, not Python or ML frameworks
17. CloudComposer
• Managed Apache Airflow Service
• Executes workflows defined in directed
acyclic graphs (DAGs)
• Accessed through console or command
line (gcloud composer environments)
18. Apache AirflowDAG
• Workflow is a collection of tasks with
dependencies.
• DAGs stored in Cloud Storage
• Supports custom plugins for
operators, hooks, and interfaces
• Python dependencies (packages)
19. DAGs are Python Programs
Source: https://cloud.google.com/composer/docs/how-to/using/writing-dags
20. CloudComposer Environments
• Deployed in environments, which are collections of
GCP service components based on Kubernetes Engine
• Uses a combination of tenant and customer project
resources
22. CloudData Fusion
• Managed service based on open source CDAP
data analytics platform
• Code-free ETL/ELT development tool
• Over 160 connectors and transformations
• Drag-and-drop ETL/ELT construction
23. ExecutionEnvironment
• Cloud Data Fusion deployed as an instance
• Two editions
• Basic – visual designer, transformations, SDK, etc.
• Enterprise – Basic plus streaming pipelines,
integration metadata repository, high availability,
triggers, schedules, etc.
28. MonitoringML ModelBest Practices
• Monitor for data skew
• Watch for changes in dependencies
• Models are refreshed as needed
• Assess model prediction quality
• Test for unfairness
29. Fairness
• Anti-classification
• Protected attributes not used in model
• Example: gender
• Classification parity
• Measures of predictive performance are equal across
groups
• Calibration
• Outcomes are independent of protected attributes
31. QuickSummary
• Machine learning workflows
are multi-step
• Automated ML addresses
some, but not all steps
• Lots of data engineering and
monitoring still required