Production Grade Data Science for Hadoop

•Download as PPTX, PDF•

1 like•690 views

This document discusses scaling machine learning models from a laboratory setting to production. It proposes using a standardized representation called PMML to capture models produced by R and Scikit-Learn. PMML allows models to be deployed across different frameworks and languages. The document outlines APIs for evaluating, maintaining, and integrating models as reusable functions within data pipelines in Hadoop ecosystems like Spark, Pig, and Cascading. The goal is a portable, platform-agnostic architecture for operationalizing machine learning based on open standards.

Production Grade
Data Science for Hadoop
Villu Ruusmann
Openscoring OÜ

About openscoring.io
"Standards-based, Open-source Middleware for
Predictive Analytics Applications"
aka
"Rapid deployment of R and Scikit-Learn models on JVM"
2/25

From grams (to kilograms) to megagrams
Scaling the vessel / Hardware
Scaling the chemical reaction / Software and processes
4/25

Broader objectives
● Platform
○ Portability of applications
● Application
○ Central governance and dissemination of models
● Model
○ "Decisioning as a Service"
● Decision
○ Traceability, reproducibility, explainability
6/25

R and
Scikit-Learn
PMML
PFA
Java and C
Domain-Specific Languages
General-Purpose Languages
7/25

X model producers
Y model consumers
(ten years into past & future)
1 model API
9/25

Model API pipeline
Conversion Deployment
Ephermeal:
Persistent, asset-like:
Conversion Maintenance Deployment
10/25

Conversion into PMML
Capturing and expressing the essentials of the modeling
workflow using PMML vocabulary:
Input → Feature vector → Response vector → Output
Connecting stable data schemas
11/25

Standardized representation
R
ada, cforest, gbm,
randomForest, xgb.Booster
Scikit-Learn
AdaBoostClassifier,
BaggingClassifier,
ExtraTreesClassifier,
GradientBoostingClassifier,
RandomForestClassifier
<MiningModel function="classification">
<Segmentation
multipleModelMethod="weightedAverage"
>
<Segment id="1" weight="1">
<True/>
<TreeModel>
<Node>
...
</Node>
</TreeModel>
</Segment>
...
</Segmentation>
</MiningModel>
14/25

Supra-standardized representation
Model model = MiningModelUtil.createClassifierEnsemble(
MultipleModelMethodType.WEIGHTED_AVERAGE,
Arrays.asList(
PMMLUtil.loadModel("xgboost.pmml"),
PMMLUtil.loadModel("keras-mlp.pmml"),
PMMLUtil.loadModel("sklearn-rf.pmml")
),
Arrays.asList(5d/9d, 2d/9d, 2d/9d)
);
PMMLUtil.storeModel(model, "kaggle-submission.pmml");
15/25

State machine for model maintenance
Enhanced
Enhanced
Enhanced
StandardizedRaw Enhanced Optimized
16/25

Model as a function
Target = f(Active1, Active2, .., Activen)
Outputfeature = ffeature(Target)
17/25

Model metadata API
Evaluator evaluator = getEvaluator();
List<FieldName> activeFields = evaluator.getActiveFields();
for(FieldName activeField : activeFields){
DataField dataField = evaluator.getDataField(activeField);
// Inspect data type, operational type, value space etc.
}
18/25

Model evaluation API
Evaluator evaluator = getEvaluator();
while(!done){
Map<FieldName, ?> arguments = readInRecord();
Map<FieldName, ?> results = evaluator.evaluate(arguments);
writeOutRecord(results);
}
19/25

http://github.com/jpmml/jpmml-${framework}
Volume
Velocity
{ REST }
20/25

JPMML-Cascading
Evaluator evaluator = getEvaluator();
PMMLPlanner pmmlPlanner = new PMMLPlanner(evaluator);
pmmlPlanner.setHeadName("input");
pmmlPlanner.setTailName("output");
FlowDef flowDef = ...;
flowDef.addAssemblyPlanner(pmmlPlanner);
21/25

JPMML-Pig
grunt> REGISTER jpmml-pig-distributable-1.0.jar;
grunt> DEFINE my_udf org.jpmml.pig.PMMLFunc('model.pmml');
grunt> output = FOREACH input GENERATE my_udf(*);
22/25

JPMML-Spark
Evaluator evaluator = getEvaluator();
PMMLFunction pmmlFunction = new PMMLFunction(evaluator);
JavaRDD<Row> input = ...;
JavaRDD<Row> output = input.map(pmmlFunction);
23/25

Thoughts on API-driven architecture
● Based on a relevant standard
○ "Conventions over configuration"
● High(est) abstraction level
○ Productivity
○ Maintainability
● End-to-end value proposition
○ Separation of concerns
24/25

Q&A
villu@openscoring.io
http://openscoring.io
http://github.com/jpmml
25/25

This document discusses MLOps and Kubeflow. It begins with an introduction to the speaker and defines MLOps as addressing the challenges of independently autoscaling machine learning pipeline stages, choosing different tools for each stage, and seamlessly deploying models across environments. It then introduces Kubeflow as an open source project that uses Kubernetes to minimize MLOps efforts by enabling composability, scalability, and portability of machine learning workloads. The document outlines key MLOps capabilities in Kubeflow like Jupyter notebooks, hyperparameter tuning with Katib, and model serving with KFServing and Seldon Core. It describes the typical machine learning process and how Kubeflow supports experimental and production phases.

Big Data Heterogeneous Mixture Learning on Spark

DataWorks Summit/Hadoop Summit

This document provides an overview of NEC's Heterogeneous Mixture Learning (HML) technology and its implementation on Apache Spark. It introduces the speakers and their backgrounds working on distributed computing and machine learning. The agenda discusses HML, applications of HML, the HML algorithm, and benchmark performance evaluations showing HML achieves competitive prediction accuracy compared to other Spark ML algorithms while maintaining good scalability. Distributed HML on Spark aims to enable fast, large-scale machine learning by balancing work across executors and leveraging high-performance matrix libraries.

Insights into Real World Data Management Challenges

DataWorks Summit

Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big. We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation. Speaker Noble Raveendran, Principal Consultant, Oracle

Streamline - Stream Analytics for Everyone

DataWorks Summit/Hadoop Summit

Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming. Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service tool that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily. In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.

Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...

Databricks

ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...

Databricks

The explosion of data volume in the years to come challenge the idea of a centralized cloud infrastructure which handles all business needs. Edge computing comes to rescue by pushing the needs of computation and data analysis at the edge of the network, thus avoiding data exchange when makes sense. One of the areas where data exchange could impose a big overhead is scoring ML models especially where data to score are files like images eg. in a computer vision application. Another concern in some applications, is that of keeping data as private as possible and this is where keeping things local makes sense. In this talk we will discuss current needs and recent advances in model serving, like newly introduced formats for pushing models at the edge nodes eg. mobile phones and how a unified model serving architecture could cover current and future needs for both data scientists and data engineers. This architecture is based among others, on training models in a distributed fashion with TensorFlow and leveraging Spark for cleaning data before training (eg. using TensorFlow connector). Finally we will describe a microservice based approach for scoring models back at the cloud infrastructure side (where bandwidth can be high) eg. using TensorFlow serving and updating models remotely with a pull model approach for edge devices. We will talk also about implementing the proposed architecture and how that might look on a modern deployment environment eg. Kubernetes.

On Demand HDP Clusters using Cloudbreak and Ambari

DataWorks Summit/Hadoop Summit

This document discusses Symantec's journey towards enabling self-service analytics clusters using Cloudbreak and Ambari. It describes how Symantec built a self-service analytics platform using Ambari to automate the deployment of Hadoop clusters on their private OpenStack cloud. However, they later needed a solution that could deploy clusters across different cloud providers. They adopted Cloudbreak to deploy clusters on AWS and contributed extensions like Keystone v3 support to enable Cloudbreak to work with their OpenStack cloud as well. This allows them to deploy analytics clusters across different clouds through a single tool and interface.

Managing a Multi-Tenant Data Lake

DataWorks Summit/Hadoop Summit

The document discusses managing a multi-tenant data lake at Comcast over time. It began as an experiment in 2013 with 10 nodes and has grown significantly to over 1500 nodes currently. Governance was instituted to manage the diverse user community and workloads. Tools like the Command Center were developed to provide monitoring, alerting and visualization of the large Hadoop environment. SLA management, support processes, and ongoing training are needed to effectively operate the multi-tenant data lake at scale.

This workshop will provide a hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud. Format: A short introductory lecture on a select important supervised and unsupervised Machine Learning techniques followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions. Objective: To provide a quick and short hands-on introduction to Machine Learning with Spark ML. In the lab, you will use the following components: Apache Zeppelin (a “Modern Data Science Toolbox”) and Apache Spark. You will learn how to analyze the data, structure the data, train Machine Learning models and apply them to answer real-world questions. Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud. At this Crash Course everyone will have a cluster assigned to them to try several workloads using Machine Learning, Spark and Zeppelin on the cloud. Speakers: Robert Hryniewicz

Building Custom Machine Learning Algorithms With Apache SystemML

Jen Aman

This document discusses Apache SystemML, which is a machine learning framework for building custom machine learning algorithms on Apache Spark. It originated from research projects at IBM involving machine learning on Hadoop. SystemML aims to allow data scientists to build ML solutions using languages like R and Python, while executing algorithms on big data platforms like Spark. It provides a high-level language for expressing algorithms and performs automatic parallelization and optimization. The document demonstrates SystemML through a matrix factorization example for a targeted advertising problem. It shows how to wrangle data, build a custom algorithm, and get results. In conclusion, it recommends that readers try out SystemML through its website.

NextGenML

Moldovan Radu Adrian

Revolutionary container based hybrid cloud solution for MLPlatform Ness' data science platform, NextGenML, puts the entire machine learning process: modelling, execution and deployment in the hands of data science teams. The entire paradigm approaches collaboration around AI/ML, being implemented with full respect for best practices and commitment to innovation. Kubernetes (onPrem) + Docker, Azure Kubernetes Cluster (AKS), Nexus, Azure Container Registry(ACR), GlusterFS Workflow Argo->Kubeflow DevOps Helm, kSonnet, Kustomize,Azure DevOps Code Management & CI/CD Git, TeamCity, SonarQube, Jenkins Security MS Active Directory, Azure VPN, Dex (K8s) integrated with GitLab Machine Learning TensorFlow (model training, boarding, serving), Keras, Seldon Storage (Azure) Storage Gen1 & Gen2, Data Lake, File Storage ETL (Azure) Databricks, Spark on K8, Data Factory (ADF), HDInsight (Kafka and Spark), Service Bus (ASB) Lambda functions & VMs, Cache for Redis Monitoring and Logging Graphana, Prometeus, GrayLog

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...

Data Con LA

Today’s Software Defined environments attempt to remove the weakness of computing hardware from the operational equation. There is no doubt that this is a natural progress away from overpriced, proprietary compute and storage layers. However, even at the heart of any Software Defined universe is an underlying hardware stack that must be robust, reliable and cost effective. Our 20+ years experience delivering over 2000 clusters and clouds has taught us how to properly design and engineer the right hardware solution for Big Data, Cluster and Cloud environments. This presentation will share this knowledge allowing user to make better design decisions for any deployment.

Building a Scalable Data Science Platform with R

DataWorks Summit/Hadoop Summit

This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

DataWorks Summit/Hadoop Summit

This document discusses optimizing Apache Spark machine learning workloads on OpenPOWER platforms. It provides an overview of Spark, machine learning, and deep learning. It then discusses how OpenPOWER systems are well-suited for these workloads due to features like high memory bandwidth, large caches, and GPU support. The document outlines various techniques for tuning Spark performance on OpenPOWER, such as configuration of executors, cores, memory, and storage levels. It also presents examples analyzing the performance of a matrix factorization machine learning application under different Spark configurations.

Data Science Crash Course

DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Vertex AI: Pipelines for your MLOps workflows

Márton Kodok

The document discusses Vertex AI pipelines for MLOps workflows. It begins with an introduction of the speaker and their background. It then discusses what MLOps is, defining three levels of automation maturity. Vertex AI is introduced as Google Cloud's managed ML platform. Pipelines are described as orchestrating the entire ML workflow through components. Custom components and conditionals allow flexibility. Pipelines improve reproducibility and sharing. Changes can trigger pipelines through services like Cloud Build, Eventarc, and Cloud Scheduler to continuously adapt models to new data.

Scaling Data Science on Big Data

DataWorks Summit

Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ? In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management. Speakers: Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM Vikram Murali, Program Director, Data Science and Machine Learning, IBM

Munich Re: Driving a Big Data Transformation

DataWorks Summit

Data and analytics are at the heart of the digital transformation. Implementing a modern data platform can be challenging; moreover, success requires a shift in culture. Andreas will discuss the ways Munich Re drives cultural and technological change within their company, focusing on three key elements: people, processes, and technology. What does it mean to be a data-driven organization? How can we provide self-service analytics to our internal and external customers in an agile way? How do we get the most value out of our big data lake? How does Munich Re balance technology and culture to meet the data demands of their business? Speaker Andreas Kohlmaier, Head of Data Engineering, Munich Re

Feature Store as a Data Foundation for Machine Learning

Provectus

This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.

Streaming analytics state of the art

Stavros Kontopoulos

Lessons learned processing 70 billion data points a day using the hybrid cloud

DataWorks Summit

NetApp receives 70 billion data points of telemetry information each day from its customer’s storage systems. This telemetry data contains configuration information, performance counters, and logs. All of this data is processed using multiple Hadoop clusters, and feeds a machine learning pipeline and a data serving infrastructure that produces insights for customers via an application called Active IQ. We describe the evolution of our Hadoop infrastructure from a traditional on-premises architecture to the hybrid cloud, and lessons learned. We’ll discuss the insights we are able to produce for our customers, and the techniques used. Finally, we describe the data management challenges with our multi-petabyte Hadoop data lake. We solved these problems by building a unified data lake on-premises and using the NetApp Data Fabric to seamlessly connect to public clouds for data science and machine learning compute resources. Architecting a truly hybrid cloud implementation allowed NetApp to free up our data scientists to use any software on any cloud, kept the customer log data safe on NetApp Private Storage in Equinix, resulted in faster ability to innovate and release new code and provided flexibility to use any public cloud at the same time with data on NetApp in Equinix. Speaker Pranoop Erasani, NetApp, Senior Technical Director, ONTAP Shankar Pasupathy, NetApp, Technical Director, ACE Engineering

ML Workshop 1: A New Architecture for Machine Learning Logistics

MapR Technologies

Having heard the high-level rationale for the rendezvous architecture in the introduction to this series, we will now dig in deeper to talk about how and why the pieces fit together. In terms of components, we will cover why streams work, why they need to be persistent, performant and pervasive in a microservices design and how they provide isolation between components. From there, we will talk about some of the details of the implementation of a rendezvous architecture including discussion of when the architecture is applicable, key components of message content and how failures and upgrades are handled. We will touch on the monitoring requirements for a rendezvous system but will save the analysis of the recorded data for later. Listen to the webinar on demand: https://mapr.com/resources/webinars/machine-learning-workshop-1/

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

Databricks

Getting cars to drive autonomously is one of the most exciting problems these days. One of the key challenges is making them drive safely, which requires processing large amounts of data. In our talk we would like to focus on only one task of a self-driving car, namely road detection. Road detection is a software component which needs to be safe for being able to keep the car in the current lane. In order to track the progress of such a software component, a well-designed KPI (key performance indicators) evaluation pipeline is required. In this presentation we would like to show you how we incorporate Spark in our pipeline to deal with huge amounts of data and operate under strict scalability constraints for gathering relevant KPIs. Additionally, we would like to mention several lessons learned from using Spark in this environment.

Paris FOD Meetup #5 Hortonworks Presentation

Abdelkrim Hadjidj

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Josh Patterson

DevOps for DataScience

Stepan Pushkarev

Paris FOD Meetup #5 Cognizant Presentation

Abdelkrim Hadjidj

Real life use cases from across Europe (Walid Aoudi - Cognizant) This presentation will present some Cognizant Big Data clients return on experiences on continental Europe and UK. The main focus will be centered on use cases through the presentation of the business drivers behind these projects. Key highlights around the big data architecture and approach solutions will be presented. Finally, the business outcomes in terms of ROI provided by the solutions implementations will be discussed.

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...

Databricks

Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more). In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring. The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.

Representing TF and TF-IDF transformations in PMML

Villu Ruusmann

This document discusses representing term frequency (TF) and TF-IDF transformations in the Predictive Model Markup Language (PMML). It provides details on encoding TF, TF-IDF, and text indexing in PMML, including defining a centralized TF-IDF function and invoking it for multiple documents. It also covers techniques for string normalization, tokenization, and counting terms during text transformations. Finally, it discusses ensuring interoperability between PMML and scikit-learn for text feature extraction.

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Villu Ruusmann

This document discusses different machine learning frameworks like R, Scikit-Learn, LightGBM, XGBoost, and Apache Spark ML and compares their capabilities for predictive modeling tasks. It highlights differences in how each framework handles data formats, parameter tuning, model serialization, and execution. It also presents a case study predicting car prices using gradient boosted trees in various frameworks and discusses lessons learned, emphasizing that ease-of-use and integration often outweigh raw performance.

What's hot

Data Science Crash Course

DataWorks Summit

Building Custom Machine Learning Algorithms With Apache SystemML

Jen Aman

NextGenML

Moldovan Radu Adrian

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...

Data Con LA

Building a Scalable Data Science Platform with R

DataWorks Summit/Hadoop Summit

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

DataWorks Summit/Hadoop Summit

Data Science Crash Course

DataWorks Summit

Vertex AI: Pipelines for your MLOps workflows

Márton Kodok

Scaling Data Science on Big Data

DataWorks Summit

Munich Re: Driving a Big Data Transformation

DataWorks Summit

Feature Store as a Data Foundation for Machine Learning

Provectus

Streaming analytics state of the art

Stavros Kontopoulos

Lessons learned processing 70 billion data points a day using the hybrid cloud

DataWorks Summit

ML Workshop 1: A New Architecture for Machine Learning Logistics

MapR Technologies

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

Databricks

Paris FOD Meetup #5 Hortonworks Presentation

Abdelkrim Hadjidj

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Josh Patterson

DevOps for DataScience

Stepan Pushkarev

Paris FOD Meetup #5 Cognizant Presentation

Abdelkrim Hadjidj

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...

Databricks

What's hot (20)

Data Science Crash Course

Building Custom Machine Learning Algorithms With Apache SystemML

NextGenML

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...

Building a Scalable Data Science Platform with R

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Data Science Crash Course

Vertex AI: Pipelines for your MLOps workflows

Scaling Data Science on Big Data

Munich Re: Driving a Big Data Transformation

Feature Store as a Data Foundation for Machine Learning

Streaming analytics state of the art

Lessons learned processing 70 billion data points a day using the hybrid cloud

ML Workshop 1: A New Architecture for Machine Learning Logistics

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

Paris FOD Meetup #5 Hortonworks Presentation

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

DevOps for DataScience

Paris FOD Meetup #5 Cognizant Presentation

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...

Viewers also liked

Representing TF and TF-IDF transformations in PMML

Villu Ruusmann

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Villu Ruusmann

Pattern: An Open Source Project for Migrating Predictive Models from SAS

DataWorks Summit

The document discusses Pattern, an open source project for migrating predictive models from SAS, R, etc. onto Hadoop. Pattern works on top of Cascading to support scoring of PMML models at scale on Hadoop. Models are reused and deployed within Cascading workflows. The document provides an example of exporting a random forest model trained in R to PMML, and then using Pattern to score sample data on Hadoop using that PMML model. It also discusses Cascading and PMML in more detail.

Yace 3.0

Atul Ashar

YACE (Yet Another Crossing Engine) is a financial trading application developed using Apache NiFi and QuickFIX/J. It performs continuous crossing of orders based on price/time priority and can run in an engine-free or server-less mode. Orders are placed using binary search for better performance and support JSON serialization. The order book is implemented as Java collections. The architecture includes NiFi for flow-based development, distributed hosting, and logging/tracing. Demo screenshots show the NiFi flow used by YACE.

On the representation and reuse of machine learning (ML) models

Villu Ruusmann

On the representation and reuse of machine learning models 1. The document discusses representing machine learning models in a generic way so they can be stored, shared, and deployed across different platforms and applications. 2. It proposes using the Predictive Model Markup Language (PMML) as a standard way to represent models that allows for "train once, deploy anywhere". 3. PMML provides a balance between being a generic representation that can be understood by any system, while also supporting more specific representations tailored to particular use cases or systems.

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...

Chris Fregly

This document summarizes a presentation about generating real-time streaming recommendations using NiFi, Kafka, and Spark ML. The presentation demonstrates using NiFi to ingest data from HTTP requests, enrich it with geo data, and write it to a Kafka topic. It then shows how to create a Spark Streaming application that reads from Kafka to perform incremental matrix factorization recommendations in real-time and handles failures using circuit breakers. The presentation also provides an overview of Netflix's large-scale real-time recommendation pipeline.

Operationalizing analytics to scale

Looker

Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data, often in real time. However, for the analyst or data scientist who builds offline models, integrating their analyses into these pipelines for operational purposes can pose a challenge. In this slide deck, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.

Product Update: EDB Postgres Platform 2017

EDB

This document summarizes updates to the EDB Postgres Platform for winter 2017, including: - EDB Postgres Advanced Server 9.6 which adds features like Oracle-compatible advanced queuing and nested subprocedures to help migrate more applications from Oracle, manage larger datasets, and improve integration. - Backup and Recovery 2.0 which enables faster backups using block-level incremental change capture. - Replication Server 6.1 which adds support for Oracle 12c and SQL Server 2014, and allows parallel replication between multiple active nodes for improved performance.

Integrating Apache Spark and NiFi for Data Lakes

DataWorks Summit/Hadoop Summit

This document discusses using Apache Spark and Apache NiFi together for data lakes. It outlines the goals of a data lake including having a central data repository, reducing costs, enabling easier discovery and prototyping. It also discusses what is needed for a Hadoop data lake, including automation of pipelines, governance, and interactive data discovery. The document then provides an example ingestion project and describes using Apache Spark for functions like cleansing, validating, and profiling data. It outlines using Apache NiFi for the pipeline design with drag and drop functionality. Finally, it demonstrates ingesting and preparing data, data self-service and transformation, data discovery, and operational monitoring capabilities.

Apache NiFi- MiNiFi meetup Slides

Isheeta Sanghi

MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi. The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a

Real time Analytics with Apache Kafka and Apache Spark

Rahul Jain

A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.

Apache Storm 0.9 basic training - Verisign

Michael Noll

Apache Storm 0.9 basic training (130 slides) covering: 1. Introducing Storm: history, Storm adoption in the industry, why Storm 2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism 3. Operating Storm: architecture, hardware specs, deploying, monitoring 4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning 5. Playing with Storm using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/ Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Apache Kafka 0.8 basic training (120 slides) covering: 1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka 2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers 3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning 4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps 5. Playing with Kafka using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

Introduction to Kafka and Zookeeper

Rahul Jain

Hadoop & HDFS for Beginners

Rahul Jain

Viewers also liked (15)

Representing TF and TF-IDF transformations in PMML

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Pattern: An Open Source Project for Migrating Predictive Models from SAS

Yace 3.0

On the representation and reuse of machine learning (ML) models

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...

Operationalizing analytics to scale

Product Update: EDB Postgres Platform 2017

Integrating Apache Spark and NiFi for Data Lakes

Apache NiFi- MiNiFi meetup Slides

Real time Analytics with Apache Kafka and Apache Spark

Apache Storm 0.9 basic training - Verisign

Apache Kafka 0.8 basic training - Verisign

Introduction to Kafka and Zookeeper

Hadoop & HDFS for Beginners

Similar to Production Grade Data Science for Hadoop

Accelerating SAP transformations with Micro Focus

Christian Schuetz

VTT Wireless access - More radio capacity with less energy

VTT Technical Research Centre of Finland Ltd

VTT Technical Research Centre of Finland provides solutions to overcome major challenges in wireless access and communication systems, such as increasing capacity demands and energy consumption. They offer expertise in radio systems, hardware and software performance, smart antennas, and security. Their services include research, prototyping, testing, and commercialization of intellectual property to support technologies like 5G, LTE, WiFi and beyond.

Sip@iPLM 2016

Dr Nicolas Figay

The document discusses the Standard Interoperability PLM (SIP) project. The SIP project aims to: 1. Develop a methodology and associated testing platform for PLM standards. 2. Create an open community and shared knowledge base around PLM standards. 3. Validate the methodology on real business cases through experimentation. The SIP project has yielded positive results including validation on a business case with Dassault Aviation, a simulation and testing environment, and an open community. Going forward, the project will focus on applying the methodology to other processes, standards, and defining recommended practices.

Manigandan_narasimhan_resume

manigandan narasimhan

Manigandan Narasimhan is a senior consultant and application database administrator (DBA) with over 14 years of experience developing client/server applications using Oracle technologies. He has extensive experience building data marts and data warehouses, performing Oracle performance tuning, and managing database migrations. Some of his key skills include Oracle, SQL, PL/SQL, Unix, data modeling, ETL tools like DataStage, and project management. He has worked on several projects for clients like JP Morgan Chase and General Motors.

Confluent Partner Tech Talk with QLIK

confluent

Atul_Oracle Functional Distribution and WMS Consultant

Atul Kumar

The document provides a summary of Atul Kumar's work experience and qualifications. It details his 6 years of experience as an Oracle functional consultant implementing Oracle Applications modules like Procure to Pay, Order to Cash, and Warehouse Management. It lists his skills in Oracle modules like INV, WMS, PO, and OM. It also provides details of 5 Oracle implementation projects he has worked on for General Electric in various countries, outlining his responsibilities and achievements. The document highlights his education qualifications and certifications like a PGDBA from IIM Lucknow and Six Sigma Green Belt.

Oracle Forms - stay or move on ? Webinar by Kumaran Systems

Kumaran Systems Inc

The document discusses options for modernizing Oracle Forms applications, including upgrading to Forms 11g, integrating with JavaScript and SOA, enhancing the UI/UX, migrating to ADF, HTML5, Apex, or developing a mobile version. It provides pros and cons and evaluation criteria for each option to help determine the best approach based on factors like user experience, developer skills availability, cost, time to market, and support needs. Scores are given for each option based on the evaluation criteria to aid in the selection process.

[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar

DataScienceConferenc1

Reproducibility of ML system is increasingly important topic in ML community. Reproducibility ensures conclusiveness of the model performance, provides an understanding how ML system works and reduces unnecessary errors when the system is deployed into production. With increasing AI regulation, it will soon become a requirement for many ML applications. In this talk, we will explore different aspects of reproducibility such as reproducibility of the dataset, data processing, ML model, its randomness and hyperparameters, code and SW environment, as well as concepts and practical tools such as data versioning, feature, metadata and artifact store, model registry and containerization that together ensure reproducibility of our experiments.

Extending open source and hybrid cloud to drive OT transformation - Future Oi...

John Archer

Harvinder Singh-Resume

Harvinder Singh

Harvinder Singh has nearly 20 years of experience in project management, software development, and testing. He has managed teams of up to 70 people and led projects involving requirements gathering, design, development, testing and implementation. Some of his responsibilities have included planning projects, monitoring schedules and budgets, ensuring quality standards are met, and mentoring team members. He has deep experience in domains like telecommunications, networking, billing and cloud computing.

Scott A Frantz resume

Scott Frantz

Scott Frantz has over 30 years of experience in manufacturing and quality assurance, including 18 years as an SAP Quality Management functional consultant. He has extensive experience consulting on SAP QM modules from versions 3.0 to ECC 6.0, focusing on industries like aerospace, pharmaceuticals, automotive, and manufacturing. Frantz has served as both a consultant and project lead for many SAP QM implementations and system conversions.

Top Line Strategies - MS xRM

TopLine Strategies

Microsoft Dynamics CRM and xRM were presented as platforms for managing relationships across various entities through a flexible and customizable application framework. xRM allows building industry-specific or line-of-business applications more quickly and at a lower cost than custom development by providing reusable components, a consistent user experience, and a shared environment and resources. Examples of government and commercial organizations successfully using xRM for tasks, grants, and other relationship management were described.

Responding to Fukushima: Real-Time, Interactive, Beyond Design Basis Modeling...

GSE Systems, Inc.

This document discusses the use of a desktop simulator for training on severe accident management guidelines (SAMG), extensive damage mitigation guidelines (EDMG), and probabilistic risk assessment (PRA) in response to changes in regulations and guidelines following the Fukushima accident. The desktop simulator uses the MAAP5 software to model beyond design basis accidents for pressurized water reactors (PWRs) and boiling water reactors (BWRs). It allows users to interactively control plant systems and observe accident progression. The desktop simulator can be used for individual and team training to help utilities address challenges around staffing changes, integrating emergency procedures, and limited simulator capabilities.

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark Summit

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.

(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...

Farhan Tariq

This document summarizes the skills and experience of Muhammad Farhan, an AI architect focusing on quality assurance. He has over 15 years of experience in software testing and delivery, especially using automation, AI/ML, and agile methodologies to improve quality. Some of his accomplishments include reducing defects by 70% and shortening release cycles through test automation and CI/CD integration.

CV_SyedShoeb_2015

Syed Shoeb

Shoeb Syed is a software quality assurance engineer with over 4 years of experience in testing client/server, web-based, and desktop applications. He has expertise in test automation using Test Complete and Selenium and has experience designing different automation frameworks including keyword-driven, data-driven, and hybrid. He is proficient in languages like VBScript, Java, and testing tools including HP Quality Center, JMeter, and Test Complete. He has received awards for his work automating test cases and creating automation frameworks for various projects.

manikandan_16_05_2015

manikandan velusamy

Manikandan Velusamy has over 2.5 years of experience in manual testing of client-server and web-based applications using techniques like functional, regression, smoke, and acceptance testing. He has worked on projects in the financial and insurance domains using tools like Quality Center and ALM. Currently working as a Software Engineer at NTT DATA Global Delivery Services, his responsibilities include creating test cases, executing manual and automated tests, reporting defects, and acting as a liaison between the client and team.

A Hybrid Cloud MultiCloud Approach to Streamline Supply Chain Data Flow

jagada7

A hybrid multicloud system can effectively disseminate supply chain product information. It supports uploading item-by-item reports to traditional systems while also supporting industry-wide CMRT submissions and testing. Supplier data on multiple public and private clouds can be accessed by various departments for purchasing, design, manufacturing, and regulatory compliance. Reports evaluate substance thresholds in articles and parts, safe use information, renewability, and carbon footprinting. A hybrid multicloud provides flexibility, security, scalability, third party access, and cost effectiveness by using the best capabilities of different cloud technologies.

Mallikarjuna_Resume

Mallikarjuna Rathod

This document provides a summary of Mallikarjuna Rathod's professional experience as a Project Lead for HP Enterprise services. Some key points: - He has over 8 years of experience in the IT industry and 6 years experience in telecom testing. - He is responsible for ensuring quality of testing activities, assigning testers to projects, test planning and monitoring, and preparing for and attending daily meetings. - He has experience managing multiple projects for Belgacom/Proximus, including testing various telecom systems and applications. - His technical skills include databases, testing tools like QCT and ALM, and programming languages like UNIX.

Pure App + Patterns + Prolifics = Feeding Change

Prolifics

This document provides information on Prolifics, an IT services company that utilizes patterns and expertise to help clients. It discusses Prolifics' technical excellence, industry focus, global delivery advantage, and core values. The document then outlines various IT services Prolifics can provide, including application development and testing, business analytics, managed services, and more. It emphasizes that Prolifics utilizes patterns and expertise to help clients adapt faster, transform applications, improve security and more.

Similar to Production Grade Data Science for Hadoop (20)

Accelerating SAP transformations with Micro Focus

VTT Wireless access - More radio capacity with less energy

Sip@iPLM 2016

Manigandan_narasimhan_resume

Confluent Partner Tech Talk with QLIK

Atul_Oracle Functional Distribution and WMS Consultant

Oracle Forms - stay or move on ? Webinar by Kumaran Systems

[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar

Extending open source and hybrid cloud to drive OT transformation - Future Oi...

Harvinder Singh-Resume

Scott A Frantz resume

Top Line Strategies - MS xRM

Responding to Fukushima: Real-Time, Interactive, Beyond Design Basis Modeling...

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...

CV_SyedShoeb_2015

manikandan_16_05_2015

A Hybrid Cloud MultiCloud Approach to Streamline Supply Chain Data Flow

Mallikarjuna_Resume

Pure App + Patterns + Prolifics = Feeding Change

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production

DataWorks Summit/Hadoop Summit

This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.

State of Security: Apache Spark & Apache Zeppelin

DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger

DataWorks Summit/Hadoop Summit

The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include: - The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies. - Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer. - Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance. - An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared

Enabling Digital Diagnostics with a Data Science Platform

DataWorks Summit/Hadoop Summit

This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.

Revolutionize Text Mining with Spark and Zeppelin

DataWorks Summit/Hadoop Summit

This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.

Double Your Hadoop Performance with Hortonworks SmartSense

DataWorks Summit/Hadoop Summit

This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.

Hadoop Crash Course

DataWorks Summit/Hadoop Summit

The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.

Data Science Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.

Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.

Dataflow with Apache NiFi

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.

Schema Registry - Set you Data Free

DataWorks Summit/Hadoop Summit

Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats. SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc. In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

DataWorks Summit/Hadoop Summit

There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time. The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

DataWorks Summit/Hadoop Summit

DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.

Mool - Automated Log Analysis using Data Science and ML

DataWorks Summit/Hadoop Summit

QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful. At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.

How Hadoop Makes the Natixis Pack More Efficient

DataWorks Summit/Hadoop Summit

Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together. This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear: • How and why the business and IT requirements originated • How we leverage the platform to fulfill security and production requirements • How we organize a community to: o Guard all the players, no one gets left on the ground! o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead) • What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match! DETAILS This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.

HBase in Practice

DataWorks Summit/Hadoop Summit

HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.

The Challenge of Driving Business Value from the Analytics of Things (AOT)

DataWorks Summit/Hadoop Summit

There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases. In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

DataWorks Summit/Hadoop Summit

In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

DataWorks Summit/Hadoop Summit

In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs. Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.

Backup and Disaster Recovery in Hadoop

DataWorks Summit/Hadoop Summit

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Recently uploaded

A Comprehensive Guide to DeFi Development Services in 2024

Intelisync

DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum. In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance. In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape. At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology. Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!

Main news related to the CCS TSI 2023 (2023/1695)

Jakub Marek

An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers. The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 . The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .

Choosing The Best AWS Service For Your Website + API.pptx

Brandon Minnick, MBA

Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API? Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose? Which one is cheapest? Which one is fastest? Which one will scale to meet our needs? Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!

Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr

saastr

Monitoring and Managing Anomaly Detection on OpenShift.pdf

Tosin Akinosho

Monitoring and Managing Anomaly Detection on OpenShift Overview Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices. Key Topics Covered 1. Introduction to Anomaly Detection - Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems. 2. Understanding Edge (IoT) - Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source. 3. What is ArgoCD? - Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices. 4. Deployment Using ArgoCD for Edge Devices - Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD. 5. Introduction to Apache Kafka and S3 - Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions. 6. Viewing Kafka Messages in the Data Lake - Learn how to view and analyze Kafka messages stored in a data lake for better insights. 7. What is Prometheus? - Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices. 8. Monitoring Application Metrics with Prometheus - Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system. 9. What is Camel K? - Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes. 10. Configuring Camel K Integrations for Data Pipelines - Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow. 11. What is a Jupyter Notebook? - Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text. 12. Jupyter Notebooks with Code Examples - Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

akankshawande

Fueling AI with Great Data with Airbyte Webinar

Zilliz

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Serial Arm Control in Real Time Presentation

tolgahangng

Trusted Execution Environment for Decentralized Process Mining

LucaBarbaro3

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

Operating System Used by Users in day-to-day life.pptx

Pravash Chandra Das

Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes. Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions. Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻 The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️ Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution. The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/ DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen! Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell. Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten. Diese Themen werden behandelt - Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten - Wie funktionieren CCB- und CCX-Lizenzen wirklich? - Verstehen des DLAU-Tools und wie man es am besten nutzt - Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw. - Praxisbeispiele und Best Practices zum sofortigen Umsetzen

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

Tatiana Kojar

Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI. With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365. Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Taking AI to the Next Level in Manufacturing.pdf

ssuserfac0301

Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as: 1. How quickly AI is being implemented in manufacturing. 2. Which barriers stand in the way of AI adoption. 3. How data quality and governance form the backbone of AI. 4. Organizational processes and structures that may inhibit effective AI adoption. 6. Ideas and approaches to help build your organization's AI strategy.

dbms calicut university B. sc Cs 4th sem.pdf

Shinana2

Recently uploaded (20)

A Comprehensive Guide to DeFi Development Services in 2024

Main news related to the CCS TSI 2023 (2023/1695)

Choosing The Best AWS Service For Your Website + API.pptx

Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr

Monitoring and Managing Anomaly Detection on OpenShift.pdf

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

Fueling AI with Great Data with Airbyte Webinar

Presentation of the OECD Artificial Intelligence Review of Germany

Serial Arm Control in Real Time Presentation

Trusted Execution Environment for Decentralized Process Mining

Columbus Data & Analytics Wednesdays - June 2024

Operating System Used by Users in day-to-day life.pptx

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

Artificial Intelligence for XMLDevelopment

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

Programming Foundation Models with DSPy - Meetup Slides

Introduction of Cybersecurity with OSS at Code Europe 2024

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...

Taking AI to the Next Level in Manufacturing.pdf

dbms calicut university B. sc Cs 4th sem.pdf

Production Grade Data Science for Hadoop

1. Production Grade Data Science for Hadoop Villu Ruusmann Openscoring OÜ

2. About openscoring.io "Standards-based, Open-source Middleware for Predictive Analytics Applications" aka "Rapid deployment of R and Scikit-Learn models on JVM" 2/25

3. "From Laboratory to Factory" 3/25

4. From grams (to kilograms) to megagrams Scaling the vessel / Hardware Scaling the chemical reaction / Software and processes 4/25

5. Scalability through re-engineering 5/25

6. Broader objectives ● Platform ○ Portability of applications ● Application ○ Central governance and dissemination of models ● Model ○ "Decisioning as a Service" ● Decision ○ Traceability, reproducibility, explainability 6/25

7. R and Scikit-Learn PMML PFA Java and C Domain-Specific Languages General-Purpose Languages 7/25

8. The PMML connection 8/25

9. X model producers Y model consumers (ten years into past & future) 1 model API 9/25

10. Model API pipeline Conversion Deployment Ephermeal: Persistent, asset-like: Conversion Maintenance Deployment 10/25

11. Conversion into PMML Capturing and expressing the essentials of the modeling workflow using PMML vocabulary: Input → Feature vector → Response vector → Output Connecting stable data schemas 11/25

12. 12/25

13. 13/25

14. Standardized representation R ada, cforest, gbm, randomForest, xgb.Booster Scikit-Learn AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier <MiningModel function="classification"> <Segmentation multipleModelMethod="weightedAverage" > <Segment id="1" weight="1"> <True/> <TreeModel> <Node> ... </Node> </TreeModel> </Segment> ... </Segmentation> </MiningModel> 14/25

15. Supra-standardized representation Model model = MiningModelUtil.createClassifierEnsemble( MultipleModelMethodType.WEIGHTED_AVERAGE, Arrays.asList( PMMLUtil.loadModel("xgboost.pmml"), PMMLUtil.loadModel("keras-mlp.pmml"), PMMLUtil.loadModel("sklearn-rf.pmml") ), Arrays.asList(5d/9d, 2d/9d, 2d/9d) ); PMMLUtil.storeModel(model, "kaggle-submission.pmml"); 15/25

16. State machine for model maintenance Enhanced Enhanced Enhanced StandardizedRaw Enhanced Optimized 16/25

17. Model as a function Target = f(Active1, Active2, .., Activen) Outputfeature = ffeature(Target) 17/25

18. Model metadata API Evaluator evaluator = getEvaluator(); List<FieldName> activeFields = evaluator.getActiveFields(); for(FieldName activeField : activeFields){ DataField dataField = evaluator.getDataField(activeField); // Inspect data type, operational type, value space etc. } 18/25

19. Model evaluation API Evaluator evaluator = getEvaluator(); while(!done){ Map<FieldName, ?> arguments = readInRecord(); Map<FieldName, ?> results = evaluator.evaluate(arguments); writeOutRecord(results); } 19/25

20. http://github.com/jpmml/jpmml-${framework} Volume Velocity { REST } 20/25

21. JPMML-Cascading Evaluator evaluator = getEvaluator(); PMMLPlanner pmmlPlanner = new PMMLPlanner(evaluator); pmmlPlanner.setHeadName("input"); pmmlPlanner.setTailName("output"); FlowDef flowDef = ...; flowDef.addAssemblyPlanner(pmmlPlanner); 21/25

22. JPMML-Pig grunt> REGISTER jpmml-pig-distributable-1.0.jar; grunt> DEFINE my_udf org.jpmml.pig.PMMLFunc('model.pmml'); grunt> output = FOREACH input GENERATE my_udf(*); 22/25

23. JPMML-Spark Evaluator evaluator = getEvaluator(); PMMLFunction pmmlFunction = new PMMLFunction(evaluator); JavaRDD<Row> input = ...; JavaRDD<Row> output = input.map(pmmlFunction); 23/25

24. Thoughts on API-driven architecture ● Based on a relevant standard ○ "Conventions over configuration" ● High(est) abstraction level ○ Productivity ○ Maintainability ● End-to-end value proposition ○ Separation of concerns 24/25

25. Q&A villu@openscoring.io http://openscoring.io http://github.com/jpmml 25/25

Production Grade Data Science for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Production Grade Data Science for Hadoop

Similar to Production Grade Data Science for Hadoop (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Production Grade Data Science for Hadoop