Flink Forward SF 2017: Trevor Grant - Introduction to Online Machine Learning...Flink Forward
Online algorithms are an increasingly popular yet often misunderstood branch of machine learning, where model parameter estimates are updated for each new piece of information received. While mini-batch methods have often been mislabeled as 'streaming-machine learning', true online methods have different implementations and goals. This talk will explain key differences between online and offline machine learning, an introduction to many common online algorithms, and how online algorithms can be analyzed. An example using Apache Flink to detect trends on Twitter will be presented. Attendees will come away from this talk with a better understanding of the challenges and opportunities from working with online algorithms and how they can begin implementing their own algorithms in Apache Flink.
Márton Balassi Streaming ML with Flink- Flink Forward
http://flink-forward.org/kb_sessions/streaming-ml-with-flink/
As continuous big data processing is gaining popularity it naturally implies that there is a need to transition many of the distributed machine learning functionality to a streaming backend. The most common use case is to give streaming predictions based on the model learnt in batch, however in some cases it is beneficial to also update the model on the fly. It is not uncommon that streaming learners need different algorithms than their batch counterparts. The talk discusses the common use cases and the pitfalls of the streaming ML transition through the example of recommender systems. It also offer a dive into the implementation of a Scala library augmenting FlinkML with streaming predictors.
This document discusses machine learning pipelines and introduces Evan Sparks' presentation on building image classification pipelines. It provides an overview of feature extraction techniques used in computer vision like normalization, patch extraction, convolution, rectification and pooling. These techniques are used to transform images into feature vectors that can be input to linear classifiers. The document encourages building simple, intermediate and advanced image classification pipelines using these techniques to qualitatively and quantitatively compare their effectiveness.
This document discusses moving machine learning models from prototype to production. It outlines some common problems with the current workflow where moving to production often requires redevelopment from scratch. Some proposed solutions include using notebooks as APIs and developing analytics that are accessed via an API. It also discusses different data science platforms and architectures for building end-to-end machine learning systems, focusing on flexibility, security, testing and scalability for production environments. The document recommends a custom backend integrated with Spark via APIs as the best approach for the current project.
Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Lea...Sam Putnam [Deep Learning]
This document summarizes Sam Putnam's presentation on deploying enterprise deep learning. It discusses how deep learning was used to analyze housing price data and predict prices. A neural network model was built using TensorFlow that improved on previous linear regression and gradient boosted decision tree models. The presentation provides an overview of deep learning concepts like neural networks, activation functions, and model architectures for different data types. It emphasizes real-world considerations for developing and deploying deep learning models in a production setting.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
The AutoML Toolkit provides tools to simplify machine learning tasks. It features techniques for feature engineering like feature interaction that combines features to gain additional predictive power. It also addresses class imbalance issues through techniques like K-Sampling, a distributed version of SMOTE oversampling that generates synthetic samples for the minority class. The toolkit uses genetic algorithms to automatically tune machine learning models for optimal performance. An upcoming roadmap includes additional tools for stacked ensembles, improved genetic search algorithms, statistical analysis of features, and visualizations.
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
KeystoneML is a software framework for building scalable machine learning pipelines. It provides tools for data loading, feature extraction, model training, and evaluation that work across multiple domains like computer vision, NLP, and speech. Pipelines built with KeystoneML can achieve state-of-the-art results on large datasets using modest computing resources. The framework is open source and available on GitHub.
Flink Forward SF 2017: Trevor Grant - Introduction to Online Machine Learning...Flink Forward
Online algorithms are an increasingly popular yet often misunderstood branch of machine learning, where model parameter estimates are updated for each new piece of information received. While mini-batch methods have often been mislabeled as 'streaming-machine learning', true online methods have different implementations and goals. This talk will explain key differences between online and offline machine learning, an introduction to many common online algorithms, and how online algorithms can be analyzed. An example using Apache Flink to detect trends on Twitter will be presented. Attendees will come away from this talk with a better understanding of the challenges and opportunities from working with online algorithms and how they can begin implementing their own algorithms in Apache Flink.
Márton Balassi Streaming ML with Flink- Flink Forward
http://flink-forward.org/kb_sessions/streaming-ml-with-flink/
As continuous big data processing is gaining popularity it naturally implies that there is a need to transition many of the distributed machine learning functionality to a streaming backend. The most common use case is to give streaming predictions based on the model learnt in batch, however in some cases it is beneficial to also update the model on the fly. It is not uncommon that streaming learners need different algorithms than their batch counterparts. The talk discusses the common use cases and the pitfalls of the streaming ML transition through the example of recommender systems. It also offer a dive into the implementation of a Scala library augmenting FlinkML with streaming predictors.
This document discusses machine learning pipelines and introduces Evan Sparks' presentation on building image classification pipelines. It provides an overview of feature extraction techniques used in computer vision like normalization, patch extraction, convolution, rectification and pooling. These techniques are used to transform images into feature vectors that can be input to linear classifiers. The document encourages building simple, intermediate and advanced image classification pipelines using these techniques to qualitatively and quantitatively compare their effectiveness.
This document discusses moving machine learning models from prototype to production. It outlines some common problems with the current workflow where moving to production often requires redevelopment from scratch. Some proposed solutions include using notebooks as APIs and developing analytics that are accessed via an API. It also discusses different data science platforms and architectures for building end-to-end machine learning systems, focusing on flexibility, security, testing and scalability for production environments. The document recommends a custom backend integrated with Spark via APIs as the best approach for the current project.
Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Lea...Sam Putnam [Deep Learning]
This document summarizes Sam Putnam's presentation on deploying enterprise deep learning. It discusses how deep learning was used to analyze housing price data and predict prices. A neural network model was built using TensorFlow that improved on previous linear regression and gradient boosted decision tree models. The presentation provides an overview of deep learning concepts like neural networks, activation functions, and model architectures for different data types. It emphasizes real-world considerations for developing and deploying deep learning models in a production setting.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
The AutoML Toolkit provides tools to simplify machine learning tasks. It features techniques for feature engineering like feature interaction that combines features to gain additional predictive power. It also addresses class imbalance issues through techniques like K-Sampling, a distributed version of SMOTE oversampling that generates synthetic samples for the minority class. The toolkit uses genetic algorithms to automatically tune machine learning models for optimal performance. An upcoming roadmap includes additional tools for stacked ensembles, improved genetic search algorithms, statistical analysis of features, and visualizations.
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
KeystoneML is a software framework for building scalable machine learning pipelines. It provides tools for data loading, feature extraction, model training, and evaluation that work across multiple domains like computer vision, NLP, and speech. Pipelines built with KeystoneML can achieve state-of-the-art results on large datasets using modest computing resources. The framework is open source and available on GitHub.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
This document discusses experimental design for distributed machine learning models. It outlines common problems in machine learning modeling like selecting the best algorithm and evaluating a model's expected generalization error. It describes steps in a machine learning study like collecting data, building models, and designing experiments. The goal of experimentation is to understand how model factors affect outcomes and obtain statistically significant conclusions. Techniques discussed for analyzing distributed model outputs include precision-recall curves, confusion matrices, and hypothesis testing methods like the chi-squared test and McNemar's test. The document emphasizes that experimental design for distributed learning poses new challenges around data characteristics, computational complexity, and reproducing results across models.
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit
MLeap is a machine learning platform that enables data scientists and engineers to collaborate using a single environment. It allows machine learning models trained using Spark to be deployed to production APIs without dependencies on Spark. MLeap addresses the common problems of data scientists and engineers having to re-write data pipelines and model code for production. It provides core machine learning components, a runtime for executing models without Spark, and tools for converting Spark models to the MLeap format. A demo is shown training and deploying models to an API in under 5 minutes.
This document provides an overview of machine learning concepts and techniques using Apache Spark. It begins with introducing machine learning and describing supervised and unsupervised learning. Then it discusses Spark and how it can be used for large-scale machine learning tasks through its MLlib library and GraphX API. Several examples of machine learning applications are presented, such as classification, regression, clustering, and graph analytics. The document concludes with demonstrating machine learning algorithms in Spark.
Using H2O AutoML for Kaggle CompetitionsSri Ambati
The document discusses H2O AutoML, a tool for automated machine learning. It begins by showing some top Kagglers who have used AutoML to achieve good results with less effort. It then provides an overview of what AutoML automates in the model building process like preprocessing, training, tuning, stacking ensembles. AutoML is suitable for novice users who want automation as well as experts who want to save time on routine tasks. The document explains the interface and shows the grid search, stacking and cross-validation process done behind the scenes to build accurate models with less work.
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
The document discusses mentorship modeling based on authorship graphs from Scopus data. It describes building features from co-authorship and correspondence graphs using Spark, validating predictions via crowdsourcing, and visualizing mentorship subgraphs. Key points include normalizing authorship data, aggregating node and edge features, applying pairwise mentorship models, obtaining training data via email campaigns, and using D3.js to interactively display mentorship subgraphs in Spark applications.
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
Spark Summit EU talk by Oscar CastanedaSpark Summit
The document discusses transforming Excel spreadsheets into Spark DataFrames by automatically translating Excel formulas into Spark code. It presents a program transformation pipeline that takes Excel formulas, parses them using a grammar and parser to generate a parse tree, and then generates Spark code from the parse tree. Key aspects covered include using an existing grammar and parser called XLParser to parse Excel formulas, treating Excel as a domain-specific language, and generating code by writing a pretty printer for the target Spark language. The talk concludes with a demonstration of the code generation approach.
The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
- MLlib has rapidly developed over the past 5 years, growing from a few initial algorithms to over 50 algorithms and featurizers today.
- It has shifted focus from just adding algorithms to improving existing algorithms and infrastructure like DataFrame integration.
- This allows for scalable machine learning workflows on big data from small laptop datasets to large clusters, with seamless integration between SQL, DataFrames, streaming, and other Spark components.
- Going forward, areas of focus include continued improvements to scalability, enhancing core algorithms, extending APIs to support custom algorithms, and building out a standard library of machine learning components.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Ververica
Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
This document discusses experimental design for distributed machine learning models. It outlines common problems in machine learning modeling like selecting the best algorithm and evaluating a model's expected generalization error. It describes steps in a machine learning study like collecting data, building models, and designing experiments. The goal of experimentation is to understand how model factors affect outcomes and obtain statistically significant conclusions. Techniques discussed for analyzing distributed model outputs include precision-recall curves, confusion matrices, and hypothesis testing methods like the chi-squared test and McNemar's test. The document emphasizes that experimental design for distributed learning poses new challenges around data characteristics, computational complexity, and reproducing results across models.
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit
MLeap is a machine learning platform that enables data scientists and engineers to collaborate using a single environment. It allows machine learning models trained using Spark to be deployed to production APIs without dependencies on Spark. MLeap addresses the common problems of data scientists and engineers having to re-write data pipelines and model code for production. It provides core machine learning components, a runtime for executing models without Spark, and tools for converting Spark models to the MLeap format. A demo is shown training and deploying models to an API in under 5 minutes.
This document provides an overview of machine learning concepts and techniques using Apache Spark. It begins with introducing machine learning and describing supervised and unsupervised learning. Then it discusses Spark and how it can be used for large-scale machine learning tasks through its MLlib library and GraphX API. Several examples of machine learning applications are presented, such as classification, regression, clustering, and graph analytics. The document concludes with demonstrating machine learning algorithms in Spark.
Using H2O AutoML for Kaggle CompetitionsSri Ambati
The document discusses H2O AutoML, a tool for automated machine learning. It begins by showing some top Kagglers who have used AutoML to achieve good results with less effort. It then provides an overview of what AutoML automates in the model building process like preprocessing, training, tuning, stacking ensembles. AutoML is suitable for novice users who want automation as well as experts who want to save time on routine tasks. The document explains the interface and shows the grid search, stacking and cross-validation process done behind the scenes to build accurate models with less work.
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
The document discusses mentorship modeling based on authorship graphs from Scopus data. It describes building features from co-authorship and correspondence graphs using Spark, validating predictions via crowdsourcing, and visualizing mentorship subgraphs. Key points include normalizing authorship data, aggregating node and edge features, applying pairwise mentorship models, obtaining training data via email campaigns, and using D3.js to interactively display mentorship subgraphs in Spark applications.
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
Spark Summit EU talk by Oscar CastanedaSpark Summit
The document discusses transforming Excel spreadsheets into Spark DataFrames by automatically translating Excel formulas into Spark code. It presents a program transformation pipeline that takes Excel formulas, parses them using a grammar and parser to generate a parse tree, and then generates Spark code from the parse tree. Key aspects covered include using an existing grammar and parser called XLParser to parse Excel formulas, treating Excel as a domain-specific language, and generating code by writing a pretty printer for the target Spark language. The talk concludes with a demonstration of the code generation approach.
The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
- MLlib has rapidly developed over the past 5 years, growing from a few initial algorithms to over 50 algorithms and featurizers today.
- It has shifted focus from just adding algorithms to improving existing algorithms and infrastructure like DataFrame integration.
- This allows for scalable machine learning workflows on big data from small laptop datasets to large clusters, with seamless integration between SQL, DataFrames, streaming, and other Spark components.
- Going forward, areas of focus include continued improvements to scalability, enhancing core algorithms, extending APIs to support custom algorithms, and building out a standard library of machine learning components.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Ververica
Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward
Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward
This session will introduce a new open-source project - Flink TensorFlow - that enables Flink programs to operate on data using TensorFlow machine learning models. Applications include real-time image processing, NLP, and anomaly detection. The session will: - Introduce TensorFlow and describe its component model which allows for model reuse across environments - Demonstrate how to use TensorFlow models in Flink ML and Flink Streaming environments - Present a roadmap and provide opportunities to contribute
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.
Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon
This document discusses using recurrent neural networks (RNNs) for electricity price forecasting with TensorFlow. It begins with an introduction to the speaker, Taegyun Jeon from GIST. The document then provides an overview of RNNs and their implementation in TensorFlow. It describes two case studies - using an RNN to predict a sine function and using one to forecast electricity prices. The document concludes with information on running and evaluating the RNN graph and a question and answer section.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
This document compares Apache Spark and Apache Flink. Both are open-source platforms for distributed data processing. Spark was created in 2009 at UC Berkeley and donated to the Apache Foundation in 2013. It uses resilient distributed datasets (RDDs) and lazy evaluation. Flink was started in 2010 as a collaboration between universities in Germany and became an Apache project in 2014. It uses cyclic data flows and supports both batch and stream processing. While Spark is currently more mature with more components and community support, Flink claims to be faster for stream and batch processing. Overall, both platforms continue to evolve and improve.
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Portable Streaming Pipelines with Apache Beamconfluent
1) Apache Beam is an open source unified model for defining both batch and streaming data processing pipelines. It allows writing pipelines once that can run on multiple distributed processing backends.
2) The Beam model separates the data processing logic from runtime requirements. It defines concepts like processing time vs event time to allow portability across batch and streaming runners.
3) Beam supports extensible IO connectors and aims to allow pipelines written in one language to run on different runtimes through language-specific SDKs. Currently, Java and Python SDKs can run on backends like Apache Spark, Flink, and Google Cloud Dataflow.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Realizing the promise of portability with Apache BeamJ On The Beach
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam (incubating) aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms.
In this talk, I will:
Cover briefly the capabilities of the Beam model for data processing and integration with IOs, as well as the current state of the Beam ecosystem.
Discuss the benefits Beam provides regarding portability and ease-of-use.
Demo the same Beam pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Flink on Google Cloud, Apache Spark on AWS, Apache Apex on-premise).
Give a glimpse at some of the challenges Beam aims to address in the future.
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou
Apache Beam is a top-level Apache project which aims at providing a unified API for efficient and portable data processing pipeline. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, Apache Apex, ...) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, describe the main concepts of the programming model and talk about the current state of the project (new python support, first stable version). We'll illustrate the concepts with a use case running on several runners.
Artsem Semianenko (Adform) - "Flink in action или как приручить белочку"
Slides for presentation: https://www.youtube.com/watch?v=YSI5_RFlcPE
Source: https://github.com/art4ul/flink-demo
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
Apache Beam is a unified programming model for batch and streaming data processing that provides portability across distributed processing backends. It aims to support multiple languages like Java, Python and Go. The Beam Python SDK allows writing pipelines in Python that can run on distributed backends like Apache Flink. Lyft developed a Python SDK runner for Flink that translates Python pipelines to native Flink APIs using the Beam Fn API for communication between the SDK and runner. Future work includes improving performance of Python pipelines on JVM runners and supporting multiple languages in a single pipeline.
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Wes McKinney is a leading open source developer who created Python's pandas library and now leads the Apache Arrow project. Apache Arrow is an open standard for in-memory analytics that aims to improve data sharing and reuse across systems by defining a common columnar data format and memory layout. It allows data to be accessed and algorithms to be reused across different programming languages with near-zero data copying. Arrow is being integrated into various data systems and is working to expand its computational libraries and language support.
Zaiyang Li discusses the technology stack for a project, including the front-end framework Angular and build tools like Webpack, and the back-end framework ASP.NET Core with libraries like Dapper and testing tools like Xunit and Moq. Chart.js is used for data visualization, and a bespoke event sourcing framework is in active development to run in the same process as the web server. Protocol buffers are used for serialization and provide benefits like high performance and backward compatibility.
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Project Management: The Role of Project Dashboards.pdfKarya Keeper
Project management is a crucial aspect of any organization, ensuring that projects are completed efficiently and effectively. One of the key tools used in project management is the project dashboard, which provides a comprehensive view of project progress and performance. In this article, we will explore the role of project dashboards in project management, highlighting their key features and benefits.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
Unveiling the Advantages of Agile Software Development.pdf
Data Intensive Applications with Apache Flink
1. Milan – July 13 2016
Data Intensive Applications with Apache Flink
Simone Robutti
Machine Learning Engineer at Radicalbit
@SimoneRobutti
2. Agenda
1. Brief Introduction to Apache Flink
○ Why
○ What
○ How
2. Machine Learning on Flink
○ Present landscape
○ Future of the Ecosystem
3. Closing notes on Radicalbit (shameless plug ahead)
14. “I have seen people insisting on using Hadoop for
datasets that could easily fit on a flash drive and could
easily be processed on a laptop.”
- Yann LeCun
-
ML on Flink
23. Apache Beam
Programming model for data processing pipelines
● Streaming first, batch as a bounded stream
● Layered API: What, Where, When, How
● Platform agnostic: same program, different
runners
30. Our vision
Flink can become the ideal choice to build real-time decision-
heavy applications with high data-throughput
To achieve this:
● Ambitious applications (aim for real-time services)
● Reliable distributed online learning (Proteus?)
● A Pipelining Framework (experiment fast, increase testability and
modularity)