Title: An Introduction to Machine Learning with Python and Scikit-Learn
Speaker: Julien Simon (https://linkedin.com/in/juliensimon/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
Speakers: Maximo Gurmendez, Javier Buquet
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks
Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback
Speaker: Wai Chee Yau
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Detecting Financial Fraud at Scale with Machine LearningDatabricks
Detecting fraudulent patterns at scale is a challenge given the massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior. In finance, added security concerns and the importance of explaining how fraudulent behavior was identified further increases the difficulty of the task. Legacy systems rely on rule-based detection that is difficult to implement and run at scale. The resulting code is very complex and brittle, making it difficult to update to keep up with new threats.
In this talk, we will go over how to convert a rule based financial fraud detection program to use machine learning on Spark as part of a scalable, modular solution. We will examine how to identify appropriate features and labels and how to create a feedback loop that will allow the model to evolve and improve overtime. We will also look at how MLflow may be leveraged throughout this effort for experiment tracking and model deployment.
Specifically, we will discuss:
-How to create a fraud-detection data pipeline
-How to leverage a framework for building features from large datasets
-How to create modular code to re-use and maintain new machine learning models
-How to choose appropriate models and algorithms for a given fraud-detection problem
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
Speakers: Maximo Gurmendez, Javier Buquet
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks
Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback
Speaker: Wai Chee Yau
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Detecting Financial Fraud at Scale with Machine LearningDatabricks
Detecting fraudulent patterns at scale is a challenge given the massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior. In finance, added security concerns and the importance of explaining how fraudulent behavior was identified further increases the difficulty of the task. Legacy systems rely on rule-based detection that is difficult to implement and run at scale. The resulting code is very complex and brittle, making it difficult to update to keep up with new threats.
In this talk, we will go over how to convert a rule based financial fraud detection program to use machine learning on Spark as part of a scalable, modular solution. We will examine how to identify appropriate features and labels and how to create a feedback loop that will allow the model to evolve and improve overtime. We will also look at how MLflow may be leveraged throughout this effort for experiment tracking and model deployment.
Specifically, we will discuss:
-How to create a fraud-detection data pipeline
-How to leverage a framework for building features from large datasets
-How to create modular code to re-use and maintain new machine learning models
-How to choose appropriate models and algorithms for a given fraud-detection problem
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
The ML Lifecycle management process is quickly becoming the bottleneck for a lot of ML projects. With MLflow’s newest release, and its enhanced integration with Azure Machine Learning, this process is now showing the right promise and capabilities on Azure. In this talk, we intend to take a tour of the integration details and how MLOps is now becoming a strength of the platform. We’ll talk about versioning, maintaining run history, production pipeline automation, deployment to cloud and edge, and CI/CD pipelines with MLOps as the backdrop.
Be prepared for an interactive conversation as we intend to seek a lot of feedback on the integration and capabilities being lit up.
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...Databricks
Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
These are slides presented at MLconf in San Francisco, November 14, 2014. I share the approach to real-time machine learning for recommender systems developed at if(we). We achieve rapid iterative cycles by adhering to a strict approach to structuring and accessing our data, as well as to building the online features that comprise our models. These developments support teams of data scientist and data engineers, who work together to solve complex recommendation problems. We also introduce the Antelope Realtime Events framework, an open source demonstration application which derives from our scalable proprietary software stack.
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.
Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Databricks
PredictionIO cofounder and creator Donald Szeto presents what, why and how of Apache PredictionIO as a Machine Learning framework running on top of Apache Spark.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
Presented at KDD, August 11, 2015.
Abstract of the paper:
Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted. The well-known Agile methodology advances projects in a chain of rapid development cycles, with subsequent steps often informed by production experiments. Support for such workflow in machine learning applications remains primitive.
The platform developed at if(we) embodies a specific machine learning approach and a rigorous data architecture constraint, so allowing teams to work in rapid iterative cycles. We require models to consume data from a time-ordered event history, and we focus on facilitating creative feature engineering. We make it practical for data scientists to use the same model code in development and in production deployment, and make it practical for them to collaborate on complex models.
We deliver real-time recommendations at scale, returning top results from among 10,000,000 candidates with sub-second response times and incorporating new updates in just a few seconds. Using the approach and architecture described here, our team can routinely go from ideas for new models to production-validated results within two weeks.
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
The ML Lifecycle management process is quickly becoming the bottleneck for a lot of ML projects. With MLflow’s newest release, and its enhanced integration with Azure Machine Learning, this process is now showing the right promise and capabilities on Azure. In this talk, we intend to take a tour of the integration details and how MLOps is now becoming a strength of the platform. We’ll talk about versioning, maintaining run history, production pipeline automation, deployment to cloud and edge, and CI/CD pipelines with MLOps as the backdrop.
Be prepared for an interactive conversation as we intend to seek a lot of feedback on the integration and capabilities being lit up.
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...Databricks
Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
These are slides presented at MLconf in San Francisco, November 14, 2014. I share the approach to real-time machine learning for recommender systems developed at if(we). We achieve rapid iterative cycles by adhering to a strict approach to structuring and accessing our data, as well as to building the online features that comprise our models. These developments support teams of data scientist and data engineers, who work together to solve complex recommendation problems. We also introduce the Antelope Realtime Events framework, an open source demonstration application which derives from our scalable proprietary software stack.
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.
Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Databricks
PredictionIO cofounder and creator Donald Szeto presents what, why and how of Apache PredictionIO as a Machine Learning framework running on top of Apache Spark.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
Presented at KDD, August 11, 2015.
Abstract of the paper:
Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted. The well-known Agile methodology advances projects in a chain of rapid development cycles, with subsequent steps often informed by production experiments. Support for such workflow in machine learning applications remains primitive.
The platform developed at if(we) embodies a specific machine learning approach and a rigorous data architecture constraint, so allowing teams to work in rapid iterative cycles. We require models to consume data from a time-ordered event history, and we focus on facilitating creative feature engineering. We make it practical for data scientists to use the same model code in development and in production deployment, and make it practical for them to collaborate on complex models.
We deliver real-time recommendations at scale, returning top results from among 10,000,000 candidates with sub-second response times and incorporating new updates in just a few seconds. Using the approach and architecture described here, our team can routinely go from ideas for new models to production-validated results within two weeks.
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Amazon Web Services
Machine learning (ML) is rapidly being adopted by enterprises, enabling them to be nimble and align technical solutions to solve real-world business problems. ML use cases include diagnosis and research in healthcare, financial fraud detection, natural language processing (NLU), and accurate statistics in sports. Amazon SageMaker is a fully managed platform that enables developers to build, train, and deploy enterprise-scale ML models quickly and easily. In this workshop, we build an ML model using Amazon SageMaker’s built-in algorithms and frameworks. We train the model to achieve a high level of accurate predictions, then we deploy the model in production to achieve best results. Gain an understanding of how Amazon SageMaker removes the complexity and barriers to use and deploy ML models.
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...Amazon Web Services Korea
발표자료 다시보기: https://youtu.be/xnimaVNTWfc
여러분의 애플리케이션에 인공 지능 기능을 추가하는 방법 중 하나로, GluonCV 및 AutoGluon 라이브러리를 이용해서 간단한 Python 코드로 높은 성능의 기계 학습 모델을 만들고 이를 예측에 사용하는 방법을 소개합니다. 정형 데이터에 대한 분류 또는 수치 예측 모델 생성부터 이미지 분류, 객체 탐지, 세그먼테이션, 행동 인식 등의 모델을 기계 학습에 대한 전문 지식이 없이도 자동으로 만들고 활용하는 방법을 알아봅니다.
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Amazon Web Services
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding, and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. In this workshop, we provide an overview of deep learning, focusing on getting started with the TensorFlow framework on AWS.
Build Deep Learning Applications with TensorFlow and Amazon SageMakerAmazon Web Services
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. In this workshop, you’ll learn how to get started with the TensorFlow deep learning framework using Amazon SageMaker, a platform to easily build, train and deploy models at scale. You’ll learn how to build a model using TensorFlow by setting up a Jupyter notebook to get started with image and object recognition. You’ll also learn how to quickly train and deploy a model through Amazon
Build a Custom Model for Object & Logo Detection (AIM421) - AWS re:Invent 2018Amazon Web Services
Detecting specific objects and logos is a feature that can help companies in any industry, from media and entertainment to financial services. However, detecting new objects or logos requires building a custom model. In this chalk talk, learn how to use Amazon Rekognition and Amazon SageMaker to build a custom model to detect logos, objects, or even inappropriate content.
NLP in Healthcare to Predict Adverse Events with Amazon SageMaker (AIM346) - ...Amazon Web Services
In healthcare, pharmacovigilance is key to improving patient outcomes. The prediction of adverse events will enable pharmaceutical companies and drug distributors in accurately meeting their pharmacovigilance requirements and scaling their operations. In this chalk talk, we discuss how Amazon SageMaker can be used to classify large-scale agent and reporter interaction summaries. We also discuss natural language processing (NLP) methods and results.
Building, Training, and Deploying fast.ai Models Using Amazon SageMaker (AIM4...Amazon Web Services
In a short space of time, fast.ai has become a popular Deep Learning library, driven by the success of the fast.ai online Massive Open Online Course (MOOC). It has allowed SW developers to achieve, in the span of a few weeks, state-of-the-art results in domains such as Computer Vision (CV), Natural Language Processing (NLP), and structured data machine learning. In this chalk talk, we go into the details of building, training, and deploying fast.ai-based models using Amazon SageMaker.
Building Deep Learning Applications with TensorFlow and Amazon SageMakerAmazon Web Services
by Steve Shirkey, Solutions Architect ASEAN
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. In this workshop, you’ll learn how to get started with the TensorFlow deep learning framework using Amazon SageMaker, a platform to easily build, train and deploy models at scale. You’ll learn how to build a model using TensorFlow by setting up a Jupyter notebook to get started with image and object recognition. You’ll also learn how to quickly train and deploy a model through Amazon.
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
The Apache MXNet deep learning framework is used for developing, training, and deploying diverse AI applications, including computer vision, speech recognition, natural language processing, and more at scale. In this session, learn how to get started with Apache MXNet on the Amazon SageMaker machine learning platform. Chick-fil-A share how they got started with MXNet on Amazon SageMaker to measure waffle fry freshness and how they leverage AWS services to improve the Chick-fil-A guest experience.
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Windows Machine Learning, Deep Learning, and Artificial Intelligence (WIN330)...Amazon Web Services
Are you interested in training and running your machine learning models on Windows? Are you unsure of how to get your data from Microsoft SQL Server to Amazon SageMaker? In this session, we focus on the patterns and practices for machine learning on Windows. This is an interactive question-and-answer session. Please bring your questions, and join us for this discussion.
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
The Apache MXNet deep learning framework is used for developing, training, and deploying diverse AI applications, including computer vision, speech recognition, and natural language processing at scale. In this session, learn how to get started with MXNet on the Amazon SageMaker machine learning platform. Hear from Workday about how they built computer vision and natural language processing (NLP) models using MXNet to automatically extract information from paper documents, such as expense receipts and populate data records. Workday also shares its experience using Sockeye, an MXNet toolkit for quickly prototyping sequence-to-sequence NLP models.
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
Similar to 16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning with Python and Scikit-Learn (20)
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage systemAthens Big Data
Title: Dive into ClickHouse storage system
Speaker: Alexander Sapin (https://github.com/alesapin/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...Athens Big Data
Title: NLP: From news recommendation to word prediction
Speaker: Mihalis Papakonstantinou (https://linkedin.com/in/mihalispapak/)
Date: Thursday, December 12, 2019
Event: https://meetup.com/Athens-Big-Data/events/266762191/
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
Title: Fast and simple data exploration with ClickHouse
Speaker: Alexander Kuzmenkov (https://github.com/akuzm/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
20th Athens Big Data Meetup - 2nd Talk - Druid: under the coversAthens Big Data
Title: Druid: under the covers
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
Title: Druid: the open source, performant, real-time, analytical datastore
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
Title: Run Spark and Flink Jobs on Kubernetes
Speaker: Chaoran Yu (https://linkedin.com/in/chaoran-yu-97b1144a/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a ServiceAthens Big Data
Title: Timeseries Forecasting as a Service
Speaker: Thanassis Spyrou (https://linkedin.com/in/thanassis-spyrou-92911959/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...Athens Big Data
Title: Data Flow Building and Calculation Pipelines via PySpark and ML Modeling via Python
Speaker: Theodoros Michalareas (https://linkedin.com/in/theodorosmichalareas/)
Date: Tuesday, September 24, 2019
Event: https://meetup.com/Athens-Big-Data/events/264702584/
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...Athens Big Data
Title: A Focus on Building and Optimizing ML Models with Amazon SageMaker
Speaker: Pavlos Mitsoulis-Ntompos (https://linkedin.com/in/pavlosmitsoulis/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
15th Athens Big Data Meetup - 1st Talk - Running Spark On MesosAthens Big Data
Title: Running Spark On Mesos
Speaker: Mr. Chris Sidiropoulos (https://linkedin.com/in/chris-sidiropoulos-2a6b156a//)
Date: Tuesday, November 27, 2018
Event: https://meetup.com/Athens-Big-Data/events/256098657/
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...Athens Big Data
Title: Training Neural Networks With Enterprise Relational Data
Speaker: Dr. Christos Malliopoulos (https://linkedin.com/in/cmalliopoulos//)
Date: Thursday, June 21, 2018
Event: https://meetup.com/Athens-Big-Data/events/251411957/
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...Athens Big Data
Title: Beyond Bitcoin; Blockchain Technologies And How They Disrupt Every Industry
Speaker: Mr. Dimitrios Kouzis-Loukas (https://linkedin.com/in/lookfwd//)
Date: Wednesday, December 27, 2017
Event: https://meetup.com/Athens-Big-Data/events/246033156/
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And GradingAthens Big Data
Title: Lead Scoring And Lead Grading
Speaker: Dr. Lefteris Mantelas (https://linkedin.com/in/lefterismantelas//)
Date: Tuesday, October 24, 2017
Event: https://meetup.com/Athens-Big-Data/events/243829781/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
4. Artificial Intelligence: design software applications which
exhibit human-like behavior, e.g. speech, natural language
processing, reasoning or intuition
Machine Learning: using statistical algorithms, teach
machines to learn from featurized data without being
explicitly programmed
Deep Learning: using neural networks, teach machines to
learn from complex data where features cannot be explicitly
expressed