The document describes a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research to develop a classifier that can handle large, sparse datasets with up to hundreds of millions of features in a single pass. The system uses online learning techniques like stochastic gradient descent that allow incremental updates to the model as new data is received without requiring multiple passes over the training data. This makes it suitable for applications with streaming data where predictions are needed in real-time. Key challenges addressed include feature scaling, handling different feature frequencies, and efficiently encoding sparse features.
This document discusses a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research in online learning to develop a practical classifier that can handle up to hundreds of millions of sparse features. The key benefits of online learning are discussed, including the ability to incrementally update models with streaming data in real-time without having separate training and testing phases. Examples of using online learning for large-scale advertising and IoT applications are provided.
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.
This document summarizes Patrick Pletscher's presentation on training large-scale ad ranking models in Apache Spark. It discusses using Spark to implement logistic regression for click-through rate prediction on billions of daily ad impressions at Yahoo. Key points include joining impression and click data, implementing an incremental learning architecture in Spark, using feature hashing and online learning algorithms like follow-the-regularized-leader for model training, and lessons learned around Spark configurations, accumulators, and RDDs vs DataFrames.
The document discusses machine learning techniques for big data, including:
1) Various machine learning models like decision trees, linear models, neural networks and their assumptions.
2) Applications of machine learning like predictive modeling, clustering, personalization and optimization.
3) Key aspects of building machine learning systems like feature selection, model selection, evaluation and continuous adaptation.
This document discusses securing Spark applications. It covers encryption to protect data in transit and at rest, authentication using Kerberos to identify users, and authorization for access control through tools like Sentry and a proposed RecordService. While Spark can be secured today by leveraging Hadoop security, continued work is needed for easier encryption, improved Kerberos support for long-running jobs, and row/column-level authorization beyond file permissions.
Spark Streaming allows processing of live data streams using Spark. It works by dividing the data stream into batches called micro-batches, which are then processed using Spark's batch engine to generate RDDs. This allows for fault tolerance, exactly-once processing, and integration with other Spark APIs like MLlib and GraphX.
This document discusses a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research in online learning to develop a practical classifier that can handle up to hundreds of millions of sparse features. The key benefits of online learning are discussed, including the ability to incrementally update models with streaming data in real-time without having separate training and testing phases. Examples of using online learning for large-scale advertising and IoT applications are provided.
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.
This document summarizes Patrick Pletscher's presentation on training large-scale ad ranking models in Apache Spark. It discusses using Spark to implement logistic regression for click-through rate prediction on billions of daily ad impressions at Yahoo. Key points include joining impression and click data, implementing an incremental learning architecture in Spark, using feature hashing and online learning algorithms like follow-the-regularized-leader for model training, and lessons learned around Spark configurations, accumulators, and RDDs vs DataFrames.
The document discusses machine learning techniques for big data, including:
1) Various machine learning models like decision trees, linear models, neural networks and their assumptions.
2) Applications of machine learning like predictive modeling, clustering, personalization and optimization.
3) Key aspects of building machine learning systems like feature selection, model selection, evaluation and continuous adaptation.
This document discusses securing Spark applications. It covers encryption to protect data in transit and at rest, authentication using Kerberos to identify users, and authorization for access control through tools like Sentry and a proposed RecordService. While Spark can be secured today by leveraging Hadoop security, continued work is needed for easier encryption, improved Kerberos support for long-running jobs, and row/column-level authorization beyond file permissions.
Spark Streaming allows processing of live data streams using Spark. It works by dividing the data stream into batches called micro-batches, which are then processed using Spark's batch engine to generate RDDs. This allows for fault tolerance, exactly-once processing, and integration with other Spark APIs like MLlib and GraphX.
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
This Edureka Linear Regression tutorial will help you understand all the basics of linear regression machine learning algorithm along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) What is Regression?
3) Types of Regression
4) Linear Regression Examples
5) Linear Regression Use Cases
6) Demo in R: Real Estate Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
The document discusses approaches for using deep learning with small datasets, including transfer learning techniques like fine-tuning pre-trained models, multi-task learning, and metric learning approaches for few-shot and zero-shot learning problems. It also covers domain adaptation techniques when labels are not available, as well as anomaly detection for skewed label distributions. Traditional models like SVM are suggested as initial approaches, with deep learning techniques applied if those are not satisfactory.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This document discusses support vector machines (SVM) and provides an example of using SVM for classification. It begins with common applications of SVM like face detection and image classification. It then provides an overview of SVM, explaining how it finds the optimal separating hyperplane between two classes by maximizing the margin between them. An example demonstrates SVM by classifying people as male or female based on height and weight data. It also discusses how kernels can be used to handle non-linearly separable data. The document concludes by showing an implementation of SVM on a zoos dataset to classify animals as crocodiles or alligators.
Scalable Learning Technologies for Big Data MiningGerard de Melo
These are slides of a tutorial by Gerard de Melo and Aparna Varde presented at the DASFAA 2015 conference.
As data expands into big data, enhanced or entirely novel data mining algorithms often become necessary. The real value of big data is often only exposed when we can adequately mine and learn from it. We provide an overview of new scalable techniques for knowledge discovery. Our focus is on the areas of cloud data mining and machine learning, semi-supervised processing, and deep learning. We also give practical advice for choosing among different methods and discuss open research problems and concerns.
This document provides an introduction and overview of machine learning algorithms. It begins by discussing the importance and growth of machine learning. It then describes the three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Next, it lists and briefly defines ten commonly used machine learning algorithms including linear regression, logistic regression, decision trees, SVM, Naive Bayes, and KNN. For each algorithm, it provides a simplified example to illustrate how it works along with sample Python and R code.
The document describes the author's approach to building a machine learning pipeline for a Kaggle competition to predict product categories from tabular data. The pipeline includes: 1) Loading and processing the training, testing, and submission data, 2) Performing cross-validated model training and evaluation using algorithms like XGBoost, LightGBM and CatBoost, 3) Averaging the results to generate final predictions and create a submission file. The author aims to share details of algorithms, hardware performance, and results in subsequent blog posts.
The document provides an overview of machine learning, including definitions of machine learning, the differences between programming and machine learning, examples of machine learning applications, and descriptions of various machine learning algorithms and techniques. It discusses supervised learning methods like classification and regression. Unsupervised learning methods like clustering are also covered. The document outlines the machine learning process and provides cautions about machine learning.
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Edureka!
This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic Regression machine learning algorithm works in R. Towards the end, in our demo we will be predicting which patients have diabetes using Logistic Regression! In this Logistic Regression Tutorial you will understand:
1) The 5 Questions asked in Data Science
2) What is Regression?
3) Logistic Regression - What and Why?
4) How does Logistic Regression Work?
5) Demo in R: Diabetes Use Case
6) Logistic Regression: Use Cases
The document discusses overfitting and transformation-based learning. It begins by defining overfitting as when a machine learning model learns the details and noise in the training data too well, reducing its ability to generalize to new data. It then describes transformation-based learning as an error-driven machine learning method where rules are learned from annotated training data and can make use of context like surrounding parts of speech to tag words. The document provides an example of part-of-speech tagging using this method.
This document discusses big data and data science. It begins by questioning whether big data and data science are hype. It then discusses how companies can use data science techniques like A/B testing to continuously improve their online systems. The document provides examples of how recommender systems and anomaly detection can be improved using these techniques. It describes the data flow and challenges of training models offline and updating them online. It also discusses evaluating models offline and testing them online to identify the best performing models.
WEKA:Credibility Evaluating Whats Been Learnedweka Content
- Training and test sets are used to measure classification success rates, with the test set being independent of the training set. The error rate on the training set is optimistic. Cross validation techniques like 10-fold stratified cross validation are used when data is limited.
- True success rates are predicted using properties of statistics and normal distributions. Confidence levels determine the range within which the true rate is expected to lie.
- Techniques like paired t-tests are used to statistically compare the performance of different algorithms or data mining methods. They determine if performance differences are statistically significant.
This document discusses various techniques for evaluating machine learning models and comparing their performance, including:
- Measuring error rates on separate test and training sets to avoid overfitting
- Using techniques like cross-validation, bootstrapping, and holdout validation when data is limited
- Comparing algorithms using statistical tests like paired t-tests
- Accounting for costs of different prediction outcomes in evaluation and model training
- Visualizing performance using lift charts and ROC curves to compare models
- The Minimum Description Length principle for selecting the model that best compresses the data
This document discusses classification and clustering techniques used in search engines. It covers classification tasks like spam detection, sentiment analysis, and ad classification. Naive Bayes and support vector machines are described as common classification approaches. Features, feature selection, and evaluation metrics for classifiers are also summarized.
Recommender Systems from A to Z – Model TrainingCrossing Minds
This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
This document describes a project to classify student academic performance using machine learning algorithms. It extracts four features from a university dataset to label students as poor or good performers. These features identify failing, dropout, lower than expected grade, and lower grade with course difficulty students. It then applies SVM, Random Forest, Decision Tree, and Gradient Boosting algorithms. Decision Tree achieved the highest accuracy at 89% while Gradient Boosting had the best F1 score. The models are used to predict performance reasons for new student records.
1. The document discusses a data mining competition hosted by DonorsChoose.org to identify school donation projects that are exceptionally exciting. It describes the provided data files and classification algorithms used, including logistic regression, which performed best.
2. Extensive data preprocessing techniques were applied, including feature selection, handling null values, categorizing numeric features, and text feature extraction from project essays. Cross validation was used to evaluate models during development.
3. Logistic regression with data divided into two parts for training performed best, achieving a ROC value of 0.69853 using optimized hyperparameters.
The learning method used by our approach is usually known in the artificial intelligence community as learning by observation, imitation learning, learning from demonstration, programming by demonstration, learning by watching or learning by showing. For consistency, learning by observation will be used from here on.
This document discusses educational data mining and various methods used in EDM. It begins with an introduction to EDM, defining it as an emerging discipline concerned with exploring unique data from educational settings to better understand students and learning environments. It then outlines several common classes of EDM methods including information visualization, web mining, clustering, classification, outlier detection, association rule mining, sequential pattern mining, and text mining. The rest of the document focuses on specific EDM methods like prediction, clustering, relationship mining, discovery with models, and distillation of data for human judgment. It provides examples and explanations of how these methods are used in EDM.
Data.Mining.C.6(II).classification and predictionMargaret Wang
The document summarizes different machine learning classification techniques including instance-based approaches, ensemble approaches, co-training approaches, and partially supervised approaches. It discusses k-nearest neighbor classification and how it works. It also explains bagging, boosting, and AdaBoost ensemble methods. Co-training uses two independent views to label unlabeled data. Partially supervised approaches can build classifiers using only positive and unlabeled data.
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
This Edureka Linear Regression tutorial will help you understand all the basics of linear regression machine learning algorithm along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) What is Regression?
3) Types of Regression
4) Linear Regression Examples
5) Linear Regression Use Cases
6) Demo in R: Real Estate Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
The document discusses approaches for using deep learning with small datasets, including transfer learning techniques like fine-tuning pre-trained models, multi-task learning, and metric learning approaches for few-shot and zero-shot learning problems. It also covers domain adaptation techniques when labels are not available, as well as anomaly detection for skewed label distributions. Traditional models like SVM are suggested as initial approaches, with deep learning techniques applied if those are not satisfactory.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This document discusses support vector machines (SVM) and provides an example of using SVM for classification. It begins with common applications of SVM like face detection and image classification. It then provides an overview of SVM, explaining how it finds the optimal separating hyperplane between two classes by maximizing the margin between them. An example demonstrates SVM by classifying people as male or female based on height and weight data. It also discusses how kernels can be used to handle non-linearly separable data. The document concludes by showing an implementation of SVM on a zoos dataset to classify animals as crocodiles or alligators.
Scalable Learning Technologies for Big Data MiningGerard de Melo
These are slides of a tutorial by Gerard de Melo and Aparna Varde presented at the DASFAA 2015 conference.
As data expands into big data, enhanced or entirely novel data mining algorithms often become necessary. The real value of big data is often only exposed when we can adequately mine and learn from it. We provide an overview of new scalable techniques for knowledge discovery. Our focus is on the areas of cloud data mining and machine learning, semi-supervised processing, and deep learning. We also give practical advice for choosing among different methods and discuss open research problems and concerns.
This document provides an introduction and overview of machine learning algorithms. It begins by discussing the importance and growth of machine learning. It then describes the three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Next, it lists and briefly defines ten commonly used machine learning algorithms including linear regression, logistic regression, decision trees, SVM, Naive Bayes, and KNN. For each algorithm, it provides a simplified example to illustrate how it works along with sample Python and R code.
The document describes the author's approach to building a machine learning pipeline for a Kaggle competition to predict product categories from tabular data. The pipeline includes: 1) Loading and processing the training, testing, and submission data, 2) Performing cross-validated model training and evaluation using algorithms like XGBoost, LightGBM and CatBoost, 3) Averaging the results to generate final predictions and create a submission file. The author aims to share details of algorithms, hardware performance, and results in subsequent blog posts.
The document provides an overview of machine learning, including definitions of machine learning, the differences between programming and machine learning, examples of machine learning applications, and descriptions of various machine learning algorithms and techniques. It discusses supervised learning methods like classification and regression. Unsupervised learning methods like clustering are also covered. The document outlines the machine learning process and provides cautions about machine learning.
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Edureka!
This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic Regression machine learning algorithm works in R. Towards the end, in our demo we will be predicting which patients have diabetes using Logistic Regression! In this Logistic Regression Tutorial you will understand:
1) The 5 Questions asked in Data Science
2) What is Regression?
3) Logistic Regression - What and Why?
4) How does Logistic Regression Work?
5) Demo in R: Diabetes Use Case
6) Logistic Regression: Use Cases
The document discusses overfitting and transformation-based learning. It begins by defining overfitting as when a machine learning model learns the details and noise in the training data too well, reducing its ability to generalize to new data. It then describes transformation-based learning as an error-driven machine learning method where rules are learned from annotated training data and can make use of context like surrounding parts of speech to tag words. The document provides an example of part-of-speech tagging using this method.
This document discusses big data and data science. It begins by questioning whether big data and data science are hype. It then discusses how companies can use data science techniques like A/B testing to continuously improve their online systems. The document provides examples of how recommender systems and anomaly detection can be improved using these techniques. It describes the data flow and challenges of training models offline and updating them online. It also discusses evaluating models offline and testing them online to identify the best performing models.
WEKA:Credibility Evaluating Whats Been Learnedweka Content
- Training and test sets are used to measure classification success rates, with the test set being independent of the training set. The error rate on the training set is optimistic. Cross validation techniques like 10-fold stratified cross validation are used when data is limited.
- True success rates are predicted using properties of statistics and normal distributions. Confidence levels determine the range within which the true rate is expected to lie.
- Techniques like paired t-tests are used to statistically compare the performance of different algorithms or data mining methods. They determine if performance differences are statistically significant.
This document discusses various techniques for evaluating machine learning models and comparing their performance, including:
- Measuring error rates on separate test and training sets to avoid overfitting
- Using techniques like cross-validation, bootstrapping, and holdout validation when data is limited
- Comparing algorithms using statistical tests like paired t-tests
- Accounting for costs of different prediction outcomes in evaluation and model training
- Visualizing performance using lift charts and ROC curves to compare models
- The Minimum Description Length principle for selecting the model that best compresses the data
This document discusses classification and clustering techniques used in search engines. It covers classification tasks like spam detection, sentiment analysis, and ad classification. Naive Bayes and support vector machines are described as common classification approaches. Features, feature selection, and evaluation metrics for classifiers are also summarized.
Recommender Systems from A to Z – Model TrainingCrossing Minds
This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
This document describes a project to classify student academic performance using machine learning algorithms. It extracts four features from a university dataset to label students as poor or good performers. These features identify failing, dropout, lower than expected grade, and lower grade with course difficulty students. It then applies SVM, Random Forest, Decision Tree, and Gradient Boosting algorithms. Decision Tree achieved the highest accuracy at 89% while Gradient Boosting had the best F1 score. The models are used to predict performance reasons for new student records.
1. The document discusses a data mining competition hosted by DonorsChoose.org to identify school donation projects that are exceptionally exciting. It describes the provided data files and classification algorithms used, including logistic regression, which performed best.
2. Extensive data preprocessing techniques were applied, including feature selection, handling null values, categorizing numeric features, and text feature extraction from project essays. Cross validation was used to evaluate models during development.
3. Logistic regression with data divided into two parts for training performed best, achieving a ROC value of 0.69853 using optimized hyperparameters.
The learning method used by our approach is usually known in the artificial intelligence community as learning by observation, imitation learning, learning from demonstration, programming by demonstration, learning by watching or learning by showing. For consistency, learning by observation will be used from here on.
This document discusses educational data mining and various methods used in EDM. It begins with an introduction to EDM, defining it as an emerging discipline concerned with exploring unique data from educational settings to better understand students and learning environments. It then outlines several common classes of EDM methods including information visualization, web mining, clustering, classification, outlier detection, association rule mining, sequential pattern mining, and text mining. The rest of the document focuses on specific EDM methods like prediction, clustering, relationship mining, discovery with models, and distillation of data for human judgment. It provides examples and explanations of how these methods are used in EDM.
Data.Mining.C.6(II).classification and predictionMargaret Wang
The document summarizes different machine learning classification techniques including instance-based approaches, ensemble approaches, co-training approaches, and partially supervised approaches. It discusses k-nearest neighbor classification and how it works. It also explains bagging, boosting, and AdaBoost ensemble methods. Co-training uses two independent views to label unlabeled data. Partially supervised approaches can build classifiers using only positive and unlabeled data.
Similar to Fast Distributed Online Classification (20)
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Fast Distributed Online Classification
1. Fast Distributed Online Classification
Ram Sriharsha (Product Manager, Apache Spark, Databricks)
Prasad Chalasani (SVP Data Science, MediaMath)
13 April, 2016
2. Summary
We leveraged recent machine-learning research to develop a
I fast, practical,
I scalable (up to 100s of Millions of sparse features)
I online,
I distributed (built on Apache Spark),
I single-pass,
ML classifier that has significant advantages
over most similar ML packages.
3. Key Conceptual Take-aways
I Supervised Machine Learning
I Online vs Batch Learning, and importance of Online
I Challenges in online-learning
I Distributed implementation in Spark.
4. Supervised Machine Learning: Overview
Given:
I labeled training data,
Goal:
I fit a model to predict labels on (unseen) test data.
5. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
6. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
Fix a family of functions fw (x) œ F that are parametrised by a
weight-vector w.
7. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
Fix a family of functions fw (x) œ F that are parametrised by a
weight-vector w.
Goal: find w that minimizes average loss over D:
L(w) =
1
n
nÿ
i=1
Li (w) =
1
n
nÿ
i=1
L(fw (xi ), yi ).
9. Logistic Regression
Logistic model fw (x) = 1
1+e≠w·x
Probability interpretation
Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi ))
10. Logistic Regression
Logistic model fw (x) = 1
1+e≠w·x
Probability interpretation
Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi ))
Overall Loss L(w) =
qn
i=1 Li (w)
11. Logistic Regression
Logistic model fw (x) = 1
1+e≠w·x
Probability interpretation
Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi ))
Overall Loss L(w) =
qn
i=1 Li (w)
L(w) is convex:
I no local minima
I di erentiate and follow gradients
13. Gradient Descent
Basic idea:
I start with an initial guess of weight-vector w
I at iteration t, update w to a new weight-vector wÕ:
wÕ
= w ≠ ⁄gt
where
I gt is the (vector) gradient of L(w) w.r.t. w at time t,
I ⁄ is the learning rate.
16. Gradient Descent
gt =
ˆL(w)
ˆw
=
ÿ
i
ˆLi (w)
ˆw
=
ÿ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.
17. Gradient Descent
gt =
ˆL(w)
ˆw
=
ÿ
i
ˆLi (w)
ˆw
=
ÿ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.
BGD is not scalable to large data-sets.
18. Online (Stochastic) Gradient Descent (SGD)
A drastic simplification:
Instead of computing gradient based on entire training data-set
gt =
ÿ
i
ˆLi (w)
ˆw
,
and doing an update wÕ = w ≠ ⁄gt.
19. Online (Stochastic) Gradient Descent (SGD)
A drastic simplification: Shu e data-set (if not naturally shu ed),
Compute gradient based on a one example
gti =
ˆLi (w)
ˆw
,
and do an update wÕ = w ≠ ⁄gti .
20. Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient
21. Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient
Online:
I to make one step, compute gradient w.r.t. one example:
I extremely fast updates
I not necessarily correct gradient
23. Batch vs Online Learning
Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set
24. Batch vs Online Learning
Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set
Drawbacks:
I infeasible/impractically slow for large data-sets.
I need to repeat batch process to update model with new data
25. Batch vs Online Learning
Online Learning:
I for each “training” example:
26. Batch vs Online Learning
Online Learning:
I for each “training” example:
I generate prediction (score),
27. Batch vs Online Learning
Online Learning:
I for each “training” example:
I generate prediction (score),
I compare with true label,
28. Batch vs Online Learning
Online Learning:
I for each “training” example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
29. Batch vs Online Learning
Online Learning:
I for each “training” example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
I for each “test” example:
30. Batch vs Online Learning
Online Learning:
I for each “training” example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
I for each “test” example:
I predict with latest learned model (weights w)
31. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
32. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
33. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
34. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct “training” and “testing” phases:
35. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct “training” and “testing” phases:
I incremental, continual learning
36. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct “training” and “testing” phases:
I incremental, continual learning
I adapts to changing patterns
37. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct “training” and “testing” phases:
I incremental, continual learning
I adapts to changing patterns
I easily update existing model with new data
38. Batch vs Online Learning
Online Learning benefits:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct “training” and “testing” phases:
I incremental, continual learning
I adapts to changing patterns
I easily update existing model with new data
I better generalization to unseen observations.
39. The Online Learning Paradigm
As each labeled example (xi , yi ) is seen,
I make prediction given only current weight-vector w
I update weight-vector w
41. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
42. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I it’s possible to only do a single pass over the data.
43. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I it’s possible to only do a single pass over the data.
I data arrives in real-time, and
44. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I it’s possible to only do a single pass over the data.
I data arrives in real-time, and
I decisions/predictions must be made quickly
45. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I it’s possible to only do a single pass over the data.
I data arrives in real-time, and
I decisions/predictions must be made quickly
I learned model needs to adapt quickly to recent observations.
47. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
48. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
49. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
Online learning benefits:
I fast update of learned model to reflect latest observations
50. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
Online learning benefits:
I fast update of learned model to reflect latest observations
I light-weight models extremely quick to compute
57. Online Learning: Feature Scaling
Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000
58. Online Learning: Feature Scaling
Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000
Extreme scale di erences =∆ convergence problems.
Convergence much faster when features are of same scale:
I normalize each feature by dividing by its max possible value.
59. Online Learning: Feature Scaling
But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to find ranges.
60. Online Learning: Feature Scaling
But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to find ranges.
=∆ Need single-pass algorithms that
adaptively normalize features with each new observation.
[Ross,Mineiro,Langford 2013] proposed such an algorithm,
which we implemented in our online ML system.
62. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
63. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
64. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
I country=USA may occur much more often than
country=Belgium
65. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
I country=USA may occur much more often than
country=Belgium
I indicator feature visited_site = 1 much more often than
purchased=1.
66. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =∆ slow convergence.
67. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =∆ slow convergence.
=∆ rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
68. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =∆ slow convergence.
=∆ rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
E ectively, the algo pays more attention to rare features
Enables finding rare but predictive features.
69. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =∆ slow convergence.
=∆ rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
E ectively, the algo pays more attention to rare features
Enables finding rare but predictive features.
ADAGRAD is an algorithm for this [Duchi,Hazan,Singer 2010], and
we implemented this in our learning system.
71. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
72. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
73. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
74. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
75. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
76. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
77. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
78. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
I cannot pre-process data to find all possible values
79. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
I cannot pre-process data to find all possible values
I don’t want to encode explicit (long) vectors
80. Online Learning: Sparse Features, Hashing Trick
e.g. observation:
I country = "china" (categorical)
I age=32 (numerical)
I domain="google.com" (categorical)
81. Online Learning: Sparse Features, Hashing Trick
Hash the feature-names:
I hash("country_china") = 24378
I hash("age") = 32905
I hash("domain_google.com") = 84395
83. Online Learning: Sparse Features, Hashing Trick
Represent observation as a (special) Map:
{24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}
Sparse Representation (no explicit vectors)
84. Online Learning: Sparse Features, Hashing Trick
Represent observation as a (special) Map:
{24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}
Sparse Representation (no explicit vectors)
No need for separate pass on data (unlike Spark MLLib)
86. Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?
87. Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?
Our (Scala) implementation in Apache Spark:
I Randomly re-partition training data into shards
I Use SGD to learn a model for each shard
I average models using TreeReduce (~ “AllReduce”)
I leverages Spark/Hadoop fault-tolerance.
89. Slider
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark
I Works directly with Spark Data Frames
I Usable as a library within other JVM systems
I Leverages Spark/Hadoop fault-tolerance
I Stochastic Gradient Descent
I Online feature-scaling/normalization
I Adaptive (per-feature) learning-rates
I Single-pass
I Hashing-trick to encode sparse features
90. Slider, Vowpal-Wabbit (VW), Spark-ML (SML)
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark (SML)
I Works directly with Spark Data Frames (SML)
I Usable as a library within other JVM systems (SML)
I Leverages Spark/Hadoop fault-tolerance (SML)
I Stochastic Gradient Descent (SGD) (VW, SML)
I Online feature-scaling/normalization (VW)
I Adaptive (per-feature) learning-rates (VW)
I Single-pass (VW, SML)
I Hashing-trick to encode sparse features (VW)
95. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on first 80%, test on remaining 20%
96. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on first 80%, test on remaining 20%
Spark ML (using Pipelines)
I makes 17 passes over data: one for each categorical feature
I trains and scores in 40 minutes
I need to specify iterations, etc.
I AUC = 0.52 on test data
97. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on first 80%, test on remaining 20%
Slider
I makes just one pass over data.
I trains and scores in 5 minutes.
I no tuning
I AUC = 0.68 on test data
98. Other Work
I Online version of k-means clustering
I FTRL algorithm (regularized alternative to SGD)
Ongoing/Future:
I Online learning with Spark Streaming
I Benchmarking vs other ML systems