This document provides an overview of machine learning and feature engineering. It discusses how machine learning can be used for tasks like classification, regression, similarity matching, and clustering. It explains that feature engineering involves transforming raw data into numeric representations called features that machine learning models can use. Different techniques for feature engineering text and images are presented, such as bag-of-words and convolutional neural networks. Dimensionality reduction through principal component analysis is demonstrated. Finally, information is given about upcoming machine learning tutorials and Dato's machine learning platform.
Feature Engineering for Machine Learning at QConSPAmanda Casari
Machine learning fits mathematical models to date to derive insights or make predictions. Engineering the features that sit between data and models is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling and enable results of higher quality. In this talk, we will dive deeper into the mechanisms behind popular feature engineering techniques and walk through use of where these techniques are most useful. You will be able to better identify which methods to use based on your data and the problem you are working to solve.
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Utilizing additional information in factorization methods (research overview,...Balázs Hidasi
This presentation contains the main points of my recommender systems related research. It describes the arc of my research starting from improving matrix factorization, through the developement of my context-aware algorithms & addressing scalability issues to developing a general factorization framework & dealing with context dimension modeling. The slides were presented at the Delft University of Technology where I was invited to give this introductory talk as part of the collaboration between participiants of the CrowdRec project. The presentation was given on 11th April 2014.
With the explosive growth of online information, recommender system has been an effective tool to overcome information overload and promote sales. In recent years, deep learning's revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its efficacy in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performance. In this talk, I will present recent development of deep learning based recommender models and highlight some future challenges and open issues of this research field.
Feature Engineering for Machine Learning at QConSPAmanda Casari
Machine learning fits mathematical models to date to derive insights or make predictions. Engineering the features that sit between data and models is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling and enable results of higher quality. In this talk, we will dive deeper into the mechanisms behind popular feature engineering techniques and walk through use of where these techniques are most useful. You will be able to better identify which methods to use based on your data and the problem you are working to solve.
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Utilizing additional information in factorization methods (research overview,...Balázs Hidasi
This presentation contains the main points of my recommender systems related research. It describes the arc of my research starting from improving matrix factorization, through the developement of my context-aware algorithms & addressing scalability issues to developing a general factorization framework & dealing with context dimension modeling. The slides were presented at the Delft University of Technology where I was invited to give this introductory talk as part of the collaboration between participiants of the CrowdRec project. The presentation was given on 11th April 2014.
With the explosive growth of online information, recommender system has been an effective tool to overcome information overload and promote sales. In recent years, deep learning's revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its efficacy in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performance. In this talk, I will present recent development of deep learning based recommender models and highlight some future challenges and open issues of this research field.
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
I gave this talk at the 1st Budapest RecSys and Personalization Meetup about using deep learning to solve long standing problems of recommender systems. I also presented our approach on using RNNs for session-based recommendations in details.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Context-aware preference modeling with factorizationBalázs Hidasi
This talk was presented at the Doctoral Symposium of RecSys'15. It is a summary of the core part of my PhD research in the last few years. The research revolves around solving the implicit feedback based context-aware recommendation problem with factorization.
Associated paper: http://dl.acm.org/citation.cfm?id=2796543
Details of presented algorithms/methods (public versions available on http://hidasi.eu):
iTALS: http://link.springer.com/chapter/10.1007/978-3-642-33486-3_5
iTALSx: http://www.infocommunications.hu/documents/169298/1025723/InfocomJ_2014_4_5_Hidasi.pdf
ALS-CG/CD: http://link.springer.com/article/10.1007/s10115-015-0863-2
GFF: http://link.springer.com/article/10.1007/s10618-015-0417-y
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Balázs Hidasi
Slides for my RecSys 2016 talk on integrating image and textual information into session based recommendations using novel parallel RNN architectures.
Link to the paper: http://www.hidasi.eu/en/publications.html#p_rnn_recsys16
Le Machine Learning, l’IA, le DeepLearning, les Statistiques, le Data Mining… bref, tous ces mots sont les buzz words du moment mais que se cache-t-il derrière ?
A travers des exemples concrets, on parcourra les différentes approches du Machine Learning, les grandes familles d’algorithmes (n’ayez crainte : sans rentrer dans le cœur de leurs implémentations), puis les outils et les frameworks à la disposition des Data Scientists… et pour finir, on essayera de prédire l’avenir !
Salon Data - Nantes - 19 Septembre 2017
https://salondata.fr/2017/07/12/0930-1030-ml/
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
Aaron Roth is an Associate Professor of Computer and Information Sciences at the University of Pennsylvania, affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program. Previously, he received his PhD from Carnegie Mellon University and spent a year as a postdoctoral researcher at Microsoft Research New England. He is the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE) awarded by President Obama in 2016, an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and a Yahoo! ACE award. His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory and mechanism design, learning theory, and the intersections of these topics. Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.”
Abstract Summary:
Differential Privacy and Machine Learning:
In this talk, we will give a friendly introduction to Differential Privacy, a rigorous methodology for analyzing data subject to provable privacy guarantees, that has recently been widely deployed in several settings. The talk will specifically focus on the relationship between differential privacy and machine learning, which is surprisingly rich. This includes both the ability to do machine learning subject to differential privacy, and tools arising from differential privacy that can be used to make learning more reliable and robust (even when privacy is not a concern).
Erik Bernhardsson is the CTO at Better, a small startup in NYC working with mortgages. Before Better, he spent five years at Spotify managing teams working with machine learning and data analytics, in particular music recommendations.
Abstract Summary:
Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called “Annoy” that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
Deep Learning and Reinforcement Learning summer schools summary
26th June-6th July 2017, Montreal, Quebec
Things I learned. What was your favourite lesson?
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...MLconf
Hanie Sedghi is a Research Scientist at Allen Institute for Artificial Intelligence (AI2). Her research interests include large-scale machine learning, high-dimensional statistics and probabilistic models. More recently, she has been working on inference and learning in latent variable models. She has received her Ph.D. from University of Southern California with a minor in Mathematics in 2015. She was also a visiting researcher at University of California, Irvine working with professor Anandkumar during her Ph.D. She received her B.Sc. and M.Sc. degree from Sharif University of Technology, Tehran, Iran.
Abstract summary
Beating Perils of Non-convexity:Guaranteed Training of Neural Networks using Tensor Methods:
Neural networks have revolutionized performance across multiple domains such as computer vision and speech recognition. However, training a neural network is a highly non-convex problem and the conventional stochastic gradient descent can get stuck in spurious local optima. We propose a computationally efficient method for training neural networks that also has guaranteed risk bounds. It is based on tensor decomposition which is guaranteed to converge to the globally optimal solution under mild conditions. We explain how this framework can be leveraged to train feedforward and recurrent neural networks.
Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
Generative Adversarial Network and its Applications on Speech and Natural Language Processing, Part 2.
발표자: Hung-yi Lee(국립 타이완대 교수)
발표일: 18.7.
Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.
In the first part of the talk, I will first give an introduction of GAN and provide a thorough review about this technology. In the second part, I will focus on the applications of GAN to speech and natural language processing. I will demonstrate the applications of GAN on voice I will also talk about the research directions towards unsupervised speech recognition by GAN.conversion, unsupervised abstractive summarization and sentiment controllable chat-bot.
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
I gave this talk at the 1st Budapest RecSys and Personalization Meetup about using deep learning to solve long standing problems of recommender systems. I also presented our approach on using RNNs for session-based recommendations in details.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Context-aware preference modeling with factorizationBalázs Hidasi
This talk was presented at the Doctoral Symposium of RecSys'15. It is a summary of the core part of my PhD research in the last few years. The research revolves around solving the implicit feedback based context-aware recommendation problem with factorization.
Associated paper: http://dl.acm.org/citation.cfm?id=2796543
Details of presented algorithms/methods (public versions available on http://hidasi.eu):
iTALS: http://link.springer.com/chapter/10.1007/978-3-642-33486-3_5
iTALSx: http://www.infocommunications.hu/documents/169298/1025723/InfocomJ_2014_4_5_Hidasi.pdf
ALS-CG/CD: http://link.springer.com/article/10.1007/s10115-015-0863-2
GFF: http://link.springer.com/article/10.1007/s10618-015-0417-y
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Balázs Hidasi
Slides for my RecSys 2016 talk on integrating image and textual information into session based recommendations using novel parallel RNN architectures.
Link to the paper: http://www.hidasi.eu/en/publications.html#p_rnn_recsys16
Le Machine Learning, l’IA, le DeepLearning, les Statistiques, le Data Mining… bref, tous ces mots sont les buzz words du moment mais que se cache-t-il derrière ?
A travers des exemples concrets, on parcourra les différentes approches du Machine Learning, les grandes familles d’algorithmes (n’ayez crainte : sans rentrer dans le cœur de leurs implémentations), puis les outils et les frameworks à la disposition des Data Scientists… et pour finir, on essayera de prédire l’avenir !
Salon Data - Nantes - 19 Septembre 2017
https://salondata.fr/2017/07/12/0930-1030-ml/
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
Aaron Roth is an Associate Professor of Computer and Information Sciences at the University of Pennsylvania, affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program. Previously, he received his PhD from Carnegie Mellon University and spent a year as a postdoctoral researcher at Microsoft Research New England. He is the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE) awarded by President Obama in 2016, an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and a Yahoo! ACE award. His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory and mechanism design, learning theory, and the intersections of these topics. Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.”
Abstract Summary:
Differential Privacy and Machine Learning:
In this talk, we will give a friendly introduction to Differential Privacy, a rigorous methodology for analyzing data subject to provable privacy guarantees, that has recently been widely deployed in several settings. The talk will specifically focus on the relationship between differential privacy and machine learning, which is surprisingly rich. This includes both the ability to do machine learning subject to differential privacy, and tools arising from differential privacy that can be used to make learning more reliable and robust (even when privacy is not a concern).
Erik Bernhardsson is the CTO at Better, a small startup in NYC working with mortgages. Before Better, he spent five years at Spotify managing teams working with machine learning and data analytics, in particular music recommendations.
Abstract Summary:
Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called “Annoy” that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
Deep Learning and Reinforcement Learning summer schools summary
26th June-6th July 2017, Montreal, Quebec
Things I learned. What was your favourite lesson?
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...MLconf
Hanie Sedghi is a Research Scientist at Allen Institute for Artificial Intelligence (AI2). Her research interests include large-scale machine learning, high-dimensional statistics and probabilistic models. More recently, she has been working on inference and learning in latent variable models. She has received her Ph.D. from University of Southern California with a minor in Mathematics in 2015. She was also a visiting researcher at University of California, Irvine working with professor Anandkumar during her Ph.D. She received her B.Sc. and M.Sc. degree from Sharif University of Technology, Tehran, Iran.
Abstract summary
Beating Perils of Non-convexity:Guaranteed Training of Neural Networks using Tensor Methods:
Neural networks have revolutionized performance across multiple domains such as computer vision and speech recognition. However, training a neural network is a highly non-convex problem and the conventional stochastic gradient descent can get stuck in spurious local optima. We propose a computationally efficient method for training neural networks that also has guaranteed risk bounds. It is based on tensor decomposition which is guaranteed to converge to the globally optimal solution under mild conditions. We explain how this framework can be leveraged to train feedforward and recurrent neural networks.
Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
Generative Adversarial Network and its Applications on Speech and Natural Language Processing, Part 2.
발표자: Hung-yi Lee(국립 타이완대 교수)
발표일: 18.7.
Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.
In the first part of the talk, I will first give an introduction of GAN and provide a thorough review about this technology. In the second part, I will focus on the applications of GAN to speech and natural language processing. I will demonstrate the applications of GAN on voice I will also talk about the research directions towards unsupervised speech recognition by GAN.conversion, unsupervised abstractive summarization and sentiment controllable chat-bot.
Presented at http://dsatlconf.com/
Demonstrating how you can run Machine Learning inside the database and be several orders of magnitude more efficient. We also talk about you can build Machine Learning models with Probabilistic Rules inside the database
In this Lunch & Learn session, Chirag Jain gives us a friendly & gentle introduction to Machine Learning & walks through High-Level Learning frameworks using Linear Classifiers.
Introduction to Applied Machine Learning in RBrian Spiering
Introduction to Applied Machine Learning in R
by Dr. Brian J. Spiering
What is applied machine learning? Machine learning is programming computers to automatically learn and generalize from examples. It allows for predictions or decisions beyond explicit programming of rules. Applying machine learning is a force multiplier for analysts, given the rise in the volume and velocity of data. This talk will the cover the basic, practical applications of machine learning in common contexts using the R programming language.
Presented at Bay Area Entrepreneurs in Statistics
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
The success of Open Data initiatives has increased the amount of data available on the Web. Unfortunately, most of this data is only available in raw tabular form, what makes analysis and reuse quite difficult for non-experts. Linked Data principles allow for a more sophisticated approach by making explicit both the structure and semantics of the data. However, from the end-user viewpoint, they continue to be monolithic files completely opaque or difficult to explore by making tedious semantic queries. Our objective is to facilitate the user to grasp what kind of entities are in the dataset, how they are interrelated, which are their main properties and values, etc. Rhizomer is a tool for data publishing whose interface provides a set of components borrowed from Information Architecture (IA) that facilitate awareness of the dataset at hand. It automatically generates navigation menus and facets based on the kinds of things in the dataset and how they are described through metadata properties and values. Moreover, motivated by recent tests with end-users, it also provides the possibility to pivot among the faceted views created for each class of resources in the dataset.
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
This slide deck gives an introduction on data science focusing on three most common tasks including regression, classification and clustering. Each task comes with a real world data science project to illustrate the concepts. This presentation was initially created for a one-hour guest lecture at Utah State University for teaching and education purposes.
The Incredible Disappearing Data ScientistRebecca Bilbro
The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
1. Overview of Machine
Learning & Feature
Engineering
Machine Learning 101 Tutorial
Strata + Hadoop World, NYC, Sep 2015
Alice Zheng, Dato
1
2. 2
About us
Chris DuBois
Intro to recommenders
Alice Zheng
Overview of ML
Piotr Teterwak
Intro to image search & deep learning
Krishna Sridhar
Deploying ML as a predictive service
Danny Bickson
TA
Alon Palombo
TA
20. 20
Machine learning … how?
Data
Answers
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Many systems
Many tools
Many teams
Lots of methods/jargon
21. 21
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
22. 22
Three things to know about ML
• Feature = numeric representation of raw data
• Model = mathematical “summary” of features
• Making something that works = choose the right model
and features, given data and task
24. 24
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
25. 25
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
26. 26
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
27. 27
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
28. 28
Feature space in machine learning
• Raw data high dimensional vectors
• Collection of data points point cloud in feature space
• Feature engineering = creating features of the appropriate
granularity for the task
29. Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to reduce
all problems to sets of numbers and variables, and the
geometers, who understand the world through shapes.
-- Masha Gessen, “Perfect Rigor”
34. 34
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
35. 35
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• High dimensional features can model many things
37. 37
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
38. 38
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
47. 47
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
48. 48
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
49. 49
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
50. 50
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
51. 51
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
53. 53
Dato’s machine learning platform
Raw data
Features Models
Predictions
Deploy in
production
GraphLab Create
Dato Distributed
Dato Predictive Services
54. 54
Data structures for feature engineering
Features SFrames
User Com.
Title Body
User Disc.
SGraphs
55. 55
Machine learning toolkits in GraphLab Create
• Classification/regression
• Clustering
• Recommenders
• Deep learning
• Similarity search
• Data matching
• Sentiment analysis
• Churn prediction
• Frequent pattern mining
• And on…
59. 59
PCA : Principal Component Analysis
Find a line, such that
the average distance of
every data point to the
line is minimized.
This is the 1st Principal
Component
60. 60
PCA : Principal Component Analysis
Find a 2nd line,
- at right angles to the 1st
- such that the average
distance of every data
point to the line is
minimized.
This is the 2nd Principal
Component
61. 61
PCA : Principal Component Analysis
Find a 3rd line
- at right angles to the
previous lines
- such that the average
distance of every data
point to the line is
minimized.
…
There can only be as many
principle components as
the dimensionality of the
data.
63. 63
Coursera Machine Learning Specialization
• Learn machine learning in depth
• Build and deploy intelligent applications
• Year long certification program
• Joint project between University of Washington + Dato
• Details:
https://www.coursera.org/specializations/machine-learning
64. 64
Next up today
alicez@dato.com @RainyData, #StrataConf
11:30am - Intro to recommenders
Chris DuBois
1:30pm - Intro to image search & deep learning
Piotr Teterwak
3:30pm - Deploying ML as a predictive service
Krishna Sridhar