BigData: My Learnings from data analytics at Uber
Reference (highly recommended):
* Designing Data-Intensive Applications http://bit.ly/big_data_architecture
* Big Data and Machine Learning using Python tools http://bit.ly/big_data_machine_learning
* Uber Engineering Blog http://eng.uber.com
* Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale
http://bit.ly/hadoop_guide_bigdata
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
BigData: My Learnings from data analytics at Uber
Reference (highly recommended):
* Designing Data-Intensive Applications http://bit.ly/big_data_architecture
* Big Data and Machine Learning using Python tools http://bit.ly/big_data_machine_learning
* Uber Engineering Blog http://eng.uber.com
* Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale
http://bit.ly/hadoop_guide_bigdata
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
Neuron is a server-less Deep Learning and AI experiment platform for analytics where you can build, deploy and visualise the data models.
Practical lab on cloud access from anywhere.
https://www.learntek.org/machine-learning-using-spark/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
Adoption of ML at scale in the Enterprise, Machine Learning Platforms & AutoML
[1] Definitions & Context
• Machine Learning Platforms, Definitions
• ML models & apps as first class assets in the Enterprise
• Workflow of an ML application
• ML Algorithms, overview
• Architecture of a ML platform
• Update on the Hype cycle for ML & predictive apps
[2] Adopting ML at Scale
• The Problem with Machine Learning - Scaling ML in the
Enterprise
• Technical Debt in ML systems
• How many models are too many models
• The need for ML platforms
[3] The Market for ML Platforms
• ML platform Market References - from early adopters to
mainstream
• Custom Build vs Buy: ROI & Technical Debt
• ML Platforms - Vendor Landscape
[4] Custom Built ML Platforms
• ML platform Market References - a closer look
Facebook - FBlearner
Uber - Michelangelo
AirBnB - BigHead
• ML Platformization Going Mainstream: The Great Enterprise Pivot
[5] From DevOps to MLOps
• DevOps <> ModelOps
• The ML platform driven Organization
• Leadership & Accountability (labour division)
[6] Automated ML - AutoML
• Scaling ML - Rapid Prototyping & AutoML:
• Definition, Rationale
• Vendor Comparison
• AutoML - OptiML: Use Cases
[7] Future Evolution for ML Platforms
Appendix I: Practical Recommendations for ML onboarding in the Enterprise
Appendix II: List of References & Additional Resources
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
This slides was presented by me at PHPIndonesia and FemaleGeek Meetup on 18th June, 2016.
On this occassion, I've shared about how Kudo start and organize our data team and more technically on how Kudo use and implement ETL and machine learning.
Mentoring Session with Innovesia: Advance RoboticsDony Riyanto
This is my mentoring session presentation for Innovesia. I'm covering several sub-topics such as:
- Mechatronics Programming (robotics)
- Autonomous Programming
- Hard-real-time systems
- Safety compliance and standard issues
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
For this talk, I will be discussing about various approaches to accelerate deep learning solutions from notebooks or research environment to production environment and how these solutions can be transformed as an enterprise level end to end Deep Learning Solution, which can be consumed as a service by any software application, with a practical use-case example.
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016MLconf
mlpack: Or, How I Learned To Stop Worrying and Love C++: mlpack is a cutting-edge C++ machine learning library containing fast implementations of both standard machine learning algorithms and recently-published algorithms. In this talk, I will introduce mlpack, its design philosophy, and discuss how C++ is helpful for making implementations fast, as well as the pros and cons of C++ as a language choice. I will briefly review the capabilities of mlpack, then focus on mlpack’s flexibility by demonstrating the k-means clustering code (and maybe some other algorithms too, like nearest neighbor search), and how it might be used in a production environment. The project website can be found at http://www.mlpack.org/.
Machine learning using spark Online TrainingLearntek1
http://www.learntek.org/product/machine-learning-using-spark/
What is Machine Learning?
Machine learning Using Spark-Spark MLlib is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
http://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
[PythonPH] Transforming the call center with Text mining and Deep learning (C...Paul Lo
Transforming the call center with Text mining and Deep learning:
1. Text ming tool to unlock user insights
2. Artificial Intelligence revolution in call centers: deep learning-based bot
Neuron is a server-less Deep Learning and AI experiment platform for analytics where you can build, deploy and visualise the data models.
Practical lab on cloud access from anywhere.
https://www.learntek.org/machine-learning-using-spark/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
Adoption of ML at scale in the Enterprise, Machine Learning Platforms & AutoML
[1] Definitions & Context
• Machine Learning Platforms, Definitions
• ML models & apps as first class assets in the Enterprise
• Workflow of an ML application
• ML Algorithms, overview
• Architecture of a ML platform
• Update on the Hype cycle for ML & predictive apps
[2] Adopting ML at Scale
• The Problem with Machine Learning - Scaling ML in the
Enterprise
• Technical Debt in ML systems
• How many models are too many models
• The need for ML platforms
[3] The Market for ML Platforms
• ML platform Market References - from early adopters to
mainstream
• Custom Build vs Buy: ROI & Technical Debt
• ML Platforms - Vendor Landscape
[4] Custom Built ML Platforms
• ML platform Market References - a closer look
Facebook - FBlearner
Uber - Michelangelo
AirBnB - BigHead
• ML Platformization Going Mainstream: The Great Enterprise Pivot
[5] From DevOps to MLOps
• DevOps <> ModelOps
• The ML platform driven Organization
• Leadership & Accountability (labour division)
[6] Automated ML - AutoML
• Scaling ML - Rapid Prototyping & AutoML:
• Definition, Rationale
• Vendor Comparison
• AutoML - OptiML: Use Cases
[7] Future Evolution for ML Platforms
Appendix I: Practical Recommendations for ML onboarding in the Enterprise
Appendix II: List of References & Additional Resources
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
This slides was presented by me at PHPIndonesia and FemaleGeek Meetup on 18th June, 2016.
On this occassion, I've shared about how Kudo start and organize our data team and more technically on how Kudo use and implement ETL and machine learning.
Mentoring Session with Innovesia: Advance RoboticsDony Riyanto
This is my mentoring session presentation for Innovesia. I'm covering several sub-topics such as:
- Mechatronics Programming (robotics)
- Autonomous Programming
- Hard-real-time systems
- Safety compliance and standard issues
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
For this talk, I will be discussing about various approaches to accelerate deep learning solutions from notebooks or research environment to production environment and how these solutions can be transformed as an enterprise level end to end Deep Learning Solution, which can be consumed as a service by any software application, with a practical use-case example.
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016MLconf
mlpack: Or, How I Learned To Stop Worrying and Love C++: mlpack is a cutting-edge C++ machine learning library containing fast implementations of both standard machine learning algorithms and recently-published algorithms. In this talk, I will introduce mlpack, its design philosophy, and discuss how C++ is helpful for making implementations fast, as well as the pros and cons of C++ as a language choice. I will briefly review the capabilities of mlpack, then focus on mlpack’s flexibility by demonstrating the k-means clustering code (and maybe some other algorithms too, like nearest neighbor search), and how it might be used in a production environment. The project website can be found at http://www.mlpack.org/.
Machine learning using spark Online TrainingLearntek1
http://www.learntek.org/product/machine-learning-using-spark/
What is Machine Learning?
Machine learning Using Spark-Spark MLlib is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
http://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
[PythonPH] Transforming the call center with Text mining and Deep learning (C...Paul Lo
Transforming the call center with Text mining and Deep learning:
1. Text ming tool to unlock user insights
2. Artificial Intelligence revolution in call centers: deep learning-based bot
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
Slidedeck for my session on Insider Dev Tour 2019 (Lisbon Jul 29th).
Mostly based on tools and platform support for AI workloads and the options for edge computing and cloud computing.
ML.NET, WinML, DirectML, Model Builder, Azure Cognitive Services, ...
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
ML Framework for auto-responding to customer support queriesVarun Nathan
The synopsis of this presentation is about how ML can be employed to develop a bot that has the capability to understand natural language and provide suitable response.
ML Framework for auto-responding to customer support queriesVarun Nathan
The synopsis of this presentation is about how ML can be employed to develop a bot that has the capability to understand natural language and provide suitable response.
This session will demystify (generative) AI by exploring its workings as an advanced statistical modelling tool (suitable for any level of technical knowledge). Not only will this session explain the technological underpinnings of AI, it will also address concerns and (long-term) requirements around ethical and practical usage of AI. This includes data preparation and cleaning, data ownership, and the value of data-generated - but not owned - by libraries. It will also discuss the potentials for (hypothetical) use cases of AI in collections environments and making collections data AI-ready; providing examples of AI capabilities and applications beyond chatbots.
NESMA is most commonly known as the owner of the Dutch FPA functional size measurement standard ISO/IEC 24570. In this presentation we show that NESMA is more than just function points and we present our vision for software metrics in 2020.
Embark on a transformative journey into the world of data science with Tsofttech Institution's comprehensive Data Science Excellence program. In today's data-driven world, harnessing the power of data is essential for making informed decisions and driving innovation.
Course Highlights:
Practical Learning: Our hands-on approach allows you to gain practical experience by working on real-world data science projects. You'll learn to extract insights, analyze trends, and make data-driven decisions.
Cutting-Edge Curriculum: Stay at the forefront of data science with a curriculum that covers the latest tools and techniques, including data analysis, machine learning, data visualization, and more.
Expert Instructors: Learn from seasoned data scientists and industry experts who will guide you through the intricacies of data analysis and modeling, providing valuable insights and mentorship.
Personalized Learning: Our flexible course modules cater to learners of all levels, whether you're a beginner or an experienced professional. We tailor your learning experience to meet your specific needs and goals.
Certification: Receive a prestigious certification upon completing the program, validating your data science skills and boosting your career prospects.
Key Topics Covered:
Data Cleaning and Preprocessing
Exploratory Data Analysis
Machine Learning Algorithms
Predictive Analytics
Data Visualization
Big Data Technologies
Deep Learning
Natural Language Processing (NLP)
Business Analytics
Capstone Projects
Open the doors to a world of opportunities with a solid foundation in data science from Tsofttech Institution. Whether you aim to drive business decisions, conduct advanced research, or seek career growth, our program equips you with the skills needed to excel in this dynamic field.
Join us today and start your journey towards Data Science Excellence at Tsofttech Institution!
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
[Taipei.py] improving user experience with text mining and deep learning in Uber
1. Paul Lo, 2018/12 @ Taipei.py
Data Analytics @ Uber, Asia-Pacific Community Operation Central team
paullo0106@gmail.com | http://paullo.myvnc.com/blog/
Improving User Experience with Text Mining
and Deep Learning in Uber
2. Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
4. Scope of Community Operation in Uber APAC
Scope
10+ languages in ~20 countries
Central Team
based in
Manila,
Singapore,
India
India
Singapore (South East and North Asia)
Australia
5. Data @ Uber
Uber’s Data Lake
Stores 30+ Petabytes of data
~M clusters across N data centers
(thousands of servers)
So how much data is that really?
~100,000 years of music
Which is 50x the amount of music streamed on spotify
each year
50+ billion books or 50 million kindles
Equivalent to the entire written works of mankind from
the beginning of recorded history, in all languages
150+ years of 24/7 Full HD video recording
The amount of storage required to render 50 Avatar
movies, simultaneously
How big is Big data?
6. Data-driven business decision culture
Data helps us to tell the story to public and operate better
Typical policy and communications questions:
● How many jobs does Uber provide in Taipei?
● How is Uber pool reducing congestion in Manila?
● What proportion of our trips start or end at public transportation?
** Uber開源城市交通資料 : https://movement.uber.com
Typical city operation questions:
● Do we have enough drivers for the New Year?
● How can we reduce the ETA for our riders?
● When is best to introduce EATS delivery fee in my city?
7. Data tools to support Big data
Source: https://eng.uber.com
8. What’s our roles at Uber
Uber’s Data Lake
App + Support Data:
Rides, Eats, and etc
Payments Data:
Collection, Payments
External Data:
Traffic, Weather,
Holidays, Maps
Machine learning
platform
Programming
interface
Query interface
Internal BI Tools
Company-wide
dashboards
Marketing Data:
Clicks, Impressions,
Sentiment
9. Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation
10. Project #1:
Text mining and NLP for use experience
enhancement
Acknowledgement: Troy James Palanca, Lorenzo Ampil
11. Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
12. Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Making this process more efficient
13. Problem
How can we quickly get the insights from users’ feedback?
Problem
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
unsystematic
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket
ticket ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
14. Problem
How can we quickly get the insights from users’ feedback?
Solution
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
topics.
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
(35%)
Fare calculation
Dispute
(15%)
GPS issue
(55%)
15. Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
>>DEMO
16. Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
riders:
● Credit cards and other modes of payment
(18%)
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)
17. Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners
18. Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
notice.
19. Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
(d3js)
Train the model for each country with top issues
monthly
20. Workflow overview
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
21. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis
22. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
23. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
re.sub(r'd+', '', text)
● Html tags
BeautifulSoup(document).get_text()
BeautifulSoup(document).find_all(‘b’)
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
24. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
25. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency)
26. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency) Common practice to score each term
with weighted frequency and relevance
27. Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
28. Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Term frequency
Inverse Document
Frequency
29. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Data preparation can be very time-consuming
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
30. Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
● TFIDF
df['content'].apply(text_processing)
→ single thread by default
31. Speed up data processing
Pandas runs on a single thread by default
Worker 1
Worker 2
Worker N
keywords
32. Data processing speedup trick in Pandas
Pandas runs on a single thread by default
1
2
3
4
5
6
7
8
9
10
33. Many handy text processing libraries
TextBlob as an example
Tokenization Sentence correction
.correct()
Part of speech
.tags
Sentiment analysis
.sentiment.polarity
NLP Library
(TextBlob)
(spaCy)
34. Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content - but how?
Sample ~50,000 tickets for
each training in each issue
category
LDA:
- Unsupervised learning
- Bag of words
- “topic distribution”
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
35. Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
money)
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
(EM; Expectation Maximization)
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
document-topic
probabilities
30% health (topic
1)
60% fruits
(topic 2)
10% disease
(topic 3)
36. Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Sample ~50,000 tickets for
each training in each issue
category
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel
37. Future work and learnings
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Customization is needed
● Not suited for
specific issue
category
● Build own
dictionary for the
removal of
irrelevant words
Data input: ticket text as raw
data
Output: topic model clusters
How to make the results more useful and actionable?
● # of topic for convergence
● Time and performance
tradeoff
● Other ”Deep NLP” model ?
Word2vec
GloVe
Fasttext
38. Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
39. Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers
40. CSR’s sample workflow to respond user in a call center
How does our users submit an issue?
41. CSR’s sample workflow to respond user in a call center
Online support via in-app-help
42. The issue for call center operation: scalability and cost
The growth comes at a price again….
43. Solution? Let’s start from a basic sample
“I want to change my rating for a rider” - very rule-based deterministic flow
44. The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
monthly
47. Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
48. Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category/ type suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
tickets
Technical model training Product design
49. Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
50. Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source: https://eng.uber.com/cota-v2/
53. Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link)
CNN: Max pooling
Optimizers: Adam (SGD, RMSProp), Batch Normalization
Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
54. Development environment for Deep learning model training
How does model training look like?
>> DEMO
Main codebase + data set
GRID K520
55. Feature engineering and feature importance
Trade off between capacity and interpretability
“Capacity”
“Interpretability”
56. Feature engineering and feature importance
What are the important features? Very easy to learn that in simpler model
57. Feature engineering and feature importance
What are the important features? Very easy to get explanation in simpler models
58. Feature engineering and feature importance
What are the important features? NN model is like our brain’s intuition … blackbox
60. Feature engineering and feature importance
What are the important features?
Sklearn: Recursive feature elimination
(sklearn.feature_selection.RFE)
Mockup
dataset
61. Feature engineering and feature importance
What are the important features?
Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Mockup
dataset
62. Feature engineering and feature importance
What are the important features?
Shuffle each feature to create noise…. on the testing set
Mockup
example
63. Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
64. Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
65. Why NumPy is faster?
Python Vectorization: Locality of reference (Spatial Locality)
Java/ C++ versus Python…...
66. Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning
67. Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor experiments
Source: https://eng.uber.com/cota-v2/
Experiment notes:
* Network effect → Switch
back instead of A/B test
* Guardrail variable and
decision variable (risk control)
* Monitoring versus peeking
* Novelty effect
69. Other leanings
How to become a better programmer, or data scientist?
● Long-term growth: Not just know how to call APIs →
○ Understand what’s happening beneath (math and low-level
manipulation are key)
○ Understand pros and cons of your tool/ model/ framework
choice
● Coding at scale: Resource and infra are rich, but data is also
huge (as well as the risk) → time and space optimization
optimization but not overdesign
● Communication: Everybody is busy → organize and
communicate your work well, and build good social relationship
70. Recommended reading 推薦閱讀
How to become a better programmer, or data scientist? 多看書,多寫扣,多分享
Data Science from Scratch: 用python學資料科學
這本很推薦,也可以嘗試看原文的版本
Python資料運算與分析實戰
Numpy, Scipy, Pandas
日本人寫程式的書也很厲害...
流暢的Python
Java我的推薦聖經是Effective Java
這本可能還沒到那個程度,但也推薦!
71. Recommended reading
How to become a better programmer, or data scientist? Read & Code & Share, and repeat
Machine Learning and Deep Learning with Python
Focus on scikit-learn and TensorFlow
Data Science from Scratch
Highly recommend: Python-based hand-in-hand
On classical concepts and algorithms