12 Resolutions for a Great Year at WorkO.C. Tanner
If you are looking to improve your experience at work, here are 12 resolutions that will help you make 2016 your most productive, happy and engaged year yet.
What is ChatGPT and how can we use it? This is a talk given at Affiliate Summit West -- January 2023 to explain what ChatGPT is and isn't and how we can use it in Search.
All images were created using Dall-e.
A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59
Key takeaways:
• Learn how to use ChatGPT to add AI power to your testing and test automation
• Understand the limitations of the technology and where human expertise is crucial
• Gain insight into different AI-based tools
• Adopt AI-based tools to stay relevant and optimize work for developers and testers
* ChatGPT and OpenAI belong to OpenAI, L.L.C.
12 Resolutions for a Great Year at WorkO.C. Tanner
If you are looking to improve your experience at work, here are 12 resolutions that will help you make 2016 your most productive, happy and engaged year yet.
What is ChatGPT and how can we use it? This is a talk given at Affiliate Summit West -- January 2023 to explain what ChatGPT is and isn't and how we can use it in Search.
All images were created using Dall-e.
A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59
Key takeaways:
• Learn how to use ChatGPT to add AI power to your testing and test automation
• Understand the limitations of the technology and where human expertise is crucial
• Gain insight into different AI-based tools
• Adopt AI-based tools to stay relevant and optimize work for developers and testers
* ChatGPT and OpenAI belong to OpenAI, L.L.C.
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYAndre Muscat
Discuss the impact and opportunity of using Generative AI to support your development and creative teams
* Explore business challenges in content creation
* Cost-per-unit of different types of content
* Use AI to reduce cost-per-unit
* New partnerships being formed that will have a material impact on the way we search and engage with content
Part 4 of a 9 Part Research Series named "What matters in AI" published on www.andremuscat.com
Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh
This talk delves into the extraordinary capabilities of the emerging technology of generative AI, outlining its recent history and emphasizing its growing influence on scientific endeavors. Through a series of practical examples tailored for researchers, we will explore the transformative influence of these powerful tools on scientific tasks such as writing, coding, data wrangling and literature review.
This ChatGPT SEO guide is created to help individuals take advantage of the potential of ChatGPT for SEO. ChatGPT has rapidly gained popularity with over 1 million users registered in just 5 days. This guide provides an in-depth understanding of utilizing ChatGPT to improve your SEO and attain higher search engine rankings.
ChatGPT What It Is and How Writers Can Use It.pdfAdsy
Have you heard of ChatGPT? This smart model seems to change the way we work in the content marketing field.
We've investigated what this AI tool can do regarding content writing and ready to share the results.
Check this presentation to learn how this chatbot can assist you with content creation.
Using ChatGPT can be helpful in presentations to explain concepts in easy-to-understand terms.
Pairing that with Dall-E 2 can make your slides fun and interesting.
Time Management & Productivity - Best PracticesVit Horky
Here's my presentation on by proven best practices how to manage your work time effectively and how to improve your productivity. It includes practical tips and how to use tools such as Slack, Google Apps, Hubspot, Google Calendar, Gmail and others.
A non-technical introduction to ChatGPT - SEDA.pptxSue Beckingham
This presentation provides a brief history and context to ChatGPT, gives examples of what ChatGPT can do, considers the implications and issues and the next steps to consider.
This is a collection of prompt examples to be used with the ChatGPT model.
The ChatGPT model is a large language model trained by OpenAI that is capable of generating human-like text. By providing it with a prompt, it can generate responses that continue the conversation or expand on the given prompt.
Prompt engineering is a fundamental concept within the field of artificial intelligence, with particular relevance to natural language processing. It involves the strategic embedding of task descriptions within the input data of an AI system, often in the form of a question or query, as opposed to explicitly providing the task description separately. This approach optimizes the efficiency and effectiveness of AI models by encapsulating the desired outcome within the input context, thereby enabling more streamlined and context-aware responses.
Deriving Intelligence from Customer Actions: Data Marketing 2015 presentation Mathew Sweezey
Understand customers wants, needs, and desires is a tricky business but it can be made much easier if you understand three key ideas: stage based marketing, system of relevance, and the fundamentals of modern consumer desires. This presentation was created for the Data Marketing conference in Toronto 2015, and outlines new research into modern buyers, and how to understand their needs, derive intelligence from their actions, and provide the best experience possible in the modern era.
Slides from SXSW 2015 session on the intersection of data and design:
http://schedule.sxsw.com/2015/events/event_IAP41090
By Trina Chiasson from https://infoactive.co
Here at Table19, we believe that great work is only possible when clients and their agencies work together as a team. This is a presentation written by our Executive Creative Director Graham Wall, who on his first day in this industry heard the senior team he was shadowing say something he couldn’t understand: that the client had bought the wrong idea.
This set in motion a desire to understand how and why this had happened, and make sure it never happened again. This presentation details Graham’s learnings and philosophies, and shows how agencies and clients can create better work together.
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYAndre Muscat
Discuss the impact and opportunity of using Generative AI to support your development and creative teams
* Explore business challenges in content creation
* Cost-per-unit of different types of content
* Use AI to reduce cost-per-unit
* New partnerships being formed that will have a material impact on the way we search and engage with content
Part 4 of a 9 Part Research Series named "What matters in AI" published on www.andremuscat.com
Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh
This talk delves into the extraordinary capabilities of the emerging technology of generative AI, outlining its recent history and emphasizing its growing influence on scientific endeavors. Through a series of practical examples tailored for researchers, we will explore the transformative influence of these powerful tools on scientific tasks such as writing, coding, data wrangling and literature review.
This ChatGPT SEO guide is created to help individuals take advantage of the potential of ChatGPT for SEO. ChatGPT has rapidly gained popularity with over 1 million users registered in just 5 days. This guide provides an in-depth understanding of utilizing ChatGPT to improve your SEO and attain higher search engine rankings.
ChatGPT What It Is and How Writers Can Use It.pdfAdsy
Have you heard of ChatGPT? This smart model seems to change the way we work in the content marketing field.
We've investigated what this AI tool can do regarding content writing and ready to share the results.
Check this presentation to learn how this chatbot can assist you with content creation.
Using ChatGPT can be helpful in presentations to explain concepts in easy-to-understand terms.
Pairing that with Dall-E 2 can make your slides fun and interesting.
Time Management & Productivity - Best PracticesVit Horky
Here's my presentation on by proven best practices how to manage your work time effectively and how to improve your productivity. It includes practical tips and how to use tools such as Slack, Google Apps, Hubspot, Google Calendar, Gmail and others.
A non-technical introduction to ChatGPT - SEDA.pptxSue Beckingham
This presentation provides a brief history and context to ChatGPT, gives examples of what ChatGPT can do, considers the implications and issues and the next steps to consider.
This is a collection of prompt examples to be used with the ChatGPT model.
The ChatGPT model is a large language model trained by OpenAI that is capable of generating human-like text. By providing it with a prompt, it can generate responses that continue the conversation or expand on the given prompt.
Prompt engineering is a fundamental concept within the field of artificial intelligence, with particular relevance to natural language processing. It involves the strategic embedding of task descriptions within the input data of an AI system, often in the form of a question or query, as opposed to explicitly providing the task description separately. This approach optimizes the efficiency and effectiveness of AI models by encapsulating the desired outcome within the input context, thereby enabling more streamlined and context-aware responses.
Deriving Intelligence from Customer Actions: Data Marketing 2015 presentation Mathew Sweezey
Understand customers wants, needs, and desires is a tricky business but it can be made much easier if you understand three key ideas: stage based marketing, system of relevance, and the fundamentals of modern consumer desires. This presentation was created for the Data Marketing conference in Toronto 2015, and outlines new research into modern buyers, and how to understand their needs, derive intelligence from their actions, and provide the best experience possible in the modern era.
Slides from SXSW 2015 session on the intersection of data and design:
http://schedule.sxsw.com/2015/events/event_IAP41090
By Trina Chiasson from https://infoactive.co
Here at Table19, we believe that great work is only possible when clients and their agencies work together as a team. This is a presentation written by our Executive Creative Director Graham Wall, who on his first day in this industry heard the senior team he was shadowing say something he couldn’t understand: that the client had bought the wrong idea.
This set in motion a desire to understand how and why this had happened, and make sure it never happened again. This presentation details Graham’s learnings and philosophies, and shows how agencies and clients can create better work together.
Data mining Course
Chapter 2: Data preparation and processing
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
Data Mining – Definition, Challenges, tasks, Data pre-processing, Data Cleaning, missing data, dimensionality reduction, data transformation, measures of similarity and dissimilarity, Introduction to Association rules, APRIORI algorithm, partition algorithm, FP growth algorithm, Introduction to Classification techniques, Decision tree, Naïve-Bayes classifier, k-nearest neighbour, classification algorithm.
Missing data arise in almost all serious statistical analyses. In this post I discuss a variety of methods to handle missing data, including some relatively simple approaches that can often yield reasonable results.
Python for Data Analysis: A Comprehensive GuideAivada
In an era where data reigns supreme, the importance of data analysis for insightful decision-making cannot be overstated. Python, with its ease of learning and a plethora of libraries, stands as a preferred choice for data analysts.
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
Data Analytics has emerged has one of the central aspects of business operations. Consequently, the quest to grab professional positions within the Data Analytics domain has assumed unimaginable proportions. So if you too happen to be someone who is desirous of making through a Data Analyst .
This knolx is about an introduction to machine learning, wherein we see the basics of various different algorithms. This knolx isn't a complete intro to ML but can be a good starting point for anyone who wants to start in ML. In the end, we will take a look at the demo wherein we will analyze the FIFA dataset going through the understanding of various data analysis techniques and use an ML algorithm to derive 5 players that are similar to each other.
Uncover Trends and Patterns with Data Science.pdfUncodemy
In today's data-driven world, the vast amount of information generated every second presents both challenges and opportunities for businesses and researchers alike. Harnessing this data effectively can provide valuable insights, unlock hidden trends, and identify patterns that drive innovation and strategic decision-making.
Machine Learning Approaches and its Challengesijcnes
Real world data sets considerably is not in a proper manner. They may lead to have incomplete or missing values. Identifying a missed attributes is a challenging task. To impute the missing data, data preprocessing has to be done. Data preprocessing is a data mining process to cleanse the data. Handling missing data is a crucial part in any data mining techniques. Major industries and many real time applications hardly worried about their data. Because loss of data leads the company growth goes down. For example, health care industry has many datas about the patient details. To diagnose the particular patient we need an exact data. If these exist missing attribute values means it is very difficult to retain the datas. Considering the drawback of missing values in the data mining process, many techniques and algorithms were implemented and many of them not so efficient. This paper tends to elaborate the various techniques and machine learning approaches in handling missing attribute values and made a comparative analysis to identify the efficient method.
Data Analyst Interview Questions & AnswersSatyam Jaiswal
Practice Best Data Analyst Interview Questions for the best preparation of the data analyst interview. these interview questions are very popular and asked various times in data analyst interview.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
4. 4
What is Machine
Learning?
Machine
Learning
Machine Learning is the field of study that
gives computers the capability to learn
without being explicitly programmed. ML
is one of the most exciting technologies
that one would have ever come across.
As it is evident from the name, it gives the
computer that which makes it more
similar to humans: The ability to learn.
Machine learning is actively being used
today, perhaps in many more places than
one would expect.
11. 11
7 Steps of Machine Learning
2 3 4 5 6 7
ChoosingaModel
TrainingtheModel
ModelEvaluation
HyperparameterTuning
PredictionStep
PreparingData
GatheringData
Data collection is the process of gathering and measuring information on variables of interest, in an established
systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes
1
12. 12
7 Steps of Machine Learning
Gathering Data
Data collection is the process of gathering and measuring information on
variables of interest, in an established systematic fashion that enables one to
answer stated research questions, test hypotheses, and evaluate outcomes
GatheringData
1
13. 13
Volume
High volume of data demands scalable techniques.
Incremental learning, Resource (computation and
storage) issue
Data type: (Un)Structured
Data may be structured or unstructured
Sensitive information
Privacy preserving issue
Distribution of data
Inconsistency of incoming records
Data Gathering
Notes
14. 14
7 Steps of Machine Learning
2 3 4 5 6 7
ChoosingaModel
TrainingtheModel
ModelEvaluation
HyperparametersTuning
PredictionStep
1
GatheringData
PreparingData
Once the data is collected, it’s time to assess the condition
of it, including looking for trends, outliers, exceptions,
incorrect, inconsistent, missing, or skewed information.
15. 15
Preparing Data
Gathering Data
PreparingData
2
1. Data Exploration and Profiling
Once the data is collected, it’s time to assess the condition of it, including looking for trends, outliers, exceptions,
incorrect, inconsistent, missing, or skewed information.
2. Formatting data to make it consistent
The next step in great data preparation is to ensure your data is formatted in a way that best fits your machine learning
model. If you are aggregating data from different sources, or if your data set has been manually updated by more than
one stakeholder, you’ll likely discover anomalies in how the data is formatted (e.g. USD5.50 versus $5.50).
3. Improving data quality
Here, start by having a strategy for dealing with erroneous data, missing values, extreme values, and outliers in your data.
4. Feature engineering
This step involves the art and science of transforming raw data into features that better represent a pattern to the learning
algorithms.
5. Splitting data into training and test sets
The final step is to split your data into two sets; one for training your algorithm, and another for evaluation purposes.
16. 16
1. Data Exploration and Profiling
Incorrect and inconsistent data discovery
This is the time to catch any issues that could incorrectly
skew your model’s findings, on the entire data set
Data visualization tools and techniques
Use any data visualization tools and techniques or
statistical analysis techniques to explore data
Once the data is collected, it’s time to assess the
condition of it, including looking for trends, outliers,
exceptions, incorrect, inconsistent, missing, or skewed
information. This is important because your source data
will inform all of your model’s findings, so it is critical to be
sure it does not contain unseen biases. For example, if
you are looking at customer behavior nationally, but only
pulling in data from a limited sample, you might miss
important geographic regions. This is the time to catch
any issues that could incorrectly skew your model’s
findings, on the entire data set, and not just on partial or
sample data sets.
Discover unseen bias in data
If you are looking at customer behavior nationally, but only
pulling in data from a limited sample, you might miss
important geographic regions.
01
02
03
17. 17
2. Formatting data to make it consistent
Anomalies in how the data is formatted
$50 vs USD 50
Inconsistency checking
Inconsistent values from different sources of data
The next step in great data preparation is to ensure your
data is formatted in a way that best fits your machine
learning model. If you are aggregating data from different
sources, or if your data set has been manually updated by
more than one stakeholder, you’ll likely discover
anomalies in how the data is formatted (e.g. USD5.50
versus $5.50). In the same way, standardizing values in a
column, e.g. State names that could be spelled out or
abbreviated) will ensure that your data will aggregate
correctly. Consistent data formatting takes away these
errors so that the entire data set uses the same input
formatting protocols.
Formatting inconsistency in unstructured data
Two different words length and size of array results in
different vectors.
01
02
03
18. 18
3. Improving data quality
Handling missing data
This is the time to catch any issues that could incorrectly
skew your model’s findings, on the entire data set
Unbalanced data handling
Use any data visualization tools and techniques or
statistical analysis techniques to explore data
Start by having a strategy for dealing with erroneous data,
missing values, extreme values, and outliers in your data.
Self-service data preparation tools can help if they have
intelligent facilities built in to help match data attributes
from disparate datasets to combine them intelligently. For
instance, if you have columns for FIRST NAME and LAST
NAME in one dataset and another dataset has a column
called CUSTOMER that seem to hold a FIRST and LAST
NAME combined, intelligent algorithms should be able to
determine a way to match these and join the datasets to
get a singular view of the customer.
For continuous variables, make sure to use histograms to
review the distribution of your data and reduce the
skewness. Be sure to examine records outside an
accepted range of value. This “outlier” could be an
inputting error, or it could be a real and meaningful result
that could inform future events as duplicate or similar
values could carry the same information and should be
eliminated. Similarly, take care before automatically
deleting all records with a missing value, as too many
deletions could skew your data set to no longer reflect
real-world situations.
Outlier removal
If you are looking at customer behavior nationally, but only
pulling in data from a limited sample, you might miss
important geographic regions.
01
02
03
19. 19
Handling missing data: notes
Assigning a unique category such
as ‘unknown’
Predict the missing values
With the help of machine learning
techniques such as regression on
other features
Using algorithms which support
missing values
Like KNN, Random Forest
Deleting rows
Delete the rows with at leas one
missing value
Deleting Columns
Delete the columns with at leas one
missing value
Replacing with mean/median/mode
TECHNIQUES
Different techniques in handling missing
data in machine learning research
01 04
02
03
05
06
21. 21
Outlier removal
What are anomalies/outliers?
The set of data points that are considerably different than
the remainder of the data
Applications
Credit card fraud detection, telecommunication, fraud
detection, network intrusion detection, fault detection
Anomaly Detection Schemes
General Steps: 1) Build a profile of the “normal” behavior.
Profile can be patterns or summary statistics for the
overall population. 2) Use the “normal” profile to detect
anomalies Linear
Non-linear
Piecewise linear
22. 22
Outlier removal
Clustering-Based
Cluster the data into groups of different
density. Choose points in small cluster
as candidate outliers. Compute the
distance between candidate points and
non-candidate clusters. If candidate
points are far from all other non-
candidate points, they are outliers
Nearest-Neighbor Based Approach
Compute the distance between every
pair of data points. Data points for which
there are fewer than p neighboring
points within a distance are outliers.
Density-Based
For each point, compute the density of
its local neighborhood. Outliers are
points with largest difference in average
local density of sample p and the density
of its nearest neighbors
Graphical Approaches
Boxplot (1-D), Scatter plot (2-D), Spin
plot (3-D), Chernoff faces , …
Limitations: Time consuming
Convex Hull Method
Extreme points are assumed to be
outliers. Use convex hull method to
detect extreme values. What if the outlier
occurs in the middle of the data?
Statistical Approaches
Assume a parametric model describing
the distribution of the data (e.g., normal
distribution). Apply a statistical test that
depends on data distribution
TECHNIQUES
Different techniques in outlier removal in
machine learning research
01 04
02
03
05
06
Distance-Based Methods
23. 23
Outlier removal : notes
01
02
03
Outlier or rare distribution of data?
In conjunction with next steps algorithms
Validation can be quite challenging
Method is fully unsupervised. Outliers comes from data and may lead algorithm in
wrong directions.
Method is unsupervised
The number of outliers in dataset.
How many outliers
04
05
There are considerably more “normal” observations than
“abnormal” observations
Working assumption
24. 24
4. Feature engineering
Feature engineering is the process of using domain
knowledge of the data to select features or
generate new features that allow machine learning
algorithms to work more accurately.
Coming up with features is difficult, time-consuming, requires expert
knowledge.
Some machine learning projects succeed and some fail. What makes the
difference? Easily the most important factor is the features used.
Good data preparation and feature engineering is integral to better prediction
27. 27
Feature generation
Feature generation is also known as feature
construction or feature extraction. The process of
creating new features from raw data or one or
multiple features.
Three general methodologies:
1. Feature extraction: Domain-specific, Statistical features,
predefined class of features
2. Mapping data to new space: E.g., Fourier transformation:
wavelet transformation, manifold approaches
3. Feature construction: Combining features, Data
discretization
28. 28
Feature Selection
Sometimes a lot of features exist. Feature selection is the
method of reducing data dimension while doing predictive
analysis.
Why:
• Curse of dimensionality
• Easier visualization
• Meaningful distance
• Lower storage need
• Faster computation
Two kinds of feature selection techniques
• Filter Method
• Wrapper Method
29. 29
Feature Selection
Filter Method
Filter Method
This method uses the variable ranking technique in order to select the
variables for ordering and here, the selection of features is independent of the
classifiers used. By ranking, it means how much useful and important each
feature is expected to be for classification. Removes Irrelevant and redundant
features.
Statistical Test: Chi-Square Test to test the independence of two events,
Hypothesis testing.
Variance Threshold: This approach of feature selection removes all features
whose variance does not meet some threshold. Generally, it removes all the
zero-variance features which means all the features that have the same value
in all samples.
Information Gain: Information gain measures how much information a
feature gives about the class.
30. 30
Feature Selection
Wrapper Method
In the wrapper approach, the feature subset selection algorithm exists as a
wrapper around the induction algorithm. One of the main drawbacks of this
technique is the mass of computations required to obtain the feature subset.
Some examples of Wrapper Methods are mentioned below:
Genetic Algorithms: This algorithm can be used to find a subset of features.
Recursive Feature Elimination: RFE is a feature selection method which fits
a model and removes the weakest feature until the specified number of
features is satisfied.
Sequential Feature Selection: This naive algorithm starts with a null set and
then add one feature to the first step which depicts the highest value for the
objective function and from the second step onwards the remaining features
are added individually to the current subset and thus the new subset is
evaluated. This process is repeated until the required number of features are
added.
32. 32
Overfitting and Underfitting
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new data.
Underfitting occurs when a model cannot adequately capture the underlying structure of the data.
33. 33
Overfitting and Underfitting
Overfitting can have many causes and usually is a
combination of the following:
• Too powerful model: e.g. you allow polynomials to
degree 100. With polynomials to degree 5 you would
have a much less powerful model which is much less
prone to overfitting
• Not enough data: Getting more data can sometimes fix
overfitting problems
• Too many features: Your model can identify single data
points by single features and build a special case just for
a single data point. For example, think of a classification
problem and a decision tree. If you have feature vectors
(x1, x2, ..., xn) with binary features and n points, and each
feature vector has exactly one 1, then the tree can
simply use this as an identifier.
Overfitted
Appropriate fitted
34. 34
5. Splitting data
Training set: A set of examples used for learning, that is to fit
the parameters [i.e., weights] of the classifier.
Validation set: A set of examples used to avoid overfitting or
tune the hyperparameters [i.e., architecture, not weights] of a
classifier, for example to choose the number of hidden units in
a neural network.
Test set: A set of examples used only to assess the
performance [generalization] of a fully specified classifier.
35. 35
5. Splitting data
If you plot the training error, and validation error, against the number of training
iterations you get the most important graph in machine learning.
36. 36
5. Splitting data: Tips
1. Non-fair sampling
2. Class imbalance problem
3. Unspanned area
Inappropriate sampling may results in a
wrong classification of data as the figure shows
37. 37
7 Steps of Machine Learning
3 4 5 6 7
ChoosingaModel
TrainingtheModel
ModelEvaluation
HyperparametersTuning
PredictionStep
GatheringData
PreparingData
Get a modern PowerPoint Presentation that is beautifully designed. I hope and I
believe that this Template will your Time, Money and Reputation. Easy to change colors,
photos and Text. You can simply impress your audience and add a unique zing and
appeal to your Presentations.
21
38. 38
7 Steps of Machine Learning
Gathering Data
ChoosingaModel
3 Some Broad ML Tasks:
Classification: assign a category to each item (e.g.,
document classification).
Regression: predict a real value for each item (prediction of
stock values, economic variables).
Clustering: partition data into ‘homogenous’ regions
(analysis of very large data sets).
Dimensionality reduction: find lower-dimensional manifold
preserving some properties of the data.
Semi-supervised learning: use existing side information during learning
39. 39
Classification
Different techniques:
• Decision Tree Induction
• Random Forest
• Bayesian Classification
• K-Nearest Neighbor
• Neural Networks
• Support Vector Machines
• Hidden Markov Models
• Rule-based Classifiers
• Many More ….
• Also Ensemble Methods
40. 40
Decision Tree
Decision tree learning uses a decision tree
(as a predictive model) to go from
observations about an item (represented in
the branches) to conclusions about the
item's target value (represented in the
leaves).
41. 41
DT: Notes
01
02
03
Entropy, information gain, Purity
Purity measures
Incremental learning
Pruning and growing as solutions
Over-fitting and under-fitting
Optimal tree length
04
05
Handling numerical features
42. 42
SVM: the story
This line represents
the decision
boundary:
ax + by − c = 0
Lots of possible solutions for a, b, c.
• Some methods find a separating hyper plane, but
not the optimal one [according to some criterion of
expected goodness].
• SVM (A.k.a. large margin classifier) maximizes the
margin around the separating hyperplane.
• Solving SVMs is a quadratic programming problem
43. 43
SVM: theory
The decision function is fully specified by a
subset of training samples, the support vectors.
44. 44
Non-vectorial inputs
The dot product of data is enough to train SVM,
Dimensionality of data is not a challenge
Parameter C
Trade-off between maximizing margin
and minimizing training error. How do
we determine C?
Difficult to handle missing values, Robust to noise
Non-linear classification using Kernels
Different kernel types, kernel parameters
SVM
Notes
46. 46
KNN: Notes
01
02
03
Irrelevant features easily swamp information from relevant
attributes. Sensitive to distance function
The issue of irrelevant features
Delaunay triangulation as a solution
High storage complexity
Partial distance calculation as a solution
High computation complexity
Weights of neighbors
04
05
Optimal value of K
47. 47
Clustering
Clustering is a method of unsupervised learning and
is a common technique for statistical data analysis
used in many fields.
In Data Science, we can use clustering analysis to
gain some valuable insights from our data by seeing
what groups the data points fall into when we apply
a clustering algorithm.
48. 48
Clustering: different techniques (1)
• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
• Typical methods: k-means, k-medoids
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH
• Density-based approach:
• Based on connectivity and density functions. Typical methods: DBSACN
• Grid-based approach:
• Based on a multiple-level granularity structure. Typical methods: STING
49. 49
Clustering: different techniques (2)
• Model-based:
• A model is hypothesized for each of the clusters and tries to find the best fit of that model to
each other. Typical methods: EM, SOM
• Fuzzy Clustering:
• Based on the theory of fuzzy membership. Typical methods: C-Means
• Constraint-based:
• Clustering by considering user-specified or application-specific constraints. Typical methods:
constrained clustering
• Graph-based clustering:
• Use the theory of graph to group data in clusters. Typical methods: Graph-Cut
• Subspace Clustering
• Spectral Clustering
• Consensus Clustering
50. 50
K-Means
Given k, the k-means algorithm works as follows
1. Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current
cluster memberships.
4. If a convergence criterion is not met, go to 2.
51. 51
K-Means
Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkdn),
where n is the number of data points,
k is the number of clusters,
d is dimension of data, and
t is the number of iterations.
Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
52. 52
K-Means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent values.
• E.g. if feature vector = (eye color, graduation level)
• The user needs to specify k.
• The algorithm is sensitive to initial seeds.
• The different number of points in each cluster can lead to large clusters being split ‘unnaturally’
• The k-means algorithm is not suitable for discovering clusters with different shapes.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with very different
values.
53. 53
K-Means
The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent values.
• E.g. if feature vector = (eye color, graduation level)
54. 54
K-Means
The user needs to specify k.
• Use clustering stability techniques to estimate k.
• Use elbow method to estimate k
elbow method
58. 58
K-Means
The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with very different values.
59. 59
Hierarchical Clustering
Unlike k-means, hierarchical clustering (HC) doesn’t require
the user to specify the number of clusters beforehand. Instead
it returns an output (typically as a dendrogram), from which the
user can decide the appropriate number of clusters.
HC typically comes in two flavors (essentially, bottom up or top
down):
Agglomerative: The agglomerative method in reverse-
individual points are iteratively combined until all points belong
to the same cluster.
Divisive: Starts with the entire dataset comprising one cluster
that is iteratively split- one point at a time- until each point
forms its own cluster.
60. 60
DBSCAN
Two important parameters are required for DBSCAN: epsilon (“eps”) and
minimum points (“MinPts”). The parameter eps defines the radius of
neighborhood around a point x. It’s called the ϵ-neighborhood of x. The
parameter MinPts is the minimum number of neighbors within “eps” radius.
• Any point x in the data set, with a neighbor count greater than or equal to
MinPts, is marked as a core point.
• We say that x is border point, if the number of its neighbors is less than
MinPts, but it belongs to the ϵ-neighborhood of some core point z.
• Finally, if a point is neither a core nor a border point, then it is called a
noise point or an outlier.
DBSCAN start by defining 3 terms :
1. Direct density reachable,
2. Density reachable,
3. Density connected.
A density-based cluster is defined as a group of density connected
points.
62. 62
Kernel Method
Motivations:
• Efficient computation of inner products in high
• dimension.
• Non-linear decision boundary.
• Non-vectorial inputs.
• Flexible selection of more complex features
Linear separation impossible in most problems.
Non-linear mapping from input space to high dimensional feature space Φ = 𝑋 → 𝐹:
63. 63
Kernel Method
Linear separation impossible in most problems.
Non-linear mapping from input space to high dimensional feature space Φ = 𝑋 → 𝐹:
64. 64
Dimension Reduction
Motivations:
• Some features may be irrelevant
• We want to visualize high dimensional data
• “Intrinsic” dimensionality may be smaller than the
number of features
• Cost of computation and storage
Techniques:
• PCA, SVD, LLE, MDS, LEM, ISOMAP and many others
65. 65
PCA
• Intuition: find the axis that shows the greatest
variation, and project all points into this axis
• Input
• 𝑋 ∈ ℝ 𝑑
• Output:
• 𝑑 principle components
• 𝑑 eigen vectors 𝑣1, 𝑣2, … , 𝑣 𝑑 as known as
components
• 𝑑 eigen values 𝜆1, 𝜆2, … . , 𝜆 𝑑
66. 66
PCA: notes
How many PCs should we choose in PCA?
• A Number of components which keep 𝛼% (e.g. 0.98%) of
variance of data
In PCA, the total variance is the sum of all eigenvalues
𝜆1 + 𝜆2 + ⋯ . , +𝜆 𝑑 = 𝜎1
2
+ 𝜎2
2
+ ⋯ . , +𝜎 𝑑
2
𝜆1 > 𝜆2 > ⋯ > 𝜆 𝑑
67. 67
Multidimensional Scaling (MDS)
Multidimensional Scaling (MDS) is used to go from a
proximity matrix (similarity or dissimilarity) between a series
of N objects to the coordinates of these same objects in a p-
dimensional space. p is generally fixed at 2 or 3 so that the
objects may be visualized easily.
68. 68
Metric Learning
Similarity between objects plays an important role in both human
cognitive processes and artificial systems for recognition and
categorization. How to appropriately measure such similarities for
a given task is crucial to the performance of many machine
learning, pattern recognition and data mining methods. This book
is devoted to metric learning, a set of techniques to automatically
learn similarity and distance functions from data
69. 69
Metric Learning
• A popular method is metric learning Mahalanobis distance metric learning
• Mahalanobis distance metric is in fact a transformation of data into a new space.
𝑀𝑎ℎ𝑎𝑙𝑎𝑛𝑜𝑏𝑖𝑠 𝑥, 𝑦 = 𝑥 − 𝑦 𝑇
𝑀−1
(𝑥 − 𝑦)
70. 70
Semi-supervised Learning
Semi-supervised learning is a class of machine learning
tasks and techniques that also make use of unlabeled data
for training – typically a small amount of labeled data with a
large amount of unlabeled data.
Semi-supervised learning falls between unsupervised
learning (without any labeled training data) and supervised
learning (with completely labeled training data).
Many machine-learning researchers have found that
unlabeled data, when used in conjunction with a small
amount of labeled data, can produce considerable
improvement in learning accuracy.
71. 71
Active Learning
The concept of active learning is simple—it involves a feedback
loop between human and machine that eventually tunes the
machine model. The model begins with a set of labeled data that
it uses to judge incoming data. Human contributors then label a
select sample of the machine’s output, and their work is plowed
back into the model. Humans continue to label data until the
model achieves sufficient accuracy.
72. 72
7 Steps of Machine Learning
5 6 7
ModelEvaluation
HyperparametersTuning
PredictionStep
Training the constructed model by
using test and validation sets
4
TrainingtheModel
3
ChoosingaModel
PreparingData
2
GatheringData
1
73. 73
7 Steps of Machine Learning
Gathering Data
TrainingtheModel
4
74. 74
7 Steps of Machine Learning
6 7
HyperparametersTuning
PredictionStep
Evaluating machine learning algorithm is an
essential part of any project.
5
ModelEvaluation
4
TrainingtheModel
3
ChoosingaModel
PreparingData
2
GatheringData
1
75. 75
7 Steps of Machine Learning
Gathering Data
ModelEvaluation
5 Evaluating machine learning algorithm is an essential part of
any project.
Different measures exist to evaluate trained models such as
accuracy, TPR, FPR, F-Measure, precision, recall, …
78. 78
ROC plot
A ROC plot shows:
The relationship between TPR and FPR. For
example, an increase in TPR results in an increase
in FPR.
Test accuracy; the closer the graph is to the top and
left-hand borders, the more accurate the test.
Likewise, the closer the graph to the diagonal, the
less accurate the test. A perfect test would go
straight from zero up the top-left corner and then
straight across the horizontal.
82. 82
7 Steps of Machine Learning
7
PredictionStep
The process of searching for the ideal model
architecture is referred to as hyperparameter
tuning.
6
HyperparametersTuning
5
ModelEvaluation
4
TrainingtheModel
3
ChoosingaModel
PreparingData
2
GatheringData
1
83. 83
7 Steps of Machine Learning
Gathering Data
HyperparametersTuning
6 When creating a machine learning model, you'll be presented with design
choices as to how to define your model architecture.
Parameters which define the model architecture are referred to as
hyperparameters and thus this process of searching for the ideal model
architecture is referred to as hyperparameter tuning.
These hyperparameters might address model design questions such as:
• What degree of polynomial features should I use for my non-linear model?
• What should be the maximum depth allowed for my decision tree?
• What should be the minimum number of samples required at a leaf node in
my decision tree?
• How many trees should I include in my random forest?
• How many neurons should I have in my neural network layer?
• How many layers should I have in my neural network?
• What should I set my learning rate to for gradient descent?
84. 84
Hyperparameters Tuning
I want to be absolutely clear, hyperparameters are not model
parameters and they cannot be directly trained from the data.
Model parameters are learned during training when we optimize
a loss function using something like gradient descent.
In general, this process includes:
• Define a model
• Define the range of possible values for all hyperparameters
• Define a method for sampling hyperparameter values
• Define an evaluative criteria to judge the model
• Define a cross-validation method
85. 85
Tuning: an example
Example: How to search for the optimal structure of a random forest
classifier. Random forests are an ensemble model comprised of a
collection of decision trees; when building such a model, two important
hyperparameters to consider are:
1. How many estimators (i.e.. decision trees) should I use?
2. What should be the maximum allowable depth for each decision
tree?
If we had access to such a plot, choosing the ideal hyperparameter
combination would be trivial. However, calculating such a plot at the
granularity visualized above would be prohibitively expensive.
86. 86
Tuning methods
Grid search
Grid search is arguably the most basic hyperparameter
tuning method. With this technique, we simply build a model
for each possible combination of all of the hyperparameter
values provided, evaluating each model, and selecting the
architecture which produces the best results.
87. 87
Tuning methods
Random search
Random search differs from grid search in that we longer
provide a discrete set of values to explore for each
hyperparameter; rather, we provide a statistical distribution
for each hyperparameter from which values may be
randomly sampled.
88. 88
7 Steps of Machine Learning
Predict the quality of
results in this step
7
PredictionStep
6
HyperparameterTuning
5
ModelEvaluation
4
TrainingtheModel
3
ChoosingaModel
PreparingData
2
GatheringData
1
89. 89
7 Steps of Machine Learning
Gathering Data
PredictionStep
7 Predict unseen data
by the trained model