Slides for talk delivered at the Python Pune meetup on 31st Jan 2014.
Categorical data is a huge problem many data scientists face. This talk is about how to tame it
In these slides I’m going to show you how to choose the correct statistical hypothesis test first time, every time by using the Hypothesis Wheel.
In fact, you can get a free ultra HD image of the Hypothesis Wheel – it’s yours to download and keep right here: http://bit.ly/HypWheel
Here’s a quick rundown of what you’re going to learn:
First, you’ll learn a 4-step strategy so that you know exactly the right questions to ask of your data.
Second, you’ll learn precisely what those questions are so you get the answers you need.
Finally, I explain how to take these answers to the Hypothesis Wheel to make sure that you select the correct hypothesis test for your data.
In the end, you’ll learn that choosing the correct hypothesis test is not so scary after all!
If you would like an 18”x24” poster of The Hypothesis Wheel in Ultra HD to pin on your office wall, you can get one here: https://deadparrotboutique.storenvy.com/products/28302023-statistical-hypothesis-testing-spinning-the-wheel-poster
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Explainable insights on algorithm performanceCSIRO
Machine Learning (ML) and Artificial Intelligence (AI) have made great strides in this decade. We have a plethora of ML algorithms that can be used to perform a given task, be it face recognition, image classification or natural language processing. However, explainability of ML/AI algorithms remains a big problem. Explainable AI (XAI) is a branch of ML that is devoted to unravelling the black-box nature of AI so that we understand the reasons behind the decisions/output. However, there are concerns that XAI sometimes produce “tools for computer scientists to explain things to other computer scientists”, which defeats its purpose. To this end, a growing number of researchers have called for integration with social sciences to make truly explainable and trustworthy AI, because philosophy and social sciences have debated the meaning and function of an explanation for millennia and have deeper insights1. In this talk, we present such an integration2.
Our problem domain is algorithm evaluation, which considers a portfolio of algorithms and its performance on a set of problems. For example, it can be a portfolio of regression algorithms. The goal is to understand meaningful, explainable insights about the algorithms from the performance results. As the social science linkage, we use Item Response Theory (IRT), a methodology from educational psychometrics. IRT is traditionally used to evaluate the difficulty and discrimination of test questions and the ability of students and has causal interpretations. Using IRT we obtain explainable insights about algorithms relating to their stable/consistent nature, the difficulty level of problems they can handle and their behaviour. In addition, we visualise the problem spectrum and find regions on the spectrum where algorithms exhibit strengths. The causal interpretations of IRT transfer to the algorithm evaluation domain as we gain a deeper understanding of algorithms.
References
1. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell 267, 1–38 (2019).
2. Kandanaarachchi, S. & Smith-Miles, K. Comprehensive Algorithm Portfolio Evaluation using Item Response Theory. Journal of Machine Learning Research 24, 1–52 (2023).
Personality Recognition" includes automatic classification of authors' personality traits, that can be compared against gold standard annotation obtained by means of the big5 personality test
In these slides I’m going to show you how to choose the correct statistical hypothesis test first time, every time by using the Hypothesis Wheel.
In fact, you can get a free ultra HD image of the Hypothesis Wheel – it’s yours to download and keep right here: http://bit.ly/HypWheel
Here’s a quick rundown of what you’re going to learn:
First, you’ll learn a 4-step strategy so that you know exactly the right questions to ask of your data.
Second, you’ll learn precisely what those questions are so you get the answers you need.
Finally, I explain how to take these answers to the Hypothesis Wheel to make sure that you select the correct hypothesis test for your data.
In the end, you’ll learn that choosing the correct hypothesis test is not so scary after all!
If you would like an 18”x24” poster of The Hypothesis Wheel in Ultra HD to pin on your office wall, you can get one here: https://deadparrotboutique.storenvy.com/products/28302023-statistical-hypothesis-testing-spinning-the-wheel-poster
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Explainable insights on algorithm performanceCSIRO
Machine Learning (ML) and Artificial Intelligence (AI) have made great strides in this decade. We have a plethora of ML algorithms that can be used to perform a given task, be it face recognition, image classification or natural language processing. However, explainability of ML/AI algorithms remains a big problem. Explainable AI (XAI) is a branch of ML that is devoted to unravelling the black-box nature of AI so that we understand the reasons behind the decisions/output. However, there are concerns that XAI sometimes produce “tools for computer scientists to explain things to other computer scientists”, which defeats its purpose. To this end, a growing number of researchers have called for integration with social sciences to make truly explainable and trustworthy AI, because philosophy and social sciences have debated the meaning and function of an explanation for millennia and have deeper insights1. In this talk, we present such an integration2.
Our problem domain is algorithm evaluation, which considers a portfolio of algorithms and its performance on a set of problems. For example, it can be a portfolio of regression algorithms. The goal is to understand meaningful, explainable insights about the algorithms from the performance results. As the social science linkage, we use Item Response Theory (IRT), a methodology from educational psychometrics. IRT is traditionally used to evaluate the difficulty and discrimination of test questions and the ability of students and has causal interpretations. Using IRT we obtain explainable insights about algorithms relating to their stable/consistent nature, the difficulty level of problems they can handle and their behaviour. In addition, we visualise the problem spectrum and find regions on the spectrum where algorithms exhibit strengths. The causal interpretations of IRT transfer to the algorithm evaluation domain as we gain a deeper understanding of algorithms.
References
1. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell 267, 1–38 (2019).
2. Kandanaarachchi, S. & Smith-Miles, K. Comprehensive Algorithm Portfolio Evaluation using Item Response Theory. Journal of Machine Learning Research 24, 1–52 (2023).
Personality Recognition" includes automatic classification of authors' personality traits, that can be compared against gold standard annotation obtained by means of the big5 personality test
Part of the ongoing effort with Skater for enabling better Model Interpretation for Deep Neural Network models presented at the AI Conference.
https://conferences.oreilly.com/artificial-intelligence/ai-ny/public/schedule/detail/65118
Explainable algorithm evaluation from lessons in educationCSIRO
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? The easy option is to find which algorithm performs best for each problem and find the algorithm that performs best on the greatest number of problems. But, there is a limitation with this approach. We are only looking at the overall best! Suppose a certain algorithm gives the best performance on hard problems, but not on easy problems. We would miss this algorithm by using the “overall best” approach. How do we obtain a salient set of algorithm features?
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
This slide deck gives an introduction on data science focusing on three most common tasks including regression, classification and clustering. Each task comes with a real world data science project to illustrate the concepts. This presentation was initially created for a one-hour guest lecture at Utah State University for teaching and education purposes.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? To find an answer to this question we turn to social sciences. Methodologies in social sciences focus on explanations as opposed to accurate predictions.
Item Response Theory (IRT) is a methodology in educational psychometrics that is used to design, analyse and score test questions and questionnaires. IRT can measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. In this talk we use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
A very high level introduction to the field of Data Science, Artificial Intelligence. Covers an introduction to Supervised Learning, Unsupervised Learning, Deep Learning and Neural Networks. Given as part of Industry Lectures event at GVP College of Engineering
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
Demonstration of two methods of translating ordinal Likert variables into indicator scores that are appropriate for data visualization and statistical analysis.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Part of the ongoing effort with Skater for enabling better Model Interpretation for Deep Neural Network models presented at the AI Conference.
https://conferences.oreilly.com/artificial-intelligence/ai-ny/public/schedule/detail/65118
Explainable algorithm evaluation from lessons in educationCSIRO
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? The easy option is to find which algorithm performs best for each problem and find the algorithm that performs best on the greatest number of problems. But, there is a limitation with this approach. We are only looking at the overall best! Suppose a certain algorithm gives the best performance on hard problems, but not on easy problems. We would miss this algorithm by using the “overall best” approach. How do we obtain a salient set of algorithm features?
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
This slide deck gives an introduction on data science focusing on three most common tasks including regression, classification and clustering. Each task comes with a real world data science project to illustrate the concepts. This presentation was initially created for a one-hour guest lecture at Utah State University for teaching and education purposes.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? To find an answer to this question we turn to social sciences. Methodologies in social sciences focus on explanations as opposed to accurate predictions.
Item Response Theory (IRT) is a methodology in educational psychometrics that is used to design, analyse and score test questions and questionnaires. IRT can measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. In this talk we use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
A very high level introduction to the field of Data Science, Artificial Intelligence. Covers an introduction to Supervised Learning, Unsupervised Learning, Deep Learning and Neural Networks. Given as part of Industry Lectures event at GVP College of Engineering
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
Demonstration of two methods of translating ordinal Likert variables into indicator scores that are appropriate for data visualization and statistical analysis.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
1. 1
Categorical Data Analysis in Python
By
Jaidev Deshpande
Data Scientist, DataCulture Analytics
twitter.com/jaidevd
2. 2
Problem: Who's likely to attend the next
meetup?
●
Who comes often?
●
Men / Women?
●
Where do you live? How far from the venue?
●
Proficiency with Python
(Beginner / Intermediate / Advanced)?
●
Area of interest?
3. 3
Something like..
Attendees Features
Attendance
(%)
Gender Pincode Proficiency in
Python
Interest ...
attendee_1 80 M 411013 Intermediate Web ...
attendee_2 30 F 411040 Advanced Test /
Automation
...
attendee_3 55 M 411001 Beginners Scientific ...
... ... ... ... ... ... ...
● 1. Numerical features – continuous and quantitative
● 2. Categorical features – discrete and qualitative
4. 4
Common Numerical Operations on Data
●
Obviously – add, subtract, multiply divide
●
Statistical moments
●
Operations in vector spaces
– Distance measures
– Slicing
5. 5
Comparison of Operations
Numerical Data
Addition, subtract, multiply, divide
Mean, Variance, Standard Deviation
Vector Spaces – the very idea of
'measuring'
Categorical Data (Strings, etc)
What's the product of two strings?
The average pincode of two areas?
&%%#&$$*&!!!!
At least get some numbers!
10. 10
Correspondence Analysis
●
How are proficiencies related w.r.t gender? (Row profiles)
●
How are genders related w.r.t proficiency? (Column profiles)
– Cosine similarity
– Correlation / Covariance
●
How are they interrelated?
– Weighted chi-squared distance
●
Can the dimensionality be reduced?
– Singular value decomposition / PCA
– sklearn.decomposition.PCA
– sklearn.decomposition.TruncatedSVD
11. 11
Sample Problem
●
Consider the proficiency and interest features from the original
problem
●
Fake data with 100 observations
●
Contingency matrix:
automation scientific web
advanced 8 1 7
beginner 13 9 35
intermediate 7 1 19