Data Resource
Problem Definition
Visualization
Prediction – Tools to support data analysis
Presenting findings
Solving the problem framed in the beginning
Where is the most criminal area in Boston and more.pdfNaoya Morishita
Where is the Most Criminal Area in Boston? (+ More):
Data Downloaded on 16 Nov 2022 from:
https://data.boston.gov/dataset/streetlight-locations
https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system
https://data.boston.gov/dataset/city-of-boston-managed-streets
The document introduces bio computing and discusses how cells can be modeled as computing devices. It outlines key topics including using P systems to represent cellular computation and examples of biocomputing. Specific concepts covered include modeling genetic transcriptional networks and common network motifs that are evolutionarily preferred. Membrane structures and transport mechanisms in P systems are also summarized.
This document discusses the potential uses of blockchain technology in healthcare, including for drug traceability, clinical trials, and data management. It describes how blockchain could be used to track drugs throughout the supply chain to prevent counterfeiting. It also outlines how blockchain provides authentication of clinical trial data to prevent fraud and allows secure sharing of trial information among researchers. The document argues that blockchain establishes transparency and trust while improving interoperability of health data.
Agenda 2063 is a strategic framework for the socio-economic transformation of Africa over 50 years adopted by the African Union in 2015. It has 7 aspirations and 20 goals for the first 10-year implementation plan. The aspirations include an integrated, prosperous, and peaceful Africa driven by its citizens. The 12 flagship projects support implementation. Agenda 2063 is aligned with past initiatives like CAADP and NEPAD and coordinates implementation through the AU, RECs, and member states. While generally coherent with the UN SDGs, Agenda 2063 has more specific African targets and additional goals not covered by the SDGs to guide the continent's sustainable development and socio-economic transformation by 2063.
The ClinGen Sequence Variant Interpretation Working Group: Refining Criteria ...Human Variome Project
The ClinGen Sequence Variant Interpretation Working Group aims to refine criteria for classifying genetic variants by standardizing how different types of evidence are integrated. In the short term, it will refine and modify current American College of Medical Genetics guidelines. It will work with disease-specific groups to evaluate criteria like population frequency thresholds and computational methods. The long term goal is to develop a quantitative Bayesian framework to classify variants. The working group will analyze ClinVar to identify disease genes with many reported variants to help evaluate criteria.
Blockchain has the potential to address many challenges in healthcare, including interoperability, security, integrity, and universal access to patient data. It provides a distributed framework for managing patient identity and medical records while ensuring privacy. Key benefits include tamper-proof storage of data across all nodes, encryption to enhance data security, and smart contracts to automate access controls. The document discusses applications of blockchain for electronic health records and clinical research to improve data sharing and integration across healthcare organizations.
The document provides an update on the Genome in a Bottle (GIAB) Consortium. Key points include:
- New benchmark sets have been developed for mosaic variants, tandem repeats, and chromosomes X and Y using whole genome assemblies.
- Additional reference materials and samples are available, including a new tumor/normal cell line and over 50 products based on broadly consented genomes.
- Benchmarking methods are improving to better evaluate variant calling, including for structural variants and different data types like RNA sequencing.
- Future plans include developing more somatic benchmarks, assembling the HG002 genome to near perfection, and a searchable public data registry.
Where is the most criminal area in Boston and more.pdfNaoya Morishita
Where is the Most Criminal Area in Boston? (+ More):
Data Downloaded on 16 Nov 2022 from:
https://data.boston.gov/dataset/streetlight-locations
https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system
https://data.boston.gov/dataset/city-of-boston-managed-streets
The document introduces bio computing and discusses how cells can be modeled as computing devices. It outlines key topics including using P systems to represent cellular computation and examples of biocomputing. Specific concepts covered include modeling genetic transcriptional networks and common network motifs that are evolutionarily preferred. Membrane structures and transport mechanisms in P systems are also summarized.
This document discusses the potential uses of blockchain technology in healthcare, including for drug traceability, clinical trials, and data management. It describes how blockchain could be used to track drugs throughout the supply chain to prevent counterfeiting. It also outlines how blockchain provides authentication of clinical trial data to prevent fraud and allows secure sharing of trial information among researchers. The document argues that blockchain establishes transparency and trust while improving interoperability of health data.
Agenda 2063 is a strategic framework for the socio-economic transformation of Africa over 50 years adopted by the African Union in 2015. It has 7 aspirations and 20 goals for the first 10-year implementation plan. The aspirations include an integrated, prosperous, and peaceful Africa driven by its citizens. The 12 flagship projects support implementation. Agenda 2063 is aligned with past initiatives like CAADP and NEPAD and coordinates implementation through the AU, RECs, and member states. While generally coherent with the UN SDGs, Agenda 2063 has more specific African targets and additional goals not covered by the SDGs to guide the continent's sustainable development and socio-economic transformation by 2063.
The ClinGen Sequence Variant Interpretation Working Group: Refining Criteria ...Human Variome Project
The ClinGen Sequence Variant Interpretation Working Group aims to refine criteria for classifying genetic variants by standardizing how different types of evidence are integrated. In the short term, it will refine and modify current American College of Medical Genetics guidelines. It will work with disease-specific groups to evaluate criteria like population frequency thresholds and computational methods. The long term goal is to develop a quantitative Bayesian framework to classify variants. The working group will analyze ClinVar to identify disease genes with many reported variants to help evaluate criteria.
Blockchain has the potential to address many challenges in healthcare, including interoperability, security, integrity, and universal access to patient data. It provides a distributed framework for managing patient identity and medical records while ensuring privacy. Key benefits include tamper-proof storage of data across all nodes, encryption to enhance data security, and smart contracts to automate access controls. The document discusses applications of blockchain for electronic health records and clinical research to improve data sharing and integration across healthcare organizations.
The document provides an update on the Genome in a Bottle (GIAB) Consortium. Key points include:
- New benchmark sets have been developed for mosaic variants, tandem repeats, and chromosomes X and Y using whole genome assemblies.
- Additional reference materials and samples are available, including a new tumor/normal cell line and over 50 products based on broadly consented genomes.
- Benchmarking methods are improving to better evaluate variant calling, including for structural variants and different data types like RNA sequencing.
- Future plans include developing more somatic benchmarks, assembling the HG002 genome to near perfection, and a searchable public data registry.
Slides from my talk on cryptoeconomics at @hasgeek. I primarily talk about design and analysis of blockchain based systems and how to design the protocol to achieve the system goal.
Digital DNA-seq Technology: Targeted Enrichment for Cancer ResearchQIAGEN
Targeted DNA sequencing has become a powerful approach by achieving high coverage of the region of interest while keeping the cost of sequencing and complexity of data interpretation manageable. However, existing PCR-based target enrichment approaches introduce errors due to PCR amplification bias and artifacts, which significantly affects quantification accuracy and limit the ability to confidently detect low-frequency DNA variants. This webinar introduces a new digital sequencing approach that is based on the use of unique molecular indices (UMIs) - QIAseq Targeted DNA Panels. With UMIs, each unique DNA molecule is barcoded before any amplification takes place to correct for PCR errors. Detailed workflow and applications in cancer research will be presented. Join us and learn about this exciting novel digital DNAseq technology
A really simple explanation of Bitcoin and why everyone afraid of it.
My other FinTech presentations:
Bitcoin: http://www.slideshare.net/ishmelev/bitcoin-future
Digital bank: http://www.slideshare.net/ishmelev/digital-bank-eng
Future of bank: http://www.slideshare.net/ishmelev/bank-future
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
This document discusses blockchain technology and its potential applications for the media and entertainment industry. It provides an overview of blockchain basics, describes how blockchain moves through different functional stages, and highlights several potential use cases for blockchain in areas like rights management, content resale, and advertising. The document advocates piloting select blockchain uses cases to discover opportunities and assess ease of implementation before potentially expanding successful solutions.
Hyperledger Fabric is a blockchain framework for enterprise use. It was designed from the ground up to address enterprise needs like confidentiality, scalability, and flexibility. Some key features include built-in privacy using channels, pluggable consensus algorithms, and multiple programming languages for writing smart contracts. It uses an endorsement and validation process to ensure transactions are valid before being added to the ledger. Membership services provide identity features and Hyperledger Composer helps speed application development.
The document discusses various techniques for data reduction including data sampling, data cleaning, data transformation, and data segmentation to break down large datasets into more manageable groups that provide better insight, as well as hierarchical and k-means clustering methods for grouping similar objects into clusters to analyze relationships in the data.
This document discusses segmenting stores into groups using multivariate analysis on store demographic and market share data. It covers:
1) The objective is to segment 2000 stores and interpret the segments, compute price elasticity for each, and discuss pricing strategies to maximize profits.
2) The approach involves understanding variable relationships, using these for segmentation, and starting with data summaries.
3) Dimension reduction techniques like principal component analysis and factor analysis are used to reduce highly correlated demographic variables into fewer underlying "factors" without much information loss.
4) Cluster analysis is then used to group similar stores based on factor scores and market shares, aiming to maximize between-cluster variance and minimize within-cluster variance.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
1. The document outlines a six-step process for developing scoring models: research design, data checking and variable creation, creating analysis files, calibrating the scoring model, model evaluation, and model implementation.
2. Several modeling techniques are discussed including linear regression, logistic regression, and neural networks. Key factors in choosing a technique include the target variable type and the software environment.
3. Model evaluation is done using lift tables and gains tables to assess how well the model ranks and selects customers. Graphs of these tables help understand model performance in selecting respondents and generating revenue or profit.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
IHP 525 Milestone Five (Final) TemplateMOST OF THIS TEMPLATE S.docxwilcockiris
IHP 525 Milestone Five (Final) Template
MOST OF THIS TEMPLATE SHOULD BE COPIED AND PASTED FROM PRIOR MILESTONES IF YOU RECEIVED FULL CREDIT FOR THOSE ELEMENTS.
DO NOT DELETE ANYTHING IN THIS TEMPLATE.
Student Name:
State the question you will pursue. (This should be copied and pasted from the list of questions and should be the same question submitted in week 7 unless you have changed your question.)
Question of interest (Copy and Paste question here):
Restate this question in your own words:
Directions for following table:
· Fill out the table below for EACH variable of interest.
· Include ONLY the variables that are relevant to your question of interest. (If it’s not mentioned in your question directly, it’s not relevant.)
· Each variable should take up ONE row.
Variable name (one variable per row)
Note: gender is a variable, and 0 and 1 are values of the variable gender. Gender =0 and gender =1 are NOT two separate variables.
Variable type (categorical, ordinal, or quantitative, etc.)
Descriptive statistics
Include:
· the statistic names (the mean, median, range, and standard deviation, at minimum)
· final calculations (e.g., mean = 10)
· an explanation/definition of each statistic used (what each statistic SAYS about the data)
*Do not subdivide the data for the variable in each row based on any other variable. For example, do NOT find the mean length of stay separately for males and females.
**The above statistics can be found for binary variables. (For example, gender is coded as binary; therefore, the above descriptive statistics can be found for the variable gender.)
Key features
· Histogram symmetric?
· Histogram bell shaped?
· Any outliers?
· Skew?
· Unimodal?
· Any other special features?
DO NOT list or discuss descriptive statistics in this space. Use the table above, as directed.
Analyze the limitations of the data set you were provided and how those limitations might affect your findings.
Limit your response to the data relevant to your question of interest. (For example, only using two variables is NOT a limitation of the data in your question of interest. It may be a limitation of the study or question of interest, but it is NOT a limitation of the data you have been provided for your question of interest.)
Limitations:
Provide ONE graph that is useful in explaining your results.
You may copy and paste this from another program, take a screen shot, etc.
LABEL EVERYTHING!!!
Explain why you chose this graph above any others to explain the situation.
What test/analysis technique did you perform?
(It is highly recommended that you perform ONLY ONE test or technique. Some examples include a t-test, regression, etc.)
There is a hypothesis test associated with your test/technique (even if you are not doing a t-test).
What is your null hypothesis?
What is your alternative hypothesis?
Provide all relevant calculations for your hypothesis test/ statistical technique.
Make sure your final ans.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...Daniel Valcarce
This document summarizes research on evaluating the robustness and discriminative power of information retrieval (IR) metrics for top-N recommendation. The researchers studied how various IR metrics like precision, recall, MAP, nDCG, MRR and bpref behave under different conditions like increased sparsity or popularity bias. They found that deeper cut-offs of metrics offer greater robustness and discriminative power than shallow cut-offs. Precision provided high robustness to biases and good discriminative power, while nDCG had the best discriminative power and robustness to sparsity bias. Future work could explore other metrics like diversity, use different evaluation methodologies, and employ temporal data splits.
This document summarizes the analysis of data from a pharmaceutical company to model and predict the output variable (titer) from input variables in a biochemical drug production process. Several statistical models were evaluated including linear regression, random forest, and MARS. The analysis involved developing blackbox models using only controlled input variables, snapshot models using all input variables at each time point, and history models incorporating changes in input variables over time to predict titer values. Model performance was compared using cross-validation.
This document provides an overview of predictive analytics tools in Alteryx, including linear regression, time series analysis, classification models, and clustering analysis. It discusses when different model types are applicable and how to evaluate model performance. Examples are provided on linear regression, time series analysis of stock prices, production optimization, and delivery route planning. The goal is to help users understand how to apply these statistical techniques to gain business insights from data.
NEW MARKET SEGMENTATION METHODS USING ENHANCED (RFM), CLV, MODIFIED REGRESSIO...ijcsit
A widely used approach for gaining insight into the heterogeneity of consumer’s buying behavior is market segmentation. Conventional market segmentation models often ignore the fact that consumers’ behavior may evolve over time. Therefore retailers consume limited resources attempting to service unprofitable consumers. This study looks into the integration between enhanced Recency, Frequency, Monetary (RFM) scores and Consumer Lifetime Value (CLV) matrix for a medium size retailer in the State of Kuwait. A modified regression algorithm investigates the consumer purchase trend gaining knowledge from a pointof-sales data warehouse. In addition, this study applies enhanced normal distribution formula to remove outliers, followed by soft clustering Fuzzy C-Means and hard clustering Expectation Maximization (EM) algorithms to the analysis of consumer buying behavior. Using cluster quality assessment shows EM algorithm scales much better than Fuzzy C-Means algorithm with its ability to assign good initial points in the smaller dataset.
Credit Card Marketing Classification Trees Fr.docxShiraPrater50
Credit Card Marketing
Classification Trees
From Building Better Models with JMP® Pro,
Chapter 6, SAS Press (2015). Grayson, Gardner
and Stephens.
Used with permission. For additional information,
see community.jmp.com/docs/DOC-7562.
2
Credit Card Marketing
Classification Trees
Key ideas: Classification trees, validation, confusion matrix, misclassification, leaf report, ROC
curves, lift curves.
Background
A bank would like to understand the demographics and other characteristics associated with whether a
customer accepts a credit card offer. Observational data is somewhat limited for this kind of problem, in
that often the company sees only those who respond to an offer. To get around this, the bank designs a
focused marketing study, with 18,000 current bank customers. This focused approach allows the bank to
know who does and does not respond to the offer, and to use existing demographic data that is already
available on each customer.
The designed approach also allows the bank to control for other potentially important factors so that the
offer combination isn’t confused or confounded with the demographic factors. Because of the size of the
data and the possibility that there are complex relationships between the response and the studied
factors, a decision tree is used to find out if there is a smaller subset of factors that may be more
important and that warrant further analysis and study.
The Task
We want to build a model that will provide insight into why some bank customers accept credit card offers.
Because the response is categorical (either Yes or No) and we have a large number of potential predictor
variables, we use the Partition platform to build a classification tree for Offer Accepted. We are primarily
interested in understanding characteristics of customers who have accepted an offer, so the resulting
model will be exploratory in nature.1
The Data Credit Card Marketing BBM.jmp
The data set consists of information on the 18,000 current bank customers in the study.
Customer Number: A sequential number assigned to the customers (this column is hidden and
excluded – this unique identifier will not be used directly).
Offer Accepted: Did the customer accept (Yes) or reject (No) the offer.
Reward: The type of reward program offered for the card.
Mailer Type: Letter or postcard.
Income Level: Low, Medium or High.
# Bank Accounts Open: How many non-credit-card accounts are held by the customer.
1 In exploratory modeling, the goal is to understand the variables or characteristics that drive behaviors or particular outcomes. In
predictive modeling, the goal is to accurately predict new observations and future behaviors, given the current information and
situation.
3
Overdraft Protection: Does the customer have overdraft protection on their checking account(s)
(Yes or No).
Credit Rating: Low, Medium or High.
# Credit Cards Held: The number of cred ...
Operations Management VTU BE Mechanical 2015 Solved paperSomashekar S.M
The document provides information about operations management concepts including scientific management, productivity, ABC analysis, economic order quantity, and materials requirements planning. It defines each concept and provides examples to illustrate how they are applied. Scientific management aims to improve efficiency through systematic analysis of work processes. Productivity is a measure of output per unit of input. ABC analysis categorizes inventory items based on their value and usage to determine appropriate control methods. Economic order quantity and ordering cycle determine optimal replenishment amounts and frequencies to minimize total inventory costs. Materials requirements planning is a technique to plan material needs at different production levels based on a product structure tree.
Slides from my talk on cryptoeconomics at @hasgeek. I primarily talk about design and analysis of blockchain based systems and how to design the protocol to achieve the system goal.
Digital DNA-seq Technology: Targeted Enrichment for Cancer ResearchQIAGEN
Targeted DNA sequencing has become a powerful approach by achieving high coverage of the region of interest while keeping the cost of sequencing and complexity of data interpretation manageable. However, existing PCR-based target enrichment approaches introduce errors due to PCR amplification bias and artifacts, which significantly affects quantification accuracy and limit the ability to confidently detect low-frequency DNA variants. This webinar introduces a new digital sequencing approach that is based on the use of unique molecular indices (UMIs) - QIAseq Targeted DNA Panels. With UMIs, each unique DNA molecule is barcoded before any amplification takes place to correct for PCR errors. Detailed workflow and applications in cancer research will be presented. Join us and learn about this exciting novel digital DNAseq technology
A really simple explanation of Bitcoin and why everyone afraid of it.
My other FinTech presentations:
Bitcoin: http://www.slideshare.net/ishmelev/bitcoin-future
Digital bank: http://www.slideshare.net/ishmelev/digital-bank-eng
Future of bank: http://www.slideshare.net/ishmelev/bank-future
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
This document discusses blockchain technology and its potential applications for the media and entertainment industry. It provides an overview of blockchain basics, describes how blockchain moves through different functional stages, and highlights several potential use cases for blockchain in areas like rights management, content resale, and advertising. The document advocates piloting select blockchain uses cases to discover opportunities and assess ease of implementation before potentially expanding successful solutions.
Hyperledger Fabric is a blockchain framework for enterprise use. It was designed from the ground up to address enterprise needs like confidentiality, scalability, and flexibility. Some key features include built-in privacy using channels, pluggable consensus algorithms, and multiple programming languages for writing smart contracts. It uses an endorsement and validation process to ensure transactions are valid before being added to the ledger. Membership services provide identity features and Hyperledger Composer helps speed application development.
The document discusses various techniques for data reduction including data sampling, data cleaning, data transformation, and data segmentation to break down large datasets into more manageable groups that provide better insight, as well as hierarchical and k-means clustering methods for grouping similar objects into clusters to analyze relationships in the data.
This document discusses segmenting stores into groups using multivariate analysis on store demographic and market share data. It covers:
1) The objective is to segment 2000 stores and interpret the segments, compute price elasticity for each, and discuss pricing strategies to maximize profits.
2) The approach involves understanding variable relationships, using these for segmentation, and starting with data summaries.
3) Dimension reduction techniques like principal component analysis and factor analysis are used to reduce highly correlated demographic variables into fewer underlying "factors" without much information loss.
4) Cluster analysis is then used to group similar stores based on factor scores and market shares, aiming to maximize between-cluster variance and minimize within-cluster variance.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
1. The document outlines a six-step process for developing scoring models: research design, data checking and variable creation, creating analysis files, calibrating the scoring model, model evaluation, and model implementation.
2. Several modeling techniques are discussed including linear regression, logistic regression, and neural networks. Key factors in choosing a technique include the target variable type and the software environment.
3. Model evaluation is done using lift tables and gains tables to assess how well the model ranks and selects customers. Graphs of these tables help understand model performance in selecting respondents and generating revenue or profit.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
IHP 525 Milestone Five (Final) TemplateMOST OF THIS TEMPLATE S.docxwilcockiris
IHP 525 Milestone Five (Final) Template
MOST OF THIS TEMPLATE SHOULD BE COPIED AND PASTED FROM PRIOR MILESTONES IF YOU RECEIVED FULL CREDIT FOR THOSE ELEMENTS.
DO NOT DELETE ANYTHING IN THIS TEMPLATE.
Student Name:
State the question you will pursue. (This should be copied and pasted from the list of questions and should be the same question submitted in week 7 unless you have changed your question.)
Question of interest (Copy and Paste question here):
Restate this question in your own words:
Directions for following table:
· Fill out the table below for EACH variable of interest.
· Include ONLY the variables that are relevant to your question of interest. (If it’s not mentioned in your question directly, it’s not relevant.)
· Each variable should take up ONE row.
Variable name (one variable per row)
Note: gender is a variable, and 0 and 1 are values of the variable gender. Gender =0 and gender =1 are NOT two separate variables.
Variable type (categorical, ordinal, or quantitative, etc.)
Descriptive statistics
Include:
· the statistic names (the mean, median, range, and standard deviation, at minimum)
· final calculations (e.g., mean = 10)
· an explanation/definition of each statistic used (what each statistic SAYS about the data)
*Do not subdivide the data for the variable in each row based on any other variable. For example, do NOT find the mean length of stay separately for males and females.
**The above statistics can be found for binary variables. (For example, gender is coded as binary; therefore, the above descriptive statistics can be found for the variable gender.)
Key features
· Histogram symmetric?
· Histogram bell shaped?
· Any outliers?
· Skew?
· Unimodal?
· Any other special features?
DO NOT list or discuss descriptive statistics in this space. Use the table above, as directed.
Analyze the limitations of the data set you were provided and how those limitations might affect your findings.
Limit your response to the data relevant to your question of interest. (For example, only using two variables is NOT a limitation of the data in your question of interest. It may be a limitation of the study or question of interest, but it is NOT a limitation of the data you have been provided for your question of interest.)
Limitations:
Provide ONE graph that is useful in explaining your results.
You may copy and paste this from another program, take a screen shot, etc.
LABEL EVERYTHING!!!
Explain why you chose this graph above any others to explain the situation.
What test/analysis technique did you perform?
(It is highly recommended that you perform ONLY ONE test or technique. Some examples include a t-test, regression, etc.)
There is a hypothesis test associated with your test/technique (even if you are not doing a t-test).
What is your null hypothesis?
What is your alternative hypothesis?
Provide all relevant calculations for your hypothesis test/ statistical technique.
Make sure your final ans.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...Daniel Valcarce
This document summarizes research on evaluating the robustness and discriminative power of information retrieval (IR) metrics for top-N recommendation. The researchers studied how various IR metrics like precision, recall, MAP, nDCG, MRR and bpref behave under different conditions like increased sparsity or popularity bias. They found that deeper cut-offs of metrics offer greater robustness and discriminative power than shallow cut-offs. Precision provided high robustness to biases and good discriminative power, while nDCG had the best discriminative power and robustness to sparsity bias. Future work could explore other metrics like diversity, use different evaluation methodologies, and employ temporal data splits.
This document summarizes the analysis of data from a pharmaceutical company to model and predict the output variable (titer) from input variables in a biochemical drug production process. Several statistical models were evaluated including linear regression, random forest, and MARS. The analysis involved developing blackbox models using only controlled input variables, snapshot models using all input variables at each time point, and history models incorporating changes in input variables over time to predict titer values. Model performance was compared using cross-validation.
This document provides an overview of predictive analytics tools in Alteryx, including linear regression, time series analysis, classification models, and clustering analysis. It discusses when different model types are applicable and how to evaluate model performance. Examples are provided on linear regression, time series analysis of stock prices, production optimization, and delivery route planning. The goal is to help users understand how to apply these statistical techniques to gain business insights from data.
NEW MARKET SEGMENTATION METHODS USING ENHANCED (RFM), CLV, MODIFIED REGRESSIO...ijcsit
A widely used approach for gaining insight into the heterogeneity of consumer’s buying behavior is market segmentation. Conventional market segmentation models often ignore the fact that consumers’ behavior may evolve over time. Therefore retailers consume limited resources attempting to service unprofitable consumers. This study looks into the integration between enhanced Recency, Frequency, Monetary (RFM) scores and Consumer Lifetime Value (CLV) matrix for a medium size retailer in the State of Kuwait. A modified regression algorithm investigates the consumer purchase trend gaining knowledge from a pointof-sales data warehouse. In addition, this study applies enhanced normal distribution formula to remove outliers, followed by soft clustering Fuzzy C-Means and hard clustering Expectation Maximization (EM) algorithms to the analysis of consumer buying behavior. Using cluster quality assessment shows EM algorithm scales much better than Fuzzy C-Means algorithm with its ability to assign good initial points in the smaller dataset.
Credit Card Marketing Classification Trees Fr.docxShiraPrater50
Credit Card Marketing
Classification Trees
From Building Better Models with JMP® Pro,
Chapter 6, SAS Press (2015). Grayson, Gardner
and Stephens.
Used with permission. For additional information,
see community.jmp.com/docs/DOC-7562.
2
Credit Card Marketing
Classification Trees
Key ideas: Classification trees, validation, confusion matrix, misclassification, leaf report, ROC
curves, lift curves.
Background
A bank would like to understand the demographics and other characteristics associated with whether a
customer accepts a credit card offer. Observational data is somewhat limited for this kind of problem, in
that often the company sees only those who respond to an offer. To get around this, the bank designs a
focused marketing study, with 18,000 current bank customers. This focused approach allows the bank to
know who does and does not respond to the offer, and to use existing demographic data that is already
available on each customer.
The designed approach also allows the bank to control for other potentially important factors so that the
offer combination isn’t confused or confounded with the demographic factors. Because of the size of the
data and the possibility that there are complex relationships between the response and the studied
factors, a decision tree is used to find out if there is a smaller subset of factors that may be more
important and that warrant further analysis and study.
The Task
We want to build a model that will provide insight into why some bank customers accept credit card offers.
Because the response is categorical (either Yes or No) and we have a large number of potential predictor
variables, we use the Partition platform to build a classification tree for Offer Accepted. We are primarily
interested in understanding characteristics of customers who have accepted an offer, so the resulting
model will be exploratory in nature.1
The Data Credit Card Marketing BBM.jmp
The data set consists of information on the 18,000 current bank customers in the study.
Customer Number: A sequential number assigned to the customers (this column is hidden and
excluded – this unique identifier will not be used directly).
Offer Accepted: Did the customer accept (Yes) or reject (No) the offer.
Reward: The type of reward program offered for the card.
Mailer Type: Letter or postcard.
Income Level: Low, Medium or High.
# Bank Accounts Open: How many non-credit-card accounts are held by the customer.
1 In exploratory modeling, the goal is to understand the variables or characteristics that drive behaviors or particular outcomes. In
predictive modeling, the goal is to accurately predict new observations and future behaviors, given the current information and
situation.
3
Overdraft Protection: Does the customer have overdraft protection on their checking account(s)
(Yes or No).
Credit Rating: Low, Medium or High.
# Credit Cards Held: The number of cred ...
Operations Management VTU BE Mechanical 2015 Solved paperSomashekar S.M
The document provides information about operations management concepts including scientific management, productivity, ABC analysis, economic order quantity, and materials requirements planning. It defines each concept and provides examples to illustrate how they are applied. Scientific management aims to improve efficiency through systematic analysis of work processes. Productivity is a measure of output per unit of input. ABC analysis categorizes inventory items based on their value and usage to determine appropriate control methods. Economic order quantity and ordering cycle determine optimal replenishment amounts and frequencies to minimize total inventory costs. Materials requirements planning is a technique to plan material needs at different production levels based on a product structure tree.
This document provides an overview of key numerical measures used to describe data, including measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It defines each measure, provides examples of calculating them, and discusses their characteristics, uses, and advantages/disadvantages. The document also covers weighted means, geometric means, Chebyshev's theorem, and calculating measures for grouped data.
This document discusses quality management, quality assurance, quality control, and good manufacturing practices (GMP) in the pharmaceutical industry. It provides definitions of key terms from organizations like WHO and describes the relationships between quality management, quality assurance, quality control, and GMP. Quality is defined as fitness for use, freedom from defects, and meeting customer/regulatory requirements. Quality assurance involves all arrangements to ensure a product meets quality standards, while quality control specifically refers to testing and release procedures.
This document discusses various statistical techniques used in factor analysis. It begins by defining factor analysis and its objectives, including data reduction and identifying underlying dimensions or factors within a set of variables. Several key assumptions and steps in factor analysis are then outlined, such as assessing multicollinearity, determining the appropriate number of factors to extract, and choosing between principal component analysis and principal axis factoring extraction methods. The document also covers factor rotation methods including orthogonal and oblique techniques, interpreting factor loadings and labels, validating the factor structure, and potential downstream uses of factor analysis results. Guidelines are provided throughout for best practices in conducting and interpreting a factor analysis.
TOP 10 Forecasting models Meghan WoodsMarketing 188 Dr. .docxturveycharlyn
TOP 10 Forecasting models
Meghan Woods
Marketing 188
Dr. William Rice
4:00- 5:50 pm T-TH Class
Row 2, Seat 1, Group 14
Econometric model
Description: These statistical models identify the relationships between various economic entities within a given study. Econometric models are often arranged under a certain economic theory and the forecast is built around that theory to support it. Economists often use this technique to determine future developments and identify what outcomes they may take in the market.
Advantages:
Only solution to “what if” scenarios
Research accompanied by economists input
Disadvantages:
Merely approximations to reality
Unknown parameter values
1
http://home.iitk.ac.in/~shalab/econometrics/Chapter1-Econometrics-IntroductionToEconometrics.pdf
Real world application: Econometric models are used by marketers and economists alike to forecast when making decisions in policy formation. Fitted models are often a real world representation of economic elements that policy makers must adjust when they see fit.
A set of equations represents the economic behavior occurring in a given market. --->
These results are graphed for forecasters to better interpret results that are then reviewed by economic analysts, and a decision is then reached on what actions to take or not take.
Econometric model
1
Econometric model
1
diffusion index
Description: Used often by economists and traders, this forecasting technique is a summarization of common tendencies that occur within a given data set. A statistical series is analyzed and interpreted by forecasters; if the series shows a greater number of rising data than declining, then the index number is above 50.
Advantages:
More participants likely to respond
Smaller mean-squared errors
Prompt results
Less data crunching
Confidentiality remains intact
Disadvantages:
Small changes cause big change in results
Changes not correlated in results
2
diffusion index
http://www.marketthoughts.com/z20050530.html
2
life cycle analysis
Description: Product life cycle analysis is a quantitative technique of forecasting. It revolves around patterns of past demand in data. This data encompasses these phases that are shown upon a curve model: introduction, growth, maturity, saturation, and decline. Phases of the life cycle help forecasters know when to best execute certain actions based upon similar products.
Pros:
Good for benchmarking performance
Stakeholder engagement tool
Maximize value
Reduce waste
Cons:
Not reliable predictor of true lifespan
False assumptions of life cycle
http://www.environmentalleader.com/2012/03/21/the-benefits-of-life-cycle-analysis/
3
life cycle analysis
Introduction:
small market size
expensive to implement
low sales
high researching & testing costs
Growth:
growth in sales & profit
increase in investment
economies of scale
Maturity:
maintain market share
product modifications and improvement
more effi ...
1) The document proposes using advanced data analytics to build knowledge of customer behavior, preferences, and aspirations in order to maximize revenue.
2) A case study uses data from an online beauty/personal care subsidiary to demonstrate how clustering, classification, and regression analyses can provide insights.
3) The analyses identify customer subgroups, predict which customers will churn, and forecast spending amounts. This knowledge can then be used to target marketing and improve customer retention and spending.
The document provides an approach for improving the efficacy of alerts from anti-money laundering transaction monitoring models by reducing false positives. The approach involves regularly evaluating rule efficacy, acquiring historic transaction and disposition data, analyzing the data to understand patterns, building a detection engine to test threshold combinations, and quantitatively and qualitatively analyzing the results to identify the combination with the best efficacy based on case and SAR retention proportions while minimizing false positives. Key steps include prioritizing rules for tuning, testing threshold permutations, sampling transactions for investigator review, and approving final threshold changes. The goal is to generate higher quality alerts while controlling compliance costs.
Similar to Historical Data on Avocado prices and Sales volume in US Markets (20)
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
2. Overview:
• Data Resource
• Problem Definition
• Visualization
• Prediction – Tools to support data analysis
• Presenting findings
• Solving the problem framed in the beginning
QMST5336: ANALYTICS 2
3. • The retail sales data used for this analysis are based
on scanner data collected and provided by the Hass
Avocado Board.
• The data include total weekly retail sales in value
and volume for fresh Hass avocados (aggregated
across all relevant PLU codes) in 45 distinct local
market areas and eight regions (53 cross sectional
observations in total) for the years spanning 2015 –
2018
• These data represent an aggregation of retail outlets
that includes the following channels: grocery, mass
merchandisers, club stores, drugstores, dollar
outlets and military commissaries.
• An average price or unit value is computed in each
market and each week by dividing sales value by the
number of fresh Hass avocados sold.
DATASET : AVOCADO
Historical Data on Avocado prices and Sales volume in US
Markets
QMST5336: ANALYTICS 3
4. Columns:
• Date-The date of the observation
• AveragePrice-the average price of a single avocado
• Total Volume-Total number of avocados sold
• 4046-Total number of avocados with PLU 4046 sold
• 4225-Total number of avocados with PLU 4225 sold
• 4770-Total number of avocados with PLU 4770 sold
• Total Bags
• Small Bags
• Large Bags
• XLarge Bags
• Type-conventional or organic
• Year-the year
• Region-the city or region of the observation
QMST5336: ANALYTICS 4
6. Our Problem and Roadmap and WHERE we are!
Whether to import Avocados for 2020 or not?
Data
Visualization
Outliers
Text
Mining
Clustering
Regression
Predictive
Analytics
Descriptive
Analytics
Utility
Theory
Optimization
Decision
Analysis
Prescriptive
Analytics
QMST5336: ANALYTICS 6
7. Snapshot of our dataset after cleaning:
Shape : (18249, 14)
Null values: None
QMST5336: ANALYTICS
7
8. Snapshot of our dataset after mining date
column:
Converted Date column to datetype and split it into Month and Day
Converting Type : Organic or Conventional to dummy variable
QMST5336: ANALYTICS
8
9. Our Problem and Roadmap and WHERE we are!
Whether to import Avocados for 2020 or not?
Data
Visualization
Outliers
Text
Mining
Clustering
Regression
Predictive
Analytics
Descriptive
Analytics
Utility
Theory
Optimization
Decision
Analysis
Prescriptive
Analytics
QMST5336: ANALYTICS 9
10. Which type of Avocados are more in demand
(Conventional/Non-Organic VS Organic)?
• Organic vs Conventional : The main difference between organic and conventional food products are
the chemicals involved during production and processing. The interest in organic food products has
been rising steadily over the recent years with new health super fruits emerging.
QMST5336: ANALYTICS 10
11. Which type of Avocados are more in demand
(Conventional/Non-Organic VS Organic
agg by ‘Total Volume’)?
A Pie Chart
QMST5336: ANALYTICS 11
12. Now, let's look at the average price distribution
In which range Average price lies?
A Distribution Plot
QMST5336: ANALYTICS 12
13. How Average price is distributed over the months for
Conventional and Organic Types?
A Line Plot
QMST5336: ANALYTICS 13
14. Now let's see the Average price distribution based on region
What are TOP 5 regions where Average price is very high?
A Bar Chart
QMST5336: ANALYTICS 14
15. What are TOP 5 regions where Average price is very high?
These region are where price is very high
HartfordSpringfield
SanFrancisco
NewYork
Philadelphia
Sacramento
QMST5336: ANALYTICS 15
16. What are TOP 5 regions where Average consumption is very high?
A Bar Chart
QMST5336: ANALYTICS 16
17. What are TOP 5 regions where Average consumption is very high?
These region are where Consumption is very high
West
California
SouthCentral
Northeast
Southeast
QMST5336: ANALYTICS 17
18. How dataset features are correlated with each other?
As we can see from the heatmap above, all the Features are not correlated with the Average Price
column, instead most of them are correlated with each other. So now we are bit worried because that will
not help us get a good model. Let's try and see.
QMST5336: ANALYTICS 18
19. Our Problem and Roadmap and WHERE we are!
Whether to import Avocados for 2020 or not?
Data
Visualization
Outliers
Text
Mining
Clustering
Regression
Predictive
Analytics
Descriptive
Analytics
Utility
Theory
Optimization
Decision
Analysis
Prescriptive
Analytics
QMST5336: ANALYTICS 19
20. Model selection/predictions
Aiming at observing the fluctuation of the avocado market in the United States based on weather
conditions, several machine learning techniques were evaluated to estimate the average price of a
unit(in dollars) of this agricultural product. For this purpose, we used the datasets listed before and
three algorithms of the sklearn:
QMST5336: ANALYTICS 20
21. Linear Regression: a technique used to determine the relationship of a y variable with one
or many other x1, . . . , xk variables. In a machine learning approach, it searches for several
functions that model the relationship between the variables and selects the one that most
closely approximates to or fits the data given in the class.
Decision tree builds regression or classification models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. The final result is a tree with decision nodes and
leaf nodes.
A random forest is a meta estimator that fits a number of classifying decision trees on
various sub-samples of the dataset and uses averaging to improve the predictive accuracy
and control over-fitting.
QMST5336: ANALYTICS 21
22. Performance metrics LinearRegression
(Baseline model that
we aimed to exceed)
DecisionTreeRegressor RandomForestRegressor
R Square Value 0.43 0.94 0.95
MAE: Mean Abs Error 0.23 0.13 0.10
MSE: Mean Sq Error 0.09 0.04 0.025
RMSE: Sqrt of MSE 0.30 0.21 0.15
Comparison of tools:
QMST5336: ANALYTICS
22
24. Model selection/predictions
We predicted that RMSE is lower than the two previous models, so the RandomForest
Regressor is the best model in this case.
Linear
Regression
Decision Tree
Regression
RandomForest
Regression
QMST5336: ANALYTICS 24
25. Model selection/predictions
Residual = Observed value - Predicted value. e = y - ŷ Both the sum and the mean of the residuals are
equal to zero.
Here that our residuals looked to be normally distributed and that's really a good sign which means
that our model was a correct choice for the data.
RandomForest Regressor
QMST5336: ANALYTICS
25
27. Our Problem and Roadmap and WHERE we are!
Whether to import Avocados for 2020 or not?
Data
Visualization
Outliers
Text
Mining
Clustering
Regression
Predictive
Analytics
QMST5336: ANALYTICS 27
Descriptive
Analytics
Utility
Theory
Optimization
Decision
Analysis
Prescriptive
Analytics
28. • Retailer chain in Dallas and
Houston with around 1000
stores
• Increase profit
• Procure from local Market Or
Direct import from Mexico
• No/Very low risk
• Prefer Mexico import due to
partnership
• May incur import duty of
10%, probability is 5%
• Meet consumption demand
Problem Map:
QMST5336: ANALYTICS
28
30. Case 1: Procure from local wholesale market Case 2: Direct import from Mexico
QMST5336: ANALYTICS
30
31. Options that can be executed:
No Import – Continue existing model
Direct Import from Mexico – preferred model
– potential to increase revenue by
$8,246,156.40
Test Direct Import from Mexico for 1 week
Based on test results, decide the next step
QMST5336: ANALYTICS
31