Bank - Loan Purchase Modeling
This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Our job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. We are expected to do the following:
EDA of the data available. Showcase the results using appropriate graphs
Apply appropriate clustering on the data and interpret the output .
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning).
Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best.
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case of customer churn where we work on a data of postpaid customers with a contract. The data has information about the customer usage behavior, contract details and the payment details. The data also indicates which were the customers who canceled their service. Based on this past data, we need to build a model which can predict whether a customer will cancel their service in the future or not.
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum regression model to predict satisfaction.
Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers and missing values.
Check evidence of multicollinearity.
Perform simple linear regression for the dependent variable with every independent variable.
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.
Perform Multiple linear regression with customer satisfaction as dependent variables and the four factors as independent variables.
We are requested to create an India credit risk(default) model, using the data provided and validate it. Use the logistic regression framework to develop the credit default model.
The purpose of this project is to do a visual analysis of the given data and to arrive at useful insights that can be of use to the principal company. There is no clear set of instructions in such open-ended problems and it is expected of the consultant to do a lot of exploration first and formulate the problems themselves.
Target Audience
The target audience for this report are the top management and the CXOs of the insurance company whose data we are analyzing.
It could also provide insights to any person from the same industry as well as industries who work in close coordination with the insurance companies.
Tool Used
The tool used for this analysis is Tableau Desktop (Public edition) running on MacOS
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case of customer churn where we work on a data of postpaid customers with a contract. The data has information about the customer usage behavior, contract details and the payment details. The data also indicates which were the customers who canceled their service. Based on this past data, we need to build a model which can predict whether a customer will cancel their service in the future or not.
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum regression model to predict satisfaction.
Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers and missing values.
Check evidence of multicollinearity.
Perform simple linear regression for the dependent variable with every independent variable.
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.
Perform Multiple linear regression with customer satisfaction as dependent variables and the four factors as independent variables.
We are requested to create an India credit risk(default) model, using the data provided and validate it. Use the logistic regression framework to develop the credit default model.
The purpose of this project is to do a visual analysis of the given data and to arrive at useful insights that can be of use to the principal company. There is no clear set of instructions in such open-ended problems and it is expected of the consultant to do a lot of exploration first and formulate the problems themselves.
Target Audience
The target audience for this report are the top management and the CXOs of the insurance company whose data we are analyzing.
It could also provide insights to any person from the same industry as well as industries who work in close coordination with the insurance companies.
Tool Used
The tool used for this analysis is Tableau Desktop (Public edition) running on MacOS
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
Data Science: Prediction analysis for houses in Ames, Iowa.ASHISH MENKUDALE
For the vastly diversified realty market, with prices of properties increasing exponentially, it becomes essential to study the factors which affect directly or indirectly when a customer decides to buy a house and to predict the market trend. In general, for any purchase, a potential customer makes the decision based on the value for the money.
The problem statement was taken from the website Kaggle. We chose this specific problem because it provided us an opportunity to build a prediction model for real-life problems like the prediction of prices for houses in Ames, Iowa.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
This paper aims at explaining the method of creating a polynomial equation out of the given data set which can be used as a representation of the data itself and can be used to run aggregation against itself to find the results. This approach uses least-squares technique to construct a model of data and fit to a polynomial. Differential calculus technique is used on this equation to generate the aggregated results that represents the original data set.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
Use of Linear Regression in Machine Learning for Rankingijsrd.com
Machine Learning is a growing field today in AI. We discuss use of a Supervised Learning algorithm called as Regression Learning in this paper for ranking. Regression Learning is used as Prediction Model. The values of dependent variable are predicted by Regression Model based on values of Independent Variables. By Regression Learning if after Experience E, program improves its performance P, then program is said to be doing Regression Learning. We chose to use Linear Regression for Ranking and discuss approaches for Rank Regression Model Building by selecting best Ranking parameters from Knowledge and confirming their selection further by performing Regression Analysis during Model building. Example is explained. Analysis of Results and we discuss the Combined Regression and Ranking approach how it is better for enhancing use of Linear Regression for Ranking purpose. We conclude and suggesting future work Ranking and Regression.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Decision Trees in Machine Learning - Decision tree method is a commonly used data mining method for establishing classification systems based on several covariates or for developing prediction algorithms for a target variable.
Data Science: Prediction analysis for houses in Ames, Iowa.ASHISH MENKUDALE
For the vastly diversified realty market, with prices of properties increasing exponentially, it becomes essential to study the factors which affect directly or indirectly when a customer decides to buy a house and to predict the market trend. In general, for any purchase, a potential customer makes the decision based on the value for the money.
The problem statement was taken from the website Kaggle. We chose this specific problem because it provided us an opportunity to build a prediction model for real-life problems like the prediction of prices for houses in Ames, Iowa.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
This paper aims at explaining the method of creating a polynomial equation out of the given data set which can be used as a representation of the data itself and can be used to run aggregation against itself to find the results. This approach uses least-squares technique to construct a model of data and fit to a polynomial. Differential calculus technique is used on this equation to generate the aggregated results that represents the original data set.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
Use of Linear Regression in Machine Learning for Rankingijsrd.com
Machine Learning is a growing field today in AI. We discuss use of a Supervised Learning algorithm called as Regression Learning in this paper for ranking. Regression Learning is used as Prediction Model. The values of dependent variable are predicted by Regression Model based on values of Independent Variables. By Regression Learning if after Experience E, program improves its performance P, then program is said to be doing Regression Learning. We chose to use Linear Regression for Ranking and discuss approaches for Rank Regression Model Building by selecting best Ranking parameters from Knowledge and confirming their selection further by performing Regression Analysis during Model building. Example is explained. Analysis of Results and we discuss the Combined Regression and Ranking approach how it is better for enhancing use of Linear Regression for Ranking purpose. We conclude and suggesting future work Ranking and Regression.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Decision Trees in Machine Learning - Decision tree method is a commonly used data mining method for establishing classification systems based on several covariates or for developing prediction algorithms for a target variable.
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
IMAGE CLASSIFICATION USING KNN, RANDOM FOREST AND SVM ALGORITHM ON GLAUCOMA DATASETS AND EXPLAIN THE ACCURACY, SENSITIVITY, AND SPECIFICITY OF EACH AND EVERY ALGORITHMS
The presentation covers the application of a machine learning approach to classification and regression for modelling the expected loss in P&C insurance.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
AI Professionals use top machine learning algorithms to automate models that analyze more extensive and complex data which was not possible in older machine learning algos.
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...INFOGAIN PUBLICATION
The shopping mall domain is a dynamic and unpredictable environment. Traditional techniques such as fundamental and technical analysis can provide investors with some tools for managing their shops and predicting their business growth. However, these techniques cannot discover all the possible relations between business growth and thus, there is a need for a different approach that will provide a deeper kind of analysis. Data mining can be used extensively in the shopping malls and help to increase business growth. Therefore, there is a need to find a perfect solution or an algorithm to work with this kind of environment. So we are going to study few methods of pruning with decision tree. Finally, we prove and make use of the Cost based pruning method to obtain an objective evaluation of the tendency to over prune or under prune observed in each method.
This presentation discusses about following topics:
Types of Problems Solved Using Artificial Intelligence Algorithms
Problem categories
Classification Algorithms
Naive Bayes
Example: A person playing golf
Decision Tree
Random Forest
Logistic Regression
Support Vector Machine
Support Vector Machine
K Nearest Neighbors
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
The owner of the restaurant wants us to use this data to come up with a set of recommendations that can help his Café Chain increase his revenues. He has not been able to launch a loyalty program and is unable to provide you with a data set that has customer level information. But, he is able to provide you with a data set for POS (point of sale data) for one of his chains.
The case needs methods and tools for displaying and analyzing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling. Explore the gas (Australian monthly gas production)
Read the data as a time series object in R. Plot the data.
Understand components of the time series are present in this dataset.
Check the periodicity of dataset..
Partition your dataset in such a way that we have the data 1994 onwards in the test data.
Inspect visually as well as conduct an ADF test.
Write down the null and alternate hypothesis for the stationarity test. De-seasonalise the series if seasonality is present.
Develop an initial forecast for next 20 periods. Check the same using the various metrics, after finalising the model, develop a final forecast for the 12 time periods. Use both manual and auto.arima (Show & explain all the steps)
Report the accuracy of the model.
Problem 1
Cold Storage started its operations in Jan 2016. They are in the business of storing Pasteurized Fresh Whole or Skimmed Milk, Sweet Cream, Flavored Milk Drinks. To ensure that there is no change of texture, body appearance, separation of fats the optimal temperature to be maintained is between 2 deg - 4 deg C.
In the first year of business they outsourced the plant maintenance work to a professional company with stiff penalty clauses. It was agreed that if the it was statistically proven that probability of temperature going outside the 2 deg - 4 deg C during the one-year contract was above 2.5% and less than 5% then the penalty would be 10% of AMC (annual maintenance case). In case it it exceeded 5% then the penalty would be 25% of the AMC fee.
Problem 2
In Mar 2018, Cold Storage started getting complaints from their Clients that they have been getting complaints from end consumers of the dairy products going sour and often smelling. On getting these complaints, the supervisor pulls out data of last 35 days temperatures. As a safety measure, the Supervisor has been vigilant to maintain the temperature below 3.9 deg C.
Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1 do you feel that there is need for some corrective action in the Cold Storage Plant or is it that the problem is from procurement side from where Cold Storage is getting the Dairy Products.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Bank loan purchase modeling
1. Project – 3.
DATA MINING
Thera Bank - Loan Purchase Modeling
By,
Saleesh Satheeshchandran, PGP BABI 2019-20 (G6)
2. Objective of the project - Thera Bank - Loan Purchase
Modeling
The Objective of the project is to build the best model which can classify the right customers
who have a higher probability of purchasing the loan.
The dataset has data on 5000 customers. The data include customer demographic information
(age, income, etc.), the customer's relationship with the bank (mortgage, securities account,
etc.), and the customer response to the last personal loan campaign (Personal Loan). Among
these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them
in the earlier campaign.
Assumptions
There are no particular assumptions
Exploratory Data Analysis
1. The dimension of the data shows 5000 observations.
2. Some of the categorical variables are converted to factors.
3. Boxplots of the following variables are done.
- Age - No outliers
- Experience - No outliers
- Income – Numerous outliers
- Zipcode – 1 outlier
- Family Members – No outliers
7. Clustering
Following clustering models were applied.
1. K-Means
2. Hierarchical
The K-means algorithm is executed in the following steps:
• Partition of objects into k non-empty subsets
• Identifying the cluster centroids (mean point) of the current partition.
• Assigning each point to a specific cluster
• Compute the distances from each point and allot points to the cluster where
the distance from the centroid is minimum.
• After re-allotting the points, find the centroid of the new cluster formed.
• Three clusters are obtained by the k-means clustering.
• These are not very distinctive clusters.
• In this project the k-means clustering has not been very effective.
The Hierarchical clustering build a cluster hierarchy that is commonly displayed as a
tree diagram called a dendrogram. They begin with each object in a separate cluster.
At each step, the two clusters that are most similar are joined into a single new
cluster. Once fused, objects are never separated.
The horizontal axis represents the clusters. The vertical scale on the dendrogram
represent the distance or dissimilarity. Each joining (fusion) of two clusters is
represented on the diagram by the splitting of a vertical line into two vertical lines.
The vertical position of the split, shown by a short bar gives the distance
(dissimilarity) between the two clusters.
8.
9.
10. CART Model
The algorithm of decision tree models works by repeatedly partitioning the data into multiple
sub-spaces, so that the outcomes in each final sub-space is as homogeneous as possible. This
approach is technically called recursive partitioning.
The resulting tree is composed of decision nodes, branches and leaf nodes. The tree is placed
from upside to down, so the root is at the top and leaves indicating the outcome is put at the
bottom.
The tree will stop growing by the following three criteria
all leaf nodes are pure with a single class;
a pre-specified minimum number of training observations that cannot be assigned to each leaf
nodes with any splitting methods;
The number of observations in the leaf node reaches the pre-specified minimum one.
However, this full tree including all predictor appears to be very complex and cisdifficult to
interpret in the situation where we have a large data sets with multiple predictors.
Additionally, it is easy to see that, a fully grown tree will overfit the training data and might
lead to poor test set performance.
12. PRUNING
Unpruned Tree
Briefly, our goal here is to see if a smaller subtree can give us comparable results to the fully
grown tree. If yes, we should go for the simpler tree because it reduces the likelihood of
overfitting.
One possible robust strategy of pruning the tree (or stopping the tree to grow) consists of
avoiding splitting a partition if the split does not significantly improves the overall quality of
the model.
Pruned Tree
13.
14. Random Forest
Random forest is a tree-based algorithm which involves building several trees (decision
trees), then combining their output to improve generalization ability of the model. The
method of combining trees is known as an ensemble method. Ensembling is a combination of
weak learners (individual trees) to produce a strong learner.
A Random Forest model is created with default parameters and then fine tuned the model by
changing ‘mtry’. We can tune the random forest model by changing the number of trees
(ntree) and the number of variables randomly sampled at each stage (mtry).
Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that
every input row gets predicted at least a few times.
Mtry: Number of variables randomly sampled as candidates at each split. The default values
are different for classification (sqrt(p) where p is number of variables in x) and regression
(p/3)
15. Testing
After doing the Random Forest in train data. The next step is to apply the same model on the
test data and see its validity.
16. Model Performance
The following Performance Measure are used to test the robustness of the model in the real
world.
1. Confusion Matrix
2. ROC Curves, Gini Coefficient
3. Concordance-Discordance ratio
Confusion Matrix:
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and
broken down by each class
ROC Curve
A ROC curve plots the performance of a binary classifier under various threshold
settings; this is measured by true positive rate and false positive rate. If your
classifier predicts “true” more often, it will have more true positives (good) but also
more false positives (bad). If your classifier is more conservative, predicting “true”
less often, it will have fewer false positives but fewer true positives as well. The ROC
curve is a graphical representation of this tradeoff.
Concordance-Discordance ratio
Used concordant and discordant pairs to describe the relationship between pairs of
observations. To calculate the concordant and discordant pairs, the data are treated
as ordinal, so ordinal data should be appropriate for your application. The number of
concordant and discordant pairs are used in calculations for Kendall's tau, which
measures the association between two ordinal variables.