Successfully reported this slideshow.

AI Final report 1.pdf

Upcoming SlideShare
20070702 Text Categorization
×

AI Final report 1.pdf

machine learning

machine learning

More Related Content

AI Final report 1.pdf

1. 1. 1 Project Report on PREDICTION OF BEST LOCATION FOR SOLAR FARM IN ORDER TO MEET ENERGY DEMAND AND COMPANY PROFIT. Submi ed by SHRUTEJ JARIWALA PARSHWA BHAVSAR VIRAL SUREJA VISHNUVARDHAN CHOWDARY SUBJECT – AI BASICS PROF: JEAN-MICHEL TAVERNE
2. 2. 2 1. Problem Summery Finding location for solar farm, which can fulfil customer need of energy demand: 1) 2 million kWh/a. Dataset 2) 3 million kWh/a Limitations: The building space in the regions is limited so may be built in the regions max following area: -North-West: 3,000m2 -North-East: 3,000m2 -South-West: 2,000 m2 -South-East: 2.000m2 For one square meter of solar plant Smart Energy LLC has to pay 100€ for the material plus the cost of the land. -2 million € for scenario -1 and -In order to fulfil scenario 2, a budget of 3 million € can be invested. Objective We have to find best solar farm location which can fulfil the energy need of the consumer along with Company profit. For that we want to apply machine learning techniques in order to find solution of this problem. 1.1. What is Machine Learning? Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm. Machine learning can also be defined as the process of solving a practical problem by 1) gathering a dataset, 2) algorithmically building a statistical model based on that dataset. That statistical model is assumed to be used somehow to solve the practical problem. To save keystrokes, I use the terms “learning” and “machine learning” interchangeably. (Burkov, 2020) Types of learning can be supervised, semi-supervised, unsupervised and reinforcement. 1.2. Supervised Learning In supervised learning1, the dataset is the collection of labeled examples {(xi, yi)}N i=1.Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . ., D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). For instance, if each example x in our collection represents a person, then the first feature, x(1), could contain height in cm, the second feature, x(2), could contain weight in kg, x(3) could contain
3. 3. 3 gender, and so on. For all examples in the dataset, the feature at position j in the feature vector always contains the same kind of information. It means that if x(2) i contains weight in kg in some example xi,then x(2) k will also contain weight in kg in every example x k, k = 1, . . . , N . The label yi can be either an element belonging to a finite set of classes {1, 2, . . ., C}, or a real number, or a more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated, is either one of a finite set of classes or a real number2. You can see a class as a category to which an example belongs. For instance, if your examples are email messages and your problem is spam detection, then you have two classes {spam, not spam}. The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector x as input and outputs information that allows deducing the label for this feature vector. For instance, the model created using the dataset of people could take as input a feature vector describing a person and output a probability that the person has cancer. 1.3. Unsupervised Learning In unsupervised learning, the dataset is a collection of unlabelled examples {xi}N i=1. Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector x as input and either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering, the model returns the id of the cluster for each feature vector in the dataset. In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input x; in outlier detection, the output is a real number that indicates how x is different from a “typical” example in the dataset. 1.4 Reinforcement Learning Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. 2. Datasets:  Installed Solar plants  New locations data sets. We have two datasets; first one “installed Solar plants” has data of 20 of already installed power plant’s data, which gives insight on every plant’s sunshine hours per year, solar panel m^2 and value of generated energy in kwh/a. Second data set has data of 56 months start from January 2018 to August 22 for 59 new unique location, which also gives sunrise hours, price per m^2, average wind speed.
4. 4. 4 3. Requirements:  Python  IDE: Jupiter notebook  Libraries: Pandas, NumPy, matplotlib, seaborn , scikit-learn 3.1. Libraries:  import pandas as pd - pandas is a popular Python-based data analysis toolkit which can be imported using import pandas as pd. It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array. This makes pandas a trusted ally in data science and machine learning. Similar to NumPy, pandas deal primarily with data in 1-D and 2-D arrays; however, pandas handle the two differently  import matplotlib. pyplot as plt - matplotlib. pyplot is stateful, in that it keeps track of the current figure and plotting area, and the plotting functions are directed to the current axes and can be imported using import matplotlib. pyplot as plt.  import seaborn as sns - Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas’ data structures. Seaborn helps you explore and understand your data. Its plotting functions operate on data frames and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset- oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them  import NumPy as np - NumPy provides a large set of numeric datatypes that you can use to construct arrays. NumPy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. 4. Initial thoughts and Observation  We can observe that both datasets have one common column have “sunrise hour”.  For every region we have limited m^2 per area, if we can find out how much energy is generated in one m^2 area in every region we can know the how much energy can be generated.  We also want to find out if there is any corelation between sunshine hour and energy per m^2. 5. Solution process: 6.1 importing data: Import data using pandas library.
5. 5. 5 We divided Generated energy by size of solar panel area m^2,then we find energy for one meter square area. 6.2 Data exploration : installed plant data From the use of seaborn library, we plotted pair plot of the data df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2'] sns.pairplot(df)
6. 6. 6 We try to find out how much sunrise hour per year column is corelated to energy per meter^2 column. Observation: 1) Energy per m^2 is directly propotional to Sunshine Hours per year. 2) Solar panel m^2 is directly propotional to sunshine Hours per year. #Let's see how much is it corelating.. #we find corelation and plot it with heatmap. sns.heatmap(df.corr(),annot = True)
7. 7. 7 Observation: 1) Energy per meter^2 is completely dependent on sunshine hour per year . 2) We can predict energy per meter^2 by sunshine hours per year, So we can apply regression model. 6.3 Classification vs. Regression Classification is a problem of automatically assigning a label to an unlabelled example. Spam detection is a famous example of classification. In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labelled examples as inputs and produces a model that can take an unlabelled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability. In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/ “healthy”, “spam”/“not spam”), we talk about binary classification (also called binomial in some sources). Multiclass classification (also
8. 8. 8 called multinomial) is a classification problem with three or more classes. While some learning algorithms naturally allow for more than two classes, others are by nature binary classification algorithms. There are strategies allowing to turn a binary classification learning algorithm into a multiclass one. Regression is a problem of predicting a real-valued label (often called a target) given an unlabelled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression. The regression problem is solved by a regression learning algorithm that takes a collection of labelled examples as inputs and produces a model that can take an unlabelled example as input and output a target. (Burkov, 2020) 6.4 Linear Regression Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example. 6.4.1 Problem Statement We have a collection of labeled examples {(xi , yi)} N i=1, where N is the size of the collection, xi is the D-dimensional feature vector of example i = 1, . . . , N, yi is a real- valued1 target and every feature x (j) i , j = 1, . . . , D, is also a real number. We want to build a model fw,b(x) as a linear combination of features of example : x: fw, b(x) = wx + b, where w is a D-dimensional vector of parameters and b is a real number. The notation fw,b means that the model f is parametrized by two values: w and b. We will use the model to predict the unknown y for a given x like this: y ← fw,b(x). Two models parametrized by two different pairs (w, b) will likely produce two different predictions when applied to the same example. We want to find the optimal values (w∗ , b∗ ). Obviously, the optimal values of parameters define the model that makes the most accurate predictions. You could have noticed that the form of our linear model in eq. 1 is very similar to the form of the SVM model. The only difference is the missing sign operator. The two models are indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary: it’s used to separate two groups of examples from one another. As such, it has to be as far from each group as possible. On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible. You can see why this latter requirement is essential by looking at the illustration in Figure 1. It displays the regression line (in red) for one-dimensional examples (blue dots). We can use this line to predict the value of the target ynew for a new unlabelled input example xnew. If our examples are D-dimensional feature vectors (for D > 1), the only difference with the one-dimensional case is that the regression model is not a line but a plane or a hyperplane (for D > 2 ).
9. 9. 9 Now you see why it’s essential to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the red line in Figure.1 was far from the blue dots, the prediction ynew would have fewer chances to be correct. 6.4.2 Solution To get this latter requirement satisfied, the optimization procedure which we use to find the optimal values for w∗ and b∗ tries to minimize the following expression: In mathematics, the expression we minimize or maximize is called an objective function, or, simply, an objective. The expression (fw,b(xi) − yi)^2 in the above objective is called the loss function. It’s a measure of penalty for misclassification of example I. This particular choice of the loss function is called squared error loss. All model-based learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical
10. 10. 10 risk, for a model, is the average of all penalties obtained by applying the model to the training data. Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute value of the difference between the true target yi and the predicted value f (xi) and use that as a penalty? We could. Moreover, we also could use a cube instead of a square. we decided to use the linear combination of features to predict the target. However, we could use a square or some other polynomial to combine the values of features. We could also use some other loss function that makes sense: the absolute difference between f (xi) and yi makes sense, the cube of the difference too; Sounds easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that the binary loss (1 when f (xi) and yi are different and 0 when they are the same) also makes sense, right? If we made different decisions about the form of the model, the form of the loss function, and about the choice of the algorithm that minimizes the average loss to find the best values of parameters, we would end up inventing a different machine learning algorithm. (Burkov, 2020) Implementing Linear Regression: We took “Sunshine Hours” column as Feature and Energy per meter^2 as Label. Then split the data in 60/40 ratio for creating train and test data. and we imported Liner Regression model form scikit-learn library. Further we test data on remaining 40% data. By predict function we predict energy from test data. And then we compare predicted value to test labels. That is how we find out error function of our model. Algorithm also gives us regression coefficient and intercept.
11. 11. 11
12. 12. 12 Observation: From scatter plot we can observe the straight line. That shows little deviation and great accuracy. Then we check the absolute mean error and R^2 score. 6.5. Now we do analysis of Second Dataset: Location dataset.
13. 13. 13 Observation:  In first row we can see scatter plot of “longitude vs latitude” which gives location of properties. We can also observe four cluster of regions.  when we group data by Date, and count value we find there are 59 unique location and for each location 56 months of data is given.  We have data in monthly manner so we have to convert it in yearly format. Steps: 1) First, we will create datasets for 59 locations. 2) We classify location in region. 3) We have to find average Sunshine hours, average Price and average wind energy of each location for yearly manner. 4) Then we predict energy for each region. Finding average Sunshine Hours: For each location, we add all sun hours values of 56 months and divide by 56 that give average sunshine hours per month. Then we multiply into 12 so we get average value for one year. Predicting Regions: We have only two features [‘Longitude’, ‘Latitude’] and no labels; that is why we choose unsupervised learning for classification. We are preferring K-means clustering algorithm. 9.2 Clustering Clustering is a problem of learning to assign a label to examples by leveraging an unlabelled dataset. Because the dataset is completely unlabelled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.
14. 14. 14 There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is better in quality for your dataset. Usually, the performance of each algorithm depends on the unknown properties of the probability distribution the dataset was drawn from. In this Chapter, I outline the most useful and widely used clustering algorithms. (Burkov, 2020) 9.2.1 K-Means The k-means clustering algorithm works as follows. First, you choose k — the number of clusters. Then you randomly put k feature vectors, called centroids, to the feature space. We then compute the distance from each example x to each centroid c using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labelled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labelled with it. These average feature vectors become the new locations of the centroids. We recompute the distance from each example to each centroid, modify the assignment and repeat the procedure until the assignments don’t change after the centroid locations were recomputed. The model is the list of assignments of centroids IDs to the examples. The initial position of centroids influences the final positions, so two runs of k-means can
15. 15. 15 result in two different models. Some variants of k-means compute the initial positions of centroids based on some properties of the dataset. One run of the k-means algorithm is illustrated in Figure 2. The circles in Figure 2 are two-dimensional feature vectors; the squares are moving centroids. Different background colours represent regions in which all points belong to the same cluster. The value of k, the number of clusters, is a hyperparameter that has to be tuned by the data analyst. There are some techniques for selecting k. None of them is proven optimal. Most of those techniques require the analyst to make an “educated guess” by looking at some metrics or by examining cluster assignments visually. 9.2.3 Determining the Number of Clusters The most important question is how many clusters does your dataset have? When the feature vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of points in the feature space. Each cloud is a potential cluster. However, for D-dimensional data, with D > 3, looking at the data is problematic. One way of determining the reasonable number of clusters is based on the concept of prediction strength. The idea is to split the data into training and test set, similarly to how we do in supervised learning. Once you have the training and test sets, Str of size Ntr and Ste of size N respectively, you fix k, the number of clusters, and run a clustering algorithm C on sets Str and Ste and obtain the clustering results C (Str, k) and C (Ste, k). Let A be the clustering C (Str, k) built using the training set. The clusters in A can be seen as regions. If an example falls within one of those regions, then that example belongs to some specific cluster. For example, if we apply the k-means algorithm to some dataset, it results in a partition of the feature space into k polygonal regions, as we saw in Figure 2. Define the N× N co-membership matrix D[A, Ste] as follows: D[A, Ste](i,i′ ) = 1 if and only if examples xi and xi′ from the test set belong to the same cluster according to the clustering A. Otherwise D[A, Ste](i,i′) = 0. Let’s take a break and see what we have here. We have built, using the training set of examples, a clustering A that has k clusters. Then we have built the co-membership matrix that indicates whether two examples from the test set belong to the same cluster in A. Intuitively, if the quantity k is the reasonable number of clusters, then two examples that belong to the same cluster in clustering C (Ste, k) will most likely belong to the same cluster
16. 16. 16 in clustering C (Str, k). On the other hand, if k is not reasonable (too high or too low), then training data-based and test data-based clustering will likely be less consistent. Another effective method to estimate the number of clusters is the gap statistic method. Other, less automatic methods, which some analysts still use, include the elbow method and the average silhouette method. Experiments suggest that a reasonable number of clusters is the largest k such that ps(k) is above 0.8. You can see in Figure 5 examples of predictive strength for different values of k for two, three- and four-cluster data. For non-deterministic clustering algorithms, such as k-means, which can generate different clustering depending on the initial positions of centroids, it is recommended to do multiple runs of the clustering algorithm for the same k and compute the average prediction strength ̄ps(k) over multiple runs. Implementing k-means clustering.
17. 17. 17 We observe Region column as categorical column and we can try to convert it into binary data by Feature Engineering Feature Engineering When a product manager tells you “We need to be able to predict whether a particular customer will stay with us. Here are the logs of customers’ interactions with our product for five years.” you cannot just grab the data, load it into a library and get a prediction. You need to build a dataset first.
18. 18. 18 Remember from the first chapter that the dataset is the collection of labeled examples {(xi, yi)} Ni=1. Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . ., D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labour-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge. For example, to transform the logs of user interaction with a computer system, one could create features that contain information about the user and various statistics extracted from the logs. For each user, one feature would contain the price of the subscription; other features would contain the frequency of connections per day, week and year. Another feature would contain the average session duration in seconds or the average response time for one request, and so on. Everything measurable can be used as a feature. The role of the data analyst is to create informative features: those would allow the learning algorithm to build a model that predicts well labels of the data used for training. Highly informative features are also called features with high predictive power. For example, the average duration of a user’s session has high predictive power for the problem of predicting whether the user will keep using the application in the future. We say that a model has a low bias when it predicts the training data well. That is, the model makes few mistakes when we use it to predict labels of the examples used to build the model. 5.1.1 One-Hot Encoding Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones. If your example has a categorical feature “colors” and this feature has three possible values: “red,” “yellow,” “green,” you can transform this feature into a vector of three numerical values: red = [1, 0, 0] yellow = [0, 1, 0] green = [0, 0, 1] By doing so, you increase the dimensionality of your feature vectors. You should not transform red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order is important for the decision making. If the order of a feature’s values is not important, using ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm will try to find a regularity where there’s no one, which may potentially lead to overfitting. (Burkov, 2020)
19. 19. 19 Implementing hot-encoding : We can also assign region also by Lambda function using if else syntax.
20. 20. 20 Data analysis by regions: 1) Locations counts by region:
21. 21. 21  South-west has highest count of location. 2.) How much each region generating : :  North-east has highest energy of location. 3.) Finding cheapest prices for each region:
22. 22. 22  South west have cheapest locations Now we create data frame for each region.
23. 23. 23 Scenario: 1 We want to pick location which have cheap prices and best optimal Energy. Which can generate 2 million kwh/a energy and our budget is 2 million. total energy = ( energy * m^2) + ( energy * m^2) + ( energy * m^2) + ( energy * m^2) = (238.62 * 3000) + ( 236.11 * 2000) + (242.04 *3000) + (239.78 *2000) = 2393760 kwh/a > 2 million. Total cost = (price + cost) * m^2 for each region = 1645000 < 2 millions So, if we use all the area available to us and build their plant we can get enegy more than 2 million kwh/a and budget will be 1.6 million. But we can optimize it further by using only half of the land where price are heigest.
24. 24. 24 We can generate more than 2 million kwh/a in minimum budget of 1384000 Euro. Locations for Scenario 1: One can using this method also to predict scenario 2 where one must take location in region where highest energy is generated. But for scenario 2 we will try to use method of Hierarchical clustering .
25. 25. 25 Scenario 2 Hierarchical clustering: In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:  Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.  Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
26. 26. 26 Implementetion of Heirrichal clustering
27. 27. 27 After implementing same method for all region datasets, we get this result. Observations: 1 From above result we want to find points that generates Highest energy and also cheap price. 2 First, we will choose Optimal point for all the regions, we can see that second highest energy point is cheap compare to highest point. But from the calculation we can see it is not fulfilling the demand of 3 million kwh/a. Now we will choose the point that have highest value of energy.
28. 28. 28 Although we are using highest energy point we are not fulfilling energy demand. Scenario 2 will be not feasible. Conclusion
29. 29. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 29/62 Code : Project to find best location for Solar Farms that can fullfill our Energy Requiement . In [476… # import Libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline In [477… #import Datasets df = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectInstalled Solar Plants.xls df_1 = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectEnvironment Solar Data.x In [478… df.head() Out[478]: Model ID Sunshine Hours per year Size Solar Panel m2 Generated energy kWh/a 0 1 1418 794 233616 1 2 1474 1726 525410 2 3 1335 5776 1612292 3 4 1224 6494 1681651 4 5 1320 2085 576313 In [479… df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 19 entries, 0 to 18 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model ID 19 non-null int64 1 Sunshine Hours per year 19 non-null int64 2 Size Solar Panel m2 19 non-null int64 3 Generated energy kWh/a 19 non-null int64 dtypes: int64(4) memory usage: 736.0 bytes Obsevation : See that we want to find Energy per m^2 ..so we can find it if we devide Genrated energy / Solar panel m2 ""
30. 30. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 30/62 In [480… df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2'] In [481… df.head() Out[481]: Model Sunshine Hours per Size Solar Panel Generated energy energy_per_m2 ID year m2 kWh/a 0 1 1418 794 233616 294.226700 1 2 1474 1726 525410 304.409038 2 3 1335 5776 1612292 279.136427 3 4 1224 6494 1681651 258.954573 4 5 1320 2085 576313 276.409113 EDA of installed plant dataset. In [482… sns.pairplot(df) Out[482]: <seaborn.axisgrid.PairGrid at 0x21012d5f6d0>
31. 31. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 31/62 Observation : 1) Enregy per m^2 is directly propsnal to Sunshine Hours per year. In [483… # Let's take close look by jointplot. sns.jointplot(x='Sunshine Hours per year',y='energy_per_m2',data = df) Out[483]: <seaborn.axisgrid.JointGrid at 0x21012d5fa30> Observation : From the straight line we can think about implemanting linear regration model. In [484… #Let's see how much is it corelating.. #we find corelation and plot it with heatmap. sns.heatmap(df.corr(),annot = True) Out[484]: <AxesSubplot:>
32. 32. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 32/62 Model Implementaion: In [485… # creating Train and Test data: X = df[[ 'Sunshine Hours per year']] y = df['energy_per_m2'] In [486… # make a Split in the Datasets and importing Linear Regression model and fiting on from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_sta from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train,y_train) Out[486]: LinearRegression() Result of Regression In [487… # intercept is value of C in y = mx + c print(lm.intercept_) 36.40591577945423 In [488… # coefficent is slop of the line :
33. 33. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 33/62 lm.coef_ Out[488]: array([0.18182098]) In [489… #Let's check predicted values predictions = lm.predict(X_test) predictions Out[489]: array([258.95479913, 250.04557096, 304.41004491, 279.13692826, 294.22806986, 248.22736112, 260.40936699, 255.13655848]) In [490… #We can check how it varries from actual values by plotting scatter plot. plt.scatter(y_test,predictions) Out[490]: <matplotlib.collections.PathCollection at 0x21015f39ee0> Observation: From graph we can see there is almost no deveation from y_test & prediction . In [491… from sklearn import metrics metrics.mean_absolute_error(y_test,predictions) Out[491]: 0.00046171013969242836 In [492… from sklearn.metrics import r2_score r2_score(y_test, predictions) Out[492]: 0.9999999989429613 From result we can see that there is almost zero error in our model and R2 score is also nearly 1. which is best possible outcome.
34. 34. localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 34/62 Now lets explore second database. In [493… df_1.head() Out[493]: Date Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices 0 2018-01-01 0.22 0.27 45.36 2.560 304.0 1 2018-01-01 0.28 0.23 34.02 3.216 318.0 2 2018-01-01 0.28 0.27 45.36 3.144 102.0 3 2018-01-01 0.35 0.20 30.78 3.664 248.0 4 2018-01-01 0.18 0.30 33.21 2.632 326.0 In [494… df_1.describe() Out[494]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices count 3304.000000 3304.000000 3302.000000 3303.000000 3303.000000 mean 0.448610 0.495627 99.031788 3.776776 146.799273 std 0.266589 0.255900 51.760654 17.347552 78.796555 min 0.014000 0.002000 20.160000 2.400000 57.000000 25% 0.210000 0.250000 48.640000 3.042000 92.000000 50% 0.350000 0.514000 99.560000 3.474000 126.000000 75% 0.720000 0.729000 140.800000 3.897000 159.000000 max 0.873000 0.929000 1000.000000 1000.000000 330.000000 In [495… df_1.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3304 entries, 0 to 3303 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 3304 non-null datetime64[ns] 1 Longitude 3304 non-null float64 2 Latitude 3304 non-null float64 3 Sunhine Hours 3302 non-null float64 4 Avg. Wind Speed 3303 non-null float64 5 Property prices 3303 non-null float64 dtypes: datetime64[ns](1), float64(5) memory usage: 155.0 KB In [496… #EDA of dataset sns.pairplot(df_1) Out[496]: <seaborn.axisgrid.PairGrid at 0x21015dacac0>
35. 35. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 35/62 In [497… df_1.groupby(['Date']).count()
36. 36. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 36/62 Out[497]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices Date 2018-01-01 59 59 59 59 59 2018-02-01 59 59 59 59 59 2018-03-01 59 59 58 59 59 2018-04-01 59 59 59 59 59 2018-05-01 59 59 59 59 59 2018-06-01 59 59 59 59 59 2018-07-01 59 59 59 59 59 2018-08-01 59 59 59 59 59 2018-09-01 59 59 59 59 59 2018-10-01 59 59 58 59 59 2018-11-01 59 59 59 59 59 2018-12-01 59 59 59 59 59 2019-01-01 59 59 59 59 59 2019-02-01 59 59 59 59 59 2019-03-01 59 59 59 59 59 2019-04-01 59 59 59 59 59 2019-05-01 59 59 59 58 59 2019-06-01 59 59 59 59 59 2019-07-01 59 59 59 59 59 2019-08-01 59 59 59 59 59 2019-09-01 59 59 59 59 59 2019-10-01 59 59 59 59 59 2019-11-01 59 59 59 59 59 2019-12-01 59 59 59 59 59 2020-01-01 59 59 59 59 59 2020-02-01 59 59 59 59 59 2020-03-01 59 59 59 59 59 2020-04-01 59 59 59 59 59 2020-05-01 59 59 59 59 59 2020-06-01 59 59 59 59 59 2020-07-01 59 59 59 59 59 2020-08-01 59 59 59 59 59 2020-09-01 59 59 59 59 59 2020-10-01 59 59 59 59 59 2020-11-01 59 59 59 59 59
37. 37. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 37/62 Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices Date 2020-12-01 59 59 59 59 59 2021-01-01 59 59 59 59 59 2021-02-01 59 59 59 59 59 2021-03-01 59 59 59 59 59 2021-04-01 59 59 59 59 59 2021-05-01 59 59 59 59 59 2021-06-01 59 59 59 59 59 2021-07-01 59 59 59 59 59 2021-08-01 59 59 59 59 59 2021-09-01 59 59 59 59 59 2021-10-01 59 59 59 59 58 2021-11-01 59 59 59 59 59 2021-12-01 59 59 59 59 59 2022-01-01 59 59 59 59 59 2022-02-01 59 59 59 59 59 2022-03-01 59 59 59 59 59 2022-04-01 59 59 59 59 59 2022-05-01 59 59 59 59 59 2022-06-01 59 59 59 59 59 2022-07-01 59 59 59 59 59 2022-08-01 59 59 59 59 59 we can from above two result that there are 59 uniqe properties's data is available to us . Also from pair plot we can see properties devided into 4 Clusters. In datasets we can see data for 56 months so we have to convert it in yearly format. Price and Wind Energy are reamaining same thorought time period for each prpoperty. We will create dataset for 59 properties. In [498… # finding uniqe properties. locations_ = df_1[['Longitude','Latitude']].drop_duplicates() locations_
38. 38. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 38/62 Out[498]: Longitude Latitude 0 0.220 0.270 1 0.280 0.230 2 0.280 0.270 3 0.350 0.200 4 0.180 0.300 5 0.310 0.320 6 0.300 0.250 7 0.200 0.200 8 0.230 0.250 9 0.210 0.210 10 0.220 0.700 11 0.280 0.680 12 0.280 0.690 13 0.350 0.700 14 0.180 0.800 15 0.310 0.750 16 0.300 0.720 17 0.200 0.770 18 0.230 0.760 19 0.210 0.740 20 0.720 0.700 21 0.640 0.680 22 0.630 0.690 23 0.680 0.700 24 0.770 0.800 25 0.770 0.750 26 0.760 0.720 27 0.740 0.770 28 0.720 0.760 29 0.700 0.740 30 0.720 0.220 31 0.640 0.260 32 0.630 0.280 33 0.680 0.250 34 0.770 0.180 35 0.770 0.310
39. 39. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 39/62 Longitude Latitude 36 0.760 0.300 37 0.740 0.200 38 0.720 0.230 39 0.700 0.210 40 0.233 0.929 41 0.617 0.514 42 0.373 0.002 43 0.864 0.838 44 0.081 0.805 45 0.124 0.413 46 0.164 0.106 47 0.137 0.710 48 0.064 0.835 49 0.160 0.173 50 0.014 0.729 51 0.025 0.472 52 0.715 0.211 53 0.808 0.505 54 0.873 0.379 55 0.856 0.100 56 0.233 0.623 57 0.567 0.727 58 0.180 0.611 In [499… #Creating Avrage Sunrise hours per month for each property. df_1.groupby(['Longitude','Latitude'])['Sunhine Hours'].mean().reset_index()
40. 40. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 40/62 Out[499]: Longitude Latitude Sunhine Hours 0 0.014 0.729 96.410714 1 0.025 0.472 94.302857 2 0.064 0.835 94.963929 3 0.081 0.805 94.248214 4 0.124 0.413 96.712857 5 0.137 0.710 98.401786 6 0.160 0.173 96.977455 7 0.164 0.106 93.440714 8 0.180 0.300 107.601830 9 0.180 0.611 94.173571 10 0.180 0.800 108.632009 11 0.200 0.200 107.670134 12 0.200 0.770 107.224554 13 0.210 0.210 105.145714 14 0.210 0.740 104.628616 15 0.220 0.270 107.102411 16 0.220 0.700 105.796607 17 0.230 0.250 106.368348 18 0.230 0.760 105.567589 19 0.233 0.623 93.926071 20 0.233 0.929 94.735714 21 0.280 0.230 107.218527 22 0.280 0.270 106.851696 23 0.280 0.680 108.749732 24 0.280 0.690 105.909911 25 0.300 0.250 106.069821 26 0.300 0.720 105.526205 27 0.310 0.320 105.487634 28 0.310 0.750 106.812723 29 0.350 0.200 106.869375 30 0.350 0.700 109.638409 31 0.373 0.002 95.600357 32 0.567 0.727 111.343214 33 0.617 0.514 92.911429 34 0.630 0.280 91.532143 35 0.630 0.690 96.505357
41. 41. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 41/62 Longitude Latitude Sunhine Hours 36 0.640 0.260 95.721429 37 0.640 0.680 93.820357 38 0.680 0.250 95.400000 39 0.680 0.700 92.683929 40 0.700 0.210 95.352857 41 0.700 0.740 94.743571 42 0.715 0.211 93.179286 43 0.720 0.220 90.614286 44 0.720 0.230 94.259643 45 0.720 0.700 93.306786 46 0.720 0.760 95.486429 47 0.740 0.200 92.203929 48 0.740 0.770 96.747500 49 0.760 0.300 93.957143 50 0.760 0.720 96.785357 51 0.770 0.180 92.680357 52 0.770 0.310 97.258929 53 0.770 0.750 95.524643 54 0.770 0.800 96.507143 55 0.808 0.505 92.385714 56 0.856 0.100 95.538929 57 0.864 0.838 93.215357 58 0.873 0.379 94.596429 In [500… df_1.groupby(['Longitude','Latitude'])['Avg. Wind Speed'].mean().reset_index()
42. 42. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 42/62 Out[500]: Longitude Latitude Avg. Wind Speed 0 0.014 0.729 3.680839 1 0.025 0.472 3.540536 2 0.064 0.835 3.565125 3 0.081 0.805 3.615589 4 0.124 0.413 3.676500 5 0.137 0.710 3.593732 6 0.160 0.173 3.511929 7 0.164 0.106 3.597750 8 0.180 0.300 3.360143 9 0.180 0.611 3.478821 10 0.180 0.800 3.196000 11 0.200 0.200 3.168286 12 0.200 0.770 3.327143 13 0.210 0.210 3.179000 14 0.210 0.740 3.243714 15 0.220 0.270 3.214571 16 0.220 0.700 3.245143 17 0.230 0.250 3.097286 18 0.230 0.760 3.140571 19 0.233 0.623 3.607393 20 0.233 0.929 3.706714 21 0.280 0.230 3.232714 22 0.280 0.270 3.204429 23 0.280 0.680 3.212714 24 0.280 0.690 3.137429 25 0.300 0.250 3.321429 26 0.300 0.720 20.975714 27 0.310 0.320 3.174714 28 0.310 0.750 3.206571 29 0.350 0.200 3.235286 30 0.350 0.700 3.258857 31 0.373 0.002 3.812143 32 0.567 0.727 3.604982 33 0.617 0.514 3.448929 34 0.630 0.280 3.697875 35 0.630 0.690 3.590196
43. 43. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 43/62 Longitude Latitude Avg. Wind Speed 36 0.640 0.260 3.705429 37 0.640 0.680 3.495375 38 0.680 0.250 3.606429 39 0.680 0.700 3.578625 40 0.700 0.210 3.546321 41 0.700 0.740 3.635679 42 0.715 0.211 3.567857 43 0.720 0.220 3.610607 44 0.720 0.230 3.617357 45 0.720 0.700 3.485571 46 0.720 0.760 3.708321 47 0.740 0.200 3.597911 48 0.740 0.770 3.554196 49 0.760 0.300 3.630857 50 0.760 0.720 3.608679 51 0.770 0.180 3.552107 52 0.770 0.310 3.673768 53 0.770 0.750 3.644357 54 0.770 0.800 3.588218 55 0.808 0.505 3.583125 56 0.856 0.100 3.668304 57 0.864 0.838 3.624429 58 0.873 0.379 3.682125 In [501… df_1.groupby(['Longitude','Latitude'])['Property prices'].mean().reset_index()
44. 44. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 44/62 Out[501]: Longitude Latitude Property prices 0 0.014 0.729 67.0 1 0.025 0.472 116.0 2 0.064 0.835 91.0 3 0.081 0.805 65.0 4 0.124 0.413 126.0 5 0.137 0.710 86.0 6 0.160 0.173 130.0 7 0.164 0.106 95.0 8 0.180 0.300 326.0 9 0.180 0.611 127.0 10 0.180 0.800 276.0 11 0.200 0.200 105.0 12 0.200 0.770 224.0 13 0.210 0.210 273.0 14 0.210 0.740 312.0 15 0.220 0.270 304.0 16 0.220 0.700 174.0 17 0.230 0.250 159.0 18 0.230 0.760 137.0 19 0.233 0.623 73.0 20 0.233 0.929 128.0 21 0.280 0.230 318.0 22 0.280 0.270 102.0 23 0.280 0.680 131.0 24 0.280 0.690 330.0 25 0.300 0.250 129.0 26 0.300 0.720 245.0 27 0.310 0.320 232.0 28 0.310 0.750 320.0 29 0.350 0.200 248.0 30 0.350 0.700 277.0 31 0.373 0.002 139.0 32 0.567 0.727 149.0 33 0.617 0.514 137.0 34 0.630 0.280 61.0 35 0.630 0.690 150.0
45. 45. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 45/62 Longitude Latitude Property prices 36 0.640 0.260 117.0 37 0.640 0.680 74.0 38 0.680 0.250 134.0 39 0.680 0.700 93.0 40 0.700 0.210 93.0 41 0.700 0.740 149.0 42 0.715 0.211 107.0 43 0.720 0.220 82.0 44 0.720 0.230 98.0 45 0.720 0.700 73.0 46 0.720 0.760 90.0 47 0.740 0.200 72.0 48 0.740 0.770 87.0 49 0.760 0.300 96.0 50 0.760 0.720 137.0 51 0.770 0.180 57.0 52 0.770 0.310 139.0 53 0.770 0.750 108.0 54 0.770 0.800 92.0 55 0.808 0.505 99.0 56 0.856 0.100 112.0 57 0.864 0.838 74.0 58 0.873 0.379 115.0 now Merging all columns together: In [502… locations_ = locations_.sort_values(by=['Longitude', 'Latitude'],) In [503… locations_['Avg Sunshine hours'] = df_1.groupby(['Longitude','Latitude'])['Sunhine In [504… locations_['Avg Wind speed'] = df_1.groupby(['Longitude','Latitude'])['Avg. Wind S In [505… locations_['Avg Price'] = df_1.groupby(['Longitude','Latitude'])['Property prices In [506… locations_.head() Out[506]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price 50 0.014 0.729 96.785357 3.608679 137.0 51 0.025 0.472 92.680357 3.552107 57.0 48 0.064 0.835 96.747500 3.554196 87.0
46. 46. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 46/62 44 0.081 0.805 94.259643 3.617357 98.0 45 0.124 0.413 93.306786 3.485571 73.0 now We have dataset of 59 properties but we do not know in which region / Area they are. In [507… # we plot the longatude vs latitude so we can see the properties. Reg = np.array(locations_[['Longitude','Latitude']]) plt.scatter(Reg[:,0],Reg[:,1]) Out[507]: <matplotlib.collections.PathCollection at 0x2101ea7df10> we can see there are 4 region . we do not have any labels available so we have to use unsupervised learning for prediction of the clusters. We will use K-means algorithm for clustering In [508… from sklearn.cluster import KMeans Kmeans = KMeans(n_clusters = 4) Kmeans.fit(Reg) Out[508]: KMeans(n_clusters=4) In [509… Reg_list = Kmeans.labels_ Reg_list Out[509]: array([3, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2]) In [510… #from these we can define which region are which. Kmeans.cluster_centers_
47. 47. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 47/62 Out[510]: array([[0.2415 , 0.22814286], [0.71328571, 0.70671429], [0.73646154, 0.24076923], [0.19594444, 0.72355556]]) In [511… fig,(ax1) = plt.subplots(1, sharey=True, figsize = (5,5)) ax1.set_title('Saperate Regions using Kmeans') ax1.scatter(Reg[:,0],Reg[:,1],c= Kmeans.labels_,cmap='rainbow') Out[511]: <matplotlib.collections.PathCollection at 0x2101eacacd0> In [512… locations_['region'] = Reg_list In [513… locations_.head() Out[513]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region 50 0.014 0.729 96.785357 3.608679 137.0 3 51 0.025 0.472 92.680357 3.552107 57.0 3 48 0.064 0.835 96.747500 3.554196 87.0 3 44 0.081 0.805 94.259643 3.617357 98.0 3 45 0.124 0.413 93.306786 3.485571 73.0 0 In [514… #defining Region using Hot-Encoding. from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(handle_unknown='ignore') encoder_df = pd.DataFrame(encoder.fit_transform(locations_[['region']]).toarray()) locations_1 = locations_.join(encoder_df)
48. 48. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 48/62 locations_1.columns = ['Longitude','Latitude','Avg Sunhine Hours','Avg. Wind Speed 'SW','SE'] locations_1.head() Out[514]: Longitude Latitude Avg Sunhine Avg. Wind Avg region NE NW SW SE Hours Speed prices 50 0.014 0.729 96.785357 3.608679 137.0 3 0.0 1.0 0.0 0.0 51 0.025 0.472 92.680357 3.552107 57.0 3 0.0 0.0 1.0 0.0 48 0.064 0.835 96.747500 3.554196 87.0 3 0.0 1.0 0.0 0.0 44 0.081 0.805 94.259643 3.617357 98.0 3 0.0 0.0 1.0 0.0 45 0.124 0.413 93.306786 3.485571 73.0 0 0.0 1.0 0.0 0.0 In [515… #defining Region by lamda function: locations_['Area']= locations_['region'].apply(lambda region:"South-West" if region In [516… locations_.head() Out[516]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region Area 50 0.014 0.729 96.785357 3.608679 137.0 3 South-West 51 0.025 0.472 92.680357 3.552107 57.0 3 South-West 48 0.064 0.835 96.747500 3.554196 87.0 3 South-West 44 0.081 0.805 94.259643 3.617357 98.0 3 South-West 45 0.124 0.413 93.306786 3.485571 73.0 0 South-East We want sunrise hours in yearly manner that's why we will multiply it by 12 In [517… locations_['Avg Sunshine hours'] = locations_['Avg Sunshine hours']*12 Now Predicting energy per one m^2 using coeficent and intercept of Linear Regression In [518… locations_['Pridected Energy'] = lm.coef_[0]*locations_['Avg Sunshine hours'] + lm locations_.head()
49. 49. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 49/62 Out[518]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 50 0.014 0.729 1161.424286 3.608679 137.0 3 South- West 247.577221 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 48 0.064 0.835 1160.970000 3.554196 87.0 3 South- West 247.494623 44 0.081 0.805 1131.115714 3.617357 98.0 3 South- West 242.066487 45 0.124 0.413 1119.681429 3.485571 73.0 0 South- East 239.987494 Now we will group data by Region and do the analysis. In [519… sns.countplot(x='Area',data = locations_) Out[519]: <AxesSubplot:xlabel='Area', ylabel='count'> In [520… locations_.groupby(['Area']).count() Out[520]: Longitude Latitude Avg Sunshine Avg Wind Avg region Pridected hours speed Price Energy Area North- East 13 13 13 13 13 13 13 North- West 14 14 14 14 14 14 14
50. 50. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 50/62 South- East 14 14 14 14 14 14 14 South- West 18 18 18 18 18 18 18 In [521… sns.barplot(x = 'Area', y = 'Pridected Energy',data = locations_, estimator = max) Out[521]: <AxesSubplot:xlabel='Area', ylabel='Pridected Energy'> In [522… locations_.groupby(['Area'], sort=False)['Pridected Energy'].max() Out[522]: Area South-West 273.424860 South-East 271.177163 North-West 273.681714 North-East 279.340308 Name: Pridected Energy, dtype: float64 In [523… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator = max) Out[523]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'>
51. 51. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 51/62 In [524… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator = min) Out[524]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'> In [525… locations_.groupby(['Area'], sort=False)['Avg Price'].min() Out[525]: Area
52. 52. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 52/62 South-West 57.0 South-East 65.0 North-West 74.0 North-East 61.0 Name: Avg Price, dtype: float64 In [526… NW_ = locations_[locations_['Area'] == 'North-West'] NE_ = locations_[locations_['Area'] == 'North-East'] SE_ = locations_[locations_['Area'] == 'South-East'] SW_ = locations_[locations_['Area'] == 'South-West'] For scenario 1 we will see how we can optimize cost and energy In [527… NW_ Out[527]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 57 0.567 0.727 1118.584286 3.624429 74.0 1 North- West 239.788010 41 0.617 0.514 1136.922857 3.635679 149.0 1 North- West 243.122347 22 0.630 0.690 1282.220357 3.204429 102.0 1 North- West 269.540482 21 0.640 0.680 1286.622321 3.232714 318.0 1 North- West 270.340851 23 0.680 0.700 1304.996786 3.212714 131.0 1 North- West 273.681714 29 0.700 0.740 1282.432500 3.235286 248.0 1 North- West 269.579054 20 0.720 0.700 1136.828571 3.706714 128.0 1 North- West 243.105204 28 0.720 0.760 1281.752679 3.206571 320.0 1 North- West 269.455448 27 0.740 0.770 1265.851607 3.174714 232.0 1 North- West 266.564299 26 0.760 0.720 1266.314464 20.975714 245.0 1 North- West 266.648457 25 0.770 0.750 1272.837857 3.321429 129.0 1 North- West 267.834546 24 0.770 0.800 1270.918929 3.137429 330.0 1 North- West 267.485645 53 0.808 0.505 1146.295714 3.644357 108.0 1 North- West 244.826530 43 0.864 0.838 1087.371429 3.610607 82.0 1 North- West 234.112858 In [528… # we wiil sort value of colums price and energy and choose the row where price is NW_sorted = NW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
53. 53. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 53/62 best_row_nw = NW_sorted.head(1) best_row_nw Avg Sunshine Avg Wind Avg Pridected Out[528]: Longitude Latitude region Area hours speed Price Energy North- 57 0.567 0.727 1118.584286 3.624429 74.0 1 239.78801 West In [529… SE_ Out[529]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 45 0.124 0.413 1119.681429 3.485571 73.0 0 South- East 239.987494 49 0.160 0.173 1127.485714 3.630857 96.0 0 South- East 241.406477 46 0.164 0.106 1145.837143 3.708321 90.0 0 South- East 244.743152 4 0.180 0.300 1160.554286 3.676500 126.0 0 South- East 247.419037 7 0.200 0.200 1121.288571 3.597750 95.0 0 South- East 240.279706 9 0.210 0.210 1130.082857 3.478821 127.0 0 South- East 241.878692 0 0.220 0.270 1156.928571 3.680839 67.0 0 South- East 246.759806 8 0.230 0.250 1291.221964 3.360143 326.0 0 South- East 271.177163 1 0.280 0.230 1131.634286 3.540536 116.0 0 South- East 242.160774 2 0.280 0.270 1139.567143 3.565125 91.0 0 South- East 243.603134 6 0.300 0.250 1163.729455 3.511929 130.0 0 South- East 247.996349 5 0.310 0.320 1180.821429 3.593732 86.0 0 South- East 251.104029 3 0.350 0.200 1130.978571 3.615589 65.0 0 South- East 242.041552 42 0.373 0.002 1118.151429 3.567857 107.0 0 South- East 239.709308 In [530… SE_sorted = SE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_se = SE_sorted.head(1) best_row_se Out[530]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy South-
54. 54. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 54/62 3 0.35 0.2 1130.978571 3.615589 65.0 0 242.041552 East In [531… SW_ Out[531]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 50 0.014 0.729 1161.424286 3.608679 137.0 3 South- West 247.577221 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 48 0.064 0.835 1160.970000 3.554196 87.0 3 South- West 247.494623 44 0.081 0.805 1131.115714 3.617357 98.0 3 South- West 242.066487 47 0.137 0.710 1106.447143 3.597911 72.0 3 South- West 237.581223 58 0.180 0.611 1135.157143 3.682125 115.0 3 South- West 242.801303 14 0.180 0.800 1255.543393 3.243714 312.0 3 South- West 264.690050 17 0.200 0.770 1276.420179 3.097286 159.0 3 South- West 268.485888 19 0.210 0.740 1127.112857 3.607393 73.0 3 South- West 241.338684 10 0.220 0.700 1303.584107 3.196000 276.0 3 South- West 273.424860 18 0.230 0.760 1266.811071 3.140571 137.0 3 South- West 266.738750 56 0.233 0.623 1146.467143 3.668304 112.0 3 South- West 244.857699 40 0.233 0.929 1144.234286 3.546321 93.0 3 South- West 244.451719 11 0.280 0.680 1292.041607 3.168286 105.0 3 South- West 271.326191 12 0.280 0.690 1286.694643 3.327143 224.0 3 South- West 270.354001 16 0.300 0.720 1269.559286 3.245143 174.0 3 South- West 267.238433 15 0.310 0.750 1285.228929 3.214571 304.0 3 South- West 270.087503 13 0.350 0.700 1261.748571 3.179000 273.0 3 South- West 265.818281 In [532… SW_sorted = SW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_sw = SW_sorted.head(1) best_row_sw Out[532]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy
55. 55. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 55/62 South- 51 0.025 0.472 1112.164286 3.552107 57.0 3 238.62072 West In [533… NE_ Out[533]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 32 0.630 0.280 1336.118571 3.604982 149.0 2 North- East 279.340308 31 0.640 0.260 1147.204286 3.812143 139.0 2 North- East 244.991727 33 0.680 0.250 1114.937143 3.448929 137.0 2 North- East 239.124883 39 0.700 0.210 1112.207143 3.578625 93.0 2 North- East 238.628512 52 0.715 0.211 1167.107143 3.673768 139.0 2 North- East 248.610484 30 0.720 0.220 1315.660909 3.258857 277.0 2 North- East 275.620676 38 0.720 0.230 1144.800000 3.606429 134.0 2 North- East 244.554577 37 0.740 0.200 1125.844286 3.495375 74.0 2 North- East 241.108031 36 0.760 0.300 1148.657143 3.705429 117.0 2 North- East 245.255887 34 0.770 0.180 1098.385714 3.697875 61.0 2 North- East 236.115486 35 0.770 0.310 1158.064286 3.590196 150.0 2 North- East 246.966303 55 0.856 0.100 1108.628571 3.583125 99.0 2 North- East 237.977853 54 0.873 0.379 1158.085714 3.588218 92.0 2 North- East 246.970199 In [534… NE_sorted = NE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_ne = NE_sorted.head(1) best_row_ne Out[534]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy North- 34 0.77 0.18 1098.385714 3.697875 61.0 2 236.115486 East In [535… Scenari0_1 = pd.concat([best_row_nw, best_row_se, best_row_sw, best_row_ne], axis=0 In [536… Scenari0_1
56. 56. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 56/62 Out[536]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 57 0.567 0.727 1118.584286 3.624429 74.0 1 North- West 239.788010 3 0.350 0.200 1130.978571 3.615589 65.0 0 South- East 242.041552 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 34 0.770 0.180 1098.385714 3.697875 61.0 2 North- East 236.115486 In [537… # Calculate Energy Total_Energy = (238.620720 * 3000) + (236.115486 * 2000) + (242.041552 * 2000) + (2 In [538… Total_Energy = round(Total_Energy) Total_Energy Out[538]: 2031858 In [539… #Cost Total_cost = round((3000 * 157) + (2000 * 161) + (2000 * 165) + (1500 * 174)) Total_cost Out[539]: 1384000 In [540… print(" Total Energy genrated in KWH/a:",Total_Energy) print(" Total cost in Euro:",Total_cost) print(" total area we occupying : 8500 m^2") Total Energy genrated in KWH/a: 2031858 Total cost in Euro: 1384000 total area we occupying : 8500 m^2 Scenario 2 We can also use above method to solve scenario 2 but we will try it using Hierarchical clustering : In [541… from sklearn.cluster import AgglomerativeClustering # Extract the two columns of features that you want to use for clustering NW_copmare = NW_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_NW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NW.fit(NW_copmare) # Predict the clusters for each data point pred = cluster_NW.fit_predict(NW_copmare) # Create a scatter plot of the clusters
57. 57. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 57/62 plt.scatter(NW_copmare['Avg Price'], NW_copmare['Pridected Energy'], c=pred, cmap= plt.show() In [542… pred Out[542]: array([3, 0, 1, 2, 1, 4, 0, 2, 4, 4, 1, 2, 0, 3], dtype=int64) In [543… SE_copmare = SE_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_SE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NW.fit(SE_copmare) # Predict the clusters for each data point pred = cluster_SE.fit_predict(SE_copmare) # Create a scatter plot of the clusters plt.scatter(SE_copmare['Avg Price'], SE_copmare['Pridected Energy'], c=pred, cmap= plt.show()
58. 58. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 58/62 In [544… SW_copmare = SW_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_SW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_SW.fit(SW_copmare) # Predict the clusters for each data point pred = cluster_SW.fit_predict(SW_copmare) # Create a scatter plot of the clusters plt.scatter(SW_copmare['Avg Price'], SW_copmare['Pridected Energy'], c=pred, cmap= plt.show()
59. 59. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 59/62 In [545… NE_copmare = NE_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_NE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NE.fit(NE_copmare) # Predict the clusters for each data point pred = cluster_NE.fit_predict(NE_copmare) # Create a scatter plot of the clusters plt.scatter(NE_copmare['Avg Price'], NE_copmare['Pridected Energy'], c=pred, cmap= plt.show()
60. 60. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 60/62 WE will use all the optimal point which have best energy and cheap price In [546… # SW + NW + NE + SE Energy_op = (252*2000) + (272 * 3000) + (273 * 3000) + (280 *2000) Energy_op Out[546]: 2699000 Now we wil use point which have highest energy In [547… Energy_hi = (271*2000) + (273.5 * 3000) + (273 * 3000) + (280 *2000) Energy_hi Out[547]: 2741500.0 In [548… cost = (376*3000) + (249*2000) + (231 * 3000) + (426*2000) cost Out[548]: 3171000
61. 61. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 61/62 Conclusion Scenario 1 We can full fil the demand of 2 milllion kwh/a energy Easily and we dont even need 2 million budget. We are also not using all the area that proposed and area with highest price of the cheapest price "North West " we are just using half land there. so it is saving money. Scenario 2 First we used the point that are second highest energy and cheap compare to highest energy point, but we are not full fiiling energy demand Now we are taking the point which have highest energy without cosidering cost . although it is not full filling demand. We need more land and little more money.
62. 62. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 62/62 Bibliography Burkov, A. (2020). The Hundred -Page machine learning book.