Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Programming for big data

75 views

Published on

Analysis of user interaction with online ads on a major news website with Python and Apache Spark on IBM Bluemix

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Programming for big data

  1. 1. Higher Diploma in Data Analytics Programming for Big Data Project Alexandros Papageorgiou Student ID: 15019004 Analysis I: The major factors determining the prediction of interest rate for Lending Club Loan request. Analysis II: Prediction of activity based on mobile phone spatial measurements Analysis III: Analysis of user interaction with online ads on a major news website with Spark
  2. 2. The major factors determining the prediction of interest rate for Lending Club Loan request. Objectives of the analysis Loans are common place in nowadays and advances in the finance industry have made the process of requesting a loan a highly automated process. A major component in this process is the level of interest rate. This is determined based on a number of factors both from the applicant’s credit history as well as the application data submitted with the request like their employment history, credit history, and creditworthiness scores (lendingclub.com, 2015). Determining the interest rate can be a complex task that requires advanced data analysis. The purpose of this analysis is to spot the association between interest rates and a number of other factors based on the loan application data (such as their employment history, credit history, and creditworthiness scores) as well as data provided by external sources in order to get a better understanding of how the interest rate is determined and attempt to quantify these relationships. Particularly this study investigates beyond FICO (the main measure of the credit worthiness of the applicant) which are the other factors that can have an impact. Using exploratory analysis and standard multiple regression techniques it is demonstrated that there is significant relationship between Interest rate and FICO as well as 2 other variables (amount requested and length of the loan) Dataset Description For this analysis a 2500 sample observations dataset was used containing 2500 observations (rows) and 14 variables (columns) from the lending club website downloaded using the R programming language (R-Core-Team, 2015). The lending club data used in this analysis contains observations in code names as seen below, measuring the following  Amount.Requested: The amount (in dollars) requested in the loan application  Amount.Funded.By.Investors: The amount (in dollars) loaned to the individual  Interest.rate: The lending interest rate.  Loan.length: The length of time (in months) of the loan  Loan.Purpose: The purpose of the loan as stated by the applicant  Debt.to.Income.Ratio: The percentage of consumer’s gross income that goes towards paying debts  State: The abbreviation for the U.S. state of residence of the loan applicant  Home.ownsership: A variable indicating whether the applicant owns, rents, or has a mortgage on their home.  Monthly.income: The monthly income of the applicant (in dollars).  FICO.range: A range indicating the applicants FICO score. This is a measure of the credit worthiness of the applicant  Open.Credit.Lines: The number of open lines of credit the applicant had at the time of application.  Revolving.Credit.Balance: The total amount outstanding all lines of credit  Inquiries.in.the.Last.6.Months: The number of authorized queries in the 6 months before the loan was issued.  Employment.Length: Length of time employee at current job.
  3. 3. Challenge: Data not in a tidy form Exploratory analysis was the method used via constructing plots and relevant tables to examine the quality of the data provided and explore possible associations between interest rate and the independent variables. This was after handling the 7 missing values found, making sure that analysis is performed based on complete cases. This was based on the assumption of no significant effect on the analysis due to low size of the missing values. Other data type transformations:  Several factor or character variables converted to numerical  Removal of % symbol from interest rate,  FICO range converted from a range in to a single figure  Renaming of variables where appropriate. Rationale for the transformations: Those transformations were made in order to enable a more flexible handling of the data through R especially by transforming them in numerical forms. To relate interest rate with its major components a standard linear regression model was deployed. The model selection was performed on the basis of the exploratory analysis and prior knowledge of the relationship between interest rate and the factors that are considered critical to its determination. Data processing activities As noted above a minimal number of missing values were identified and where appropriate removed, beyond that the data was found to be within normal and acceptable ranges without any extremities in interest rates and the other independent variables either. The final dataset was in line with the tidy data rule (Wickham, 2015). As a first step in the exploratory analysis a correlation analysis of all the numeric variables was introduced in order to identify possible associations among them and particularly the ones that correlate well with the interest rate. The results of this first analysis reveal quite a high negative type correlation among the interest rate and the FICO score (r=-0.7) and there is also some correlation with amount requested and amount funded on the level of r=0.33. The correlation among other variables with interest rate was relatively low. To carry on with the analysis the information provided on the club’s website was considered, which mentions as credit risk indicators factored into the model for the interest rate, the following:  Requested amount loan  Loan maturity (36 or 60 months)  Debt to income ratio  Length of credit history  Number of other accounts opened  Payment history  Number of other credit inquiries initiated over the past six months.
  4. 4. The FICO score is also quite explicitly mentioned as a decisive factor so in this context this will unavoidably be one of the variables that will define the model. A number of experiments through box plots were performed in order to identify possible relationships of the interest rate with categorical variable. It turned out that what seems to have an impact on the interest rate is the length of the loan. Obviously overloading the model by including all the variables is not the optimal strategy (Cohen, 2009) and therefore as selection has to be made based on the results of the correlation analysis, the box plots for the categorical variables and the information provided on the web site. After testing with a number of models the one that was found to be best fit for this analysis is the following: Interest Rate= b0 + b1(FICO Score) + b2 (Requested Amount) + b3(Length of the Loan) + e where b0 is an intercept term and b1 represents the change of the (negative) interest rate for a given change of one unit of FICO score, similarly b2 represents the impact on interest for a one dollar increase of the requested amount. The term length of loan is a categorical two level variable that represents the change of the interest rate with a change from 36 months to 60 months of loan period, at average levels of the other two independent variables. The error term e represents all sources of unmeasured and un-modelled random variation (Stockburger, 2015) For the length of loan a set of dummy variables were implemented so that the R function can interpret the data more effectively. As term of reference was selected “36 months”. In the case of amount obviously due to confounder concerns (the two variables correlate not just with the interest rate but most obviously between themselves as well) just the amount requested was included in the regression model (the funded amount obviously directly depends on the originally requested amount) We observed: highly statistically significant (P =2e-16) association between interest rate and FICO score. A change of one unit FICO corresponded to a negative change of b1 = 8.9 on Interest rate (95% Confidence Interval: -8.984321e-02 -8.507495e-02). Association between interest rate and amount requested (P =2e-16). A change of one unit amount requested corresponded to a change of b2 = 1.446e-04 on Interest rate (95% Confidence Interval: 1.319564e-04 1.573394e-04). Last, with an intercept of 7.245e+01 it is observed that this is the amount of interest rate that corresponds when all the coefficients are set to zero which corresponds to the projected value for the 36 month period of loan, while when the coefficient takes the value of one, this corresponds to the value for the 60 month length. (P =2e-16) The model has an Adjusted R-squared: 0.7454 which corresponds to the amount of variation that is explained by the model. A -limited in scope analysis- of residuals to compare the effectiveness of a multiple regression against the simple linear regression model shows that non-random residual variance is better fitted with the second one.
  5. 5. Conclusions: The analysis suggests that there is a significant, positive association between Interest rate and FICO score as well as factors such as loan length and amount of loan requested. The analysis estimated the relationship using a linear model. This analysis provides some insights with regard to the ways a loan institution like lending club determines the cost of money for its customers, it therefore makes sense for the borrowers to be aware of the major factors that determine the interest rate they will be asked to pay and possibly based on this knowledge take action that could contribute in more favourable terms (for example ask for a lower amount and return the loan sooner rather than later. It is important to keep in mind that this study is a result of a limited dataset of just one institution and therefore it might be subject to bias. As time goes on and depending also on other parameters of the national and international economy other factors might come to play critical roles too. In any case an informed customer who is aware of this type of analysis is likely to make better decisions in his or her loan purchase. Works Cited Cohen, Y., 2009. Statistics and Data with R. s.l.:Wiley. lendingclub.com, 2015. Interest rates and how we set them. [Online] Available at: https://www.lendingclub.com/public/how-we-set-interest-rates.action [Accessed 25 11 2015]. R-Core-Team, 2015. R: A language and environment for statistical computing. [Online] Available at: http://www.R-project.org Stockburger, D. W., 2015. Multiple Regression with Categorical Variables. [Online] Available at: http://www.psychstat.missouristate.edu/multibook/mlt08m.html Wickham, H., 2015. Tidy Data. [Online] Available at: http://vita.had.co.nz/papers/tidy-data.pdf
  6. 6. Title: Prediction of activity based on mobile phone spatial measurements Introduction and Objectives: Advances in technology of mobile phones and the proliferation of smart devices have enabled the collection of spatial data of smart phone users with the intention of studying the relation between the measurements registered with the devices and the corresponding synchronous activity of the subjects. Data analysis methodology will be used with a prediction model to determine user activity based on a wide range of signals related to body motion. In particular the above analysis is based on the records of the Activity Recognition database which was built from the recordings of 30 subjects doing Activities of Daily Living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors including accelerometers and gyroscopes. The objective is the recognition of six different human activities based on the quantitative measurements of the Samsung phones. Data Description A group of 30 volunteers were selected for this task from the original research team. Each person was instructed to follow a predefined set of activities while wearing a waist-mounted Smartphone. The six selected ADLs were standing, sitting, laying down, walking, walking downstairs and upstairs. The respective. data set was downloaded from the URL https://sparkpublic.s3.amazonaws.com/dataanalysis/samsungData.rda. The data was partly preprocessed to facilitate its use within the R environment by the authors of the Coursera data analysis course (https://github.com/jtleek/dataanalysis). The data consists of 7352 entities (rows), each of which corresponds to a time indexed activity of each of the 21 subjects and 563 variables (columns) corresponding to measurements of two sensors. Specifically for each record, the data provided: - Acceleration from the accelerometer (total acceleration) and the estimated body acceleration. ( X, Y, Z axis) - Angular velocity from the gyroscope. (X, Y, Z axis) -Various descriptive statistics based on the above measurements Also, 2 additional pieces of information included as variables -Corresponding activity -Subject identifier Data processing The different activities were relatively evenly distributed and the same applies to the observations for every subject, so no extremes found in this context.
  7. 7. All the columns referring to measurement are numeric. The subject is integer and the activity has the type of character. This type is transformed to factor to assist with R processing given that activity is the actual dependent variable of this dataset. Prior to the analysis, a number of additional data transformations needed to take place. There are some issues, for example a number of variables appear to have the same names but different values. In specific, the bandsEnergy-related variables, are repeated in sets of 3. For example, columns 303- 316,317-330, and 331-344 have the same column names. To fix this the variables were renamed in such a way to avoid duplication and possible problems with the analysis of the data. Those transformations were made in order to enable a more flexible handling of the data through R. Moreover the variable names are cleaned by removing some punctuation like “( )” and “-“ characters in names to make it make syntactically valid. An unusual fact observed is that all the numeric data appear to be within a range of -1 and 1 but this turns out to be because the data is normalized. There were no missing values found just complete cases observed and except the above mentioned no other data type transformations were found to be necessary. The dataset was split in two sets for training and testing. On a random split training set include the subject ids 1,3,5 and 6- Total of 4 samples corresponding to 328 observations. The test set includes ids 27, 28, 29 and 30. Therefore a total of 4 samples corresponding to 371 observations. Results: The selected method for the analysis is classification trees. Trees are particularly useful when there are have many explanatory variables. If the “twigs” of the tree are categorical a classification tree is recommended in order to partition the data ultimately into groups that are as homogeneous as possible (Pekelis, 2013) The next call to make is the selection of the variables that will be integrated into the model. It was decided that with the very high number of columns in the dataset, it might not be meaningful to examine each one individually. The first exploratory attempt to fit the tree model was with test tree prediction on the first variable set, related to Body acceleration (tBodyAcc) which includes the first 15 variables of the data set. A classification tree was grown through the training set, and the predictive model was tested on the test data. The misclassification rate was as high as 41%. This led to the straightforward abandonment of this first set of variables. As a next step, and considering again the nature of the variables, a more comprehensive approach was chosen, that would include all the variables (with the obvious exception of subject id) and let the classification tree algorithm choose the critical nodes. There were 11 variables selected as nodes namely: "tBodyAcc.std.X" "tGravityAcc.mean.X" "tGravityAcc.max...Y" "tGravityAcc.mean.Y" "fBodyGyro.meanFreq.X" "tGravityAcc.arCoeff.X.1" "tBodyAccJerk.max.X" "tBodyGyroMag.arCoeff..1" "fBodyAcc.bandsEnergy.1.8" "tBodyAcc.arCoeff.X.3"
  8. 8. These variables correspond to various summary statistics measurements from the two sensors The misclassification error rate was ~3 % on the training set, practically failing to classify correctly the activity just nine time out of 328. The value of the model will not be proven unless applied on to the test set. Once it performs versus the test set the error in prediction reaches 19.6% therefore being successful in predicting over 80% of the cases. The most significant nodes at the higher level of the tree are the Body acceleration standard deviation of X the Gravity acceleration mean of X and the Gravity acceleration coefficient of X. To check if the tree has any potential to improve its performance across validation experiment was made to find the deviance and number of misclassifications as related to the size of our model. Given the graph, it appears to be the case that the model has the least amount of misclassifications and deviance for a model size of 8. Given this result the next step is to prune the tree by predefining the number of nodes to be used to 8. Fitting the model of 8 nodes to the test data produces 20.2 % error rate so it is marginally lower than the previous model. Conclusions: Based on the results of the data analysis it turns out that the classification tree method was effective into handling a large number variables, which due to size and the nature of the variables, they would not have been able to be analyzed one by one. That said, there is definitely room for improvement especially if an analysis has the possibility to look deeper into the meaning of the variables and identifies patterns and relationships between them that could help regarding the selection of the variables for the model, instead of opting for the comprehensive approach as it was the case with analysis. In this relatively straightforward approach adopted, no potential confounders were identified, as this model did not include any linear analysis. The main criterion for judging the model was accuracy, but additional ways of measuring error can be considered. Also possible to use are techniques that are likely to improve the initial classification tree results, such as random forests (Chen, 2009) Works Cited Chen, F., 2009. R Examples of Using Some Prediction Tools. [Online] Available at: stat.fsu.edu/~fchen/prediction-tool.pdf [Accessed 05 12 2015]. Pekelis, L., 2013. Classification And Regression Trees : A Practical Guide for Describing a Dataset. [Online] Available at: http://statweb.stanford.edu/~lpekelis/talks/13_datafest_cart_talk.pdf [Accessed 04 12 2015]. Y.Theodoridis, 1996. ACM Digital Library-A model for the prediction of R-tree performance. [Online] Available at: dl.acm.org/citation.cfm?id=237705 [Accessed 07 12 2015].
  9. 9. Title: Analysis of user interaction with online ads on a major news website The dataset The dataset is part of a sequence of files that include daily click through to online ads data, based on user character tics as recorded on the New York Times web site in May of 2012. The datasets are available on the “Doing data science” book (Schutt, 2013) Github repo, in this analysis only the first day of available data is considered. It contains over 458,000 observations including 5 variables:  Age of User – numerical variable  Gender – binary variable  Signed_In – binary variable representing if the users was logged in or not  Impressions – the number of ad impressions during the session  Clicks - the number of click-throughs to one or more ads on the website Every row corresponds to a user. It is generally speaking a simple low dimensional dataset which however can be used to conduct a basic analysis of the user behaviour on the website with relation to user interaction with the ad content on the site. Configurations: Setting up the Environment The platform used for this analysis was IBM Bluemix, which via the integrated notebook interfaces allows access to Apache Spark, an open source –in memory- data processing engine for cluster computing that shares some common ground with the Hadoop Map Reduce programming framework. The dataset in csv format is first uploaded as a new data source on Bluemix with the Object storage service, which is associated with the required credentials. The first step is to define a function that sets the Hadoop configuration with the credentials as parameter. Next using the insert code function from the data sources a dictionary with the credentials associated with the data source is created and then used as an argument into the set Hadoop configuration function in order to activate the service. For the data processing activities that follow Pyspark, the Python API to Spark is used. The data structured used throughout the analysis is Spark DataFrame. Algorithms, results, challenges: The algorithms used for the analysis are for the most part of the category ‘split-apply-combine’ whereby the data are grouped based on an attribute, then a function is applied onto the grouped data summarising the values within each group into one value. This is in line with the MapReduce principles of creating key value pairs, then grouping by key with the individual values of same-key entries in an associated to the key sequence and the reducing this to one aggregate value that represents all the values under the common key.
  10. 10. Although in Spark the map and reduce process differs from the Hadoop Map Reduce implementation (Owen, 2014) specialised Spark functions such as reduceByKey, groupByKey and flatMap can deploy equivalent functionality. In Spark the above procedure represents a transformation, which is lazily evaluated when an action is performed i.e. when an answer from the system is explicitly requested.  For example the observations are grouped by gender and then a function is applied to the groups that outputs the mean age by gender (22.9 for males and 40.8 for women).  Other algorithms are used to make transformations of the existing variables to new ones, for example number of clicks and impressions are used together as a ratio to produce the click-through rate.  In other cases, SQL type analysis is deployed for example to filter the observations, by keeping only a subset that for instance belongs to the 25-35 age group and then focus the analysis on the specific segment.  There are also interesting implementations of summary statistics including the count of observations (458441)  The mean values for the key variables – for example the average age which is 29.4 years, the average number of impressions which is 5 and clicks (just below 0.1)  We can also see that the minimum reported age is 0 and maximum 99, the max number of ad impressions is 9 and maximum number of clicks 4  The Pearson correlation between impressions and clicks is 0.13, which a positive but relatively weak one, implying that more ad impressions do not always lead to more clicks. The algorithms used along with the full results are presented in detail in the attached Jupyter notebook. One of the challenges working with large scale data is the need to use distributed frameworks for the computation which translates to the need to use suitable data structures. In spark the core data structure is the RDD Resilient Distributed Dataset, which is essentially a collection of elements that can be partitioned across the nodes of a cluster. Data manipulation with RDDs is not as intuitive and expressive for data analysis as other data structures. Given the introduction of Spark data frames, that was actually the data structure selected for the analysis which thanks to the named columns attribute that supports makes the performed data analysis tasks more intuitive as well as efficient in terms of computational speed compared with the RDDs. Works Cited Anon., 2015. Spark programming-guide. [Online] Available at: spark.apache.org/docs/latest/programming-guide.html Owen, S., 2014. how-to-translate-from-mapreduce-to-apache-spark. [Online] Available at: https://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to- apache-spark/ Schutt, R., 2013. Doing Data Science. [Online] Available at: https://github.com/oreillymedia/doing_data_science

×