Given what a beautiful and mature functional programming language R is, there is a surprising, though understandable, lack of visibility of functional programming techniques in R. This is a talk given to the Mumbai R meetup group in October/November, 2014, meant to introduce the audience to Functional Programming in R.
In this tutorial, we explore the most basic data structure in R, the vector. We cover everything from creating vectors to subsetting them in different ways.
Given what a beautiful and mature functional programming language R is, there is a surprising, though understandable, lack of visibility of functional programming techniques in R. This is a talk given to the Mumbai R meetup group in October/November, 2014, meant to introduce the audience to Functional Programming in R.
In this tutorial, we explore the most basic data structure in R, the vector. We cover everything from creating vectors to subsetting them in different ways.
Brief introduction to the R language and its use in large data applications. Tools and techniques for data ingestion and analytics are touched on. Packages and support for publishing research and visualizations are described.
Overview and about R, R Studio Installation, Fundamentals of R Programming: Data Structures and Data Types, Operators, Control Statements, Loop Statements, Functions,
Descriptive Analysis using R: Maximum, Minimum, Range, Mean, Median and Mode, Variance, Standard Deviation, Quantiles, IQR, Summary
Brief introduction to the R language and its use in large data applications. Tools and techniques for data ingestion and analytics are touched on. Packages and support for publishing research and visualizations are described.
Overview and about R, R Studio Installation, Fundamentals of R Programming: Data Structures and Data Types, Operators, Control Statements, Loop Statements, Functions,
Descriptive Analysis using R: Maximum, Minimum, Range, Mean, Median and Mode, Variance, Standard Deviation, Quantiles, IQR, Summary
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Teradata training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Teradata Database classes in Mumbai according to our students and corporates
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdfSOUMIQUE AHAMED
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED, Division of Agronomy, Faculty of Agriculture - Wadura, Sher-e-Kashmir University of Agricultural Sciences and Technology of Kashmir.
Presentation by Jacob van Etten.
CCAFS workshop titled "Using Climate Scenarios and Analogues for Designing Adaptation Strategies in Agriculture," 19-23 September in Kathmandu, Nepal.
This was a brief 1-hour introduction to R programming, presented at the 1st Inter-experimental Machine Learning (IML) Working Group Workshop at CERN, 20-22 March 2017.
The presentation is a brief case study of R Programming Language. In this, we discussed the scope of R, Uses of R, Advantages and Disadvantages of the R programming Language.
R is one of the most powerful, easy to use and open source statistical software packages. In these slides, basics of R, data structures in R, data management and analysis using R are presented.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
1. Introduction to R
Regression Module
Pinaki M Mukherjee
Regression Module Introduction to R Pinaki M Mukherjee 1 / 42
2. 1 About R software environment
2 Download and install R
3 R packages
4 Types of data in R & import data in R
5 Regression Modelling
6 Go to R Lab
7 Interpretation of R regression output
Regression Module Introduction to R Pinaki M Mukherjee 2 / 42
3. About R software environment
About R software environment
Regression Module Introduction to R Pinaki M Mukherjee 3 / 42
4. About R software environment
What is R?
R is a powerful software environment for statistical computing and
graphics
R is free open source software licensed under the GNU general public
license
R runs in Linux,Mac and Windows
Users of R
Economists
Financial Analysts
Market Researchers
Academicians
Bio- Scientist
Many other professionals for quantitative research
Regression Module Introduction to R Pinaki M Mukherjee 4 / 42
5. About R software environment
Why to use R?
There are may statistical softwares like SAS, SPSS, E-Views, STATA etc.
Why to use R?
Some of the very strong reasons
Its FREE
More than 2 million users around the world
High acceptability and recognition: Extensively used by large corporate
houses and business schools & universities (like Stanford, Harvard,
Johns Hopkins, Princeton, Washington University, MIT etc.)
New features are being developed all the time
Active R community and free updated resources
Regression Module Introduction to R Pinaki M Mukherjee 5 / 42
6. About R software environment
Applications of R
Data sourcing
Data cleaning
Data structuring
Data warehousing
Statistical and econometric modelling
Report document generation
Preparing presentation
Automate reproducable research
To know more about applications of R click Here
Regression Module Introduction to R Pinaki M Mukherjee 6 / 42
7. Download and install R
Download and install R
Regression Module Introduction to R Pinaki M Mukherjee 7 / 42
8. Download and install R
Install R
Open www.r-project.org
Click the “download R”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for Windows” link at the top of the page
Click on the “install R for the first time” link at the top of the page.
Click “Download R for Windows” and save the executable file
Run the .exe file and follow the installation instructions
After installation of R, you need to download and install RStudio. RStudio
is easy to use interface for R loaded with many user friendly and useful
features. Like R, RStudio is also free.
Regression Module Introduction to R Pinaki M Mukherjee 8 / 42
9. Download and install R
Install RStudio
Open www.rstudio.com
Click on the “Download RStudio” button
Click on “Download RStudio Desktop”
Click on the version recommended for your system, or the latest
Windows version
Download and save the executable file
Run the .exe file and follow the installation instructions
Regression Module Introduction to R Pinaki M Mukherjee 9 / 42
11. R packages
About R packages
Packages are collections of R functions, data, and compiled code in a
well-defined format
There are more than 6000 packages in R
However we need only a handful of packages to work
Only 1% of external packages are required to efficiently execute 99% of
work
The packages are kept in dedicated serves maintained by R community.
In India, we have the seerver at IIT Madras
The network of servers are also called CRAN (Comprehensive R
Archive Network)
Regression Module Introduction to R Pinaki M Mukherjee 11 / 42
12. R packages
Install R packages
(You need a proper internet connection to successfully run the codes)
To install one specific package in R
On the R Console, type install.packages(“forecast”) “forecast” is the
name of the package
Run the following code in the “Console” of RStudio to install only the most
important packages relecent for us.
install.packages(“ctv”)
library(“ctv”)
install.views(“Econometrics”)
install.views(“ReproducibleResearch”)
Install.views(“Finance”)
install.views(“MachineLearning”)
Regression Module Introduction to R Pinaki M Mukherjee 12 / 42
13. R packages
Some important commands
Load a package in the R session
library(forecast)
See the packages loaded in the R session
search()
Regression Module Introduction to R Pinaki M Mukherjee 13 / 42
14. Types of data in R & import data in R
Types of data in R & import data in R
Regression Module Introduction to R Pinaki M Mukherjee 14 / 42
15. Types of data in R & import data in R
Vectors, Factors & Matrix
A vector is a collection of data elements of the same type (also called class
in R). There are four different class of data elements
Charecter
Numeric
Logical
Date and
Intiger
Factors are qualitative valiable extensively used in modelling. For example,
interest rate changed by RBI in different RBI monetary policy reviw
meetings
Matrix is a numeric vectors with multiple dimentions in rows and columns
Regression Module Introduction to R Pinaki M Mukherjee 15 / 42
16. Types of data in R & import data in R
Dataframe & List
Dataframe
Most frequently used format for statistical analysis
Different than matrices. It can store different classes of vectors
It can be created in R by simple data entry Or
It can be imported from external sources, like datasets in csv files,
excel files etc
List is a collection of different dataframes. May resemble to a workbook
with different Sheets
Regression Module Introduction to R Pinaki M Mukherjee 16 / 42
17. Types of data in R & import data in R
Import external data into R
From csv or txt file
read.csv(‘data.csv’)
From excel
install.packages(‘xlsx’) (if xlsx package is not already installed)
library(xlsx) (load the ‘xlsx’ package into the R session)
read.xlsx(‘data.xlsx’, sheetIndex= 1) (Importing data in Sheet 1
of’data.xlsx’ )
Regression Module Introduction to R Pinaki M Mukherjee 17 / 42
19. Regression Modelling
Linear regression
Linear regression is a simple approach to supervised learning. It
assumes that the dependence of Y on X1,X2, . . . . ,Xp is linear
True regression functions are never linear
although it may seem overly simplistic, linear regression is extremely
useful both conceptually and practically
Regression Module Introduction to R Pinaki M Mukherjee 19 / 42
20. Regression Modelling
Questions we might ask
Is there a relationship between the dependent and independent
variable?
How strong is the relationship between the dependent and independent
variable?
Which independent variable contributes to dependent variable?
Is the relationship linear?
How accurately can we forecast the value of the dependent variable?
Regression Module Introduction to R Pinaki M Mukherjee 20 / 42
21. Regression Modelling
Simple linear regression using a single predictor X.
We assume a model:
Y = β0 + β1X +
where β0 and β1 are two unknown constants that represent the intercept
and slope, also known as coefficients or parameters, and is the error term
or residual
Given the estimates for β0 and β1 for the model coefficients, we can
forecast Y using the following equations
ˆy = ˆβ0 + ˆβ1x
where ˆy indicates a prediction of Y on the basis of X = x. The hat symbol
denotes an estimated value.
Regression Module Introduction to R Pinaki M Mukherjee 21 / 42
22. Regression Modelling
Estimation of the parameters by least squares
ˆyi = ˆβ0 + ˆβ1xi
Let ˆyi be the prediction for Y based on xi value of X
i = yi − ˆy represents the ith residual
We define the residual sum of square also called RSS as
RSS = 2
1 + 2
2 + 2
3 + .... + 2
n
The least squares approach chooses ˆβ0 and ˆβ1 to minimize the RSS
Regression Module Introduction to R Pinaki M Mukherjee 22 / 42
23. Regression Modelling
Simple regression model: The advertisement data
0 50 100 150 200 250 300
051015202530
Sales to TV Advertisement
TV Ad budget
Sales
Regression Module Introduction to R Pinaki M Mukherjee 23 / 42
24. Regression Modelling
Simple regression model: The advertisement data
0 50 100 150 200 250 300
051015202530
Sales to TV Advertisement
TV Ad budget
Sales
Regression Module Introduction to R Pinaki M Mukherjee 24 / 42
25. Regression Modelling
Simple regression model: The advertisement data
0 50 100 150 200 250 300
051015202530
Sales to TV Advertisement
TV Ad budget
Sales
Regression Module Introduction to R Pinaki M Mukherjee 25 / 42
26. Regression Modelling
Multiple regression using more than one predictor
We assume a model:
Y = β0 + β1X1 + β2X2 +
where β0, β1 and β3 are two unknown constants that represent the
intercept and slope, also known as coefficients or parameters, and is the
error term or residual
Given the estimates for β0 ,β1 and β2 for the model coefficients, we can
forecast Y using the following equations
ˆy = ˆβ0 + ˆβ1x1 + ˆβ2x2
where ˆy indicates a prediction of Y on the basis of X = x. The hat symbol
denotes an estimated value.
Regression Module Introduction to R Pinaki M Mukherjee 26 / 42
27. Regression Modelling
Estimation of the parameters by least squares
ˆyi = ˆβ0 + ˆβ1x1i + ˆβ2x2i
Let ˆyi be the prediction for Y based on x1i value of X1 and x2i value
of X2
i = yi − ˆy represents the ith residual
We define the residual sum of square also called RSS as
RSS = 2
1 + 2
2 + 2
3 + .... + 2
n
The least squares approach chooses ˆβ0, ˆβ1 and ˆβ2 to minimize the RSS
Regression Module Introduction to R Pinaki M Mukherjee 27 / 42
28. Regression Modelling
Multiple regression model: The advertisement data
Adding elements
0 50 100 150 200 250 300
051015202530
0
10
20
30
40
50
TV
Radio
Sales
Regression Module Introduction to R Pinaki M Mukherjee 28 / 42
29. Regression Modelling
Multiple regression model: The advertisement data
Adding elements
0 50 100 150 200 250 300
051015202530
0
10
20
30
40
50
TV
Radio
Sales
Regression Module Introduction to R Pinaki M Mukherjee 29 / 42
30. Regression Modelling
Multiple regression model: The advertisement data
Adding elements
0 50 100 150 200 250 300
051015202530
0
10
20
30
40
50
TV
Radio
Sales
Regression Module Introduction to R Pinaki M Mukherjee 30 / 42
31. Go to R Lab
Go to R Lab
Regression Module Introduction to R Pinaki M Mukherjee 31 / 42
32. Go to R Lab
Go to R Lab: Objective of the Lab
Import external data
Corelation matrix
Estimating regression coefficients
Estimating error term/residuals
Print regression model summary
Regression Module Introduction to R Pinaki M Mukherjee 32 / 42
33. Interpretation of R regression output
Interpretation of R regression output
Regression Module Introduction to R Pinaki M Mukherjee 33 / 42
34. Interpretation of R regression output
Accuracy of the estimated coefficient: Confidence interval
The standard error of an estimator reflects how it varies under repeated
sampling
These standard errors can be used to compute confidence intervals
A 95% confidence interval is defined as a range of values such that
with 95% probability, the range will contain the true unknown value of
the parameter.
It has the form ˆβ1 ± 2 ∗ SE( ˆβ1)
That is, there is approximately a 95% chance that the interval
[ ˆβ1 − 2 ∗ SE( ˆβ1), ˆβ1 + 2 ∗ SE( ˆβ1)]
For the advertising data, the 95% confidence interval for β1 is [0:042;
0:053]
Regression Module Introduction to R Pinaki M Mukherjee 34 / 42
35. Interpretation of R regression output
Hypothesis testing
Standard errors can also be used to perform hypothesis tests on the
coefficients
The most common hypothesis test involves testing
The null hypothesis of H0 : There is no relationship between X and Y
Vs
The alternative hypothesis of HA : There is some relationship between X
and Y
Regression Module Introduction to R Pinaki M Mukherjee 35 / 42
36. Interpretation of R regression output
Hypothesis testing: Mathematically mean
Testing
H0 : β1 = 0
Vs
H0 : β1 = 0
if β1 = 0 it means X is not associated with Y
To test the null hypothesis, we compute a t-statistic, given by
t =
ˆβ1−0
SE( ˆβ1)
Using R, it is easy to compute the probability of observing any value
equal to |t| or larger. We call this probability the p-value.
If we see a small p-value,then we can infer that there is an association
between the predictor and the response. We reject the null
hypothesis-that is, we declare a relationship to exist between X and Y
Regression Module Introduction to R Pinaki M Mukherjee 36 / 42
37. Interpretation of R regression output
Assessing the Overall Accuracy of the Model: R2
R-squared or fraction of variance explained is
R2 = TSS−RSS
TSS = 1 − RSS
TSS
TSS= Total Sum of Square, also called total variation RSS= Residual sum
of Square, also called unexplained variation Explained variation= Total
variation - Unexplained variation
R2 measures the proportion of variability in dependent variable that
can be explained using independent variable
An R2 statistic that is close to 1 indicates that a large proportion of
the variability in the response has been explained by the regression
A number near 0 indicates that the regression did not explain much of
the variability in the response
Regression Module Introduction to R Pinaki M Mukherjee 37 / 42
38. Interpretation of R regression output
Assessing the Overall Accuracy of the Model: R2
. . .
R2 always lies between 0 and 1.
However, it can still be challenging to determine what is a good R2 value?
Depend on the application
Physics ~ Close to 1, smaller value value might indicate a serious
problem
Biology, Psychology ~ well below 0.1 might be more realistic!
Economics and finance ~ well above 0.6 might be more acceptable!
what is the value of R2 in our data?
Regression Module Introduction to R Pinaki M Mukherjee 38 / 42
39. Interpretation of R regression output
Assessing the Overall Accuracy of the Model: F Test
F =
TSS−RSS
p
RSS
n−p−1
n = numberofobservations
p = numberofindependentvariables
Intuitively if the model is a good fit then the explained variation
(TSS − RSS) will be high relative to the RSS.
F value higher than 1 is desired
Just how high depends on the sample size n and the number of
independent variables
Regression Module Introduction to R Pinaki M Mukherjee 39 / 42
40. Interpretation of R regression output
Answer to “Questions we might ask” in advertisement data
Is there a relationship between the dependent and independent variable?
Is there a relationship between advertising budget and sales?
How strong is the relationship between the dependent and independent
variable?
How strong is the relationship?
Which independent variable contributes to dependent variable?
Which media contribute to sales?
How large is the effect of each medium on sales?
Is the relationship linear?
Is the relationship linear?
How accurately can we forecast the value of the dependent variable?
Regression Module Introduction to R Pinaki M Mukherjee 40 / 42
41. Interpretation of R regression output
Exporting the regression output
a <- capture.output(summary(reg))
cat(a, file = "trial.txt", sep = "n", append = TRUE)
Regression Module Introduction to R Pinaki M Mukherjee 41 / 42
42. Interpretation of R regression output
Online free resources
R Cookbook : http://www.cookbook-r.com/
Try R: http://tryr.codeschool.com/
Video tutorials: http://www.twotorials.com/
I shall be glad to help you
Follow: Me in Google plus and my blog
Email: pinaki.economics@gmail.com
Mobile: +91 9818383989
Regression Module Introduction to R Pinaki M Mukherjee 42 / 42