Very quick introduction to the language R. It talks about basic data structures, data manipulation steps, plots, control structures etc. Enough material to get you started in R.
General Linear Model is an ANOVA procedure in which the calculations are performed using the least square regression approach to describe the statistical relationship between one or more prediction in continuous response variable. Predictors can be factors and covariates. Copy the link given below and paste it in new browser window to get more information on General Linear Model:- http://www.transtutors.com/homework-help/statistics/general-linear-model.aspx
General Linear Model is an ANOVA procedure in which the calculations are performed using the least square regression approach to describe the statistical relationship between one or more prediction in continuous response variable. Predictors can be factors and covariates. Copy the link given below and paste it in new browser window to get more information on General Linear Model:- http://www.transtutors.com/homework-help/statistics/general-linear-model.aspx
Due to the advancements in various data acquisition and storage technologies, different disciplines have attained the ability to not only accumulate a wide variety of data but also to monitor observations over longer time periods. In many real-world applications, the primary objective of monitoring these observations is to estimate when a particular event of interest will occur in the future. One of the major difficulties in handling such problem is the presence of censoring, i.e., the event of interests is unobservable in some instance which is either because of time limitation or losing track. Due to censoring, standard statistical and machine learning based predictive models cannot readily be applied to analyze the data. An important subfield of statistics called survival analysis provides different mechanisms to handle such censored data problems. In addition to the presence of censoring, such time-to-event data also encounters several other research challenges such as instance/feature correlations, high-dimensionality, temporal dependencies, and difficulty in acquiring sufficient event data in a reasonable amount of time. To tackle such practical concerns, the data mining and machine learning communities have started to develop more sophisticated and effective algorithms that either complement or compete with the traditional statistical methods in survival analysis. In spite of the importance of this problem and relevance to real-world applications, this research topic is scattered across various disciplines. In this tutorial, we will provide a comprehensive and structured overview of both statistical and machine learning based survival analysis methods along with different applications. We will also discuss the commonly used evaluation metrics and other related topics. The material will be coherently organized and presented to help the audience get a clear picture of both the fundamentals and the state-of-the-art techniques.
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...Simplilearn
This Time Series Analysis (Part-2) in R presentation will help you understand what is ARIMA model, what is correlation & auto-correlation and you will alose see a use case implementation in which we forecast sales of air-tickets using ARIMA and at the end, we will also how to validate a model using Ljung-Box text. A time series is a sequence of data being recorded at specific time intervals. The past values are analyzed to forecast a future which is time-dependent. Compared to other forecast algorithms, with time series we deal with a single variable which is dependent on time. So, lets deep dive into this presentation and understand what is time series and how to implement time series using R.
Below topics are explained in this " Time Series in R presentation " -
1. Introduction to ARIMA model
2. Auto-correlation & partial auto-correlation
3. Use case - Forecast the sales of air-tickets using ARIMA
4. Model validating using Ljung-Box test
Become an expert in data analytics using the R programming language in this data science certification training course. You’ll master data exploration, data visualization, predictive analytics and descriptive analytics techniques with the R language. With this data science course, you’ll get hands-on practice on R CloudLab by implementing various real-life, industry-based projects in the domains of healthcare, retail, insurance, finance, airlines, music industry, and unemployment.
Why learn Data Science with R?
1. This course forms an ideal package for aspiring data analysts aspiring to build a successful career in analytics/data science. By the end of this training, participants will acquire a 360-degree overview of business analytics and R by mastering concepts like data exploration, data visualization, predictive analytics, etc
2. According to marketsandmarkets.com, the advanced analytics market will be worth $29.53 Billion by 2019
3. Wired.com points to a report by Glassdoor that the average salary of a data scientist is $118,709
4. Randstad reports that pay hikes in the analytics industry are 50% higher than IT
The Data Science with R is recommended for:
1. IT professionals looking for a career switch into data science and analytics
2. Software developers looking for a career switch into data science and analytics
3. Professionals working in data and business analytics
4. Graduates looking to build a career in analytics and data science
5. Anyone with a genuine interest in the data science field
6. Experienced professionals who would like to harness data science in their fields
Learn more at: https://www.simplilearn.com/
Diagnósticos do Modelo Clássico de Regressão LinearFelipe Pontes
Média dos resíduos. Heterocedasticidade. Aucorrelação serial. Variável dependente estocástica. Multicolinearidade. Forma funcional errada. Instabilidade dos parâmetros.
This Presentation course will help you in understanding the Machine Learning model i.e. Generalized Linear Models for classification and regression with an intuitive approach of presenting the core concepts
Time Series In R | Time Series Forecasting | Time Series Analysis | Data Scie...Edureka!
In this Edureka tutorial we will show you how to use the Time Series Analysis in R to predict the future!
Below are the topics we will cover in this tutorial:
1. Why Time Series Analysis?
2. What is Time Series Analysis?
3. When Not to use Time Series Analysis?
4. Components of Time Series Algorithm
5. Demo on Time Series
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Multiple Linear Regression II and ANOVA IJames Neill
Explains advanced use of multiple linear regression, including residuals, interactions and analysis of change, then introduces the principles of ANOVA starting with explanation of t-tests.
It covers- Introduction to R language, Creating, Exploring data with Various Data Structures e.g. Vector, Array, Matrices, and Factors. Using Methods with examples.
Due to the advancements in various data acquisition and storage technologies, different disciplines have attained the ability to not only accumulate a wide variety of data but also to monitor observations over longer time periods. In many real-world applications, the primary objective of monitoring these observations is to estimate when a particular event of interest will occur in the future. One of the major difficulties in handling such problem is the presence of censoring, i.e., the event of interests is unobservable in some instance which is either because of time limitation or losing track. Due to censoring, standard statistical and machine learning based predictive models cannot readily be applied to analyze the data. An important subfield of statistics called survival analysis provides different mechanisms to handle such censored data problems. In addition to the presence of censoring, such time-to-event data also encounters several other research challenges such as instance/feature correlations, high-dimensionality, temporal dependencies, and difficulty in acquiring sufficient event data in a reasonable amount of time. To tackle such practical concerns, the data mining and machine learning communities have started to develop more sophisticated and effective algorithms that either complement or compete with the traditional statistical methods in survival analysis. In spite of the importance of this problem and relevance to real-world applications, this research topic is scattered across various disciplines. In this tutorial, we will provide a comprehensive and structured overview of both statistical and machine learning based survival analysis methods along with different applications. We will also discuss the commonly used evaluation metrics and other related topics. The material will be coherently organized and presented to help the audience get a clear picture of both the fundamentals and the state-of-the-art techniques.
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...Simplilearn
This Time Series Analysis (Part-2) in R presentation will help you understand what is ARIMA model, what is correlation & auto-correlation and you will alose see a use case implementation in which we forecast sales of air-tickets using ARIMA and at the end, we will also how to validate a model using Ljung-Box text. A time series is a sequence of data being recorded at specific time intervals. The past values are analyzed to forecast a future which is time-dependent. Compared to other forecast algorithms, with time series we deal with a single variable which is dependent on time. So, lets deep dive into this presentation and understand what is time series and how to implement time series using R.
Below topics are explained in this " Time Series in R presentation " -
1. Introduction to ARIMA model
2. Auto-correlation & partial auto-correlation
3. Use case - Forecast the sales of air-tickets using ARIMA
4. Model validating using Ljung-Box test
Become an expert in data analytics using the R programming language in this data science certification training course. You’ll master data exploration, data visualization, predictive analytics and descriptive analytics techniques with the R language. With this data science course, you’ll get hands-on practice on R CloudLab by implementing various real-life, industry-based projects in the domains of healthcare, retail, insurance, finance, airlines, music industry, and unemployment.
Why learn Data Science with R?
1. This course forms an ideal package for aspiring data analysts aspiring to build a successful career in analytics/data science. By the end of this training, participants will acquire a 360-degree overview of business analytics and R by mastering concepts like data exploration, data visualization, predictive analytics, etc
2. According to marketsandmarkets.com, the advanced analytics market will be worth $29.53 Billion by 2019
3. Wired.com points to a report by Glassdoor that the average salary of a data scientist is $118,709
4. Randstad reports that pay hikes in the analytics industry are 50% higher than IT
The Data Science with R is recommended for:
1. IT professionals looking for a career switch into data science and analytics
2. Software developers looking for a career switch into data science and analytics
3. Professionals working in data and business analytics
4. Graduates looking to build a career in analytics and data science
5. Anyone with a genuine interest in the data science field
6. Experienced professionals who would like to harness data science in their fields
Learn more at: https://www.simplilearn.com/
Diagnósticos do Modelo Clássico de Regressão LinearFelipe Pontes
Média dos resíduos. Heterocedasticidade. Aucorrelação serial. Variável dependente estocástica. Multicolinearidade. Forma funcional errada. Instabilidade dos parâmetros.
This Presentation course will help you in understanding the Machine Learning model i.e. Generalized Linear Models for classification and regression with an intuitive approach of presenting the core concepts
Time Series In R | Time Series Forecasting | Time Series Analysis | Data Scie...Edureka!
In this Edureka tutorial we will show you how to use the Time Series Analysis in R to predict the future!
Below are the topics we will cover in this tutorial:
1. Why Time Series Analysis?
2. What is Time Series Analysis?
3. When Not to use Time Series Analysis?
4. Components of Time Series Algorithm
5. Demo on Time Series
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Multiple Linear Regression II and ANOVA IJames Neill
Explains advanced use of multiple linear regression, including residuals, interactions and analysis of change, then introduces the principles of ANOVA starting with explanation of t-tests.
It covers- Introduction to R language, Creating, Exploring data with Various Data Structures e.g. Vector, Array, Matrices, and Factors. Using Methods with examples.
A high level introduction to R statistical programming language that was presented at the Chicago Data Visualization Group's Graphing in R and ggplot2 workshop on October 8, 2012.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. 1
Introduction to R
What is R?
Getting Started
Data structures
Scalar (number, string,
Boolean, Date-time) ,
Vector, Matrix, Data frame, List
Input / Output
Plots
Control Logic
Working with Strings
Writing Functions
Angshuman Saha
2. 2
What is R?
• R is a free software environment for statistical computing and graphics. It
compiles and runs on a wide variety of UNIX platforms, Windows and
MacOS.
• R can be downloaded and installed from CRAN website (http://www.r-
project.org/)
• CRAN stands for Comprehensive R Archive Network
• Installation comes with base, stat and a few other packages. Other than
that, there are hundreds of contributed packages enabling users to a
variety of specialized computation on data
4. 4
Getting Started
Double - click on the R icon on your desktop to start R
This launches the R GUI window
In the command prompt you can directly type your code
and hit Enter. This will run the code. This however runs the
code one line at a time.
1. Using command prompt
You can use a standard text editor like Notepad to create
your R code and save it in a text file. You can manually copy
the whole code from there and paste it in the RGUI window.
This will run the whole code.
2. Using external text files
You may save your R code in a text file with extension “.r”. You can then source this file to run the code.
Use “File>Source R code” from the menu to do this. Alternatively, you may type following command in R
prompt source(“D:/myFirstRcode.r”) to run the code. You need to specify the full path of your R code
file within double-quotes, while using source().
3. Using .r files
6. 6
Vector > Creation
x = c(10, 12.3 , 45) # create a vector of 3 numbers
x = c(FALSE, TRUE , TRUE, FALSE) # create a vector of 4 logical (boolean)
variables
x = c(“red”, “green” , “blue”) # create a vector of 3 strings
x = c(1:15) # create a vector of integers 1 to 15
x = 1:15 # equivalent to previous code
x = rep( 5.6 , 10) # repeat 5.6, 10 times. Vector of length 10 , all entries equal to
5.6
x = rep( c(1,2) , c(3,2) ) # x= (1,1,1,2,2)
x = seq( 10 , 14 , 2) # sequence from 10 to 14 in steps of 2. x=(10,12,14)
x = vector(mode="numeric", length=0)
# Initialize a zero length numeric vector, values will be put inside it later
7. 7
Vector > Accessing Elements
x = c(10, 12.3 , 45, 55, 65, 75, 85) # create a vector
y=x[2] # y has value 12.3
y=x[c(5,6,7)] # y is a vector with 5th,6th and 7th value of x
y=x[ -c(5,6,7) ] # y is a vector with all but 5th,6th and 7th value of x
y=x[c(1,1,3,4,7,7)] # y = (10,10,45,55,85,85)
Vector > Naming
x = c(10, 45, 55 ) # create a vector
names(x) = c(“first”, ”second”, ”third”) # name the elements of x
y=x[ “second” ] # y= 45. Elements can be accessed by name.
a = “third” ; y=x[ a ] # y = 55. Name can be passed through another variable
8. 8
Vector > operations
x = c(10, 45, 55 ) ; y = c(1, 5, 6 ) # create two vectors x and y
z = x + y # z=(11,50,61) . Element-wise addition
z = x - y # z=(9,40,49) . Element-wise subtraction
z = x * y # z=(10,225,330) . Element-wise multiplication
z = x / y # z=(10,9,1.66667) . Element-wise division
z = x ^2 # z=(100,2025,3025) . Element-wise squaring
z = x[x>20] # z=(45,55) . All elements of x that are >20
z= which(x>20) # z= (2,3). Indices of x where x>20
z1 = x[x>20] ; z2 = x[ which( x>20 ) ] ; u= which(x>20) ; z3=x[u]
# z1 z2 and z3 are all identical
10. 10
Matrix > Creation
x = matrix( 10, nrow=3 , ncol = 5) # x is a 3 by 5 matrix with all entries = 10
Matrix can be created from a vector
x = 1:12 ; mat = matrix(x , nrow = 4 , ncol=3)
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
By default, numbers are stacked column wise.
To change that , use byrow = TRUE
x = 1:12 ; mat = matrix(x , nrow = 4 , ncol=3 , byrow = TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
Row and column names can be assigned
colnames(mat) =c("col1","col2","col3")
rownames( mat ) = paste( “rowID ” , 1:4, sep=“_”)
col1 col2 col3
rowID_1 1 2 3
rowID_2 4 5 6
rowID_3 7 8 9
rowID_4 10 11 12
11. 11
Matrix > Subsetting
Consider the Matrix – mat in previous example
x = mat[ 2, ] # a vector containing second row of mat
y = mat[ ,3 ] # a vector containing third column of mat
x = mat[ “rowID_3”, ] # third row of mat
x = mat[ ,”col2” ] # second column of mat
newmat = mat[ 1:2, 2:3 ] # sub-matrix of mat
newmat = mat[ c(1,2,4) , c(1,3) ] # sub-matrix of mat
diag_entries = diag(mat) # vector (1,5,9)
col1 col2
col3
rowID_1 1 2 3
rowID_2 4 5 6
rowID_3 7 8 9
rowID_4 10 11 12
Row / column names can be changed
rownames(mat) [3] = “third” ;
colnames(mat)[2]=“second col”
col1 Nm2 col3
rowID_1 1 2 3
rowID_2 4 5 6
third 7 8 9
rowID_4 10 11 12
Set all values > 9 to 99
mat [mat>9] = 99
col1 Nm2 col3
rowID_1 1 2 3
rowID_2 4 5 6
third 7 8 9
rowID_4 99 99 99
14. 14
Data Frame > Background
• Data frame can be thought of as a matrix where the
columns may be of different types (e.g. text, date, number,
logical)
• Most datasets we work with can be stored as data frame
• Row / column subsetting works just like matrices
• Row and column names can be assigned
15. 15
Data Frame > Creation
Data frames can be created by stacking individual vectors
column-wise
cust = c(“Bob” , “John” , “Jane”)
age= c(67, 45, 52)
ownHouse = c( FALSE , FALSE, TRUE)
cust_dat = data.frame( Name= cust, Age = age, ownHouse = ownHouse)
Name Age ownHouse
1 Bob 67 FALSE
2 John 45 FALSE
3 Jane 52 TRUE
Data frames can also be created by reading data from a csv
cust_dat =
read.csv( file = “custData.csv” , header = TRUE, stringsAsFactors =
FALSE)
header = TRUE says that the 1st row of the file contains column names
stringsAsFactors = FALSE do not convert character vectors to “factors”
16. 16
Data Frame > Creation
Consider two data frames - cust1 & cust2
cust = rbind(cust1 , cust2)
Name Age ownHouse
1 Bob 67 FALSE
2 John 45 FALSE
3 Jane 52 TRUE
Name Age ownHouse
1 Bill 55 TRUE
2 Jack 75 TRUE
3 Deb 49 TRUE
Name Age ownHouse
1 Bob 67 FALSE
2 John 45 FALSE
3 Jane 52 TRUE
4 Bill 55 TRUE
5 Jack 75 TRUE
6 Deb 49 TRUE
Two data frames can be stacked below each other
A new data frame can be created by subsetting an
existing data frame
cust = cust[cust$Age > 60 , ]
Name Age ownHouse
1 Bob 67 FALSE
5 Jack 75 TRUE
17. 17
Data Frame > Creation
cust0 = data.frame(
Name=character(0) ,
Age=numeric(0) ,
ownHouse =
logical(0)
)
[1] Name Age ownHouse
<0 rows> (or 0-length row.names)
An empty data frame can be created by specifying
column names and types. It can be populated later.
An empty data frame can be created from an existing
data frame
cust0 = cust[0,]
[1] Name Age ownHouse
<0 rows> (or 0-length row.names)
18. 18
Data Frame > Creation
Two data frames can be merged by a common column
By default, only common records are returned.
Using options - all , all.x , all.y – different record sets are
obtained. Records may contain missing values.
Name Age ownHouse
1 Bob 67 FALSE
2 John 45 FALSE
3 Jane 52 TRUE
Name PetCount hasCar
1 Bob 1 TRUE
2 John 0 FALSE
3 Jill 5 TRUE
cust= merge(cust1,cust2 ,
by = "Name")
Name Age ownHouse PetCount hasCar
1 Bob 67 FALSE 1 TRUE
2 John 45 FALSE 0 FALSE
cust= merge(cust1,cust2 ,
by = "Name" ,
all = TRUE)
Name Age ownHouse PetCount hasCar
1 Bob 67 FALSE 1 TRUE
2 Jane 52 TRUE NA NA
3 Jill NA NA 5 TRUE
4 John 45 FALSE 0 FALSE
20. 20
List > Background
• List can be thought of as a vector, whose elements may be
of different types
LIST
vector matrix
Another
List
21. 21
List > Creation
An empty list
mylist = list() # nothing is known about the list
mylist = vector(mode=“list”, length=5) # length is known upfront
Non- empty list
mylist = list( c(1,5,7) , “abc” , matrix(0,3,3) )
List with names
mylist = list( comp1 = c(1,5,7) , comp2 = “abc” , comp3 = matrix(0,3,3) )
22. 22
List > Accessing the entries
By Index
mylist = list( c(1,5,7) , “abc” , matrix(0,3,3) )
x = mylist[[1]] # x is a vector (1,5,7)
x = mylist[[2]] # x is a string “abc”
x = mylist[[1]] # x is a 3-by-3 matrix of zeros
By Name
mylist = list( comp1 = c(1,5,7) , comp2 = “abc” , comp3 = matrix(0,3,3) )
x = mylist$comp1 # x is a vector (1,5,7)
x = mylist$comp2 # x is a string “abc”
x = mylist$comp3 # x is a 3-by-3 matrix of zeros
23. 23
List > Updating entries
By Index
By Name
mylist = list( comp1 = c(1,5,7) , comp2 = “abc” , comp3 = matrix(0,3,3) )
mylist[[4]] = 1024 # create a new entry at 4th position a number 1024
mylist = mylist[-3] # drop the third entry from mylist
mylist[[2]] = “New Entry” # update the second entry
mylist$comp99 = 1024 # create a new entry at 4th position its name “comp99”
mylist$comp1 = c(10,10) # update the entry – “comp1”
mylist = list( comp1 = c(1,5,7) , comp2 = “abc” , comp3 = matrix(0,3,3) )
names( mylist) # returns the vector – (“comp1” , “comp2” , “comp3”)
names( mylist) = c(“A”,”B”,”C”) # change the names of the components
names( mylist)[2] =”second” # change only the name of the second component
Renaming components
Subsets
newlist = mylist[ c(1,3,4) ] # new list contains the first, third and fourth entry of mylist
25. 25
Data Structure: Dates
Sys.time() # Returns the current system date and time.
x = strptime("02-07-2012",format="%m-%d-%Y")
x = strptime("02-feb-2012",format="%d-%b-%Y")
x = strptime("02-feb-2012 15:45:10",format="%d-%b-%Y %H:%M:%S")
String to Date-time
x = Sys.time() # on typing x in console you see : "2012-06-22 11:44:01 IST"
y = strftime(x , format="%d-%b-%Y") # "22-Jun-2012"
y = strftime(x , format="date: %d-%b-%Y >> Time: %H+%M+%S")
# "date: 22-Jun-2012 >> Time: 11+44+01«
y = strftime(x , format="%d-%b-%Y %a >> Time: %H hour %M min %S sec")
#"22-Jun-2012 Fri >> Time: 11 hour 44 min 01 sec"
Date-time to String
Study R help on date-time variables to learn about a large
number of possible format options
26. 26
Data Structure: Dates
Two main (internal) formats for date-time are : POSIXct and POSIXlt
POSIXct : A short format of date-time, typically used to
store date-time columns in a data frame
POSIXlt : A long format of date-time, various other sub-units
of time can be extracted from here
x = Sys.time() # on typing x in console you see : "2012-06-22 11:44:01 IST"
y = as.POSIXlt(x) # Convert from POSIXct to POSIXlt
z = c(y$mon, y$year, y$hour, y$min, y$wday) # z = (5, 112, 11, 51, 5)
Examples
difftime
x1 = strptime("02-07-2012 14:20:34",format="%m-%d-%Y %H:%M:%S ")
x2 = strptime("11-07-2012 14:20:34",format="%m-%d-%Y %H:%M:%S ")
y = x2-x1 # y is a difftime object
x1 + as.difftime( 1 , units="days") # "2012-02-08 14:20:34 IST“
x1 + as.difftime( 10 , units=“mins") # "2012-02-07 14:30:34 IST"
28. 28
Data Structures: Others
NULL
NULL is typically used for initializing variables. The code “x=NULL” creates a
variable x of length zero. It can later be converted to other values by overwriting x with some
other values. The function is.null() returns TRUE of FALSE and tells whether a variable is
NULL or not.
Other than the data structures described so far, there are a few very useful data types.
NA
NA is used for denoting missing values. The code “x=NA” creates a variable x with
missing values. The function is.na() returns TRUE of FALSE and tells whether a variable is NA
or not.
NaN
NaN stands for “Not a Number”. The code “x= sqrt(-10) ; y = log(-10)” sets value of x
and y to NaN. Also prints a warning message in console. The function is.nan() lets you check
whether the value of a variable is NaN or not.
Inf
Inf stands for “Infinity”. The code “x= 10/0 ; y = -3/0” sets value of x to Inf and y to -Inf.
The function is.finite() lets you check whether the value of a variable is infinity or not.
30. 30
Input
Read data (row-column format) from a csv file
x = read.csv(file = “D:/mydata.csv” , header = TRUE, stringsAsFactors = FALSE)
# x is a data frame containing the data in csv
Read data (row-column format) from a delimited file
x = read.table( file = “D:/mydata.csv” , sep = “,” , header = TRUE, stringsAsFactors =
FALSE)
# x is a data frame containing the data in csv
# read.csv is a special case of read.table with sep=“,”.
# In read.table you may specify any character(s) of your choice as a separator
Reading arbitrary data using a lower level function : scan()
Using scan() user can read character by character from a file.
These functions have many more optional input arguments
to let user control the way in which data is read.
31. 31
Output
Write a R object in R workspace to disk
Write a data frame to a file on disk
# Assume: x is a data frame
# write.csv() writes it to a csv file on disk
write.csv( x, file = “D:/ out.csv” , row.names = FALSE, col.names=TRUE, na = “”)
# write.table() writes it to any user-specified file.
# write.csv(0 is a special case of write.table
write.table( x, file = “D:/ out.txt” ,
row.names = FALSE, col.names=TRUE, na = “” , sep = “t” )
# Assume: x is an object in R workspace
save( x, file = “D:/ out.RData”)
33. 33
Plots – xy plot
x = rnorm(100, mean = 2 , sd = 2)
y = rnorm(100, mean = 10 , sd = 1)
plot(x,y,
xlab = "x-variable" , ylab = "y-variable",
main = "scatter plot example" ,
pch = 19 , cex= 0.7, col="blue")
X-y scatter plot
main
ylab
xlab
A large number of options available to control – axes, tick
marks, axes labels, legends, font type and size …. etc
34. 34
Plots - overlay
x = rnorm(100, mean = 2 , sd = 2)
y = rnorm(100, mean = 10 , sd = 1)
plot(x,y,xlab = "x-variable" , ylab = "y-variable",
main = "scatter plot example" , pch = 19 , cex=
0.7, col="blue")
Generate a plot
Add red points later
x1 = rnorm(30, mean = 0 , sd = 1)
y1 = rnorm(30, mean = 12 , sd = 0.5)
points(x1,y1,pch = 15 , col="red" , cex=1)
35. 35
Plots – multi panel plot
x = rnorm(100, mean = 2 , sd = 2)
y = rnorm(100, mean = 10 , sd = 1)
par(mfrow=c(2,2))
plot(x,y,xlab = "x-variable" , ylab = "y-
variable", main = "scatter plot example" , pch
= 19 , cex= 0.7, col="blue")
hist(x, xlab = "x-variable" , ylab = "frequency",
main = "histogram-x" , col = "grey",
border="blue" , lwd=2 )
hist(y, xlab = "y-variable" , ylab = "frequency",
main = "histogram-y" , col = "grey",
border="blue" , lwd=2 )
plot(density(x),col="limegreen",lwd=2,
xlab="x",ylab="density",main="density plot")
par( mfrow=c(2,2)) splits the plot region into a 2-by2 matrix.
Next 4 plot commands create plots in cells (1,1),(1,2),(2,1),(2,2)
36. 36
Plots – saving to a file
x = rnorm(100, mean = 2 , sd = 2)
y = rnorm(100, mean = 10 , sd = 1)
png(file = "D:/testplots.png")
par(mfrow=c(2,2))
plot(x,y,xlab = "X" , ylab = "Y", main = " " , pch
= 19 , cex= 0.7, col="blue")
plot( 0,0, type="n", axes=F,
xlab="",ylab="",main="")
text(0,0, "NO DATA")
hist(y, xlab = "Y" , ylab = "frequency", main =
"histogram-y" , col = "grey", border="blue" ,
lwd=2 )
plot(density(x),col="limegreen",lwd=2,
xlab="x",ylab="density",main="density plot (X)
")
dev.off()
The code creates the above
plot and saves it in a png file
in the location :
D:/testplots.png
38. 38
Control
# Generate k random numbers from N(0,1)
# k is not fixed apriori.
# Stop when sum of the value exceed 5
x = NULL ; stopIter = FALSE
while( !stopIter) {
x= c(x,rnorm(1,mean=0,sd=1) )
sumx=sum(x) ;
if (sumx >5){stopIter = TRUE} }
While ()
for ()
# Example of for loop
x = rnorm(100) ; y = rep(0, length(x))
for(i in 1:length(x) ){ y[i] = x[i] ^3 }
40. 40
Working with Strings
x= nchar("WRA data Filtering") #counts number of characters – x= 18 in
this case
MetID = 2 ; x = paste(“Met”, MetID, sep = “:”) # string concatenation - x= “Met:2”
x = substr(“Met 12”, start=1, stop = 5) # substring from position 1 to 5 - x= “Met 1”
x = strsplit("Met1 has no data" , split = " ") # splits the string by “ ”. Returns a list
y = unlist(x) # y is a vector with 4 elements – “Met1” , “has”, “no”, “data”
x= sub( pattern = "Met1” , replacement = “Met2” , x = “Met1 is empty")
# replaces the first match - x = “Met2 is empty”
x= gsub("Met1” , “Met2” , x = “Met1 is empty. Met1 has no data.")
# replaces all matches - x = “Met2 is empty. Met2 has no data.”
x = c( “red” , “Blue” , “green” , “skyblue” )
y = grep(pattern =“blue”, x = x, ignore.case = TRUE) # y = (2,4) – positions of matches
z = grep(pattern =“blue”, x = x, ignore.case = TRUE, value = TRUE)
# z = (“Blue”,”skyblue”) – returns the actual strings that match the pattern
41. 41
Regular Expressions
x=c("ht_10m","ht:20m"," ht_30m")
y = gsub("^ht_","HT:",x) # y = ("HT:10m" , "ht:20m" , " ht_30m")
# Replace “ht_” at the beginning of the string with “HT:”
y = gsub(“m$",”mtr",x) # y = ("ht_10mtr“ , "ht:20mtr“ , " ht_30mtr")
# Replace “m” at the end of the string with “mtr”
y = gsub(“[0-9]+",”XXX", x) # y = ("ht_XXXm" , "ht:XXXm" , " ht_XXXm")
# Replace one or more occurrence of digits with “XXX”
y = gsub(“_[0-9]+",”XXX", x) # y = ("htXXXm" , "ht:20m" , " htXXXm")
# Replace one or more occurrence of digits preceeded by “_” with “XXX”
u = grep(“^ht_[0-9]+m", x) ; y = x ; y[-u] = “invalid!”
# y = ("ht_10m" , "invalid!“ , "invalid!")
# Used for checking the validity of format of a string
Regular expressions provide a vast number of options in manipulating
strings. Study R help on regular expressions to know more.
45. 45
Further Help on R
- http://cran.r-project.org/
- http://www.r-project.org/search.html
This page provides links to search engines specific to R
- Search for “R tutorial” , “R forum” …
Have fun exploring
the world of R