Create a correlation plot from joined tables and lag times

Join tables, create lag times, and
base correlation plots upon lag
times
Using the dplyr and corrplot packages in R
by Doug Loqa

Goal: Create a
correlation plot between
columns of 2 related
tables
Key Price1 Price2 Price3
1 .. .. ..
2 .. .. ..
3 .. .. ..
4 .. .. ..
Key Demand1 Demand2 Demand3
1 .. .. ..
2 .. .. ..
3 .. .. ..
4 .. .. ..
• Subgoal1: Combine
2 tables

• Subgoal2: Create lag times
• Subgoal3: Create correlation
table
• Subgaol4: Create a correlation
plot
Cost1 Demand1 Demand1 lag Demand1 2 lags
.. .. .. ..
.. .. .. ..
Cost1 Demand1 Demand1 lag
Cost1 1 .. ..
Demand1 .. 1 ..
Demand1 lag .. .. 1

Install your packages, and input your tables
• Call:
Install.Packages("dplyr")
Install.Packages(“corrplot
”)
• Call:
Library(dplyr)
Library(corrplot)
• CREATE 2 TABLES ONE BASED ON COST, AND
ONE BASED ON DEMAND:
COST<-READ.CSV(FIRST, HEADERS = TRUE,
STRINGSASF ACTORS = FALSE)
SET UP A DEMAND TABLE THE SAME WAY
Cost
• ……
• ……
Demand
• ……
• ……

Sub Goal 1: Extract relevant columns and join
2 tables
• From the dplyr package, use the
select() call to create 2 new
scaled down tables that only
have relevant columns.
• Use the merge() function to join
your scaled down tables.
In this case, a right-join is exemplified.
• SELECT() WORKS LIKE THE SELECT CLAUSE
IN SQL, BUT IN PLACE OF FROM, YOU JUST
ENTER TABLE SOURCE IN FIRST ARGUMENT,
AND THEN PLACE NUMBERS FOR EACH
SEQUENTIAL COLUMN YOU WANT.
• MERGE WORKS LIKE A SQL JOIN. THE TWO
TABLES JOINED ARE LISTED AS THE FIRST
TWO ARGUMENTS, THE COMMON FIELD OR
“KEY” IS LISTED THIRD AND SPECIFIED
WITH “by = KEY” AND YOU FINALLY
SPECIFY LEFT, RIGHT, OR INNER/OUTER
WITH THE “ALL.X”, “ALL.Y”,OR “ALL”.
IN THIS CASE, USE:
MERGE(COST, DEMAND, by = “KEY”,
all.x = TRUE)

Sub Goal 2: Create a new table tracking changes
over time in both cause and effect variables
Create vectors of
change for input
Get
difference
between
each interval
of the Cost
column
Create output
vector single
interval time lag
Get the
difference
between
each interval
of the
Demand
column
Create output vector
using a double
interval time lag
Create
difference
between
every other
value of the
Demand
column

Approach: Use the diff() function creating
vectors and bind all three vectors together
Use as.numeric()
and diff()
functions
one<-diff(
as.numeric(
cost,1))
Use the same
functions on the
Demand row
two<-diff(
as.numeric(
Demand,1))
Create a third similar
vector using a
difference of 2
three<-diff(
as.numeric(
Demand,2))

Pad each new vector with the appropriate
amount of “0”s and then combine them
Use pipes, and the concatenation to return new equally-sized vectors
• For the first 2 columns just add a single “0” like this:
one<-as.vector(c(one, 0))
two<-as.vector(c(two, 0))
• For every proceeding vector add one more 0 than the last:
three<-as.vector(c(three,0,0))
four<-as.vector(c(four,0,0,0))
……..
• Finally bind all of these columns together into a table:
Time_lag<-cbind(one, two, three, four)

Your table should have 0s at the end
Your vectors represent changes in
respective categorical information with 0s
at the end
Cost
Chng
Chng
Chng
…
…
0
Cbind all of these (which should be of
equal length due to the 0s) to get the
following table; call it “lag”
Cost Demand Demand2 Demand3
Chng Chng Chng2 Chng3
… … … 0
… … 0 0
0 0 0 0
Demand
Chng
Chng
Chng
…
…
0
Demand2
Chng2
Chng2
Chng2
…
0
0
Demand3
Chng3
Chng3
Chng3
0
0
0
Chng = change between single spots
Chng2 = change between 2 time spots
Chng3 = change between 3 time spots

Now, make sure these columns are numeric
and then add the “key” column from before
Apply to
columns
•Apply(lag, 2,
as.numeric())
Specify 2 to iterate across cols
Bind from
Cost
table
•lag<-cbind(cost[,1],lag)

Sub Goal 3: You might want to group data in
the Key and use the aggregate() function
Fact Key Cost Dem
and1
Dem
and2
1
1 Sum(
chng)
Sum(
chng)
Sum(
chng)2
3
4
5
2
6 Sum(
chng)
Sum(
chng)
Sum(
chng)7
8
9
10
… 11 .. .. ..
• Group your key
data by a
segment (say by
each 5 records).
Ex. Fact<-
ceiling(Lag$ke
y/5)
This sets up another
column that can be
factored and bind it to
the lag table.
• Then Use the
aggregate() function
which requires a list
and specify a sum()
as the function to
aggregate by. Here is
how that looks:
Ex. aggregate(lag,
list(lag$Fact),FU
N =sum)

Now create a correlation matrix
and remove Key if you aggregated in previous step
Select only the columns you need to
compare as a matrix (kicking out key)
• Now that you have the table
(aggregated or not), select the
cost, and demand lag columns
as a new matrix:
as.matrix(cbind(lag$cost,
lag$demand1, lag$demand2,
lag$demand3))
Use the Corr() function removing
the key column if you did aggregate
Cost Demand Demand2 Demand3
Chng_Agg Chng_Agg Chng2_Agg Chng3_Agg
This will look much like the initial lag table
when it was first created; call this Lag 2

Sub Goal 4: Now, create your correlation
matrix and a plot
Create correlation matrix and plot with corr() and
corrplot() functions
clag<-corr(lag2)
LagPlot<-corrplot(clag)

Finally, you can adapt your correlation plot to
see values and see shading better
Use arguments like method =
“shade”,
tl.srt=30, and
add.coef.col=“red”
to modify as seen here.

Explore other modifications and enjoy making
correlation plots!

Create a correlation plot from joined tables and lag times

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Create a correlation plot from joined tables and lag times

Similar to Create a correlation plot from joined tables and lag times (20)

Recently uploaded

Recently uploaded (20)

Create a correlation plot from joined tables and lag times