This document provides an overview of using data.tables in R. It discusses how to create and subset data.tables, manipulate columns by reference, perform grouped operations, and use keys and indexes. Some key points include:
- Data.tables allow fast subsetting, updating, and grouping of large data sets using keys and indexes.
- Columns can be manipulated by reference using := to efficiently add, update, or remove columns.
- Grouped operations like summing are performed efficiently using by to split the data.table into groups.
- Keys set on one or more columns allow fast row selection similar to SQL queries on indexed columns.
Extended network and algorithm finding maximal flows IJECEIAES
Graph is a powerful mathematical tool applied in many fields as transportation, communication, informatics, economy, in ordinary graph the weights of edges and vertexes are considered independently where the length of a path is the sum of weights of the edges and the vertexes on this path. However, in many practical problems, weights at a vertex are not the same for all paths passing this vertex, but depend on coming and leaving edges. The paper develops a model of extended network that can be applied to modelling many practical problems more exactly and effectively. The main contribution of this paper is algorithm finding maximal flows on extended networks.
Diapositivas de la sesión "Las tripas de un sistema Solr" impartida en la Drupalcamp Conil 2019.
Trata los principales elementos para hacer una query de Solr y la forma de depurar errores usando el modo debug y las herramientas del UI que viene por defecto.
Extended network and algorithm finding maximal flows IJECEIAES
Graph is a powerful mathematical tool applied in many fields as transportation, communication, informatics, economy, in ordinary graph the weights of edges and vertexes are considered independently where the length of a path is the sum of weights of the edges and the vertexes on this path. However, in many practical problems, weights at a vertex are not the same for all paths passing this vertex, but depend on coming and leaving edges. The paper develops a model of extended network that can be applied to modelling many practical problems more exactly and effectively. The main contribution of this paper is algorithm finding maximal flows on extended networks.
Diapositivas de la sesión "Las tripas de un sistema Solr" impartida en la Drupalcamp Conil 2019.
Trata los principales elementos para hacer una query de Solr y la forma de depurar errores usando el modo debug y las herramientas del UI que viene por defecto.
Introduction to Information Technology Lecture 2MikeCrea
Number Systems
Types of number systems
Number bases
Range of possible numbers
Conversion between number bases
Common powers
Arithmetic in different number bases
Shifting a number
Introduction to Information Technology Lecture 2MikeCrea
Number Systems
Types of number systems
Number bases
Range of possible numbers
Conversion between number bases
Common powers
Arithmetic in different number bases
Shifting a number
imager package in R and example
References:
http://dahtah.github.io/imager/
http://dahtah.github.io/imager/imager.html
https://cran.r-project.org/web/packages/imager/imager.pdf
R Data Visualization-Spatial data and Maps in R: Using R as a GISDr. Volkan OBAN
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
Reference: https://pakillo.github.io/R-GIS-tutorial/
Basic packages
library(sp) # classes for spatial data
library(raster) # grids, rasters
library(rasterVis) # raster visualization
library(maptools)
library(rgeos)
library(dismo)
library(googleVis)
library(rworldmap)
library(RgoogleMaps)
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
R Machine Learning packages( generally used)
prepared by Volkan OBAN
reference:
https://github.com/josephmisiti/awesome-machine-learning#r-general-purpose
In MATLAB, a vector is created by assigning the elements of the vector to a variable. This can be done in several ways depending on the source of the information.
—Enter an explicit list of elements
—Load matrices from external data files
—Using built-in functions
—Using own functions in M-files
MATLAB DOCUMENTATION ON SOME OF THE MODULES
A.Generate videos in which a skeleton of a person doing the following Gestures.
1.Tilting his head to right and left
2.Tilting his hand to right and left
3.Walking
in matlab.
B. Write a MATLAB program that converts a decimal number to Roman number and vice versa.
C.Using EZ plot & anonymous functions plot the following:
· Y=Sqrt(X)
· Y= X^2
· Y=e^(-XY)
D.Take your picture and
· Show R, G, B channels along with RGB Image in same figure using sub figure.
· Convert into HSV( Hue, saturation and value) and show the H,S,V channels along with HSV image
E.Record your name pronounced by yourself. Try to display the signal(name) in a plot vs Time, using matlab.
F.Write a script to open a new figure and plot five circles, all centered at the origin and with increasing radii. Set the line width for each circle to something thick (at least 2 points), and use the colors from a 5-color jet colormap (jet).
G. NEWTON RAPHSON AND SECANT METHOD
H.Write any one of the program to do following things using file concept.
1.Create or Open a file
2. Read data from the file and write data to another file
3. Append some text to already existed file
4. Close the file
I.Write a function to perform following set operations
1.Union of A and B
2. Intersection of A and B
3. Complement of A and B
(Assume A= {1, 2, 3, 4, 5, 6}, B= {2, 4, 6})
If you are worried about completing your R homework, you can connect with us at Statisticshomeworkhelper.com. We have a team of experts who are professionals in R programming homework help and have years of experience in working on any problem related to R. Visit statisticshomeworkhelper.com or email info@statisticshomeworkhelper.com. You can also call +1 (315) 557-6473 for assistance with Statistics Homework.
Covid19py by Konstantinos Kamaropoulos
A tiny Python package for easy access to up-to-date Coronavirus (COVID-19, SARS-CoV-2) cases data.
ref:https://github.com/Kamaropoulos/COVID19Py
https://pypi.org/project/COVID19Py/?fbclid=IwAR0zFKe_1Y6Nm0ak1n0W1ucFZcVT4VBWEP4LOFHJP-DgoL32kx3JCCxkGLQ
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...Dr. Volkan OBAN
Finds optimal trees in weighted graphs. In
particular, this package provides solving tools for minimum cost spanning
tree problems, minimum cost arborescence problems, shortest path tree
problems and minimum cut tree problem.
by Volkan OBAN
k-means Clustering in Python
scikit-learn--Machine Learning in Python
from sklearn.cluster import KMeans
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.[wikipedia]
ref: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
Forecasting through ARIMA Modeling using R
ref:http://ucanalytics.com/blogs/step-by-step-graphic-guide-to-forecasting-through-arima-modeling-in-r-manufacturing-case-study-example/
k-means Clustering and Custergram with R.
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster.
ref:https://www.r-bloggers.com/k-means-clustering-in-r/
ref:https://rpubs.com/FelipeRego/K-Means-Clustering
ref:https://www.r-bloggers.com/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
Data Science and its Relationship to Big Data and Data-Driven Decision MakingDr. Volkan OBAN
Data Science and its Relationship to Big Data and Data-Driven Decision Making
To cite this article:
Foster Provost and Tom Fawcett. Big Data. February 2013, 1(1): 51-59. doi:10.1089/big.2013.1508.
Foster Provost and Tom Fawcett
Published in Volume: 1 Issue 1: February 13, 2013
ref:http://online.liebertpub.com/doi/full/10.1089/big.2013.1508
https://www.researchgate.net/publication/256439081_Data_Science_and_Its_Relationship_to_Big_Data_and_Data-Driven_Decision_Making
Data visualization with R.
Mosaic plot .
---Ref: https://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture17.pdf
http://www.statmethods.net/advgraphs/mosaic.html
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/mosaicplot.html
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...Dr. Volkan OBAN
Some R Examples[R table and Graphics]
Advanced Data Visualization in R (Some Examples)
References:
http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/
http://www.cookbook-r.com/
http://moderndata.plot.ly/trisurf-plots-in-r-using-plotly/
I hope that it would ne useful for UseRs.
Umarım; R programı ile ilgilenen herkes için yararlı olur.
Volkan OBAN
A short list of the most useful R commands
reference: http://www.personality-project.org/r/r.commands.html
R programı ile ilgilenen veya yeni öğrenmeye başlayan herkes için hazırlanmıştır.
ref:https://www.ggplot2-exts.org/ggtree.html
ggtree
https://bioconductor.org/packages/release/bioc/html/ggtree.html
gtree is designed for visualizing phylogenetic tree and different types of associated annotation data.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. The official Cheat Sheet for the DataCamp course
DATA ANALYSIS THE DATA.TABLE WAY
General form: DT[i, j, by] “Take DT, subset rows using i, then calculate j grouped by by”
CREATE A DATA TABLE
Create a
data.table
and call it DT.
library(data.table)
set.seed(45L)
DT <- data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(4),4),
V4=1:12)
> DT
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 B -0.3825 2
3: 1 C -1.0604 3
4: 2 A 0.6651 4
5: 1 B -1.1727 5
6: 2 C -0.3825 6
7: 1 A -1.0604 7
8: 2 B 0.6651 8
9: 1 C -1.1727 9
10: 2 A -0.3825 10
11: 1 B -1.0604 11
12: 2 C 0.6651 12
SUBSETTING ROWS USING
What? Example Notes Output
Subsetting rows by numbers. DT[3:5,] #or DT[3:5] Selects third to fifth row. V1 V2 V3 V4
1: 1 C -1.0604 3
2: 2 A 0.6651 4
3: 1 B -1.1727 5
Use column names to select rows in i based on
a condition using fast automatic indexing. Or
for selecting on multiple values:
DT[column %in% c("value1","value2")],
which selects all rows that have value1 or
value2 in column.
DT[ V2 == "A"] Selects all rows that have value A in column
V2.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
3: 1 A -1.0604 7
4: 2 A -0.3825 10
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 1 C -1.0604 3
...
7: 2 A -0.3825 10
8: 2 C 0.6651 12
DT[ V2 %in% c("A","C")] Select all rows that have the value A or C in
column V2.
MANIPULATING ON COLUMNS IN
What? Example Notes Output
Select 1 column in j. DT[,V2] Column V2 is returned as a vector. [1] "A" "B" "C" "A"
"B" "C" ...
Select several columns in j. DT[,.(V2,V3)] Columns V2 and V3 are
returned as a data.table.
V2 V3
1: A -1.1727
2: B -0.3825
3: C -1.0604
…
.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.
Call functions in j. DT[,sum(V1)] Returns the sum of all
elements of column V1 in a vector.
[1] 18
Computing on several columns. DT[,.(sum(V1),sd(V3))] Returns the sum of all
elements of column V1 and the standard
deviation of V3 in a data.table.
V1 V2
1: 18 0.7634655
Assigning column names to
computed columns.
DT[,.(Aggregate = sum(V1),
Sd.V3 = sd(V3))]
The same as above, but with new names. Aggregate Sd.V3
1: 18 0.7634655
Columns get recycled if different
length.
DT[,.(V1, Sd.V3 = sd(V3))] Selects column V1, and compute std. dev. of V3,
which returns a single value and gets recycled.
V1 Sd.V3
1: 1 0.7634655
2: 2 0.7634655
...
11: 1 0.7634655
12: 2 0.7634655
Multiple expressions can be
wrapped in curly braces.
DT[,{print(V2)
plot(V3)
NULL}]
Print column V2 and plot V3. [1] "A" "B" "C" "A"
"B" "C" ...
#And a plot
DOING GROUP
What? Example Notes Output
Doing j by group. DT[,.(V4.Sum = sum(V4)),by=V1] Calculates the sum of V4, for every group in
V1.
V1 V4.Sum
1: 1 36
Doing j by several groups
using .().
DT[,.(V4.Sum = sum(V4)),by=.(V1,V2)] The same as above, but for every group in V1
and V2.
V1 V2 V4.Sum
1: 1 A 8
2: 2 B 10
3: 1 C 12
4: 2 A 14
5: 1 B 16
6: 2 C 18
Call functions in by. DT[,.(V4.Sum = sum(V4)),by=sign(V1-1)] Calculates the sum of V4, for every group in
sign(V1-1).
sign V4.Sum
1: 0 36
2: 1 42
Assigning new column
names in by.
DT[,.(V4.Sum = sum(V4)),
by=.(V1.01 = sign(V1-1))]
Same as above, but with a new name for the
variable we are grouping by.
V1.01 V4.Sum
1: 0 36
2: 1 42
Grouping only on a subset by
specifying i.
DT[1:5,.(V4.Sum = sum(V4)),by=V1] Calculates the sum of V4, for every group in
V1, after subsetting on the first five rows.
V1 V4.Sum
1: 1 9
2: 2 6
Using .N to get the total
number of observations of each
group.
DT[,.N,by=V1] Count the number of rows for every group in
V1.
V1 N
1: 1 6
2: 2 6
BYJ
ADDING/UPDATING COLUMNS BY REFERENCE IN USING :=
What? Example Notes Output
Adding/updating a column by
reference using := in one line.
Watch out: extra assignment
(DT <- DT[...]) is redundant.
DT[, V1 := round(exp(V1),2)] Column V1 is updated by what is after :=. Returns the result invisibly.
Column V1 went from: [1] 1 2 1
2 … to [1] 2.72 7.39 2.72
7.39 …
Adding/updating several
columns by reference using :=.
DT[, c("V1","V2") := list
(round(exp(V1),2), LETTERS
[4:6])]
Column V1 and V2 are updated by what is
after :=.
Returns the result invisibly.
Column V1 changed as above.
Column V2 went from: [1] "A"
"B" "C" "A" "B" "C" … to: [1]
"D" "E" "F" "D" "E" "F" …
Using functional :=. DT[, ':=' (V1 =
round(exp(V1),2),
V2 = LETTERS[4:6])][]
Another way to write the same line as
above this one, but easier to write
comments side-by-side. Also, when [] is
added the result is printed to the screen.
Same changes as line above this
one, but the result is printed to the
screen because of the [] at the end
of the statement.
Remove a column instantly
using :=.
DT[, V1 := NULL] Removes column V1. Returns the result invisibly.
Column V1 became NULL.
Remove several columns
instantly using :=.
DT[, c("V1","V2") := NULL] Removes columns V1 and V2. Returns the result invisibly. Col-
umn V1 and V2 became NULL.
Wrap the name of a variable
which contains column names in
parenthesis to pass the contents
of that variable to be deleted.
Cols.chosen = c("A","B")
DT[, Cols.chosen := NULL] Watch out: this deletes the column with
column name Cols.chosen.
Returns the result invisibly.
Column with name Cols.chosen
became NULL.
DT[, (Cols.chosen) := NULL] Deletes the columns specified in the
variable Cols.chosen (V1 and V2).
Returns the result invisibly.
Columns V1 and V2 became NULL.
INDEXING AND KEYS
What? Example Notes Output
Use setkey() to set a key on a DT.
The data is sorted on the column we
specified by reference.
setkey(DT,V2) A key is set on column V2. Returns results
invisibly.
Use keys like supercharged rownames
to select rows.
DT["A"] Returns all the rows where the key column (set to
column V2 in the line above) has the value A.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
3: 1 A -1.0604 7
4: 2 A -0.3825 10
DT[c("A","C")] Returns all the rows where the key column (V2) has the
value A or C.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
...
7: 1 C -1.1727 9
8: 2 C 0.6651 12
The mult argument is used to control
which row that i matches to is
returned, default is all.
DT["A", mult ="first"] Returns first row of all rows that match the value A in
the key column (V2).
V1 V2 V3 V4
1: 1 A -1.1727 1
DT["A", mult = "last"] Returns last row of all rows that match the value A in
the key column (V2).
V1 V2 V3 V4
1: 2 A -0.3825 10
The nomatch argument is used to
control what happens when a value
specified in i has no match in the rows
of the DT. Default is NA, but can be
changed to 0.
0 means no rows will be
returned for that non-matched row of i.
DT[c("A","D")] Returns all the rows where the key column (V2) has the
value A or D. A is found, D is not so NA is returned for
D.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
3: 1 A -1.0604 7
4: 2 A -0.3825 10
5: NA D NA NA
DT[c("A","D"), nomatch
= 0]
Returns all the rows where the key column (V2) has the
value A or D. Value D is not found and not returned
because of the
nomatch argument.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
3: 1 A -1.0604 7
4: 2 A -0.3825 10
by=.EACHI allows to group by each
subset of known groups in i. A key
needs to be set to use by=.EACHI.
DT[c("A","C"),
sum(V4)]
Returns one total sum of column V4, for the rows of the
key column (V2) that have values A or C.
[1] 52
DT[c("A","C"),
sum(V4), by=.EACHI]
Returns one sum of column V4 for the rows of column
V2 that have value A, and
another sum for the rows of column V2 that have value
C.
V2 V1
1: A 22
2: C 30
Any number of columns can be set as
key using setkey(). This way rows
can be selected on 2 keys which is an
equijoin.
setkey(DT,V1,V2) Sorts by column V1 and then by column V2 within each
group of column V1.
Returns results
invisibly.
DT[.(2,"C")] Selects the rows that have the value 2 for the first key
(column V1) and the value C for the second key (column
V2).
V1 V2 V3 V4
1: 2 C -0.3825 6
2: 2 C 0.6651 12
DT[.(2,
c("A","C"))]
Selects the rows that have the value 2 for the first key
(column V1) and within those rows the value A or C for
the second key (column V2).
V1 V2 V3 V4
1: 2 A 0.6651 4
2: 2 A -0.3825 10
3: 2 C -0.3825 6
4: 2 C 0.6651 12
ADVANCED DATA TABLE OPERATIONS
What? Example Notes Output
.N contains the number of rows or the
last row.
Usable in i: DT[.N-1] Returns the penultimate row of the
data.table.
V1 V2 V3 V4
1: 1 B -1.0604 11
Usable in j: DT[,.N] Returns the number of rows. [1] 12
.() is an alias to list() and means
the same. The .() notation is not
needed when there is only one item in
by or j.
Usable in j: DT[,.(V2,V3)] #or
DT[,list(V2,V3)]
Columns V2 and V3 are returned as a
data.table.
V2 V3
1: A -1.1727
2: B -0.3825
3: C -1.0604
...
Usable in by: DT[, mean(V3),
by=.(V1,V2)]
Returns the result of j, grouped by all
possible combinations of groups
specified in by.
V1 V2 V1
1: 1 A -1.11655
2: 2 B 0.14130
3: 1 C -1.11655
4: 2 A 0.14130
5: 1 B -1.11655
6: 2 C 0.14130
.SD is a data.table and holds all the
values of all columns, except the one
specified in by. It reduces
programming time but keeps
readability. .SD is only accessible in j.
DT[, print(.SD), by=V2] To look at what .SD
contains.
#All of .SD (output
too long to display
here)
DT[,.SD[c(1,.N)], by=V2] Selects the first and last row grouped by
column V2.
V2 V1 V3 V4
1: A 1 -1.1727 1
2: A 2 -0.3825 10
3: B 2 -0.3825 2
4: B 1 -1.0604 11
5: C 1 -1.0604 3
6: C 2 0.6651 12
DT[, lapply(.SD, sum), by=V2] Calculates the sum of all columns in .SD
grouped by V2.
V2 V1 V3 V4
1: A 6 -1.9505 22
2: B 6 -1.9505 26
3: C 6 -1.9505 30
.SDcols is used together with .SD, to
specify a subset of the columns of .SD to
be used in j.
DT[, lapply(.SD,sum), by=V2,
.SDcols = c("V3","V4")]
Same as above, but only for columns V3
and V4 of .SD. V2 V3 V4
1: A -1.9505 22
2: B -1.9505 26
3: C -1.9505 30.SDcols can be the result of a
function call.
DT[, lapply(.SD,sum), by=V2,
.SDcols = paste0("V",3:4)]
Same result as the line above.
CHAINING HELPS TACK EXPRESSIONS TOGETHER AND
AVOID (UNNECESSARY) INTERMEDIATE ASSIGNMENTS
What? Example Notes Output
Do 2 (or more) sets of statements
at once by chaining them in one
statement. This
corresponds to having in SQL.
DT<-DT[, .(V4.Sum = sum(V4)),by=V1]
DT[V4.Sum > 40] #no chaining
First calculates sum of V4, grouped by V1. Then
selects that group of which the sum is > 40
without chaining.
V1 V4.Sum
1: 1 36
2: 2 42
DT[, .(V4.Sum = sum(V4)),
by=V1][V4.Sum > 40 ]
Same as above, but with chaining. V1 V4.Sum
1: 2 42
Order the results by chaining. DT[, .(V4.Sum = sum(V4)),
by=V1][order(-V1)]
Calculates sum of V4, grouped by V1, and then
orders the result on V1.
V1 V4.Sum
1: 2 42
2: 1 36
USING THE set()-FAMILY
What? Example Notes Output
set() is used to repeatedly
update rows and columns by
reference. Set() is a loopable
low overhead version of :=.
Watch out: It can not handle
grouping operations.
Syntax of set(): for (i in from:to) set(DT, row, column, new value).
rows = list(3:4,5:6)
cols = 1:2
for (i in seq_along(rows))
{ set(DT,
i=rows[[i]],
j = cols[i],
value = NA) }
Sequence along the values of rows,
and for the values of cols, set the
values of those elements equal to NA.
Returns the result invisibly.
> DT
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 B -0.3825 2
3: NA C -1.0604 3
4: NA A 0.6651 4
5: 1 NA -1.1727 5
6: 2 NA -0.3825 6
7: 1 A -1.0604 7
8: 2 B 0.6651 8
setnames() is used to create
or update column names by
reference.
Syntax of setnames():
setnames(DT,"old","new")[]
Changes (set) the name of column old to new. Also, when [] is added at the
end of any set() function the result is printed to the screen.
setnames(DT,"V2","Rating") Sets the name of column V2 to Rating. Returns the result invisibly.
setnames(DT,c("V2","V3"),
c("V2.rating","V3.DataCamp"))
Changes two column names. Returns the result invisibly.
setcolorder() is used to
reorder columns by reference.
setcolorder(DT, "neworder") neworder is a character vector of the new column name ordering.
setcolorder(DT,
c("V2","V1","V4","V3"))
Changes the column ordering to the
contents of the vector.
Returns the result invisibly. The new
column order is now [1] "V2" "V1"
"V4" "V3"
i
J
J