Grouped data frames allow dplyr functions to manipulate each group separately. The group_by() function creates a grouped data frame, while ungroup() removes grouping. Summarise() applies summary functions to columns to create a new table, such as mean() or count(). Join functions combine tables by matching values. Left, right, inner, and full joins retain different combinations of values from the tables.
This is my presentation at SQLBits 8, Brighton, 9th April 2011. This session is about advanced dimensional modelling topics such as Fact Table Primary Key, Vertical Fact Tables, Aggregate Fact Tables, SCD Type 6, Snapshotting Transaction Fact Tables, 1 or 2 Dimensions, Dealing with Currency Rates, When to Snowflake, Dimensions with Multi Valued Attributes, Transaction-Level Dimensions, Very Large Dimensions, A Dimension With Only 1 Attribute, Rapidly Changing Dimensions, Banding Dimension Rows, Stamping Dimension Rows and Real Time Fact Table. Prerequisites: You need have a basic knowledge of dimensional modelling and relational database design.
My name is Vincent Rainardi. I am a data warehouse & BI architect. I wrote a book on SQL Server data warehousing & BI, as well as many articles on my blog, www.datawarehouse.org.uk. I welcome questions and discussions on data warehousing on vrainardi@gmail.com. Enjoy the presentation.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
The Data Warehouse (DW) is considered as a collection of integrated, detailed, historical data, collected from different sources . DW is used to collect data designed to support management decision making. There are so many approaches in designing a data warehouse both in conceptual and logical design phases. The conceptual design approaches are dimensional fact model, multidimensional E/R model, starER model and object-oriented multidimensional model. And the logical design approaches are flat schema, star schema, fact constellation schema, galaxy schema and snowflake schema. In this paper we have focused on comparison of Dimensional Modelling AND E-R modelling in the Data Warehouse. Dimensional Modelling (DM) is most popular technique in data warehousing. In DM a model of tables and relations is used to optimize decision support query performance in relational databases. And conventional E-R models are used to remove redundancy in the data model, facilitate retrieval of individual records having certain critical identifiers, and optimize On-line Transaction Processing (OLTP) performance.
This is my presentation at SQLBits 8, Brighton, 9th April 2011. This session is about advanced dimensional modelling topics such as Fact Table Primary Key, Vertical Fact Tables, Aggregate Fact Tables, SCD Type 6, Snapshotting Transaction Fact Tables, 1 or 2 Dimensions, Dealing with Currency Rates, When to Snowflake, Dimensions with Multi Valued Attributes, Transaction-Level Dimensions, Very Large Dimensions, A Dimension With Only 1 Attribute, Rapidly Changing Dimensions, Banding Dimension Rows, Stamping Dimension Rows and Real Time Fact Table. Prerequisites: You need have a basic knowledge of dimensional modelling and relational database design.
My name is Vincent Rainardi. I am a data warehouse & BI architect. I wrote a book on SQL Server data warehousing & BI, as well as many articles on my blog, www.datawarehouse.org.uk. I welcome questions and discussions on data warehousing on vrainardi@gmail.com. Enjoy the presentation.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
The Data Warehouse (DW) is considered as a collection of integrated, detailed, historical data, collected from different sources . DW is used to collect data designed to support management decision making. There are so many approaches in designing a data warehouse both in conceptual and logical design phases. The conceptual design approaches are dimensional fact model, multidimensional E/R model, starER model and object-oriented multidimensional model. And the logical design approaches are flat schema, star schema, fact constellation schema, galaxy schema and snowflake schema. In this paper we have focused on comparison of Dimensional Modelling AND E-R modelling in the Data Warehouse. Dimensional Modelling (DM) is most popular technique in data warehousing. In DM a model of tables and relations is used to optimize decision support query performance in relational databases. And conventional E-R models are used to remove redundancy in the data model, facilitate retrieval of individual records having certain critical identifiers, and optimize On-line Transaction Processing (OLTP) performance.
Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
Below are the topics covered in this tutorial:
What is Data Visualization?
What is Tableau?
Why Tableau?
Tableau Job Trends
Companies using Tableau
Who should go for Tableau?
Tableau Architecture
Tableau Visualizations
Real time Use Case
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
The process of finding, loading and cleaning data in the real world. This is the pre-cursor to data analysis and data science.
This presentation explains sources where you can find data, the various formats in which data is usually available or stored, and a checklist of activities to perform when cleaning the data.
It is an introduction to Data Analytics, its applications in different domains, the stages of Analytics project and the different phases of Data Analytics life cycle.
I deeply acknowledge the sources from which I could consolidate the material.
Réalisation d'un mashup de données avec DSS de Dataiku et visualisation avec ...Gautier Poupeau
cf. la première partie : https://www.slideshare.net/lespetitescases/ralisation-dun-mashup-de-donnes-avec-dss-de-dataiku-premire-partie
Tutoriel pour réaliser un mashup à partir de jeux de données libres téléchargés sur data.gouv.fr et Wikidata entre autres avec le logiciel DSS de Dataiku. Cette deuxième partie permet d'aborder le requêtage de Wikidata avec une requête SPARQL puis montre comment relier les jeux de données de data.gouv.fr et les données issues de Wikidata. Enfin, il aborde la visualisation des données via l'application en ligne Palladio.
Ce tutoriel a servi de support de cours au Master 2 "Technologies numériques appliqués à l'histoire" de l'Ecole nationale des chartes lors de l'année universitaire 2016-2017.
Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
Below are the topics covered in this tutorial:
What is Data Visualization?
What is Tableau?
Why Tableau?
Tableau Job Trends
Companies using Tableau
Who should go for Tableau?
Tableau Architecture
Tableau Visualizations
Real time Use Case
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
The process of finding, loading and cleaning data in the real world. This is the pre-cursor to data analysis and data science.
This presentation explains sources where you can find data, the various formats in which data is usually available or stored, and a checklist of activities to perform when cleaning the data.
It is an introduction to Data Analytics, its applications in different domains, the stages of Analytics project and the different phases of Data Analytics life cycle.
I deeply acknowledge the sources from which I could consolidate the material.
Réalisation d'un mashup de données avec DSS de Dataiku et visualisation avec ...Gautier Poupeau
cf. la première partie : https://www.slideshare.net/lespetitescases/ralisation-dun-mashup-de-donnes-avec-dss-de-dataiku-premire-partie
Tutoriel pour réaliser un mashup à partir de jeux de données libres téléchargés sur data.gouv.fr et Wikidata entre autres avec le logiciel DSS de Dataiku. Cette deuxième partie permet d'aborder le requêtage de Wikidata avec une requête SPARQL puis montre comment relier les jeux de données de data.gouv.fr et les données issues de Wikidata. Enfin, il aborde la visualisation des données via l'application en ligne Palladio.
Ce tutoriel a servi de support de cours au Master 2 "Technologies numériques appliqués à l'histoire" de l'Ecole nationale des chartes lors de l'année universitaire 2016-2017.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
Desk reference for data wrangling, analysis, visualization, and programming in Stata. Co-authored with Tim Essam(@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Data transformation-cheatsheet
1. Group Cases
group_by(.data, ..., add = FALSE)
Returns copy of table grouped by …
g_iris <- group_by(iris, Species)
ungroup(x, ...)
Returns ungrouped copy of table.
ungroup(g_iris)
wwwwww
www
Use group_by() to created a "grouped" copy of a table. dplyr
functions will manipulate each "group" separately and then
combine the results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
Summarise Cases
These apply summary functions to
columns to create a new table.
Summary functions take vectors as
input and return one value (see back).
summary
function
Variations
• summarise_all() - Apply funs to every column.
• summarise_at() - Apply funs to specific columns.
• summarise_if() - Apply funs to all cols of one type.
www
www
summarise(.data, …)
Compute table of summaries. Also
summarise_().
summarise(mtcars, avg = mean(mpg))
count(x, ..., wt = NULL, sort = FALSE)
Count number of rows in each group defined
by the variables in … Also tally().
count(iris, Species)
Manipulate Cases
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Extract Cases
Add Cases
Arrange Cases
filter(.data, …)
Extract rows that meet logical criteria. Also
filter_(). filter(iris, Sepal.Length > 7)
distinct(.data, ..., .keep_all = FALSE)
Remove rows with duplicate values. Also
distinct_(). distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
weight = NULL, .env = parent.frame())
Randomly select fraction of rows.
sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE,
weight = NULL, .env = parent.frame())
Randomly select size rows.
sample_n(iris, 10, replace = TRUE)
slice(.data, …)
Select rows by position. Also slice_().
slice(iris, 10:15)
top_n(x, n, wt)
Select and order top n entries (by group if
grouped data). top_n(iris, 5, Sepal.Width)
Row functions return a subset of rows as a new table. Use a variant
that ends in _ for non-standard evaluation friendly code.
wwwwww
wwwwww
wwwwww
wwwwww
Logical and boolean operators to use with filter()
See ?base::logic and ?Comparison for help.
> >= !is.na() ! &
< <= is.na() %in% | xor()
wwwwww
arrange(.data, ...)
Order rows by values of a column (low to high),
use with desc() to order from high to low.
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))
wwwwww
add_row(.data, ..., .before = NULL,
.after = NULL)
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
Manipulate Variables
Extract Variables
Make New Variables
wwww
www ww
Column functions return a set of columns as a new table. Use a
variant that ends in _ for non-standard evaluation friendly code.
vectorized
function
These apply vectorized functions to
columns. Vectorized funs take vectors
as input and return vectors of the
same length as output (see back).
Data Transformation
with dplyr Cheat Sheet
wwwwww
www
wwww
mutate(.data, …)
Compute new column(s).
mutate(mtcars, gpm = 1/mpg)
transmute(.data, …)
Compute new column(s), drop others.
transmute(mtcars, gpm = 1/mpg)
mutate_all(.tbl, .funs, ...)
Apply funs to every column. Use with
funs(). mutate_all(faithful, funs(log(.),
log2(.)))
mutate_at(.tbl, .cols, .funs, ...)
Apply funs to specific columns. Use with
funs(), vars() and the helper functions for
select().
mutate_at(iris, vars( -Species), funs(log(.)))
mutate_if(.tbl, .predicate, .funs, ...)
Apply funs to all columns of one type. Use
with funs().
mutate_if(iris, is.numeric, funs(log(.)))
add_column(.data, ..., .before =
NULL, .after = NULL)
Add new column(s).
add_column(mtcars, new = 1:32)
rename(.data, …)
Rename columns.
rename(iris, Length = Sepal.Length)
w ww
Use these helpers with select(),
e.g. select(iris, starts_with("Sepal"))
contains(match)
ends_with(match)
matches(match)
:, e.g. mpg:cyl
-, e.g, -Species
Each observation, or
case, is in its own row
A B C
Each variable is
in its own column
A B C
&
dplyr functions work with pipes and expect tidy data. In tidy data:
pipes
x %>% f(y)
becomes f(x, y) num_range(prefix, range)
one_of(…)
starts_with(match)
select(.data, …)
Extract columns by name. Also select_if()
select(iris, Sepal.Length, Species)
wwwwww
2. C A B
1 a t
2 b u
3 c v
1 a t
2 b u
3 c v
C A B
A B C
a t 1
b u 2
c v 3
1 a t
2 b u
3 c v
C A B
A.x B.x C A.y B.y
a t 1 d w
b u 2 b u
c v 3 a t
a t 1 d w
b u 2 b u
c v 3 a t
A1 B1 C A2 B2
A B.x C B.y D
a t 1 t 3
b u 2 u 2
c v 3 NA NA
A B D
a t 3
b u 2
d w 1
A B C D
a t 1 3
b u 2 2
c v 3 NA
d w NA 1
A B C D
a t 1 3
b u 2 2
a t 1 3
b u 2 2
d w NA 1
A B C D
A B C D
a t 1 3
b u 2 2
c v 3 NA
A B C A B D
a t 1 a t 3
b u 2 b u 2
c v 3 d w 1
A B C
c v 3
A B C
a t 1
b u 2
A B C
a t 1
b u 2
a t 1
b u 2
A B C
c v 3
d w 4
A B C
c v 3
DF A B C
x a t 1
x b u 2
x c v 3
z c v 3
z d w 4
Counts
dplyr::n() - number of values/rows
dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NA’s
Location
mean() - mean, also mean(!is.na())
median() - median
Logicals
mean() - Proportion of TRUE’s
sum() - # of TRUE’s
Position/Order
dplyr::first() - first value
dplyr::last() - last value
dplyr::nth() - value in nth location of vector
Rank
quantile() - nth quantile
min() - minimum value
max() - maximum value
Spread
IQR() - Inter-Quartile Range
mad() - mean absolute deviation
sd() - standard deviation
var() - variance
Offsets
dplyr::lag() - Offset elements by 1
dplyr::lead() - Offset elements by -1
Cumulative Aggregates
dplyr::cumall() - Cumulative all()
dplyr::cumany() - Cumulative any()
cummax() - Cumulative max()
dplyr::cummean() - Cumulative mean()
cummin() - Cumulative min()
cumprod() - Cumulative prod()
cumsum() - Cumulative sum()
Rankings
dplyr::cume_dist() - Proportion of all values <=
dplyr::dense_rank() - rank with ties = min, no
gaps
dplyr::min_rank() - rank with ties = min
dplyr::ntile() - bins into n bins
dplyr::percent_rank() - min_rank scaled to [0,1]
dplyr::row_number() - rank with ties = "first"
Math
+, - , *, /, ^, %/%, %% - arithmetic ops
log(), log2(), log10() - logs
<, <=, >, >=, !=, == - logical comparisons
Misc
dplyr::between() - x >= left & x <= right
dplyr::case_when() - multi-case if_else()
dplyr::coalesce() - first non-NA values by
element across a set of vectors
dplyr::if_else() - element-wise if() + else()
dplyr::na_if() - replace specific values with NA
pmax() - element-wise max()
pmin() - element-wise min()
dplyr::recode() - Vectorized switch()
dplyr::recode_factor() - Vectorized switch() for
factors
Summary FunctionsVectorized Functions Combine Tables
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Combine Variables
bind_cols(…)
Returns tables placed side by
side as a single table.
BE SURE THAT ROWS ALIGN.
left_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
Join matching values from y to x.
right_join(x, y, by = NULL, copy =
FALSE, suffix=c(“.x”,“.y”),…)
Join matching values from x to y.
inner_join(x, y, by = NULL, copy =
FALSE, suffix=c(“.x”,“.y”),…)
Join data. Retain only rows with
matches.
full_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
Join data. Retain all values, all
rows.
Use a "Mutating Join" to join one table to columns
from another, matching values with the rows that
they correspond to. Each join retains a different
combination of values from the tables.
Use by = c("col1", "col2") to
specify the column(s) to match
on.
left_join(x, y, by = "A")
Use a named vector, by =
c("col1" = "col2"), to match on
columns with different names in
each data set.
left_join(x, y, by = c("C" = "D"))
Use suffix to specify suffix to give
to duplicate column names.
left_join(x, y, by = c("C" = "D"),
suffix = c("1", "2"))
Use bind_cols() to paste tables beside each other
as they are.
A B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
Combine Cases
A B C
a t 1
b u 2
Use bind_rows() to paste tables below each other as
they are.
bind_rows(…, .id = NULL)
Returns tables one on top of the other
as a single table. Set .id to a column
name to add a column of the original
table names (as pictured)
intersect(x, y, …)
Rows that appear in both x and z.
setdiff(x, y, …)
Rows that appear in x but not z.
union(x, y, …)
Rows that appear in x or z. (Duplicates
removed). union_all() retains
duplicates.
Extract Rows
Use a "Filtering Join" to filter one table against the
rows of another.
semi_join(x, y, by = NULL, …)
Return rows of x that have a match in y.
USEFUL TO SEE WHAT WILL BE JOINED.
anti_join(x, y, by = NULL, …)
Return rows of x that do not have a
match in y. USEFUL TO SEE WHAT WILL
NOT BE JOINED.
A B C
a t 1
b u 2
c v 3
+
x
z
A B C
c v 3
d w 4
Use setequal() to test whether two data sets contain
the exact same rows (in any order).
A B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
to use with summarise()to use with mutate()
mutate() and transmute() apply vectorized
functions to columns to create new columns.
Vectorized functions take vectors as input and
return vectors of the same length as output.
vectorized
function
summarise() applies summary functions to
columns to create a new table. Summary
functions take vectors as input and return single
values as output.
summary
function
Row names
Tidy data does not use rownames, which store
a variable outside of the columns. To work with
the rownames, first move them into a column.
rownames_to_column()
Move row names into col.
a <- rownames_to_column(iris,
var = "C")
column_to_rownames()
Move col in row names.
column_to_rownames(a,
var = "C")
Also has_rownames(), remove_rownames()
a t 1
b u 2
A B C
a t 1
b u 2
c v 3
A B C A B C
a t 3
b u 2
d w 1
a t 1
b u 2
c v 3
A B CA B C
a t 1
b u 2
c v 3
+ =
x y
A B D
a t 3
b u 2
d w 1
A B C
a t 1
b u 2
c v 3 + =
x y
A B D
a t 3
b u 2
d w 1