R can perform various data analysis and data science tasks for free through its extensive packages and community support. It is an open-source statistical programming language that is widely used for data manipulation, visualization, and machine learning. Some key features of R include its ability to perform interactive visualization, ensemble learning, text/social media mining, and integration with other languages and technologies like SQL, Python, and Tableau. While powerful, R does have some limitations like a steep learning curve and slower execution compared to other languages.
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
Basic of R Programming Language,
Introduction, How to run R, R Sessions and Functions, Basic Math, Variables, Data Types, Vectors, Conclusion, Advanced Data Structures, Data Frames, Lists, Matrices, Arrays, Classes
Basic of R Programming Language
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
Basic of R Programming Language,
Introduction, How to run R, R Sessions and Functions, Basic Math, Variables, Data Types, Vectors, Conclusion, Advanced Data Structures, Data Frames, Lists, Matrices, Arrays, Classes
Basic of R Programming Language
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
These Lecture series are relating the use R language software, its interface and functions required to evaluate financial risk models. Furthermore, R software applications relating financial market data, measuring risk, modern portfolio theory, risk modeling relating returns generalized hyperbolic and lambda distributions, Value at Risk (VaR) modelling, extreme value methods and models, the class of ARCH models, GARCH risk models and portfolio optimization approaches.
Data Science - Part II - Working with R & R studioDerek Kane
This tutorial will go through a basic primer for individuals who want to get started with predictive analytics through downloading the open source (FREE) language R. I will go through some tips to get up and started and building predictive models ASAP.
This is our first version of the key products that have been used to offer services to our clients. We have about 30 tools mostly open source that are being used at our startup to develop minimum viable products
R is among the most popular programming languages among data science professionals. In this guide learn about the basic concepts and various functionalities it offers.
As part of the GSP’s capacity development and improvement programme, FAO/GSP have organised a one week training in Izmir, Turkey. The main goal of the training was to increase the capacity of Turkey on digital soil mapping, new approaches on data collection, data processing and modelling of soil organic carbon. This 5 day training is titled ‘’Training on Digital Soil Organic Carbon Mapping’’ was held in IARTC - International Agricultural Research and Education Center in Menemen, Izmir on 20-25 August, 2017.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
These Lecture series are relating the use R language software, its interface and functions required to evaluate financial risk models. Furthermore, R software applications relating financial market data, measuring risk, modern portfolio theory, risk modeling relating returns generalized hyperbolic and lambda distributions, Value at Risk (VaR) modelling, extreme value methods and models, the class of ARCH models, GARCH risk models and portfolio optimization approaches.
Data Science - Part II - Working with R & R studioDerek Kane
This tutorial will go through a basic primer for individuals who want to get started with predictive analytics through downloading the open source (FREE) language R. I will go through some tips to get up and started and building predictive models ASAP.
This is our first version of the key products that have been used to offer services to our clients. We have about 30 tools mostly open source that are being used at our startup to develop minimum viable products
R is among the most popular programming languages among data science professionals. In this guide learn about the basic concepts and various functionalities it offers.
As part of the GSP’s capacity development and improvement programme, FAO/GSP have organised a one week training in Izmir, Turkey. The main goal of the training was to increase the capacity of Turkey on digital soil mapping, new approaches on data collection, data processing and modelling of soil organic carbon. This 5 day training is titled ‘’Training on Digital Soil Organic Carbon Mapping’’ was held in IARTC - International Agricultural Research and Education Center in Menemen, Izmir on 20-25 August, 2017.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
6. Freedom 0: Freedom to run the Program .How, When and What
Freedom 1: Freedom to study how the program works, adapt it to your
needs. Access to source code recondition for this.
Freedom 2: The freedom to redistribute copies so you can help
your neighbor.
Freedom 3: The freedom to improve the program ,and release
your improvement to the public so that whole community can
benefits.
7. What is data science ?
Hacking ( Programming) +
Maths/Statistics + Domain Knowledge =
Data Science
8. SO NEXT WHAT IS
Data Scientist ?
A data scientist is simply a person who
can
write code = in R, Python,Java, SQL,
Hadoop (Pig,HQL,MR) etc
= for data storage, querying,
summarization, visualization
= how efficiently, and in time (fast
results?)
= where on databases, on cloud,
servers
and understand enough statistics
to derive insights from data
so business can make decisions
9. Data Science with R :A popular language in Data Science
https://www.tiobe.com/tiobe-index/
10. The are some milestone dates in the development of R:
R version 4.2.1 (Funny-Looking Kid) has been released on 2022-06-23.
► Early 1990s: The development of R began.
► August 1993: The software was announced on the S-news mailing list.
► www.r-project.org/mail,html
► June 1995: After some persuasive arguments by Martin Mächler - code available as “free software,” under
the FSF’s GNU GPL, Version 2.
► Mid-1997: The initial R Development Core Team was formed(Core group)
► February 2000: The first version of R, version 1.0.0, was released.
► R : Past and Future History (r-project.org); https://cran.r-project/doc/html/interface98-paper/paper.html
11. What's great about R?
CRAN Packages By Date (r-project.org) https://cran.r-project.org/web/packages/
R can perform various data analysis and data science tasks for free
Interactive Visualization with Shiny package (Equivalent SAS Product : Visual Analytics)
Ensemble Learning / Machine Learning (SAS Product : SAS Enterprise Miner)
Text / Social Media Mining (SAS Product : SAS Text Miner)
Optimization and Forecasting (SAS Product : SAS ETS, PROC OPTMODEL)
RStudio IDE (SAS Product : SAS Enterprise Guide)
Integartion: Tableau, SQL Server, VS , PowewrBI
The system saves data sets between sessions, so you don't need to reload them each time. It
saves your command history too.
12. What is R?
Free alternative to MATLAB,Excdel ,SAS and SPSS.
R is a:
1. Statistical Software
2. Language
3. Environment
4. Ecosystem
Used by Google ,Facebook ,Bank of America etc.
Millions of user word wide
13. Where is R used?
Big data demands of companies
analyse user behaviour.
online advertising and e-commerce
Weather services use it for weather forecasts.
It is a fundamental tool for analytics-driven
organizations
14.
15.
16. What is R?
R is a dialect of S.
S was a language, or is a language that was developed by John
Chambers and at the now-defunct Bell Labs.
S was initiated in 1976 as an internal statistical analysis environment-
originally implemented as a Fortran Libraries.
Early versions of the language did not contain functions for statistical
modelling.
17. So in 1988, the system was rewritten in the C language and to make it more
portable across systems and it began to resemble the system that we have
today. Historical Notes
In 1993 Bell Labs gave a corporation called StatSci which became Insightful
Corporation, an exclusive license to develop and sell the S language.
In 2004, Insightful purchased the S language completely from Lucent for $2 million
is the current owner.
In 2006, Alcatel purchased Lucent Technologies and it's now called Alcatel-Lucent.
Insightful sell its implementation of the S language under the product name S-PLUS
and has built a number of fancy features(GUI Mostely) on top of it- ”PLUS” .
18. Version 4 of the S language was released in 1998. And its version, it's the
version we more or less use today. The book Programming with Data, which is
a reference for this course, is written by John Chambers sometimes called the
green book and it documents version four of the S language.
In 2008 the Insightful Corporation was acquired a company called TIBCO for $25
million dollars
The basic fundamentals of the S language have not really changed since 1998.
In 1998 S won the Association for Computing Machinery’s Software System award
19. 1991: It was created in New Zealand by two gentleman named Ross Ihaka
and Robert Gentleman.
1993: First announcement to public.
1995: Martin Michler convinced Ross and Robert to use, to license R under the
GNU General Public License to make R free software.
1996: A Public mailing list is created(R-help and R-devel).
1997: The core group is formed. The core group control the source code of R
2000: R 1.0.0 Version is released.
R version 4.2.1 (Funny-Looking Kid) has been released on 2022-06-23.
20. What is R
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display
An effective data handling and storage facility,
A suite of operators for calculations on arrays, in particular matrices,
A large, coherent, integrated collection of intermediate tools for data analysis,
Graphical facilities for data analysis and display either directly at the computer or on hardcopy, and
A well developed, simple and effective programming language (called ‘S’) which includes
conditionals, loops, user defined recursive functions and input and output facilities.
21. R is a system for statistical computation and graphics.
It consists of a language plus a run-time environment with graphics, a
debugger , access to certain system functions, and the ability to run programs
stored in script files.
It is free software distributed under a GNU-style copyleft, and an official part
of the GNU project (“GNU S”).
24. Alternatives to the standard R editors
Eclipse StatET www.walware.de/goto/statet IDE -java
Emacs Speaks Statistics http://ess.r-project.org Emacs, a powerful text and code
editor
Tinn-R www.sciviews.org/Tinn-R : This editor, developed specifically for working with
R, is available only for Windows
25. Design of the R System
The R System is divided into 2 conceptual part.
1. The base R system that you downloaded from CRAN
2. Everything else
R functionality divided into number of packages
The “base” system contain, among the other things the base package which is
required to run R And contain the most fundamental functions
26. There are other packages contained in the base system which includes for
example util, stats, datasets, graphics ,grDevices, grid, methods ,tools ,
parallel,compiler,stats4.
There are also recommended packages: that are kind of fundamental
packages that more or less everyone might use. And then there are a series of
recommended packages, so, boot for bootstrap, class classification, cluster,
codetools,foreign, and a variety of other packages.
27. How R works
1. R is an interpreted language, not a compiled one.
2. Syntax : lm(y ~ x) which means “fitting a linear model with y as response and x
as predictor”.
3. ls() and ls-content of function .
29. Features
►It comes as free, open-source code- stable and Reliable . license- www.r-
project.org/COPYING.
►It runs anywhere-MAC, Windows, Unix System
►It supports extensions :data manipulation, statistical modeling, and graphics.
Extensibility-write own s/w and distribute it on the form of add-on pkgs.
►It provides an engaged community :
► www.r-project.org/mail.html
www.stackoverflow.com/questions/tagged/r
http://stats.stackexchange.com/questions/tagged/r
www.twitter.com/search/rstats(R regional Conferences)
30. It connects with other languages: R package foreign
http://cran.rproject.org/web/packages/foreign/index.html SPSS, SAS, Stata.
RODBC, ROracle
Unique Features: Performing multiple
calculations with vectors: R is Vector
based language
Ex: x<- 1:5
Call x
> x [1] 1 2 3 4 5
> x+2
> x+ 6:10 ( Two Vector)
Processing more than just statistics :
data processing, graphic visualization,
and analysis of all sort
Running code without a compiler-
Development cycle easy- downside of
interptreted language –slow
Object oriented and Functional
Programming
31. Distributed Computing
In distributed computing, tasks are split between multiple processing nodes to reduce
processing time and increase efficiency. ddR and multiDplyr -large data sets.
Compatibility with Other Data Processing Technologies
R can be easily paired with other data processing and distributed computing technologies
technologies like Hadoop and Spark.
It is possible to remotely use a Spark cluster to process large datasets using R
Generates Report in any Desired Format: R’s markdown package
32. Limitations of R
Steep Learning Curve: R is not an easy language to get started with. Beginners find it
hard to get their feet wet due to the command-line interface. (Rstudio)
Hungry for Physical Memory: R stores all its data in the physical memory ,hard to
handle large data set. Hadoop integration for R
Slower execution: R would need a lot of optimizations before your code can run as
fast as it does on MATLAB or Python.
33. Drawback of R
Essentially based on 40 yrs. old Technology.
Little built-in support for Dynamic and 3-D graphics.
Functionality is based on consumer demand and user contributions.
Object must be stored in physical memory of computer: but here have been
advancement to deal with this too.
Not ideal for all Possible solutions.
34. Some important commands
1. help(command), ?command
2. help. start(): opens the help system in the system default browser
3. apropos(): Show all the commands that contain the “partword”
4. install.packages(“pkg”): install a library of command form CRAN website.
5. installed.packages(): list of the packages installed
6. library(pkg) : Load a package of commands, make them available for use (the
pkg must be installed)
7. search(): shows a list of all packages and( other objects) that are loaded and
available for use.
8. detach(package:name)- name will be replaced with package name
35. Pacman-make them available all
packages
Install.packages(pacman)
require(pacman); configuration message
Library(pacman) – no message
p_unload(dplyr,ggplot2,tidyr) # clear specific package..
P_unload(all)
Detach(“package:datasets”,unload=TRUE) # for base
38. Reserved Words in R
The reserved words in R's parser are
if else repeat while function for in next break
TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_
NA_complex_ NA_character_
... and ..1, ..2 etc,
40. Operator Syntax and Precedence
:: ::: access variables in a namespace
$ @ component / slot extraction
[ [[ indexing
^ exponentiation (right to left)
- + unary minus and plus
: sequence operator
%any% |> special operators (including %% and %/%)
* / multiply, divide
The following unary and binary operators are defined. They are
listed in precedence groups, from highest to lowest.
41. Operator and Precedence
+ - (binary) add, subtract
< > <= >= == != ordering and comparison
! negation
& && and
| || or
~ as in formulae
-> ->> rightwards assignment
<- <<- assignment (right to left)
= assignment (right to left)
? help (unary and binary)
51. # R program to illustrate
# the use of Arithmetic operators
vec1 <- c(0, 2)
vec2 <- c(2, 3)
# Performing operations on Operands
cat ("Addition of vectors :", vec1 + vec2, "n")
cat ("Subtraction of vectors :", vec1 - vec2, "n")
cat ("Multiplication of vectors :", vec1 * vec2, "n")
cat ("Division of vectors :", vec1 / vec2, "n")
cat ("Modulo of vectors :", vec1 %% vec2, "n")
cat ("Power operator :", vec1 ^ vec2)
52. %in%
The %in% operator in R can be used to identify if an element (e.g., a
number) belongs to a vector or dataframe. For example, it can be used
the see if the number 1 is in the sequence of numbers 1 to 10
53. This operator is used to multiply a matrix with its transpose.
The number of columns of the first matrix must be equal to the
number of rows of the second matrix.
%*% Operator:
54. What is the Difference Between the == and
%in% Operators in R
The %in% operator is used for matching values. “returns a vector of the positions
of (first) matches of its first argument in its second”.
On the other hand, the == operator, is a logical operator and is used to compare if
two elements are exactly equal. Using the %in% operator you can compare
vectors of different lengths to see if elements of one vector match at least one
element in another.
55. 1: Using %in% to Compare two Sequences of
Numbers (vectors)
# sequence of numbers 1:
a <- seq(1, 5)
# sequence of numbers 2:
b <- seq(3, 12)
# using the %in% operator to check matching
values in the vectors
a %in% b
57. Arithmetic Operators
These unary and binary operators perform
arithmetic on numeric or complex vectors (or
objects)
+ x - x
x + y x - y
x * y
x / y
x ^ y
x %% y
x %/% y
58.
59.
60.
61. Language objects
Language objects : calls, expressions, and names.
objects have modes "call", "expression", and "name",
They can be created directly from expressions using the quote
mechanism and converted to and from lists by the as.list and as.call
functions.
62.
63. Entering Input
• At the R prompt we type expression.
> x<-1
Print(x) S<-rep(obj,times=10)
[1] 1 seq(length=100,from=4 by=1)
> msg<- “Welcome”
The grammar of the language determine whether an expression is
complete or not. > X<- # incomplete expression
64. R commands
R commands, case sensitivity, etc. (country locale)
Executing commands from or diverting output to a file
source("commands.R'") # execute command save in file named commands.R
sink("record.lis") # divert all subsequent output from the console to an external
file, record.lis. SQR
sink() #restores it to the console once again.
.Rdata= # all Object
.Rhistory # command line used in session
65. Data permanency and removing objects
The entities that R creates and manipulates are known as objects.
variables, arrays of numbers, character strings, functions, or
structures built from such components.
During an R session, objects are created and stored by name
> objects()
The collection of objects currently stored is called the workspace.
To remove objects
> rm(x, y, ….)
66. Objects, their modes and attributes
The entities R operates on are technically known as objects.
Example; “atomic” vector # component or mode same
Recursive: List, function and expression
mode : basic type of its fundamental constituents. This is a special case of a
“property” of an object
mode(object) and length(object)
compl<-c(2+3i,4+5i) l=2 m=complex
67. properties of an object are usually provided by attributes(object)
As.character(_) As.complex(object)
Empty object
emp-obje<-character()
emp_obj[6]<-57
Changing the length of an object
68. The class of an object
All objects in R have a class, reported by the function class.
A special attribute known as the class of the object is used to allow
for an object-oriented style of programming in R.
# To remove temporarily the effects of class, use the function
unclass(). For example if winter has the class "data.frame" then
> winter
will print it in data frame form, which is rather like a matrix, whereas
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101. Learn R Programming (Tutorial & Examples) | Free Introduction Course (statisticsglobe.com)
R Guides – Statology
https://www.youtube.com/watch?v=fpl_ny-
jX5Y&list=RDCMUC87aeHqMrlR6ED0w2SVi5nw&start_radio=1&rv=fpl_ny-jX5Y&t=0
https://www.youtube.com/watch?v=UYclmg1_KLk&list=PLqzoL9-
eJTNDw71zWePXyHx3_cm_fMP8S&index=3
102.
103. Your best quote that reflects your
approach… “It’s one small step for
man, one giant leap for mankind.”
- NEIL ARMSTRONG
104.
105.
106.
107.
108.
109. Identifying potential problems.
Optimizing price dynamically.
Improving the allocation of “available to promise” inventory.
What is supply chain management? | IBM
www.fsf.org
https://www.youtube.com/watch?v=ckdHNu4kfL0
Why is Supply Chain Management is important?
Editor's Notes
1. When R is running, variables, data, functions, results, etc, are stored in the active memory of the computer in the form of objects which have a name.
The user can do actions on these objects with operators (arithmetic, logical, comparison, . . .) and functions (which are themselves objects).
All the actions of R are done on objects stored in the active memory of the computer: no temporary files are used (Fig. 1).
The readings and writings of files are used for input and output of data and results (graphics, . . .).
The user executes the functions via some commands. The results are displayed directly on the screen, stored in an object,
or written on the disk (particularly for graphics). Since the results are themselves objects, they can be considered as data and analyzed as such.
Data files can be read from the local disk or from a remote server through internet.
In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects. These objects are referred to through symbols or variables. In R, however, the symbols are themselves objects and can be manipulated in the same way as any other object. This is different from many other languages and has wide ranging effects. In this chapter we provide preliminary descriptions of the various data structures provided in R. More detailed discussions of many of them will be found in the subsequent chapters. The R specific function typeof returns the type of an R object. Note that in the C code underlying R, all objects are pointers to a structure with typedef SEXPREC; the different R data types are represented in C by SEXPTYPE, which determines how the information in the various parts of the structure is used.
So factor is a special type of vector, which is used to create,
to represent categorical data.
So, and there's two types of factor, there is unordered or ordered, so
you can think of this as being, as storing data that are.
Have labels that are categorical but have no ordering, so for
example male and female.
Or you can have ordered factors which might represent things that are ranked.
So they have an order but they're not numerical for example you know,
in many universities you'll have assistant professors, associates professors, and
full professors.
Those are categorical but they're ordered.
So one, you can think of a factor as an integer vector where
each integer has a label.
So for example, you might, you can think of it as a vector as one two three,
where one represents you know, high, for example high value and
two represents a medium value and three represents a low value.
So you might have a, a variable that's called high, medium and low.
And underlying in R is represented by the numbers one, two, and three.
so, factors are important because they're treated specially by modeling functions
like lm and glm which we'll talk about later.
But these are functions for, for, for fitting linear models.
And factors are with labels generally speaking are better than using
simple integer vectors because the factors are, what are called self describing.
So having a variable that has values male and female is more
descriptive than having a variable that just, that just has ones and twos.
So for example, in many data sets you'll find that a var,
there will be a variable that's coded as one and two and it's, and it's not.
Easy to know whether that variable is really a numeric variable that only
takes values one and two, but the problem is that's not something that's coded in
the data set, so it's hard to tell.
If you use a factor variable then the coding for the labels is all,
is kind of built into the variable and it's much easier to understand.
So there's a special type of object that we haven't talked too much about yet.
And these are missing values.
Missing values in R are denoted by either NA or NAN which we talked about before.
NAN is used for undefined mathematical operations.
And NA is pretty much used for everything else.
And so, there's a function in R called is.na which is used to test objects to
see if they are NA.
To see if they are missing values in that object.
There's another function called is.nan which is used to test for NANs.
So, NA values can have a class, too.
So you can have missing integer val, values or
you can have missing character values or missing numeric values etc.
And so even though it looks like it's all NAs,
the NAs can have different classes potentially.
And then it's an NA, an NAN value is considered to be also NA, so for
example, an NAN value, a NAN value, is missing.
Is considered to be missing.
So, but the reverse is not true.
So an NA value is not necessarily, an NAN value.
I've got a few different types of missing values listed here.
So, here I created a vector x which is 1,2, NA, 10, and 3.
So, now, this is a numeric vector.
And the NA value in here's going to be a numeric missing value.
So when I call is.na on x, what it returns is a, is a logical vector.
And the logical vector indicates whether each element of the vector x
is missing or not.
And so, there's only one missing element in this vector, and so
that's the third element.
So you can see that the, that the logical vector that's returned.
The first two are false, the third is true, and the fourth and
the fifth are false.
So the, the, the element that's true indicated where the missing value is.
If I call is.NaN on this vector,
you'll see that vector that's returned is all false.
Because there aren't any NaN values, or
their aren't any MAN values in this vector so everything's false.
Of course, if I create a vector that has an end, a NAN value and an, and
an NA value in it.
You'll see that is.na returns true for both of them.
But is.nan only returns true for the for the value that's actually NAN.
English
The last data type I'm going to talk about here is the data frame. The data frame is a key data type used in R and it's used to store tabular data.
So of course, tabular data make up a lot of what we use in statistics. Of course not all types of data are tabular. But because so much data becomes a tabular form.
Data frames are very important in R. So data frames are basically represented as a special type of list, where every element of that list has the same length.
Right, so you can think of each column of the data frame as an element of the list, and of course, in order to be a table, every column has to have the same length.
However, each column doesn't have to be the same type. So the first column could be numbers, the second column could be factor, the third column could be integers the fourth column could be logicals, it doesn't matter what the different types are. so, unlike matrices where, wh, which have to store the same type of object in every single element of the matrix, data frame can store your cla objects of different classes. And so, data frames also have some special attributes. First, the first special attribute is called a row name. And so every row of a data frame has a name.
And this can be useful for kind of annotating the data. So for example, each row re, might represent a subject enrolled in a study, and then the row names would be the subject ID for example.
however, sometimes the row names are not interesting, and, and, and often you'll just use row names of 1, 2, 3, et cetera. Data frames can be created by calling most often calling the read.table, the read.csv
function and we'll get into that a little bit when I talk about reading data into R.
And you can also create a matrix from a data frame by
calling the data.matrix a function.
Now, you can't if you have a data frame that has many different types of objects,
and then if you coerce that into a matrix, it's going to force so
each object to be coerced so that they're all the same.
So you may get something that's not exactly expected.
So, data frames can be created besides using read.table or
read.csv, you can use the data.frame function and here I've
created a very simple data frame where the first the first column is called,
is the foo variable, and the second column is the bar variable.
The foo variable is an integer sequence from one to four, and
the bar variable is a logical vector with two trues and two falses.So when I autoprint the data frame out you'll see the, it prints out the two columns and here the row names since I didn't specify any special row names, just defaults to 1, 2, 3, 4, because there's four rows.And then when I call the nrow function on x, I see that there's four rows in the ncall function, shows me that there are two rows
The last data type I'm going to talk about here is the data frame.
The data frame is a key data type used in R and it's used to store tabular data.
So of course, tabular data make up a lot of what we use in statistics.
Of course not all types of data are tabular.
But because so much data becomes a tabular form.
Data frames are very important in R.
So data frames are basically represented as a special type of list,
where every element of that list has the same length.
Right, so you can think of each column of the data frame as an element of the list,
and of course, in order to be a table, every column has to have the same length.
However, each column doesn't have to be the same type.
So the first column could be numbers, the second column could be factor,
the third column could be integers the fourth column could be logicals,
it doesn't matter what the different types are.
so, unlike matrices where, wh,
which have to store the same type of object in every single element of
the matrix, data frame can store your cla objects of different classes.
And so, data frames also have some special attributes.
First, the first special attribute is called a row name.
And so every row of a data frame has a name.
And this can be useful for kind of annotating the data.
So for example, each row re, might represent a subject enrolled in a study,
and then the row names would be the subject ID for example.
however, sometimes the row names are not interesting, and, and, and
often you'll just use row names of 1, 2, 3, et cetera.
Data frames can be created by calling most often calling the read.table, the read.csv
function and we'll get into that a little bit when I talk about reading data into R.
And you can also create a matrix from a data frame by
calling the data.matrix a function.
Now, you can't if you have a data frame that has many different types of objects,
and then if you coerce that into a matrix, it's going to force so
each object to be coerced so that they're all the same.
So you may get something that's not exactly expected.
So, data frames can be created besides using read.table or
read.csv, you can use the data.frame function and here I've
created a very simple data frame where the first the first column is called,
is the foo variable, and the second column is the bar variable.
The foo variable is an integer sequence from one to four, and
the bar variable is a logical vector with two trues and two falses.
So when I autoprint the data frame out you'll see the,
it prints out the two columns and here the row names since I didn't specify any
special row names, just defaults to 1, 2, 3, 4, because there's four rows.
And then when I call the nrow function on x,
I see that there's four rows in the ncall function, shows me that there are two rows
R objects can also have names.
So this not true for just data frames.
It's true for all r objects.
And this can be very useful for writing readable code and self describing objects.
So for example, I'm creating a vector that's an integer sequence 1, 2,
3 and by default, there's no name.
So when I call the names function on x, it gives me a null value.
However, I can, I can give a name to each element of the vector x.
So for example, if I, I can say the first element's called food,
the second element's called bar, and the third element's called norf.
So now when I print out my x vector, I get a vector 1, 2, 3 but
then each one has a name over it, which is the name I just specified.
And so when I call the names function I get the,
the names that are associated with each element of the vector foo, bar, and norf.
Lists can also have names.
And so for example here I'm creating a list with the list function
where the first element is called a, the second element is called b, and
the third element is called c.
And so when I print out the list, it prints out the names of each element and
the values associated with those names.
Finally matrices can have names.
These are called dim names.
So here I created a matrix from the sequence 1 to 4.
It's a two by two matrix.
And so the, when, when I use the dim names function I pass it a list.
Excuse me, I assign it a list.
Where the first element of the list is the, is the vector of row names and
the second element of the list is a vector of column names.
So here I want to name the rows a and b, and I want to name the columns c and d.
So that's what I passed to the dim names function.
And now when I print out my matrix I can see that the row names and
the column names are labeled as I wanted.