The document outlines an agenda for a workshop on hands-on data science with R. The pre-lunch session covers R fundamentals including installing R and RStudio, data types and structures, control flow, and data import/export. The post-lunch session covers data analysis techniques in R like working with datasets, visualization, classification, regression, clustering, and association rule mining. Examples use the built-in iris dataset and other datasets.
The document defines a function called covcor() that calculates and returns the covariance and correlation between variables in a data frame. The function takes a data frame as input, splits it by a grouping variable, applies covariance and correlation calculations to subsets of the data, and combines the results into an output data frame. Three methods for defining the covcor() function are presented: 1) Using subset() and merge(), 2) Using tapply(), and 3) Using ddply() from the plyr package. The function is demonstrated on orange tree data to calculate covariance and correlation between tree age and circumference for each tree. Transforming the circumference variable affects the covariance but not the correlation, demonstrating properties of these statistical measures.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
This document provides an overview of the dplyr package in R. It describes several key functions in dplyr for manipulating data frames, including verbs like filter(), select(), arrange(), mutate(), and summarise(). It also covers grouping data with group_by() and joining data with joins like inner_join(). Pipelines of dplyr operations can be chained together using the %>% operator from the magrittr package. The document concludes that dplyr provides simple yet powerful verbs for transforming data frames in a convenient way.
The document defines a function called covcor() that calculates and returns the covariance and correlation between variables in a data frame. The function takes a data frame as input, splits it by a grouping variable, applies covariance and correlation calculations to subsets of the data, and combines the results into an output data frame. Three methods for defining the covcor() function are presented: 1) Using subset() and merge(), 2) Using tapply(), and 3) Using ddply() from the plyr package. The function is demonstrated on orange tree data to calculate covariance and correlation between tree age and circumference for each tree. Transforming the circumference variable affects the covariance but not the correlation, demonstrating properties of these statistical measures.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
This document provides an overview of the dplyr package in R. It describes several key functions in dplyr for manipulating data frames, including verbs like filter(), select(), arrange(), mutate(), and summarise(). It also covers grouping data with group_by() and joining data with joins like inner_join(). Pipelines of dplyr operations can be chained together using the %>% operator from the magrittr package. The document concludes that dplyr provides simple yet powerful verbs for transforming data frames in a convenient way.
This document provides an overview of using sparklyr to perform data science tasks with Apache Spark. It discusses how sparklyr allows users to import data, transform it using dplyr verbs and Spark SQL, build and evaluate models using Spark MLlib and H2O, and visualize results. It also covers important concepts like connecting to different Spark cluster configurations and tuning Spark settings. Finally, it demonstrates an example workflow performing preprocessing, modeling, and visualization on the iris dataset using sparklyr functions.
This document provides a cheat sheet for frequently used commands in Stata for data processing, exploration, transformation, and management. It highlights commands for viewing and summarizing data, importing and exporting data, string manipulation, merging datasets, and more. Keyboard shortcuts for navigating Stata are also included.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
This document discusses various techniques for manipulating data in R, including sorting, subsetting, ordering, reshaping between wide and long formats using the reshape2 package, and using plyr for efficient splitting and combining of large datasets. Specific functions and examples covered include sort(), order(), cut(), melt(), dcast(), and plyr functions. The goal is to demonstrate common ways to manipulate and rearrange data for further processing and analysis in R.
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
The document discusses various SQL commands and techniques. It provides explanations and examples of:
1) How to use the SPOOL command to output a SQL query to a text file. The SPOOL command directs query output to a file, and SPOOL OFF closes the file.
2) Common questions about SQL, including the differences between DDL, DML, and DCL commands, escaping special characters in queries, eliminating duplicate rows, generating primary keys, calculating time differences between dates, and adding to date values.
3) Techniques for counting distinct values or value ranges in a column, retrieving a specific row number from a table, and implementing IF-THEN-ELSE logic in a SQL
Grouped data frames allow dplyr functions to manipulate each group separately. The group_by() function creates a grouped data frame, while ungroup() removes grouping. Summarise() applies summary functions to columns to create a new table, such as mean() or count(). Join functions combine tables by matching values. Left, right, inner, and full joins retain different combinations of values from the tables.
RStudio is a trademark of RStudio, PBC. This document provides a cheat sheet on tidying data with the tidyr package in R. It defines tidy data as having each variable in its own column and each observation in its own row. It discusses tibbles as an enhanced data frame format and provides functions for constructing, subsetting, and printing tibbles. It also covers reshaping data through pivoting, handling missing values, expanding and completing data, and working with nested data through nesting, unnesting, and applying functions to list columns.
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
As senior consultant of German management consultancy Königsweg, Alexander is guiding enterprises and institutions through change processes of digitalisation and automation.
Alexander always loved data almost as much as music and so no wonder he’s organiser of local meet ups and one of the 25 mongoDB Community Masters.
He loves to share this expertise and engages in the global community as organiser and program chair of the EuroPython conference, speaker and trainer at multiple international conferences as mongoDB World, EuroPython, Cebit or PyData.
This document discusses various data structures in R programming including vectors, matrices, arrays, data frames, lists, and factors. It provides examples of how to create each structure and access elements within them. Various methods for importing and exporting data in different file formats like Excel, CSV, and text files are also covered.
Here are the R commands to create the requested graph from the MASS leuk dataset and save it as MASSleuk.jpeg:
```r
data(leuk)
windows()
par(mfrow=c(2,2))
plot(leuk$time, main="Scatter plot of time", ylab="time")
hist(leuk$time, main="Histogram of time", xlab="time")
boxplot(leuk$time, main="Boxplot of time")
qqnorm(leuk$time); qqline(leuk$time)
dev.copy(png, "MASSleuk.jpeg")
```
This will open a graphics window,
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
This document discusses various functions in R for exporting data, including print(), cat(), paste(), paste0(), sprintf(), writeLines(), write(), write.table(), write.csv(), and sink(). It provides descriptions, syntax, examples, and help documentation for each function. The functions can be used to output data to the console, files, or save R objects. write.table() and write.csv() convert data to a data frame or matrix before writing to a text file or CSV. sink() diverts R output to a file instead of the console.
This document summarizes a brown-bag seminar on introducing R and data visualization. It covers loading and exploring data; basic data types and data frames; performing operations and asking questions of the data; subsetting, filtering, and handling duplicates and missing values; basic statistical analysis and graphs; and using the ggplot2 package to create bar plots, scatter plots, line graphs, box plots and more to visualize relationships in the data. Key points covered include using geom_bar with stat="bin", adding colors and facets, setting linetype and point shapes, and layering geoms to overlay points and lines on the same plot.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
1) The document provides a quick guide to using data.table in R and Pentaho Data Integration (PDI) for fast data loading and manipulation. It discusses benchmarks showing data.table is 2-20x faster than traditional methods for reading, ordering, and transforming large data.
2) The outline discusses how to use basic data.table functions for speed gains and to overcome R's scaling limitations. It also provides a very brief overview of PDI's capabilities for Extract/Transform/Load (ETL) workflows without writing code.
3) The benchmarks section shows data.table is up to 500% faster than traditional R methods for reading large CSV files and orders of magnitude faster for sorting and aggregating
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
This document provides an overview of the basics of R. It discusses why R is useful, outlines its interface and workspace, describes how to get help and install packages, and explains some key concepts like objects, functions, and the search path. The document is intended to introduce new users to commonly used R functions and features to get started with the programming language.
This document provides an overview of using sparklyr to perform data science tasks with Apache Spark. It discusses how sparklyr allows users to import data, transform it using dplyr verbs and Spark SQL, build and evaluate models using Spark MLlib and H2O, and visualize results. It also covers important concepts like connecting to different Spark cluster configurations and tuning Spark settings. Finally, it demonstrates an example workflow performing preprocessing, modeling, and visualization on the iris dataset using sparklyr functions.
This document provides a cheat sheet for frequently used commands in Stata for data processing, exploration, transformation, and management. It highlights commands for viewing and summarizing data, importing and exporting data, string manipulation, merging datasets, and more. Keyboard shortcuts for navigating Stata are also included.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
This document discusses various techniques for manipulating data in R, including sorting, subsetting, ordering, reshaping between wide and long formats using the reshape2 package, and using plyr for efficient splitting and combining of large datasets. Specific functions and examples covered include sort(), order(), cut(), melt(), dcast(), and plyr functions. The goal is to demonstrate common ways to manipulate and rearrange data for further processing and analysis in R.
Python for R developers and data scientistsLambda Tree
This is an introductory talk aimed at data scientists who are well versed with R but would like to work with Python as well. I will cover common workflows in R and how they translate into Python. No Python experience necessary.
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
The document discusses various SQL commands and techniques. It provides explanations and examples of:
1) How to use the SPOOL command to output a SQL query to a text file. The SPOOL command directs query output to a file, and SPOOL OFF closes the file.
2) Common questions about SQL, including the differences between DDL, DML, and DCL commands, escaping special characters in queries, eliminating duplicate rows, generating primary keys, calculating time differences between dates, and adding to date values.
3) Techniques for counting distinct values or value ranges in a column, retrieving a specific row number from a table, and implementing IF-THEN-ELSE logic in a SQL
Grouped data frames allow dplyr functions to manipulate each group separately. The group_by() function creates a grouped data frame, while ungroup() removes grouping. Summarise() applies summary functions to columns to create a new table, such as mean() or count(). Join functions combine tables by matching values. Left, right, inner, and full joins retain different combinations of values from the tables.
RStudio is a trademark of RStudio, PBC. This document provides a cheat sheet on tidying data with the tidyr package in R. It defines tidy data as having each variable in its own column and each observation in its own row. It discusses tibbles as an enhanced data frame format and provides functions for constructing, subsetting, and printing tibbles. It also covers reshaping data through pivoting, handling missing values, expanding and completing data, and working with nested data through nesting, unnesting, and applying functions to list columns.
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
As senior consultant of German management consultancy Königsweg, Alexander is guiding enterprises and institutions through change processes of digitalisation and automation.
Alexander always loved data almost as much as music and so no wonder he’s organiser of local meet ups and one of the 25 mongoDB Community Masters.
He loves to share this expertise and engages in the global community as organiser and program chair of the EuroPython conference, speaker and trainer at multiple international conferences as mongoDB World, EuroPython, Cebit or PyData.
This document discusses various data structures in R programming including vectors, matrices, arrays, data frames, lists, and factors. It provides examples of how to create each structure and access elements within them. Various methods for importing and exporting data in different file formats like Excel, CSV, and text files are also covered.
Here are the R commands to create the requested graph from the MASS leuk dataset and save it as MASSleuk.jpeg:
```r
data(leuk)
windows()
par(mfrow=c(2,2))
plot(leuk$time, main="Scatter plot of time", ylab="time")
hist(leuk$time, main="Histogram of time", xlab="time")
boxplot(leuk$time, main="Boxplot of time")
qqnorm(leuk$time); qqline(leuk$time)
dev.copy(png, "MASSleuk.jpeg")
```
This will open a graphics window,
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
This document discusses various functions in R for exporting data, including print(), cat(), paste(), paste0(), sprintf(), writeLines(), write(), write.table(), write.csv(), and sink(). It provides descriptions, syntax, examples, and help documentation for each function. The functions can be used to output data to the console, files, or save R objects. write.table() and write.csv() convert data to a data frame or matrix before writing to a text file or CSV. sink() diverts R output to a file instead of the console.
This document summarizes a brown-bag seminar on introducing R and data visualization. It covers loading and exploring data; basic data types and data frames; performing operations and asking questions of the data; subsetting, filtering, and handling duplicates and missing values; basic statistical analysis and graphs; and using the ggplot2 package to create bar plots, scatter plots, line graphs, box plots and more to visualize relationships in the data. Key points covered include using geom_bar with stat="bin", adding colors and facets, setting linetype and point shapes, and layering geoms to overlay points and lines on the same plot.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
1) The document provides a quick guide to using data.table in R and Pentaho Data Integration (PDI) for fast data loading and manipulation. It discusses benchmarks showing data.table is 2-20x faster than traditional methods for reading, ordering, and transforming large data.
2) The outline discusses how to use basic data.table functions for speed gains and to overcome R's scaling limitations. It also provides a very brief overview of PDI's capabilities for Extract/Transform/Load (ETL) workflows without writing code.
3) The benchmarks section shows data.table is up to 500% faster than traditional R methods for reading large CSV files and orders of magnitude faster for sorting and aggregating
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
This document provides an overview of the basics of R. It discusses why R is useful, outlines its interface and workspace, describes how to get help and install packages, and explains some key concepts like objects, functions, and the search path. The document is intended to introduce new users to commonly used R functions and features to get started with the programming language.
This document provides an overview of the basics of R. It discusses why R is useful, outlines its interface and workspace, describes how to get help and install packages, and explains some key concepts like objects, functions, and the search path. The document is intended to introduce new users to commonly used R functions and features to get started with the programming language.
This document provides an overview of the basics of R including why R is used, tutorials and links for learning R, an overview of the R interface and workspace, and how to get help in R. It discusses that R is a free and open-source statistical programming language used for statistical analysis and graphics. It has a steep learning curve due to the interactive nature of analyzing data through chained commands rather than single procedures. Help is provided through a built-in system and various online tutorials.
Modeling in R Programming Language for Beginers.pptanshikagoel52
This document provides an overview of the basics of R. It discusses why R is useful, outlines its interface and workspace, describes how to get help and install packages, and explains some key concepts like objects, functions, and the search path. The document is intended to introduce new users to commonly used R functions and features to get started with the programming language.
This document provides an overview of machine learning in R. It discusses R's capabilities for statistical analysis and visualization. It describes key R concepts like objects, data structures, plots, and packages. It explains how to import and work with data, perform basic statistics and machine learning algorithms like linear models, naive Bayes, and decision trees. The document serves as an introduction for using R for machine learning tasks.
R is a free software environment for statistical computing and graphics. It provides statistical techniques and is highly extensible. Packages can be installed from repositories like CRAN and BioConductor to expand R's functionality. Basic plots can be created using functions like plot() and hist(). Data can be imported from files and bound together from multiple files into a single data frame.
This document provides an overview and introduction to using R. It discusses why R is useful, outlines the R interface and workspace, describes how to get help and install packages, and provides tips on resolving conflicting object names. The document is intended to help new users get started with the basic functionality of R.
R is a language and environment for statistical computing and graphics. It is based on S, an earlier language developed at Bell Labs. R features include being cross-platform, open source, having a package-based repository, strong graphics capabilities, and active user and developer communities. Useful URLs and books for learning R are provided. Instructions for installing R and RStudio on different platforms are given. R can be used for a wide range of statistical analyses and data visualization.
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
INFORMATIVE ESSAY
The purpose of the Informative Essay assignment is to choose a job or task that you know how to do and then write a minimum of 2 full pages, maximum of 3 full pages, Informative Essay teaching the reader how to do that job or task. You will follow the organization techniques explained in Unit 6.
Here are the details:
1. Read the Lecture Notes in Unit 6. You may also find the information in Chapter 10.5 in our text on Process Analysis helpful. The lecture notes will really be the most important to read in writing this assignment. However, here is a link to that chapter that you may look at in addition to the lecture notes:
https://open.lib.umn.edu/writingforsuccess/chapter/10-5-process-analysis/ (Links to an external site.)
2. Choose your topic, that is, the job or task you want to teach. As the notes explain, this should be a job or task that you already know how to do, and it should be something you can do well. At this point, think about your audience (reader). Will your reader need any knowledge or experience to do this job or task, or will you write these instructions for a general reader where no experience is required to perform the job?
3. Plan your outline to organize this essay. Unit 6 notes offer advice on this organization process. Be sure to include an introductory paragraph that has the four main points presented in the lecture notes.
4. Write the essay. It will need to be at least 2 FULL pages long, maximum of 3 full pages long. You will use the MLA formatting that you used in previous essays from Units 3, 4, and 5.
5. Be sure to include a title for your essay.
6. After writing the essay, be sure to take time to read it several times for revision and editing. It would be helpful to have at least one other person proofread it as well before submitting the assignment.
Quiz2
# comments start with #
# to quit q()
# two steps to install any library
#install.packages("rattle")
#library(rattle)
setwd("D:/AJITH/CUMBERLANDS/Ph.D/SEMESTER 3/Data Science & Big Data Analy (ITS-836-51)/RStudio/Week2")
getwd()
x <- 3 # x is a vector of length 1
print(x)
v1 <- c(2,4,6,8,10)
print(v1)
print(v1[3])
v <- c(1:10) #creates a vector of 10 elements numbered 1 through 10. More complicated data
print(v)
print(v[6])
# Import test data
test<-read.csv("CVEs.csv")
test1<-read.csv("CVEs.csv", sep=",")
test2<-read.table("CVEs.csv", sep=",")
write.csv(test2, file="out.csv")
# Write CSV in R
write.table(test1, file = "out1.csv",row.names=TRUE, na="",col.names=TRUE, sep=",")
head(test)
tail(test)
summary(test)
head <- head(test)
tail <- tail(test)
cor(test$X, test$index)
sd(test$index)
var(test$index)
plot(test$index)
hist(test$index)
str(test$index)
quit()
Quiz3
setwd("C:/Users/ialsmadi/Desktop/University_of_Cumberlands/Lectures/Week2/RScripts")
getwd()
# Import test data
data<-read.csv("yearly_sales.csv")
#A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set.
#It consists o ...
This document provides an overview of a lecture on R. The lecture will cover what R is and why to use it, setting up R and RStudio, performing calculations and functions in R, handling files and creating plots and graphics, performing statistical analyses, and writing functions. It also discusses R's strengths like data management, statistics, graphics, and its active user community, as well as weaknesses like not being very user friendly initially and being slower than other programming languages.
R is an open-source statistical programming language that can be used for data analysis and visualization. The document provided an introduction to R including how to install R, create variables, import and assemble data, perform basic statistical analyses like t-tests and linear regression, and create plots and graphs. Key functions and concepts introduced included using c() to combine values into vectors, reading in data from CSV files, using lm() for linear regression, and the basic plot() function.
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED.pdfSOUMIQUE AHAMED
STAT-522 (Data Analysis Using R) by SOUMIQUE AHAMED, Division of Agronomy, Faculty of Agriculture - Wadura, Sher-e-Kashmir University of Agricultural Sciences and Technology of Kashmir.
This document provides an introduction to the statistical programming language R. It outlines what R is, how to access and use its interface, and how to work with basic data types like vectors, matrices, and factors. It also demonstrates how to import and export data, perform basic plotting and graphics, and gives examples working with biological data from Affymetrix chips. The presenter encourages attendees to ask questions and notes they are not a perfect teacher.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
R is a free implementation of the S programming language for statistical analysis and graphics. It allows for both interactive analysis and programming. The document discusses reading data into R from various sources, performing operations and computations on data frames, creating subsets of data, and producing graphics. Key points covered include using formulas for modeling, designing legible graphs, and exporting graphs to other formats like PDF.
R is a free programming language and software environment for statistical analysis and graphics. It contains functions for data manipulation, calculation, and graphical displays. Some key features of R include being free, running on multiple platforms, and having extensive statistical and graphical capabilities. Common object types in R include vectors, matrices, data frames, and lists. R also has packages that add additional functions.
This document discusses using the doSNOW package in R to perform parallel programming and speed up simulations. It explains how to register clusters, use foreach loops with .combine functions, and load necessary packages within loops. Testing with different numbers of clusters shows speedups over serial execution, with optimal speedups achieved when the number of clusters matches or exceeds the number of cores. Processing jobs in parallel reduces the elapsed time for each job.
This document provides information on tools for research plotting in Python and R. It discusses matplotlib and R for creating plots in Python and R respectively. It provides examples of different plot types that can be created such as line plots, scatter plots, bar plots, and includes code samples. It also discusses installing and working with matplotlib and R Studio, and reading in data from files to create plots.
This document discusses text clustering and sentiment analysis using machine learning. It provides an overview of text clustering for topic modelling using techniques like vector space models and cosine similarity. It also discusses sentiment analysis using machine learning algorithms and provides examples of document clustering using k-means and sentiment analysis of Amazon movie reviews. Finally, it briefly introduces chatbots.
This document provides an outline and overview of training convolutional neural networks. It discusses update rules like stochastic gradient descent, momentum, and Adam. It also covers techniques like data augmentation, transfer learning, and monitoring the training process. The goal of training a CNN is to optimize its weights and parameters to correctly classify images from the training set by minimizing output error through backpropagation and updating weights.
This document provides an overview of structures in C programming. It defines structures as named collections of data items of different data types that are accessed by name. It discusses defining and declaring structures, initializing and accessing members, nested structures, arrays of structures, unions, and the typedef keyword as it relates to structures. The document aims to explain these fundamental concepts of structures in C to help readers better understand how to use structures in their own programs.
Templates and Exception Handling in C++Nimrita Koul
This document discusses templates and exception handling in C++. It provides an overview of templates, including why they are used for generic programming and how to define function and class templates. Exception handling in C++ uses try, catch, and throw blocks. The try block contains code that may throw exceptions, catch blocks handle specific exceptions, and throw explicitly throws an exception. The document contains examples of templates, exception handling, and derived class exceptions. It also discusses opportunities available at the School of CIT at Reva University.
The document provides an overview of the field of bioinformatics. It defines bioinformatics as using computational tools to analyze biological data to answer questions about molecular composition, structure, function, and evolution. Bioinformatics is interdisciplinary, drawing from fields like mathematics, computer science, statistics, and information science. It discusses some key areas of bioinformatics work like analyzing nucleic acid and protein sequences, predicting molecular structures and interactions, and constructing networks to model metabolic and gene regulatory pathways. It also lists some common databases and tools used in bioinformatics as well as examples of computational problems and applications in areas like disease diagnosis, drug discovery, and more.
This document provides an overview of linear regression analysis. It defines key terms like dependent and independent variables. It describes simple linear regression, which involves predicting a dependent variable based on a single independent variable. It covers techniques for linear regression including least squares estimation to calculate the slope and intercept of the regression line, the coefficient of determination (R2) to evaluate the model fit, and assumptions like independence and homoscedasticity of residuals. Hypothesis testing methods for the slope and correlation coefficient using the t-test and F-test are also summarized.
Deep learning uses multilayered neural networks to process information in a robust, generalizable, and scalable way. It has various applications including image recognition, sentiment analysis, machine translation, and more. Deep learning concepts include computational graphs, artificial neural networks, and optimization techniques like gradient descent. Prominent deep learning architectures include convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks.
Machine learning is a branch of artificial intelligence concerned with building systems that can learn from data. The document discusses various machine learning concepts including what machine learning is, related fields, the machine learning workflow, challenges, different types of machine learning algorithms like supervised learning, unsupervised learning and reinforcement learning, and popular Python libraries used for machine learning like Scikit-learn, Pandas and Matplotlib. It also provides examples of commonly used machine learning algorithms and datasets.
This document provides instructions for installing Python and an overview of key Python concepts. It begins with an outline of topics to be covered, including Python datatypes, flow control, functions, files, exceptions, and projects. Detailed step-by-step instructions are given for installing Python on Windows. Short summaries are then provided of Python's history and name, the Python prompt and IDE, comments, identifiers, operators, keywords, None, as, and datatypes. Examples and explanations are provided of Python's core datatypes including numbers, strings, lists, tuples, sets, and dictionaries.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing.
Mechatronics deals with robotics, control systems, and electro-mechanical systems.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
1. Hands-on Data Science with R
Presented by: Nimrita Koul
Title: Assistant Professor
School: School of Computing & IT
Affiliation: REVA University, Bengaluru
2. Agenda
Pre Lunch Session: R Fundamentals
1.1 What is R? RStudio?
1.2 Installing R and RStudio
1.3 Data Types and Structures – vector, matrix, data frames, list
1.4 Control Flow
1.5 Data Import and Export
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
3. Agenda…
Post Lunch Session: Data Analysis with R
2.1 Working with iris data set
---2.1.1 Exploring individual variables of iris dataset
---2.1.2 Plotting Histogram, scatter plot, table, pie chart
---2.1.3 scatterplot3d
---2.1.4 contour
---2.1.5 persp
---2.1.6Saving charts to files
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
4. Agenda
2.2 Classification
2.2.1 Decision Trees
2.3 Regression - Linear Regression
2.4 Clustering
2.4.1 K means clustering
2.5 Association Rule mining on Titanic Dataset using Apriori
Algorithm
2.6 Sentiment Analysis of Twitter Text
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
5. Session 1
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
6. What is R?
R is a free software environment for statistical computing and
graphics.
R can be easily extended with 9,200+ packages available on CRAN (as
of Oct 2016).
http://www.r-project.org/
http://cran.r-project.org/
http://www.bioconductor.org/
http://cran.r-project.org/manuals.html
A great and complete documentation on R is available at:
http://cran.r-project.org/doc/manuals/R-intro.pdf
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
14. What is RStudio
It is an integrated development environment (IDE) for R
It runs on various operating systems like Windows, Mac OS X and Linux
RStudio project has following folders for default content types:
code: source code
data: raw data, cleaned data
figures: charts and graphs
docs: documents and reports
models: analytics models
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
15. Installing RStudio
Follow the steps below for installing R Studio:
1. Go to https://www.rstudio.com/products/rstudio/download/
2. In ‘Installers for Supported Platforms’ section, choose and
click the
R Studio installer based on your operating system.
The download should begin as soon as you click.
3. Click Next..Next..Finish.
Download Complete.
To Start R Studio, click on its desktop icon
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
19. RStudio window looks
like this
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
20. Data Types & Structures
Data types
Integer
Numeric
Character
Factor
Logical
Data structures
Vector
Matrix
Data frame
List
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
25. Control Flow…for loop
for(i in 1:5)
print(1:i)
for(i in 1:5)
print(2:i)
for(i in 1:5)
print(3:i)
for(i in 1:5)
print(4:i)
for(i in 1:5)
print(5:i)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
26. Control Flow .. If
statement
x <- 1:15
if (sample(x, 2) <= 10) {
print("x is less than 10")
} else
{
print("x is greater than 10")
}
barplot(table(sample(1:3, size=1000, replace=TRUE,
prob=c(.30,.60,.10))))
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
27. Data Import and Export
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
28. Save and Load R
Objects
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
29. Import from and Export
to .csv Files
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
30. Session 2 : Data Analysis
with R
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
31. Working with iris data
set
We use iris dataset for demonstration of data exploration with R. See
The dimension and names of data can be obtained respectively with
dim() and names(). Functions str() and attributes() return the structure
and attributes of data.
Dim(iris)
Names(iris)
Str(iris)
Attributes(iris)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
32. Structure of iris dataset
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
33. Working with iris dataset
>iris[1:5] ----to print all 5 columns of iris data set
>iris[1:2] --- to print first two columns of iris data set
>iris[5] – to print only 5th
column of iris data set
>head(iris) --- first five records of iris data set
>tail(iris) --- last five records of iris data set
>iris[1:10, "Sepal.Length"]---- first 10 values in column Sepal.Length
> iris$Sepal.Length[1:10]-----same as above
Individual variables with summary(iris)
>summary(iris)
>quantile(iris$Sepal.Length)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
34. Visualization & Graphics
Histogram
> hist(iris$Sepal.Length)
Density Plot
> plot(density(iris$Sepal.Length))
Table
> table(iris$Species)
Pie Chart
> pie(table(iris$Species))
Box Plot
>boxplot(Sepal.Length~Species,data=iris)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
35. Visualization and
Graphics
Scatter plot
A scatter plot can be drawn for two numeric variables with plot() as
below. Using function with(), we don't need to add “iris$" before
variable names. In the code below, the colors (col) and symbols (pch) of
points are set to Species.
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species,
pch=as.numeric(Species)))
When there are many points, some of them may overlap. We can use
jitter() to add a small amount of noise to the data before plotting
>plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
36. 3d scatter plot and interactive3d
plots
Install.packages(“scatterplot3d”)
A 3D scatter plot can be produced with package scatterplot3d
We first need to install package scatterplot3d and load it using
library(scatterplot3d)
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
For interactive 3d plot
>library(rgl)
>plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
37. Heatmap and Contour
plot
A heat map presents a 2D display of a data matrix, which can be
generated with heatmap() in R. We calculate the similarity between
different lowers in the iris data with dist() and then plot it with a heat
map.
> distMatrix <- as.matrix(dist(iris[,1:4]))
> heatmap(distMatrix)
Contour plot needs lattice library
>library(lattice)
>filled.contour(volcano, color=terrain.colors, asp=1,
+ plot.axes=contour(volcano, add=T))
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
38. Advanced graphics with
ggplot2
Perspective -persp
Another way to illustrate a numeric matrix is a 3D surface plot shown as
below, which is generated with function persp().
> persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
Package ggplot2 supports complex graphics, which are very useful for
exploring data
>library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
39. Saving charts to pdf
>pdf("myplot.pdf")
>x=1:50
>plot(x,log(x))
>graphics.off()
# file myplot.pdf will contain the resulting plot.
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
40. Covariance and
Correlation
> cov(iris$Sepal.Length, iris$Petal.Length)
>cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
41. Classification with Decision
Trees
We can build a decision tree for the iris data with function ctree() in
package “party”. Sepal.Length, Sepal.Width, Petal.Length and
Petal.Width are used to predict the Species of flowers.
In the package, function ctree() builds a decision tree, and predict()
makes prediction for new data.
Before modeling, the iris data is split below into two subsets: training
(70%) and test (30%).
The random seed is set to a fixed value below to make the results
reproducible
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
42. Decision Trees
> set.seed(1234)
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
> library(party) #party package is required to create decision trees
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width
> iris_ctree <- ctree(myFormula, data=trainData)
> # check the prediction
> table(predict(iris_ctree), trainData$Species)
> print(iris_ctree)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
43. Linear Regression
Regression is to build a function of independent variables (also known
as predictors) to predict a dependent variable (also called response).
For example, banks assess the risk of home-loan applicants based on
their age, income, expenses, occupation, number of dependents, total
credit limit, etc.
y = c0 + c1x1 + c2x2 + + ckxk;
where x1; x2; ; xk are predictors and y is the response to predict.
We demonstrate linear regression using function lm() on Australian
Consumer Price Index(cpi) data , quarterly from 2008 to 2010.
Data is created and plotted, x axis is manually added with function
axis(), las=3 makes text vertical.
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
45. Checking correlation
and plotting 3d
> cor(quarter,cpi)
> fit<-lm(cpi~year+quarter)
> fit
>plot(fit)
>attributes(fit)
>library(scatterplot3d)
>s3d<-
scatterplot3d(year,quarter,cpi,highlight.3d=T,type=“h”,lab=c(2,3))
>s3d$plane3d(fit)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
46. Linear Regression
With the above linear model, CPI is calculated as
cpi = c0 + c1 year + c2 quarter;
where c0, c1 and c2 are coefficients from model fit.
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
47. Clustering- K - Means
We remove species from the data to cluster, thenapply function
kmeans() to iris2, and store the clustering result in kmeans.result. The
cluster number is set to 3 in the code below.
>iris2=iris
>iris2$Species=NULL
>(kmeans.result=kmeans(iris2,3))
>table(iris$Species,kmeans.result$cluster)
The above result shows that cluster setosa" can be easily separated
from the other clusters, and that clusters versicolor" and virginica"
are to a small degree overlapped with each other.
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
48. Plotting clusters
> plot(iris2[c("Sepal.Length", "Sepal.Width")], col =
kmeans.result$cluster)
# plot cluster centers
> points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col
= 1:3, pch = 8, cex=2)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
49. Association Rule Mining
on Titanic dataset
Attributes - Passenger class, Gender, Child, Survived
> str(Titanic)
table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
- attr(*, "dimnames")=List of 4
..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
..$ Sex : chr [1:2] "Male" "Female"
..$ Age : chr [1:2] "Child" "Adult"
..$ Survived: chr [1:2] "No" "Yes"
> df <- as.data.frame(Titanic)
> head(df)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
50. Titanic Dataset
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
51. Titanic Raw
> titanic.raw <- NULL
> for(i in 1:4) {
+ titanic.raw <- cbind(titanic.raw, rep(as.character(df[,i]), df$Freq))
+ }
> titanic.raw <- as.data.frame(titanic.raw)
> names(titanic.raw) <- names(df)[1:4]
> str(titanic.raw)
> dim(titanic.raw)
> str(titanic.raw)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
52. Titanic Raw
> head(titanic.raw)
> summary(titanic.raw)
> library(arules)
> # find association rules with default settings
> rules.all <- apriori(titanic.raw)
> rules.all
>inspect(rules.all)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
53. Apriori Algorithm
>rules <- apriori(titanic.raw, control = list(verbose=F),
+ parameter = list(minlen=2, supp=0.005, conf=0.8),
+ appearance = list(rhs=c("Survived=No", "Survived=Yes"),
+ default="lhs"))
> quality(rules) <- round(quality(rules), digits=3)
> rules.sorted <- sort(rules, by="lift")
> inspect(rules.sorted)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
54. Removing redundant
rules
> subset.matrix <- is.subset(rules.sorted, rules.sorted)
> subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
> redundant <- colSums(subset.matrix, na.rm=T) >= 1
> which(redundant)
> # remove redundant rules
> rules.pruned <- rules.sorted[!redundant]
> inspect(rules.pruned)
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.
55. Visualizing Association
Rules
> library(arulesViz)
> plot(rules.all)
> plot(rules.all, method="grouped")
> plot(rules.all, method="graph")
> plot(rules.all, method="graph", control=list(type="items"))
HAVE TOTAL FOCUS ON YOUR GOAL, IT WILL DESTROY
MANY OBSTACLES OF YOUR PATH.