SlideShare a Scribd company logo
Abhik Seal
Indiana University School of Informatics and Computing(
Get/set your working directoryGet/set your working directory
A basic component of working with data is knowing your working directory
The two main commands are getwd() and setwd().
Be aware of relative versus absolute paths
Important difference in Windows setwd("C:UsersdatascDownloads")
Relative - setwd("./data"), setwd("../")
Absolute - setwd("/Users/datasc/data/")
Checking for and creating directoriesChecking for and creating directories
file.exists("directoryName") will check to see if the directory exists
dir.create("directoryName") will create a directory if it doesn't exist
Here is an example checking for a "data" directory and creating it if it doesn't exist
Reading data filesReading data files
We wil look at each of the methods
From Internet
Reading local files
Reading Excel Files
Reading XML
Reading JSON
Reading MySQL
Reading HDF5
Reading from other resources
Getting data from InternetGetting data from Internet
Data from
Use of download.file()
Useful for downloading tab-delimited, csv, and other files
fileUrl <- ""
## [1] "NAMCS.csv"
Getting data from InternetGetting data from Internet
Reading the data using read.csv()
## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices
## 1 Alabama 2013 0.48
## 2 Alaska 2013 0.50
## Adoption.of.Basic.EHRs..Primary.Care.Providers
## 1 0.50
## 2 0.52
## Adoption.of.Basic.EHRs..Rural.Providers
## 1 0.54
## 2 0.37
## Adoption.of.Basic.EHRs..Small.Practices
## 1 0.40
## 2 0.39
## 1 0.74
## 2 0.75 6/40
Some notes about download.file()Some notes about download.file()
If the url starts with http you can use download.file()
If the url starts with https on Windows you may be ok
If the url starts with https on Mac you may need to set method="curl"
If the file is big, this might take a while
Be sure to record when you downloaded.
Loading flat files - read.table()Loading flat files - read.table()
This is the main function for reading data into R
Flexible and robust but requires more parameters
Reads the data into RAM - big data can cause problems
Important parameters file, header, sep, row.names, nrows
Related: read.csv(), read.csv2()
Both read.table() and read.fwf() use scan to read the file, and then process the results of scan.
They are very convenient, but sometimes it is better to use scan directly
Example dataExample data
fileUrl <- ""
## [1] "NAMCS.csv"
Data <- read.table("./data/NAMCS.csv")
## Error: line 2 did not have 87 elements
## Error: object 'Data' not found
Example parametersExample parameters
read.csv sets sep="," and header=TRUE
same as
cameraData <- read.table("./data/NAMCS.csv",sep=",",header=TRUE)
cameraData <- read.csv("./data/NAMCS.csv")
## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices
## 1 Alabama 2013 0.48
## 2 Alaska 2013 0.50
## 3 Arizona 2013 0.51
## 4 Arkansas 2013 0.46
## 5 California 2013 0.54
## 6 Colorado 2013 0.39
## Adoption.of.Basic.EHRs..Primary.Care.Providers
## 1 0.50
## 2 0.52
## 3 0.63
Some more important parametersSome more important parameters
People face trouble with reading flat files those have quotation marks ` or " placed in data values,
setting quote="" often resolves these.
quote - you can tell R whether there are any quoted values quote="" means no quotes.
na.strings - set the character that represents a missing value.
nrows - how many rows to read of the file (e.g. nrows=10 reads 10 lines).
skip - number of lines to skip before starting to read
read.xlsx(), read.xlsx2() {xlsx package}read.xlsx(), read.xlsx2() {xlsx package}
Reading specific rows and columnsReading specific rows and columns
Data <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,header=TRUE)
colIndex <- 2:3
rowIndex <- 1:4
dataSub <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,
Further notesFurther notes
The write.xlsx function will write out an Excel file with similar arguments.
read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable.
The XLConnect is a Java-based solution, so it is cross platform and returns satisfactory results.
For large data sets it may be very slow.
xlsReadWrite is very fast: it doesn't support .xlsx files
gdata package provides a good cross platform solutions. It is available for Windows, Mac or
Linux. gdata requires you to install additional Perl libraries. Perl is usually already installed in
Linux and Mac, but sometimes require more effort in Windows platforms.
In general it is advised to store your data in either a database or in comma separated files (.csv)
or tab separated files (.tab/.txt) as they are easier to distribute.
I found on the web a self made function to easily import xlsx files. It should work in all platforms
and use XML
xlsxToR = function("myfile.xlsx", header = TRUE)
Working with XMLWorking with XML
Extensible markup language
Frequently used to store structured data
Particularly widely used in internet applications
Extracting XML is the basis for most web scraping
Markup - labels that give the text structure
Content - the actual text of the document
Read the file into RRead the file into R
fileUrl <- ""
doc <- xmlTreeParse(fileUrl,useInternal=TRUE)
rootNode <- xmlRoot(doc)
## [1] "breakfast_menu"
## food food food food food
## "food" "food" "food" "food" "food"
Directly access parts of the XML documentDirectly access parts of the XML document
## <food>
## <name>Belgian Waffles</name>
## <price>$5.95</price>
## <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
## <calories>650</calories>
## </food>
## <name>Belgian Waffles</name>
Go for a tour of XML package
Official XML tutorials short, long
An outstanding guide to the XML package
Javascript Object Notation
Lightweight data storage
Common format for data from application programming interfaces (APIs)
Similar structure to XML but different syntax/format
Data stored as
Numbers (double)
Strings (double quoted)
Boolean (true or false)
Array (ordered, comma separated enclosed in square brackets [])
Object (unorderd, comma separated collection of key:value pairs in curley brackets {})
Example JSON fileExample JSON file
Reading data from JSON {jsonlite package}Reading data from JSON {jsonlite package}
# Using chembl api
jsonData <- fromJSON("")
## [1] "compound"
## [1] "CHEMBL1"
Writing data frames to JSONWriting data frames to JSON
myjson <- toJSON(iris, pretty=TRUE)
## [
## {
## "Sepal.Length" : 5.1,
## "Sepal.Width" : 3.5,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.9,
## "Sepal.Width" : 3,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.7, 20/40
Further resourcesFurther resources
A good tutorial on jsonlite -
jsonlite vignette
Free and widely used open source database software
Widely used in internet based applications
Data are structured in
Each row is called a record
Tables within databases
Fields within tables
Step 2 - Install RMySQL ConnectorStep 2 - Install RMySQL Connector
On a Mac: install.packages("RMySQL")
On Windows:
Official instructions - (may be useful for
Mac/UNIX users as well)
Potentially useful guide -
Connecting and listing databasesConnecting and listing databases
ucscDb <- dbConnect(MySQL(),user="genome",
result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb);
## [1] TRUE
## Database
## 1 information_schema
## 2 ailMel1
## 3 allMis1
## 4 anoCar1
## 5 anoCar2
## 6 anoGam1
Connecting to hg19 and listing tablesConnecting to hg19 and listing tables
hg19 <- dbConnect(MySQL(),user="genome", db="hg19",
allTables <- dbListTables(hg19)
## [1] 11006
## [1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass"
## [5] "acemblyPep"
Get dimensions of a specific tableGet dimensions of a specific table
## [1] "bin" "matches" "misMatches" "repMatches" "nCount"
## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"
## [11] "qName" "qSize" "qStart" "qEnd" "tName"
## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes"
## [21] "qStarts" "tStarts"
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
## count(*)
## 1 58463
Read from the tableRead from the table
affyData <- dbReadTable(hg19, "affyU133Plus2")
## bin matches misMatches repMatches nCount qNumInsert qBaseInsert
## 1 585 530 4 0 23 3 41
## 2 585 3355 17 0 109 9 67
## 3 585 4156 14 0 83 16 18
## 4 585 4667 9 0 68 21 42
## 5 585 5180 14 0 167 10 38
## 6 585 468 5 0 14 0 0
## tNumInsert tBaseInsert strand qName qSize qStart qEnd tName
## 1 3 898 - 225995_x_at 637 5 603 chr1
## 2 9 11621 - 225035_x_at 3635 0 3548 chr1
## 3 2 93 - 226340_x_at 4318 3 4274 chr1
## 4 3 5743 - 1557034_s_at 4834 48 4834 chr1
## 5 1 29 - 231811_at 5399 0 5399 chr1
## 6 0 0 - 236841_at 487 0 487 chr1
## tSize tStart tEnd blockCount
## 1 249250621 14361 15816 5
## 2 249250621 14381 29483 17 28/40
Select a specific subsetSelect a specific subset
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)
## 0% 25% 50% 75% 100%
## 1 1 2 2 3
affyMisSmall <- fetch(query,n=10); dbClearResult(query);
## [1] TRUE
## [1] 10 22
# close connection
Further resourcesFurther resources
RMySQL vignette
R data import and export
Set up R odbc with postgres
A nice blog post summarizing some other commands
Used for storing large data sets
Supports storing a range of data types
Heirarchical data format
groups containing zero or more data sets and metadata
datasets multidimensional array of data elements with metadata
Have a group header with group name and list of attributes
Have a group symbol table with a list of objects in group
Have a header with name, datatype, dataspace, and storage layout
Have a data array with the data
R HDF5 packageR HDF5 package
The rhdf5 package works really well, although it is not in CRAN. To install it:
## Bioconductor version 2.13 (BiocInstaller 1.12), ?biocLite for help
## A newer version of Bioconductor is available after installing a new
## version of R, ?BiocUpgrade for help
## BioC_mirror:
## Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.0.3.
## Installing package(s) 'rhdf5'
## The downloaded binary packages are in
## /var/folders/pm/jg6blwt55b71g8jl64wfw8ch0000gn/T//RtmpuYnNzs/downloaded_packages
Creating an HDF5 file and group hierarchyCreating an HDF5 file and group hierarchy
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
hdf5 continuedhdf5 continued
Saving multiple objects to an HDF5 file
## group name otype dclass dim
## 0 / baa H5I_GROUP
## 1 / foo H5I_GROUP
## 2 /foo foobaa H5I_GROUP
A = 1:7; B = 1:18; D = seq(0,1,by=0.1)
h5save(A, B, D, file="newfile2.h5")
## $A
## [1] 1 2 3 4 5 6 7
## $B
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 34/40
Reading from other resourcesReading from other resources
foreign package
Loads data from Minitab, S, SAS, SPSS, Stata,Systat
Basic functions
See the help page for more details
read.arff (Weka)
readline() read from console
read.dta (Stata)
read.mtp (Minitab)
read.octave (Octave)
read.spss (SPSS)
read.xport (SAS)
Reading imagesReading images
jpeg -
readbitmap -
png -
EBImage (Bioconductor) -
Reading GIS dataReading GIS data
rgdal -
rgeos -
raster -
Reading music dataReading music data
tuneR -
seewave -
Jeff Leek University of Washington and Coursera Getting and Cleaning data
R For Natural Resources Course
R Data import comprehensive guide

More Related Content

What's hot

4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
Muhammad Nabi Ahmad
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
Tim Essam
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
Laura Hughes
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
Tim Essam
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
Sakthi Dasans
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
Laura Hughes
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
Avjinder (Avi) Kaler
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
R language introduction
R language introductionR language introduction
R language introduction
Shashwat Shriparv
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Jeffrey Breen
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
Avjinder (Avi) Kaler
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with dates
Rupak Roy
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
krishna singh

What's hot (19)

4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
R language introduction
R language introductionR language introduction
R language introduction
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with dates
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame

Viewers also liked

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
Abhik Seal
Chemical data
Chemical dataChemical data
Chemical data
Abhik Seal
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
Adverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & ReportingAdverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & Reporting
Ruella D'Costa Fernandes
Adverse drug reactions
Adverse drug  reactionsAdverse drug  reactions
Adverse drug reactionssuniu
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactionsDr.Vijay Talla
Adverse drug reactions ppt
Adverse drug reactions pptAdverse drug reactions ppt
Adverse drug reactions ppt
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative Structure
Saramita De Chakravarti
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
Jannatul Ferdoush
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design ppt
Abhik Seal

Viewers also liked (12)

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
Chemical data
Chemical dataChemical data
Chemical data
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
Adverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & ReportingAdverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & Reporting
Adverse drug reactions
Adverse drug  reactionsAdverse drug  reactions
Adverse drug reactions
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
Adverse drug reactions ppt
Adverse drug reactions pptAdverse drug reactions ppt
Adverse drug reactions ppt
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative Structure
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design ppt
Adverse Drug Reactions
Adverse Drug ReactionsAdverse Drug Reactions
Adverse Drug Reactions

Similar to Data handling in r

ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And Representation
Randy Connolly
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
John David Duncan
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
Python (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network AutomationPython (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network Automation
Rick Sherman
Yehuda Katz
Tarek Raihan
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
Lars Trieloff
Intake 37 ef2
Intake 37 ef2Intake 37 ef2
Intake 37 ef2
Mahmoud Ouf
Local data storage for mobile apps
Local data storage for mobile appsLocal data storage for mobile apps
Local data storage for mobile apps
Ivano Malavolta
Ado.Net Architecture
Ado.Net ArchitectureAdo.Net Architecture
Ado.Net Architecture
Umar Farooq
R data interfaces
R data interfacesR data interfaces
R data interfaces
Bhavesh Sarvaiya
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeMarco Gralike
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Amazon Web Services
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari

Similar to Data handling in r (20)

ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And Representation
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Python (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network AutomationPython (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network Automation
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
Intake 37 ef2
Intake 37 ef2Intake 37 ef2
Intake 37 ef2
Local data storage for mobile apps
Local data storage for mobile appsLocal data storage for mobile apps
Local data storage for mobile apps
R stata
R stataR stata
R stata
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
Chapter 15
Chapter 15Chapter 15
Chapter 15
Ado.Net Architecture
Ado.Net ArchitectureAdo.Net Architecture
Ado.Net Architecture
R data interfaces
R data interfacesR data interfaces
R data interfaces
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
Ch 7 data binding
Ch 7 data bindingCh 7 data binding
Ch 7 data binding

More from Abhik Seal

Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
Abhik Seal
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to functionAbhik Seal
SequencedatabasesAbhik Seal
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using dataAbhik Seal
Abhik Seal
R scatter plots
R scatter plotsR scatter plots
R scatter plots
Abhik Seal
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
Abhik Seal
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
Abhik Seal
Weka guide
Weka guideWeka guide
Weka guide
Abhik Seal
Abhik Seal
Abhik Seal

More from Abhik Seal (16)

Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to function
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data
R scatter plots
R scatter plotsR scatter plots
R scatter plots
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
Weka guide
Weka guideWeka guide
Weka guide

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Data handling in r

  • 2. Get/set your working directoryGet/set your working directory A basic component of working with data is knowing your working directory The two main commands are getwd() and setwd(). Be aware of relative versus absolute paths Important difference in Windows setwd("C:UsersdatascDownloads") · · · Relative - setwd("./data"), setwd("../") Absolute - setwd("/Users/datasc/data/") - - · 2/40
  • 3. Checking for and creating directoriesChecking for and creating directories file.exists("directoryName") will check to see if the directory exists dir.create("directoryName") will create a directory if it doesn't exist Here is an example checking for a "data" directory and creating it if it doesn't exist · · · if(!file.exists("data")){ dir.create("data") } 3/40
  • 4. Reading data filesReading data files We wil look at each of the methods From Internet Reading local files Reading Excel Files Reading XML Reading JSON Reading MySQL Reading HDF5 Reading from other resources · · · · · · · · 4/40
  • 5. Getting data from InternetGetting data from Internet Data from Use of download.file() Useful for downloading tab-delimited, csv, and other files · · fileUrl <- "" download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl") list.files("./data") ## [1] "NAMCS.csv" 5/40
  • 6. Getting data from InternetGetting data from Internet Reading the data using read.csv() data<-read.csv("") head(data,2) ## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices ## 1 Alabama 2013 0.48 ## 2 Alaska 2013 0.50 ## Adoption.of.Basic.EHRs..Primary.Care.Providers ## 1 0.50 ## 2 0.52 ## Adoption.of.Basic.EHRs..Rural.Providers ## 1 0.54 ## 2 0.37 ## Adoption.of.Basic.EHRs..Small.Practices ## 1 0.40 ## 2 0.39 ## ## 1 0.74 ## 2 0.75 6/40
  • 7. Some notes about download.file()Some notes about download.file() If the url starts with http you can use download.file() If the url starts with https on Windows you may be ok If the url starts with https on Mac you may need to set method="curl" If the file is big, this might take a while Be sure to record when you downloaded. · · · · · 7/40
  • 8. Loading flat files - read.table()Loading flat files - read.table() This is the main function for reading data into R Flexible and robust but requires more parameters Reads the data into RAM - big data can cause problems Important parameters file, header, sep, row.names, nrows Related: read.csv(), read.csv2() Both read.table() and read.fwf() use scan to read the file, and then process the results of scan. They are very convenient, but sometimes it is better to use scan directly · · · · · · 8/40
  • 9. Example dataExample data fileUrl <- "" download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl") list.files("./data") ## [1] "NAMCS.csv" Data <- read.table("./data/NAMCS.csv") ## Error: line 2 did not have 87 elements head(Data,2) ## Error: object 'Data' not found 9/40
  • 10. Example parametersExample parameters read.csv sets sep="," and header=TRUE same as cameraData <- read.table("./data/NAMCS.csv",sep=",",header=TRUE) cameraData <- read.csv("./data/NAMCS.csv") head(cameraData) ## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices ## 1 Alabama 2013 0.48 ## 2 Alaska 2013 0.50 ## 3 Arizona 2013 0.51 ## 4 Arkansas 2013 0.46 ## 5 California 2013 0.54 ## 6 Colorado 2013 0.39 ## Adoption.of.Basic.EHRs..Primary.Care.Providers ## 1 0.50 ## 2 0.52 ## 3 0.63 10/40
  • 11. Some more important parametersSome more important parameters People face trouble with reading flat files those have quotation marks ` or " placed in data values, setting quote="" often resolves these. quote - you can tell R whether there are any quoted values quote="" means no quotes. na.strings - set the character that represents a missing value. nrows - how many rows to read of the file (e.g. nrows=10 reads 10 lines). skip - number of lines to skip before starting to read · · · · 11/40
  • 12. read.xlsx(), read.xlsx2() {xlsx package}read.xlsx(), read.xlsx2() {xlsx package} Reading specific rows and columnsReading specific rows and columns library(xlsx) Data <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,header=TRUE) colIndex <- 2:3 rowIndex <- 1:4 dataSub <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1, colIndex=colIndex,rowIndex=rowIndex) 12/40
  • 13. Further notesFurther notes The write.xlsx function will write out an Excel file with similar arguments. read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable. The XLConnect is a Java-based solution, so it is cross platform and returns satisfactory results. For large data sets it may be very slow. xlsReadWrite is very fast: it doesn't support .xlsx files gdata package provides a good cross platform solutions. It is available for Windows, Mac or Linux. gdata requires you to install additional Perl libraries. Perl is usually already installed in Linux and Mac, but sometimes require more effort in Windows platforms. In general it is advised to store your data in either a database or in comma separated files (.csv) or tab separated files (.tab/.txt) as they are easier to distribute. I found on the web a self made function to easily import xlsx files. It should work in all platforms and use XML · · · · · · · source(" xlsxToR = function("myfile.xlsx", header = TRUE) 13/40
  • 14. Working with XMLWorking with XML Extensible markup language Frequently used to store structured data Particularly widely used in internet applications Extracting XML is the basis for most web scraping Components · · · · · Markup - labels that give the text structure Content - the actual text of the document - - 14/40
  • 15. Read the file into RRead the file into R library(XML) fileUrl <- "" doc <- xmlTreeParse(fileUrl,useInternal=TRUE) rootNode <- xmlRoot(doc) xmlName(rootNode) ## [1] "breakfast_menu" names(rootNode) ## food food food food food ## "food" "food" "food" "food" "food" 15/40
  • 16. Directly access parts of the XML documentDirectly access parts of the XML document rootNode[[1]] ## <food> ## <name>Belgian Waffles</name> ## <price>$5.95</price> ## <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description> ## <calories>650</calories> ## </food> rootNode[[1]][[1]] ## <name>Belgian Waffles</name> Go for a tour of XML package Official XML tutorials short, long An outstanding guide to the XML package · · · 16/40
  • 17. JSONJSON Javascript Object Notation Lightweight data storage Common format for data from application programming interfaces (APIs) Similar structure to XML but different syntax/format Data stored as · · · · · Numbers (double) Strings (double quoted) Boolean (true or false) Array (ordered, comma separated enclosed in square brackets []) Object (unorderd, comma separated collection of key:value pairs in curley brackets {}) - - - - - 17/40
  • 18. Example JSON fileExample JSON file 18/40
  • 19. Reading data from JSON {jsonlite package}Reading data from JSON {jsonlite package} library(jsonlite) # Using chembl api jsonData <- fromJSON("") names(jsonData) ## [1] "compound" jsonData$compound$chemblId ## [1] "CHEMBL1" jsonData$compound$stdInChiKey ## [1] "GHBOEFUAGSHXPO-XZOTUCIWSA-N" 19/40
  • 20. Writing data frames to JSONWriting data frames to JSON myjson <- toJSON(iris, pretty=TRUE) cat(myjson) ## [ ## { ## "Sepal.Length" : 5.1, ## "Sepal.Width" : 3.5, ## "Petal.Length" : 1.4, ## "Petal.Width" : 0.2, ## "Species" : "setosa" ## }, ## { ## "Sepal.Length" : 4.9, ## "Sepal.Width" : 3, ## "Petal.Length" : 1.4, ## "Petal.Width" : 0.2, ## "Species" : "setosa" ## }, ## { ## "Sepal.Length" : 4.7, 20/40
  • 21. Further resourcesFurther resources A good tutorial on jsonlite - encoderdecoder/ jsonlite vignette · · · 21/40
  • 22. mySQLmySQL Free and widely used open source database software Widely used in internet based applications Data are structured in Each row is called a record · · · Databases Tables within databases Fields within tables - - - · 22/40
  • 23. Step 2 - Install RMySQL ConnectorStep 2 - Install RMySQL Connector On a Mac: install.packages("RMySQL") On Windows: · · Official instructions - (may be useful for Mac/UNIX users as well) Potentially useful guide - windows/ - - 23/40
  • 25. Connecting and listing databasesConnecting and listing databases library(DBI) library(RMySQL) ucscDb <- dbConnect(MySQL(),user="genome", host="") result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb); ## [1] TRUE head(result) ## Database ## 1 information_schema ## 2 ailMel1 ## 3 allMis1 ## 4 anoCar1 ## 5 anoCar2 ## 6 anoGam1 25/40
  • 26. Connecting to hg19 and listing tablesConnecting to hg19 and listing tables library(RMySQL) hg19 <- dbConnect(MySQL(),user="genome", db="hg19", host="") allTables <- dbListTables(hg19) length(allTables) ## [1] 11006 allTables[1:5] ## [1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass" ## [5] "acemblyPep" 26/40
  • 27. Get dimensions of a specific tableGet dimensions of a specific table dbListFields(hg19,"affyU133Plus2") ## [1] "bin" "matches" "misMatches" "repMatches" "nCount" ## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand" ## [11] "qName" "qSize" "qStart" "qEnd" "tName" ## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes" ## [21] "qStarts" "tStarts" dbGetQuery(hg19, "select count(*) from affyU133Plus2") ## count(*) ## 1 58463 27/40
  • 28. Read from the tableRead from the table affyData <- dbReadTable(hg19, "affyU133Plus2") head(affyData) ## bin matches misMatches repMatches nCount qNumInsert qBaseInsert ## 1 585 530 4 0 23 3 41 ## 2 585 3355 17 0 109 9 67 ## 3 585 4156 14 0 83 16 18 ## 4 585 4667 9 0 68 21 42 ## 5 585 5180 14 0 167 10 38 ## 6 585 468 5 0 14 0 0 ## tNumInsert tBaseInsert strand qName qSize qStart qEnd tName ## 1 3 898 - 225995_x_at 637 5 603 chr1 ## 2 9 11621 - 225035_x_at 3635 0 3548 chr1 ## 3 2 93 - 226340_x_at 4318 3 4274 chr1 ## 4 3 5743 - 1557034_s_at 4834 48 4834 chr1 ## 5 1 29 - 231811_at 5399 0 5399 chr1 ## 6 0 0 - 236841_at 487 0 487 chr1 ## tSize tStart tEnd blockCount ## 1 249250621 14361 15816 5 ## 2 249250621 14381 29483 17 28/40
  • 29. Select a specific subsetSelect a specific subset query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3") affyMis <- fetch(query); quantile(affyMis$misMatches) ## 0% 25% 50% 75% 100% ## 1 1 2 2 3 affyMisSmall <- fetch(query,n=10); dbClearResult(query); ## [1] TRUE dim(affyMisSmall) ## [1] 10 22 # close connection dbDisconnect(hg19) 29/40
  • 30. Further resourcesFurther resources RMySQL vignette R data import and export Set up R odbc with postgres A nice blog post summarizing some other commands · · · · 30/40
  • 31. HDF5HDF5 Used for storing large data sets Supports storing a range of data types Heirarchical data format groups containing zero or more data sets and metadata datasets multidimensional array of data elements with metadata · · · · Have a group header with group name and list of attributes Have a group symbol table with a list of objects in group - - · Have a header with name, datatype, dataspace, and storage layout Have a data array with the data - - 31/40
  • 32. R HDF5 packageR HDF5 package The rhdf5 package works really well, although it is not in CRAN. To install it: source("") ## Bioconductor version 2.13 (BiocInstaller 1.12), ?biocLite for help ## A newer version of Bioconductor is available after installing a new ## version of R, ?BiocUpgrade for help biocLite("rhdf5") ## BioC_mirror: ## Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.0.3. ## Installing package(s) 'rhdf5' ## ## The downloaded binary packages are in ## /var/folders/pm/jg6blwt55b71g8jl64wfw8ch0000gn/T//RtmpuYnNzs/downloaded_packages 32/40
  • 33. Creating an HDF5 file and group hierarchyCreating an HDF5 file and group hierarchy library(rhdf5) h5createFile("myhdf5.h5") ## [1] TRUE h5createGroup("myhdf5.h5","foo") ## [1] TRUE h5createGroup("myhdf5.h5","baa") ## [1] TRUE h5createGroup("myhdf5.h5","foo/foobaa") ## [1] TRUE 33/40
  • 34. hdf5 continuedhdf5 continued Saving multiple objects to an HDF5 file h5ls("myhdf5.h5") ## group name otype dclass dim ## 0 / baa H5I_GROUP ## 1 / foo H5I_GROUP ## 2 /foo foobaa H5I_GROUP A = 1:7; B = 1:18; D = seq(0,1,by=0.1) h5save(A, B, D, file="newfile2.h5") h5dump("newfile2.h5") ## $A ## [1] 1 2 3 4 5 6 7 ## ## $B ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 34/40
  • 35. Reading from other resourcesReading from other resources foreign package Loads data from Minitab, S, SAS, SPSS, Stata,Systat Basic functions See the help page for more details · · read.arff (Weka) readline() read from console read.dta (Stata) read.clipboard() read.mtp (Minitab) read.octave (Octave) read.spss (SPSS) read.xport (SAS) - - - - - - - - · 35/40
  • 36. Reading imagesReading images jpeg - readbitmap - png - EBImage (Bioconductor) - · · · · 36/40
  • 37. Reading GIS dataReading GIS data rgdal - rgeos - raster - · · · 37/40
  • 38. Reading music dataReading music data tuneR - seewave - · · 38/40
  • 39. AcknowledgemntAcknowledgemnt Jeff Leek University of Washington and Coursera Getting and Cleaning data R For Natural Resources Course R Data import comprehensive guide · · · 39/40
  • 40. 40/40