• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
925
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Basic Data Ingestion in RDenver RUG 11/16/10
    @jrideout
    Software Engineer & Data Monkey
    @ReturnPath
  • 2. Where is the data?
    Flat-file (text/binary)
    Relational Database
    Where is … (from google suggestions)
    chuck norris
    the love
    my mind
    the love lyrics (apparently a song by Black Eyed Peas)
  • 3. read.*
    read.table
    read.csv(2)
    csv2 for , decimal points, : delim
    read.delim(2)
    Tab defaults
  • 4. read.*
    library(foreign) provides read.
    systat, xport, ssd, octave, spss, mtp, epiinfo, dta, dbf
    Many Others:
    Search http://crantastic.org/
  • 5. Scan
    Better for numeric matrices
    M1 <- matrix(scan("test.data"),nrow=x,ncol=y,byrow=T)
    Read 10000000 items
    user system elapsed
    28.565 18.513 50.882
    M2 <- as.matrix(read.table("test.data"))
    > 40 minutes on my laptop
    Actually (read.* just uses scan anyway)
  • 6. Others
    readLines
    Sqldf
    MapReduce
    bigmemory
  • 7. Some tricks
    comment.char="“
    Use colClasses  or as.isfor read.table
    stringsAsFactors
    Colnames(data) <- c(‘newName’,’other’)
    na.strings = “.”
  • 8. Working with the DF
    Attach(df); fieldname
    df[[index]]
    df$fieldname
    Plyr/Reshape
    name abbreviation
    as.*, matrix, data.matrix
  • 9. Type coercion
    Check types with str(), typeof()
    attributes()
    logical < integer < double < complex
    It’s better to get the read.* methods right than coerce later.
  • 10. ?