Data Structures and Visualization
Upcoming SlideShare
Loading in...5
×
 

Data Structures and Visualization

on

  • 104 views

Invited talk from the big data symposium at the 2011 ADSA meeting in New Orleans, LA.

Invited talk from the big data symposium at the 2011 ADSA meeting in New Orleans, LA.

Statistics

Views

Total Views
104
Views on SlideShare
104
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Structures and Visualization Data Structures and Visualization Presentation Transcript

  • J. B. Cole Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 20705-2350 john.cole@ars.usda.gov Data Structures and Visualization
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole U.S. dairy population 0 5 10 15 20 25 30 40 50 60 70 80 90 00 Year Cows(millions)
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high  Corn averaged $6/bu in May  Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility • Longevity • Some genotypes
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole What are big data? Type of Record Number of Records1 Cows with lactation data 28,394,976 Lactations 68,373,863 Individual test days 508,574,732 Calving ease records 20,770,758 Animals in pedigree file 58,893,009 Bull genotypes 50,393 Cow genotypes 70,687 1Totals include animals from all breeds.
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole Data not routinely available • Farm and herd management  Geography and climate  Housing systems  Feed intake • Milk composition  Milk fats, proteins, vitamins, minerals  Conductivity, lactose, MUN • DNA data  Cow SNP genotypes, DNA sequence data Photo: NOAA
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole Data “trapped” on the farm • Fertility  Insemination information  Use of estrus synchronization • Cow health and longevity  Body condition scores  Birth weights and mature weights  Disease occurrence data
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole Electronic milk meters • Currently can provide—  Milk yield  Milking speed  Electrical conductivity • May possibly supply—  Progesterone levels  Milk temperature  Fat and protein concentrations Photo: afimilk
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole Other sources of data • RFID tags have lower ID error rates associated with meter data • Pedometers are useful for detecting estrus, the onset of calving, and some early-stage disease Top: Allflex; Bottom: afimilk
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole Current sources of data AIPL CDCB NAAB PDCA DHI Universities AIPL Animal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole Sources of genomic data AIPL Requester (Ex: AI, breeds) Dairy producers DNA laboratories samples
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC  Mostly C, some Fortran 90 • Data stored in a relational database
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole How do we assess data quality • Consistency  e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks  Association  Clustering  Classification  Regression
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole Association analysis • Discover interesting relationships among variables in large databases  e.g., predicting protein function and identifying SNP-disease associations  Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives  Measures co-occurence, not causality
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole Clustering • Place items into distinct groups such that  Items in a group are similar  Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole Partitional clustering
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples  Relationships among species  Evolutionary history of proteins
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir BFGL Genome Assemblies Nelore Water Buffalo Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples  Identify cows with subclinical mastitis  Mate assignment
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole Decision tree classification Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example—  Breeding values  Feed intake  Milk and components yields • Very mature analytical tools
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques  Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole One image, millions of points 43,382 SNP solutions x 4,064 animals = 176,304,448 data points
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole Use size to denote importance Colors differentiate among chromosomes and markers are proportional to effect sizes.
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole O-Style Haplotypes (chromosome 15)
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole Correlations among calving traits
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole Provide multiple cues Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10. Lines are differentiated by color and pattern.
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole Supercomputer performance • Cray-1 (1976) — 136 megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole Data storage technologies • Storage costs are now as low as $100/TB  Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole Software • Complexity is increasing  Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find  A top programmer may be 10x as productive as an average worker
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware
  • Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole Questions?