Ecosystem	
  Challenges	
  Around	
  Data	
  Use	
  	
  
Leonid	
  Zhukov	
  
Ancestry.com	
  
2
•  World’s	
  largest	
  online	
  family	
  history	
  
resource	
  
•  Started	
  as	
  a	
  publishi...
Data	
  at	
  Ancestry	
  
•  Historical	
  records	
  –	
  company	
  acquired	
  content	
  collecFons	
  
•  User	
  cr...
Historical	
  records	
  
•  Historical	
  Content	
  
– 14	
  billion	
  historical	
  records	
  going	
  back	
  to	
  ...
Historical	
  records	
  
5
•  More	
  than	
  30,000	
  content	
  collecFons	
  
User	
  family	
  trees	
  
6
•  Family	
  trees:	
  
– 60	
  million	
  family	
  trees	
  
– 6	
  billion	
  profiles	
  
Family	
  trees	
  
77
Power	
  law	
  distribuFon	
  
	
  tree	
  sizes	
  
500	
  nodes	
  
700	
  edges	
  
55	
  gener...
User	
  contributed	
  content	
  
– 200	
  million	
  uploaded	
  	
  family	
  photos	
  and	
  stories	
  
8
Person	
  and	
  record	
  search	
  
9
•  Search	
  query	
  
Record	
  linkage	
  
10
•  Record	
  linkage	
  –	
  finding	
  and	
  matching	
  records	
  in	
  mulFple	
  data	
  set...
User	
  behavior	
  data	
  
•  User	
  behavior	
  data:	
  
– 75	
  mln	
  searches	
  daily	
  
– 10	
  mln	
  profiles	...
DNA	
  Data	
  
•  Direct	
  to	
  consumer	
  DNA	
  test	
  
•  700,000	
  SNPs	
  per	
  sample	
  
•  400,000	
  DNA	
...
Ancestry	
  DNA	
  
	
   	
   	
   	
   	
   	
  	
  
•  GeneFc	
  ethnicity	
  
– Reference	
  panel	
  	
  
– 26	
  ethn...
Ancestry	
  DNA	
  
14
•  GeneFc	
  inheritance	
  
– IdenFty-­‐by-­‐descent	
  
– Cousin	
  matching	
  	
  
	
  
Matchin...
DNA	
  data:	
  privacy	
  and	
  research	
  
15
ding how
influence
nd the re-
atments is
communi-
ability of
e distribu-...
Challenges	
  
•  Engineering	
  
– Scalability	
  
– Availability	
  
– Security	
  
•  Research	
  
– InformaFon	
  retr...
Upcoming SlideShare
Loading in …5
×

Ecosystem challenges around data use

687 views

Published on

Presentation at the panel on data use at CRA 2014

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
687
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ecosystem challenges around data use

  1. 1. Ecosystem  Challenges  Around  Data  Use     Leonid  Zhukov  
  2. 2. Ancestry.com   2 •  World’s  largest  online  family  history   resource   •  Started  as  a  publishing  company  in   1983,  online  from  1996   •  2.7  million  worldwide  subscribers    
  3. 3. Data  at  Ancestry   •  Historical  records  –  company  acquired  content  collecFons   •  User  created  content:   – Ancestor  profiles  and  family  trees   – Uploaded  photographs  and  stories   •  User  behavior  data  on  Ancestry.com   •  Customer  DNA  data   •  10  PB  of  structured  and  unstructured  data   3
  4. 4. Historical  records   •  Historical  Content   – 14  billion  historical  records  going  back  to  17th  century   – DigiFzed  and  searchable   4
  5. 5. Historical  records   5 •  More  than  30,000  content  collecFons  
  6. 6. User  family  trees   6 •  Family  trees:   – 60  million  family  trees   – 6  billion  profiles  
  7. 7. Family  trees   77 Power  law  distribuFon    tree  sizes   500  nodes   700  edges   55  generaFons       Fme  
  8. 8. User  contributed  content   – 200  million  uploaded    family  photos  and  stories   8
  9. 9. Person  and  record  search   9 •  Search  query  
  10. 10. Record  linkage   10 •  Record  linkage  –  finding  and  matching  records  in  mulFple  data  sets     with  non-­‐unique  idenFfiers  (data  matching,    enFty  disambiguaFon,   duplicate  detecFon  etc)   •  Goal:  bring  together  informaFon  about  the  same  person   •  Some    non-­‐unique  idenFfiers:   –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)   –  Dates:    date  of  birth,  date  of  death         –  Places:  place  of  birth,  residence,  place  of  death     –  Extra:  family  members,  life  events   •  Records  o_en  incomplete  and  contain  mistakes   •  Other  industries:  banking,  insurance,  government  etc  
  11. 11. User  behavior  data   •  User  behavior  data:   – 75  mln  searches  daily   – 10  mln  profiles  added  daily   – 3.5  mln  records  aaached  daily   11
  12. 12. DNA  Data   •  Direct  to  consumer  DNA  test   •  700,000  SNPs  per  sample   •  400,000  DNA  samples   •  No  medical  studies       12
  13. 13. Ancestry  DNA                 •  GeneFc  ethnicity   – Reference  panel     – 26  ethnic  regions,  3000  samples     13
  14. 14. Ancestry  DNA   14 •  GeneFc  inheritance   – IdenFty-­‐by-­‐descent   – Cousin  matching       Matching DNA
  15. 15. DNA  data:  privacy  and  research   15 ding how influence nd the re- atments is communi- ability of e distribu- ences and ever, like l informa- ata are pri- sensitive. ed special imination, of insur- r individu- es (1, 2). of these data poses allenges. differ in about 0.1% es in their genomes entific data has led to a search for new tech- nologies. However, the hurdles may be greater than had been suspected. For exam- ple, one approach to protecting privacy is to dustrial, or governmental r agrees to our usage policies of data access) (10). Althou prevent data abuse, it pro monitor usage. Social concern are intricately con about benefits o trustworthiness of governmental ag United States, the Portability and Ac of 1996 (HIPAA) ed Privacy Rules o erally forbid sha data without p However, they do address use or di for human genetic bates in Iceland, and elsewhere (1 range of views on by genetic information. Th may be at one end of this sp izens seem to strongly desir Whatever the setting, we rec man Subject Privacy Zhen Lin,1 Art B. Owen,2 Russ B.Altman1* Privacy Independent SNPs Low Medium High 5 75 100 125 1000 2000 3000 4000 Insufficient for future genomic research Insufficient for privacy protection Needed to find genetic relationshops Trade-offs between SNPs and privacy. Z.  Lin,  A.  Owen,  R.  Altman,  Science,  vol  305,  2004  
  16. 16. Challenges   •  Engineering   – Scalability   – Availability   – Security   •  Research   – InformaFon  retrieval     – DNA  genomic  research     •  Privacy      16

×