Your SlideShare is downloading. ×
Big Data at Ancestry.com
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data at Ancestry.com

384

Published on

Presentation at Big Data Summit, April 2013, SF

Presentation at Big Data Summit, April 2013, SF

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
384
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Learning  from  Data:     Who  Do  You  Think  You  Are?     DNA Sco$  Sorensen  and  Leonid  Zhukov  
  • 2. Ancestry.com  Mission   2
  • 3. Discoveries   It’s  the  “aha”  moment  of  a  discovery  that   drives  our  business!   3
  • 4. World’s  largest  online  family  history  resource   Historical  Content   Over  30,000  historical  content  collec2ons     11  billion  records  and  images   Records  da2ng  back  to  16th  century   4
  • 5. World’s  largest  online  family  history  resource   User  Contributed  Content   45  million  family  trees   More  than  4  billion  profiles   200  million  stories  and  photos   5
  • 6. DNA  Data   DNA  Data     Over  120,000  DNA  samples   700,000  SNPs  for  each  sample   2,000,000  4th  cousin  matches           DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http:// en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)   Spit  in  a  tube,  pay  $99,  learn  your  past  Derrick  Harris  -­‐  GigaOm     6
  • 7. User  Behavior  Data   User  Behavior  Data   40  million  searches  /  day   10  million  people  added  to  trees  /  day   5  million    Hints  accepted  /  day   3.5  million    Records  aMached  /  day     1/12   7 12/12   1/12   12/12  
  • 8. Real-­‐Ome  data  feed   8
  • 9. Technology   Machine  Learning     9
  • 10. Person  and  record  search   •  Search  query   10
  • 11. Hint  suggesOons  system   •  Hints  -­‐  sugges2ons    to  aMach  a  record     11
  • 12. Record  linkage   •  Record  linkage  –  finding  and  matching  records  in  mul2ple  data  sets     with  non-­‐unique  iden2fiers   •  Goal:  bring  together  informa2on  about  the  same  person   •  Some    non-­‐unique  iden2fiers:   –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)   –  Dates:    date  of  birth,  date  of  death         –  Places:  place  of  birth,  residence,  place  of  death     –  Extra:  family  members,  life  events   •  Records  o[en  incomplete     •  Records  contains  mistakes   •  Exact  and  fuzzy  match   12
  • 13. Life  events  in  collecOons   •  Life  events   –  Birth:  2.59  bln   –  Marriage:    114  mln   –  Census:    2.74  bln   –  Death:    467  mln   •  Total:    5.91  bln  events   13
  • 14. Candidate  set  funnel:  exact  match   John  Smith:    300,000     John  Smith,  1870:   2,200   John  Smith,  1870,     Boston,  MA:    10   Search:    high  precision   14
  • 15. Candidate  set  funnel:  fuzzy  match   John  Smith:    380,000     John  Smith,  1870:   97,000   John  Smith,  1870,     Boston,  MA:    1400   Explora2on:  large  recall   15
  • 16. Results  set   Name se distan d i t ce Exact match es t nam Shor ls initia Exten de dates d Missing fields 16
  • 17. Hints  suggesOon  system   •  User  feedback  loop:   –  Accept  sugges2on   –  Reject  sugges2on   17
  • 18. A  place  for  machine  learning   •  Supervised  machine  learning   •  Learn  similarity  measure     Person ?   Record (how  to  combine  iden2fiers)   •  Training  &  tes2ng  sets:   –  User  accepts,  rejects   •  Features  (>  500):   –  First  last  name,  DOB,  POB,  DOD,  POD     –  Parents,  children,  siblings,  spouses   –  Fuzzy  matches   •  Similar  to  “learning  to  rank”  problem   18 ML suggest Candidate  k-­‐set  
  • 19. Similarity  measure  learning   Training   Label Person ID Feature generation Record ID Index Ancestry collections ML Random forest Hadoop   Hive   Member trees Scoring   Top-k records candidate set Person ID 19 Feature generation Model Ranked List
  • 20. Large  scale  machine  learning   Hadoop  HDFS   Hadoop  streaming   Random forest (R) Random forest (R) Random forest (R) Model 20 Random forest (R)
  • 21. Data   Big  Data  –  Big  Picture     21
  • 22. Family  tree   •  User  generated  family  trees:   –   45  mln  family  trees   –   4.9  bln    profiles   22
  • 23. Family  tree  as  a  graph  (DAG)   2020  nodes   572  marriage  edges   2910  family  edges     23
  • 24. Family  trees   24
  • 25. Family  trees  staOsOcs   “Power  law”  distribu2on   44  mln  trees   25
  • 26. History  from  family  trees   500  nodes   700  edges   55  genera2ons     2me   26
  • 27. Historical  immigraOon  to  the  US   •  ImmigraOon  is  the  movement  of  people  into  a  country  or  region  to  which  they   are  not  na2ve  in  order  to  seMle  there   •  Immigrants  are  those  who  were  born  outside  the  US  and  died  in  the  US   •  Based  on  family  tree  profiles:   –  Birth/death  dates  range    1500-­‐1990   –  Select  only  complete  profiles  with  FLN,  POB,  DOB,  POD,  DOD   –  Perform  de-­‐duplica2on,  remove  same  ancestors  from  different  family  trees   –  Select  only  those  with  POB  !=  US,  POD  ==  US   •  15  mln  profiles  (  0.3  %  from  4.9  bln  profiles)   27
  • 28. ImmigraOon  to  the  USA  1500-­‐1990   28
  • 29. 29
  • 30. ImmigraOon  map     30
  • 31. Ports  of  arrival    (1800-­‐1980)     31
  • 32. Data  Science     •  Ancestry  is  building  data  science  team   •  We  work  on  product  data  and  BI   •  We  are  hiring   •  Special  thanks  to  Mercator  Group  for  inforgraphics       32

×