World Vital Records Case Study

1,654 views

Published on

A case study presented at the 2009 Enterprise Search Summit by World Vital Records CEO Paul Allen.

Mr. Allen discussed the challenge of cost effective scaling, ability to index and query large data sets, and stay competitive in the market place.

Published in: Technology
2 Comments
1 Like
Statistics
Notes
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • One of the better presentations at Enterprise Search Summit - New York, NY 2009 - WorldVitalRecords.com uses the Perfect Search Appliance to index and provide access to a billion+ records for millions of users on the web. In this presentation the founder of WorldVitalRecords.com, Paul Allen reveals his logic for selecting Perfect Search over Lucene and other search engines and the dramatic scalability, performance and TCO benefits his company received from this break through search technology.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,654
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
2
Likes
1
Embeds 0
No embeds

No notes for slide

World Vital Records Case Study

  1. 1. Indexing and Searching Massive Data Sets PAUL ALLEN CEO May 13, 2009 Enterprise Search Summit
  2. 2. About  WorldVitalRecords.com  Provides access to genealogy databases and family history tools, including birth, death, military, census, and parish records  WorldHistory.com  Provides historical and biographical content  We’re Related Application on Facebook  Designed to help users find their family members, build a family tree and share news and photos with family.  There are 15,636,941 active users as of March 2009. May 13, 2009 Enterprise Search Summit
  3. 3.  Founded in 2006 and includes several key members of the original Ancestry.com team  Has goal to be #2 genealogy company on web  Currently has 12,000+ databases  1.2 billion names  25,000 subscribers  Enterprise Search Summit May 13, 2009
  4. 4. The Challenge  Rapidly expanding data set to grow into the billions of records  Mixture of structured and unstructured data  Indexing and search costs to handle this massive content repository quickly escalate  Increased customer traffic placing additional load on query servers requiring additional servers and costs How do I provide an affordable search solution to handle the explosive data growth? May 13, 2009 Enterprise Search Summit
  5. 5. My Experience With Massive Data Sets  Saw content repositories grow to billions of records  Saw millions spent on capital expenditures on servers and data centers  Saw millions spent on annual server maintenance and energy costs May 13, 2009 Enterprise Search Summit
  6. 6. My Options  Build proprietary search engine  Buy a solution from an Enterprise Search Vendor  Build a Lucene open source search platform May 13, 2009 Enterprise Search Summit
  7. 7. Traditional Solution to Handle Massive Data Sets Lots of servers! Lots of money! May 13, 2009 Enterprise Search Summit
  8. 8. Cluster Architecture  Cluster made up of rows and columns of servers  Determinants of cluster size  Size of index Determines number of columns  Amount of peak traffic (queries per second)  Determines the number of rows  May 13, 2009 Enterprise Search Summit
  9. 9. Index Size – Number of Columns Index size = 80 gb = 5 columns Query Servers needed = 5 Collation Server Query Query Query Query Query Server Server Server Server Server 8 GB 8 GB 8 GB 8 GB 8 GB Assumptions: Up to 50% of index (16 gb for each cluster) resides mostly in cache QPS per server = 20 qps
  10. 10. Peak Traffic Load – Number of Columns Index size = 20 G – 5 columns needed Max Traffic Rate = 100 queries per second (QPS) – 10 rows needed Query Servers needed = 5 columns x 10 rows = 50 servers Collation Server Query Server Query Server Query Server Query Server Query Server Assumptions: Server memory = 8 GB Up to 50% of index resides in cache QPS per server = 10 qps
  11. 11. WVR Server Configuration Collation Server 8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb 800,000,000 records Maximum Query volume system capacity = 7 queries per second 64 bit Windows based servers Dual core CPUs Enterprise Search Summit May 13, 2009
  12. 12. Future WVR Cluster Needs  Projections show  1+ terabyte of data (3.5 billion records)  200 queries per second at peak load Would require a cluster architecture of: 29 columns  40 rows  WVR would need 1,160 query servers! Over $3.5 million in initial capital expenditure Over $2.3 million in recurring yearly costs May 13, 2009 Enterprise Search Summit
  13. 13. Search Challenges  Many solutions work well with  Low traffic  Initial small data sets  As data and traffic grows; however, so do the costs and associated problems Slow indexing times  Low queries per second capacity  Ranked search would have required a significant expansion of  servers to handle the increased search load Required skilled staff to modify and optimize to handle growth  Enterprise Search Summit May 13, 2009
  14. 14. Enterprise Search Solution May 13, 2009 Enterprise Search Summit
  15. 15. Perfect Search Approach  Replace Existing Lucene  Utilize PS Indexing  Utilize PS Search Engine  Match Business Rules  Incorporate Near Exact modules  Soundex, Metaphone  Match or improve results  Provide query results back to WVR for display  Disk-based index Enterprise Search Summit May 13, 2009
  16. 16. Data Growth Past Year August 2008 May 2009 % Growth Number of Records 800,000,000 1,200,000,000 50% Number of Databases 9,000 12,000 33% May 13, 2009 Enterprise Search Summit
  17. 17. Current WVR Perfect Search Server Configuration Collation Server 8 gb 8 gb 8 gb 8 gb 8 gb 8 gb 16 gb 1.2 billion records 12,000 databases Maximum Query volume system capacity = 40 queries per second 5x faster! 64 bit Windows based servers Dual core CPUs Enterprise Search Summit May 13, 2009
  18. 18. Benefits to Worldvitalrecords.com  Reduce indexing time to 1/100 of Lucene times  Reduce query servers from 7 to 1  Provided sub-second query response times  Allows for continued dramatic customer growth without significant server expansion  Allow World Vital Records to compete with market leaders at a fraction of the server capitalization and maintenance costs Enterprise Search Summit May 13, 2009
  19. 19. Future Growth Projections  World Vital Records Growth Plans  1+ terabyte (3 times growth in data)  200 queries per second (20 times growth in customers) Perfect Lucene Search Servers 20 1160 Server Capital Expenditure $60,000 $3,480,000 Recurring Power /Maintenance Costs $40,000 $2,320,000 May 13, 2009 Enterprise Search Summit
  20. 20. Questions? Paul Allen, CEO, FamilyLink.com paul@familylink.com May 13, 2009 Enterprise Search Summit

×