DNA
Big Data at Ancestry
Bill Yetman, Sr. Director Commerce, Data, Analytics
March 22, 2013
Agenda
• Introduction
• Understanding Big Data Scale
• Big Data Story – How it works
• What is Big Data?
• Ancestry’s Big ...
Understanding Big Data Scale
Dollar Amounts Data Sizes
$1 Byte
$1,000 (Thousands) Kilobyte (KB: 10^3)
$1,000,000 (Millions...
Big Data at Work
Bill’s Facebook Story
– Signed up in early 2007
– Must have been a visionary
– Not really…
4
1 Friend for...
Big Data at Work (cont’d)
What happens today…
– Fill out your profile and that data is used to hook you into the “social g...
What is Big Data?
 Volume (amount of data)
 Exceeds the processing capacity of conventional database systems
 Velocity ...
Big Data: Why now?
• The simplified version of Moore’s Law states that processor speeds, or overall
processing power for c...
Big Data Technologies
8
* http://en.wikipedia.org/wiki/Apache_Hadoop
4 Hadoop definition points are from a presentation by...
What is Ancestry’s Big Data?
• Family Trees: 44 million+
• Profiles on the Family Trees: 4 billion+
• Records Attached to ...
DNA Matching
• Framing the Problem
– 46 chromosomes, 23 pairs, 3 billion+ base pairs, AGTC
– 99.9% of human DNA is shared
...
DNA Matching
• Raw Data
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
A A A ...
DNA Matching
• Walk through an example of how the Open Source
algorithm Germline works
– http://www1.cs.columbia.edu/~guse...
DNA Matching with Big Data
• Map phase “adds words” to the HBase table for each new
sample and saves data to a “fuzzy matc...
DNA Matching – Big Data Results
22
Introduced Big Data Hadoop Matching Process
Projected Process vs. Big Data Process
Machine Learning – Automated Content Pipeline
• OCR to extract the text from an image
• Use machine learning to categorize...
Data Mining: Tree Sizes
24
x-axis: tree size, y-axis: number of trees of a specific size
More small trees, than large tree...
Data Mining: What do trees look like?
25
Data Mining: What do trees look like?
26
Final Thoughts/Questions
• Just three examples of Big Data processing and
technologies in use at Ancestry
– Hinting
– Sear...
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Upcoming SlideShare
Loading in …5
×

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

393 views

Published on

This was one of my first presentations on Big Data at Ancestry.com. The audience was split between Family Historians interested in the Technology and Developers interested in our Big Data Story. So the presentation is a mix. I think there is plenty for a someone with an interest in technology and enough meat for a "technologist".

Keep this in mind as you look at this presentation.

Thanks,

-Bill-

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
393
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

  1. 1. DNA Big Data at Ancestry Bill Yetman, Sr. Director Commerce, Data, Analytics March 22, 2013
  2. 2. Agenda • Introduction • Understanding Big Data Scale • Big Data Story – How it works • What is Big Data? • Ancestry’s Big Data • Three Big Data Examples – DNA Matching – Machine Learning – Data Mining 2
  3. 3. Understanding Big Data Scale Dollar Amounts Data Sizes $1 Byte $1,000 (Thousands) Kilobyte (KB: 10^3) $1,000,000 (Millions) Megabyte (MB: 10^6) $1,000,000,000 (Billions) Gigabyte (GB: 10^9) $1,000,000,000,000 (Trillions) Terabyte (TB: 10^12) $1,000,000,000,000,000 (Quadrillion) Petabyte (PB: 10^15) $1,000,000,000,000,000,000 (Quintillion) Exabyte (EB: 10^18) 3 • Before the bubble, $100 million used to be big, now $1 trillion is routine • Data has gone through a similar trend "640K ought to be enough for anybody.“ Bill Gates, 1981
  4. 4. Big Data at Work Bill’s Facebook Story – Signed up in early 2007 – Must have been a visionary – Not really… 4 1 Friend for 2 ½ Years Daughter joined in 2010 … • Back in 2007, Facebook did not care! • How Big Data has changed FB …
  5. 5. Big Data at Work (cont’d) What happens today… – Fill out your profile and that data is used to hook you into the “social graph” – FB wants you to make 10 friend requests within the first 5 minutes of signing up – Within 15 minutes you understand intrinsically how the service works – Social graph (a huge dataset) is used to engage the individual user 5
  6. 6. What is Big Data?  Volume (amount of data)  Exceeds the processing capacity of conventional database systems  Velocity (speed of data in/out)  Needs high performance data pipeline and massively parallel processing  Variety (range of data types and sources)  Fluid data requirements where up-front schema design is a poor fit 6 Big Data Characteristics Volume Variety Velocity
  7. 7. Big Data: Why now? • The simplified version of Moore’s Law states that processor speeds, or overall processing power for computers will double every two years. – Intel co-founder Gordon Moore • Over the last 30 years, space per unit cost has doubled roughly every 14 months (increasing by an order of magnitude every 48 months).* – In early 2010, Terabyte drives were introduced at a cost of $0.10 per TB or $0.01 per GB 7 * http://www.mkomo.com/cost-per-gigabyte *
  8. 8. Big Data Technologies 8 * http://en.wikipedia.org/wiki/Apache_Hadoop 4 Hadoop definition points are from a presentation by Jeremy Pollack, Ancestry.com 3/2/2013 What is Hadoop? 1. Hadoop* is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion 2. Hadoop specifies a distributed file system called HDFS 3. Hadoop supports a processing methodology known as MapReduce 4. Many tools are built on top of Hadoop, such as HBase, Hive, and Flume
  9. 9. What is Ancestry’s Big Data? • Family Trees: 44 million+ • Profiles on the Family Trees: 4 billion+ • Records Attached to Family Trees: 2 billion+ • Photographs, Scanned Documents and Written Stories: 185 million+ • Total # of Records: 11 billion+ • Total # of Titles: 30,000 • Total # of DNA Samples: 100,000+ • 4 Petabytes of structured and unstructured data 9
  10. 10. DNA Matching • Framing the Problem – 46 chromosomes, 23 pairs, 3 billion+ base pairs, AGTC – 99.9% of human DNA is shared – Academic programs work at small scale (GermLine) – New samples are matched against all previous samples 10 • Use Big Data technologies to create a scalable matching platform – Hadoop and MapReduce – HBase http://www1.cs.columbia.edu/~gusev/germline/
  11. 11. DNA Matching • Raw Data 3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips) • Map File 0 rs10005853 0 0 0 rs10015934 0 0 0 rs1004236 0 0 0 rs10059646 0 0 0 rs10085382 0 0 0 rs10123921 0 0 0 rs10127827 0 0 0 rs10155688 0 0 0 rs10162780 0 0 0 rs1017484 0 0 0 rs10188129 0 0 11
  12. 12. DNA Matching • Walk through an example of how the Open Source algorithm Germline works – http://www1.cs.columbia.edu/~gusev/germline/ • Highly simplified – Simple, small example – Uses names from Battlestar Galactica • Two key steps when comparing two samples – Initial “word match” – Second “fuzzy logic” step • Worst case – Identical twins or two samples from the same user 12 Kara Thrace Admiral Adama
  13. 13. DNA Matching with Big Data • Map phase “adds words” to the HBase table for each new sample and saves data to a “fuzzy match” table • Reduce phase uses those tables to create segment matches 21 • Matched Segments are analyzed statistically to determine relationship distance (M0 to M11)
  14. 14. DNA Matching – Big Data Results 22 Introduced Big Data Hadoop Matching Process Projected Process vs. Big Data Process
  15. 15. Machine Learning – Automated Content Pipeline • OCR to extract the text from an image • Use machine learning to categorize the words and phrases to extract names, dates, place, relationships, and other information • Natural language processing using supervised machine learning – Use a set of data that is annotated for learning – Validation set to test – Run against the extracted text from a collection • Result is a fully automated content delivery process 23
  16. 16. Data Mining: Tree Sizes 24 x-axis: tree size, y-axis: number of trees of a specific size More small trees, than large trees, but we also have extremely large trees in the system > 500,000
  17. 17. Data Mining: What do trees look like? 25
  18. 18. Data Mining: What do trees look like? 26
  19. 19. Final Thoughts/Questions • Just three examples of Big Data processing and technologies in use at Ancestry – Hinting – Search – Trees – More… • Big Data technologies to watch – Hadoop and HDFS move from batch to real-time processing – Impala and Apache Drill • Questions? 27

×