Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dark Data 
1
It’s all about me… 
Prof. Mark Whitehorn 
Emeritus Professor of Analytics 
School of Computing 
University of Dundee 
Cons...
It’s all about me… 
School of Computing 
Teach Masters in: 
Data Science 
Part time - aimed at existing data professionals...
Dark Data 
First mention: 
“Shedding Light on the Dark Data in the Long Tail of Science”, P. Bryan Heidorn, LIBRARY TRENDS...
Dark Data 
First usage (modern sense) of which I am aware was in Alyth, Lands of Loyal, 2013. 
BIDS (Business Intelligence...
Dark Data 
Companies currently collect vast swathes of data; some is analysed, much isn’t. The stuff that isn’t analysed i...
Dark Data 
I think we have three identifiable flavours: 
Dark Data 1.0 – The data that is in the transactional systems but...
Dark Data of the first kind 
Is all around us and I believe it is very important. We are going to be moving a great deal o...
The Andria Doria 
Dark Data of the first kind
The Andria Doria 
Collision between the Andrea Doria and the Stockholm. 
Worst maritime disaster to occur in US waters 
25...
Dark Data of the first kind 
The Andria Doria 
Officers on board the Andrea Doria did not use the plotting equipment that ...
Dark Data of the first kind 
The Andria Doria 
Both had radar running and yet they collided.
Dark Data of the first kind 
The Andria Doria Effect 
Bats flying into nets in caves 
Griffin, Donald R. * 
*Listening in ...
Dark Data of the first kind 
Bats 
“For example, Griffin (1958) describes bats, returning to the roost after an evening's ...
Dark Data of the first kind 
You are unlikely to collect data or information from or about bats and/or ocean-going liners....
Dark Data of the second kind 
DD 2.0 – Data that is published but which the publisher does not expect to be analysed 
Data...
Dark Data of the second kind 
Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Jan Feb Mar Apr 
Amazo...
I also had the sales figures (number of books sold) from my publisher. I could correlate these with the Amazon position fi...
I also had the sales figures (number of books sold) from my publisher. I could correlate these with the Amazon position fi...
Dark Data of the second kind 
The estimation of the maximum of a discrete uniform distribution by sampling (without replac...
Dark Data of the second kind 
The estimation of the maximum of a discrete uniform distribution by sampling (without replac...
Dark Data of the second kind 
The German Tank problem 
The allies were very interested in how many tanks the Germans were ...
Dark Data of the second kind 
The German Tank problem 
The allies started taking the serial numbers from the captured/disa...
The German Tank problem 
As an example, in June 1941, intelligence suggested tank production was 1,550. 
The estimate from...
An Empirical Approach to Economic Intelligence in World War II Author(s): Richard Ruggles and Henry Brodie Source: Journal...
So how is it done? 
Suppose (simplistically) that the actual numbers run from 1 to n. 
If we see a number like 24 then at ...
So how is it done? 
If we capture 5 tanks (we’ll call this value C) 432, 435, 1542 2187, 3245, then the guess has to be hi...
So how is it done? 
C=5, B=3,245, s=432, n=number of tanks produced 
1 s 1K 2K 3K B 4K 
1..…432……………………………………………………………...3...
So how is it done? 
C=5, B=3,245, s=432, n=number of tanks produced 
Well, we hypothesised that (n-B) is roughly equal to ...
You can also use: 
n = (1+1/C)B 
n = (1 +1/5)3,245 = 3,894* 
* Roger Johnson of Carleton College in Northfield, Minnesota ...
Dark Data of the second kind
There are various ways you can do the sums. 
I certainly wouldn’t want to argue for any particular one. 
I certainly would...
Other uses 
The serial number data was also used to deduce the: 
•number of tank factories and their relative production 
...
Of course, if the Germans had been aware of this they could have: 
•Coded the serial numbers, perhaps using a random surro...
Local paper – certain pages were printed out of register enabling us to estimate the number of printing presses in use. 
T...
Again, don’t think of this technique as being limited to tanks. 
What data is your company leaking? 
What can you do about...
Let’s see. 
7517 – 6108 = 
1409 in 130 minutes 
= about 10 per minute.
70836 - 67517 = 
3,319 in 1,105 minutes 
= about 3 per minute; implying the rate drops overnight.
Quick test. 
70845 followed, one minute later, by 70836. 9 in one minute and it looks like they are sequential (but much m...
The German Bomb problem 
DD 3.0 – Data that vanishes and the disappearance is highly significant 
My mother worked on expl...
The German Bomb problem 
If the allies had started analysing in the middle of the war, they would have missed the missing ...
Take home message 
Collect data and use it imaginatively. 
Dark Data
Simpson’s Paradox 
Why is Homer Simpson yellow?
Simpson’s paradox can be seen when percentages and/or actions are combined without thought. 
Two schools and two teachers ...
Brian beats Sally in both schools. 
Simpson’s Paradox 
Brian 
Sally 
School 1 
90% 
88% 
School 2 
75% 
60%
But when we do the sums for all children, Sally beats Brian. 
Simpson’s Paradox 
Brian 
Sally 
School 1 
90% 
88% 
School ...
The paradox arises because the teachers teach very different ability groups in the two schools and the schools themselves ...
Brian may have achieve 90% in school 1, but this is based on a very small sample – he taught far more students in school 2...
Both small and large samples are producing the same kind of value (a percentage). Combining these means that samples with ...
We have one set of data 
1 
140 
9 
100 
10 
190 
12 
150 
14 
140 
15 
170 
15 
150 
17 
150 
17 
200 
20 
250 
20 
270 
...
And another set for the same two axes 
y = 0.1673x + 90.477 R² = 0.3125 
0 
20 
40 
60 
80 
100 
120 
140 
160 
180 
200 
...
But when we combining the sets 
1 
140 
9 
100 
10 
190 
12 
150 
14 
140 
15 
170 
15 
150 
17 
150 
17 
200 
20 
250 
20...
How likely is Simpson’s paradox? 
Marios G. Pavlides and Michael D. Perlman (August 2009). 
The American Statistician 63 (...
Simpson’s Paradox 
Simpson’s Paradox is usually attributed to Simpson in 1951. However Pavlides and Perlman quote various ...
Simpson’s Paradox 
This is a wonderful example of Stigler's law of eponymy. 
Stephen Stigler in his 1980 publication "Stig...
Dark Data by Prof. Mark Whitehorn
Upcoming SlideShare
Loading in …5
×

Dark Data by Prof. Mark Whitehorn

2,299 views

Published on

"Dark Data" by Prof. Mark Whitehorn. A presentation given at DunDDD 2014 at the University of Dundee.

Published in: Data & Analytics
  • Be the first to comment

Dark Data by Prof. Mark Whitehorn

  1. 1. Dark Data 1
  2. 2. It’s all about me… Prof. Mark Whitehorn Emeritus Professor of Analytics School of Computing University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk 2
  3. 3. It’s all about me… School of Computing Teach Masters in: Data Science Part time - aimed at existing data professionals Data Engineering 3
  4. 4. Dark Data First mention: “Shedding Light on the Dark Data in the Long Tail of Science”, P. Bryan Heidorn, LIBRARY TRENDS, Vol. 57, No. 2, Fall 2008 (“Institutional Repositories: Current State and Future,” edited by Sarah L. Shreeves and Melissa H. Cragin), pp. 280– 299 Heidorn writes that: ‘“Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost.’ However this is not synonymous with the modern usage. Heidorn was talking about scientific data that is not published.
  5. 5. Dark Data First usage (modern sense) of which I am aware was in Alyth, Lands of Loyal, 2013. BIDS (Business Intelligence & Data Science) conference.
  6. 6. Dark Data Companies currently collect vast swathes of data; some is analysed, much isn’t. The stuff that isn’t analysed is Dark. Clearly the origin of the term is ‘Dark Matter’. Donald Rumsfelt told us about: Known knowns Known unknowns Unknown unknowns
  7. 7. Dark Data I think we have three identifiable flavours: Dark Data 1.0 – The data that is in the transactional systems but never makes it to the analytical - visible but unseen DD 2.0 – Data that is published but which the publisher does not expect to be analysed DD 3.0 – Data that vanishes and the disappearance is highly significant
  8. 8. Dark Data of the first kind Is all around us and I believe it is very important. We are going to be moving a great deal of this data into analytical systems.
  9. 9. The Andria Doria Dark Data of the first kind
  10. 10. The Andria Doria Collision between the Andrea Doria and the Stockholm. Worst maritime disaster to occur in US waters 25 July 1956. Happily not too great a loss of life. Collision near Nantucket, Mass. Dark Data of the first kind
  11. 11. Dark Data of the first kind The Andria Doria Officers on board the Andrea Doria did not use the plotting equipment that was available to them to calculate the speed and position of the Stockholm. The navigation officer on board the Stockholm thought his radar was set to 15 miles when in fact it was set to 5 miles.
  12. 12. Dark Data of the first kind The Andria Doria Both had radar running and yet they collided.
  13. 13. Dark Data of the first kind The Andria Doria Effect Bats flying into nets in caves Griffin, Donald R. * *Listening in the dark: the acoustic orientation of bats and men. New Haven, Yale University Press, 1958. 415 p.
  14. 14. Dark Data of the first kind Bats “For example, Griffin (1958) describes bats, returning to the roost after an evening's hunt, crashing into a newly erected cave barricade even though it reflected strong echoes. Griffin coins this mishap the “Andrea Doria Effect,” because the bats seemed to ignore important information from echo returns to guide their behavior and instead favored spatial memory.” http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927269/
  15. 15. Dark Data of the first kind You are unlikely to collect data or information from or about bats and/or ocean-going liners. But your organisation is likely to collect data that is not being analysed (in the sense that the information it contains is not being extracted). •Patient monitors in intensive care •Robots in car production - Volvo
  16. 16. Dark Data of the second kind DD 2.0 – Data that is published but which the publisher does not expect to be analysed Data leakage Amazon publishes the position of a book relative to other books. “One” is good (think J. K. Rowlings). I collected the position of my books for several years.
  17. 17. Dark Data of the second kind Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Jan Feb Mar Apr Amazon positional data (IRDB) for Aug 98 – Mar 00
  18. 18. I also had the sales figures (number of books sold) from my publisher. I could correlate these with the Amazon position figures. I also tracked the Amazon position of ‘competing’ books. So….. Dark Data of the second kind
  19. 19. I also had the sales figures (number of books sold) from my publisher. I could correlate these with the Amazon position figures. I also tracked the Amazon position of ‘competing’ books. So I could estimate the sales figures of my competitors. Dark Data of the second kind
  20. 20. Dark Data of the second kind The estimation of the maximum of a discrete uniform distribution by sampling (without replacement).
  21. 21. Dark Data of the second kind The estimation of the maximum of a discrete uniform distribution by sampling (without replacement). A superb example of this is: The German Tank problem
  22. 22. Dark Data of the second kind The German Tank problem The allies were very interested in how many tanks the Germans were producing during the second World War. Intelligence suggested the production was about 1,000 - 1,500 tanks per month.
  23. 23. Dark Data of the second kind The German Tank problem The allies started taking the serial numbers from the captured/disabled German tanks. It turned out that the gearbox numbers were the most useful. (However the chassis numbers and wheel moulds were also used.) These numbers were also used to estimate production.
  24. 24. The German Tank problem As an example, in June 1941, intelligence suggested tank production was 1,550. The estimate from serial numbers was 244. The actual value (verified after the war) was 271. In August 1942 the numbers were 1,550, 327, 342. It worked. Dark Data of the second kind
  25. 25. An Empirical Approach to Economic Intelligence in World War II Author(s): Richard Ruggles and Henry Brodie Source: Journal of the American Statistical Association, Vol. 42, No. 237 (Mar., 1947), pp. 72- 91 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2280189
  26. 26. So how is it done? Suppose (simplistically) that the actual numbers run from 1 to n. If we see a number like 24 then at least 24 tanks have been produced. But we can be smarter. If we see numbers like: 6, 7, 9, 11, 12, 23, 16, 17, 24 then an intelligent guess would be around 30-40. Dark Data of the second kind
  27. 27. So how is it done? If we capture 5 tanks (we’ll call this value C) 432, 435, 1542 2187, 3245, then the guess has to be higher than 3,245 – perhaps 4,000. But we can apply logic and a formula. The biggest number (B) is 3,245, and the smallest (s) is 432. We could reasonably expect as many ‘hidden’ numbers above 3,245 as there are below 432. Dark Data of the second kind
  28. 28. So how is it done? C=5, B=3,245, s=432, n=number of tanks produced 1 s 1K 2K 3K B 4K 1..…432……………………………………………………………...3245.….n So n is about 3,245 + 432 = 3,677 Dark Data of the second kind
  29. 29. So how is it done? C=5, B=3,245, s=432, n=number of tanks produced Well, we hypothesised that (n-B) is roughly equal to (s-1), so if n-B = s-1 then: n=B+(s-1) n= 3,245 + (432-1) = 3,676 Dark Data of the second kind
  30. 30. You can also use: n = (1+1/C)B n = (1 +1/5)3,245 = 3,894* * Roger Johnson of Carleton College in Northfield, Minnesota Teaching Statistics Vol. 16, number 2 Summer 1994 http://www.mcs.sdsmt.edu/rwjohnso/html/tank.pdf Dark Data of the second kind
  31. 31. Dark Data of the second kind
  32. 32. There are various ways you can do the sums. I certainly wouldn’t want to argue for any particular one. I certainly would argue that this may well be worth doing. Dark Data of the second kind
  33. 33. Other uses The serial number data was also used to deduce the: •number of tank factories and their relative production •logistics of tank deployment •changes in production due to, for example, bombing Dark Data of the second kind
  34. 34. Of course, if the Germans had been aware of this they could have: •Coded the serial numbers, perhaps using a random surrogate key (boring) •Created misinformation (a much more interesting option) Dark Data of the second kind
  35. 35. Local paper – certain pages were printed out of register enabling us to estimate the number of printing presses in use. This is an example of “It’s the exception that proves the rule.” Dark Data of the second kind
  36. 36. Again, don’t think of this technique as being limited to tanks. What data is your company leaking? What can you do about it? Which of your competitors is leaking? (satellite photographs of oppositions parking lot) (classis car advertisements) Dark Data of the second kind
  37. 37. Let’s see. 7517 – 6108 = 1409 in 130 minutes = about 10 per minute.
  38. 38. 70836 - 67517 = 3,319 in 1,105 minutes = about 3 per minute; implying the rate drops overnight.
  39. 39. Quick test. 70845 followed, one minute later, by 70836. 9 in one minute and it looks like they are sequential (but much more testing required).
  40. 40. The German Bomb problem DD 3.0 – Data that vanishes and the disappearance is highly significant My mother worked on explosives and casings from German bombs and shells during the war. At the start of the war the Germans used very good quality alloys. The allies tracked the elements in the alloy. As the war progressed, certain elements disappeared completely from the mix. Dark Data of the third kind
  41. 41. The German Bomb problem If the allies had started analysing in the middle of the war, they would have missed the missing data. (Would that have made it an unknown, unknown, unknown?) As it was, they could see what elements were becoming scarce in Germany and plan accordingly. Dark Data of the third kind
  42. 42. Take home message Collect data and use it imaginatively. Dark Data
  43. 43. Simpson’s Paradox Why is Homer Simpson yellow?
  44. 44. Simpson’s paradox can be seen when percentages and/or actions are combined without thought. Two schools and two teachers – School 1 – Sally School 2 – Brian But they occasionally work in each other’s schools. Scored on the percentage exam pass of the students. Simpson’s Paradox
  45. 45. Brian beats Sally in both schools. Simpson’s Paradox Brian Sally School 1 90% 88% School 2 75% 60%
  46. 46. But when we do the sums for all children, Sally beats Brian. Simpson’s Paradox Brian Sally School 1 90% 88% School 2 75% 60% Overall 75% 87%
  47. 47. The paradox arises because the teachers teach very different ability groups in the two schools and the schools themselves differ in intake. Simpson’s Paradox Brian Sally Pass Total Percentage Pass Total Percentage School 1 No. of children 9 10 90% 700 800 88% School 2 No. of children 600 800 75% 6 10 60% Combined No. of children 609 810 75% 706 804 75%
  48. 48. Brian may have achieve 90% in school 1, but this is based on a very small sample – he taught far more students in school 2. And Sally taught very few children in school 2 and far more in school 1. Simpson’s Paradox Brian Sally Pass Total Percentage Pass Total Percentage School 1 No. of children 9 10 90% 700 800 88% School 2 No. of children 600 800 75% 6 10 60% Combined No. of children 609 810 75% 706 804 75%
  49. 49. Both small and large samples are producing the same kind of value (a percentage). Combining these means that samples with the high numbers of students ‘overwhelm’ the samples with low numbers. Simpson’s Paradox Brian Sally Pass Total Percentage Pass Total Percentage School 1 No. of children 9 10 90% 700 800 88% School 2 No. of children 600 800 75% 6 10 60% Combined No. of children 609 810 75% 706 804 75%
  50. 50. We have one set of data 1 140 9 100 10 190 12 150 14 140 15 170 15 150 17 150 17 200 20 250 20 270 22 270 23 450 23 270 34 430 34 370 45 490 56 480 60 330 y = 6.4397x + 111.66 R² = 0.6344 0 100 200 300 400 500 600 0 10 20 30 40 50 60 70
  51. 51. And another set for the same two axes y = 0.1673x + 90.477 R² = 0.3125 0 20 40 60 80 100 120 140 160 180 200 0 100 200 300 400 500 600 98 95 125 180 134 70 148 85 152 75 200 125 200 135 220 135 230 120 230 135 340 180 340 185 450 180 560 140
  52. 52. But when we combining the sets 1 140 9 100 10 190 12 150 14 140 15 170 15 150 17 150 17 200 20 250 20 270 22 270 23 450 23 270 34 430 34 370 45 490 56 480 60 330 98 95 125 180 134 70 148 85 152 75 200 125 200 135 220 135 230 120 230 135 340 180 340 185 450 180 560 140 y = -0.2662x + 238.52 R² = 0.0989 0 100 200 300 400 500 600 0 100 200 300 400 500 600
  53. 53. How likely is Simpson’s paradox? Marios G. Pavlides and Michael D. Perlman (August 2009). The American Statistician 63 (3): 226–233. www.stat.washington.edu/research/reports/2009/tr558.pdf Simpson’s Paradox
  54. 54. Simpson’s Paradox Simpson’s Paradox is usually attributed to Simpson in 1951. However Pavlides and Perlman quote various other sources which suggest that the underlying idea goes back at least as far 1899.
  55. 55. Simpson’s Paradox This is a wonderful example of Stigler's law of eponymy. Stephen Stigler in his 1980 publication "Stigler’s law of eponymy". Essentially this postulates that scientific discoveries are not usually named after the original discoverer. Stigler law was originally postulated by the sociologist Robert K. Merton; thus ensuring the Stigler’s law obeys Stigler’s law. Gieryn, T. F., ed. (1980). Science and social structure: a festschrift for Robert K. Merton. New York: NY Academy of Sciences. pp. 147–57. ISBN 0-89766-043-9. , republished in Stigler's collection "Statistics on the Table"

×