Data randomness, variation, coincidences, populations and
estimation and the use and abuse of statistics
September 9/11 Coincidences
911 is the emergency number
The twin towers looked like the number 11 so perhaps all 9/11 things relate to 11
9 + 1 + 1 = 11, the first flight to hit the twin towers was flight 11
On board flight 11 was 92 people on board, 9 + 2 = 11
September 11 is the 254th day of the year 2 + 5 + 4 = 11 and 365 – 254 = 111)
11 letters each in “New York City”, “Afghanistan”, “the Pentagon”, and “George W. Bush”
New York was the 11th state admitted to the union
119 (1 + 1 + 9 = 11) used to be the area code to both Iraq and Iran
Flight 77 that crashed in Pennsylvania had 65 people on board, 6 + 5 = 11
March 11 (2004) attack in Spain. There are exactly 911 days between this and the September 11 (2001) attack.
The Strange Coincidence of the Girl from Petrovka
Key Research Finding - Serendipity is essential
"I knew before coming to
university that I wanted to
do something different, I
wanted to take advantage
of all opportunities"
"I randomly received an
email for the Young
competition event and
decided to enter it"
"An executive from the
association addressed my
class for volunteers"
to be lucky,
"I found the idea of
my business while
doing an assignment"
(Sood and Marchand 2012)
High Calibre Analytics Graduates
Data Scientist Job Roles
(LinkedIn 16 September 2012)
Notes: Word count shown next to each word
Exclusion words: ability area bay com experience francisco job linkedin preferred san
Adapted from and source: firstname.lastname@example.org
Statistics as Hypothesis Driven Process
Leading Questions: Yes Prime Minister
The risk with data mining is the discovery of meaningless
patterns and given enough data and time you can support almost
Sir Humphrey Appleby demonstrates use of leading questions to
skew an opinion survey to support or oppose National Service
Taken from the 1st Season of Yes Prime Minister - Episode 2, The
Yes Prime Minister is a British political satire/ comedy aired in the
ATLAS: The observed (full line) and expected (dashed line) 95%
CL combined upper limits on the SM Higgs boson production
cross section divided by the Standard Model expectation as a
function of mH in the full mass range considered in this analysis
(a) and in the low mass range (b). The dashed curves show the
median expected limit in the absence of a signal and the green
and yellow bands indicate the corresponding 68% and 95%
Statistics of the Higgs Boson
Particle physics has an accepted definition for a “discovery”: a five-sigma level of certainty
The number of standard deviations, or sigmas, is a measure of how unlikely it is that an experimental result
is simply down to chance rather than a real effect
Similarly, tossing a coin and getting a number of heads in a row may just be chance, rather than a sign of a
The “three sigma” level represents about the same likelihood of tossing more than eight heads in a row
Five sigma, on the other hand, would correspond to tossing more than 20 in a row
One standard deviation from the center would give a probability of 68% of all data (~ 1 in 3)
About 95.5% of the data will be inside two standard deviations
(~ 1 in 22)
About 99.7% lie within three standard deviations (~ 1 in 370),
Four standard deviation events occur 1 in 15,787 times
Five standard deviation events occur 1 in every 1,744, 278 times.
So a five sigma effect, which two experiments now have, means that such a thing would be observed by
chance with a probability of 1/1,744, 278 = 5.7 x 10-7.
This is so unlikely that this is the criterion for accepting an effect as real in particle physics, when it is
corroborated by another experiment as in this case.
24104 Emerging Marketing Issues and Social Media
Assessment item 1: Project (Group)
Objective(s): This addresses Subject Learning Objective/s 1-4 Weighting: 30%
Due: The group report is due by start of lecture in Week 14.
Length: The final deliverable report requires to be of sufficient length to document:
1. The acquisition of the social data and supporting process
2. Visualisation of the network data and key measures
3. Description of models built from social data
4. Conclusion highlighting any useful insights
24104 Emerging Marketing Issues and Social Media
Task: Groups of students (4-5) participate in a practical project to data mine social media data.
Completion of this task requires the group to provide a report documenting the experience in acquiring and
discovering the social data using visualisation, setting up the data mining environment, describing the
findings with regard to the models built from the data and concluding insights. The approach to mine the
data is in 2 stages:
1. Visualise a social network of data freely available to the group e.g. LinkedIn, Twitter, YouTube,
Identify and describe key network measures
2. Mine the data to build models from the social data
This project uses the sophisticated REVOLUTION R ENTERPRISE software as a platform for data mining.
The software is free for academic use. The Rattle (R Analytical Tool To Learn Easily) package provides a
graphical user interface specifically for data mining using R and overcomes the need to use heavy
The following resources help to bootstrap the project and amuse the group project members:
Kaggle – Kaggle.com/competitions
AnalyticsBridge A social network for analytics professionals - analyticbridge.com
The R Inferno “If you are using R and you think you‟re in hell, this is a map for you”
Furnas, Alexander ( 2012) Everything You Wanted to Know About Data Mining but Were Afraid to Ask, the
Atlantic, 3 April http://www.theatlantic.com/technology/archive/2012/04/everything-you-wanted-to-knowabout-data-mining-but-were-afraid-to-ask/255388/
Train of Thought Analysis
A bottom-up approach
Perceptual process of discovery to uncover structure
Distinguish patterns,structure, relationships and anomalies
Reveals indirect links
Knowledge is colour coded
Marketing Analyst can spot irregularities
Not sure why but where does this lead
Harnesses the power of the human mind
How to Find a Killer using Visualisation
1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer
Everyone in Australia was a suspect
Enormous volumes of data from multiple sources
RTA Vehicle records
Gun Licensing records
Internal Police records
Police applied visualisation techniques (NetMap) to the data
Reduced the suspect list from 18 million to 230
Further analysis with the use of additional information reduced this to 32
NodeXL - Excel 2007/10/13 workbook template for viewing and analyzing network graphs
Import ego, Fan page and groups networks from Facebook using
Social Network Importer for NodeXL
A&F,Beijing ,Gucci,LVMH,New York,Old Navy,
,Paris, Sydney, Tiffany, Tokyo, Tommy, Versace
Austria, California, Canada, China, Egypt, England,
Finland, France Germany, Guernsey, Holland, India,
Indonesia, Ireland , Israel, Italy , Japan, Kuwait,
Malaysia, Nepal,Paraguay , Philippines, Phillipines,
Portugual, Saudi Arabia, Singapore South Africa,
Spain, Sweden, Taiwan, Thailand,UK ,USA
Ambivalent, Employee, Opposer, Reporter, Supporter
11. Committed Partnerships, 12. Compartmentalised
Friendship,13. Childhood friendship,14. Courtship,15. Fling, 16.
Secret-Affair, 17. Enslavement , 2. Marriages of Convenience,3.
Best Friendships,4. Kinships, 5. Rebounds/ Avoidance-Driven,6.
Courtships,7.Dependencies 8. Enmities, 9. Love-Hate (Sweeney and
Model Comparison By Variables/Predictors
Elaboration of Trip to Paris Blog Story (Means-End & Heider)
Woodside,Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing
17. "I wanted Paige to get a feel
for shopping experiences that
she would not have at home (aka
the ubiquitous mall). "
16. "On our trip to Giverny, we met a young
woman from Brisbane, Australia who was
traveling on her own and we invited her to join
us. Three of us enjoyed delicious and
innovative soufflés, while Paige had the rack of
lamb. We shared two dessert soufflés, one
chocolate and the other cherry/almond. Yum"
14. "They had decide to come to Paris
to find the Harley Davidson store so
they could buy Harley Paris t-shirts."
was my cousin
15." Michael Osman is an American artists
living in Paris."
"He supplements his income by being a
tour guide." I" found out about him on
"So I engaged Michael for two days."
5. “I am a Canadian
and get by in
6. "All I can say is WOW! We rented a 2
bedroom, 1 ½ bath apartment (two
showers), "Merlot" from ParisPerfect
http://www.parisperfect.com/ and boy was
it ever perfect! "
7. “We had a full view of the Eiffel from
our charming little terrace. ....We were
within walking distance to two metro
stops (Pont d'Alma or Ecole Militaire) "
13."The father stretched out his cupped
hands which held all of the pieces they were
able to recover, including the memory stick
and he very solemnly said, "El muerto...".
12. Unforgettable Memories
"This trip had so many memories, but here are a few choice
highlights........On our very first night, knowing that the Eiffel
Tower light show started at 10:00 p.m.... she [Paige] dropped
her camera…down 6 flights…we were stunned…Spanish
Family below standing below [with pieces of the camera]”
8. "We were walkable to many good
bistros, cafes and bakeries and only a
few blocks from the wonderful market
street Rue Cler."
18."We went on Fat
Tire's day trip to
Monet's gardens and
house in Giverny, about
an hour outside Paris."+
19....."I know Paige will
treasure the memory of
this girl's trip for many
years to come."
•L'Arc de Triomphe - 248 steps up and 248 steps
•Les Invalides, Napoleon's Tomb and the
•Train to Vernon, bike to Giverny with Fat Tire
9. "I bought a Paris Pratique pocket-sized book at a
Metro station. This handy guide has detailed maps
of each arrondisement, as well as the metro lines,
the bus lines, the RER and the SCNF (trains). I'll
never be without this again."
10."Six months before our trip, I gave
Paige a couple of good guide books on
Paris and suggested she let me know
what her interests were since after all,
this was to be her trip."
Linguistic Inquiry and Word Count (LIWC)
Text Analysis : The Psychological Power of Words
“I love Paris”
(I, me, my)
Overall cognitive words
Articles (a, an, the)
Big words (> 6 letters)
Pennebaker, J. W., Francis ME, Booth RJ. (2001). Linguistic Inquiry and Word Count (LIWC):
LIWC2001. Mahwah: Lawrence Erlbaum Associates.
Which Pattern is Random ?
Ceiling of the Waitomo cave in New Zealand.
Which Pattern is Random ?
Journal of the Institute of Actuaries 0481
Journal of the Institute of Actuaries 72 (1946)72 (1946) 0481
Newcomb Discovery (1881)
• American mathematician/astronomer Simon Newcomb discovered the
first few pages of a logarithmic table corresponding to the lower
significant digits (typically those below 5) were comparatively dirtier than
the later pages corresponding to the higher significant digits (typically
those above 5)
• Newcomb attributed greater usage to users were looking-up numbers that
started with digit 1 more often than numbers starting with, say, digit 5
• This leads to probability distribution of an user accessing any of the pages
at any given time was skewed in favour of the earlier pages corresponding
to the lower significant digits!
• This was directly in contrast with the normal theory of probability
according to which the probability of randomly picking any number
between one and nine should be equal to the unique value of 1/9 or
Leading (first) digit
If the leading (first) digit is d, then the frequency of
occurrence (probability) of the leading digit is
Log10 (1 + 1/d)
Benford Stumbles Over Newcomb Finding
• In 1938, almost half a century after the Newcomb Frank
Benford was going through a large collection of numerical
data from disparate sources when he stumbled upon a similar
• Benford used a huge volume of data to empirically support his
finding including areas of rivers, street addresses of “American
men of Science” and numbers appearing in front-page
newspaper stories. He went on to publish his findings in a
number of papers including the 1937 “The Law of Anomalous
Numbers”. Thus the „ principle ‟ came to be known as
Human choices are not random, invented numbers are unlikely to follow Benford’s Law
Only works with natural numbers (those numbers that are not ordered in a particular
When people invent numbers, their digit patterns (which have been artificially added to a list
of true numbers) will cause the data set to appear unnatural
See Durtshi, Hillison and Pacini (2004) The Effective Use of Benford’s Law to Assist in Detecting Fraud in
Accounting Data by).
Does not work with Lottery!
Formally proven in 1996
Corpus of over 650 papers available at
Benford Law Plug-in is for Kirix Strata, R package “BenfordTests” or visualise in Tableau
Smartphone, Google Glass or Apple Watchwill
Know What you Want before you do
“…from 2014 your phone [glasses or watch] will
anticipate your needs, do the research, tell you
what what you want to know – sometimes
before the question even occurs to you…”
Chapman, Jake (2013), The Wired World in 2014
Useful References Informing our Thinking
There is a potential 93% average predictability in user mobility, an exceptionally high
value rooted in the inherent regularity of human behavior. Yet it is not the 93%
predictability that we find the most surprising. Rather, it is the lack of variability in
predictability across the population.
Scellato et al. (2011), NextPlace: A Spatio-temporal Prediction Framework for
Pervasive Systems. Proceedings of the 9th International Conference on Pervasive
Daily and weekly routines => Few significant places every day => Regularity in human
activities => Regularity leads to predictability
Useful References Informing our Thinking
Domenico, A. Lima, Musolesi.M. (2012) Interdependence and Predictability of Human
Mobility and Social Interactions. Proceedings of the Nokia Mobile Data Challenge
we have shown that it is possible to exploit the correlation between movement data and
social interactions in order to improve the accuracy of forecasting of the future geographic
position of a user. In particular, mobility correlation, measured by means of mutual
information, and the presence of social ties can be used to improve movement forecasting
by exploiting mobility data of friends. Moreover, this correlation can be used as indicator of
potential existence of physical or distant social interactions and vice versa.
Sadilek, A and Krumm, J. (2012) Far Out: Predicting Long-Term Human Mobility
Where are you going to be 285 days from now at 2pm …we show that it is possible to
predict location of a wide variety of hundreds of subjects even years into the future and
with high accuracy.
“Children never put off till
tomorrow what will keep
them from going to bed