Al-Mqbali, Leila, Big Data - Research Project

Big Data:
An Investigationof the Big Data Phenomenon
and its Implications for Accuracy in Modelling
and Analysis
By Leila Al-Mqbali
Directed Research in Social Sciences: SCS 4150
Supervisor: Roman Meyerovich, Canada Revenue Agency
Program Director: Professor Kathleen Day
Disclaimer: Anyviewsoropinionspresentedinthisreportare solelythose of the studentanddonot
representthose of the CanadaRevenue Agency.
April 23, 2014

1
Table of Contents
Introduction..................................................................................................................................... 3
What is Big Data? ........................................................................................................................... 5
Volume......................................................................................................................................... 6
Velocity........................................................................................................................................ 7
Variety......................................................................................................................................... 7
The Rise of Big Data and Predictive Analytics .............................................................................. 9
Recording Data through the Ages: From Ancient to Modern..................................................... 9
Datafication vs. Digitization ..................................................................................................... 12
Big Data: Inference Challenge...................................................................................................... 14
Introducing “Messiness” .......................................................................................................... 14
Data Processing and Analysis: Making a Case for Sampling .................................................. 14
Random Error vs. Non-Random Error...................................................................................... 15
Precision vs. Accuracy .............................................................................................................. 16
Accuracy, Non-Random Error, and Validity............................................................................. 17
Precision, Random Error, and Reliability ................................................................................ 18
Mathematical Indicators of Precision .................................................................................... 18
Precision and Sample Size..................................................................................................... 19
Minimizing Random Errors and Systematic Errors.................................................................. 20
Hypothesis Testing and Sampling Errors.................................................................................. 21
Big Data: The Heart of Messiness ............................................................................................ 26
Patterns and Context: Noise or Signal? ........................................................................................ 28
The Predictive Capacity of Big Data: Understanding the Human’s Role and Limitations in
Predictive Analytics ...................................................................................................................... 31
Models and Assumptions........................................................................................................... 31
The Danger of Overfitting......................................................................................................... 34
Personal Bias, Confidence, and Incentives............................................................................... 38
Big Data: IT Challenge ................................................................................................................. 40
The Big Data Stack.................................................................................................................... 40
Legacy Systems and System Development ................................................................................ 40

2
Building the Big Data Stack...................................................................................................... 41
Storage ................................................................................................................................... 42
Platform Infrastructure........................................................................................................... 43
Data........................................................................................................................................ 43
Application Code, Functions and Services............................................................................ 44
Business View ....................................................................................................................... 44
Presentation and Consumption .............................................................................................. 44
Big Data: Benefits and Value........................................................................................................ 46
Unlocking Big Data’s Latent Value: Recycling Data ............................................................... 46
Product and Service Innovation................................................................................................ 47
Competitive Advantage ......................................................................................................... 47
Improved Models and Subsequent Cost Reductions ............................................................. 48
Improved Models and Subsequent Time Reductions ............................................................ 49
Big Data: Costs and Challenges.................................................................................................... 51
Conceptual Issues: How to Measure the Value of Data ........................................................... 51
Recycling Data: Does Data’s Value Diminish?........................................................................ 53
Big Data and Implications for Privacy ..................................................................................... 55
Cautiously Looking to the Future ................................................................................................. 57
Can N=All?............................................................................................................................... 57
Can Big Data Defy the Law of Diminishing Marginal Returns? .............................................. 58
Final Remarks ............................................................................................................................... 60
Reference List ............................................................................................................................... 61

3
Introduction
As the volume and variety of available data continues to expand, many industries are
becoming increasingly fixated on harnessing data for their own advantage. Coupled with ever
advancing technology and predictive analytics, the accumulation of larger datasets allows
researchers to analyze and interpret information faster, and at a much lower cost than has ever
previously been viable. Undoubtedly, big data has many advantages when applied to a broad
range of business applications; such as cost reductions, time reductions, and more informed
decision making. However, big data also presents its own set of challenges, including a higher
potential for privacy invasion, a higher level of imprecision, and mistaking noise for true insight.
This paper will address the following. First, we attempt to construct a unified and
comprehensive definition of Big Data, characterized by rapidly increasing volume, velocity, and
variety of data. Subsequently, we will discuss the progressions in analytical thinking which
prompted the emergence of Big Data and predictive analytics. In particular, increased processing
capacity and a willingness to permit “messiness” in datasets were instrumental factors in
facilitating the shift from “small” data to “big”. The nature of this “messiness” is explored
through a discussion of sampling, random error, and systematic error.
In addition, we will address the importance of considering correlations in context, in
order to discern noise from signal. The predictive capacity of Big Data is restricted by the
models used to infer testable hypotheses, and so we must consider the limitations and
shortcomings models introduce into the analysis through their underlying assumptions.
Specifically, we will assess the dangers of overfitting, personal bias, confidence, and incentives.

4
Big Data signifies major environmental changes in firm objectives, often requiring
considerable modifications to computer and processing systems. To address these changes, a new
architectural construct has emerged known as the “Big Data Stack”. The architectural construct
is made of several interlinked components, and we address each part in turn in order to assess the
effectiveness of the Big Data Stack in Big Data analytics.
As Big Data continues to develop, it is essential that we undertake careful examination of
the potential costs and benefits typically associated with its deployment. In particular, we will
discuss the benefits of Big Data in terms of data’s potential for re-use, as well as cost and
decision time reductions resulting from improved modelling techniques. With regards to the
potential challenges presented by Big Data, we discuss conceptual issues, privacy concerns, and
the loss of data utility over time.

5
What is Big Data?
Before discussing Big Data in depth, it is essential that we have a good comprehension of
what we understand Big Data to be. Definitions are important, and a single accepted definition
across industries and sources would be ideal, as ambiguous definitions often lead to inarticulate
arguments, inaccurate cost/benefit analyses, and poor recommendations. Unfortunately,
following the review of various sources it is evident that providing a definitive and cohesive
definition of Big Data is perhaps not so simple. Some institutions view Big Data as a broad
process used to encompass the continuous expansion and openness of information; for example
Statistical Analysis System (SAS) characterizes Big Data as “a popular term used to describe the
exponential growth and availability of data, both unstructured and structured.” (“Big Data: What
it is and why it matters”, n.d.) Others focus more on the increased processing capabilities that the
use of Big Data necessitates in order to construct a definition. Strata O’Reilly, a company
involved with big data technology and business strategy, state that “Big Data is data that exceeds
the processing capacity of conventional database systems.” (Dumbill, 2012, para. 1). Yet further
ambiguity is introduced by other authorities who define Big Data as a repository of its potential
future, rather than current, use and value. For instance, Forbes magazine claims that Big Data is
“a collection of data from traditional and digital sources inside and outside your company that
represents a source for ongoing discovery and analysis” (Arthur, 2013, para. 7).
Is there a way forward, given such divergent opinions? Obviously, each of the definitions
captures an important conceptual element, worthy of note. Big data can be big in terms of
volume, and indeed it also requires advanced processing techniques and has important
implications for firm profitability and innovation. However, where all these definitions fall short

6
is in comprehending that Big Data must embody all three of these characteristics at once. Big
Data is more akin to a subject or discipline than a description of a single event or a process, and
therefore emphasizing individual characteristics is not enough to distinguish it as such. Big Data
is unique in that its elements of data volume, velocity, and variety, are increasing rapidly at
different rates and in different formats. Ultimately, this leads to challenges in data integration,
measurement, and interpretation and replicability of the results.
Volume
The volume of Big Data is increasing due to a combination of factors. Primarily, as we
move forward in time the amount of data points available to us increases. Moreover, where
previously all data were amassed internally by company employees’ direct interaction with
clients, innovation in other industries, such as the invention of cellular devices, has resulted in
more and more data being indirectly generated by machines and consumers. Cellular usage data
did not exist before the creation of the cell phone, and today millions upon millions of cellular
devices transmit usage data to various networks all over the world. Furthermore, the creation of
the internet has permitted consumers to play an active role in data generation, as they knowingly
and willingly provide information about themselves that is available to various third parties. The
internet and web logs have allowed for transactions-based data to be stored and retrieved, and
have facilitated the collection of data from social media sites and search engines. Thus, the
combination of a larger number of data sources with a larger quantity of data results in the
exponential growth of data volume, which is related to improved measurement. However, it is

7
important to stress that it is not greater volume in absolute terms that characterizes Big Data, but
rather a higher volume relative to some theoretical, final set of the clients’ data.
Velocity
With regards to velocity, as innovation continues to modify many industries, the flow of
data is increasing at an unparalleled speed. Ideally, data should be processed and analyzed as
quickly as possible in order to obtain accurate and relevant results. Analyzing information is not
an issue as long as the velocity of incoming data is slower than the time it takes to process it. In
this case, the information will still be relevant when we obtain the results. However, with Big
Data the velocity of information is so rapid that it undermines previous methods of data
processing and distillation, and new tools and techniques must be introduced to produce results
that are still relevant to decision-makers. Data is streaming in at an accelerated rate, while
simultaneously processing delays are decreasing at such a rate that the data arrival and
processing may eventually approach real time. However, at present such vast capabilities are not
yet on the horizon, and the need for immediate reaction to data collection poses an ongoing
challenge for many firms and industries.
Variety
Finally, the third central element of Big Data is its variety. Consisting of many different
forms, Big Data represents the mix of all types of data, both structured and unstructured.
McKinsey Global Institute (2011) defines structured data as “data that resides in fixed fields.

8
Examples of structured data include relational databases or data in spreadsheets” (Manyika,
Chui, Brown, Bughin, Dobbs, Roxburgh, & Byers, 2011, p.34). In contrast, unstructured data is
described as “data that do not reside in fixed fields. Examples include free-form text, (e.g. books,
articles, body of e-mail messages), untagged audio, image and video data” (Manyika et al., 2011,
p. 34). However, Big Data is trending towards less structured data and a greater variety of
formats (due to a rising number of applications). Where increased volume is related to improved
measurement, increased variety is associated with greater potential for innovation. Lacking
cohesion in input configuration, the effective management and reconciliation of the varying data
formats remains a persistent obstacle that organizations are attempting to overcome.
Having discussed the various components of Big Data, it is evident that articulating a
succinct and precise definition in a few simple sentences is challenging, if not impossible. Big
Data is not a series of discreetly separable trends; it is rather a dynamic and multi-dimensional
phenomenon. In confining our definition to a few lines, we restrict our understanding and
introduce a haze of ambiguity and uncertainty. Instead, by focussing on Big Data as a
multidimensional process, we bring ourselves a step closer to a fuller and deeper understanding
of this new phenomenon.

9
The Rise of Big Data and Predictive Analytics
Previously, we defined Big Data as consisting of three intertwined dimensions: volume,
velocity, and variety. Now, we briefly look at changes in analytical thinking that took place over
a long period of time, and in the final analysis gave rise to Big Data. At their core are several
concurrent changes in the analysts’ mindset that support and reinforce each other. Firstly, there
was a move towards the capacity to process and analyze increasingly sizeable amounts of data
pertaining to a question of interest. Second, there was a readiness to permit messiness in datasets
rather than restricting our analysis to favour the utmost accuracy and precision.
Recording Data through the Ages: From Ancient to Modern
The emergence of Big Data is rooted in our natural desire to measure, record, and evaluate
information. Advances in technology and the introduction of the Internet have simply made
documentation easier and faster, and as a result we are now able to analyse progressively larger
datasets. In fact, the methods used to document history have been developing for millennia; from
Neanderthal cave art to early Sumerian pictograms, and finally to the digital era we know today.
Basic counting and an understanding of the passage of time are possibly the oldest conceptual
records known to us, but in 3500BC, the early Mesopotamians made a discovery that
transformed the way information was transmitted through the generations and across regions.
Mesopotamians had discovered a method of record keeping (now known as Cuneiform) by
inscribing symbols onto clay tablets, which were used to communicate objects or ideas. It was
this – the invention of writing – that gave rise to the dawn of the information revolution,

10
permitting “news and ideas to be carried to distant places without having to rely on a messenger's
memory”(“Teacher Resource Center Ancient Mesopotamia: The Invention of Writing”, para. 3).
Cuneiform script formed the basis of future record keeping, and as records advanced to printed
text and then again to the digital world, Big Data emerged in its wake.
Essentially, the combination of “measuring and recording ... facilitated the creation of
data” (Schonberger & Cukier, 2013, p. 78), which in turn had valuable effects on society.
Sumerians employed what is known today as descriptive analytics: they were able to draw
insight from the historical records they created. However, somewhere along the journey of
documenting information, a desire was born to use it. It was now possible for humanity to
reproduce past endeavours from documentation of their dimensions, and the process of recording
allowing for more methodical experimentation – one variable could be modified while holding
others constant. Moreover, industrial transactions could be calculated and recorded, aiding in
predicting events such as annual crop yield, and further developments in mathematics “gave new
meaning to data – it could now be analyzed, not just recorded and retrieved” (Schonberger &
Cukier, 2013, p. 80). Thus, it is evident that developments in data documentation had significant
implications for civilization. Parallel to these advances, means of measurement were also
increasing dramatically in precision – allowing for more accurate predictions that could be
derived from the collected documentation.
Nurtured by the rapid growth in computer technology, the first corporate analytics group
was created in 1954 by UPS, marking the beginning of modern analytics. Characterized by a
relatively small volume of data (mostly structured data) from internal sources, analytics were
mainly descriptive and analysts were far removed from decision makers. Following the
beginning of the second millennia, internet-based companies such as Google began to exploit

11
online data and integrate Big Data-type analytics with internal decision making. Increasingly,
data was externally sourced and the “fast flow of data meant that it had to be stored and
processed rapidly” (Davenport & Dyché, 2013, p. 27).
Advancements in computers aided in cementing the transition from descriptive analytics to
predictive analytics, as efficiency was increased through faster computations and increased
storage capacity. Predictive analytics is defined by SAS as “a set of business intelligence (BI)
technologies that uncovers relationships and patterns within large volumes of data that can be
used to predict behavior and events” (Eckerson, 2007, p. 5). As the amount of data continues to
grow with technological developments, these relationships are being discovered at a much faster
speed and with greater accuracy than previously attainable. In addition, it is important to
distinguish between predictive analytics and forecasting. Forecasting entails predicting future
events, while predictive analytics adds a counter-factual by asking “questions regarding what
would have happened... given different conditions” (Waller & Fawcett, 2013, p. 80).
Furthermore, there is a growing interest in the field of behavioural analytics; consumers are
leaving behind “‘digital footprint(s)’ from online purchases ... and social media commentary
that’s resulting in part of the Big Data explosion” (Davenport & Dyché, 2013, p. 27). Effectively,
these communications are informing targeting strategies for various industries and advertisers1.
In sum, using larger quantities of information to inform and enrich various types of
business analytics was a fundamental factor in the shift to Big Data. Thus, as the volume of data
increased exponentially with the arrival of computers and the internet, so too did the variety of
the information and the potential value that could be extracted from it. Continuously developing
1 While promisingtremendous benefits, behavioural analytics entails certainrisks andchallenges for society (such as implications forthe roleof
free will) which must be addressed in a timely manner toavoidpolitical andsocial backlash.Theseissues are beyondthe scope of this paper.

12
computing technologies and software, combined with their increasingly widespread use,
facilitated the shift to Big Data.
Datafication vs. Digitization
One important technological development in the evolution of Big Data is what
Schonberger and Cukier (2007) call “datafication”, distinct from the earlier invention of
digitization. Digitization refers to the process of converting data into a machine-readable digital
format. For example, an image from a soft copy book scanned to a computer and saved as a
bitmap file.
Datafication, on the other hand, involves taking something not previously perceived to
have informational worth beyond its original function, and transforming it into a “numerically
quantified format” (Mayer-Schonberger & Cukier, 2007, p. 76), so that it may then be charted
and analyzed. Data that has no informational worth beyond its original function is said to lack
stored value, as it cannot be held and retrieved for analytical purposes, and has no usefulness
other than what it presents at face value. Essentially, digitization is an initial step in the
datafication process. For example, consider Google Books: pages of text were scanned to
Google’s servers (digitized) so that they could be accessed by the public through use of the
internet. Retrieving this information was difficult as it required knowing the specific page
number and book title; one could not search for specific words or conduct textual information
analysis because the pages had not been datafied. Lacking datification, the pages were simply
images that could only be converted into constructive information by the act of reading –
offering no value other than the narrative they described.

13
To add value, Google used advanced character-recognition software that had the ability to
distinguish individual letters and words: they had transformed the digital images to datified text
(Mayer-Schonberger & Cukier, 2007, p. 82). Possessing inherent value to readers and analysts
alike, this data allowed the uses of particular words or idioms to be charted over time, thus, as an
example, providing new insight on the progression of human philosophy. For instance, it was
able to show that “until 1900 the term ‘causality’ was more frequently used than ‘correlation,’
but then the ratio reversed” (Mayer-Schonberger & Cukier, 2007, p. 83). Combined with
advances in measurement techniques, the development of digital technology has further
increased our ability to analyze a larger volume of data.

14
Big Data: Inference Challenge
Introducing “Messiness”
Despite Big Data’s noted advances in technological sophistication, it has been argued that
“increasing the volume [and complexity of data] opens up the door to inexactitude” in results
(Mayer-Schonberger & Cukier, p. 32). This inexactitude has been referred to as Big Data
“messiness”, and the following sections will explore the nature of messiness and why it seems to
be unavoidable in Big Data analytical solutions. Furthermore, we will consider how sampling
errors and sources of data bias are impacted by the use of Big Data analytics.
Data Processing and Analysis: Making a Case for Sampling
Historically, data collection and processing was slow and costly. Attempts to use whole
population counts (i.e. census) produced outdated results that were consequently not of much use
in making meaningful inferences at the time they were needed. This divergence between growth
in data volume and advances in processing methods was only increasing over time, leading the
U.S. Census Bureau in the 1880s to contract inventor Herman Hollerith to develop new
processing methods for use in the 1890 census.
Remarkably, Hollerith was able to reduce the processing time by more than 88%, so that
the results could now be released in less than a year. Despite this feat, it was still so expensive
for the Bureau to acquire and collect the data that the Census Bureau could not justify running a
census more frequently than once every decade. The lag, however, was unhelpful because the
country was growing so rapidly that the census results were largely irrelevant by the time of their

15
release. Here lay the dilemma: should the Bureau use a sample as opposed to the population in
order to help facilitate the development of speedier census procedures (Mayer-Schonberger &
Cukier, 2013, pp. 21-22).
Clearly, gathering data from an entire population is the ideal, as it affords the analyst far
more comprehensive results. However, using a sample is much more efficient in terms of time
and cost. The idea of sampling quickly took root, but with it emerged a new dilemma – how
should samples be chosen? And how does the choice of sample affect the results?
Random Error vs. Non-Random Error
The underlying assumption in sampling theory is that the units selected will be
representative of the population from which they are selected. In the design stage, significant
efforts are undertaken to ensure that, as far as possible, this is the case. Even when conceptually
correct processing methods for sampling selection are used, a sample cannot be exactly
representative of the entire population. Inevitably, errors will occur, and these are known as
sampling errors. True population parameters differ from observed sample values for two
reasons: random error and non-random error (also called systematic bias).2 Random error refers
to the “statistical fluctuations (in either direction) in the measured data due to the precision
limitations of the measurement” (Allain, n.d.). More specifically, random error comes as a result
of the chosen sampling method’s inability to cover the entire range of population variance
(random sampling error), the way estimates are measured, and the subject of the study.
2
Random andnon-random errors are both types ofsamplingerrors.Non-samplingerrors will be discussed later.

16
On the other hand, systematic errors describe “reproducible inaccuracies that are
consistently in the same direction [and] are often due to a problem which persists throughout the
entire experiment3” (Allain, n.d.). For example, non-random error may result from systematic
overestimation or underestimation of the population (scale factor error), or from the failure of the
measuring instrument to read as zero when the measured quantity is in fact zero (zero error).
Non-Random errors accumulate and cause bias in the final results. In order to evaluate the
impact these non-random errors have on results, we must first consider the concepts of accuracy
and precision.
Precision vs. Accuracy
Bennett (1996) defines accuracy as “the extent to which the values of a sampling
distribution for a statistic approach the population value of the statistic for the entire population”
(p. 135). If the difference between the sample statistic and the population statistic is small, the
result is said to be accurate (also referred to as unbiased), otherwise it is said to be inaccurate. It
is important to note that accuracy depends on the entire range of sample values, not a particular
estimate, and so we refer to the accuracy of a statistic as opposed to that of an estimate.
In contrast, precision reveals “the extent to which information in a sample represents
information in a population of interest” (Bennet, 1996, p. 136). An estimator is called precise if
the sample estimates it generates are not far from their collective average value. Note however,
that these estimates may all be very close together, and yet all may be far from the true
3
Note that humanerror or“mistakes” are not includedin error analysis. Examples of such flaws include faults in calculation,and
misinterpretationofdata orresults.

17
population statistic. Therefore, we can observe results which are accurate but not precise, precise
yet not accurate, both, or neither. To put it differently, “precision does not necessarily imply
accuracy and accuracy does not necessarily imply precision” (Bennett, 1996, p.138). These
outcomes are illustrated below, where the true statistic is represented graphically by the bulls-
eye:
The first drawing is precise because the sample estimates are clustered close to one another. It is
not accurate, however, because they are far from the centre of the inner circle. The
interpretations of the other drawings follow similar analysis.
Accuracy, Non-RandomError, and Validity
The accuracy of a statistic is primarily affected by non-random error. For example, as
previously discussed, non-random error may result from estimates being scaled upwards or
downwards if the instrument persistently records changes in the variable to be greater or less
than the actual change in the observation. In this case, we might find the sample means of our
estimates – though clustered together – are persistently higher than the population mean by a
particular value or percentage, producing a consistent but wholly inaccurate set of results.
Source: Vig, (1992). Accuracy,Stability,andPrecision Examples for a Marksman. IntroductiontoQuartz Frequency Standards.

18
Moreover, Bennett (1996) notes that “probably the greatest threat to accuracy is failure to
properly represent some part of the population of interest in the set of units being selected or
measured” (p. 140).
For example, a mailed literacy survey that participants are invited to fill out and return will
result in gross inaccuracy, as it is bound to exclude those people who are illiterate. The concept
of accuracy is also closely linked to validity. Validity is the term used to indicate the degree to
which a variable measures the characteristic that it is designed to measure. Put differently, an
estimator is not valid when it “systematically misrepresents the concept or characteristic it is
supposed to represent” (Bennett, 1996, p.141). For example, taxable income may not be a valid
indicator of household income if particular types of income (such as welfare payments) are
excluded from the data. It is important to note that the validity of an estimator is largely
determined by non-random (systematic) errors in measurement and experimental design.
Therefore, eliminating a systematic error improves accuracy but does not alter precision, as an
increase in precision can only result from a decrease in random error.
Precision, Random Error, and Reliability
Mathematical Indicators of Precision
The extent of the random error present in an experiment determines the degree of precision
afforded to the analyst. In addition, precision and random error are also closely linked to the
perceive reliability of an estimator. Fundamentally, an estimated statistic is considered reliable
when it produces “the same results again and again when measured on similar subjects in similar

19
circumstances” (Bennett, 1996, p.144). Put differently, results which closely resemble one
another represent a more precise estimator and a lower degree of random error.
Recall that random error is the part of total error that varies between measurements, all else
held equal. The lower the degree of random error, the more precise our estimate will be. How
then do we measure the extent to which random error is present in experiments? Confidence
intervals are commonly used as an indicator of precision, as they measure the probability that a
population statistic will lie within the specified interval.
For example, given a 95% confidence interval, the mean of a sample may be between 5.3
and 6.7. In effect, this means that we can expect 95% of our sample estimates for the mean to fall
somewhere between 5.3 and 6.7. The narrower the confidence band, the more precise the
estimator. Moreover, the standard error of an estimate is also used to indicate precision.
Standard error is essentially the extent of the fluctuation from the population statistic due to pure
chance in sample estimates, and is calculated by dividing the sample variance by the sample size
and then taking the square root. An estimate with high precision (and thus small random error)
will have low standard error.
Precision and Sample Size
Depending on the statistic under consideration, precision may be dependent on any number
of factors (such as the unit of measurement etc.). However, it is always dependent on sample
size. The explanation for this comes from the nature of non-random errors. As we have
discussed, non-random errors can occur in any number of observations in an experiment, and
each observation is not necessarily distorted to the same degree. Therefore, if we were to repeat a

20
test with random error and average the results, the precision of the estimate will increase. Also,
“the greater the variation in the scores of a variable or variables on which a statistic is based, the
greater the sample size necessary to adequately capture that variance” (Bennett, 1996, p.139).
Essentially, an experiment with higher random error necessitates a larger sample size to achieve
precision, and the estimate will become more precise the more times the experiment is repeated.
This result follows from the Central Limit Theorem, which states that as the sample size
increases, the sample distribution of a statistic approaches a normal distribution regardless of the
shape of the population distribution. Thus, the theorem demonstrates why sampling errors
decrease with larger samples.
Minimizing Random Errors and Systematic Errors
While it is possible to minimize random errors by repeating the study and averaging the
results, non-random errors are more difficult to detect and can only be reduced by improvement
of the test itself. This is due to the fact that non-random errors systematically distort each
observation in the same direction, whereas random errors may irregularly distort observations in
either direction. To illustrate this more clearly, let us consider the following example. If the same
weight is put on the same scale several times and a different reading (slightly higher or lower) is
recorded with each measurement, then our experiment is said to demonstrate some degree of
random error. Repeating the experiment many times and averaging the result will increase the
precision. However, if the same weight is put on the same scale several times and the results are
persistently higher or persistently lower than the true statistic by a fixed ratio or amount, the

21
experiment is said to have systematic error. In this case, repeating the test will only reinforce the
false result, and so systematic errors are much more difficult to detect and rectify.
Hypothesis Testing and Sampling Errors
An important principle of sampling is that samples must be randomly selected in order to
establish the validity of the hypothesis test. Hypothesis testing is a method of statistical inference
used to determine the likelihood that a premise is true. A null hypothesis H0 is tested against an
alternate hypothesis H1 (hence H0 and H1 are disjoint) and the null hypothesis is rejected if there
is strong evidence against it, or equivalently if there is strong evidence in favour of the alternate
hypothesis. It is important to note that failure to reject H0 therefore denotes a weak statement; it
does not necessarily imply that H0 is true, only that there did not exist sufficient evidence to
reject it.
As an example, imagine a simple court case: the null hypothesis is that a person is not
guilty, and that person will only be convicted if there is enough evidence to merit conviction. In
this case, failure to reject H0 merely implies there is inadequate evidence to call for a guilty
verdict – not that the person is innocent. Moreover, it is possible to repeat an experiment many
times under different null hypotheses and fail to reject any of them. Consider if we were to put
each person in the world on trial for a crime – we could hypothetically fail to find sufficient
evidence to convict anyone, even if someone did commit a crime. Therefore, the goal of
hypothesis testing should always be to reject the null hypothesis and in doing so confirm the
alternate, as it represents a much stronger statement than failure to reject the null.

22
In the probabilistic universe, there is always some level of imprecision and inaccuracy,
however small. Occasionally, an innocent person will be convicted, and sometimes a guilty
person will walk free. Every hypothesis test is subject to error as we are imperfect beings with
imperfect empirical knowledge, as some data points are always missing which affects
measurement. Furthermore, every study has a level of “acceptable” error (typically denoted by
alpha), which is directly related to the probability that the results inferred will be inexact. For
example, alpha=0.05 is indicative of accepting 5% error in our results – and so if we repeated an
experiment 1000 times we would reasonably expect 950 significant results from the 1000. Type I
error occurs as a result of random error, and results when one rejects the null hypothesis when it
is true. The probability of such an error is the level of significance (alpha) used to test the
hypothesis. Put differently, Type I error is a “false positive” result and a higher level of
acceptable error (i.e. a larger value of alpha) increases the likelihood of imprecision.
According to the Central Limit Theorem, larger samples result in lower Type 1 error. As
noted previously, Big Data is not only represented by bigger data sets (volume) but also by
different data types (variety). Therefore, in the case of Big Data, the probability that a Type I
error will occur is significantly higher than it would be in a “small” data problem, as the move to
Big Data involves increasing the value of alpha due to increased variety. Indeed, “the era of Big
Data only seems to be worsening the problems of false positive findings in the research
literature” (Silver, 2012, p.253).
To lower the likelihood of Type I error, one lowers the level of acceptable error: one
tightens the restrictions regarding which data are permitted in the analysis, thereby reducing the
size of the sample to a small data problem. However, experiments making use of small data are
more prone to errors of Type II: accepting the null hypothesis when it is not true (the alternative

23
is true). In other words, Type II error refers to a situation where a study fails to find a difference
when in fact a difference exists (also referred to as a false negative result). Effectively,
committing a Type II error is caused by systematic error, and entails accepting a false
hypothesis. This can negatively impact results as adopting false beliefs (and drawing further
inferences from analyses under the assumption that your beliefs are correct) can result in further
erroneous conclusions. The possible outcomes for hypothesis testing are shown in the table
below:
Outcomesfrom HypothesisTesting
RealitysfjsdgReality
The null
hypothesis is
true
(no.difference)
The alternative
hypothesis is
true (difference)
Result from Research
The null hypothesis
is true (no
difference)
Accurate Type 2 Error
The alternative
hypothesis is true
(difference)
Type 1 Error Accurate
Thus, for a given sample size the real problem is to choose alpha so as to achieve the greatest
benefit from the results; we consider which type of error we deem to be “more” acceptable. This
is not a simple question as the level of acceptable error is contingent upon the type of research
we are conducting.
For instance, if a potential benefactor refuses to fund a new business venture, they are
avoiding Type I error – which would result in a loss of finances. At the same time, however, they
open themselves to the possibility of Type II error; that they may by bypassing a potential profit.

24
It is simply an issue of potential costs vs. potential benefits, and weighing the risk and
uncertainty. Risk is “something you can put a price on” (Knight, 1921, as cited by Nate Silver,
2012, p. 29), whereas uncertainty is “risk that is hard to measure” (Silver, 2012, p. 29).Whereas
risk is exact (e.g. odds of winning a lottery), uncertainty introduces imprecision. Silver (2012)
notes that, “you might have some vague awareness of the demons lurking out there. You might
even be acutely concerned about them. But you have no idea how many of them there are or
when they might strike” (p. 29). In the case of the potential backer, there was too much
uncertainty surrounding the outcome for him to feel comfortable financing the new business.
Similarly to our hypothetical patron, many people are averse to uncertainty when making
decisions – that is many people would prefer lower returns with known risks as opposed to
higher returns with unknown risks – and are consequently more inclined to minimize Type I
errors and accept Type II errors when making decisions.
Consider a second example; results from cancer screening, where the null hypothesis is
that a patient is healthy. Type I error entails telling a patient they have cancer when they do not,
and Type II error involves failing to detect a cancer that is present. Here, the costs of the errors
seem to be much higher, as the patient’s life may be at stake. Type I error can lead to serious side
effects from unnecessary treatment and patient trauma, however an error of Type II could result
in a patient dying from an undiagnosed disease which could have potentially been treated. In this
scenario, the cost of a Type II error seems to be much greater than that of a Type I error.
Therefore, in this scenario a false positive is more desirable than a false negative, and we seek to
minimize Type II errors. This is exactly the case with hypothesis tests which utilize Big Data; by

25
increasing the sample size, the power4 of the test is amplified, and thus Type II errors are
minimized.
Assessing the costs of different decisional errors, we can see that the choice of alpha (and
relative likelihood of Type I and Type II errors) must be made on a situational basis, and making
any decision will involve a trade-off between the two types. Furthermore, one cannot easily
make the argument that one type of error is always worse than the other; the gravity of Type I
and Type II errors can only be gauged in the context of the null hypothesis.
The discussion of sampling errors and other sources of bias have significant implications
for Big Data. Decisions regarding Type I and Type II errors introduce bias into datasets, as each
organization executes these decisions in order to fulfil their individual objectives. Typically, each
party is not obligated (or inclined) to share their decision making processes with other parties,
and therefore each organization has imperfect information regarding the data held by others. The
resulting set of Big Data employed by each organization represents an unknown combination of
decisions (biases) to all other organizations. Society’s continuing shift to Big Data implies the
costs of false positives are not perceived to be serious (or the costs of false negatives are
understood to be relatively more serious) for the types of issues being addressed in the
experiments.
4 The power of a test refers to the ability of a hypothesis test to reject the null hypothesis when the alternative
hypothesis is true.

26
Big Data:The Heart of Messiness
Consider a coin that is tossed 10 times and each observation recorded. We might find the
probability of heads to be 0.8 from our sample, and therefore we do not have sufficient evidence
to reject H0: P(heads)=0.75. In this case, we would be making a Type II error as we would fail to
reject the null when the alternative is true. However, as we increase our sample size to 10000 we
may find that the probability of heads is now 0.52 and so we may consider this sufficient to
reject the null hypothesis. Clearly, in this case, a bigger sample is better as it allows us to gather
more data which can be used as evidence. This result follows from the Law of Large Numbers,
which states that as sample size increases, the sample mean approaches the population mean.
However, it is important to note that this law is valid only for samples that are unbiased: a larger
biased sample will yield next to no improvement in accuracy. Bias is the tendency of the
observed result to fall more on one side of the population statistic than the other: it is a persistent
deviation to one side. With regards to our example, a coin is fair and unbiased in nature (unless it
has been tampered with). A coin toss is just as likely to come up tails as it is to come up heads,
and since there are only two possible outcomes the probability of either is 0.5. In other words,
the unbiased coin “has no favourites”. Thus, as the sample size of coin tosses increases, the
sample mean approaches the true population mean.
Let us now consider a biased experiment. For example, internet surveys are explicitly
(through not deliberately, by design) biased to include only those people who use the internet.
Increasing the number of participants in the survey will not make it any more representative of
the whole population as each time it is repeated it replicates the same bias against people who do
not use the internet. It is important, therefore, to note that increasing the size of a biased sample
is not likely to result in any increase in accuracy.

27
Herein lays the crux of messiness. Following from the Central Limit Theorem, we
previously discussed precision as increasing with sample size. However, the Law of Large
Numbers states that if the sample is biased, using a larger sample does not reduce the bias and
may even amplify it, thereby magnifying inaccuracies. Big Data samples may contain a number
of biases, such as self-reporting on social media sites, etc., making accuracy extremely unlikely
to increase with the larger sample size. Some data points are likely to be missing, and it can
never be known with complete certainty exactly what has been omitted. With additional volume
and variety in data points comes additional random errors and systematic bias. The problem lies
in our inability to discern which is increasing faster. This is the nature of messiness in Big Data.
Clearly, the total absence of error is unattainable, as there are always some data points
missing from the experiment. While in theory increasing sample size can increase precision, the
biases inherent in Big Data mean the increase in volume is unlikely to result in any meaningful
improvement.

28
Patterns and Context: Noise or Signal?
When dealing with Big Data solutions, it is important to distinguish between data and
knowledge so as not to mistake noise for true insight. Data “simply exists and has no
significance beyond its existence... it can exist in any form, usable or not” (Ackoff, 1989. As
cited in Riley & Delic, 2010, p. 439). Stated differently, data is signified by a fact or statement of
event lacking an association to other facts or events. In contrast, knowledge is “the appropriate
collection of information, such that it's intent is to be useful. Knowledge is a deterministic
process” (Ackoff, 1989. As cited in Riley & Delic, 2010, p. 439). Therefore, knowledge involves
data which has been given context, and is more than a series of correlations; it typically imparts a
high degree of reliability as to events that will follow an expressed state. To put it differently,
knowledge has the potential to be useful, as it can be analyzed to reveal latent fundamental
principles. The table below provides examples of these related concepts:
Data vs. Knowledge:
Data Knowledge
Example 1 2, 4, 8, 16
Knowing that this is equivalent to 21, 22, 23, 24, and
being able to infer the next numbers in the sequence.
Example 2 It is raining.
The temperature dropped and then it started raining.
Inferring that a drop in temperature may be correlated
with the incidence of rain.
Example 3 The chair is broken
I sat heavy items on a chair and it broke. Inferring that
the chair may not be able to withstand heavy weights.
Clearly, understanding entails synthesizing different pieces of knowledge to form new
knowledge. By understanding a set of correlations, we open the door to the possibility for the
prediction of future events in similar states. Fundamentally, Big Data embodies the progression

29
from data to understanding with the purpose of uncovering underlying fundamental principles.
Analysts can then exercise this newfound insight to promote more effective decision making.
Further to this intrinsic procedure from data to understanding, patterns often also emerge
from the manipulation and analysis of the data itself. For example, results demonstrate that there
are visible patterns and connections in data variability; trends in social media due to seasonal and
world events can disrupt the typical data load. Data flows have a tendency to vary in velocity and
variety during peak seasons and periods throughout any given year, and on a much more intricate
scale it is even possible to observe varying fluctuations in the data stream across particular times
of day.
This variability of data can often be challenging for analysts to manage (e.g. server crashes
due to unforeseen escalations in online activity) and furthermore it can significantly impact the
accuracy of the results. Inconsistencies can emerge which obscure other meaningful information
with noise; not all data should necessarily be included in all types of analysis as inaccurate
conclusions may result from the inclusion of superfluous data points. Therefore, it is necessary to
weigh the cost of permitting data with severe variability against the potential value the increase
in volume may provoke. Whereas the internal and patterned process from data to understanding
serves to provide us with insight, the patterns of data variability often present us with obstacles
to this insight.
Identifying a pattern is not enough. Almost any large dataset can reveal some patterns,
most of which are likely to be obvious or misleading. For example, late in the 2002 season
Cleveland Cavaliers basketball team showed a consistent tendency to “go over” the total for the

30
game.5 Upon investigation, it was found that the reason behind this trend was that Ricky Davis’
contract was to expire at the end of the season, and so he was doing his utmost best to improve
his statistics and thereby render himself more marketable to other teams (Ricky Davis was the
teams’ point-guard). Given that both the Cavaliers and many of their opponents were out of
contention for the playoffs and thus their only objective was to improve their statistics, a tacit
agreement was reached where both teams would play weak defence so that each team could
score more points.
The pattern of high scores in Cavalier games may seem to be easily explainable. However,
many bettors committed a serious error when setting the line; they failed to consider the context
under which these high scores were attained (Silver, 2012, pp. 239-240). Discerning a pattern is
easily done in a data-rich environment, but it is crucial to consider these patterns within their
context in order to ascertain whether they indicate noise or signal.
5 When assigningodds to basketball scores,bookmakers setan expected total for the game. This total refers to the
number of points likely to be scored in the game. Thus, a tendency to “go over” this total refers to the factthat
consistently,in any given game, more points are being scored than expected.

31
The Predictive Capacity of Big Data: Understanding the
Human’s Role and Limitations in Predictive Analytics
Advocates of the Big Data movement argue that the substantial growth in volume, velocity,
and variety increases the potential gains from predictive analytics. According to them, the shift
towards Big Data should effectively afford the analyst greater capacity to infer accurate
predictions. More sceptical observers argue that in connecting the analyst’s subjective view of
reality with the objective facts about the universe, the possibility for more accurate predictions
hinges on a belief in an objective truth and an awareness that we can only perceive it imperfectly.
As human beings we have imperfect knowledge, and so “wherever there is human
judgement there is the potential for bias” (Silver, 2012, p.73). Forecasters rely on many different
methods when making predictions, but all of these methods are contingent upon specific
assumptions and inferences about the relevant states or events in question – assumptions that
may be wrong. Let us now further examine the limitations that assumptions introduce to the
analysis.
Models and Assumptions
Assumptions lie at the foundation of every model. A model is essentially a theoretical
construct which uses a simplified framework in order to infer testable hypotheses regarding a
question of interest. Dependent upon the analyst’s perceptions of reality, models guide selection
criteria regarding which data is to be included and how it is to be assembled. The analyst must

32
decide which variables are important and which relationships between these variables are
relevant. Ultimately, all models contain a certain degree of subjectivity, as they employ many
simplifying assumptions and thus capture only a slice of reality. However, the choice of
assumptions in data analysis is of critical importance, as varying assumptions often generate very
different results.
All models are tools to help us understand the intricate details of the universe, but they
must never be mistaken for a substitute for the universe. As Norbert Wiener famously put it, “the
best material model for a cat is another, or preferably the same cat.” (Rosenblueth & Wiener,
1945, p.320). In other words, every model omits some detail of the reality, as all models involve
some simplifications of the world. Moreover, “how pertinent that detail might be will depend on
exactly what problem we’re trying to solve and on how precise an answer we require” (Silver,
2012, p.230).
Again, this emphasizes the importance of constructing a model in such a way that its
design is consistent with appropriate assumptions and examines the relationship between all
relevant variables. As Big Data attracts increasing focus, we must not fail to recognize that the
predictions we infer from analysis are only as valid and reliable as the models they are founded
on.
For example, consider a situation where you are asked to provide a loan to a new company
which operates ten branches across the country. Each branch is determined to have a relatively
small (say 3%) chance of defaulting, and if one branch defaults, their debt to you will be spread
between the remaining branches. Thus, the only situation where you would not be paid back is
the situation in which all ten branches default. What is the likelihood that you will not be repaid?

33
In fact, the answer to this question depends on the assumptions you will make in your
calculations.
One common dilemma faced by analysts is whether or not to assume event independence.
Two events are said to be independent if the incidence of one event does not affect the likelihood
that the other will also occur. In our hypothetical scenario, if you were to assume that each
branch is independent of the other, the risk of the loan defaulting would be exceptionally small
(specifically, the chance that you would not be repaid is (0.03)10). Even if nine branches were to
default, the probability that the tenth branch would also fail to repay the loan is still only 3%.
This assumption of independence may be reasonable if the branches were well diversified, and
each branch sold very distinct goods from all other branches. In this case, if one branch defaulted
due to low demand for the specific goods that they offered, it is unlikely that the other branches
would now be more prone to default as they offer very different commodities.
However, if each branch is equipped with very similar merchandise, then it is more likely
that low demand for the merchandise in one branch will coincide with low demand in the other
branches, and thus the assumption of independence may not be appropriate. In fact, considering
the extreme case where each branch has identical products and consumer profiles are the same
across the country, either all branches will default or none will. Consequently, your risk is now
assessed on the outcome of one event rather than ten, and the risk of losing your money is now
3%, which is several hundred thousand times higher than the risk calculated under the
assumption of independence.

34
Evidently, the underlying assumptions of our analysis can have a profound effect on the
results. If the assumptions upon which a model is founded are inappropriate, predictions based
on this model will naturally be wrong.
The Dangerof Overfitting
Another root cause of failure in attempts to construct accurate predictions is model
overfitting. The concept of overfitting has its origins in Occam’s Razor (also called the principle
of parsimony), which states that we should use models which “contain all that is necessary for
the modeling but nothing more” (Hawkins, 2003, p.1). In other words, if a variable can be
described using only two predictors, then that is all that should be used: including more than two
predictors in regression analysis would infringe upon the principle of parsimony. Overfitting is
the term given to the models which violate this principle. More generally, it describes the act of
misconstruing noise as signal6, and results in forecasts with inferior predictive capacity.
Overfitting generally results from the use of too many parameters relative to the quantity of
observations, thereby increasing the random error present in the model and obscuring the
underlying relationships between the relevant predictors. In addition, the potential for overfitting
also depends on the model’s compatibility with the shape of the data, and on the relative
magnitude of model error to expected noise in the data. Here, we define model error as the
divergence between the outcomes in the model and reality due to approximations and
assumptions.
6 This is in contrastwith underfitting, which describes the scenario when one does not capture as much of the
signal as ispossible.Putdifferently, underfitting results fromthe factthat some relevant predictors aremissing
from the model. We focus here on overfitting as itis more common in practice.

35
In order to see how the concept of overfitting arises in practice, consider the following.
Suppose we have a dataset with 100 observations, and we know beforehand exactly what the
data will look like. Clearly, there is some randomness (noise) inherent in the dataset, although
there appears to be enough signal to identify the relationship as parabolic. The relationship is as
shown below:
However, in reality, the number of observations available to us is usually restricted. Suppose we
now only have access to 25 of the hundred observations. Without knowing the true fit of the data
beforehand, the true relationship appears to be less certain. Cases such as these are prone to
overfitting, as analysts design complex functional relationships that strive to include outlying
data points – mistaking the randomness for signal (Silver, 2012). Below, the overfit model is
represented by the solid line; and the true relationship by the dotted line:
Source: Silver (2012). TrueDistributionof Data. The Signal andthe Noise.

36
Errors such as these can broaden the gap between the analyst’s subjective knowledge and the
true state of the world, leading to false conclusions and decreased predictive capacity.
Overfitting is highly probable in situations where the analyst has a limited understanding of the
underlying fundamental relationships between variables, and when the data is noisy and too
restricted.
Now that we have examined potential scenarios in which overfitting may occur, we will
examine why it is undesirable in Big Data predictive analytics. First, including predictors which
perform no useful function necessitates the need to “measure and record these predictors so that
you can substitute their values in the model” (Hawkins, 2003, p.2) in all future regressions
undertaken with the model. In addition to wasting valuable resources by documenting ineffectual
parameters, this also increases the likelihood of random errors which can lead to less precise
predictions. A related issue comes from the fact that including irrelevant predictors and
estimating their coefficients increases the amount of random variation (fluctuations due to mere
chance) in the resulting predictions. Despite these issues, however, perhaps the most pressing
concern regarding overfitting results from its tendency to make the model appear to be more
Source: Silver,(2012).Overfit Model.The SignalandtheNoise.

37
valid and reliable than it really is. One frequently used method of testing the appropriateness of a
model is to measure how much variability in the data is explained by the model. In many cases,
overfit models explain a higher percentage of the variance than the correctly fit model. However,
it is critical that we recognize that the overfit model achieves this higher percentage “in essence
by cheating – by fitting noise rather than signal. It actually does a much worse job of explaining
the real world.” (Silver, 2012, p.167).
The crux of the problem of overfitting in predictive analytics is that, because the overfit
model looks like a better imitation of reality and thus provides the illusion of greater predictive
capacity, it is likely to receive more attention from publications etc. than models with a more
correct fit and lower prediction value. If the overfit models are the ones which are accepted, the
decision-making suffers as a result of misleading results.
With Big Data, the problem of overfitting may be amplified, as the nature of Big Data tools
and applications allows us to investigate increasingly complex questions. There are various
techniques for avoiding the problem, some of which are designed to explicitly penalize models
which violate the principle of parsimony. Other techniques test the model’s performance by
splitting the data, and using half to build the model and half to validate it (also known as early
stopping). The choice of avoidance mechanism is at the discretion of the analyst and is
influenced by the nature of the issue the test addresses.

38
Personal Bias, Confidence, and Incentives
As we have discussed, many predictions will fail due to the underlying construct of the
model: assumptions may be inappropriate, key pieces of context may be omitted, and models
may be overfitted. However, even if these problems are avoided, there is still a risk that the
prediction may fail due to the attitudes and behaviours of humans themselves. In fact, failure to
recognize our attitudes and behaviours as obstacles to better prediction can potentially increase
the odds of such failure. As Silver (2012) notes, “data driven predictions can succeed – and they
can fail. It is when we deny our role in the process that the odds of failure rise” (p. 9).
Again, the root of the problem is that all predictions involve exercising some degree of
human judgment, where each individual bases his/her judgments on their own subjective
knowledge, psychological characteristics, and even monetary incentives. It has been shown that,
rather than accepting results from statistical analysis at face value, individual judgmental
adjustments result in “forecasts that were about 15% more accurate” (Silver, 2012, p. 198). For
example, a more cautious individual - or one with a lot at stake if their prediction is wrong – may
choose to believe the average (aggregate) prediction rather than the prediction of any one
individual forecaster. In fact, specialists in many different fields of study have observed the
tendency for group forecasts to outperform individual forecasts, and so choosing the aggregate
forecast may be a reasonable judgment in some cases. However, in other cases choosing the
aggregate prediction may hinder potential improvements to forecasts, as improvements to any
individual prediction will subsequently improve the group prediction as well.
Moreover, applying individual judgments to analyses introduces the potential for bias, as it
has been shown that people may construct their forecasts to cohere with their personal beliefs

39
and incentives. For instance, researchers have found that forecasts which are managed
anonymously outperform – in the long run - predictions which name their forecaster. The reason
for this trend lies in the fact that incentives change when people have to take responsibility for
their predictions: “if you work for a poorly known firm, it may be quite rational for you to make
some wild forecasts that will draw big attention when they happen to be right, even if they aren’t
going to be right very often” (Silver, 2012, p. 199). Effectively, individuals with lower
professional profiles have less to lose by declaring bolder or riskier predictions. However,
concerns with status and reputation distract from the primary goal of making the most precise
and accurate prediction possible.

40
Big Data: IT Challenge
The Big Data Stack
In an attempt to infer more accurate predictions, many experts are analyzing larger
volumes of data and are aiming for increasingly sophisticated modelling techniques. As the trend
towards the “Big Data” approach to knowledge and discovery grows, there is a new architectural
construct which requires development and through which data must travel. Often referred to as
the “Big Data stack”, this construct is made of several moving components which work together
to “comprise a holistic solution that’s fine-tuned for specialized, high-performance processing
and storage” (Davenport & Dyché, 2013, p. 29).
Legacy Systems and System Development
Developing a computer-based system requires a great deal of time and effort, and therefore
such systems tend to be designed for a long lifespan. For example, “much of the world’s air
traffic control still relies of software and operational processes that were originally developed in
the 1960s and 1970s” (Somerville, 2010, para. 1). These types of systems are called legacy
systems, and they combine dated hardware, software, and procedures in their operation.
Therefore, it is difficult and often impossible to alter methods of task execution as these methods
rely on the legacy software: “Changes to one part of the system inevitably involve changes to
other components” (Somerville, 2010, para. 2).
However, discarding these systems is often too expensive after only several years of
implementation, and so instead they are frequently modified to facilitate changes to business

41
environments. For example, additional compatibility layers may be regularly added as new tools
and software are often incompatible with the system. Clearly, the development of computer-
based systems must be considered in juxtaposition with the evolution in its surrounding
environment. Somerville (2010) notes that “changes to the environment lead to system change
that may then trigger further environmental changes” (p.235), in some cases resulting in a shift
of focus from innovation to maintaining current status.
Building theBig Data Stack
The advent of Big Data constitutes a major environmental change in terms of firms’
objectives, and it has necessitated considerable modifications and redesign of computer and
process systems. One such solution, the Big Data Stack, is well equipped to facilitate businesses’
continuous system innovations, as its configuration uses packaged software solutions that are
specifically fine-tuned to fit the variety of data formats. The composition and assembly of the
Stack is shown below:

42
Storage
The storage layer is the foundation of the edifice. Before data is collected, there must be
space for it to be recorded and held until it has been processed, distilled, and analyzed.
Previously available technologies offered limited space capacity, and storage devices with large
capacity were new commodities and therefore not cost-effective. As a result, the amount of data
that could be used in analysis was restricted right from the outset. However, disk technologies
are becoming increasingly efficient which is producing a subsequent cost decrease in the storage
of large and varied data sets, and increased storage capacity represents new possibilities to
collect larger amounts of data.
Source: Davenport & Dyché,2013. The BigDataStack. International Institute for Analytics.

43
Platform Infrastructure
Data can move from the storage layer to the platform infrastructure, which is comprised of
various functions which collaborate to achieve the high-performance processing that is
demanded in companies which utilize Big Data. Consisting of “capabilities to integrate, manage,
and apply sophisticated computational processing to the data” (Davenport & Dyché, 2013, p. 9),
the platform infrastructure is generally built on a Hadoop foundation. Hadoop foundations are
cost-effective, flexible, and fault tolerant software frameworks. Fundamentally, Hadoop enables
the processing of high volume data sets across collections of servers, and it can be created to
scale from an individual machine to a multitude of servers.
Offering high performance processing at a low price to performance ratio, Hadoop
foundations are both flexible and resilient as the software is able to detect and manage faults at
an early stage of the process.
Data
As previously discussed, Big Data is vast and structurally complex, and the data layer
combines elements such as Hadoop software structures with different types of databases for the
purpose of combining data retrieval mechanisms with pattern identification and data analysis.
This combination of databases is used to design Big Data strategies, and therefore the data layer
manages data quality, reconciliation, and security when formulating such schemes.

44
Application Code, Functions and Services
Big Data’s use differs with the underlying objectives of analysis, and each objective
necessitates its own unique data code which often takes considerable time to implement and
process. In solution to these issues, Hadoop employs a processing engine called MapReduce.
Using this engine, analysts can redistribute data across disks and at the same time perform
intricate computations and searches on the data. From these operations, new data structures and
datasets can then be formed using the results from computation (e.g. Hadoop could apply
MapReduce to sort through social media transactions, looking for words like “love”, “bought”,
etc. and thereby establish a new dataset listing key customers and/or products (Davenport &
Dyché, 2013, p. 11).
Business View
Depending on the application of Big Data, additional processing may be necessary.
Between data and results, an intermediate stage may be required, often in the form of a statistical
model. This model can then be analysed to achieve results consistent with original objective.
Therefore, the business view guarantees that Big Data is “more consumable by the tools and the
knowledge workers that already exist in an organization” (Davenport & Dyché, 2013, p. 11).
Presentation and Consumption
One particular distinguishing characteristic of Big Data is that it has adopted “data
visualisation” techniques. Traditional intelligence technologies and spreadsheets can be

45
cumbersome and difficult to navigate in a timely manner. However, data visualization tools
permit information to be viewed in the most efficient manner possible.
For example, information can be presented graphically to depict trends in the data, which
may lead to a faster gain in insight or give rise to further questions – thereby prompting further
testing and analysis. Many data visualization software are now so advanced that they are more
cost and time effective than traditional presentation systems. It is important to note though that
data visualizations become more complicated to read when we are dealing with multivariate
predictor models, as the visualization in these cases encompasses more than two dimensions.
Methods to address this challenge are in development, and there now exist some
visualization tools that select the most suitable and easy-to-read display given the form of the
data and the number of variables.

46
Big Data: Benefits and Value
Attracting attention from firms in all industries, Big Data offers many benefits to those
companies with the ability to harness its full potential. Firms using “small” and internally
assembled data derive all of the data’s worth from its primary use (the purpose for which the data
was initially collected). With Big Data, “data’s value shifts from its primary use towards its
potential future uses” (Mayer-Schonberger & Cukier, 2013, p.99) thus leading to considerable
increases in efficiency. Employing Big Data analytics allows firms to increase their innovative
capacity, and realize substantial cost and decision time reductions. In addition, Big Data
techniques can be applied to support internal business decisions by identifying complex
relationships within data. Despite these promising benefits, it is also important to recognize that
much of Big Data’s value is “largely predicated on the public’s continued willingness to give
data about themselves freely” (Brough, n.d., para. 11). Therefore, if such data were to be no
longer publically available due to regulation etc., the value of Big Data would be significantly
diminished.
Unlocking Big Data’s LatentValue: Recycling Data
As the advances in Big Data take hold, the perceived intrinsic monetary value of data is
changing. In addition to supporting internal business decisions, data is increasingly considered to
be the good to be traded. Decreasing storage costs combined with the increased technical
capacity to collect data means that many companies are finding it easier to justify preserving the
data rather than discarding it when they have completed its primary processing and utilization.

47
Effectively, increased computational abilities in Big Data analytics have helped to facilitate data-
re-use. Data is now viewed as an intangible asset, and unlike material goods its value does not
diminish after a one-time use. Indeed, data can be processed multiple times – either in the same
way for the purpose of validation, or in a number of different ways to meet different goals and
objectives. After its initial use, the intrinsic value of data “still exists, but lies dormant, storing its
potential... until it is applied to a secondary use” (Mayer-Schonberger & Cukier, 2013, p. 104).
Implicit in this is the fact that even if the first several exploitations of the data generate little
value, there is still potential value in data which may eventually be realized.
Ultimately, the value of data is subject to the analyst’s abilities. Highly creative analysts
may think to employ the data in more diverse ways, and as such the sum of the value they extract
from the data’s iterative uses may be far greater than that extracted by another analyst with the
same dataset. For example, sometimes the latent value of data is only revealed when two
particular datasets are combined, as it is often hard to discern their worth by examining some
datasets on their own. In the age of Big Data, “the sum is more valuable than its parts, and when
we recombine the sums of multiple datasets together, that sum too is worth more than its
individual ingredients” (Mayer-Schonberger & Cukier, 2013, p. 108).
Product and Service Innovation
Competitive Advantage
Big Data analytics have enabled innovation in a broad range of products and services,
while reshaping business models and decision making processes alike. For example, advances in
Big Data storage and processing capabilities have facilitated the creation of online language

48
translation services. Advanced algorithms allow users to alert service providers whenever their
intentions are misunderstood by the systems. In effect, users are now integrated into the
innovation process, educating and refining systems as much as the system creators do. This not
only allows for vast cost and time reductions, but also offers a powerful competitive advantage to
companies who integrate these types of analytic processes.
For instance, a new provider of an online translation service may have trouble competing
with an established enterprise not only due to the lack of brand recognition, but also because
their competitors already have access to an immense quantity of data. The fact that so much of
the performance of established services, such as Google Translate, are the result of the consumer
data they have been incorporating over many years may constitute a significant barrier to entry
by others into the same markets. In other words, what is considered an advantage of Big Data
innovation for one firm (competitive advantage) is conversely a disadvantage for other firms
(barriers to entry).
Improved Models and Subsequent Cost Reductions
Many innovations are made possible due to the increased capacity of Big Data analytics
to identify complex relationships in data. For example, the accumulation of a greater volume of
observations has made it much easier to correctly discern non-linear relationships, which may
revise predictive and decision making models to afford the analyst greater accuracy. For
instance, models used in fraud detection are becoming increasingly sophisticated, often allowing
anomalies to be detected in near real time and resulting in significant cost reductions. Some
estimate that Big Data will “drive innovations with an estimated value of GDP 24 billion... and

49
there are also big gains for government, with perhaps GBP 2 billion to be saved through fraud
detection” (Brough, n.d., para. 7).
In addition, innovations and subsequent cost reductions have been achieved through the
development of personalized customer preference technologies. The nature of Big Data allows
for greater insight into human behavioural patterns, as fewer data points are omitted from
analysis in an attempt to glean all possible value from each observation. Consequently, greater
attention to detail “makes it possible to offer products and services based on relevance to the
individual customer and in a specific context” (Brough, n.d., para. 3). Innovative services such
as Amazon’s personalized book recommendations allow firms to efficiently direct their
advertising campaigns to target those consumers perceived as more likely to be interested in their
products, and thereby reduce advertising and marketing costs. Therefore, much of the value of
Big Data comes not only from offering new products, but often from offering the same product
(Amazon is still offering books) in a more efficient way. In other words, value from Big Data
can also be enhanced when we do the same thing as before, but cheaper or more effectively using
advanced analytic models.
Improved Models and Subsequent Time Reductions
As well as offering significant cost reductions, Big Data processing techniques have
helped facilitate vast reductions in the time it takes to complete tasks. For instance, by employing
Big Data analytics and more sophisticated models, Macy’s was able to reduce the time taken to
optimize the pricing of its entire range of products by approximately 73% from over 27 hours to
approximately 1 hour (Davenport and Dyché, 2013). Not only does this substantial time

50
reduction afford Macy’s greater internal efficiency, but it also “makes it possible for Macy’s to
re-price items much more frequently to adapt to changing conditions in the retail marketplace”
(Davenport and Dyché, 2013, p. 5), thereby affording the company a greater competitive edge.
Firms can gain larger shares of their respective markets by being able to make faster decisions
and adapt to changing economic conditions faster than their rivals, making decision-time and
time-to-market reductions in to a significant competitive advantage of using Big Data over
“small”.
The power of Big Data analytics also affords companies greater opportunities to drive
customer loyalty through time reductions in interactions between firms and consumers. For
example, for firms utilizing small data, once a customer leaves their store/facility, they are
unable to sell/market to that person until that individual chooses to return to the store. However,
firms using Big Data technologies are able to exercise much more control over the marketing
process as they can interact with consumers whenever they wish, regardless of whether or not an
individual is specifically looking to buy from their company at any given moment. Many firms
now possess advanced technology that allows them to send e-mails, targeted offers, etc. to
customers and interact with them in real time, potentially impacting customer loyalty.

51
Big Data: Costs and Challenges
Notwithstanding Big Data’s obvious benefits, Big Data potentially poses challenges with
regards to privacy, and (operationally) in the determination of which data to include in the
models’ development process. As the Wall Street Journal notes: “in our rush to embrace the
possibilities of Big Data, we may be overlooking the challenges that Big Data poses – including
the way companies interpret the information, manage the politics of data and find the necessary
talent to make sense of the flood of new information” (Jordan, 2013, para. 2).
For every apparent benefit in using Big Data, there exists a potential challenge. Thus, the
efficacy of data re-use will not necessarily add further value if data loses utility over time. In
addition, increased enterprise efficiency due to reduced costs and decision time has to be
balanced against large investments that are required to develop Big Data infrastructure. Thus,
companies with large investments in Big Data technologies stand to lose their investment and
incur opportunity costs if Big Data does not help them realize their objectives more effectively.
Employing Big Data analytics requires careful cost/benefit analysis with decisions of when and
how to utilize Big Data being made according to the results.
ConceptualIssues: How to Measure the Valueof Data
The market-place is still struggling to effectively quantify the value of data, and since
many companies today essentially consist of nothing but data (e.g. social media websites) it is
increasingly difficult to appraise the net value of firms. Consider Facebook: on May 18th 2012,
Facebook officially became a public company. Boasting an impressive status as the world’s

52
largest social network, on May 17th 2012 Facebook had been evaluated at $38 per share,
effectively setting it up to have the third largest technology initial public offering (IPO) in
history7. If all shares were to be floated, including monetizing those stock options held by
Facebook executives and employees, the company’s total worth was estimated at near $107
billion (Pepitone, 2012).
As is often the case with IPOs, stock prices soared by close to 13% within hours of the
company going public to reach a high of approximately $43. However, within that same day
stocks began to decline, and Facebook stocks closed at the end of the day at just $38.23. Worse
still, Madura (2015) notes that “three months after the IPO, Facebook’s stock price was about
$20 per share, or about 48% below the IPO open price. In other words, its market valuation
declined by about $50 billion in three months” (p. 259).
What was the explanation for such a drastic plunge? To explain it, we must first look at the
company’s valuation using standard accounting practices. In its financial statements for the year
2011, Facebook’s assets were estimated at $6.3 billion, where assets’ values accounted for
hardware and office equipment, etc. Financial statements also include valuations of intangible
assets such as goodwill, patents, and trademarks etc., and the relative magnitude of these
intangible assets as compared with physical assets is increasing. Indeed, “there is widespread
agreement that the current method of determining corporate worth, by looking at a company’s
‘book value’ (that is, mostly the worth of its cash and physical assets), no longer adequately
reflects the true value” (Mayer-Schonberger & Cukier, 2013, p. 118). Herein lies the reason for
the divergence between Facebook’s estimated market worth and its worth under accounting
criteria.
7 Visa has had the largesttech IPO to date, followed by auto maker General Motors (GM).

53
As mentioned previously, intangible assets are generally accepted as including goodwill
and strategy, but increasingly, for many data-intensive companies, raw data itself is also
considered an intangible asset. As data analytics have become increasingly more prominent in
business decision making, the potential value of a company’s data is increasingly taken into
account when estimating corporate net worth. Companies like Facebook contain data on millions
of users – Facebook now reports over 1.11 billion users – and as such each user represents a
monetized sum in the form of data.
In essence, the above example serves to illustrate that as of yet there is no clear way to
measure the value of data. As discussed previously, data’s value is contingent on its potential
worth from re-use and recombination, and there is no direct way to observe or even anticipate
what this worth may be. Therefore, while data’s value may now be increased exponentially as
firms and governments alike begin to realize its potential for re-use, exactly how to measure this
value is unclear.
Recycling Data: Does Data’sValueDiminish?
Previously, it was noted that advanced storage capacities combined with the decreasing
costs of storing data have provided strong incentives for companies to keep and reuse data for
purposes not originally foreseen, rather than discard it after its initial use. It does seem, however,
that the value which can be wrought from data re-use has its limits.
It is inevitable that most data loses some degree of value over time. In these cases
“continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of
fresher data” (Mayer-Schonberger & Cukier, 2013, p.110). As the environment around us is

54
continually changing, newer data tends to outweigh older data in its predictive capacity. This
then raises the question: how much of the older data should be included in order to guarantee the
effectiveness of the experiment?
Consider again Amazon’s personalized book recommendations site. This service is only
representative of increased marketing efficiency if its recommendations adequately reflect the
individual consumer’s interests. A book a customer bought twenty years ago may no longer be
an accurate indicator of their interests, thereby suggesting that Amazon should perhaps exclude
older data from analysis. If this data is included, a customer may see related recommendations,
presume all recommendations are just as irrelevant, and subsequently fail to pay any attention to
Amazon’s recommendations service. If this is the case for just one customer, it may not pose
such a huge problem or constitute a large waste of resources. However, if many customers
perceive Amazon’s service as being of little worth, then Amazon are effectively wasting precious
money and resources marketing to customers who are not paying any attention. This example
serves to illustrate the clear motivation for companies to use information only so long as it
remains productive. The problem lies in knowing which data is no longer useful, and in
determining the point beyond which it begins to diminish the value of more recent data.
In fact, many companies (including Amazon) have now introduced advanced modelling
techniques to resolve these challenges. For example, Amazon can now keep track of what books
people look at, even if they do not purchase them. If these books were recommended to
customers based on previous purchases, Amazon’s models interpret that those previous
purchases are still representative of the consumer’s current preferences. In this way, previous
purchases can now be ranked in order of their perceived relevance to customers, to further
advance the recommendations service. For instance, the system may interpret from your

55
purchase history that you value both cooking books and science fiction, but because you buy
cooking books only half as often as you buy science fiction they may ensure that the large
majority of their recommendations pertain to science fiction (the category which they believe to
be more representative of your interests). Knowing which data is relevant and for how long still
represents a significant obstacle for many companies. However, successful steps to resolve these
challenges can result in positive feedback in the form of improved services and sales.
Big Data and Implicationsfor Privacy
Another issue central to the discussion of Big Data is its implications for peoples’
privacy. The rise of social media in recent years has resulted in a rapid increase in the amount of
unstructured online data, and many data driven companies are using this consumer data for
purposes that individuals are often unaware of. When consumers post or search online, their
online activities are being closely monitored and stored, often without their knowledge or
consent. Even when they do consent to have companies such as Amazon or Google keep records
of their consumer history, they still do not often have any awareness of many potential secondary
uses of this data. At the heart of Big Data privacy concerns are questions regarding data
ownership and use, and the future of Big Data is contingent upon the answers.
As explained previously, data’s value is now more dependent on its cumulative potential
uses than its initial use. Since it is unlikely that a single firm will be able to unlock all of the
latent value from a given dataset, in order to maximize Big Data’s value, many firms license the
use of accumulated data to third parties in exchange for royalties. In doing so, all parties have

56
incentive to maximize the value that can be extracted by means of re-using and recombining
data.
Threats to privacy result from companies which conduct data aggregation on a massive
scale, particularly personal information, and “data brokers” who realize that there is money to be
made in selling such information. For these firms, data is the raw material, and because they
compete on having more data to sell than their competitors, they have incentive to over collect
data. Firms which pay for this information include insurance companies and other corporations
which collect and create “profiles” of individuals in order to establish indicators such as credit
ratings, insurance tables, etc. Due to the large inherent biases in Big Data, it has, for instance,
been shown that these credit reports are often inaccurate, leading some experts to express
concerns that “people’s virtual selves could get them written off as undesirable, whether the
[consumer profile] is correct or not” (White, 2012). Such outcomes have been dubbed by some
as “discrimination by algorithm”. In other words, in Big Data solutions, “data may be used to
make determinations about individuals as if correlation were a reasonable proxy for causation”
(Big Data Privacy, 2013).

57
Cautiously Looking to the Future
As the Big Data movement continues to evolve, questions are emerging regarding its
limitations. Increasingly, Big Data technologies are facilitating the aggregation of ever larger
datasets, but it has yet to be determined whether “N” will ever equal “all”, thus resolving vastly
accumulating biases inside. Furthermore, while it has been shown that employing Big Data
analytics can lead to improved efficiency and better predictions, it has not been “shown that the
benefit of increasing data size is unbounded” (Junqué de Fortuny, Martens, & Provost, 2013, p.
10) Questions still remain concerning whether, given the required scale of investment in data
infrastructure, the return on investment would be positive, and if it is, can it continue to increase
at the rate exceeding the costs of necessary upgrades in infrastructure?
Can N=All?
There is an increasing focus on collecting as much data as possible, and many specialists
are beginning to question whether it may eventually be possible to obtain a theoretically
complete, global dataset. In other words, can N=all?
Aiming for a comprehensive dataset necessitates advanced processing techniques and
storage capacity. In addition, forecasters must have the ability to create and analyze sophisticated
models to obtain meaningful results. Previously, each of these issues presented obstacles to the
progression of Big Data, but as new methods and procedures are developed, “increasingly, we
will aim to go for it all” (Mayer Schonberger & Cukier, p. 31).

58
While striving for a dataset which approaches N=all may appear to be more feasible, it is
questionable whether one can ever obtain a dataset which is equivalent to N=all. For instance,
although it is hypothetically possible to “record and analyse every message on Twitter and use it
to draw conclusions about the public mood... Twitter users are not representative of the
population as a whole” (Harford, 2014). In this case, N=all is simply an illusion. We have N=all
in the sense that we have the entire set of data from Twitter, however the conclusions we are
drawing from this complete dataset pertain to a much broader population. Conclusions regarding
public mood relate to the global population, many of whom do not use Twitter. As discussed
earlier, Big Data is messy and involves many sources of systematic bias, and so while datasets
may sometimes appear to be comprehensive, we must always question exactly what (or who) is
missing from our datasets.
Can Big Data Defy the Law of Diminishing MarginalReturns?
Another important consideration is whether the on-going aggregation of larger datasets can
result in diminishing marginal returns to scale. The law of diminishing marginal returns holds
that “as the usage of one input increases, the quantities of other inputs being held fixed, a point
will be reached beyond which the marginal product of the variable input will decrease” (Besanko
& Braeutigam, 2011, p. 207). We have noted that in the Big Data movement, data has become a
factor of production in its own right. Consequently, it may be the case that after a certain number
of data points, the inclusion of additional data results in lower per-unit economic returns.
In fact, it has been shown that, past a certain level, some firms do experience diminishing
returns to scale when the volume of data is increased. For example, it has been observed that in

59
predictive modelling for Yahoo Movies, predictive performance was observed to increase with
sample size, but it appeared to be increasing at a progressively slower rate. There are several
reasons for this trend. First, “there is simply a maximum possible predictive performance due to
the inherent randomness in the data and the fact that accuracy can never be better than perfect”
(Junqué de Fortuny et al., 2013, p. 5). Second, predictive modelling exhibits a tendency to detect
larger and more significant correlations first, and as sample size increases the model begins to
detect more minor relationships that could not be seen with smaller samples (as smaller samples
lose granularity). Minor relationships rarely add value, and if not removed from modelling they
can result in overfitting.
It is important to note that techniques are not yet sophisticated enough to determine
whether decreasing returns are experienced by all firms in all industries. In addition, it remains to
be seen whether there exists a ceiling on returns. Further research and more advanced procedures
are needed to address these issues.

Al-Mqbali, Leila, Big Data - Research Project

Al-Mqbali, Leila, Big Data - Research Project

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Al-Mqbali, Leila, Big Data - Research Project

Similar to Al-Mqbali, Leila, Big Data - Research Project (20)

Al-Mqbali, Leila, Big Data - Research Project