The Complexity of Data: Computer Simulation and “Everyday” Social Science

DEPARTMENT OF SOCIOLOGY
The Complexity of Data: Computer
Simulation and “Everyday” Social
Science
Edmund Chattoe-Brown
ecb18@le.ac.uk

Plan of talk
• Simulation as a confusing term.
• A simple (but revealing) example.
• The importance of data collection: Simulation
methodology.
• Where does complexity fit into all this?
• A more challenging example: DrugChat.
• Conclusions.

Simulation as a confusing term
• Not “gaming” or “role playing”: Student United Nations.
• Not system dynamics, discrete event simulation, analogue
simulation and so on, though these are ancestors.
• Not simulation as discussed by Bourdieu, whatever that is.
• Instrumental versus descriptive simulation: Not just a technical
tool (doing the same sums quicker) but a distinctive way of
understanding (explaining) social behaviour.
• A social process described as a computer programme rather
than a narrative or a statistical/mathematical model.
• Other disciplines, other approaches: Experiments, time series,
documents/content analysis, GIS.

Spatial segregation (Schelling)
• Agents live on a square grid (like a US city) so each has a maximum of
eight neighbours.
• There are two “types” of agents (red and green) and some spaces in the
grid are vacant. Initially agents and vacancies are distributed randomly.
• All agents decide what to do in the same very simple way.
• Each agent has a preferred proportion (PP) of neighbours of its own kind
(0.5 PP means you want at least half your neighbours to be your own kind -
but you would be happy with all of them being so i. e. PP is a minimum.)
• If an agent is in a position that satisfies its PP then it does nothing
otherwise it moves to an unoccupied position chosen at random.
• A time period is defined as the time it takes for each agent (chosen in
random order to avoid non robust patterns) to “take a turn” at deciding and
possibly moving.

Two questions
• What is the smallest PP (i. e. number 0-1) that will produce clusters?
• What happens when the PP is 1?

Simple individuals but complex system
I ndividual Desires and Collective Outcomes
-20
0
20
40
60
80
100
120
0 50 100 150
% Similar W anted ( I ndividual)
%SimilarAchieved(Social)
% similar
% unhappy
Counter-intuitive
macro (social)
results from
simple micro
interactions. A
non-linear
system.

Deconstructing this example
• Clearly unrealistic in many senses: Property values, decision processes,
unstructured space, communication, neighbourhood knowledge.
• However, not unrealistic in important sense that simulation contains no
arbitrary parameters and agents operate on plausible local knowledge.
The only “parameters” in the model are individual PP values (measure
by experiment? Already in surveys: Mare.)
• The simulation also generates unintended consequences (PP=1) and
patterns that were not “built in”. For example, is the distribution of empty
sites random or buffering? This emergence allows the possibility of
genuine falsification and has heuristic fertility: What does compatibility of
desires mean? When does it occur?
• We need two sorts of data: Quantitative (what patterns are we trying to
explain?) and qualitative (what social processes create these?)

Aside …
• It is very clear that we need the “complexity approach” because
we are not very good at deducing how complex systems work “in
either direction” (micro to macro or vice versa).
• But what is the complexity approach in this context? Is it a set of
methods, a set of subject areas, a family of interesting
models/results, a way of looking at problems or all of the above?
• How does “the complexity approach” compare with “the sociology
approach” or “the physics approach?”
• Should complexity be more than simulation calibrated on real
data? If so, what?
• IMO, the main problem with complexity is “where’s the data?”

Quantitative data collection approach
• Collect survey data: Cross sectional, time series or whatever.
• Choose a model and accept/reject it on grounds of statistical fit.
• Model coefficients are “results” conditional on acceptable model.
• In what sense do models explain observed patterns? (If we find a
correlation between income and academic success of a particular
size, what have we really learnt?)
• Technical problems: Explanatory range depends on sample size.
• Basic problem doesn’t go away even with “fancier” techniques
like time series/multi-level modelling: A description isn’t an
explanation.
• Rarely heuristically fertile.

Deriving a quantitative coefficient
Number
of
strikes
(units)
Unemployment (millions)
1 2
50
80

Quantitative example
• “The most important empirical findings of this study can be summarized as
follows:
• … there is a moderate tendency for individuals with higher service class origins
to be more likely than others to enrol in PhD programmes.
• …
• The estimated effect of class drops to zero when controlling for parents’
education and employment in research or higher education.
• The overall implication of these findings is that the transition from graduate to
doctoral studies is influenced by social origins to a considerable degree. Thus,
the notion that such effects disappear at transitions at higher educational levels
- due either to changes over the life course or to differential social selection - is
not supported.” (Mastekaasa, Acta Sociologica, 2006, 49(4), pp. 448-449.)

Translating back into simulation …
• Agents start with particular attributes (like being red or green and
having a particular PP in Schelling). These might include things
like IQ and motivation.
• They undergo a long sequence of social interactions in
institutional contexts, being influenced by parents, peers and
teachers in classroom, playground, public library and so on.
They also make choices and operate within institutional contexts
(like rules for “streaming” or school allocation by catchment).
• The quantitative approach described here tries to link “late”
attributes (starting a PhD) to “early” ones (parental occupation)
in the hope that regularities in social life support this.
• Is parental occupation an attribute or a process?

Qualitative data collection approach
• Collect data (cognitive, behavioural, structural) by observation
and questioning.
• Try (though surprisingly rarely) to induce a pattern from the data:
Example of the “addiction cycle” and compare with amount
(frequency) and type account of drug use.
• Result is rich coherent narrative(s): What heroin addiction means
from the inside and in a particular context.
• Are the results generalisable? (What is N?)
• Can we correctly envisage the overall consequences of complex
social interaction sequences presented using narratives?
(Compare Schelling case again.)
• Often heuristically fertile (“addiction cycle”).

Qualitative example
• “Turkish interviewees do not include themselves when they are evaluating the
status of ‘Turkish women’ in general. While referring to ‘Turkish women’, most
Turkish interviewees use the pronoun ‘they’:
• Turkish women are more home-oriented. I think that they are left in the backstage
because they do not have education, because they are not given equal opportunities
with men. (T3)
• One of the Turkish interviewees stated that it was difficult for her to answer the
questions related to her status ‘as a woman’, because:
• I don’t think of myself as a Turkish women, but as a Turkish person. I mean I never
think about what kind of role I have in the society as a woman. (T1)
• Most Norwegian interviewees, on the other hand, identify with ‘Norwegian
women’ in general, and they refer to ‘Norwegian women’ as ‘we’:
• I think that in a way Norwegian women, that is we, at least have our rights on paper.
We have equal rights for education and we have good welfare arrangements … (N1)”
(Sümer, Acta Sociologica, 1998, 41(1), p. 122)

Translating back into simulation …
• Agents choose “appropriate” actions on the basis of perceived
identity.
• A range of identities is “given” to agents by biological difference
(skin colour) and social structure (“mother”, “worker”).
• Identities are made more salient by patterns of social interaction
and socialisation. For example, perhaps a Turkish upbringing
stresses female identities that are traditional (mother) or liberal
(worker) and de-stresses the existence of a separate “woman’s
identity” while a Norwegian upbringing stresses that identity as
the underpinning of both work and child-rearing.
• Clearly this simulation needs to be much more cognitive,
contextual and detailed than the Schelling example.

What is going on here?
• Qualitative research tells us how people interact and make
decisions within environments but can’t usually tell us what
large scale patterns result.
• Quantitative research tells us what the large scale patterns are
but may not really explain them. (Inability to reason about
complexity may result in naïve attribution i. e. clusters are
evidence of xenophobia.)
• Simulation shows how we might bridge the gap between the
levels of description with a “generative” social theory
expressed as a computer programme. (Coleman “boat”.)

How are we doing with complexity?
• Large number of elements which interact dynamically.
• Interaction rich (mutual influence between significant numbers of
elements).
• Non-linearity.
• Interaction short range and each element ignorant of the behaviour of the
system as a whole. [2OE on clusters?]
• Interaction loops.
• Open system far from equilibrium requiring energy input. [?]
• Has a history.
• Source: Compressed losslessly from Cilliers, Complexity and
Postmodernism, pp. 3-5.

Different kinds of “difficulty”
• Difficult patterns: Chaos, self-organised criticality.
(Mathematical strand: We are studying formal systems,
we don’t need data.)
• Difficult mental processes: Reflexivity, self-awareness,
subconscious motives. (Social theory strand: We are
too embedded in these systems and our reflections on
them to bracket anything off as objective data.)
• Difficult social systems: Rich context, negotiated roles,
complex artefacts. (Ethnographic strand: The world is
too complex for general theories.)

Degrees of similarity in Schelling
• Predict exact positions of clusters?
• Predict that there will be clusters at all?
• Predict spatial stability of clusters?
• Predict the size distribution (or separation) of clusters?
• Predict (for three “types”) that clusters will be separated/nested?
• Predict that most cosmopolitan agents will form perimeters of
clusters?
• Predict that empty sites will be randomly distributed for
cosmopolitan agents but form buffer zones for more xenophobic
agents? (“Looking at the holes”: A heuristic idea, “vacancy
chains”.)

Ideal simulation methodology
• Choose a target system: Ethnic segregation in cities.
• Build a simulation of the target system and calibrate it, typically on
micro level data: Ethnography and experiments? How do agents
make relocation decisions and where do they go?
• Run simulation and look for regularities and their preconditions:
Do we observe clusters (always, never, only with high PP, fixed,
identical, moving) and buffer zones?
• Compare these regularities with statistical data on real residential
patterns. What effective similarity tests do we have?
• If there is a “good” match then we haven’t yet falsified the claim
that the simulation “generates” the target system and therefore
explains it (a progressive process of course).

The Gilbert and Troitzsch “box”

Case Study I: DrugChat
• A reimplementation of Agar’s DrugTalk for the
DTI Foresight Programme.
• Based on ethnographic data but generates
some qualitatively realistic aggregate data.
• Problematises both the “attribute” based
approach to social regularity and the “transition
probability” based approach to modelling.

Assumptions I
• Networks: Many have few ties and few have many.
• Types: Non-users, users and addicts. (Distinguished by
patterns of behaviour not level of use.)
• Choice based on attitudes to risk (fixed and normally
distributed around 50) and to drugs (varies by
experience and social influence initialised at 50).
• System driven by “arrival” of drug doses: Addicts get
few doses with high probability, users get more doses
with lower probability and non-users get few doses with
very low probability.

Assumptions II
• Choice simply compares ATR and ATD (but addicts
don’t choose).
• “Stash”: Users share all bar one dose with friends
(“partying”) while addicts don’t share.
• Drug use experience evaluated on each dose and can
be good and bad. Counts kept of these update ATD.
Early experiences have more impact than late ones
and bad experiences more impact than good.
• After 5 doses, addiction occurs (physiology).

Assumptions III
• Addict communication is ignored but status as addicts
has strong negative effect on friends.
• Current users have a direct “congruence” influence via
drug experience (good or bad).
• Non current users and non users only influence slightly
through “gossip” - telling “drug stories” (total counts of
good and bad experiences across all friends used to
update ATD).
• Clearly a complicated system: Is it a complex one?

Reading these outputs
• Producing an “S curve” is very weak support for the
simulation assumptions. Too many other assumption
sets produce it too. (Back to issue of qualitative
similarity.)
• Because this simulation is only broadly empirical, the
failure to predict user status on ATR does not
“disprove” the statistical approach. It only shows how
systems at a particular level of “complicatedness” (in
fact not very high) may break down relationships
between attributes which statistical approaches rely on.

Aside …
• The Caulkins model also has three states: User, non-
user and addict and assumes that there are fixed
transition rates between states.
• These TRs are for NU to U, U to A, U to NU and A to
NU. The only behavioural restriction on the TRs is that
A to NU is assumed to be smaller than U to NU.
• This model is fitted to real data.
• What happens if we use the DrugChat simulation to
calculate transition probabilities of the Caulkins kind?

Transition probabilities in DrugChat

Reading this output
• Again, DrugChat is not calibrated well enough to prove that it
is “right” and Caulkins et al. are “wrong”.
• However, this output (not only are transition probabilities not
constant but they change sign!) does suggest that constant
transition probabilities are not likely to be a very effective
approximation in social systems with even a rather low level of
“complicatedness”. (The Caulkins model doesn’t even work in
the simplified world of DrugChat.)
• Should we start asking questions about how likely different
approaches are to work and how we would go about
establishing this? (Hendry and model reductions.)

Reading this output
• Initially there is little information in the system. ATD=50.
• Then the agent has two bad experiences with drugs.
• By then, much gossip and experience is reporting good
things about the drug which is true “on average” before
its addictive nature is recognised.
• This promotes more use, each time with mixed results.
• Unfortunately by this point, addiction has kicked in.
• This particular agent becomes addicted despite several
bad drug experiences via social influence.

What are we doing here?
• Collecting different kinds of data from the simulated
system which can be compared not only with real data
but with underlying assumptions of various theoretical
approaches (simple statistical models, models based
on “stocks and flows”). Access to multiple kinds of data
allows stronger falsification of methods and models.
• Reflecting (at least broadly) on where we might get the
kinds of data we need to calibrate the model properly
(behavioural, cognitive, physiological, institutional,
structural) within the context of existing methods.

Why is this a good idea?
• Simulated systems recognise and can represent
different kinds of social “difficulty” - which may include
various things people intend by complexity (reflexivity,
chaotic output) but also make their “ontological” status
clearer. (Is this “difficulty” in the heads of individuals, in
their processes of interaction or what?)
• However, unlike a lot of complexity theory (albeit for
different reasons) there is an “old fashioned”
commitment to integrating data and theory and to
explaining across levels of description. This may work
better using the new approach too.

Conclusions
• Complexity needs to think very carefully about what “kind of
thing” it is if it is going to survive after the “fad” phase.
• Simulation has tools to offer the approach which (at least in
principle) tap into the methods and data of social science. (I
haven’t talked about the physical sciences but I think the some
of the same arguments go through.)
• Simulation of Innovation: A Node (SIMIAN): ESRC funded under
NCRM for three years with Professor Nigel Gilbert (Sociology @
Surrey) to train and do methodologically innovative research. A
good time for collaboration?

Now read on?
• Gilbert and Troitzsch (2005) Simulation for the Social Scientist, second edition
(Open University Press). [Examples/resources online. All examples in
NetLogo.]
• J. Artificial Societies and Social Simulation: http://jasss.soc.surrey.ac.uk/
[Free, fully peer reviewed, interdisciplinary and only online.]
• Chattoe (2006) ‘Using Simulation to Develop and Test Functionalist
Explanations: A Case Study of Dynamic Church Membership’, British Journal
of Sociology, 57(3), September, pp. 379-397.
• Chattoe and Hamill (2005) ‘It’s Not Who You Know – It’s What You Know
About People You Don’t Know That Counts: Extending the Analysis of Crime
Groups as Social Networks’, British Journal of Criminology, 45(6), pp. 860-
876.
• Chattoe, Hickman and Vickerman (2005) Foresight: Drugs Futures 2025?
Modelling Drug Use, Office of Science and Technology, Department of Trade
and Industry. [Available from the presenter or online.]

The Complexity of Data: Computer Simulation and “Everyday” Social Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to The Complexity of Data: Computer Simulation and “Everyday” Social Science

Similar to The Complexity of Data: Computer Simulation and “Everyday” Social Science (20)

More from Edmund Chattoe-Brown

More from Edmund Chattoe-Brown (20)

Recently uploaded

Recently uploaded (20)

The Complexity of Data: Computer Simulation and “Everyday” Social Science