Stuff I am presenting is very much ongoing work. The goal is to give you an impression about the type of CSS we do at GESIS and to through some ideas/thoughts out about the potential usefullness and limitations of using observational data for tackeling social science research questions.
During my PhD I was interning at the OU, Parc and hp. My mixture of interests which range from pure CS methods towards behavioral analysis of users or user-groups, brought me to GESIS where in 2013 the first CSS in Europa was founded.
Established in 1986 – its publicly prefunded. Huge potential of found data for SS found data are data which are not generated for a specific purpose (e.g., server logs). Found data are nothing new for the quant. social sciences, but social scienists have a pretty good understanding WHEN to use them and WHEN not. For example when studying alkohol consumption with surveys they will miss teenager drinking. Studying trash boxes near schools to correct the survey results.
Study online traces to learn sth about offline world is only part of the CSS story and it’s the part where you have to argue a lot with SS. The other part of the story: How does the availability of these data shape our behavior? What are the societal implications of these data and the algorithms that decide which data is accessible to whom when? Understanding these questions is important for 2 reasons: 1) if algorithms e.g. reinforce gender-biases online then you want to know that and maybe new dimensions for evaluating algorithms are needed. 2) We need to understand the bias that algorithms introduce in the data we observe! Project or a Discussion: Gender Biases in Wikipedia. Wikipedia reflects an unbalanced world, so males have higher indegree, get higher PR and so on. Most algorithms will reinforce the bias by making this sites more visible.
Example BM paper on rank algs for articles about notable people in wikipedia. David Lazers group: Personalization on google searchers and price comparison sites.
The 2 research projects I will present fall both in the second category. Use online data to unsertand offline.
Food is very central to all human beeings and it is effected by many factors: e.g. economics, social factors, cultural factors, biology… For social scientists is mainly interested because it effects the health status of the society and it helps us to learn about social groups and cultural differences. Most of our food related preferences are learned through experience. Nowadays people interact with food online a lot. One of the first thing…
Kochbar is 4 times larger than ichkoche. Chefkoch is 5 times larger than ichkoche.
server log data from the three largest recipe platforms in the german speaking area. recipe popularity distribution -> compare them across cities, or over time, look at the relation between the weather and what people eat, relation between city (or neighbourhood) characteristics and what people eat.
Then the inferred popularity of tomatoes would be 60k. So that’s how we generate a ingredient popularity distribution. So the question now is if these ingredient popularity distribution tells us sth about the taste of people who generated it or if it is just a side product of the the ingredient universality distribution which tells us in how many recipe an ingredient is used. Idea is similar to: if ingredients used in recipes would be randomly selected what number of shared flavour compuunds would we observe and what do we observe empirically in different cuisine.
So one of the first things which we did beside looking at the shape of the distributions was to analyze how stable the preferences are over time. So how much does the popularity ranking changes during the course of a week. We use a top-weighted overlap measure and compare to rank lists of items! Ingredients tend to be more stable since the head of the distribution does not change. Salt sugar and oil are always the most popular ingredients. However the recipe popularity changes at the weekend and some key ingredients change as well.
Since we know from offline diestary studies that certain types of ingredients are especially popular during the weekend (e.g. meat), we looked at the popularity of different types of ingredients during the course of a week. And we indeed see that People eat more meat on weekends (turkey and chicken are not seen as meat). Trend for carbohydrates are workdays. Trend for fish is Friday. Trend for vegetables is the begin of the week. Trend for alcohol is the weekend. Is this pattern universal?
So we observe a slight shift from weekday towards weekend preferences and a pretty clear cut from Sunday to Monday. Is this pattern universal?
So we picked some ingredients but what happens if we look at all ingredients
What the hell are these people in frankfurt checking out?
Most cities are extremely similar in their recipe preferences!
Of course Germany is special in the sense that it was divided for more than 40 years (1949 until 1989). So if the platform is introducing a very strong bias we might even not be able to see difference between east and west right?
Still there are striking differences between east and west and at GESIS we actually have the data to compare how attitudes, beliefs and values of Germans in East and West Germany changed over the last 10 years. Surveys started in 1990 and go on until today! So now we can wonder if also the dietary preferences are distinct.
Federal States of West are more similar than within East. Cities across East and West germany are less similar than just cities within either East or West.
Regions in the west are more similar to other west regions than regions in the east are to other east regions. But in general ingroup sim is on avergae higher than across group sim.
Food is one dimension of culture since social groups often differentiate by what they eat or dont eat, when they eat or dont eat and how they prepare their food. Marco Calvo __> situations of migration Traditional: culture as shared meanings and believes. survey. BUT The researcher himself has a cultural background. Hofstede or Alavi in 2007 Collective Orientation Scale. Pierre Bourdieu la distinction he argues that taste and related practices are used to differentiate from others.
Alternative perspective on culture was introduced by West and Graham in 2004: they use the origin of language to define cultural distance and found that 40% of the shared values can be explained by language. Another example is the science article from Michel in 2001 who uses digitalized books to quantify culture.
server log data are only for DE, AT und CH. Wikipedia allows to observe how language communities preserve their culture online by describing it and how it is consumed. Every person perceives its own culture and other cultures through their own ethnic and cultural lenses. Therefore biases and missunderstandings may be present on Wikipedia. We hypothesize that those biases and missunderstanding reflect the real world, but also that Wikipedia impacts the real world and has the power to hinder or foster corss-cultural missunderstandings.
We associate cuisines with language communities e.g. spanish speaking community with the spanish cuisine (though the southamerican cuisine), German speaking community with the German, Austrian and Swiss cuisine. Article „Spanische Küche“ represents the view of the Spanish cuisine as seen by the German speaking language group
3 datasets: view statistics, word count and outlink count (first and second hop) (explain outlinks!)View counts correlate very strongly (above 0.9) with outlink count Not every cuisine is described in each Wikipedia
So we wanted to develop an automated method to quantify the relation between differenet lang. communities. How similar are their cultures? How well do they understand the culture of other groups? How much interest do they have in the culture of others? Affinity biases?
What’s the cultural similarity between germany and italy, the cultural understanding and cultural affinity from the perspective of the German speaking population and from the perspective of the italian speaking population?
Using outlinks and their overlap Local and global perspective
On average each country is 1,5 time more similar to its neighbours than to distant ones. We also setup a crowdflour task and asked crowdworkers to assess which pair of cuisines is more similar. We only compared top rank versus low rank and got more than 99% correct. We had 10 judgements per pair and 225 pairs.
Italian, Swedish, Germans. Sweden would understand italian better than germans do.
Small overlaps - most food cultures are badly understood. Some food cultures like the one of the France , Italy and Turkey is however better understood by most language communities. Italian and french cuisine are famouse. But what about Turkey??? The good understanding of the turkish cuisine by different lang. Communities can be explained by the fact that many immigrants from Turkey moved to other EU countries. E.g. in germany where the largest number of non-nationals live 22% of them are Turkish residence. Also in the netherlands and Denmark most immigrants are turksih. Beside Turkey also Romania, Poland, Estonia abd Russia have many residence living in other European countries.
Germany largest numbers of non-nationals living in the EU on 1 January 2013 were found in Germany (7.7 million persons), Spain (5.1 million), the United Kingdom (4.9 million), Italy (4.4 million) and France (4.1 million).
External Sources: no proper ground truth, we assumed that cultural understanding can have two explanations Cultural understanding due to cultural similarity Cultural understanding due to exchange (migration) Migration data from the Global Bilateral Migration Database Migration seems to explain cultural understanding better than cultural similarity External sources do not correlate well with each other – not valid?
F(l,o) how often does a language community l view an object o, compared to how much they view all other objects. Normalized by the popularity of the objects.
We compared the for the list of country pairs ranked by their wikipedoia affinity values and their Eurovision affinity values. We found a cor of 0.25. One can see that also the distr. Of affinity values is differnet. Most affinities which we infer from wikipedia are around zero (nos special affinities) with some strong positive expceptions.
So what role does affinity play on Wikipedia? How can we reproduce the cross-cultural affinity score distr which we inferred from Wikipedia? IDEA: simulate the cross-cultural attention process (view statistics) and infer affinity values from these synthetic view data. 2 Simulation models which make differnet assumptions about what drives the cross-cultural attention process. Popularity model assumes that the global popularity of countries/cuisines is the only factor which impact how each agent decides how to distribute its attention. Popularity-affinity model includes directed edge specific weights. We use both models to generate synthetic cross-country click distributions and compute affinity values for the synthetic click distributions. we compare the distributions of affinities obtained from the empirical data with the synthetic ones.
Models generate synthetic cross-cultural view data and infer synthetic affinity scores. Compare distr of synthetic score with empirical score using divergence. If the divergence (Y-axis) is ZERO we have a perfect fit the 2 distr are the same. Popularity of countries/cuisines is a exponential distr. Lambda describe skewness of distribution. If lambda is too big (i.e. distr is too skewed) the fit is bad. That means an exponential with y <= 1 is best. Some cuisines are more popular but not much more. Popularity and affinity Model: in addition to the global popularity we introduce affinities between countries which follow an a normal distribution. Most countries have no special affinities (i.e. are around zero) with some outliers.
Most countries have a strong self-focus bias. Especially when you look at views. Around half of our lang. communities show a slight positive regional bias and half do show a slight negative one. But the affinity values are rather low and closed to zero.
Maybe we need to accept the fact that we have to come up with our own scales which describe what me can measure online a platform-independent way. And maybe we need also to acknowledge the fact that each platform introduces a bias which we need to understand or we need at least several platforms to make sure that we are fine.
Food and Culture
Food and Culture
GESIS & University of Koblenz
6nd Nov 2014, Yahoo Labs, Spain
Research and Services at GESIS
Survey Design and
Raise the standards of surveys
at all phases of the survey life
Gender studies, Political
science (e.g., GLES),
Values and Attitudes
research (e.g. ALLBUS),
18.11.2014 Claudia Wagner 3
CSS Agenda @GESIS
Support traditional Social Science research with
computational methods and tools
Develop new instruments to tap into the potential
of found data and crowds building a telescope
for the Social Sciences
Online impacts offline! Build new algorithms and
tools to shift the current configurations of
societies towards better futures.
18.11.2014 Claudia Wagner 4
~ 470k Unique Users
~1 Mil. Page Impressions per week
– 2,27 Mil. Unique User in July 2014
– 1.29 million Visits (12.1 Mio. PI) in December 2008
– 11,05 Mil. Unique User in July 2014
– 28 Mio. Visits and 242 Mio. PI in December 2010
What may explain Cultural
• Create for each country a
list of countries ranked
by where most of its
immigrants come from
• Create for each country a
list of countries ranked
by how similar their
values and beliefs are
according to ESS
Pair ρ (p-value)
wiki – ess 0.18 (0.00019)
wiki – migration 0.36 (1.74e-22)
18.11.2014 Claudia Wagner
• View statistics of cuisine pages in different
• How much more attention than we would expect
does language community A pay to the culture of
de/tr/German Croatian (+(+0.1464)
de/tr/French Serbian (+(+0.0850)
de/tr/Italian Polish (+0.0051)
Popularity Model Popularity-Affinity Model
18.11.2014 Claudia Wagner 31
Popularity Model Popularity-Affinity Model
• Affinities between language communities are present in
Wikipedia and drive the attention process
• Cultural understanding can to some extent be explained by
• Cultural similarities inferred from Wikipedia are pretty
• Relation between similarity, understanding and affinities?
– Understanding and affinity: -0.35
– Similarity and affinity: 0.27
– Similarity and understanding: 0.19
18.11.2014 Claudia Wagner 34