Geographic and linguistic normalization opensym2014 poster

Han-Teng Liao defended his PhD
successfully at the Oxford
Internet Institute (OII) July 2014.
His research focus in is on user-
generated content and data, Web
analytics (webometrics), Chinese
Internet Research and integrated
digital research designs (both
qualitative and quantitative).
Thomas Petzold is a social technology analyst, TED
speaker and professor of media management at HMKW
– University of Applied Sciences for Media,
Communication and Management in Berlin, Germany.
As a research fellow at the WZB (2011–2013), he led a
project on languages and big data in social technology.
[photo: David Ausserhofer]
Abstract
What is Data Normalization?
Finer normalization: geolinguistic unit
A language tag:
• Often starts with a language code followed by a country code
• e.g. “fr‐CA” = the geolinguistic unit of French used in Canada.
• has corresponding data points in the Unicode’s Common Locale
Data Repository (CLDR) Project.
• e.g. “fr‐CA” = 7,605,004 [12]
Finer geolinguistic data normalization is useful …
• for finer comparison between, say, Egyptian Arabic and Saudi
Arabia Arabic speakers, or that of Spanish Spanish and Mexican
Spanish speakers
• for analysts or designers to better know and thus support their
users by to providing appropriate interfaces and content[7]
• for better understanding of the Wikipedia traffic data
References (partial: those mentioned in this poster)
[1] American Planning Association 2006.
Planning and Urban Design Standards.
John Wiley & Sons.
[2] Cote, P. Effective Cartography:
Mapping with Quantitative Data. Harvard
Graduate School of Design.
[3] Crowston, K. et al. 2013. Sustainability
of Open Collaborative Communities:
Analyzing Recruitment Efficiency.
Technology Innovation Management
Review. January: Open Source
Sustainability (2013).
Acknowledgments
We appreciate the Wikimedia UK for the scholarship for Han‐Teng
Liao to present the findings at the OpenSym 2014. We also
acknowledge the open source software tools called Scrapy for
making the web mining tasks easier.
Data normalization, or geographic normalization, allows data to be
compared using a sensible common denominator, thereby
producing measurements of intensity or density, such as
population density [1, 2]
Data normalization is useful …
• in “factoring out the size” in order to facilitate comparisons
across unequal areas or populations [2]
• in dividing a certain numeric attribute (e.g. GDP)
by another (e.g. population), and
so as to derive another numeric attribute (e.g. GDP per capita)
• in minimizing the differences caused by the size of a geographic
unit
It is similar to Crowston, Julien and Ortega[3] in “factoring out the
size” but different in the choice of size unit.
• Crowston et al’s work[3] have proposed a measurement to
compare how efficient a language version turns potential users
into actual contributors.
• They found “a strong (but not perfect) correlation” between
the total number of Wikipedia contributors on one side, and
the Internet population, and total tertiary‐educated population
on the other.
Han‐Teng Liao (hanteng@gmail.com) and Thomas Petzold (t.petzold@hmkw.de)
Towards a better understanding of the geolinguistic dynamics of knowledge
Geographic And Linguistic Normalization
OpenSym '14 , Aug 27‐29 2014,
Berlin, Germany
ACM 978‐1‐4503‐3016‐9/14/08.
http://dx.doi.org/10.1145/26415
80.2641623
We propose a method of geo‐linguistic normalization to advance
the existing comparative analysis of open collaborative
communities, with multilingual Wikipedia projects as the example.
Such normalization requires data regarding the potential users
and/or resources of a geolinguistic unit.
0%
20%
40%
60%
80%
Percent of the traffic
Year/Month
pgViews_perLang
Egypt
Saudi Arabia
Other
Algeria
0
2
4
6
8
Normalized by language
population
Year/Month
pgViews_perLang
Israel
Kuwait
Saudi Arabia
UAE
Jordan
Bahrain
Qatar
Egypt
Figure 1. Viewing traffic trend lines Figure 3. Normalized viewing traffic trend lines
Comparing results: before and after data normalization
Arabic Wikipedia viewing traffic
Arabic Wikipedia editing traffic?
Please refer to the extended abstract or ask the authors for more
(Figure 2 and Figure 4).
English Wikipedia editing traffic
English Wikipedia viewing traffic
Figure EN1. Viewing traffic trend lines Figure EN3. Normalized viewing traffic trend lines
0%
10%
20%
30%
40%
50%
Year/Month
pgViews_perLang
United States
Other
United
Kingdom
Canada
0
0.5
1
1.5
2
Normalized by language
population
Year/Month
pgViews_perLang
Canada
United Kingdom
New Zealand
Australia
Ireland
United States
Malaysia
Netherlands
Spain
Italy
France
Germany
Figure EN2. Editing traffic trend lines Figure EN4. Normalized editing traffic trend lines
0%
10%
20%
30%
40%
50%
Year/Month
pgEdits_perLang
United States
United
Kingdom
Other
Canada
0
0.5
1
1.5
2
2.5Normalized by language
population
Year/Month
pgEdits_perLang
United
Kingdom
New
Zealand
Canada
Ireland
Australia
[7] Liao, H.-T. 2013. How does localization
influence online visibility of user-generated
encyclopedias? A case study on Chinese-
language Search Engine Result Pages
(SERPs). Proceedings of the 9th
International Symposium on Open
Collaboration (Hong Kong, Aug. 2013).
[12]Unicode Consortium 2014. Language-
Territory Information, CLDR Version 25.

Geographic and linguistic normalization opensym2014 poster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Geographic and linguistic normalization opensym2014 poster

Similar to Geographic and linguistic normalization opensym2014 poster (20)

Recently uploaded

Recently uploaded (19)

Geographic and linguistic normalization opensym2014 poster