Does a “Renaissance Man” Create Good Wikipedia Articles?
Jacek Szejda
3
, Marcin Sydow
1,2
and Dominika Czerniawska
4
1
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
2
Polish-Japanese Institute of IT, Warsaw, Poland
3
Educational Research Institute, Warsaw, Poland
4
Digital Economy Lab, Warsaw, Poland
Keywords:
Diversity Of Interests, Article Quality, Open Collaboration, Wikipedia.
Abstract:
We introduce a concept of diversity of interests or versatility of a member of an open-collaboration environ-
ment such as Wikipedia and aim to study how versatility influences the work quality. We introduce versatility
measure based on entropy. In preliminary experiments on Wikipedia data we indicate the positive role of
editors’ versatility on the quality of the articles they co-edit.
1 INTRODUCTION
Open-collaboration environments like Wikipedia pro-
duce outcome of varying quality. It is important
to study what properties of community members in-
crease chances for high-quality results of their work.
Such studies can help in future in developing tools
that improve and support open-collaboration team-
building process.
For example, it is interesting to study whether ed-
itors that have diverse interests tend to create better
Wikipedia articles.
Diversity has prooved to play important role in
multiple fields of applications: text summarisation,
web search, databases, recommender systems and se-
mantic entity summarisation. Recently, the concept
of diversity has attracted interest also in the domain of
open collaboration research (e.g. (Aggarwal, 2014)).
In this paper we introduce a quantitative measure
of diversity of interests of a member of an open-
collaboration environment such as Wikipedia and aim
to study how versatility influences the work quality.
The measure is based on the information-theoretic
concept of entropy. We demonstrate on Wikipedia
data that versatility of editor seems to be correlated
with the quality of articles they co-edit.
1.1 Sociological Background
Team diversity is one of the fundamental issues in so-
cial and organisational studies that has been broadly
researched on free software communities. Wikipedia
has a similar workflow where the community mem-
bers can edit any article. It rises analogous issues
concerning team’s coherence vs efficiency. There are
two competing theories describing efficient team or-
ganisation: modularity and integrity. The first was
introduced by David Parnas who suggested that co-
dependence between components should be elimi-
nated by limiting the communication (Parnas, 1972).
In our approach, a module corresponds to a task of
creating an article on Wikipedia. Participation in a
module does not require knowledge about the whole
system or other modules, e.g. Wikipedia users can co-
author articles about social science without knowing
anything about life sciences or mathematics. It leads
to higher specialisation and less diversity in individual
performance. Modular approach enables more flex-
ibility and decentralized management (Sanchez and
Mahoney, 1996). On the other hand, integral ap-
proach to organisation is easier to adapt to new envi-
ronments, to change the cooperation rules and gives
better results when it comes to fine-tuning of the
system (Langlois and Garzarelli, 2008). In an inte-
gral mode team members have diverse knowledge and
skills. We aim to study whether modular/specialized
or integral collaboration pattern is more successful in
creating high-quality Wikipedia articles.
1.2 Related Work
The potentially positive role of diversity was no-
ticed very early in the beginnings of Information Re-
trieval a few decades ago (Goffman, 1964). One
425
Szejda J., Sydow M. and Czerniawska D..
Does a “Renaissance Man” Create Good Wikipedia Articles?.
DOI: 10.5220/0005155804250430
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014), pages 425-430
ISBN: 978-989-758-048-2
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
of the earliest successful applications of diversity-
aware approach was reported in (Carbonell and Gold-
stein, 1998) in the context of text summarisation. Re-
cently, diversity-awareness has gained increasing in-
terest in other information-related areas where the
actual user’s information need is unknown and/or
the user query is ambiguous. Examples range from
databases (e.g.(Vee et al., 2008)) to Web search (e.g.
(Agrawal et al., 2009)) or very recently to the quite
novel problem of graphical entity summarisation in
semantic knowledge graphs (Sydow et al., 2013).
From the open collaboration point of view, diversity
can be considered from many perspectives, for exam-
ple as a team diversity vs homogeneity or a single edi-
tors’s diversity of interest vs specialisation. For exam-
ple, the positive role of team diversity was studied in
(Chen et al., 2010), but the used definitons of diver-
sity and its measures (e.g. Blau index) are different
than in our paper, where it is based on the concept of
entropy. Most importantly, in contrast to our work,
the mentioned work studies the influence of diversity
on amount of accomplished work and withdrawal be-
haviour rather than the work quality that is consid-
ered here. In contrast to our work most of previ-
ous works focus on diversity of editor teams in terms
of categories such as culture, ethnicity, age, etc. A
very recent example, with a special emphasis on ad-
hoc “swift” teams where the members have very lit-
tle previous interactions with each other is (Aggarwal,
2014). (L
´
opez and Butler, 2013) studies how the con-
tent diversity influences online public spaces in the
context of local communities.
2 MODEL DESCRIPTION
In this section we explain the model of editor’s inter-
est diversity that we apply in our approach. We will
use Wikipedia terminology, to illustrate the concepts,
however our model can be adapted to other, similar
open-collaboration environments.
Let X denote the set of Wikipedia editors. Editors
participate in editing Wikipedia articles. Each article
can be mapped to one or more of some pre-defined set
of categories C = {c
1
,...,c
k
} that represent topics.
Each editor x X in our model is characterised by
their editing activity i.e. all editing actions done by x.
We assume that the interests of an editor x can be
represented by the amount of work that x committed
to articles in particular categories.
Let t(x) denote the total amount of textual con-
tent (in bytes) that x contributed to all articles they
co-edited and let t
i
(x) denote the total amount of tex-
tual content that editor x contributed to the articles
belonging to a specific category c
i
.
1
Now, lets introduce the following denotation:
p
i
(x) = t
i
(x)/t(x) and interpret it as representing xs
interest in category c
i
. Henceforth, we will use a
shorter denotation p
i
for p
i
(x) whenever x is under-
stood from the context.
2.1 Interest Profile
Finally, we define the interest profile of the editor x,
denoted as ip(x), as the interest distribution vector
over the set of categories of the articles that x edited:
ip(x) = (p
1
(x),..., p
k
(x))
Notice that according to the definition the interest
profile represents a valid distribution vector i.e. its
coordinates sum up to 1.
2.1.1 Example
Assume that the set of categories C consists of 8 cat-
egories: {c
i
}
1i8
and that editor x has contributed
t(x) = 10kB of text in total, out of which t
2
(x) = 8kB
of text has been contributed to articles in category
c
2
, t
5
(x) = 2kB in category c
5
and nothing to arti-
cles that were not assigned to c
2
nor c
5
. Thus the
x
0
s interest in c
2
is p
2
(x) = t
2
(x)/t(x) =
4
5
, in c
5
is
p
5
(x) = t
5
(x)/t(x) =
1
5
and is equal to 0 for all other
categories. The interest profile of this user is:
ip(x) = (0,
4
5
,0,0,
1
5
,0,0,0)
2.2 Measuring the Diversity of Interests
There are many possible ways of measuring diversity.
Since the interest profile ip(x) is modelled as a distri-
bution vector over categories, we define diversity of
interests (or equivalently versatility) of x, V (x), as the
entropy of interest profile of x:
V (x) = H((p
1
, p
2
,..., p
k
)) =
1ik
p
k
(lg(p
k
)) (1)
Where lg denotes binary logarithm. The value of
entropy ranges from 0 (extreme specialisation, i.e. to-
tal devotion to a single category) to lg(k) (extreme
diversity, i.e. equal interest in all categories).
1
Since a single article can be assigned to multiple cat-
egories, we split the contribution equally for all the cate-
gories of the article
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
426
Figure 1: Versatility vs Quality.
2.2.1 Example
The versatility of user x from Example 2.1.1 has the
following value:
V (x) = p
2
lg(p
2
) p
5
lg(p
5
) =
= 0.8 × 0.32 + 0.2 × 2.32 = 0.25 + 0.46 = 0.6
Now assume that another user x
0
has contributed
equally to the four first categories, i.e. their interest
profile is: ip(x
0
) = (
1
4
,
1
4
,
1
4
,
1
4
,0,0,0,0). The versatil-
ity value for this editor has the following value:
H(ip(x
0
)) = 4 × 0.25 × (log
2
(0.25)) = 2
Notice that the versatility measure of x
0
is higher
than that of x and that this is according to the intuition
since x
0
has similar interest in four different categories
and x only in two (mostly in one). In other words, x
0
is
more versatile while x is more specialised. Maximum
versatility for eight categories would have value of 3,
for an editor that is equally intested in all categories.
3 EXPERIMENTS
In this section we report experiments made on data
Doesa"RenaissanceMan"CreateGoodWikipediaArticles?
427
extracted from Wikipedia that reflects recorded activ-
ity of its editors.
The goal is to experimentally study the depen-
dence between editors’ versatility as defined in Sec-
tion 2 and the quality of articles they co-edit. In the re-
ported experiments the quality of articles is modelled
based on the information available in the data. More
precisely, we utilise two kinds of information regard-
ing the articles’ quality: some articles are marked as
featured and, independently, some as good. We treat
this information as “gold-truth” in our experiments.
3.1 Data
The data covers sample of 2714 contributors to
German-language edition in 2013. We used the
Wikipedia API for retrieving the list of contributors
and their activity logs, and database dumps for the
page (article) list and category graph.
Considering the categories mentioned in the Sec-
tion 2, we utilise the fact that each Wikipedia arti-
cle can be mapped to one of the eight main content
categories: Art & Culture, Economy, History, Knowl-
edge, Religion, Society, Sport, Technology. Techni-
cally, the mapping to categories was computed so that
they were encountered by the algorithm traversing the
category graph using given article as a root node and
iterating over neighbors up to 1000 times. If the ar-
ticle was mapped to more than one category, contri-
bution size was split equally among them, so that we
could use valid totals after per-user aggregation.
3.2 Experimental Results
We analysed four groups of editors: N,G,F,GF that
denote editors who co-edited: none good nor featured
article, at least one good, at least one featured and at
least one article that is both good and featured, respec-
tively. Notice that the four groups represent a graded
“hierarchy” of high-quality editors, with the GF rep-
resenting the highest-quality editors in some way. For
each of the four groups we computed some statistics
concerning versatility measure V () (Equation 1), in-
cluding mean, median and quartiles. The results are
presented on Figure 1, where one can observe a no-
ticeable regularity that indicates clear positive con-
nection between editors versatility and the quality of
their work. More precisely, the aggregated versatil-
ity statistics for the groups N,G,F,FG are strictly in-
creasing.
Furthermore, we observed that the distribution of
user versatility has a negative skew (Figure 2), with
median value at 2.29 bits (out of 3-bit maximum).
Users co-authoring at least one featured article score
Figure 2: Distribution of Editor’s Versatility.
2.31 on mean versatility measure, compared to 2.00
of those who co-authored only non-featured articles.
3.3 Versatility, Quality and Productivity
We also computed for each editor, their productivity
defined as the total amount of text (in Bytes) commit-
ted to the articles they co-edited. We divided editors
into two groups: F (at least one co-edited featured
article) and X \ F and made scatterplots of versatil-
ity vs productivity for these two groups (see Figure
3). Again, one can notice that the authors of featured
articles are noticeably more versatile than others.
Since the results on Figure 3 might suggest that
versatility and productivity are somehow correlated,
we additionally repeated analogous (to that reported
in Section 3.2) experiment on comparison of article
quality and editors’ productivity (Figure 4).
Finally, since the results of this experiment also
seem to indicate some positive influcence of produc-
tivity on quality we finally decided to compare the in-
fluence of versatility and productivity on quality in a
more quantitative way. For this reason we built the
logistic model with versatility and productivity as ex-
planatory variables. Table 1 shows no significant role
Table 1: Explaining quality with logistic model.
Estimate Std. Error z value Pr(> |z|)
(Intercept) -3.566e+00 2.720e-01 -13.111 < 2e 16
versatility 1.434e+00 1.214e-01 11.820 < 2e 16
productivity 4.822e-07 6.017e-07 0.801 0.423
vers. * prod. 5.474e-07 2.865e-07 1.911 0.056
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
428
Figure 3: Versatility – Activity Scatterplot.
of productivity in explaining the quality of contribu-
tions (the fact of authoring at least one featured article
by a given user), however a significant, 4.2 odds ratio
for one-bit increase in versatility measure.
4 CONCLUSIONS AND FUTURE
WORK
We proposed a model of user interests and entropy-
based measure of interest diversity of a single
Wikipedia editor. Preliminary experiments indicate
that editors with more diversed interests seem to co-
author better-quality content. On the other hand, de-
spite an observed correlation between versatility and
productivity, the latter one does not seem to explain
article quality so well.
The continuation work would benefit from deep-
ened and repeated experiments on other datasets and
settings. For example, other choice of main the-
matic categories can be considered in next experi-
ments. Also, other interest-diversity measures can be
proposed. Since the reported preliminary experiments
are promising, a natural future extension of this work
would be to define team diversity based on some con-
cepts introduced here and to extend the study on the
issue of team-work quality.
ACKNOWLEDGEMENTS
The work is supported by Polish National Science
Centre grant 2012/05/B/ST6/03364.
Doesa"RenaissanceMan"CreateGoodWikipediaArticles?
429
Figure 4: Productivity vs Quality. The denotations are anal-
ogous to those on Figure 1.
REFERENCES
Aggarwal, A. (2014). Decision making in diverse swift
teams: An exploratory study. In Proceedings of the
2014 47th Hawaii International Conference on Sys-
tem Sciences, HICSS ’14, pages 278–288, Washing-
ton, DC, USA. IEEE Computer Society.
Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S.
(2009). Diversifying search results. In Proceedings
of the Second ACM International Conference on Web
Search and Data Mining, WSDM ’09, pages 5–14,
New York, NY, USA. ACM.
Carbonell, J. and Goldstein, J. (1998). The use of mmr,
diversity-based reranking for reordering documents
and producing summaries. In Proceedings of the 21st
annual international ACM SIGIR conference on Re-
search and development in information retrieval, SI-
GIR ’98, pages 335–336, New York, NY, USA. ACM.
Chen, J., Ren, Y., and Riedl, J. (2010). The effects of di-
versity on group productivity and member withdrawal
in online volunteer groups. In Proceedings of the
SIGCHI Conference on Human Factors in Comput-
ing Systems, CHI ’10, pages 821–830, New York, NY,
USA. ACM.
Goffman, W. (1964). A searching procedure for information
retrieval. Information Storage and Retrieval, 2(2):73
– 78.
Langlois, R. N. and Garzarelli, G. (2008). Of Hackers and
Hairdressers: Modularity and the Organizational Eco-
nomics of Open-source Collaboration. Working pa-
pers 2008-53, University of Connecticut, Department
of Economics.
L
´
opez, C. A. and Butler, B. S. (2013). Consequences of
content diversity for online public spaces for local
communities. In Proceedings of the 2013 Conference
on Computer Supported Cooperative Work, CSCW
’13, pages 673–682, New York, NY, USA. ACM.
Parnas, D. L. (1972). On the criteria to be used in de-
composing systems into modules. Commun. ACM,
15(12):1053–1058.
Sanchez, R. and Mahoney, J. T. (1996). Modularity, Flex-
ibility, and Knowledge Management in Product and
Organization Design. Strategic Management Journal,
17:63–76.
Sydow, M., Pikula, M., and Schenkel, R. (2013). The no-
tion of diversity in graphical entity summarisation on
semantic knowledge graphs. Journal of Intelligent In-
formation Systems, 41:109–149.
Vee, E., Srivastava, U., Shanmugasundaram, J., Bhat, P.,
and Yahia, S. A. (2008). Efficient computation of di-
verse query results. In Proceedings of the 2008 IEEE
24th International Conference on Data Engineering,
ICDE ’08, pages 228–236, Washington, DC, USA.
IEEE Computer Society.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
430