ORGANOGRAPHS

Multi-faceted Hierarchical Categorization of Web Documents

Rodrigo Dias Arruda Senra and Claudia Bauzer Medeiros

Institute of Computing, University of Campinas, UNICAMP, Campinas, SP, Brazil

Keywords:

Organographs, Hierarchical content categorization, Bookmarking, Classiﬁcation.

Abstract:

The data deluge of information in the Web challenges internauts to organize their references to interesting

content in the Web as well as in their private storage space off-line. Having an automatically managed personal

index to content acquired from the Web is useful for everybody, but critical to researchers and scholars. In this

paper, we discuss concepts and problems related to organizing information through multi-faceted hierarchical

categorization. We introduce the organograph as a mechanism to specify multiple views of how content is

organized. Organographs can help scientists to automatically organize their documents along multiple axes,

improving sharing and navigation through themes and concepts according to a particular research objective.

1 INTRODUCTION

The organization, archival and sharing of digital con-

tent generated by scientists is important in eScience

research - e.g. reports, algorithms and data. Scien-

tists must be able to efﬁciently organize and dissem-

inate their knowledge, not only within a project, but

also with the community at large, where the preferred

collaboration environment is the Web. This compli-

cates document management, since each scientist (or

group) may use different document standards, storage

mechanisms and vocabularies. Such issues are nowa-

days a prime research topic in Web Science – a novel

research domain which is concerned with the Web as

the primary object of interest.

Continuing our research (Senra and Medeiros,

2009) about data sharing in eScience, this paper con-

cerns with information organization and collabora-

tion on the Web. We are interested in the organiza-

tion of scholarly digital content (documents used by

scientists), through automatic hierarchical structur-

ing, multifaceted ﬁltering and sharing. Hierarchical

Structuring is a pervasive approach towards informa-

tion organization. It is the cognitve pattern we use to

organize everything (e.g., ﬁles, messages). Although

ubiquitous, we often create our hierarchies manually

and in an ad hoc fashion. We choose ﬁlesystems as

the pivot artifact to discuss our approach to hierar-

chical structuring, considering that digital content is

often encapsulated, stored and shared as ﬁles.

The issues for ﬁlesystems can be transported to

other manifestations of the hierarchical structuring

pattern. First issue, the membership relation between

ﬁles and directories, is often static and manually de-

ﬁned by the user on a per-ﬁle-instance basis. Sec-

ond, the hierarchy implicitly deﬁnes a content catego-

rization, thus some given ﬁle frequently can only be

placed in a single point inside the hierarchy. We refer

to this issue as the single category problem, further

exploring it in section 2.2. Above all, the organiza-

tion of a directory hierarchy is not shared dissociated

from the content itself, e.g., people do not exchange

hierarchies of empty folders even though there is valu-

able knowledge encoded in their structuring. We only

share directory trees associated with content, because

we lack the tools to categorize our content according

to a foreign directory hierarchy and vice-versa. We re-

fer to this issue as the content-recategorization prob-

lem, and discuss it in section 2.4.

The Web aggravated these issues. It is often eas-

ier and quicker to ﬁnd a piece of information in the

Internet, than to locate the one already available in

our local computer or Intranet. Not only is this an in-

efﬁcient way of managing content, but it can lead to

problems such as needless duplication, loss of quality

and provenance mishaps. The information available

“off-line” (i.e. in our local computer or Intranet) is

not the same information available in the Internet. Al-

though not necessarily as fresh, in some aspects local

information can be richer. It has been ﬁltered, possi-

bly transformed, can be sensitive or private, and may

583

Dias Arruda Senra R. and Bauzer Medeiros C..

ORGANOGRAPHS - Multi-faceted Hierarchical Categorization of Web Documents.

DOI: 10.5220/0003319205830588

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 583-588

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

no longer be available on the Web.

This paper proposes an approach towards the auto-

matic organization of documents generated and used

by scientists. Our approach enables this organization

along multiple axes, thus facilitating its sharing and

dissemination. Our hypothesis is that hierarchies can

be shared in isolation from their generative collection,

and used to organize other (non-generative) collec-

tions.

2 APPROACHES TO CONTENT

ORGANIZATION

We are interested in information organization, and we

deﬁne it as: ”the structuring of information units by a

group of agents according to a set of consensual and

shared principles to achieve a deﬁned goal”. Our def-

inition has ﬁve elements. The agents represent the

users and software artifacts that interact. The infor-

mation units (IUs) represent the granularity at which

content is encapsulated and manipulated. The struc-

ture is deﬁned by intermediate aggregators that parti-

tion the set of IUs; these aggregators are often related

to content categorization or clusterization. The prin-

ciples are the algorithms and knowledge bases that

drive the structuring process. The goal is usually to

improve how the agents obtain access to the desired

IUs. In this paper, we refer to ﬁle and document in-

terchangeably as concrete instances of IUs; similarly,

directories are instances of hierarchical aggregators to

enforce structuring upon the IUs.

2.1 Basic Concepts

Organization as part of Information Retrieval (IR), in-

volves several tasks (Jackson and Moulinier, 2002) –

e.g., sorting, summarizing, indexing. We are partic-

ularly interested in Classiﬁcation or Categorization

– the assignment of IUs to a pre-established set of

classes or categories deﬁned by the agents. Meth-

ods for categorization differ in the form of the clas-

siﬁer, the technique for training, and the representa-

tion of the IU, e.g, see (Weigend et al., 1999) on text.

While the categorization task assumes an existing set

of classes(categories), clustering aims to create or dis-

cover a set of classes(clusters).

Another important concept is the notion of ontol-

ogy ontology (Uschold and Gruninger, 1996): an ex-

plicit and rigorous speciﬁcation of a conceptualiza-

tion, that organizes some knowledge domain. For

many applications, ontologies are mostly hierarchical,

containing all the entities and their relations, usually

restricted to is-a and part-of. Ontologies can describe

IUs, categories, and relationships between them.

2.2 The Single Category Problem

The way we organize digital documents in hierarchies

still follows the metaphors of the physical world:

archives, drawers and folders. The majority of tools

used to organize documents relies on such single tax-

onomic organization patterns - e.g., ﬁle managers

or browsers, e-mail clients, or software navigational

menus. When the digital space inherited this pattern

from the physical world, it also inherited some unnec-

essary limitations. For instance, many software arti-

facts restrict a digital document to be placed in a sin-

gle directory in the ﬁlesystem hierarchy, or an email

message to be stored in a single folder.

Digital libraries are presented as a means to solve

this issue – by allowing multifaceted content orga-

nization using links and metadata structures. Other

means of circumventing this limitation is the use of

copies, hard-links (a.k.a clones) and soft-links (a.k.a

symbolic links). While these mechanisms support

content multi-categorization, they also introduce new

problems. Copies increase storage space, complicate

consistency maintenance and may cause redundant

processing. Links change the hierarchy traversal from

tree-based into a graph-based procedure that often in-

troduces problematic cycles. Furthermore, links in-

troduce a duality: sometimes they are treated trans-

parently, other times not – causing an identity prob-

lem between the link and its referred object. Thus,

archiving systems and digital libraries still lack ﬂex-

ible content organization support, and suffer from a

rigid content structuring.

2.3 Alternatives to Rigid Hierarchical

Structuring

In opposition to the hierarchical organization strategy,

there are other two approaches to organize content:

folksonomies and full-text management engines. The

full-text search approach abolishes classes and cate-

gories altogether. Objects are indexed by their tex-

tual content and retrieved by a subset of keywords.

The object collection cannot be browsed by topic (or

any other property) because it is unstructured, only

ranked result sets can be iterated. Furthermore, full-

text mechanisms do not support multimedia content,

which is increasingly common.

Folksonomies (or social tagging) is a recent re-

sponse to the demands of content organization. In this

paradigm, content is annotated and categorized by

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

584

multiple labels or tags that form a cloud of un-

structured categories, represented by words or short

phrases. Any object can be associated with any num-

ber of tags, therefore belonging to multiple categories,

solving the single category problem. Tags can evolve

dynamically, being used to browse and retrieve IUs.

Therefore, folksonomies preserve classes or emulate

categories, but not their inter-relations and structure.

In order to minimize the lack of structure, some tag-

ging systems (e.g.Delicious and Connotea) use co-

occurrence of tags for navigational purposes. Folk-

sonomies have other limitations (Giannakidou et al.,

2008) that restrict their usability, such as: tag valida-

tion, uncontrolled vocabularies, spamming and term

ambiguity and redundancy.

Summing up, taxonomies suffer from the single

category problem, full-text search provides no struc-

turing nor categories, and folksonomies have unstruc-

tured and unrelated categories.

2.4 The Content-recategorization

Problem

As mentioned in section 1, hierarchies still lack the

desirable property of dynamic and ﬂexible member-

ship relation between content and category. One of

our goals is to allow hierarchical organizations to be

shared in isolation from the content they categorize.

First we need to distinguish the generative collection

from a subordinate collection. A generative collec-

tion is the set of IUs from which the agents derived

a hierarchical categorization scheme. A subordinate

collection is the reciprocal entity, i.e., any set of IUs

subject to a speciﬁc (generated) classiﬁcation scheme.

For example, suppose some researcher has a hier-

archical collection of articles. The folders represent

categories, and the union of all articles are the gener-

ative collection. There are two recategorization sce-

narios. The ﬁrst is to ﬁt her articles to an external

categorization, for instance according to the Library

of Congress Subject Headings (LCSH). In this case,

each heading from LCSH would become a folder (cat-

egory), and the researcher’s articles would become the

subordinate collection to be organized according to

this new hierarchy. The second scenario is to ﬁt other

documents to her own hierarchy scheme, for example

to browse a colleague’s collection as if it were orga-

nized with her own personal classiﬁcation scheme.

Each scenario means solving the hierarchical cat-

egorization problem. The content-recategorization

problem is a variation of hierarchical classiﬁcation

because the content subject to classiﬁcation is already

classiﬁed according to a source hierarchy, which

could be used to improve the classiﬁcation process

towards the new given target hierarchy. For a full dis-

cussion and survey about hierarchical classiﬁcation

see (Gordon, 1996). Many classiﬁcation methods dis-

cussed in the literature are not fully automatic, requir-

ing supervised training and user feedback – e.g., naive

bayes, support vector machine, or neural networks.

If the categorization process is distributed across

different people, or even done by a single person at

different times, then categorizations will differ – e.g.,

(Bonifacio et al., 2000) have shown that community

members keep their own perspective on a commu-

nity repository. The reason why many community

repository initiatives fail is due to the fact that a sin-

gle (though shared) categorization scheme is not ac-

cepted or understood by the entire community. This

reinforces the idea that single category approaches to

classiﬁcation are doomed to fail.

2.5 Categories for Categorization

We are interested in improving automation for dif-

ferent categorization tasks, which we call: ﬁrst-

time, refactor, shared, and synchronized categoriza-

tion. First-time categorization is when we categorize

an assorted digital document collection for the ﬁrst

time.

Refactor categorization is when we already have a

categorized collection, but we need to either judge its

coherence or refactor it. The quality and suitability of

the categorization may vary for different user groups

or for the same group doing tasks at different times.

Shared categorization is when we want to re-use

the categorization scheme from a different group and

apply it to our own content, or do the inverse task –

using our own categorization scheme to browse the

shared content from others.

Synchronized categorization is a stricter version

of shared categorization, when two groups are simul-

taneously manipulating content and categorization.

Each group may adopt a different categorization, and

yet they must edit and evolve the same documents.

Supposing that one group devised a proper (refactor)

categorization for its purposes, then this group should

be capable of applying that same categorization to

any exchanged content. In other words, in the last

two tasks what is sought is a solution to the content-

recategorization problem.

3 THE ORGANOGRAPH

FRAMEWORK

In order to accomplish those four categorization tasks,

we present a semantically-grounded organizational

ORGANOGRAPHS - Multi-faceted Hierarchical Categorization of Web Documents

585

structure that we called organographs. An organo-

graph is a user parametrization to a hierarchical cate-

gorization task, that is editable, persistable and share-

able.

Organographs improve access to digital content

because they are context-sensitive and built to meet

speciﬁc user needs, preserving user familiarity with

categories and their structure. Multiple organographs

can be generated from the same collection given a

different parametrization, and a single organograph

could be applied to different and unrelated collec-

tions. Each generated organograph should be inter-

preted as a multi-faceted view (Dakka et al., 2007)

of the generative collection. A concrete example of

organograph is given in ﬁgure 1, which is detailed in

section 3.3.

3.1 Use Case

Consider the following example: a researcher uses a

hierarchy created to gather all material related to his

research project. Some documents were generated lo-

cally, while others were retrieved from the Web. The

document tree contains his publications, unpublished

papers, some papers (categorized by subject) he al-

ready read, and papers to read. He wants to make

this document tree available for his research group,

including his own publications, but not his unpub-

lished materials. Moreover, he wants to organize the

resulting collection by publication date (year/month)

and then by ACM’s 1998 Computing Classiﬁcation

System (http://www.acm.org/about/class/1998). No-

tice that the classifying criteria are either based on

attributes that are intrinsic to the IU’s content (e.g.:

publication date, text subject, annotations), or based

on attributes dependent on the user context – being

content-independent (eg.: read vs unread, published

vs unpublished). In this example, the attributes pub-

lished/not published and read/not read are implicitly

given by categories in the original document orga-

nization, therefore annotations can be derived auto-

matically using proper attribute extractors (Dakka and

Ipeirotis, 2008). In other cases, annotations must be

performed manually.

3.2 Methodology

Prior to constructing the target organizing hierar-

chy, all documents in the source collection are pre-

processed and indexed to populate what we call the

attribute-space. The attribute-space is a dictionary-

like database whose keys are document IDs and

whose values are records with heterogeneous schemas

– because different documents might have distinct

sets of attributes. It can be implemented on top of a

NoSQL database or on top a relational database with

a star-join (multi-dimensional) schema.It is built by

applying all suitable information extractors available

to all documents in the source collection. The car-

dinality of the attribute-space for a given document

depends on the availability of information extractors

applicable to the respective document type. Hence, it

grows incrementally with the advent of new informa-

tion extractors.

For instance, the attribute-space for some x.pdf

document has the schema: title, author, publica-

tion date, document type, word frequency vector,

and other user deﬁned keywords. The ﬁrst four at-

tributes are tagged (by the extractor) with their re-

spective Dublin Core elements counterparts. Further-

more, x.pdf’s attribute-space is augmented with at-

tributes retrieved from datasource (the host ﬁlesys-

tem), such as: document size, last access/modiﬁcation

date, owner, group, and access permissions. If origi-

nated from the Web, the attribute-space could be aug-

mented with information extracted from the host site

or the surrounding Web pages. Once the attribute-

space is populated, the researcher possesses a vocab-

ulary of attributes with which he can write organo-

graph speciﬁcations that will guide the materializa-

tion of diverse multi-faceted views.

We propose four steps to construct organographs:

Step 1: apply information extraction techniques

to the IUs (e.g. documents), factoring out attributes

and properties (deﬁning an attribute space). Each IU

is assigned a unique ID that serves for indexation pur-

poses. Step 2: automatically generate categories from

a user-given organograph that either explicitly enu-

merates categories or provides generative rules that

deﬁne them. User parametrization should anchor cat-

egories and their inter-relationships to concepts in on-

tologies. Step 3: run a categorization algorithm spec-

iﬁed in the user-given organograph, resulting in the

assignment of IU IDs (step 1) to the generated cate-

gories (step 2). Step 4: use virtualization and links

to materialize the emergent categorization using the

same structuring metaphor (e.g. directories) of the

generative collection.

3.3 Concrete Organograph Example

We provide the organograph described in ﬁgure 1 to

illustrate what we mean by user parametrization.

Line 1 deﬁnes an organograph instance with iden-

tiﬁcation for persistence purposes. Line 2 deﬁnes the

topmost level (root node). Lines 3-7 deﬁne the labels

for the ﬁrst level nodes, consisting of the 4-digit year

present in the publication date attribute. Lines 8-30

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

586

Figure 1: Sample Organograph Speciﬁcation.

deﬁne the second nested level, where lines 9-13 deﬁne

similarly a 3-letter month label for this level’s nodes

based on the same attribute. Lines 14-29 deﬁne the

third level, where lines 15-24 apply an algorithm to

do topical clusterization according to ACM’s CCS on-

tology. The choice of categorization algorithm is left

for the user, in this example we chose “naive bayes”

omitting from the example some necessary parame-

ters such as the trainning set used. Line 19 deﬁnes the

generative datasource collection subject to inclusive

(line 20) and exclusive ﬁlters (line 21). Finally, lines

26-28 deﬁne the inner-most level nodes (tree leaves),

consisting of hyperlinks to the IU’s (documents) ﬁl-

tered.

The exact syntax and semantics of the domain spe-

ciﬁc language used to code the organograph lies out-

side the scope of this paper.

4 RELATED WORK

There are many research initiatives on tagging and

text mining, on the Web, whose goal is document

sharing. However, to our knowledge, ours is the ﬁrst

work that is geared towards organizing for sharing in a

collaborative Web environment, in which each partic-

ipating scientist can construct a personal organization

scheme to allow other researchers to “see” organiza-

tions differently.

The CAIMAN system (Lacher and Groh, 2001)

facilitates document exchange between geographi-

cally dispersed people. Each community member

organizes his collection according to his own cate-

gorization scheme (ontology), then CAIMAN maps

concepts in personal ontologies to concepts in a com-

munity ontology. (Bloehdorn et al., 2005) performed

text mining experiments in the medical domain in

which the ontological structures used were acquired

automatically in an unsupervised learning process.

They have shown that automatically learned ontolo-

gies and manually engineered ones, were both com-

petitive and improved results on text clustering and

classiﬁcation tasks. (Giannakidou et al., 2008) pro-

pose an approach for social data clustering which

combines semantic, social and content-based infor-

mation. They devised an unsupervised model for

efﬁcient and scalable mining on multimedia social-

related data, which leads to the extraction of rich

and trustworthy semantics and the improvement of

retrieval in a social tagging system. (Du and Chen,

2007) devised a desktop-based personal information

management that further exploits a social networking

environment for collaborative knowledge creation, in-

tegration and sharing, based on the integration of Se-

mantic Web technologies and collaborative tagging.

(Chen and Roberts, 2007) presented an architecture

ORGANOGRAPHS - Multi-faceted Hierarchical Categorization of Web Documents

587

capable of recovering the context of tags and drive

emergent semantics, using them to organically build

ontologies. The generated semantic hierarchy is used

to enforce structure and semantics in collaborative

tagging. That approach was adopted in practice in the

Online Open Publishing System.

Our work also goes toward ﬁlling gaps in Web Sci-

ence research, in the area of designing and develop-

ing infrastructures for collaboration on the Web. The

term Web Science was ﬁrst introduced by Berners-

Lee. It has since given origin to large international

research efforts, including The Web Science Trust

(http://webscience.org/) or the Brazilian Institute for

Web Science Research (http://webscience.org.br).

Formally, research in Web Science is concerned

with the Web as the primary object of interest. In

our work, this means among others concentrating on

organographs as a means of sharing and exchanging

knowledge. Furthermore, once document organiza-

tions are shared, the researchers can reuse each other’s

data – which is the essence of scientiﬁc collaboration

– without having to concern themselves with estab-

lishing standards for document organization.

5 CONCLUSIONS

This paper presented a conceptual framework to au-

tomate information organization and support collab-

orative work on the Web. Our core proposal is the

organograph – a persistent and shareable organiza-

tion that emerges from automatic feature-extraction,

classiﬁcation and clustering. Though our discussion

was centered in document sharing and reuse, our end-

users are scientists that work cooperatively in some

eScience domain. In such a context, documents refer

not only to scientiﬁc papers and reports, but also data

ﬁles containing experimental data, or images. Under

this perspective, our proposal can be extended to other

domains in which cooperation on the Web is required.

At the same time, we need to concern ourselves

with the Web Science issue of visibility. It is not

enough to share organographs, if we also want the

documents to be visible beyond a research group. In-

deed, the validation of scientiﬁc experiments requires

reproducibility – and this means that documents asso-

ciated with an eScience project must all, at the end,

become available. This means that we must also con-

sider some sort of Publication Directory, in which a

group’s (or a project’s) organographs can be accessed

by all interested in accessing the main results of that

group. This kind of solution is part of our ongoing

research. We will validate this concept using real ap-

plications that run in the Web and that have been im-

plemented by our research group, in distinct scientiﬁc

domains.

ACKNOWLEDGEMENTS

This work was supported by Fapesp, CNPq, CAPES

and INCT in Web Science (CNPq 557.128/2009-9).

REFERENCES

Bloehdorn, S., Cimiano, P., and Hotho, A. (2005). Learn-

ing ontologies to improve text clustering and classi-

ﬁcation. In From Data and Information Analysis to

Knowledge Engineering: Proceedings of the 29th An-

nual Conference of the German Classiﬁcation Society.

Bonifacio, M., Bouquet, P., and Manzardo, A. (2000). A

distributed intelligence paradigm for knowledge man-

agement. In AAAI Spring Symposium Series 2000 on

Bringing Knowledge to Business Processes.

Chen, L. and Roberts, C. (2007). Semantic tagging for

large-scale content management. In WI ’07: Proceed-

ings of the IEEE/WIC/ACM International Conference

on Web Intelligence.

Dakka, W. and Ipeirotis, P. G. (2008). Automatic extrac-

tion of useful facet hierarchies from text databases. In

ICDE, pages 466–475.

Dakka, W., Ipeirotis, P. G., and Wood, K. R. (2007). Faceted

browsing over large databases of text-annotated ob-

jects. In ICDE, pages 1489–1490.

Du, Y. and Chen, L. (2007). Using personalized knowledge

portal for information and knowledge integration and

sharing. In SKG ’07: Proceedings of the Third In-

ternational Conference on Semantics, Knowledge and

Grid.

Giannakidou, E., Kompatsiaris, I., and Vakali, A. (2008).

Semsoc: Semantic, social and content-based cluster-

ing in multimedia collaborative tagging systems. In

ICSC ’08: Proceedings of the 2008 IEEE Interna-

tional Conference on Semantic Computing.

Gordon, A. (1996). Hierarchical classiﬁcation. Clustering

and classiﬁcation.

Jackson, P. and Moulinier, I. (2002). Natural language pro-

cessing for online applications: text retrieval, extrac-

tion, and categorization. John Benjamins Publishing

Company.

Lacher, M. and Groh, G. (2001). Facilitating the exchange

of explicit knowledge through ontology mappings. In

Proceedings of the Fourteenth International Florida

Artiﬁcial Intelligence Research Society Conference.

Senra, R. D. A. and Medeiros, C. B. (2009). SciFrame: a

conceptual framework to describe data sharing in e-

Science. SBBD. III e-Science Workshop.

Uschold, M. and Gruninger, M. (1996). Ontologies: Princi-

ples, methods and applications. The Knowledge Engi-

neering Review, 11(02).

Weigend, A. S., Wiener, E. D., and Pedersen, J. O. (1999).

Exploiting hierarchy in text categorization. Inf. Retr.,

1(3).

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

588