relations or events can involve referents of multiple
entities, the likelihood of accurately extracting all
arguments of a relation or event is low.
Using ACE terminology, a relation is defined as
an ordered pair of entities with an asserted
relationship of a specific interesting type (“ACE
English,” 2005). So a relation can be thought of as a
four tuple: <entity1, relation, entity2, time>. For
example, “Scott was a member of ACM for four
years” contains a relation where the first entity is
“Scott,” the second entity is “ACM,” and the time is
a duration of “four years.” This relation has a type
of “Organization Affiliation” and has a subtype of
“Membership.”
An event is defined as a specific occurrence
involving participants and a “trigger” word that best
represents the event (e.g., “attacked” in “The rebels
attacked the convoy yesterday”) (“ACE English,”
2005). Despite this broad definition, ACE limits its
events to a set of types and subtypes that are most
interesting. For example, “Jen flew from Boston to
Paris” contains a “travel” event, defined as an event
that captures movement of people, that is, a change
of location of one or more people. The captured
arguments of the event would be the travelling entity
“Jen,” the origin “Boston,” and the destination
“Paris.” Like relations, events can have associated
time values (“Working Guidelines,” 2007).
In an examination of a leading rule-based
commercial extractor on 230 annotated internal
documents, it was able to identify the “ORG-
AFF/Membership” relation with a precision of 47%
(meaning that 47% of the times it identified this
relation, the relation existed in the data). The recall
was also 47% meaning that 53% of the membership
relations in the data were missed by the system. For
those relations that were identified, the first entity,
the person, was identified with 71% precision,
meaning that 29% of the items that the system
returned were incorrect. For the second entity, the
organization, the precision was 85%. After the
company improved the results, the new relation
identification improved to 70% while it remained the
same for the two entity arguments. A member of
this company suggested that this score was
considered “very good” for relations and was unsure
that much more improvement could be obtained.
Unfortunately, relations and events are often the
key assertions that one needs in a knowledge base in
order to identify information about people and/or
organizations. Due to the high error rate in
extraction technology, rather than introducing errors
into the knowledge base, a preferred solution might
be semi-automatic population of a knowledge base,
involving the presentation of extracted information
to users who can validate the information, including
accepting, rejecting, correcting, or modifying it
before uploading it to the knowledge base. This
interface must be designed in a manner that supports
the users’ workflow when doing this task. Ideally,
the interface would speed up significantly the time
to enter data in the knowledge base manually. Since
extractor recall tends to be less than 60%, besides
correcting precision errors that the extractor makes,
the interface must have the ability for users to add
information missed by the extractor (recall errors).
In this paper, we describe the challenges faced in
this task and define the design for our system,
FEEDE – Fix Extractor Errors before Database
Entry. We also discuss the required elements as
defined by our end users, the interface’s design, and
an examination of the extractors used to populate it
with initial content to be authenticated. Given the
daunting task of manually entering all important
information in a knowledge base from unstructured
text, we believe this effort is important to save users
time, both a valuable commodity in this information
age as well as being enterprise cost saving.
To our knowledge, this is the first research effort
on developing an interface using content extraction
from unstructured text for populating knowledge
bases. It has only been in the past year (“Automatic
Content Extraction,” 2008) that the automatic
extraction community has started to focus on text
extraction for the purpose of populating databases.
In 1996, there was an interface effort for structured
data (metadata) (Barclay, 1996). Furthermore, since
content extraction efforts have not been focused on
the database issue, they are missing certain items
that are important for such endeavours. A recent
survey of extraction elements important to our users
revealed that only 25 out of the 47 requested (53%)
were in the ACE guidelines.
2 CONTENT EXTRACTION FOR
DATABASING ISSUES
ACE provides specifications for tagging and
characterizing entities, relations and events in text,
as well as some other features. For entities, the key
attributes are type and subtype. Mention categories
are also important attributes, determining the
specificity of the entities, such as a pronoun referent
to an entity name. Relations and events also feature
types and subtypes as well as arguments—two for
relations, where the order matters, and potentially
many different arguments for events where the
allowed set depends on the event type. Although
quite extensive, the ACE guidelines (“ACE
DESIGNING A SYSTEM FOR SEMI-AUTOMATIC POPULATION OF KNOWLEDGE BASES FROM
UNSTRUCTURED TEXT
89