A SIMPLE METHOD FOR MINING AND VISUALIZING
COMPANY RELATIONS BASED ON WEB SOURCES
Maximilien Kintz and Jan Finzen
Fraunhofer IAO, Nobelstraße 12, 70569 Stuttgart, Germany
Keywords: Competitive Intelligence, Graph visualization, Company name normalization.
Abstract: One of the important aspects of market and competitive intelligence is the observation and analysis of a
partner, customer or competitor’s relations with other companies. Using Web-based sources such as press
releases, corporate Web sites or news articles and text mining technologies such as Named Entity
Recognition, it is possible to automatically extract company relations out of Web content and to build
network graphs showing how companies interact. Visualization software that can be integrated in a Web-
based application offers means to explore, search, and analyse these networks and their meaning for a
company. In this paper we demonstrate how to build a powerful company relation mining application with
very little effort by effectively connecting open source toolkits.
1 INTRODUCTION
Competitive intelligence is the activity of
monitoring and studying one’s partners and
competitors, their current activities, products,
relations etc. For a company working in a highly
competitive market, it is of major importance to gain
and maintain a current and complete overview of
competitors and partners as well as their relations
(e.g. are they customers, suppliers, etc.). As multiple
relations between a high number of companies and
organizations can be involved, this data can become
quite complex. An appropriate visual representation
of the data supports the user in analysing and
interpreting the contained information.
We built a prototype (shown in Figure 1) for
company relations visualization demonstrating the
effectiveness of two specific aspects:
The use of freely available information on the
Internet (we only rely on data available
publicly and free of charge, for example in
press releases, news sections of corporate
websites and specialized news sites) and
The power of mashing-up free software and
services for Web crawling, scraping,
recognition of company names and graph
visualization.
The remainder of this paper is organized as
follows: In Section 2 we present related work. In
Section 3 we present the methods used to retrieve
data from the Web and prepare it for the
visualization. In Section 4 we describe the user front
end and visualization possibilities. In the concluding
Section 5 we discuss some limitations of the current
implementation and propose ways to further develop
the tool and possible future outcomes.
2 RELATED WORK
The visualization of graph structures and networks is
a widely investigated research topic. Much research
has for example been carried out in the 90s in the
telecommunications sector (Becker et al., 1995),
focusing on the visual representation of telephone
network graphs.
More recently, several methods and
implementations have been proposed for the
visualization of groups of people or social networks,
in generic real-world use cases (Freeman, 2000) as
well as specifically focusing on web-based social
networks (Buzgar and Buraga, 2008 or Matsuo et al,
2007).
The visualization of company relations, however,
remains a research area less investigated. Hu et al
(2009) describe an approach of extracting and
597
Kintz M. and Finzen J..
A SIMPLE METHOD FOR MINING AND VISUALIZING COMPANY RELATIONS BASED ON WEB SOURCES.
DOI: 10.5220/0003300705970602
In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 597-602
ISBN: 978-989-8425-51-5
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
Figure 1: Standard view of the application and (force directed) graph for the company Acciona Energia.
analysing company relations focusing on relation
type identification and temporal information.
However, they hardly discuss visualization of data.
Some commercial Web Intelligence solutions
include graph-based visualization, but they generally
focus on specific topics and web sources
classification rather than on companies and
organizations. Some interesting implementations can
be found in solutions provided by IBM with its
COBRA tools (www.ibm.com/us/en), Vico Research
(www.vico-research.com) as well as by eCairn
(http://ecairn.com/). Unfortunately, the underlying
algorithms are not well documented in literature.
3 IMPLEMENTATION OF DATA
RETRIEVAL AND MINING
Although much textual information can be easily
found on the Web, the graph-based visualization
needs to rely on a specific and structured data
format. In this chapter, we present the methods and
techniques used to transform unstructured textual
content into lists of relations that can easily be fed to
a visualization toolkit.
3.1 Web Crawling and Scraping
To build a corpus large enough to allow for
interesting visualizations, we defined a list of a
dozen of sites active in a specific domain (in our
case renewable energies) and providing news
content via RSS feeds. The feeds are continuously
parsed with the ROME Java API
(https://rome.dev.java.net/), and completed with the
full text directly extracted from the web site. We
applied two methods of scraping the meaningful text
from the web sites and ignoring the irrelevant parts
(navigational elements, advertising etc.):
Maintaining regular expressions for each of
the considered websites, which provides a
very good quality but needs preconfiguration.
It also has the drawback of not being robust
against design changes of the target websites.
We successfully applied this approach in our
meta search engine for press releases (Finzen
et al., 2009)
A more generic approach based on relatively
simple heuristics like “longest paragraph of
coherent English words”. This approach
proved effective within a more powerful web
mining framework (see Finzen and Kintz,
2011), as it works well for basically any
HTML page. However it does not reduce
noise data as reliably as the first approach.
For the purpose of extracting company names
and visualizing relation between companies, we
found that both approaches worked quite well.
WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies
598
3.2 Identification and Normalization
of Company Names
Once texts have been extracted from Web sites, we
need to identify the actual company and organization
names that will be the basis of our visualization.
This is achieved using Named Entity Recognition
(NER). Many tools and Web services are available
to perform NER (especially for the English
language). In our case, we used the OpenCalais
(www.opencalais.com/) service provided by
Thomson-Reuters. Unfortunately, as of today, this
web service does not support German. For German
texts we therefore sidestep to the Alchemy
(www.alchemyapi.com/) service which offers
similar functionality but (as to our findings) in lower
quality. The dispatching between both services is
based on the Alchemy language detection service
(www.alchemyapi.com/api/lang/). The service
results include many different annotations types
(people, places, dates, organizations and others) that
we all store in a database for further analysis. The
visualization of the company relations solely bases
on company names.
Organization and company names can be found
in texts in many different forms. For example, the
company IBM can be referred to as “IBM
Corporation”, “IBM Corp.” etc. The OpenCalais
service tries to normalize the company names when
analysing a single text document, but does not give
unified answers over a set of multiple documents. To
avoid having different versions of the same company
name in the graphs, we developed and implemented
a simple normalization algorithm. The algorithm
works in six steps:
i) The name is written in lower case, in order to
eliminate case problems (names are written in
upper case in some texts).
ii) Special characters such as accents are ignored
because they are too often inconsistently
used.
iii) A list of common suffixes in German, French
and English company names (our primary
focus being those three languages) is
searched and if found removed. Common
suffixes include “Corp.”, “GmbH” or
“SARL”.
iv) Some other keywords such as “(c)” are
removed.
v) The name is written in title case (to look like
a “real name”).
vi) Special cases are considered. For example
“IBM” should be written all upper case.
Although some limitations of this algorithm are
obvious (no distinction between Apple Corp. and
Apple Inc., etc.), our tests showed that it improved
the quality of graphs in a significant way.
Another approach to company name
normalization has been proposed with the goal to
match names against spelling errors and facilitate
database integration (Magnani and Montesi, 2007).
Using simple pattern matching methods, the authors
were able to implement a high quality company
name harmonization tool. However, similar
drawbacks to those mentioned in our case were
observed.
The annotations are stored in the database as
follows: nature of the annotation, text of the
annotation, ID or URL of the text in which the
annotation was found, start index and length of the
annotation in the text. Thus it is possible to display a
version of the text in which all annotations are
highlighted.
3.3 Identification of Company
Relations
The simplest relation between two companies that
can be extracted from Web texts is the co-
occurrence relation. This means that we consider
two companies mentioned in the same text to be in a
relation of some kind. The more texts are found
containing both the two names, the stronger the
relation. As the co-occurrence relation is not
directed, for n companies identified in a text, a total
number of n(n-1)/2 relations are extracted from each
text.
Tests showed that in order to avoid having too
much noise (i.e. meaningless relations) in graphs, it
is reasonable to ignore relations stemming from
articles containing a very large number of company
names, because these are likely to contain only a list
of unrelated companies like e.g., stock reports.
Once computed, the relations are stored in the
database as follows: (normalized) name of first
company, (normalized) name of second company
and URL or ID of the text from which the relation is
extracted. This is all the information needed to build
the graphs.
Full texts, named entities (annotations), company
names and relations are stored in the database. The
whole process is performed repeatedly: the web sites
specified in the first step are crawled every 30
minutes for new content.
A SIMPLE METHOD FOR MINING AND VISUALIZING COMPANY RELATIONS BASED ON WEB SOURCES
599
4 IMPLEMENTATION
OF VISUALIZATION
Once the data has been extracted from web sources
and stored in appropriate formats in a database, it
can be queried and transformed into visualizations.
The core of the implementation of the visualization
relies on the Prefuse Flare (http://flare.prefuse.org/)
toolkit, an Adobe Flex (www.adobe.com/
products/flex) visualization toolkit very similar to
the older and well known Java-based Prefuse
visualization toolkit (Heer et al., 2005 or http://
prefuse.org/) developed at the University of
Berkeley used in our previous work (Finzen et al.,
2009). The visualization runs client side by a Flex
application, a server is used to perform queries with
the database and to return a list of relations to the
client.
In the following paragraphs, we describe the
general user interface (UI) developed for the tool,
the database querying process, the graph layouts and
the interaction possibilities.
4.1 Input and Search UI
Based on our own research and on discussions with
a partner company intending to use the tool for its
own competitive intelligence needs, we developed
four ways allowing the user to specify a query:
Choosing a company name from the list of all
names available in the database, and
displaying the graph for this company. The
standard graph includes the companies
directly related to the chosen company as
well as the companies related to the
companies related to the chosen company; we
speak of graph of level two. This allows
getting a general overview of the partners and
competitors of a company.
Choosing two companies from two lists and
see if there exists or does not exist a direct (or
indirect, which means one intermediary
company may exist) relation between the two
companies. After the first company has been
selected by the user, the second list is
automatically reduced to give the user a set of
meaningful choices (for example, the first
company is removed from the list and only
those companies that co-occur with the first
one at least an adjustable number of times are
shown). This allows checking if a relation
that the user assumes must exist can be
attested by a Web source.
Entering a search expression (keywords, start
and end date) and display the relations graph
corresponding to the texts matching this
search expression. This allows detecting only
co-occurrences of company names in a
certain context.
Specifying a search expression, a time frame
and a step size, and animating the graph by
interpolating between each step in the time
interval. This allows detecting significant
changes and developments concerning the
activeness of a company’s network regarding
a certain topic.
4.2 Querying the Database
The data to be visualized is obtained using a REST
interface. All search parameters are passed in a
URL. A back-end server (we use a Tomcat servlet
container) then performs a database query either
directly searching the relations table (if the user
specified a company name) or matching a query
with documents and the relations with the
documents they come from (if the user entered a
generic search query). Finally, the server returns a
JSON-formatted list of relations with the name of
the first and second companies as well as an ID or
source URL indicating the document that originated
the relation. The graph is then built client-side by the
Flex application using the Prefuse Flare toolkit.
4.3 Graph Layout
The core of the application is a custom built Adobe
Flex visualization tool based on the Flare toolkit. It
provides a user interface as well as layout engines
and interaction controllers allowing the user to
interact with the data. The Flare toolkit was chosen
as it is an established open source visualization
toolkit, relatively well documented and offering
many customization and extension possibilities. This
choice implied the use of a Flex based user interface,
which allows for the easy creation of user friendly
and interactive interfaces. The main drawback of
this choice is that it prevents the tool from being run
on hardware not supporting Adobe Flash, like
Apple’s iPad.
The Prefuse Flare toolkit provides all methods
needed to communicate with a server using a REST
interface. Very little scripting is needed to transform
a list of company relations into a graph structure. In
the graph, the companies correspond to the nodes
whereas the relations between them correspond to
the edges. The force directed and radial graphs
WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies
600
Figure 2: Radial graph for the company Actebis (orange,
in the centre).
layouts proposed by the Prefuse Flare toolkit proved
most useful in our use case. The force directed
layout is animated and allows for a lot of interaction
from the user, with the possibility to drag some
nodes in order to explore specific regions or blocks
of the graph. The radial layout (shown on Figure 2)
is static and gives a good general view of the
number of companies directly and indirectly related
to a chosen company, but is less appropriate to
obtain a detailed view of some specific regions of
the graph (mainly because some nodes tend to
overlap). Using the radial layout, the directly related
companies can clearly be seen on the first inner
circle, the indirectly related companies being on a
second, bigger circle.
As can be seen on Figure 3, some edges are
darker than others. This is intended to mean that
some relations are more important (e.g. based on
more co-occurrences) than others. Clicking on such
an edge replaces it with a number of bended edges
each corresponding with a specific co-occurrence, as
shown in Figure 7. The implementation of this
functionality required an extension of the Prefuse
Flare toolkit to distinguish multiple edges between
the same two nodes. We use Bézier curves
alternating between a line linking the two nodes.
This allows for a compact and clear visualization of
edges and avoids overlapping over other edges of
the graph.
Figure 3: Multiple mentions of a relation between Actebis
and the Commerzbank AG.
4.4 Interaction
The user interaction is a central aspect of the tool
and the main reason for the choice of a Flex based
visualization framework. Next to standard
interaction techniques directly provided by the Flare
toolkit such as zooming with the mouse wheel and
panning by clicking and dragging, we added some
simple custom built click-based interactions:
A double click on a node (one node representing
one company) displays a pop-up window offering
the choice between loading a graph centered on the
chosen company and loading a Web page associated
to the company name (it could be the list of all texts
associated with the company in the database, our
current implementation directs the user to a search
engine results page showing information related to
the company). A single click on an edge switches
the display of multiple relations between a collapsed
and an expanded mode, as explained in the previous
paragraph. A double click on an edge loads a Web
page helping the user understand the current
relation; in our implementation we show the text
from which the relation was inferred.
By moving its mouse over parts of the graph, the
user can get a zooming effect: company names
(nodes) as well as relations (edges) are highlighted
using a different color and have their size increased
to help the user distinguish them from other non-
highlighted elements of the graph.
5 CONCLUSIONS
AND OUTLOOK
Using mostly simple and freely available
technologies, we could build a powerful tool that
visualizes relations between companies and
organizations extracted from a defined set of web
sources. Furthermore, it is possible to integrate the
tool with a larger solution that lets the user define
and adapt the list of sources to monitor. We found
that even without analyzing the detailed semantics of
the relations between organizations and only
focusing on co-occurrence, it was possible to
quickly obtain meaningful and helpful graphs for
day to day web intelligence and especially
competitive intelligence activities.
We presented some ways to improve the tool by
developing the text mining aspects and using or
building better tools to identify the nature of the
relation between companies. Furthermore, we
presented ways to use this information to improve
A SIMPLE METHOD FOR MINING AND VISUALIZING COMPANY RELATIONS BASED ON WEB SOURCES
601
the visualization and usage of the knowledge gained
from building these graphs.
Although the simple implementation developed
in less than one man month gave interesting and
useful results that are currently being evaluated
during a field test with a partner company, some
limitations as well as ways to improve the tool can
be mentioned.
5.1 Limitations
An obvious limitation of the current state of the tool
is that all relations are co-occurrence relations and
are presented in the same way, regardless of their
actual meaning and importance. It would be helpful
to define a certain number of relation types and to
use the possibilities of a visual user interface to
distinguish between types and between relations that
can be defined as of main importance and of
secondary importance with regard to the use case.
Not only relations could be distinguished but
also company types. With regard to the competitive
intelligence use case, it would be helpful to display
partner companies in one color and competitor
companies in another color, further distinguishing
between customers, suppliers, etc.
Another kind of limitations comes from the
implementation: the whole graph being loaded once
and either completely updated or not at all. A more
interactive and on-the-fly data retrieval would help
the navigation in large company networks.
5.2 Future Work
Some of these limitations are to be addressed in our
future work. Using and adapting advanced text-
mining tools, it is possible to detect and classify a
certain number of relations between companies, such
as “customer of” or “acquirer of”, as shown e.g. by
Hu et al. (2009). This work will be accompanied by
a proposed classification of company relations types
(customer, supplier, etc.) and attributes (directed,
transitive, etc.). Another important aspect related to
this classification is the analysis of internal company
or group structures.
Another part of the planned work consists in the
improvement of the organization name
normalization algorithm. An aspect that was ignored
as of today is the multilingualism of Web sources,
which means that “Microsoft Germany” and
“Microsoft Deutschland” will be considered as two
distinct companies. This could for many cases be
addressed by well-built look-up lists.
An evaluation of the recall achieved by
automatic detection of company relations is also
planned.
REFERENCES
Finzen, Jan, Kintz, Maximilien, Kett, Holger, Koch,
Steffen. 2009. Strategic Innovation Management on
the Basis of Searching and Mining Press Releases.
Proceedings of the 5th WEBIST conference, Lisbon,
Portugal, March 23-26, 2009.
Finzen, Jan, Kintz, Maximilien: Innovation Mining. 2011.
Proceedings of the 7
th
WEBIST conference,
Noordwijkerhout, The Netherlands, May 06-09, 2011.
Heer Jeffrey, Card, Stuart K., Landay, James A. 2005.
Prefuse: a toolkit for interactive information
visualization. Proceedings of the SIGCHI conference
on Human factors in computing systems, Portland,
Oregon, USA, April 02-07, 2005.
Freeman, Linton C. 2000. Visualizing Social Groups.
Proceedings of the Section on Statistical Graphics.
American Statistical Association.
Buzgar, Adrian N., Buraga, Sabin C. 2008. Visualizing
Online Social Networks in the Context of Web 2.0.
Sisteme Distribuite, University Stefan cel Mare of
Suceava, Suceava, Romania.
Hu, Changjian, Xu, Liqin, Shen, Guoyang, Fukushima,
Toshikazu. 2009. Temporal Company Relation Mining
from the Web. Lecture Notes in Computer Science,
2009, Volume 5446/2009, 392-403.
Magnani, M., and Montesi, D. 2007. A study on company
name matching for database integration. Technical
Report UBLCS-07-15. May 2007.
Matsuo, Yutaka, Mori, Junichiro, Hamasaki, Masahiro,
Nishimura, Takuichi, Takeda, Hideaki, Hasida, Koiti,
and Ishizuka, Mitsuru. 2007. POLYPHONET: An
advanced social network extraction system from the
Web. Web Semantics. 5, 4 (December 2007), 262-278.
2007
WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies
602