A Structural Model of Internet Organization Discovery
Zi-yu Yang, Xiao-yun Wang, Hong-mei Ma and Li Qin
Library, China Defense Science and Technology Information Center, Beijing, China
Keywords: Organization Discovery, Social Network, IP Allocations, Routing Registry.
Abstract: This paper presents a highly structured model to automatically discover Internet organizations from the data
of RIR (Regional Internet Registry) and IRR (Internet Routing Registry), where network operators register
their networking resources such as IP addresses and routing policies. Our basic idea is to discover network
operators that have close ties among each other from those registry activities, and consider them as being
from the same organization. With the data from two RIRs, this model produces to date the first organization
level network of current Internet. The model shows its reasonability in our validation with real Internet
routing data, and is likely to be applied extensively in networking area.
1 INTRODUCTION
The Internet is running under the administration of
thousands of organizations, each of which can be an
institution, a company or a university. An Internet
Organization (IORG) is an organization that uses at
least an IP network to host application services or
access the Internet. Despite the important role of
IORGs in advancing the Internet forward, neither
those organizations nor the relationships in between
have been well understood or characterized yet. In
fact, even how to identify IORGs is still an open
question, making the further studies such as their
formed ecosystem impossible.
IORG discovery is difficult for several reasons.
First of all, considering the tremendous size of the
Internet, the amount of organizations is huge as well.
Secondly, there is no authoritative source of IORG
data. Last but not least, as the growth of the Internet,
IORGs are highly dynamic. Some IORGs may
vanish while some new ones may appear. Therefore,
any approach that used to discover IORGs should be
highly structured and automated such that the
discovery process can be regularly repeated with
new input.
In this paper, we design and propose a highly
structured model to automatically discover IORGs
from the data of RIR (Regional Internet Registry),
where network operators register their networking
resources such as IP addresses and routing policy.
Let network operators be the representative of
organizations in the registry activities, our basic idea
is to discover network operators that have close ties
among each other from those registry activities, and
consider them as being from the same organization.
To our best knowledge, this has been no similar
research in the direction of IORG discovery so far.
Researches closest to ours are (Siganos and
Faloutsos, 2007) and (Cai and Heidemann, 2010).
Siganos et al. used the allocation records from RIRs,
registered ISP (Internet Service Provider) routing
policy from IRR(Internet Routing Registry) to detect
erroneous and suspicious routing behaviour.
However, they restricted their work on data only,
and did nothing about IORGs. Cai et al. aimed for an
AS(Autonomous System)-to-Organization map that
allows a more accurate view of Internet in the
granularity of AS. Since the size of organizations
vary largely, and only large organizations are
qualified to apply for AS numbers, their work
actually focused on only large organizations. On the
contrary, we provide a finer granularity to observe
the organization-level Internet. Beside, different
from that both these two studies rely on ad-hoc
methods and require a large amount of manual
intervention, our model is highly structured and
automated.
The remainder of this paper is structured as
follows. Section 2 introduces the registry activities
of network operators that serve as the base of our
model. We present the discovery methodology in
section 3. The discovery and validation results are
shown in section 4, and we conclude this paper in
section 5.
344
Yang, Z., Wang, X., Ma, H. and Qin, L..
A Structural Model of Internet Organization Discovery.
In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 344-349
ISBN: 978-989-758-158-8
Copyright
c
2015 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
2 INTERNET REGISTRY
ACTIVITIES
The virtual activities we consider to be the mirror of
IORGs activities in real world are Internet registry
activities, which can be divided into two categories,
regarding Internet number resources and routing
policies respectively.
2.1 Registry of Resource Allocations
Internet number resources refer to IP addresses and
AS numbers, which are essentials to access the
Internet or perform network management on AS
level.
IANA (Internet Assigned Numbers Authority)
serves as the root of resource allocation chain to
ensure the unique use of Internet number resources.
RIRs are set up to coordinate the allocation of those
resources inside their own regions on behalf of
IANA. There are currently five RIRs: ARIN, APNIC,
AFRNIC, RIPE and LACNIC. RIRs subsequently
allocate number resources received from IANA to
NIRs (National Internet Registry, e.g. CNNIC), or
directly to LIRs (Local Internet Registry, e.g. AT&T)
which are usually large ISPs (Internet Service
Provider). NIRs and LIRs can further allocate the
resources they received from RIRs to end users or
other ISPs.
The allocation activities of number resources are
required to be registered in related RIR databases.
To ensure the completeness and freshness of registry
information, RIRs usually carry out strict policy to
assure that allocation activities are registered in time.
Figure 1: IP allocations registered by two institutes.
For example, ARIN claims that every allocation or
assignment that contains eight or more IP addresses
should be recorded in its database. If not, future
allocations would be impacted. Similarly, RIPE will
check the correctness of relevant registry
information when a LIR or ISP requests for a new
allocation.
In Figure 1, there are two allocation records of IP
address, registered by RIPE-NCC and Netvision
company respectively. Based on these two records,
we can know that RIPE-NCC has the administration
authority over IP addresses ranging from 62.0.0.0 to
62.255.255.255, while Netvision company owns the
network 62.0.4.0~62.0.4.255.
2.2 Registry of Routing Policy
Routing registry is used to improve the Internet wide
routing by sharing routing policies among ISPs, and
the institution in charge is IRR (Internet Routing
Registry). An ISP can leverage IRR to publish its
routing policy, or look up peering agreements to
optimize its routing policy. IRR consists of several
distributed databases that usually mirror each other.
For instance, RIPE has mirrored more than 10
partners, including ARIN-RR, APNIC, NTTCOM
DB, Merit RADB and so on.
Figure 2: A route object registered by an Internet company.
Figure 2 depicts a route record registered by
Internet ProLink NZ Limited, a company located in
Auckland, New Zealand. By registering this record
in the database of APNIC, this company claims that
AS6831 is authorized to originate the network prefix
202.36.121.0/24 in Internet routing system, and ISPs
worldwide can use this information to filter false
announcements regarding this network.
2.3 Route Policy Specific Language
RIRs use different languages to describe resource
allocations and registered routing polices. Currently,
ARIN uses SWIP (Shared Whois Project), RIPE and
APNIC use RPSL(Route Policy Specific Language),
while LACNIC uses a mix of both. Our model is
A Structural Model of Internet Organization Discovery
345
based on RPSL for its wider usage, and it can be
applied to SWIP with tiny modifications.
RPSL is designed to specify routing policy at
various levels, ranging from router to AS. In an ideal
case, low-level router configurations can be directly
generated from the routing policies described at AS
level. Like typical object-orientated language, RPSL
comprises several classes, each of which uses a set
of attributes to describe its object instances. In our
model, RPSL classes are classified into three
categories: PoC(Point of Contact), NR(Number
Resource) and RP(Routing Policy), according to the
content being described.
(1) PoC classes.
PoC classes describe contact information. For
details, PoC classes include mntner, person and role
class. The mntner class specifies authentication
information required to add, delete or modify other
objects. The person class describes the information
necessary to contact a person. The role class is very
similar to person class except for that instead of
describing a human being, a role object describes a
role performed by one or more human beings. In this
way, role does not have to change when a person
performing this role changes.
(2) NR classes.
NR classes describe Internet number resources,
such as inetnum, inet6num and domain class in
RPSL.
(3) RP classes.
RP classes are used to describe routing policy.
For example, the inet-rtr class defines a router via
this router’s DNS name, the IP address of each
interface, the AS number of the AS which owns or
operates this router and such information.
The NR and RP objects can establish direct
connections with network operators by referring to
the class key of PoC objects through their admin-c,
tech-c and mnt-* attributes(including mnt-by, mnt-
lower, mnt-routes and so on), as shown in Figure 3.
Figure 3: PoC objects are referred as contact points.
While admin-c attribute usually refers to
someone who is physically located at the site of the
network, the tech-c attribute indicates a person
responsible for the day-to-day operation of the
network, but does not need to be physically located
at the site of the network.
3 METHODOLOGY
Our methodology is as follows. Firstly, we build a
MDN (multiple dimension network) to characterize
the interrelationship of various elements in registry
data, which is the outcome of network operators’
registry activities. Secondly, we quantify how close
are two network operators with the tie strength in
between, which is calculated based on the paths
between these two operators in the built MDN. At
last, network operators are grouped into clusters
according to the tie strength among them, and each
cluster is considered as an organization.
3.1 Building Multi-Dimension Network
Let symbol D={as-block,as-set,aut-num,inet6num,
inetnum ,mntner…, mail, phone/fax number} denote
the dimension vector of the MDN, each element in
D represents either a RPSL or a user-defined class.
3.1.1 Vertexes of MDN
Let V
(i)
denote the set of vertexes from dimension i
D, then V
(i)
should be the union set of all the
object instances’ class keys of class i. Similarly, the
vertex sets of email and phone/fax number
dimension are all the email addresses and phone/fax
numbers that appeared in the dataset.
3.1.2 Discovering Links from RPSL Objects
MDN links are primarily generated from RPSL
objects, each of which is essentially a collection of
attributes. For each RPSL object r, it has a key
attribute (r.k) and a set of non-key attributes (r.NK).
For each attribute xNK, the non-key attribute x can
be:
(1) Key of other RPSL objects. The definition of
r leverages the information that has already been
defined by some other RPSL objects. In this case, a
link kx is added to the link set.
(2) Plain text. Since natural language processing
is not so accurate, no links are generated in this case
to avoid importing uncertainty.
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
346
(3) Email, phone/fax number. By referring to
email addresses, phone or fax numbers, we know
how to reach the personnel responsible for record r.
In this case, we generate a link between k and x, and
add it to the link set.
3.1.3 Discovering Links via Correlation
within Dimension
Within each dimension i D, our purpose is to
discover any two vertexes whose key attributes are
related, and generate a link between these two
vertexes.
Take dimension inetnum as an example, as
shown in Figure 1, since the IP address block
62.0.4.0-62.0.4.255 is a subset of another, 62.0.0.0-
62.255.255.255, we add a link between these two
inetnum objects, as shown in Figure 4. This method
also works for dimension as-block, and we do not
repeat here for the sake of brevity.
Figure 4: Adding a link between two inetnums relevant.
As for email dimension, we manually build a
blacklist of domain names that are from RIRs or
public email services, such as Gmail, Yahoo, then
add a link for every two email addresses that sharing
the same domain name.
Correlations within fax and phone dimension
may also be useful in link generation. However,
since there is no universe method to parse phone
number into country code, region code and
institution code, we prefer to be conservative and do
not perform this correlation.
3.1.4 Discovering Links via Correlation
across Dimensions
Correlations across dimensions occur between set
object and its member objects. For example, an as-
block object usually describes several consecutive
AS numbers, thus we can add a link between this as-
block object and every aut-num object whose AS
number is included. This theory also works for irt-
set object (a set of routers) and inet-irt object (a
router), rtr-set (a set of routes) and route object(a
route), and so on.
3.2 Calculating Tie Strength
We then derive a weighted graph of network
operators G
o
=<V
o
, E
o
> that V
o
consists of all the
PoC objects and E
o
is the link set. For any two
vertexes u, vV
o
, there is a link (u, v)E
o
if and
only if they are connected in MDN. Let w
(u, v)
denote
the strength of link (u, v) in G
o
, it is defined to be the
accumulated strength of the multiple paths between
u and v that may traverse through one or multiple
dimensions in MDN.
Moreover, the strength of a path is linearly
proportional to the strength of each single link on
that path for these links are in series.
Let S be a n-dimension matrix(n is the number of
dimensions in MDN that n=|D|), and each of its
element s
i,j
(1i, jn) be the strength of a link
between two vertices within the i
th
dimension and j
th
dimension respectively, the tie strength of link (u,v)
E should be
0)(-1)(
),(
1/
),(
,),(
vuPl
k
lji
ddvu
ji
skw
Where P(u,v) denotes all the paths between vertex
u and v, k is the length of path l, ɑ is a constant
between 0 and 1, and d
i
is the dimension where
vertex i is from.
3.3 Classifying Network Operators into
Clusters
To group network operators into clusters, we
adopted the algorithm presented in (Blondel and
Guillaume, 2008), which aims to maximum the
network’s modularity, defined as
ji
ji
ji
ji
cc
m
kk
A
m
Q
,
,
),(][
22
1
Where A
i,j
represents the weight of the link
between i and j, k
i
is the sum of the weights of the
links attached to vertex i, c
i
is the community to
which vertex i is assigned. The function (u,v) is 1
if u=v and 0 otherwise, and m is the sum of weights
of all the links in the network.
4 INTERNET ORGANIZATION
DISCOVERY
In this section, we first show the clustering results
A Structural Model of Internet Organization Discovery
347
with real data from two RIRs, and then validate the
reasonability and effectiveness of the obtained
results.
4.1 Datasets and Parameter Settings
The datasets used in this paper are collected from
RIPE and APNIC, both of which provide allocation
registry service and routing registry service with a
shared database.
Table 1: Datasets.
Data APNIC RIPE
22
n
d
, Jun 2008 AP-1 RP-1
21
st
, Dec 2010 AP-2 RP-2
25
th
, Feb 2014 RP-3
10
th
, Aug 2015 AP-3 RP-4
Our datasets include 7 snapshots of RIR data at 4
distinct time points, as shown in Table 1. For
validation purpose, we also collected 4 BGP (Border
Gateway Protocol) route tables at each time points
from RIPE-RIS project (collector rrc03 is chosen for
it has more peering ASes). While time diversity
enables us to observe the evolution of RIR data over
time, these datasets are not consistent with each
other for the change of RIR policy in data release. In
particular, the dataset here is largely different from
that used in (Cai,2010) in the following three aspects.
No org Attributes, including AP-1, AP-2 and
AP-3. While the org attribute provided a good
coverage on the AS objects (90% in percentage)
in (Cai and Heidemann, 2010), APNIC did not
use this attribute at all.
Partially Anonymous, including RP-1. PoC
objects are anonymized by replacing the
associated phone numbers, emails with
+31205354444, the number of RIPE NCC and
unread@ripe.net, respectively. However, the
admin-c and tech-c attributes of other objects are
still available.
Completely Anonymous, including RP-2, RP-3
and RP-3. Compared with RP-1, not only
telephone numbers and email addresses, but also
admin-c and tech-c attributes are removed.
As for the parameters used in the calculation of
tie strength between network operators, we set a to
be 1/2, and each element s
i,j
(1i, jn) of matrix S is
set to be 1 for simplicity. That is, no matter a link
goes across two dimensions or not, it contributes the
same amount of strength to the tie between network
operators.
4.2 Discovery Results
The IORG discovery results are shown in Table 2.
The first column denotes the number of vertexes in
the built MDN, while the second column denotes the
number of links (links generated via correlation
procedure are also included).
As for the datasets, we can observe a clear trend
that both the vertexes and links grow fast as time
goes on, while RIPE has a higher speed. However,
the number of IORGs in both APNIC and RIPE
grow much slower than vertexes and links.
Table 2: Discovery results of IORGs.
Dataset #of vertexes #of links #of IORGs
AP-1 838,143 7,459,473 56,236
AP-2 973,218 9,138,517 57,861
AP-3 1,051,342 10,496,599 57,964
RP-1 1,004,208 9,663,494 73,247
RP-2 3,824,610 38,219,328 75,336
RP-3 4,936,486 53,807,697 75,912
RP-4 5,534,401 70,021,241 76,238
Figure 5 depicts the Cdf (Cumulative distribution
function) of the size of IORGs (the number of RPSL
person objects in each IORG) from AP-2 and RP-2
dataset respectively.
As we can see, 20.0% of the IORGs in RIPE and
28.7% in APNIC contain only 1 person object, and
IORGs consisting of fewer than 10 person objects
account for 69.5% in RIPE and 81.3% in APNIC.
That is, most of the IORGs are very small in scale.
In fact, although there are more and more IORGs in
the datasets as time goes, as shown in Table 2, the
fraction of small size IORGs is becoming larger and
larger. We attribute this increase to the fact that
more and more users are required to register their
allocations in RIR databases.
Figure 5: Cdf of the size of IORGs.
10
0
10
1
10
2
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of PoC objects in an IORG
Cumulative Percentage
APNIC
RIPE
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
348
4.3 Validation with IP-to-AS Mappings
The IP-to-AS mappings (prefix, originator AS) in
BGP routing reflect the up-to-date usage of Internet
number resources. In an ideal case, an organization
would advertise its IP prefixes with its own AS
numbers, producing mappings whose prefix and AS
number belong to the same organization. However,
since not all the organizations are qualified to apply
for AS numbers, an organization can also delegate
its prefixes to ISP providers for advertisement in
BGP. In this case, this organization and its providers
are supposed to be relevant and close to each other
in G
0
.
Two indicators are defined to quantify the
relativeness and closeness of the two organizations
(O
IP
and O
AS
) that own the IP prefix and AS number
respectively involved in a (prefix, originator AS)
mapping observed from BGP route tables.
Relativeness(γ): Assuming that most of current
usages are reasonable, a mapping is considered
connected if the corresponding O
IP
and O
AS
are
connected in G
o
. Relativeness γ is defined to be the
fraction of connected mappings, compared with the
mappings whose O
IP
and O
AS
can be pinpointed in
the organization graph.
Closeness(): This indicator is defined to be the
fraction of mappings whose O
IP
and O
AS
belong to
the same organization, compared with mappings that
are connected in G
o
.
The validation results are shown in Table 3. Our
conclusions are two-fold. First, the relativeness
indicator γ is really high. That is, for most of the IP-
to-AS mappings, the two organizations O
IP
and O
AS
have a close tie in between. This finding means that
our discovery results can be used to detect prefix
hijacking, routing leak or similar events, as Siganos
et al. had done in (Siganos and Faloutsos, 2007).
Second, the closeness indicator is actually the ratio
that O
IP
and O
AS
belong to the same organization.
This is the upper bound of accuracy that traditional
methods such as (Siganos and Faloutsos, 2007) can
reach if they do not perform clustering operation on
network operators.
Table 3: Validation results.
Dataset γ
AP-1 94.5% 78.3%
AP-2 95.2% 80.5%
AP-3 94.6% 76.9%
RP-1 98.8% 84.3%
RP-2 96.5% 85.7%
RP-3 96.1% 84.4%
RP-4 97.0% 82.1%
5 CONCLUSIONS
In this paper, we develop a systematic approach to
discover organizations in the Internet. Our basic idea
is to discover network operators that have close ties
among each other, and consider them as being from
the same organization. To be honest, the model is
still very coarse. However, the preliminary
discovery results can enable us to start looking into
the organization level Internet ecosystem.
In our future work, we would continue adjusting
our model and related approach, and then extend this
approach to ARIN, AFRNIC and LACNIC to obtain
an organization level picture of the global Internet.
REFERENCES
Siganos, G., Faloutsos, M., 2007. Neighborhood watch for
Internet Routing: Can we improve the robustness of
Internet Routing today?. In INFOCOM’07, 26th IEEE
Annual Conference on Computer Communications.
IEEE Press.
Cai, X., Heidemann, J., Krishnamurthy, B., Willinger, W.,
2010. Towards an AS-to-organization map. In ACM
IMC’10, 10th Annual Conference on Internet
Measurement. ACM Press.
Blondel, V. D., Guillaume, J.L., Lambiotte, R., Lefebvre,
E., 2008. Fast unfolding of communities in large
networks. Journal of Statistical Mechanics: Theory
and Experiment.
SWIP, Shared Whois Project.
http://www.arin.net/reference/database.html.
RFC2622, 1999. Routing Policy Specification Language
(RPSL). IETF rfc.
RIPE-RIS Project.
http://ris.ripe.net.
A Structural Model of Internet Organization Discovery
349