Comparative Evaluation of Kernel Bypass Mechanisms for

High-performance Inter-container Communications

Gabriele Ara

1 a

, Tommaso Cucinotta

1 b

, Luca Abeni

1 c

and Carlo Vitucci

Scuola Superiore Sant’Anna, Pisa, Italy

Ericsson, Stockholm, Sweden

Keywords:

Kernel Bypass, DPDK, NFV, Containers, Cloud Computing.

Abstract:

This work presents a framework for evaluating the performance of various virtual switching solutions, each

widely adopted on Linux to provide virtual network connectivity to containers in high-performance scenarios,

like in Network Function Virtualization (NFV). We present results from the use of this framework for the

quantitative comparison of the performance of software-based and hardware-accelerated virtual switches on a

real platform with respect to a number of key metrics, namely network throughput, latency and scalability.

1 INTRODUCTION

As an emerging technology and business paradigm,

cloud computing has seen a stable growth in the past

few years, becoming one of the most interesting ap-

proaches to high-performance computing. Thanks to

the high ﬂexibility of these platforms, more applica-

tions get redesigned every day to follow distributed

computing models.

Recently, network operators started replacing tra-

ditional physical network infrastructures with more

ﬂexible cloud-based systems, which can be dynami-

cally instantiated on demand to provide the required

level of service performance when needed. In this

context, the paradigm represented by Network Func-

tion Virtualization (NFV) aims to replace most of

the highly specialized hardware appliances that tradi-

tionally would be used to build a network infrastruc-

ture with software-based Virtualized Network Func-

tions (VNFs). These are equivalent implementations

of the same services provided in software, often en-

riched with the typical elasticity of cloud applications,

i.e., the ability to scale out and back in the service

as needed. This brings new ﬂexibility in physical

resources management and allows the realization of

more dynamic networking infrastructures.

Given the nature of the services usually deployed

in NFV infrastructures, these systems must be charac-

https://orcid.org/0000-0001-5663-4713

https://orcid.org/0000-0002-0362-0657

https://orcid.org/0000-0002-7080-9601

terized by high performance both in terms of through-

put and latency among VNFs. Maintaining low

communication overheads among interacting soft-

ware components has become a critical issue for these

systems. Requirements of NFV applications are so

tight that the industry has already shifted its focus

from traditional Virtual Machines (VMs) to Operat-

ing System (OS) containers to deploy VNFs, given

the reduced overheads characterizing container solu-

tions like LXC or Docker, that exhibit basically the

same performance as bare-metal deployments (Felter

et al., 2015; Barik et al., 2016).

Given the superior performance provided by con-

tainers when used to encapsulate VNFs, primary re-

search focus is now into further reducing communi-

cation overheads by adopting user-space networking

techniques, in combination with OS containers, to by-

pass the kernel when exchanging data among contain-

ers on the same or multiple hosts. These techniques

are generally indicated as kernel bypass mechanisms.

1.1 Contributions

In this work, we propose a novel benchmarking

framework that can be used to compare the perfor-

mance of various networking solutions that leverage

kernel bypass to exchange packets among VNFs de-

ployed on a private cloud infrastructure using OS con-

tainers. Among the available solutions, this work fo-

cuses on the comparison of virtual switches based on

the Data Plane Development Kit (DPDK) framework.

This occupies a prominent position in industry and it

Ara, G., Cucinotta, T., Abeni, L. and Vitucci, C.

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications.

DOI: 10.5220/0009321200440055

In Proceedings of the 10th International Conference on Cloud Computing and Services Science (CLOSER 2020), pages 44-55

ISBN: 978-989-758-424-4

can be used to efﬁciently exchange packets locally on

a single machine or among multiple hosts accessing

Network Interface Controller (NIC) adapters directly

from the user-space bypassing the OS kernel. We

present results from a number of experiments compar-

ing the performance of these virtual switches (either

software-based or hardware-accelerated) when sub-

ject to synthetic workloads that resemble the behavior

of real interacting VNF components.

2 BACKGROUND

There are a number of different options when multi-

ple applications, each encapsulated in its own OS con-

tainer or other virtualized environment, need to com-

municate through network primitives. Usually, they

involve some form of network virtualization to pro-

vide each application a set of gateways to exchange

data over a virtual network infrastructure.

For the purposes of this work, we will fo-

cus on the following: (i) kernel-based networking,

(ii) software-based user-space networking, (iii) hard-

ware-accelerated user-space networking.

In the following, we brieﬂy summarize the main

characteristics of each of these techniques when

adopted in NFV scenarios to interconnect OS con-

tainers within a private cloud infrastructure. Given

the demanding requirements of NFV applications in

terms of performance, both with respect to throughput

and latency, we will focus on the performance that can

be attained when adopting each solution on general-

purpose computing machines running a Linux OS.

2.1 Containers Networking through the

Linux Kernel

A ﬁrst way to interconnect OS containers is to place

virtual Ethernet ports as gateways in each container

and to connect them all to a virtual switch embed-

ded within the Linux kernel, called “linux-bridge”.

Through this switch, VNFs can communicate on the

same host with other containerized VNFs or with

other hosts via forwarding through actual Ethernet

ports present on the machine, as shown in Figure 1a.

The virtual ports assigned to each container are

implemented in the Linux kernel as well and they em-

ulate the behavior of actual Ethernet ports. They can

be accessed via blocking or nonblocking system calls,

for example using the traditional POSIX Socket API,

exchanging packets via send() and recv() (or their

more general forms sendmsg() and recvmsg()); as

a result, at least two system calls are required to ex-

change each UDP datagram over the virtual network,

therefore networking overheads grow proportionally

with the number of packets exchanged.

These overheads can be partially amortized re-

sorting to batch APIs to exchange multiple pack-

ets using a single system call (either sendmmsg() or

recvmmsg()), grouping packets in bursts, amortiz-

ing the cost of each system call over the packets in

each burst. However, packets still need to traverse the

whole in-kernel networking stack, going through ad-

ditional data copies, to be sent or received.

This last problem can be tackled bypassing par-

tially the kernel networking stack using raw sock-

ets instead of regular UDP sockets and implementing

data encapsulation in user-space. This is often done

in combination with zero-copy APIs and memory-

mapped I/O to transfer data quickly between the ap-

plication and the virtual Ethernet port, greatly reduc-

ing the time needed to send a packet (Rizzo, 2012b).

Despite the various techniques available to reduce

the impact of the kernel on network performance,

their use is often not sufﬁcient to match the demand-

ing requirements of typical NFV applications (Ara

et al., 2019). In typical scenarios, the only solution

to these problems is to resort to techniques that can

bypass completely the kernel when exchanging pack-

ets both locally and between multiple hosts.

2.2 Inter-container Communications

with Kernel Bypass

Signiﬁcant performance improvements for inter-

container networking can be achieved by avoiding

system calls, context switches and unneeded data

copies as much as possible. Various I/O frame-

works undertake such an approach, recurring to ker-

nel bypassing techniques to exchange batches of raw

Ethernet frames among applications without a sin-

gle system call. For example, frameworks based on

the virtio standard (Russell et al., 2015) use para-

virtualized network interfaces that rely on shared

memory to achieve high-performance communication

among containers located on the same host, a situa-

tion which is fairly common in NFV scenarios. While

virtio network devices are typically implemented for

hypervisors (e.g. QEMU, KVM), recently a complete

user-space implementation of the virtio speciﬁcation

called vhost-user can be used for OS containers to

completely bypass the kernel. However, virtio cannot

be used to access the physical network without any

user-space implementation of a virtual switch. This

means that it cannot be used alone to achieve both dy-

namic and ﬂexible communications among indepen-

dently deployed containers.

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications

Host

Container 1

Application

Linux Bridge

NIC

User Space

Kernel Space

Container 2

Application

VETH PORT

(a) Kernel-based solution.

Host

User Space Virtual Switch

Container 1

Application

NIC

User Space

Kernel Space

VIRTIO

Container 2

Application

VHOST-USER

VIRTIO

VHOST-USER

(b) Using DPDK with vhost-user.

Host

NIC

Container 1

Application

Hardware Switch

User Space

Kernel Space

Container 2

Application

Figure 1: Different approaches to inter-container networking.

2.2.1 Data Plane Development Kit (DPDK)

Data Plane Development Kit (DPDK)

is an open

source framework for fast packet processing imple-

mented entirely in user-space and characterized by a

high portability across multiple platforms. Initially

developed by Intel, it now provides a high-level pro-

gramming abstraction that applications can use to

gain access to low-level resources in user-space with-

out depending on speciﬁc hardware devices.

In addition, DPDK supports virtio-based network-

ing via its implementation of vhost-user interfaces.

Applications can hence exchange data efﬁciently us-

ing DPDK to communicate locally using vhost-user

ports and remotely via DPDK user-space implemen-

tations of actual Ethernet device drivers: the frame-

work will simply initialize the speciﬁed ports accord-

ingly in complete transparency from the application

point of view. For this reason, DPDK has become ex-

tremely popular over the past few years when it is nec-

essary to implement fast data plane packet processing.

2.3 High-performance Switching among

Containers

After having described virtio in Section 2.2, now we

can explain better how virtio, in combination with

other tools, can be used to realize a virtual network-

ing infrastructure in user-space. There are essentially

two ways to achieve this goal: 1) by assigning each

container a virtio port, using vhost-user to bypass the

kernel, and then connect each port to a virtual switch

application running on the host (Figure 1b); 2) by

leveraging special capabilities of certain NIC devices

that allow concurrent access from multiple applica-

tions and that can be accessed in user-space by us-

ing DPDK drivers (Figure 1c). The virtual switch in-

stance, either software or hardware, is then connected

https://www.dpdk.org/

to the physical network via the actual NIC interface

present on the host.

Many software implementations of L2/L3

switches are based on DPDK or other kernel bypass

frameworks, each implementing their own packet

processing logic responsible for packet forward-

ing. For this reason, performance among various

implementation may differ greatly from one imple-

mentation to another. In addition, these switches

consume a non-negligible amount of processing

power on the host they are deployed onto, to achieve

very high network performance. On the other hand,

special NIC devices that support the Single-Root I/O

Virtualization (SR-IOV) speciﬁcation allow trafﬁc

ofﬂoading to the hardware switch implementation

that they embed, which applications can access

concurrently without interfering with each other.

Below, we brieﬂy describe the most common soft-

ware virtual switches in the NFV industrial practice,

and the characteristics of SR-IOV compliant devices.

DPDK Basic Forwarding Sample Application

is a

sample application provided by DPDK that can be

used to connect DPDK-compatible ports, either vir-

tual or physical, in pairs: this means that each appli-

cation using a given port can only exchange packets

with a corresponding port chosen during system ini-

tialization. For this reason, this software does not per-

form any packet processing operation, hence it cannot

be used in real use-case scenarios.

Open vSwitch (OVS)

is an open source virtual

switch for general-purpose usage with enhanced ﬂex-

ibility thanks to its compatibility with the OpenFlow

protocol (Pfaff et al., 2015). Recently, Open vSwitch

(OVS) has been updated to support DPDK and virtio-

based ports, which accelerated considerably packet

forwarding operations by performing them in user-

space rather than within a kernel module (Intel, 2015).

https://doc.dpdk.org/guides/sample_app_ug/skeleton.html

https://www.openvswitch.org

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

FD.io Vector Packet Processing (VPP)

is an ex-

tensible framework for virtual switching released

by the Linux Foundation Fast Data Project (FD.io).

Since it is developed on top of DPDK, it can run on

various architectures and it can be deployed in VMs,

containers or bare metal environments. It uses Cisco

Vector Packet Processing (VPP) that processes pack-

ets in batches, improving the performance thanks to

the better exploitation of instruction and data cache

locality (Barach et al., 2018).

Snabb

is a packet processing framework that can

be used to provide networking functionality in user-

space. It allows for programming arbitrary packet

processing ﬂows (Paolino et al., 2015) by connecting

functional blocks in a Directed Acyclic Graph (DAG).

While not being based on DPDK, it has its own im-

plementation of virtio and some NIC drivers in user-

space, which can be included in the DAG.

Single-Root I/O Virtualization (SR-IOV) (Dong

et al., 2012) is a speciﬁcation that allows a single

NIC device to appear as multiple PCIe devices, called

Virtual Functions (VFs), that can be independently

assigned to VMs or containers and move data

through dedicated buffers within the device. VMs

and containers can directly access dedicated VFs

and leverage the L2 hardware switch embedded in

the NIC for either local or remote communications

(Figure 1c). Using DPDK APIs, applications within

containers can access the dedicated VFs bypassing

the Linux kernel, removing the need of any software

switch running on the host; however, a DPDK

daemon is needed on the host to manage the VFs.

3 RELATED WORK

The proliferation of different technologies to ex-

change packets among containers has created the need

for new tools to evaluate the performance of virtual

switching solutions with respect to throughput, la-

tency and scalability. Various works in the research

literature addressed the problem of network perfor-

mance optimization for VMs and containers.

A recent work compared various kernel bypass

frameworks like DPDK and Remote Direct Mem-

ory Access (RDMA)

against traditional sockets, fo-

cusing on the round-trip latency measured between

two directly connected hosts (Géhberger et al., 2018).

This work showed that both DPDK and RDMA out-

perform POSIX UDP sockets, as only with kernel by-

https://fd.io/

https://github.com/snabbco/snabb

http://www.rdmaconsortium.org

pass mechanisms it is possible to achieve single-digit

microsecond latency, with the only drawback that ap-

plications must continuously poll the physical devices

for incoming packets, leading to high CPU utilization.

Another work (Lettieri et al., 2017) compared

both qualitatively and quantitatively common high-

performance networking setups. It measured CPU

utilization and throughput achievable using either

SR-IOV, Snabb, OVS and Netmap (Rizzo, 2012b),

which is another networking framework for high-

performance I/O. Their focus was on VM to VM com-

munications, either on a single host or between mul-

tiple hosts, and they concluded that in their setups

Netmap is capable of reaching up to 27 Mpps (when

running on a 4 GHz CPU), outperforming SR-IOV

due to the limited hardware switch bandwidth.

Another similar comparison of various kernel by-

pass technologies (Gallenmüller et al., 2015) eval-

uates throughput performance of three networking

frameworks: DPDK, Netmap and PF_RING

. They

concluded that for each of these solutions there are

two major bottlenecks that can potentially limit net-

working performance: CPU capacity and NIC maxi-

mum transfer rate. Among them, the latter is the dom-

inating bottleneck when per-packet processing cost is

kept low, while the former has a bigger impact as

soon as this cost increases and the CPU becomes fully

loaded. Their evaluations also showed that DPDK

is able to achieve the highest throughput in terms of

packets per second, independently from the burst size;

on the contrary, Netmap can reach its highest through-

put only for a burst size greater than 128 packets per

burst, and even then it cannot reach the same perfor-

mance of DPDK or PF_RING.

The scalability of various virtual switching solu-

tions with respect to the number of VMs deployed

on the same host has been addressed in another re-

cent work (Pitaev et al., 2018), in which VPP and

OVS have been compared against SR-IOV. From their

evaluations, they concluded that SR-IOV can sustain

a greater number of VMs with respect to software-

based solutions, since the total throughput of the sys-

tem scales almost linearly with the number of VMs.

On the contrary, both OVS and VPP are only able

to scale the total throughput up to a certain plateau,

which is inﬂuenced by the amount of allocated CPU

resources: the more VMs are added, the more the sys-

tem performance degrades if no more resources are

allocated for the each software virtual switch.

Finally, a preliminary work (Ara et al., 2019) was

presented by the authors of this paper, performing a

basic comparison of various virtual switching techni-

ques for inter-container communications. That evalu-

https://www.ntop.org/products/packet-capture/pf_ring/

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications

Benchmarking Applications Stack

DPDK

Apps

Sender

Receiver Client

Server

POSIX API

DPDK Environment

Abstraction Library

DPDK

VIRTIO

Driver

DPDK

Device

Drivers

Linux Kernel

Networking Stack

Linux NIC Device Drivers

Hardware

User Space

Kernel Space

Logging Data Generation / ConsumptionAPI Abstraction

Software Tools

System

Setup

Tools

Deployment

Tools

Post-

Processing

Tools

Test

Setup

Tools

Fast Networking

Frameworks

OVS VPP

Snabb

DPDK Basic

Forwarding Application

Figure 2: Main elements of the proposed framework.

ation however was limited to only a single transmis-

sion ﬂow between a sender and a receiver applica-

tion deployed on the same machine, for which SR-

IOV was the most suitable among the tested solutions.

This work extends that preliminary study to a much

broader number of test cases and working conditions,

on either one or multiple machines, and presents a

new set of tools that can be used to conveniently re-

peat the experiments in other scenarios.

Compared to the above mentioned works, in this

paper we present for the ﬁrst time an open-source

framework that can be used to evaluate the perfor-

mance of various widely adopted switching solutions

among Linux containers in the common NFV indus-

trial practice. This framework can be used to carry out

experiments from multiple points of view, depending

on the investigation focus, while varying testing pa-

rameters and the system conﬁguration. The proposed

framework eases the task of evaluating the perfor-

mance of a system under some desired working con-

ditions. With this framework, we evaluated system

performance in a variety of application scenarios, by

deploying sender/receiver and client/server applica-

tions on one or multiple hosts. This way we were able

to draw some general conclusions about the charac-

teristics of multiple networking solutions when vary-

ing the evaluation point of view, either throughput,

latency or scalability of the system. As being open-

source, the framework can be conveniently extended

by researchers or practicioners, should they need to

write further customized testing applications.

4 PROPOSED FRAMEWORK

In this section we present the framework we realized

to evaluate and compare the performance of different

virtual networking solutions, focusing on Linux con-

tainers. The framework we developed can be easily

installed and conﬁgured on one or multiple general-

purpose servers to instantiate a number of OS contain-

ers and deploy in each of them a selected application

that generates or consumes synthetic network traf-

ﬁc. These applications, also developed for this frame-

work, serve the dual purpose to generate/consume

network trafﬁc and to collect statistics to evaluate sys-

tem performance in the given conﬁguration.

In particular, this framework can be used to con-

ﬁgure a number of tests, each running for a speciﬁc

amount of time with a given system conﬁguration;

after each test is done, the framework collects per-

formance results from each running application and

provides the desired statistics to the user. In addi-

tion, multiple tests can be performed consecutively by

specifying multiple values for any test parameter.

The software is freely available on GitHub, under

a GPLv3 license, at: https://github.com/gabrieleara/

nfv-testperf . The architecture of the framework is de-

picted in Figure 2, which highlights the various tools

it includes. First there are a number of software tools

and Bash scripts that are used to install system de-

pendencies, conﬁgure and customize installation, run

performance tests and collect their results. Among

the dependencies installed by the framework there are

the DPDK framework (which includes also the Ba-

sic Forwarding Sample Application), and the other

software-based virtual switches that will be used dur-

ing testing to connect applications in LXC contain-

ers: OVS (compiled with DPDK support), VPP, and

Snabb (conﬁgured as a simple learning switch). Each

virtual switch is conﬁgured to act as a simple learning

L2 switch, with the only exception represented by the

DPDK Basic Forwarding Sample Application, which

does not have this functionality. In addition, VPP and

OVS can be connected to physical Ethernet ports to

perform tests for inter-machine communications.

Finally, a set of benchmarking applications are in-

cluded in the framework, serving the dual purpose to

generate/consume network trafﬁc and to collect statis-

tics that can be used to evaluate the actual system per-

formance in the provided conﬁguration. These can

measure the performance from various view points:

Throughput. As many VNFs deal with huge

amounts of packets per second, it is important to eval-

uate the maximum performance of each networking

solution with respect to either the number of pack-

ets or the amount of data per second that it is able to

effectively process and deliver to destination. To do

this, the sender/receiver application pair is provided

to generate/consume unidirectional trafﬁc (from the

sender to the receiver) according to the given param-

eters.

Latency. Many VNFs operate with networking pro-

tocols designed for hardware implementations of cer-

tain network functions; for this reason, these proto-

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

cols expect very low round-trip latency between in-

teracting components, in the order of single-digit mi-

crosecond latency. In addition, in NFV infrastruc-

tures it is crucial to keep the latency of individual

interactions as little as possible to reduce the end-to-

end latency between components across long service

chains. For this purpose, the client/server application

pair is used to generate bidirectional trafﬁc to eval-

uate the average round-trip latency for each packet

when multiple packets are sent and received in bursts

through a certain virtual switch. To do so, the server

application will send back each packet it receives to

its corresponding client.

Scalability. Evaluations from this point of view are

orthogonal with respect of the two previous dimen-

sions, in particular with respect to throughput: since

full utilization of a system is achieved only when

multiple VNFs are deployed on each host that is

present within infrastructure, it is extremely important

to evaluate how this affects performance in the men-

tioned dimensions. For this purpose there are no ded-

icated applications: multiple instances of each des-

ignated application can be deployed concurrently to

evaluate how that affects global system performance.

The benchmarking applications are implemented

in C and they are built over a custom API that masks

the differences between POSIX and DPDK; this way,

they can be used to evaluate system performance us-

ing linux-bridge or other virtual switches that bypass

the kernel to exchange data. When POSIX APIs are

used to exchange packets, raw sockets can also be

used rather than of regular UDP sockets to bypass

partially the Linux networking stack, building Ether-

net, IP and UDP packet headers in user-space. The

trafﬁc generated/consumed by these applications can

be conﬁgured varying a set of parameters that include

the sending rate, packet size, burst size, etc. To maxi-

mize application performance, packets are always ex-

changed using polling and measurements of elapsed

time are performed by checking the TSC register in-

stead of less precise timers provided by Linux APIs.

During each test, each application is deployed

within a LXC container on one or multiple hosts and

it is automatically connected to the other designated

application in the pair. The Linux distribution that

is used to realize each container is based on a sim-

ple rootfs built from a basic BusyBox and it con-

tains only the necessary resources to run the bench-

marking applications. The framework then takes care

of all the setup necessary to interconnect these ap-

plications with the desired networking technology.

The latter may be any among linux-bridge, another

software-based virtual switch (using virtio and vhost-

user ports) or a SR-IOV Ethernet adapter; each sce-

nario is depicted in Figure 1. In any case, deployed

applications use polling to exchange network trafﬁc

over the selected ports. For tests involving multiple

hosts, only OVS or VPP can be used among software-

based virtual switches to interconnect the benchmark-

ing applications; otherwise, it is possible to assign to

each container a dedicated VF and leverage the em-

bedded hardware switch in the SR-IOV network card

to forward trafﬁc from one host to another.

The proposed framework can be easily extended

to include more virtual switching solutions among

containerized applications or to develop different test-

ing applications that generate/consume other types

of synthetic workload. From this perspective, the

inclusion of other virtio-based virtual switches is

straightforward, and it does not require any modiﬁ-

cation of the existing test applications. In contrast,

other networking frameworks that rely on custom port

types/programming paradigms (e.g., Netmap) may re-

quire the extension of the API abstraction layer or

the inclusion of other custom test applications. Fur-

ther details about the framework’s extensibility can be

found at https://github.com/gabrieleara/nfv-testperf/

wiki/Extending .

5 EXPERIMENTAL RESULTS

To test the functionality of the proposed framework,

we performed a set of experiments in a synthetic use-

case scenario, comparing various virtual switches.

Experiments were performed on two similar hosts:

the ﬁrst has been used for all local inter-container

communications tests, while both hosts have been

used for multi-host communication tests (using con-

tainers as well). The two hosts are two Dell Pow-

erEdge R630 V4 servers, each equipped with two

Intel

Xeon

E5-2640 v4 CPUs at 2.40 GHz, 64 GB

of RAM, and an Intel

X710 DA2 Ethernet Controller

for 10 GbE SFP+ (used in SR-IOV experiments and

multi-host scenarios). The two Ethernet controllers

have been connected directly with a 10 Gigabit Eth-

ernet cable. Both hosts are conﬁgured with Ubuntu

18.04.3 LTS, Linux kernel version 4.15.0-54, DPDK

version 19.05, OVS version 2.11.1, Snabb version

2019.01, and VPP version 19.08. To maximize re-

sults reproducibility, the framework carries out each

test disabling CPU frequency scaling (governor set

to performance and Turbo Boost disabled) and it has

been conﬁgured to avoid using hyperthreads to deploy

each testing application.

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications

Table 1: List of parameters used to run performance tests

with the framework.

Parameter Symbol Values

Test Dimension D throughput or latency

Hosts Used L

local or remote

(i.e. single or multi-host)

Containers Set S NvsN (where N is the

number of pairs)

Virtual Switch V linux-bridge, basic f wd,

ovs,snabb, sriov, vpp

Packet Size P Expressed in bytes

Sending Rate R Expressed in pps

Burst Size B Expressed in number of

packets per burst

5.1 Testing Parameters

The framework can be conﬁgured to vary certain pa-

rameters before running each test. Each conﬁgura-

tion is represented by a set of applications (each run-

ning within a container), deployed either on one or

both hosts and grouped in pairs (i.e. sender/receiver or

client/server) and a set of parameters that specify the

synthetic trafﬁc to be generated. Given the number of

evaluated test cases, for each different perspective we

will show only relevant results.

To identify each test, we use the notation shown in

Table 1: each test is uniquely identiﬁed using a tuple

in following form:

(D, L, S, V, P, R, B)

When multiple tests are referred, the notation will

omit those parameters that are free to vary within a

predeﬁned set of values. For example, to show some

tests performed while varying the sending rate the R

parameter will not be included in the tuple.

Each test uses only a ﬁxed set of parameters and

runs for 1 minute; once the test is ﬁnished, an aver-

age value of the desired metric (either throughput or

latency) is obtained after discarding values related to

initial warm-up and shutdown phases, thus consider-

ing only values related to steady state conditions of

the system. The scenarios that we considered in our

experiments are summarized in Figure 3.

5.2 Kernel-based Networking

First we will show that performance achieved with-

out using kernel bypass techniques are much worse

than the ones achieved using techniques like vhost-

user in similar set-ups. Table 2 reports the maximum

throughput achieved using socket APIs and linux-

bridge to interconnect a pair of sender and receiver

applications on a single host (Figure 3a). In this sce-

nario, it is not possible to reach even 1 Mpps using

Table 2: Maximum throughput achieved for various socket-

based solutions: (D = throughput, L = local, S = 1vs1,

V = linux-bridge, P = 64, R = 1M, B = 64).

Technique

Max

Throughput

(kpps)

UDP sockets using send/recv 338

UDP sockets using

sendmmsg/recvmmsg

409

Raw sockets using send/recv 360

Raw sockets using

sendmmsg/recvmmsg

440

kernel-based techniques, while using DPDK any vir-

tual switch can easily support at least 2 Mpps in simi-

lar set-ups. That is why all the results that will follow

will consider kernel bypass technologies only.

5.3 Throughput Evaluations

Moving on to techniques that use DPDK, we ﬁrst

evaluated throughput performance between two ap-

plications in a single pair, deployed both on the same

host or on multiple hosts, varying desired sending

rate, packet and burst sizes. In all our experiments,

we noticed that varying the burst size from 32 to

256 packets per burst did not affect throughput per-

formance, thus we will always refer to the case of

32 packets per burst in further reasoning.

5.3.1 Single Host

We ﬁrst deployed a single pair of sender/receiver ap-

plications on a single host (Figure 3a), varying the

sending rate from 1 to 20 Mpps and the packet size

from 64 to 1500 bytes:

(D = throughput, L = local, S = 1vs1)

In all our tests, each virtual switch achieves the de-

sired throughput up to a certain maximum value, after

which the virtual switch has saturated his capabilities,

as shown in Figure 4a. This maximum throughput de-

pends on the size of the packets exchanged through

the virtual switch, thus from now on we will consider

only the maximum achievable throughput for each

virtual switch while varying the size of each packet.

Figures 4b and 4c show the maximum receiving

rates achieved in all our tests that employed only two

containers on a single host, expressed in Mpps and

in Gbps respectively. In each plot, the achieved rate

(Y axis) is expressed as function of the size of each

packet (X axis), for which a logarithmic scale has

been used.

From both ﬁgures it is clear that maximum per-

formance are attained by ofﬂoading network trafﬁc

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(a)

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(b)

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(c)

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(d)

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(e)

Host

Container 1

Sender

Container 2

Receiver

Switch

Single Pair Local Throughput Tests

Host

Switch

Container 5

Sender /

Client

Container 6

Receiver /

Server

Container 3

Sender /

Client

Container 4

Receiver /

Server

Container 1

Sender

Container 2

Receiver

Multiple Pairs Local Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver

Single Pair Remote Throughput

Tests

Host B

Host A

Switch Switch

Container 1

Sender

Container 2

Receiver /

Server

Container 1

Sender

Container 1

Sender

Container 2

Receiver /

Server

Container 2

Receiver

Multiple Pairs Remote Throughput

Tests

Host

Container 1

Client

Container 2

Server

Switch

Local Latency Tests

Host B

Host A

Switch Switch

Container 1

Client

Container 2

Server

Remote Latency Tests

(f)

Figure 3: Different testing scenarios used for our evaluations. In particular, (a), (b), and (c) refer to single-host scenarios,

while (d), (e), and (f) to scenarios that consider multiple hosts.

to the SR-IOV device, exploiting its embedded hard-

ware switch. Second best ranked the Basic Forward-

ing Sample Application, which was expected since it

does not implement any actual switching logic. The

very small performance gap between VPP and the

latter solution is also a clear indicator that the batch

packet processing features included in VPP can dis-

tribute packet processing overhead effectively among

incoming packets. Finally, OVS and Snabb follow.

Comparing these performance with other evaluations

present in literature (Ara et al., 2019), we were able to

conclude that the major bottleneck for Snabb is repre-

sented by its internal L2 switching component.

In addition, notice that while the maximum

throughput in terms of Mpps is achieved with the

smallest of the packet sizes (64 bytes), the maximum

throughput in terms of Gbps is actually achieved for

the biggest packet size (1500 bytes). In fact, through-

put in terms of Gbps is characterized by a logarithmic

growth related to the increase of the packet size, as

shown in Figure 4c.

From these plots we derived that for software

switches the efﬁciency of each implementation im-

pacts performance only for smaller packet sizes,

while for bigger packets the major limitation becomes

the ability of the software to actually move pack-

ets from one CPU core to another, which is equiva-

lent for any software implementation. Given also the

slightly superior performance achieved by SR-IOV,

especially for smaller packet sizes, we also concluded

that its hardware switch is more efﬁcient at moving

large number of packets between CPU cores than the

software implementations that we tested.

5.3.2 Multiple Hosts

We repeated these evaluations deploying the receiver

application on a separate host (Figure 3d), using the

only virtual switches able to forward trafﬁc between

multiple hosts

(D = throughput, L = remote, S = 1vs1,

V ∈ {ovs, sriov, vpp})

Figure 4d shows the maximum receiving rates

achieved for a burst size of 32 packets. In this sce-

nario, results depend on the size of the exchanged

packets: for smaller packet sizes the dominating bot-

tleneck is still represented by the CPU for OVS and

VPP, while for bigger packets the Ethernet standard

limits the total throughput achievable by any virtual

switch to only 10 Gbps. From these results we con-

cluded that when the expected trafﬁc is characterized

by relatively small packet sizes (up to 256 bytes) de-

ploying a component on a directly connected host

does not impact negatively system performance when

OVS or VPP are used. In addition, we noticed that

in this scenario there is no clear best virtual switch

with respect to the others: while for smaller packet

sizes SR-IOV is more efﬁcient, both OVS and VPP

perform better for bigger ones.

5.4 Throughput Scalability

To test how throughput performance scale when the

number of applications on the system increases, we

repeated all our evaluations deploying multiple appli-

cation pairs on the same host (Figure 3b):

(D = throughput, L = local, S ∈ {1vs1, 2vs2, 4vs4})

Figures 4e and 4f show the relationship between

the maximum total throughput of the system achiev-

able and the number of pairs deployed simultane-

ously, for packets of 64 and 1500 bytes respectively.

The Basic Forwarding Sample Application does not imple-

ment any switching logic, while Snabb was not compatible

with our selected SR-IOV Ethernet controller.

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications

basicfwd

ovs

snabb sriov

vpp

Tentative Sending Rate [Mpps]

Actual Receiving

Rate [Mpps]

(a) (D = throughput, L = local, S = 1vs1, B = 32,

P = 64)

128

256 512

1024

1500

Packet Size [bytes]

Max

Throughput [Mpps]

(b) (D = throughput, L = local, S = 1vs1, B = 32)

128

256 512

1024

1500

Packet Size [bytes]

Max

Throughput [Gbps]

128

256 512

1024

1500

Packet Size [bytes]

Max

Throughput [Gbps]

(d) (D = throughput, L = remote, S = 1vs1, B = 32)

1vs1 2vs2 4vs4

Scenario S

Max Total

Throughput [Gbps]

(e) (D = throughput, L = local, P = 64, B = 32)

1vs1 2vs2 4vs4

Scenario S

Max Total

Throughput [Gbps]

(f) (D = throughput, L = local, P = 1500, B = 32)

1vs1 2vs2 4vs4 8vs8

Scenario S

Max Total

Throughput [Gbps]

(g) (D = throughput, L = remote, P = 64, B = 32)

1vs1 2vs2 4vs4 8vs8

Scenario S

Max Total

Throughput [Gbps]

(h) (D = throughput, L = remote, P = 1500, B = 32)

Figure 4: Throughput performance obtained varying system conﬁguration and virtual switch used to connect sender/receiver

applications deployed in LXC containers.

While the maximum total throughput does not change

for relatively small packet sizes when increasing the

number of senders, for bigger packet sizes SR-IOV

can sustain 4 senders with only a per-pair perfor-

mance drop of about 18%, achieving almost linear

performance improvements with the increase of the

number of participants. On the contrary, virtio-based

switches can only distribute the same amount of re-

sources over a bigger number of network ﬂows. From

these results we concluded that while for SR-IOV the

major limitation is mostly represented by the number

of packets exchanged through the network, for virtio-

based switches the major bottleneck is represented

by the capability of the CPU to move data from one

application to another, which depends on the overall

amount of bytes exchanged.

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

basicfwd

ovs

sriov

vpp

128

256 512

1024

1500

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

(a) (D = latency, L = local, S = 1vs1, B = 4,

R = 1000)

128

256 512

1024

1500

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

(d) (D = latency, L = remote, S = 1vs1, B = 4,

R = 1000)

128

256 512

1024

1500

120

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

(b) (D = latency, L = local, S = 1vs1, B = 32,

R = 1000)

128

256 512

1024

1500

100

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

(e) (D = latency, L = remote, S = 1vs1, B = 32,

R = 1000)

128

256 512

1024

1500

100

200

300

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

R = 1000)

128

256 512

1024

1500

100

200

300

Packet Size [bytes]

Mean Round-Trip

Latency [µs]

(f) (D = latency, L = remote, S = 1vs1, B = 128,

R = 1000)

Figure 5: Average round-trip latency obtained varying system conﬁguration and virtual switch used to connect client/server

applications deployed in LXC containers. Plots on the left refer to tests performed on a single-host, while plots on the right

involve two separate hosts.

Repeating scalability evaluations on multiple

hosts (L = remote), we were able to deploy up to 8 ap-

plication pairs (S = 8vs8) transmitting data from one

host to the other one (Figure 3e). Results of these

evaluations, shown in Figures 4g and 4h, indicate that

the outcome strongly depends on the size of the pack-

ets exchanged: for bigger packets the major bottle-

neck is represented by the limited throughput of the

Ethernet standard; for smaller ones the inability of

each virtual switch to scale adequately with the num-

ber of packet ﬂows negatively impacts even more sys-

tem performance.

5.5 Latency Performance

We ﬁnally evaluated the round-trip latency between

two applications when adopting each virtual switch

to estimate the per-packet processing overhead intro-

duced by each different solution. In particular we

focused on the ability of each solution to distribute

packet processing costs over multiple packets when

increasing the burst size. For this purpose, a single

pair of client/server applications has been deployed

on a single or multiple hosts to perform each evalua-

tion. Each test is conﬁgured to exchange a very low

number of packets per second, so that there is no inter-

ference between the processing of a burst of packets

and the following one.

First we considered a single pair of client/server

applications deployed on a single host (Figure 3c).

In these tests we varied the burst size from 4 to

128 packets and the packet size from 64 to 1500 bytes:

(D = latency, L = local, S = 1vs1, R = 1000)

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications

In all our tests, Snabb was not able to achieve

comparable latency with respect to other solutions;

for example, while all other virtual switches could

maintain their average latency under 40 µs for B = 4,

Snabb’s delay was always greater than 60 µs, even for

very small packet sizes. Since this overall behavior is

repeated in all our tests, regardless of which parame-

ters are applied, Snabb will not be discussed further.

Figures 5a to 5c show that only virtio-based so-

lutions were able to achieve single-digit microsec-

ond round-trip latency on a single host, while SR-

IOV has a higher latency for small packet and burst

sizes. Increasing the burst size over 32 packets, these

roles are reversed, with SR-IOV becoming the most

lightweight solution, although it was never able to

achieve single-digit microsecond latency. From this

we inferred that SR-IOV performance are less inﬂu-

enced by the variation of the burst size with respect to

the other options available and thus it is more suitable

when trafﬁc on a single host is grouped into bigger

bursts.

We then repeated the same evaluations by deploy-

ing the server application on a separate host (L =

remote, Figure 3f). In this new scenario, SR-IOV al-

ways outperforms OVS and VPP, as shown in Fig-

ures 5d to 5f. This can be easily explained, since

when using a virtio-based switch to access the phys-

ical network two levels of forwarding are introduced

with respect to the case using only SR-IOV: the two

software instances, each running in their respective

hosts, introduce additional computations with respect

to the one performed in hardware by the SR-IOV de-

vice.

6 CONCLUSIONS AND FUTURE

WORK

In this work, we presented a novel framework aimed

to evaluate and compare the performance of various

virtual networking solutions based on kernel bypass.

We focused on the interconnection of VNFs deployed

in Linux containers in one or multiple hosts.

We performed a number of performance evalu-

ations on some virtual switches commonly used in

the NFV industrial practice. Test results show that

SR-IOV has superior performance, both in terms of

throughput and scalability, when trafﬁc is limited to a

single host. However, in scenarios that consider inter-

host communications, each solution is mostly con-

strained by the limits imposed by the Ethernet stan-

dard. Finally, from a latency perspective we showed

that both for local and remote communications SR-

IOV can attain smaller round-trip latency with respect

to its competitors when bigger burst sizes are adopted.

In the future, we plan to support additional vir-

tual networking solutions, like NetVM (Hwang et al.,

2015), Netmap (Rizzo, 2012a), and VALE (Rizzo and

Lettieri, 2012). We also plan to extend the functional-

ity of the proposed framework, for example by allow-

ing for customizing the number of CPU cores allo-

cated to the virtual switch during the runs. Finally, we

would like to compare various versions of the consid-

ered virtual switching solutions, as we noticed some

variability in the performance ﬁgures after upgrading

some of the components, during the development of

the framework.

REFERENCES

Ara, G., Abeni, L., Cucinotta, T., and Vitucci, C. (2019).

On the use of kernel bypass mechanisms for high-

performance inter-container communications. In High

Performance Computing, pages 1–12. Springer Inter-

national Publishing.

Barach, D., Linguaglossa, L., Marion, D., Pﬁster, P.,

Pontarelli, S., and Rossi, D. (2018). High-speed

software data plane via vectorized packet processing.

IEEE Communications Magazine, 56(12):97–103.

Barik, R. K., Lenka, R. K., Rao, K. R., and Ghose, D.

(2016). Performance analysis of virtual machines and

containers in cloud computing. In 2016 International

Conference on Computing, Communication and Au-

tomation (ICCCA). IEEE.

Dong, Y., Yang, X., Li, J., Liao, G., Tian, K., and Guan, H.

(2012). High performance network virtualization with

SR-IOV. Journal of Parallel and Distributed Comput-

ing, 72(11):1471–1480.

Felter, W., Ferreira, A., Rajamony, R., and Rubio, J. (2015).

An updated performance comparison of virtual ma-

chines and linux containers. In 2015 IEEE Interna-

tional Symposium on Performance Analysis of Sys-

tems and Software (ISPASS). IEEE.

Gallenmüller, S., Emmerich, P., Wohlfart, F., Raumer,

D., and Carle, G. (2015). Comparison of frame-

works for high-performance packet IO. In 2015

ACM/IEEE Symposium on Architectures for Network-

ing and Communications Systems (ANCS). IEEE.

Géhberger, D., Balla, D., Maliosz, M., and Simon, C.

(2018). Performance evaluation of low latency com-

munication alternatives in a containerized cloud envi-

ronment. In 2018 IEEE 11th International Conference

on Cloud Computing (CLOUD). IEEE.

Hwang, J., Ramakrishnan, K. K., and Wood, T. (2015).

NetVM: High performance and ﬂexible networking

using virtualization on commodity platforms. IEEE

Transactions on Network and Service Management,

12(1):34–47.

Intel (2015). Open vSwitch enables SDN and NFV trans-

formation. White Paper, Intel.

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

Lettieri, G., Mafﬁone, V., and Rizzo, L. (2017). A survey

of fast packet I/O technologies for Network Function

Virtualization. In Lecture Notes in Computer Science,

pages 579–590. Springer International Publishing.

Paolino, M., Nikolaev, N., Fanguede, J., and Raho, D.

(2015). SnabbSwitch user space virtual switch bench-

mark and performance optimization for NFV. In 2015

IEEE Conference on Network Function Virtualization

and Software Deﬁned Network (NFV-SDN). IEEE.

Pfaff, B., Pettit, J., Koponen, T., Jackson, E., Zhou, A., Ra-

jahalme, J., Gross, J., Wang, A., Stringer, J., Shelar, P.,

et al. (2015). The design and implementation of Open

vSwitch. In 12th USENIX Symposium on Networked

Systems Design and Implementation (NSDI 15), pages

117–130.

Pitaev, N., Falkner, M., Leivadeas, A., and Lambadaris, I.

(2018). Characterizing the performance of concur-

rent virtualized network functions with OVS-DPDK,

FD.IO VPP and SR-IOV. In Proceedings of the

2018 ACM/SPEC International Conference on Perfor-

mance Engineering - ICPE '18. ACM Press.

Rizzo, L. (2012a). Netmap: A novel framework for fast

packet I/O. In 2012 USENIX Annual Technical Con-

ference (USENIX ATC 12), pages 101–112, Boston,

MA. USENIX Association.

Rizzo, L. (2012b). Revisiting network I/O APIs: The

Netmap framework. Queue, 10(1):30.

Rizzo, L. and Lettieri, G. (2012). VALE, a switched eth-

ernet for virtual machines. In Proceedings of the

8th international conference on Emerging networking

experiments and technologies - CoNEXT '12. ACM

Press.

Russell, R., Tsirkin, M. S., Huck, C., and Moll, P. (2015).

Virtual I/O Device (VIRTIO) Version 1.0. Standard,

OASIS Speciﬁcation Committee.

Comparative Evaluation of Kernel Bypass Mechanisms for High-performance Inter-container Communications