
entirely transparent to the applications.  While such 
transparency is desirable, it forces a tight integration 
at the memory subsystem either at the physical level 
or the hypervisor level. At the physical level the 
memory controller needs to be able to handle remote 
memory accesses. To avoid the impact of long 
memory access latencies, we expect that a large 
cache system is required. Disaggregated GPU and 
FPGA can be accessed as an I/O device based on 
direct integration through PCIe over Ethernet.  
Similar to disaggregated memory, the programming 
models remain unchanged once the mapping of the 
disaggregated resource to the I/O address space of 
the local compute node. 
In the second approach, the access of 
disaggregated resources can be exposed at the 
hypervisor/container/operating system levels.  New 
hypervisor level primitives - such as getMemory, 
getGPU, getFPGA, etc. - need to be defined to allow 
applications to explicitly request the provisioning 
and management of these resources in a manner 
similar to malloc. It is also possible to modify the 
paging mechanism within the hypervisor/operatoring 
systems so that the paging to HDD is now going 
through a new memory hierarchy including 
disaggregated memory, SSD, and HDD.    In this 
case, the application does not need to be modified at 
all.   Accessing remote Nvdia GPU through rCUDA 
(Duato 2010) has been demonstrated, and has been 
shown to actually outperform locally connected 
GPU when there is appropriate network 
connectivity.    
Disaggregation details and resource remoteness 
can also be directly exposed to applications.  
Disaggregated resources can be exposed via high-
level APIs (e.g. put/get for memory).  As an 
example, it is possible to define GetMemory in the 
form of Memory as a Service  as one of the 
Openstack service. The Openstack service sets up a 
channel between the host and the memory pool 
service through RDMA. Through GetMemory 
service, the application can now explicitly control 
which part of its address space is deemed remote and 
therefore controls or is at least cognizant which 
memory and application objects will be placed 
remotely.  In the case of GPU as a service, a new 
service primitive GetGPU can be defined to locate 
an available GPU from a GPU resource pool and 
host from the host resource pool.  The system 
establishes the channel between the host and the 
GPU through RDMA/PCIe and exposes the GPU 
access to applications via a library or a virtual 
device.   
4  NETWORK CONSIDERATIONS  
One of the primary challenges for a disaggregated 
datacenter architecture is the latency incurred by the 
network when accessing memory, SSD, GPU, and 
FPGA from remote resource pools.   The latency 
sensitivity depends on how the disaggregated 
resources are exposed to the programming models in 
terms of direct hardware, hypervisor, or resource as 
a service. 
The most stringent requirement on the network 
arises when disaggregated memory is mapped to the 
address space of the compute node and is accessed 
through the byte addressable approach. The total 
access latency across the network cannot be 
significantly larger than the typical access time of 
DRAM – which is on the order of 75 ns.   As a 
result, silicon photonics and optical circuit switches 
(OCS) are likely to be the only options to enable 
memory disaggregation beyond a rack. Large caches 
can reduce the impact of remote access. When the 
block sizes are aligned with the page sizes of the 
system, the remote memory can be managed as 
extension of the virtual memory system of the local 
hosts through the hypervisor and OS management. 
In this configuration, local DRAM is used as a cache 
for the remote memory, which is managed in page-
size blocks and can be moved via RDMA 
operations.  
Disaggregating GPU and FPGA are much less 
demanding as each GPU and FPGA are likely to 
have its local memory, and will often engage in 
computations that last many microseconds or 
milliseconds.  So the predominant communication 
mode between a compute node and disaggregated 
GPU and FPGA resources is likely through bulk 
data transfer.  It has been shown by Reano et al. 
(2013) that adequate bandwidth such as those 
offered by RDMA at FDR data rate (56 Gb/s) 
already demonstrated superior performance than a 
locally connected GPU.     
Current SSD technologies has a spectrum of  
100K IOPS (or more) and ~100 us access latency.  
Consequently, the access latency for non-buffered 
SSD should be on the order of 10 us.  This latency 
may be achievable using conventional Top-of-the-
Rack (TOR) switch technologies if the 
communication is limited to within a rack.  A flat 
network across a PoD or a datacenter with a two-tier 
spine-leaf model or a single tier spline model is 
required in order achieve less than 10 us latency if 
the communication between the local hosts and the 
disaggregated SSD resource pools are across a PoD 
or a datacenter.   
ESaaSA2015-WorkshoponEmergingSoftwareasaServiceandAnalytics
48