Scale-Out NUMA. Stanko Novaković Alexandros Daglis Edouard Bugnion Babak Falsafi Boris Grot. Abstract. 1. Introduction - PDF

In Proceedings of ASPLOS-XIX, March 2014 Scale-Out NUMA Stanko Novaković Alexandros Daglis Edouard Bugnion Babak Falsafi Boris Grot EcoCloud, EPFL University of Edinburgh Abstract Emerging datacenter applications

Please download to get full document.

View again

of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.


Publish on:

Views: 20 | Pages: 15

Extension: PDF | Download: 0

In Proceedings of ASPLOS-XIX, March 2014 Scale-Out NUMA Stanko Novaković Alexandros Daglis Edouard Bugnion Babak Falsafi Boris Grot EcoCloud, EPFL University of Edinburgh Abstract Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of x over local DRAM operations. We introduce Scale-Out NUMA (sonuma) an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. sonuma layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, sonuma relies on the remote memory controller a new architecturally-exposed hardware block integrated into the node s local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that sonuma performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core. Categories and Subject Descriptors C.1.4 [Computer System Organization]: Parallel Architectures Distributed Architectures; C.5.5 [Computer System Organization]: Computer System Implementation Servers Keywords 1. Introduction RDMA, NUMA, System-on-Chips Datacenter applications are rapidly evolving from simple data-serving tasks to sophisticated analytics operating over Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ASPLOS 14, March 1 4, 2014, Salt Lake City, Utah, USA. Copyright is held by the owner/author(s). ACM /14/03. enormous datasets in response to real-time queries. To minimize the response latency, datacenter operators keep the data in memory. As dataset sizes push into the petabyte range, the number of servers required to house them in memory can easily reach into hundreds or even thousands. Because of the distributed memory, applications that traverse large data structures (e.g., graph algorithms) or frequently access disparate pieces of data (e.g., key-value stores) must do so over the datacenter network. As today s datacenters are built with commodity networking technology running on top of commodity servers and operating systems, node-to-node communication delays can exceed 100µs [50]. In contrast, accesses to local memory incur delays of around 60ns a factor of 1000 less. The irony is rich: moving the data from disk to main memory yields a 100,000x reduction in latency (10ms vs. 100ns), but distributing the memory eliminates 1000x of the benefit. The reasons for the high communication latency are well known and include deep network stacks, complex network interface cards (NIC), and slow chip-to-nic interfaces [21, 50]. RDMA reduces end-to-end latency by enabling memory-to-memory data transfers over Infini- Band [26] and Converged Ethernet [25] fabrics. By exposing remote memory at user-level and offloading network processing to the adapter, RDMA enables remote memory read latencies as low as 1.19µs [14]; however, that still represents a 10x latency increase over local DRAM. We introduce Scale-Out NUMA (sonuma), an architecture, programming model, and communication protocol for distributed, in-memory applications that reduces remote memory access latency to within a small factor ( 4x) of local memory. sonuma leverages two simple ideas to minimize latency. The first is to use a stateless request/reply protocol running over a NUMA memory fabric to drastically reduce or eliminate the network stack, complex NIC, and switch gear delays. The second is to integrate the protocol controller into the node s local coherence hierarchy, thus avoiding state replication and data movement across the slow PCI Express (PCIe) interface. sonuma exposes the abstraction of a partitioned global virtual address space, which is useful for big-data applications with irregular data structures such as graphs. The programming model is inspired by RDMA [37], with applica- 1 tion threads making explicit remote memory read and write requests with copy semantics. The model is supported by an architecturally-exposed hardware block, called the remote memory controller (RMC), that safely exposes the global address space to applications. The RMC is integrated into each node s coherence hierarchy, providing for a frictionless, lowlatency interface between the processor, memory, and the interconnect fabric. Our primary contributions are: the RMC a simple, hardwired, on-chip architectural block that services remote memory requests through locally cache-coherent interactions and interfaces directly with an on-die network interface. Each operation handled by the RMC is converted into a set of stateless request/reply exchanges between two nodes; a minimal programming model with architectural support, provided by the RMC, for one-sided memory operations that access a partitioned global address space. The model is exposed through lightweight libraries, which also implement communication and synchronization primitives in software; a preliminary evaluation of sonuma using cycleaccurate full-system simulation demonstrating that the approach can achieve latencies within a small factor of local DRAM and saturate the available bandwidth; an sonuma emulation platform built using a hypervisor that runs applications at normal wall-clock speeds and features remote latencies within 5x of what a hardwareassisted RMC should provide. The rest of the paper is organized as follows: we motivate sonuma ( 2). We then describe the essential elements of the sonuma architecture ( 3), followed by a description of the design and implementation of the RMC ( 4), the software support ( 5), and the proposed communication protocol ( 6). We evaluate our design ( 7) and discuss additional aspects of the work ( 8). Finally, we place sonuma in the context of prior work ( 9) and conclude ( 10). 2. Why Scale-Out NUMA? In this section, we discuss key trends in datacenter applications and servers, and identify specific pain points that affect the latency of such deployments. 2.1 Datacenter Trends Applications. Today s massive web-scale applications, such as search or analytics, require thousands of computers and petabytes of storage [60]. Increasingly, the trend has been toward deeper analysis and understanding of data in response to real-time queries. To minimize the latency, datacenter operators have shifted hot datasets from disk to DRAM, necessitating terabytes, if not petabytes, of DRAM distributed across a large number of servers. The distributed nature of the data leads to frequent serverto-server interactions within the context of a given computation, e.g., Amazon reported that the rendering of a single page typically requires access to over 150 services [17]. These interactions introduce significant latency overheads that constrain the practical extent of sharding and the complexity of deployed algorithms. For instance, latency considerations force Facebook to restrict the number of sequential data accesses to fewer than 150 per rendered web page [50]. Recent work examining sources of network latency overhead in datacenters found that a typical deployment based on commodity technologies may incur over 100µs in roundtrip latency between a pair of servers [50]. According to the study, principal sources of latency overhead include the operating system stack, NIC, and intermediate network switches. While 100µs may seem insignificant, we observe that many applications, including graph-based applications and those that rely on key-value stores, perform minimal computation per data item loaded. For example, read operations dominate key-value store traffic, and simply return the object in memory. With 1000x difference in data access latency between local DRAM (100ns) and remote memory (100µs), distributing the dataset, although necessary, incurs a dramatic performance overhead. Server architectures. Today s datacenters employ commodity technologies due to their favorable cost-performance characteristics. The end result is a scale-out architecture characterized by a large number of commodity servers connected via commodity networking equipment. Two architectural trends are emerging in scale-out designs. First, System-on-Chips (SoC) provide high chip-level integration and are a major trend in servers. Current server SoCs combine many processing cores, memory interfaces, and I/O to reduce cost and improve overall efficiency by eliminating extra system components, e.g., Calxeda s ECX SoC [9] combines four ARM Cortex-A9 cores, memory controller, SATA interface, and a fabric switch [8] into a compact die with a 5W typical power draw. Second, system integrators are starting to offer glueless fabrics that can seamlessly interconnect hundreds of server nodes into fat-tree or torus topologies [18]. For instance, Calxeda s on-chip fabric router encapsulates Ethernet frames while energy-efficient processors run the standard TCP/IP and UDP/IP protocols as if they had a standard Ethernet NIC [16]. The tight integration of NIC, routers and fabric leads to a reduction in the number of components in the system (thus lowering cost) and improves energy efficiency by minimizing the number of chip crossings. However, such glueless fabrics alone do not substantially reduce latency because of the high cost of protocol processing at the end points. Remote DMA. RDMA enables memory-to-memory data transfers across the network without processor involvement on the destination side. By exposing remote memory and re- 2 Latency (s) Latency Bandwidth Request size (Bytes) Figure 1: Netpipe benchmark on a Calxeda microserver. liable connections directly to user-level applications, RDMA eliminates all kernel overheads. Furthermore, one-sided remote memory operations are handled entirely by the adapter without interrupting the destination core. RDMA is supported on lossless fabrics such as InfiniBand [26] and Converged Ethernet [25] that scale to thousands of nodes and can offer remote memory read latency as low as 1.19µs [14]. Although historically associated with the highperformance computing market, RDMA is now making inroads into web-scale data centers, such as Microsoft Bing [54]. Latency-sensitive key-value stores such as RAMCloud [43] and Pilaf [38] are using RDMA fabrics to achieve object access latencies of as low as 5µs. 2.2 Obstacles to Low-Latency Distributed Memory As datasets grow, the trend is toward more sophisticated algorithms at ever-tightening latency bounds. While SoCs, glueless fabrics, and RDMA technologies help lower network latencies, the network delay per byte loaded remains high. Here, we discuss principal reasons behind the difficulty of further reducing the latency for in-memory applications. 0 Bandwidth (Mbps) Node scalability is power-limited. As voltage scaling grinds to a halt, future improvements in compute density at the chip level will be limited. Power limitations will extend beyond the processor and impact the amount of DRAM that can be integrated in a given unit of volume (which governs the limits of power delivery and heat dissipation). Together, power constraints at the processor and DRAM levels will limit the server industry s ability to improve the performance and memory capacity of scale-up configurations, thus accelerating the trend toward distributed memory systems. Deep network stacks are costly. Distributed systems rely on networks to communicate. Unfortunately, today s deep network stacks require a significant amount of processing per network which factors considerably into end-toend latency. Figure 1 shows the network performance between two directly-connected Calxeda EnergyCore ECX SoCs, measured using the standard netpipe benchmark [55]. The fabric and the integrated NICs provide 10Gbps worth of bandwidth. Despite the immediate proximity of the nodes and the lack of intermediate switches, we observe high latency (in excess of 40µs) for small sizes and poor bandwidth scalability (under 2 Gbps) with large s. These bottlenecks exist due to the high processing requirements of TCP/IP and are aggravated by the limited performance offered by ARM cores. Large-scale shared memory is prohibitive. One way to bypass complex network stacks is through direct access to shared physical memory. Unfortunately, large-scale sharing of physical memory is challenging for two reasons. First is the sheer cost and complexity of scaling up hardware coherence protocols. Chief bottlenecks here include state overhead, high bandwidth requirements, and verification complexity. The second is the fault-containment challenge of a single operating system instance managing a massive physical address space, whereby the failure of any one node can take down the entire system by corrupting shared state [11]. Sharing caches even within the same socket can be expensive. Indeed, recent work shows that partitioning a single many-core socket into multiple coherence domains improves the execution efficiency of scale-out workloads that do not have shared datasets [33]. PCIe/DMA latencies limit performance. I/O bypass architectures have successfully removed most sources of latency except the PCIe bus. Studies have shown that it takes ns to communicate short bursts over the PCIe bus [21], making such transfers 7-8x more expensive, in terms of latency, than local DRAM accesses. Furthermore, PCIe does not allow for the cache-coherent sharing of control structures between the system and the I/O device, leading to the need of replicating system state such as page tables into the device and system memory. In the latter case, the device memory serves as a cache, resulting in additional DMA transactions to access the state. SoC integration alone does not eliminate these overheads, since IP blocks often use DMA internally to communicate with the main processor [5]. Distance matters. Both latency and cost of high-speed communication within a datacenter are severely impacted by distance. Latency is insignificant and bandwidth is cheap within a rack, enabling low-dimensional topologies (e.g., 3-D torus) with wide links and small signal propagation delays (e.g., 20ns for a printed circuit board trace spanning a 44U rack). Beyond a few meters, however, expensive optical transceivers must be used, and non-negotiable propagation delays (limited by the speed of light) quickly exceed DRAM access time. The combination of cost and delay puts a natural limit to the size of tightly interconnected systems. 3. Scale-Out NUMA This work introduces sonuma, an architecture and programming model for low-latency distributed memory. 3 App OS core core L1.. L1.. App LLC core L1 data control RMC L1 NI Figure 2: sonuma overview. NUMAfabric sonuma addresses each of the obstacles to low-latency described in 2.2. sonuma is designed for a scale-out model with physically distributed processing and memory: (i) it replaces deep network stacks with a lean memory fabric; (ii) eschews system-wide coherence in favor of a global partitioned virtual address space accessible via RMDA-like remote memory operations with copy semantics; (iii) replaces transfers over the slow PCIe bus with cheap cache-to-cache transfers; and (iv) is optimized for rack-scale deployments, where distance is minuscule. In effect, our design goal is to borrow the desirable qualities of ccnuma and RDMA without their respective drawbacks. Fig. 2 identifies the essential components of sonuma. At a high level, sonuma combines a lean memory fabric with an RDMA-like programming model in a rack-scale system. Applications access remote portions of the global virtual address space through remote memory operations. A new architecturally-exposed block, the remote memory controller (RMC), converts these operations into network transactions and directly performs the memory accesses. Applications directly communicate with the RMC, bypassing the operating system, which gets involved only in setting up the necessary in-memory control data structures. Unlike traditional implementations of RDMA, which operate over the PCI bus, the RMC benefits from a tight integration into the processor s cache coherence hierarchy. In particular, the processor and the RMC share all data structures via the cache hierarchy. The implementation of the RMC is further simplified by limiting the architectural support to one-sided remote memory read, write, and atomic operations, and by unrolling multi-line requests at the source RMC. As a result, the protocol can be implemented in a stateless manner by the destination node. The RMC converts application commands into remote requests that are sent to the network interface (NI). The NI is connected to an on-chip low-radix router with reliable, point-to-point links to other sonuma nodes. The notion of fast low-radix routers borrows from supercomputer interconnects; for instance, the mesh fabric of the Alpha connected 128 nodes in a 2D torus using an on-chip router with a pin-to-pin delay of just 11ns [39]. sonuma s memory fabric bears semblance (at the link and network layer, but not at the protocol layer) to the QPI and HTX solutions that interconnect sockets together into multiple NUMA domains. In such fabrics, parallel transfers over traces minimize pin-to-pin delays, short messages (header + a payload of a single cache line) minimize buffering requirements, topology-based routing eliminates costly CAM or TCAM lookups, and virtual lanes ensure deadlock freedom. Although Fig. 2 illustrates a 2D-torus, the design is not restricted to any particular topology. 4. Remote Memory Controller The foundational component of sonuma is the RMC, an architectural block that services remote memory accesses originating at the local node, as well as incoming requests from remote nodes. The RMC integrates into the processor s coherence hierarchy via a private L1 cache and communicates with the application threads via memory-mapped queues. We first describe the software interface ( 4.1), provide a functional overview of the RMC ( 4.2), and describe its microarchitecture ( 4.3). 4.1 Hardware/Software Interface sonuma provides application nodes with the abstraction of globally addressable, virtual address spaces that can be accessed via explicit memory operations. The RMC exposes this abstraction to applications, allowing them to safely and directly copy data to/from global memory into a local buffer using remote write, read, and atomic operations, without kernel intervention. The interface offers atomicity guarantees at the cache-line granularity, and no ordering guarantees within or across requests. sonuma s hardware/software interface is centered around four main abstractions directly exposed by the RMC: (i) the context identifier (ctx id), which is used by all nodes participating in the same application to create a global address space; (ii) the context segment, a range of the node s address space which is globally accessible by others; (iii) the queue pair (QP), used by applications to schedule remote memory operations and get notified of their completion; and (iv) local buffers, which can be used as the source or destination of remote operations. The QP model consists of a work queue (WQ), a bounded buffer written exclus
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks