Performance gains when multicore partners with AdvancedTCA
| By |
|
In recent years processor technology has changed its focus. Instead of increasing processor clock rate it is adding more and more processor cores. Combining this silicon development direction with AdvancedTCA form factor results in multi-level performance scaling options. Performance can be scaled by using processor silicon devices that have more processor cores and by adding more AdvancedTCA blades into the chassis. Moreover, AdvancedTCA systems can be tailor scaled for a specific workload by combining standard multicore x86 processors with more specialized Packet Processors. Here Gene explores such AdvancedTCA system creation options and the performance gains they offer.
Multicore processor technology combined with the AdvancedTCA form factor results in multi-faceted performance scaling options: engineers can scale performance by using processor silicon with more processor cores as well as by adding more AdvancedTCA blades into the chassis. Moreover, AdvancedTCA systems are comfortable adapting themselves to a specific workload by combining standard multicore x86 processors with specialized packet processors. Having multiple cores within a processor is potentially highly advantageous, of course, but they are useless unless the software infrastructure can put the cores to work.
Virtualization is one technique that makes it possible for multiple cores to run multiple applications and their operating systems in parallel. Development tools ease application development and porting an existing application to a multicore. Packet processors in particular have a powerful set of tools that allow the design of applications that run in parallel on multiple cores.
The rise of multicore
Just a few years ago, each new processor silicon release brought along a worthwhile clock frequency improvement. Today, however, clock frequency is not the main news in a new generation processor release; instead the number of processor cores within the device commands center stage. As usual, small startups such as Cavium Networks and RMI (now NetLogic MicroSystems) were the first to market with multicore general-purpose processors. Then followed the giants: Intel, AMD, and Freescale. Today, 4-8 cores within a processor is the norm – and there are architectures available that feature as many as 64 cores within one processor.
The motivation for multicore processors is fairly simple: when running a typical application, the processor spends most of its time waiting for data to process. Historically, memory latency improved at a much slower pace than the speed of the processor. Today, the mismatch between processor and memory is such that adding a few extra clocks to the processor doesnít improve performance to any worthwhile degree. As if this is not a big enough problem, there is the issue of power consumption: adding a few extra hertz to the clock translates into a significant increase in power consumption.
From the multicore architecture perspective, having multiple cores, each running perhaps at a slightly slower speed, results in a higher overall performance solution. Considering that the processor spends roughly three quarters of its time waiting for the memory, this approach works well for applications that can benefit from parallel processing. Obviously, the memory subsystem implementation has to support multiple data accesses in parallel, which is typically the case today.
AdvancedTCA systems
Letís move the focus from the silicon to the system. When a single server with two or four multicore processors is required, the 19-inch rack-mountable enclosure (pizza box) works very well. When the application requires more than that, or when redundancy and higher reliability are required, AdvancedTCA becomes a good choice for system implementation (Figure 1). The AdvancedTCA chassis can support up to 14 dual processor blades interconnected via two high-performance Ethernet switches in a redundant fashion (Figure 2). All blades within the chassis share power supplies and cooling fans, which are also implemented to support redundancy and higher reliability.
|
|
| Figure 1: AdvancedTCA integrated platform from NEI |
|
|
|
Figure 2: 16-slot AdvancedTCA chassis internal interconnect diagram (click graphic to zoom by 1.3x) |
A key requirement when building a multiblade system is a high-speed, reliable interconnect between the blades. From this perspective, an AdvancedTCA system interconnects each blade via a Fabric Interface and Base Interface. The Fabric Interface, which is considered to be a data path interface, is predominantly 10 Gbit Ethernet today and will soon support 40 Gbit Ethernet. The Base Interface is a control path and is implemented using 1 Gbit Ethernet. Both Fabric and Base Interfaces are implemented in a redundant fashion, such that each AdvancedTCA blade connects to both AdvancedTCA hubs, which provide the required Ethernet switching resources. Having all connectivity come about via the AdvancedTCA backplane reduces external cabling, thereby making the overall system more reliable and more serviceable. And separating the control plane and data plane not only enables high-performance blade management and control services, but also isolates the control traffic from the revenue-generating data plane traffic. Such isolation of the two planes becomes critical when overall system security is considered. Plane isolation ensures that data plane traffic, which is typically customer-facing traffic, will not intentionally or unintentionally get its mitts on Ethernet switches and disrupt the operation of the complete system.
Doubling available bandwidth
Depending on the application type, the high-performance interconnect brings a different value proposition. In a compute type application, itís essential that large numbers of processors communicate with high throughput and very low latency. To that extent, 10 Gbit and 40 Gbit Ethernet can provide the required data throughput. Some Ethernet switches also support pass-through switching mode. In this mode packet transmission starts before the packet is fully received. In such cases, packet switching latency can be lower than 500ns. Although configuring two hubs (Ethernet switches) in an AdvancedTCA chassis is primarily for redundancy, it is also possible to use both hubs in parallel, effectively doubling the available bandwidth.
From the compute power density perspective, it is interesting to note that a group of 14 GE Intelligent Platforms A10200 ATCA blades (Figure 3), with each blade featuring dual Intel 6-core Westmere processors, yields no fewer than 168 x86 cores within a single AdvancedTCA chassis, all interconnected via an in-chassis high-speed interconnect.
|
|
| Figure 3: The GE Intelligent Platforms A10200 ATCA single board computer |
Compute applications also tend to require significant storage capacity, bandwidth, and reliability. There are three main ways to address storage requirements. At the lowest level, each AdvancedTCA blade can have local hard disks, located on the blade itself or on an associated RTM: these could be two redundant SAS drives. At the next level, one or more storage AdvancedTCA blades could be used within the system. Such storage blades would be accessed via Ethernet using either the Fibre Channel over Ethernet (FCoE) or iSCSI protocols. AdvancedTCA storage blades can be shared among multiple processor blades. Finally, an external storage array can be connected via Fibre Channel, FCoE, or iSCSI.
Communication applications requirements
A key feature of communication applications is their requirement for high data throughput and packet processing. Also, they typically lend themselves well to parallel processing, which is where multicore technology finds its optimal advantage. Although processors from both AMD and Intel are excellent computing devices – especially when multiple cores are considered – both lack the ability to efficiently get data in and out at very high data rates.
Packet processors, another type of multicore processor architecture, are specifically optimized to address the problem of the efficient movement in and out of packetized data. Such devices are readily available in the AdvancedTCA blade form factor allowing system designers to take advantage of both x86 compute resources and packet processor packet manipulation resources within the same system. The interoperability inherent in the AdvancedTCA specification enables designers to plug in multiple x86 processor blades as well as multiple packet processor blades and interconnect them via high-performance Ethernet interfaces.
From this perspective, Ethernet switches within hubs add value in load distribution. Ethernet switches today can steer packets to a specific AdvancedTCA blade. Ushering the packets to just the right blade occurs thanks to sophisticated Access Control List features that take into account packetsí Layer-2 to Layer-4 information. Such policy-based routing allows packet streams to be distributed at very high data rates (10 Gbit/sec to 100 Gbit/sec) among multiple AdvancedTCA blades while ensuring that packets belonging to the same flow are always directed to the same blade.
An example of a high-performance communication system is shown in Figure 4. This AdvancedTCA system employs two Ethernet hubs, two GE A10200 multicore x86 processor blades, and up to 12 GE AT2-5800 packet processor blades with dual 16-core OCTEON Plus processors. Simple math reveals that this system provides 320 MIPS64 cores (OCTEON devices) and 24 x86 cores (Intel Westmere devices). From the data processing perspective, data enters the system via Ethernet hubs where packets are distributed – based on policies – among packet processing blades. Then, within the packet processor blade, packets are further distributed between two OCTEON devices and finally, within each OCTEON device, between the cores. The packet processors perform the majority of the high-throughput packet processing. Specific packets requiring more extensive processing power make a beeline to the x86-based blades. The key principle here is that although the majority of packets require little processing, a small subset requires more significant processing power.
|
|
|
Figure 4: AdvancedTCA System with two Westmere blades and 10 OCTEON blades (click graphic to zoom by 1.3x) |
Happily separated
Historically, most applications were written without any parallel computing concepts in mind. Consequently, although modern compilers attempt to recognize areas in the code that lend themselves to parallel processing and try to harness the power of multiple cores, performance improvements are very limited when running legacy applications on multicore hardware.
Virtualization is often used today to better utilize multiple processor cores. In a virtualized environment, multiple instances of the operating system – or even multiple dissimilar operating systems – run on the same multicore processor. Having no relationship with each other, the operating systems can be happily executed in parallel on multiple cores. Hardware, with the help of Hypervisor, ensures that each operating system can safely access its own memory and I/O devices without disturbing its neighbors. Virtualization makes consolidating multiple physical servers into one server with multicore processors possible.
AdvancedTCA allows further consolidation of multiple blades with multiple multicore processors. Racks of legacy servers can be reduced to a single AdvancedTCA chassis. Virtualization within the AdvancedTCA environment also facilitates redundancy and high availability. Using a high availability virtualized operating system, an application can be migrated from one physical server to the other if hardware failure occurs.
Designed for parallel processing from the start, packet processors have a software environment and development tools that are fully geared toward application development in a multicore environment. Although OCTEON and similar devices are often called packet processors, internally they are based on standard processor architectures, such as MIPS64, and can run standard operating systems such as Linux. Their performance advantages, however, are best exposed when running simplified proprietary operating systems, such as Caviumís Simple Executive. It is important not to confuse these devices and their operating systems with the network processors of the past, such as Intelís XScale. Modern packet processors are programmed using standard C and C++ language even when their proprietary operating system is being used; in fact, they allow existing C code to be simply ported.
Simplistic applications, such as a packet filter or L-2, L-3 switch, can be developed as sequential code that runs to completion and executes in an endless loop with same code running on all cores. The parallel nature of the processing would be provided by the hardware itself, which would schedule a packet processing event onto the next available processor core, enforcing packet ordering and atomicity rules if desired. Having memory management and cache coherency handled by hardware allows developers to focus on the application itself. Inter-core communication can be implemented by setting aside a shared memory region or by using a shared variable.
Depending on the application and development requirements, a number of software packages can help developers get a head start. One notable example is 6WINDGate software. It allows the seamless marriage of x86 processors with packet processors, offloading time-critical tasks to be run by the packet processorsí Simple Executive, and providing a large number of frequently needed protocols. 6WINDGate can be used standalone, or as a base platform for a specific application, and can abstract inter-processor and inter-core communications, significantly simplifying software development effort.
Conclusion
Today, multicore processors are an integral element of electronics design and are well supported by the AdvancedTCA infrastructure. AdvancedTCA enables very high compute density, without sacrificing reliability and redundancy. Redundant high-speed chassis-wide interconnect options support high-performance computing clusters as well as high-performance communication applications. Load balancing and policy routing techniques enable packet distribution among the blades, avoiding bottlenecks and fully utilizing multicore devices. Although most legacy applications canít take advantage of multicore performance, software techniques such as virtualization let multiple legacy applications run on the same processor, taking full advantage of the available multiple cores. Finally, software tools and hardware offload elements ease new application development or existing application porting to multicore environments.
Gene Juknevicius is Senior Architect & Technologist, GE Intelligent Platforms. He has participated in the PICMG, AMC, and MicroTCA committees, is currently an active member of the SCOPE Alliance and is responsible for new product definition and architecture at GE Intelligent Platforms. He received his M.S. degree in Electrical Engineering from Stanford University.
GE Intelligent Platforms
www.ge-ip.com
gene.juknevicius@ge.com




