Japanese

NEWS 

Past Articles

 



What is TOPSTREAM™ Multi-Core Platform

The scalable heterogeneous Multi-Core Platform implementation is the TOPSTREAM™ Multi-Core Platform, which is designed to provide following features.

- Provide scalability of up to 9 parallel processing cores (Task-Level Parallelism)

- Provide a wide range of flexibility on Data Processing Engine (DPE)’s Dual ISA

- Enable optimization of memory hierarchy with register-file configuration

- Enable quick configuration and reuse of Multi-Core for derivatives

Task-Level parallelism exploited for embedded systems can be considered to be able to fit in to less than 10 processors because of inherent concurrency on application tasks, so that the TOPSTREAM™ Multi-Core platform provides scalability of up to 9 parallel processing cores. For example, MPEG-2 encoder consists of several operations to run concurrently, such as motion estimation (ME), discrete cosine transform (DCT), quantization (Q), inverse quantization (IQ), inverse DCT (IDC), and Huffman coding. Given the large amount of computation to be done on each frame, which is typically enter the system at 30 frames/sec, these steps must be performed in parallel to meet the timing requirement as describes as 30 frames/sec. This type of parallelism is easy to leverage since the system specification naturally decompose into less than six tasks, although the decomposition based on specification may not be the best. Since the similar type of tasks can share a same processor as long as meet the timing requirements, the number of processors required for such system’s Task-Level parallelism can be consider to no more than 10 processor,  even though extra operations such as audio encoding is added to such system.

The TOPSTREAM™ Multi-Core platform supports integration of up to eight heterogeneous DPEs organized as asymmetric multi-processor (AMP), with dual ISA support on each DPE. The DPEs have a common compact 32-bit RISC ISA for processing control flow of application programs, in addition each DPE may have application specific ISA mainly for data processing with dedication to reduce clock cycles required for target application tasks.

If we could use the same architecture of processor organizing a symmetric multi-processor (SMP) for many different applications, we could manufacture the chips in even larger volumes, and it allows lower prices Programmers could also more easily develop software since they would be familiar with the platforms and they would have a richer tool set. SMP would make it easier to map an application onto the architecture. However, SoCs for Information Appliances must meet several constraints that do not apply to scientific computations typically run on SMPs. Such SoCs must perform real-time computing, be area-efficient, and be energy-efficient. All these constraints push the TOPSTREAM™ Multi-Core platform toward flexible heterogeneous Multi-Core. Real-time computation requires a processor to produce results at predictable times, which needs careful design of hardware and software, such as instruction-set, memory system, and system bus, and software to take advantage of the hardware features. Although many mechanisms used in general-purpose processor to provide performance in an easy programming model, some of them make the system performance less predictable. For example, cache snooping dynamically manages cache coherency but we have to count the cost for less predictable delays since the time required for a memory access depends on the state of several caches. The TOPSTREAM™ Multi-Core platform use a memory system and application specific instructions that the configuration can be specialized to the needs of target application. For example, each DPE can have multiple banks of general-purpose registers as well as multiple banks of data-registers.
Scratch Pad Memory can be optionally added to each DPEs. In addition, Global SPMs such as the instruction memory (IM) and the data memory (DM) are shared between DPEs and also can be configured for target applications. Since different tasks in an application often have different characteristics, different parts of the architecture often need different hardware structure, if a system architect can predict some aspect of the memory behavior of the target application, it is possible to reflect those characteristics in the SoC architecture. Cache configuration of the MC is an example that a considerable smaller cache can be used when the target application has regular memory access patterns.

AMPs, Heterogeneous Multi-Cores, are more area efficient than SMP. The Task-Level parallelism in embedded computing applications is inherently heterogeneous. For example, in MPEG-2 encoding, each function does something different then the others and has different computation requirements. A special purpose DPE can be much faster and smaller than a general-purpose processor or a DSP. For example, matching the processors datapath width to the native data size of the application can save a considerable amount of area. Then, choosing an optimum number of local registers and datapath organization to match the application characteristics can greatly improve performance. Because of these reasons the fundamental DPE architecture allows extension of application specific Instruction-Set-Architecture with any native data size, such as 64-bit, 128-bit, etc., although the common instruction set for control flow stays in 32-bit RISC.

As with area-efficiency, specializations can save power. Reducing clock rates by increasing work to be done in each clock by application specific instructions with native data size can be considered to a way to drastically increase energy-efficiency too. Also, to remove away features that are not necessary for the target application increases energy-efficiency.

Embedded system applications continue to require as much computational power as can be supplied, for example data rates continue to go up in most applications, such as data communication, video, audio. In addition, new appliances increasingly combine these previously available applications. A single Multi-Core SoC may perform wireless communication, video compression, and speech recognition, for example. So that scalability becomes an important feature of Multi-Core SoC platform. The scalability of TOPSTREAM™ Multi-Core platform supports SoC designers to quickly develop derivatives by reusing the DPE cores and the software that have previously designed and verified. Fig. 1 shows the overview of the TOPSTREAM™ Multi-Core platform.

The TOPSTREAM™ Multi-Core platform enables high-performance and energy-efficient design in short time-to-market with providing a 32-bit RISC Processor as a master controller (MC) and up to eight DPEs that may have different ISA extensions. The platform also provides a scalable distributed arbitration busses (TOPSTREAM™ bus) for efficient communication and integration between processors, memories, and I/Os. The TOPSTREAM™ bus consists of three busses, I-bus for instruction fetch by DPEs, D-bus for on-chip data memory access by DPEs, and S-bus for system resource access, such as external and internal memories and I/Os by both MC and DPEs. Each bus is configured as 128-bit data width to provide enough band-width for most of applications data transfer by and between processors. Configurable memory hierarchy such as register file, cache, and on-chip scratch pad memory (SPM) is possible for cost, performance, and energy trade-offs. In addition, on-chip-peripheral bus for IP core integration is provided. The TOPSTREAM™ Multi-Core platform architecture and the chip have introduced in 2001. It allows users to configure Multi-Core SoC for derivatives and enables retargeting to future process technology with providing the synthesizable RTL source code, as so called soft-IP.

The DPE’s base 32-bit RISC ISA supports 16 32-bit general-purpose registers, and can be configured to have up to 16 register banks, 256 registers in total. In addition, each DPE can have data-registers supported by additional ISA for application specific data processing. The data register also can be configured to have up to 16 register banks depending on optimization of data locality for a target application. For example, a DPE with 128-bit SIMD ISA with eight banks of 128-bit data registers, which is 128 x 128-bit data registers, can be implemented.

overview

Fig. 1 Overview of TOPSTREAM™ platform

Architectural challenges for defining the TOPSTREAM™ Multi-Core platform are processor core architecture and on-chip-networks design to provide scalability in performance while increasing energy-efficiency, and to enable easy configuration of Multi-Core SoC, with enhancing high levels of software and hardware reuse for high design-efficiency.

Especially, software development methodology is one of the critical challenges for the TOPSTREAM™ Multi-Core platform. The software running on the platform must be high performance, real time, and energy-efficient.
Although many researches have been made on these issues, much remains to be done. In addition, each SoC requires its own software development environment, such as compiler, debugger, simulator, and other tools.

Task-level programming model also provides a major challenge for SoC software. Task level parallelism is relatively easy to identify in SoC applications, but it is important to exploit. Real-Time Operating Systems provide scheduling mechanisms for tasks, however the detailed behavior of tasks, such as how it access memory and how the flow is controlled, influence the execution time a lot. So that, we need to understand how to abstract tasks so as to properly capture the essential characteristics of their detailed behavior with system level analysis.

chart

 

 
 

© Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009 TOPS SYSTEMS Corporation

TOPS copyright