A Low-cost 32-channel Module with High-speed Digital Interfaces for Portable Ultrasound Systems

M. Lewandowski, K. Sielewicz, M. Walczak
Department of Ultrasound
Institute of Fundamental Technological Research PAS
Warsaw, Poland
mlew@ippt.pan.pl

Abstract— There is a continuous trend towards small and portable ultrasound systems with multichannel processing. The objective of the work was to develop a modular acquisition and processing platform based on the following architecture principles: limited hardware processing, external high-speed data communication and software based on SAFT processing using embedded graphics processing unit (GPU). The acquisition module connected via PCIe or USB 3.0 interface can stream either raw RF data or demodulated ones. A low-power embedded PC with embedded GPU will implement ultrasound signal processing, as well as control and visualization functions. The performed feasibility study showed that AMD APU G-Series embedded x86 CPU+GPU is capable of real-time SAFT image reconstruction at limited resolution.

Keywords—ultrasonic imaging; synthetic aperture; medical electronics; GPU; FPGA

I. INTRODUCTION

Portable ultrasound imaging systems are widely used in medical and industrial applications. Low-power, low-cost and small size can be accomplished by smart system design and signal processing optimization.

Most commercial and described in the literature systems are based on the FPGA front-end processing. Portable ultrasound scanners are equipped with 16-64 parallel receive channels, digital beamformer (ASIC or FPGA) and software back-end processing (embedded PC or DSP). An example of a contemporary design based on that architecture is a system described by Kim et al. [1]. Their system has 32-channel receive beamformer and back-end processing implemented in a single FPGA chip (Xilinx, Spartan-3). Embedded ARM processor running Linux operating system was dedicated to system control and display functions. Similar architecture, but with imaging processing on the software side, was developed by Zonare in Z.ONE commercial medical scanner [2]. Z.ONE is also based on a single FPGA chip for front-end processing and a cluster of 3 DSPs (Texas Instruments TMS320C6455) for mid-end and back-end processing. Zonare implemented a patented ‘zone sonography’ – a kind of synthetic aperture focusing technique (SAFT). The user functions and display are controlled by embedded PowerPC processor (Freescale PPC5200) running the Nucleus RTOS. Another miniaturized solution is MANUS offered by Aurotech as OEM [3]. MANUS is a flexible system scalable from 64 to 128 channels by stacking up to 4 main MANUS modules, 32 channels each. Each module is based on a FPGA with implemented beamformer.

Newly developed high integration low-power SOCs (System-on-Chip) are routinely equipped with embedded GPU. These GPUs can become general purpose parallel processors due to availability of new programming tools (eg. OpenCL, CUDA). Our goal is to create a low-cost portable ultrasound platform capable of real-time imaging implemented on an embedded GPU. We adopted system architecture from our own versatile research ultrasound platform and scaled it down for a mobile system [4].

In the following sections we will present a design of the system and feasibility study of SAFT image reconstruction on the embedded GPU.

II. SYSTEM DESIGN

A. System Architecture

Our system consists of a 32-channel acquisition module and a connected (stacked) 32-channel transmit module. A single FPGA chip supports both transmit and data acquisition functions, as well as wire speed RF processing (eg. bandpass filtering, dc-offset removal, and optional demodulation). There is a potential to increase the number of acquisition channels by connecting more boards.

The digital RF or demodulated I/Q data are streamed to the embedded PC using a high speed PCIe or USB 3.0 interface. A hybrid CPU+GPU software signal processing is performed on the data in system memory. The modules can be used in various applications and with different processing hardware thanks to application of standard I/O interfaces. An important benefit of using commercial off-the-shelf components for communication and processing infrastructure is a possibility of system performance improvement by adapting newer more advanced embedded system solutions.
B. Electronic Modules

The module was designed to operate in two modes: standalone device utilizing an embedded-PC or peripheral module controlled by standard computer. The module can easily communicate with embedded PC MIO-5270 (Advantech, Taiwan) using dedicated MIOe 2.0 extension connector. In this mode the raw or demodulated RF data are sent directly to the PC through PCIe gen. 1 4-lane interface. The developed platform is also capable to operate as a peripheral module controlled by a standalone PC. Then the ultrasound data are streamed to the computer by external super-speed USB 3.0 interface.

The analog ultrasound signals from 32-channels are conditioned and sampled at 40-65 MSPS by a multichannel analog conditioning and an analog-to-digital converter module (SMM9132, Samplify, USA) with dedicated time gain compensation circuit. The FPGA (ARRIA V A5GXMB3G6F35C6, Altera, USA) located on the 32-channel RX-CTRL board (Fig. 1) performs digital demodulation separately for each channel. External DDR3 SDRAM (MT41J128M16, Micron, USA) chips are connected to the dedicated FPGA’s 32-bit hard memory controllers. Afterwards the demodulated or raw RF data is directly streamed to the embedded PC through PCIe interface implemented in FPGA or super-speed USB 3.0 controller CYUSB3014 (Cypress Semiconductor, USA). The FPGA is connected to the USB 3.0 controller using 32-bit bi-directional port, called General Programmable Interface (GPIF II). It is also possible to control the internal FPGA registers through GPIF II. The FPGA can be configured using JTAG, configuration flash memory, USB 3.0 controller or PCIe interface. For long time, off-line data acquisition external SATA hard-drive can be connected. In order to provide reliable system monitoring and power control a low-power microcontroller (MSP430F552, Texas Instruments, USA) was utilized. For delivering a low-jitter clock to the ADCs and the system a high performance clock generator-synthesizer was used (Si5338, Silicon Labs, USA).

The ultrasound transmitter module (Fig. 2) uses 32 pulsers (STHV478, STMicroelectronics, USA) with integrated TX/RX switches saving PCB space. TX channels are multiplexed to 128 transducers through high-voltage switches (MAX14802, MAXIM Integrated, USA). In order to generate high-voltage power for transducers an integrated DC/DC converter was utilized (AD35S200/12, Beta Dyne, USA).

III. GPU PROCESSING

In our study we developed optimized SAFT imaging algorithm using OpenCL [5] and tested it on the embedded PC with AMD Fusion APU (Accelerated Processing Unit) G-T56N processor clocked at frequency of 1.65 GHz. The AMD APU is a hybrid integrated CPU and GPU with shared system memory (Fig. 3).

A tight integration between CPU and GPU on the AMD Fusion platform provides a consistent OpenCL programming model for both processors and a very high inter-processor data exchange through the shared system memory. In comparison to the hybrid processing, based on a separate CPU and GPU or DSP, the AMD Fusion APU ensures better integration and simplifies inter-processor communication. The OpenCL provides an abstraction of the computing resources/processors, which enables the signal processing tasks to be distributed or
even dynamically migrated between them. Common source code and run-time compilations for the CPU and GPU targets also greatly simplify algorithm development. All these features make the embedded GPU an interesting alternative to the multicore DSP.

Another important issue in discrete GPU solutions is the PCIe bandwidth that limits data exchange rate between the CPU and GPU. Novel architecture like the AMD Fusion eliminating the PCIe interface by internal high-speed buses, holds the promise of overcoming this bottleneck, and thus improves the processing performance. A report by Daga et al. [6] showed that for some applications the data access cost can be reduced by as much as 6x in comparison to the discrete GPU system. According to the same research the application performance was improved 3-fold for selected benchmarks.

Figure 3. The AMD Fusion APU system architecture.

We implemented and benchmarked two SAFT algorithms: STA (Synthetic Transmit Aperture) and PWI (Plane Wave Imaging). All implemented algorithms are consistent with the recommendations for optimizing OpenCL code [7]. These recommendations concern the use of the smallest possible number of dimensions of the array of threads in each block using, and if possible, the native functions (e.g. hypot, mix, etc.). It is also important to minimize branching of the code and use of conditional expressions.

IV. RESULTS

The high-speed LVDS 12-bit deserializers were implemented and successfully fitted into the FPGA device. Moreover, the VHDL module for synchronizing the incoming ADC data and reconfiguring the deserializers was developed and fully tested. The ultrasound RF data can be real-time demodulated by demodulator block or directly streamed to the GPU. The PCIe gen.1 x4 Hard-IP native endpoint controller was implemented. Additionally, the DDR3 SDRAM memory controllers were constrained and correctly fitted into the device. The simple microprocessor operating a few commands was developed to provide flexible realization of transmitting-receiving schemes.

The SAFT reconstruction time required to obtain the LRI (Low Resolution Image) of the size 64x256 for 32 channels I/Q signal does not exceed the 50 ms on the embedded AMD APU. This provides imaging frame rate of 20 Hz. The fastest image reconstruction, 32 ms per LRI, at a resolution of 64x256 was obtained for the raw RF signal. The use of the original RF signals from the transducers is faster because the phase correction is not required, as in the case for the I/Q data. The benefit of using the I/Q data is 2-4 reduction in the data bandwidth due to inherent decimation. An additional overhead for the envelope detection and the scan-converter for curved arrays, needed for B-mode display, was estimated at less than 5% of the total time. For different resolutions (64x256, 128x256, 128x512 pixels) and different transducer sizes (32, 64 elements) the reconstruction time scales linearly.

The results shows that the SAFT base imaging at limited resolution is feasible on the low-power embedded GPU. The benefit of using the commercial off-the-shelf embedded PC is the ease of system upgrade.

V. CONCLUSIONS

We presented a design of the low-cost, mobile ultrasound platform with SAFT based image reconstruction implemented on the embedded GPU. The designed compact 32-channel acquisition module with standard high speed interfaces enables connection to a variety of embedded computers. System performance is easily scaled by changing the processing subsystem without any changes to the acquisition module. Direct access to the raw RF or I/Q data and software processing makes the system an open platform for new ultrasound imaging algorithm development.

In the near future all the described system components will be integrated into a single compact chassis. We are planning to offer the modules and the software to the research community. An implementation of the other ultrasound processing algorithms (e.g. Doppler) on the OpenCL is planned as well.

ACKNOWLEDGMENT

The Project funded by NCBiR in the frame of LIDER programme 2010-2013: “Development of economic ultrasound platform”.  

![LIDER Logo](image)
REFERENCES


