I've been playing with a parallella board. Here are my first impressions of what is hoped to be the future direction of massively parallel computing.
In short, the parallella is a single board computer, much like the Raspberry Pi. It similarly sized and has roughly the same IO set as the Pi.
The primary SoC is a Xilinx Zynq, a duel core ARMv7 process coupled to an FPGA, with a secondary Epiphany multi-core coprocessor. For the price and power draw, this is what sets it apart from the other single board computers currently out in the market.
The CPU is a fairly mundane duel core ARMv7 Cortex-A9. For comparison, the Raspi2 yields a quad core Cortex-A7. The great thing about choosing an ARMv7 architecture is that there is a lot of distro support for this processor. Even Microsoft's flagship will work in this instruction set.
As a primary CPU, the ARMv7 is a low-power 32-bit RISC architecture with hardware floating-point instructions as default (unlike the ARMv6). The ARMv6 generation can typically be found in two forms one using software implemented floats, armel, and the other capable of significant speedups, the armhf. Most distros only have the softfloat version for ARMv6, so a recompilation of Debian was required for the original Raspberry Pi.
The 64-bit ARMv8 processors, known as aarch64, are starting to become available and while some phones are transitioning to this platform, these 64-bit processors are likely to see massive adoption in the data-center.
Doing anything computationally expensive on the primary CPU of an ARMv7 is tedious and slow. Compiling a distro is likely to take hours, if not days, generate a lot of heat and likely kill a nearby battery pack. Likewise, video decompression will quickly be bottlenecked.
The trick to modern mobile computing, is to offload these hard compute tasks to offload onto a dedicated GPU/DSP. The BCM2835 (BCM2836 on the Pi2) is mostly composed of GPU as is a purpose built media applications processor. On the Raspberry Pi, if you have any floating-point sensitive computation, find a way to use the GPU there's a massive performance boon to be gained.
The parallella has no dedicated GPU, instead there is a separate chip, the 16 (or 64) core epiphany coprocessor. The parallella's Zynq SoC also contains an FPGA, but more on that later.
A single epiphany core, is computationally similar to the ARM. Both are 32-bit RISC cores capable of efficient floating-point and discrete operations.
The ARM is capable of running a full Operating System, such as Linux and can interface with the external world via the FPGA. Conversely, the epiphany has very limited external IO capabilities. In fact, any influence on into or out of the coprocessor needs to be mediated by the ARM host.
An application running on the host (as root) needs to fully supervise any tasks. Resetting registers, loading binaries setting any initial state and reading final results are all controlled by applications running on the ARM.
The simplicity of the design means that the epiphany coprocessor is a very power efficient computing platform. Most of the heat generated by a parallella board comes from the SoC.
The BCM2835/6 is a pragmatic chip choice for the Raspberry Pi. For media processing there are a lot of support within the provided videocore libraries and it's easy enough to find high-level SDL and OpenGL APIs. They can even work from python apps.
Step outside of the API sandbox and you're stonewalled. While the GPU is technically documented, it is not a trivial platform to program in the absence of available free software compilers.
On the parallella, the epiphany chip is a supprisingly easy platform to target. Adapteva have opensourced their entire toolchain, and licensed them (for the most part) under the GPLv3.
The compiler is a modified gcc/binutils toolchain and you have access to most of the standard C libraries (and anything else you can compile into a static elf binary). For development ease, you can even locally compile epiphany code on the host ARM. It feels like C, but there is no host OS available.
It turns out, that even without system calls in epiphany code, the coprocessor is still easy to debug. The host ARM has the ability to read and write to memory segments local to a specific epiphany core. Most of the examples use this mechanism for the host ARM to printf values that have been read by simply reaching into the epiphany's memory.
At somepoint, I think that I'll try implementing a ringbuffer to communicate between the host and the coprocessor. A task that I don't even know how to approach on the Raspberry Pi.
One of the hidden gems of the parallella is the presence of the FPGA. It is a boot-time programmable logic engine. Most of the marketing is silent about the capabilities of the Zynq. In the provided bitstreams, the FPGA is configured to drive the HDMI, GPIO pins and the host-side eLink communication between the ARM and epiphany.
It is possible, by swapping out the parallella.bit.bin file for another bitstream and rebooting, to customise the FPGA functionality. There are practical ideas about reducing power-consumption by removing the HDMI, or expanding the GPIO. If you know what you're doing, then it could even be used to implement cryptography primatives in hardware.
Unfortunatly, I'm not sure that much innovation can be expected in this arena. FPGA programming can get tedious and while the toolchain is freely available, it isn't necessarily Free Software.
On the parallella, it is possible to run from the same code base an algorithm compiled for the ARM or the epiphany. I chose an integer based prime factoring problem (how many primes under 16 million, one million trials per ecore). I found a little under 2 million primes in 47 seconds. The equivalant single-threaded ARM binary found them in a little over 11 minutes.
I calculated this to be a 14x speedup, but considering that this was an embarrassingly parallel problem on 16-cores, I feel slightly dissappointed that the per-core performance of the epiphany chip is comparable to the Cortex.
Of course, where the epiphany makes up for this, is that it is possible to stick many more cores into these processors, with a much lower energy requirement.
It will be interesting to see what is possible with a larger parallella cluster, using domestic power supplies.
The parallella was kickstarted back in 2012, a few months after the initial release of the Raspberry Pi. Raspberry Pi 2, with a quad-core ARMv7 is a welcome update, but with the looming GA of the 64-core epiphany processor, the parallella platform could be a useful learning platform for truly parallel programming.
Personally, I can see the sunsetting on this generation of single board PCs. The ARMv7 family is starting to be deprecated, and it's trivially easy to rent some time on multi-core, high clock-speed amd64 platforms in the cloud.
The benifit of the RPi2 and parallella will be in pioneering the way forward for multi-core algorithms, if anything, to shrug off the stigma that small is slow.