|
Message
From: Marko Mlinar<markom@o...>
Date: Tue Sep 7 11:17:41 CEST 2004
Subject: [oc] Winning with a reconfigurable computer
Bill,quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade.
First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores.
The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for *only* bigger bandwidth RAM will not help.
Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization.
However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs.
Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays.
One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies.
The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area.
best regards, Marko
On Tuesday 07 September 2004 10:31, Bill Cox wrote: On Tue, 2004-09-07 at 02:17, Richard Herveille wrote: > > Let's see if I get it right. Main issue here is getting random data fast > > enough into multiple execution units, or one big execution unit that > > works on a lot of data (what's the difference?), right? > > Right. > > The concept of an execution unit might not be quite right. Basically, > I'm talking about synthesizing the critical inner loops of an algorithm > directly to hardware, and executing a whole loop per clock cycle. > > So, the execution unit in this case is custom for a specific software > loop. This custom execution unit often needs several fields from > different classes, all in parallel. With several memory ports, we could > provide that data just as the loop's execution unit needs it. > > > How about using QDR-II, or DDR-II for that matter, memories? These are > > true SRAMs, produced in a 90nm technology. The QDR-II has separate read > > and write ports, uses a 250MHz double pumped (DDR) clock per port. This > > provides an astonishingly: > > - 1G read/write operations per sec > > - 18Gbit/sec bandwidth per port (36bit databus) > > - 2GByte/sec bandwidth per port (32bit databus) > > > > This all without any latency! Pipelined yes, latency no. > > Now these devices are available in 72Mbit (2M x 36bits) densities. Not > > directly enough for main memory, but it would make a hell of a cache to > > execute your while loops from. > > Best of all, both Xilinx and Altera claim they can handle these memories > > with their FPGAs. > > Ok, that's cool. While these memories seem on the small side, they're > large enough to do practical work on real EDA problems. > > For example, let's look at the placement problem. A big design might > have 1 million placeable instances. Each QDR-II memory could handle a > 32-bit field for any class that had at most 2M instances. > > Memory for instance ports would be a problem. We'd need memories with > several times the size of the instance memory to hold even a single > field for ports on instances. > > So, I'd guess that I'd need several memories of the 2Mx36 bit variety, > but also at least a couple that were several times deeper. We could use > the 2M by 36 bit memories as cache for the bigger arrays, and store most > data in a single large external memory. > > > Then there's another memory variant called QuadPorts. These memories have > > 4 independent ports. All running at 133MHz. This allows you to access the > > same shared memory from 4 units independantly. > > Unfortunately these memories are pretty small; 1Mbit max. But still, this > > might be a candidate for your cache. > > These don't appeal to me much. I'd rather go with true parallel > memories. > > Bill
|
 |