|
Message
From: Richard Herveille<richard@h...>
Date: Tue Sep 7 08:17:42 CEST 2004
Subject: [oc] Winning with a reconfigurable computer
Let's see if I get it right. Main issue here is getting random data fast enough into multiple execution units, or one big execution unit that works on a lot of data (what's the difference?), right?
How about using QDR-II, or DDR-II for that matter, memories? These are true SRAMs, produced in a 90nm technology. The QDR-II has separate read and write ports, uses a 250MHz double pumped (DDR) clock per port. This provides an astonishingly: - 1G read/write operations per sec - 18Gbit/sec bandwidth per port (36bit databus) - 2GByte/sec bandwidth per port (32bit databus)
This all without any latency! Pipelined yes, latency no. Now these devices are available in 72Mbit (2M x 36bits) densities. Not directly enough for main memory, but it would make a hell of a cache to execute your while loops from. Best of all, both Xilinx and Altera claim they can handle these memories with their FPGAs.
Then there's another memory variant called QuadPorts. These memories have 4 independent ports. All running at 133MHz. This allows you to access the same shared memory from 4 units independantly. Unfortunately these memories are pretty small; 1Mbit max. But still, this might be a candidate for your cache.
Cheers, Richard
> -----Original Message----- > From: cores-bounces@o... > [mailto:cores-bounces@o...] On Behalf Of Bill Cox > Sent: Tuesday, September 07, 2004 5:22 AM > To: bporcella > Cc: Discussion list about free open source IP cores > Subject: Re: [oc] Winning with a reconfigurable computer > > On Mon, 2004-09-06 at 21:48, bporcella wrote: > > At 12:16 PM 9/6/2004, you wrote: > > > > >Latency and cache misses are the key issues that keeps my > inner loops > > >running slowly, from what I can see. > > > > Ok gotta admit thinking about this stuff is MUCH more fun > than doing > > real work....... > > It's Labor Day weekend! Who needs real work? > > > So suppose that we could convince ourselves that it is > possible to say > > quadruple DDR - cache throughput (relative to present state > of the art > > microprocessor) without (of course) compromising latency > -- on a single > > chip . i.e. we can create 4 instances of independent > caches (32 -64 > > bits) that we can run at say 2Gh. What's next ??? > > > > Well "layout" IS one of those interesting problems that > have been seriously > > studied over the last 50 years or so... If we could show > a real 4X > > speedup using 4 RISC type engines and a configurable logic > > interconnect system we could possibly convince some hard hitting > > venture capitalist ( Sequoia ? ) to drop a few million on > further research. > > Actually, Xilinx has a venture fund. They also have all the > really hard pieces to make something like this possibly work: > a great fab relationship, money, true layout artists, not to > mention an excellent FPGA fabric and most of the hard IP > needed. I also have high respect for their management. > > They also probably have a bit more sense than to invest in a > new CPU architecture... By now, everyone should understand > that it's not how good your CPU is, it's what software you > run, and who you are, and how much it all costs (this > architecture doesn't feel particularly cheap). > > However, I love trying to convince myself that something > could actually happen. I'd love to work on a project like > this. In the meantime, I'll keep my day-job making one-mask > structured ASICs better... > > > Some ideas...... > > > > I suspect that the answer to the throughput question is > already pretty > > well understood by somebody -- so we should be able to discover it > > with a reasonable level of research. > > Someone listening out there probably knows the answers: How > fast can a Xilinx FPGA access external memory? How many > ports can they support? > How fast of a direct mapped cache can the internal memories make? > > > For the application speedup question we could put together a > > simulation. The cache and processor are easy to simulate > - If we could
> > come up with ANY interconnect system that would show a 4x
> speedup over
> > present state of the art we would have something.
> >
> >
> > Any Ideas on how to configure the interconnect system ???
>
> I'm pretty comfortable with interconnect. It'd be custom for
> each algorithm being sped up. If only inner loops are
> optimized, the routing to the SRAM buses should be fairly
> clean. Once the code for the memory access is synthesized,
> we should be able to throw the problem at the place and route
> tools and get fair results. It's not like we're trying to
> route data from any bus to any functional unit. There should
> only be a few destinations and sources per SRAM interface for
> a typical algorithm.
>
> Putting together some demonstrations showing speedups of
> inner loops should be doable. The hardware compiler is a big
> project. Building the prototype system isn't simple.
>
> I have to say that pulling off this project sounds far
> fetched. Unless Xilinx or Altera wants bragging rights to
> the fastest general purpose computing machine, there probably
> wont be any money available to build it.
>
> Still, it's a three day weekend, and it's time to think about
> far-fetched ideas. How cool would it be to build the worlds
> fastest single processor computer?
>
> Way cool... I think I'll go to bed and dream about it now.
>
> Thanks for the lively discussion!
>
> Bill
>
>
> _______________________________________________
> http://www.opencores.org/mailman/listinfo/cores
>
|
 |