LOGIN   :::   RECOVER PASS   :::   GET ACCOUNT    
Browse
  • Projects
  • Code (CVS)
  • Forums
  • News
  • Articles
  • Polls
  •  
    OpenCores
  • FAQ
  • CVS HowTo
  • Mission
  • Media
  • Tools
  • Sponsors
  • Mirrors
  • Logos
  • Contact us
  •  
    Tools
  • Search
      
  • Download Cores (CVSGet)
  •  
    More
  • Wishbone
  • Perlilog
  • EDA tools
  • OpenTech CD
  •  
    Navigation: All forums > Cores > Message List > Message Post

    Message

    Reply | Reply all
    Date Prev | Date Next | Thread Prev | Thread Next Date Index | Thread Index

    From: Marko Mlinar<markom@o...>
    Date: Tue Sep 7 11:17:41 CEST 2004
    Subject: [oc] Winning with a reconfigurable computer
    Top
    Bill,

    quite a discussion you have out here ;)

    I have spend quite some time on this subject and I am sure it will be one of
    the hottest research areas in the following decade.

    First let me say I wrote several compilers from ANSI C to Verilog RTL, one of
    the compilers (the first one) is available also on Opencores.

    The biggest problems as you already noted is of course memory bandwidth. More
    specifically the problem is to deliver the data on time -- more specifically
    the biggest problem here is latency.
    Therefore search for *only* bigger bandwidth RAM will not help.

    Although it may seem at the first glance that C->RTL conversion is
    straightforward and all it requires is a lot of work, but this is not true.
    I have not been able to prove that solving this problem in finite time would
    require a machine computationally better than Turing, but that seems very
    likely, since the problem becomes very similar to code optimization.

    However the problem can be non-optimally solved with transformations which
    maintain code equality like current compilers do. This is of course very hard
    with all the pointers in the C-code like you already mentioned. With simple
    transformations you don't get enough optimizations to compete with high
    frequency sequential CPUs.

    Java and C# therefore seem much more suitable for conversion than full ANSI C
    compatibility. Or saying that C pointers cannot point to just any area, but
    just to specific arrays.

    One of the biggest finding I had is that every loop that can be optimized
    (containing memory accesses) can be duplicated, where from the first loop you
    can calculate memory addresses and in the second you collect data. In the
    worst case your second loop is empty and you have no optimizations, but in
    most cases you can generate the addresses needed to cope with long RAM access
    latencies.

    The next important finding is that it is quite impossible for a machine to
    change the architecture (usually needs to change whole algorithm) to be more
    suitable for HW and to save area.

    best regards,
    Marko

    On Tuesday 07 September 2004 10:31, Bill Cox wrote:
    On Tue, 2004-09-07 at 02:17, Richard Herveille wrote:
    > > Let's see if I get it right. Main issue here is getting random data fast
    > > enough into multiple execution units, or one big execution unit that
    > > works on a lot of data (what's the difference?), right?
    >
    > Right.
    >
    > The concept of an execution unit might not be quite right. Basically,
    > I'm talking about synthesizing the critical inner loops of an algorithm
    > directly to hardware, and executing a whole loop per clock cycle.
    >
    > So, the execution unit in this case is custom for a specific software
    > loop. This custom execution unit often needs several fields from
    > different classes, all in parallel. With several memory ports, we could
    > provide that data just as the loop's execution unit needs it.
    >
    > > How about using QDR-II, or DDR-II for that matter, memories? These are
    > > true SRAMs, produced in a 90nm technology. The QDR-II has separate read
    > > and write ports, uses a 250MHz double pumped (DDR) clock per port. This
    > > provides an astonishingly:
    > > - 1G read/write operations per sec
    > > - 18Gbit/sec bandwidth per port (36bit databus)
    > > - 2GByte/sec bandwidth per port (32bit databus)
    > >
    > > This all without any latency! Pipelined yes, latency no.
    > > Now these devices are available in 72Mbit (2M x 36bits) densities. Not
    > > directly enough for main memory, but it would make a hell of a cache to
    > > execute your while loops from.
    > > Best of all, both Xilinx and Altera claim they can handle these memories
    > > with their FPGAs.
    >
    > Ok, that's cool. While these memories seem on the small side, they're
    > large enough to do practical work on real EDA problems.
    >
    > For example, let's look at the placement problem. A big design might
    > have 1 million placeable instances. Each QDR-II memory could handle a
    > 32-bit field for any class that had at most 2M instances.
    >
    > Memory for instance ports would be a problem. We'd need memories with
    > several times the size of the instance memory to hold even a single
    > field for ports on instances.
    >
    > So, I'd guess that I'd need several memories of the 2Mx36 bit variety,
    > but also at least a couple that were several times deeper. We could use
    > the 2M by 36 bit memories as cache for the bigger arrays, and store most
    > data in a single large external memory.
    >
    > > Then there's another memory variant called QuadPorts. These memories have
    > > 4 independent ports. All running at 133MHz. This allows you to access the
    > > same shared memory from 4 units independantly.
    > > Unfortunately these memories are pretty small; 1Mbit max. But still, this
    > > might be a candidate for your cache.
    >
    > These don't appeal to me much. I'd rather go with true parallel
    > memories.
    >
    > Bill

     
    Copyright (c) 1999 OPENCORES.ORG. All rights reserved.