|
Message
From: Matjaz Breskvar<phoenix@o...>
Date: Sun May 8 21:46:22 CEST 2005
Subject: [openrisc] [or1ksim #85] [RFC] Overhoul memory code
* Gy?rgy 'nog' Jeney (nog@s...) wrote: > Hi, > > This is quite a major rewrite of the memory code makeing the implementation more > linear, ie. less conditionals which also leads to faster execution: > > With the patch: > > real 0m50.808s > user 0m39.720s > sys 0m11.090s > > Without the patch: > > real 1m0.258s > user 0m48.570s > sys 0m11.690s > > I have changed the callback interface to provide a relative address. This is > done to avoid haveing to pass a struct mem pointer to the simmem_{read,write}* > functions. > > The memory controller was a real hack. It thought that it owned every single > peripheral and it was imposible to have more than one memory controller. With > this patch, each peripheral that wants to be under the controll of a memory > controller will have to register with the appropriate one. I have also > decoupled it from the inerds of abstract.c: ie. it doesn't mess with the > dev_memarea list. In 2 places it still diggs into the dev_memarea structures > that are registered with it which still needs to be fixed, but it's still > cleaner now. > > With this patch it's simple to fix the debug unit to provide the weird behaviour > of being able to load a program to flash (even when it's unwriteable). I've > kept this out of this already very big patch. I would argue that this is wrong > behaviour but people requested it... > > There is some stuff I'm not really clear on so I would really like to hear > comments/oppinions on them: > > Granularity > ----------- > > I have changed the register_memory_area function (renamed to reg_mem_area) so > that you can specify 32, 16 and 8 bit read/write functions but only the memory > peripheral uses them now (makes it much faster). With this, only one of the > read/write delays get added to the cycle counter unlike previously where if an > 8-bit write happened to a 32-bit memory area the expense of a write plus a read > where added to the cycle counter. I don't know what the memory controller would > do if, say, the data widths of the memory sitting behind it was not that of the > access. The question: What does the hardware do in this case?
if you do a l.sb (store byte) from architectual point of view only write delay access should be incured. (though in principle l.sb could have different delay than l.sw). what actually happens depends on memory controler you are using.
the most common thing with 'normal' memories is that 8-bit, 16-bit and 32-bit writes all take the same amount of time (the same goes for reads). this is assuming you have 32-bit wide access to the memory. if not the the dalays will be longer, depending on the mc width... > Also, how do peripherals not under the control of a memory controller (uart, > ethernet, etc) react to writting less data to a register than the width of the > register? As an example what happens when you do a 16-bit write to a 32-bit > wide register in the ethernet?
this is core dependand. some devices might even not allow less than 32-bit accesses (or some other accesses). by not allowing i mean sending back bus error, which raises 0x200 exception.
> What happens when you do, say a 16-bit read to a 32-bit wide register when the > address of the read is not aligned on a 32-bit boundry?
device dependant, the developer could have decided to implement 16-bit access or not (so you could get the requested data back or bus error). if the 16-bit access is supported (or 8-bit) the most natural assumption is that it takes 1 'write/read delay' for any of 8, 16 or 32 bit accesses. > The code in this patch mimics that of the previous implementation, so everything > should still work the same as it did before. Which I'm not really sure is > always right. > > Endianess > --------- > > This is a stupid topic that drives me insane. or1ksim, in effect, emulates a > little-endian core (and a little endian bus, etc.) on a litte endian machine and > a big-endian core on a big-endian machine (I'm not even convinced that the sim > will work on a big endian machine, though). This causes some real headaches > when implementing a peripheral that registered with more than one granularity. > I'm also stumped as to how should I implement proper support for IDE in which it > is defined that all data transfers are little-endian. When the ata peripheral > is compiled on a little endian machine all data returned from it is (naturally) > little endian but when running linux compiled with the ide driver it decides to > byteswap all data recieved from the ata, because openrisc is big-endian but > since we have a little endian simulator the sim now tryies to do arithmetic with > big endian numbers, which fails in some interesting ways. > > Linux and it's userspace stuff just happen to work since when loading the linux > elf binary everything is byteswapped into little-endian but what would happen if
> you would feed some code to linux running inside the sim from an external
> source? Say, an nfs mounted on a remote machine? I'm pretty sure that it would
> fail to execute since it will not be byteswapped. I would test it myself but
> since I don't have a LAN and I couldn't figure out how to get the ethernet
> peripheral to work if I don't have a proper ethernet running.
>
> The only way that I can think of to solve this is to byteswap all data moveing
> between the registers and the memory, but then the peripherals will have to know
> that the data that they get(/must return) is(/must to be) in big-endian order.
> Meaning that it will have to potentially be swapped twice. Not a good thing for
> execution speed. This will also involve byte-swapping all the instruction
> reads. It may be possible to avoid doing this with clever tricks in or32.c,
> though I haven't looked at it. It may also be needed to keep the data in the
> registers in big-endian order, again I don't know. Any comments/ideas/help/
> corrections in this regard will be greatly appreciated.
Am not a big fan of bit/byte swapping either... The way i see it is that
internaly to simulator we can have the data in any bit/byte order we choose
(maybe depending an the host), the only place we *can* have
transformations is on the borders (ie input/output). for io we can have a
bunch of read/write functions that while reading/writing the data can
(depending on the host,...) do proper transformations.
This is not enteirely true though. The problems may also (as you pointed out)
be at peripheral borders if peripherals want/have data in some special
order. in this case some more swapping might be neccessery.... the way i see
this could be achieved is:
- openrisc functions in big-endian format (architectualy). the
representation of this in simulator running on real machine may or may not
be the same. all we care about is that the end result is identical (we get
the 'results' on outputs (which in some way depend on inputs, so it should
only be neccessey changing these two)). the also care about performance, so
we'd like to achieve the correct result using as few transformations as
possible
- the added complication of peripherals that architectualy define
order of data on their input and output to be different than that of an
openrisc should be possible to avoid by some tranformations that are depend
on the archtectural order and it's representation in the host machine. i'm
not saying that this is trivial though...
> ChangeLog:
> * Seporate out the code used for handling the memory peripheral to
> peripheral/memory.c
> * Mostly decouple the memory controller from the internals of the memory
> handling.
> * Rewrite memory handling to be more linear and thus much faster.
>
> nog.
> diff -upr --unidirectional-new-file ./cache/dcache_model.c /home/nog/or1ksim-ac/cache/dcache_model.c
> --- ./cache/dcache_model.c 2005-03-31 16:39:39.000000000 +0200
> +++ /home/nog/or1ksim-ac/cache/dcache_model.c 2005-04-29 19:06:49.000000000 +0200
> @@ -79,7 +79,7 @@ void dc_info()
> - refill cache line
> */
>
> -uint32_t dc_simulate_read(oraddr_t dataaddr, int width)
> +uint32_t dc_simulate_read(oraddr_t dataaddr, oraddr_t virt_addr, int width)
> {
> int set, way = -1;
> int i;
> @@ -90,25 +90,13 @@ uint32_t dc_simulate_read(oraddr_t dataa
> (!testsprbits(SPR_SR, SPR_SR_DCE)) ||
> data_ci) {
> if (width == 4)
> - tmp = evalsim_mem32(dataaddr);
> + tmp = evalsim_mem32(dataaddr, virt_addr);
> else if (width == 2)
> - tmp = evalsim_mem16(dataaddr);
> + tmp = evalsim_mem16(dataaddr, virt_addr);
> else if (width == 1)
> - tmp = evalsim_mem8(dataaddr);
> + tmp = evalsim_mem8(dataaddr, virt_addr);
this looks suspicious to me. data cache should not need to know about
virtual addresses (since MMU is not between cache and memory). why did you
add this parameter to dc_simulate_read and also to evalsim_mem32 (this was
just function to read from memory, ie, always using the physical address)
best regards,
p.
|
 |