Welcome to Emulationworld

Forum Index | FAQ | New User | Login | Search

Make a New PostPrevious ThreadView All ThreadsNext Thread*Show in Threaded Mode


SubjectDoes Generator have a 68000 Dynarec? new Reply to this message
Posted byfinaldave
Posted on12/10/03 03:35 PM




Have been looking at Generator, and it seems a great project - wish I could have done an emulator project at Uni, it would have been wicked!

But... does Generator have a 68000 Dynarec? I'm confused...

Links on the Internet say it does, and even the start of the University writeup seens to indicate it does, or at least was intended to, yet I can't find the x86 Back-End in the source code.

Am I going crazy - is it there?


As far as I can gather it has a code emitter similar to the other 68000 emulators except it produces C code fragments for each opcode.

These code fragments are then executed based on opcodes with run-time compiled IL structures (I think...).

Yet the PDF talks about flag optimisation, and an ARM backend!

How can you have an x86 or ARM backend if it is producing C code fragments on an opcode basis and then executing them one at a time based on a run-time intemediate language?

I mean I can't understand how it could work in C otherwise.... You can't afaik have a C Dynarec. The nearest I saw was a dynarec with some 'naked' C functions memcpyed together, but I imagine there would be issues even so. I couldn't see that in Generator....


It doesn't seem like it's a dynarec in the sense that I expected... am I wrong though, is there a Dynarec and I missed it?


You learn something old everyday...



SubjectRe: Does Generator have a 68000 Dynarec? new Reply to this message
Posted byBart T.
Posted on12/10/03 03:48 PM



> As far as I can gather it has a code emitter similar to the other 68000
> emulators except it produces C code fragments for each opcode.
>
> These code fragments are then executed based on opcodes with run-time compiled
> IL structures (I think...).
>
> Yet the PDF talks about flag optimisation, and an ARM backend!
>
> How can you have an x86 or ARM backend if it is producing C code fragments on an
> opcode basis and then executing them one at a time based on a run-time
> intemediate language?

I thought the way Generator worked was that it emulated instructions with C functions (generated at compile time) invoked by a sequence of CALL instructions (in the native processor language.)

It could probably do flag optimization by having different functions for each instruction which calculate different combinations of flags (or none at all.)

I'm not really sure, though. Email James and see what he says :) And then post the answer here ;)


----
Bart


SubjectAhhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/11/03 05:03 AM



> > As far as I can gather it has a code emitter similar to the other 68000
> > emulators except it produces C code fragments for each opcode.
> >
> > These code fragments are then executed based on opcodes with run-time compiled
> > IL structures (I think...).
> >
> > Yet the PDF talks about flag optimisation, and an ARM backend!
> >
> > How can you have an x86 or ARM backend if it is producing C code fragments on
> an
> > opcode basis and then executing them one at a time based on a run-time
> > intemediate language?
>
> I thought the way Generator worked was that it emulated instructions with C
> functions (generated at compile time) invoked by a sequence of CALL instructions
> (in the native processor language.)
>
> It could probably do flag optimization by having different functions for each
> instruction which calculate different combinations of flags (or none at all.)
>
> I'm not really sure, though. Email James and see what he says :) And then post
> the answer here ;)


No I think you are right - that's brilliant!!, because then you could basically run through the entire code and sort-of static recompile it at run-time but BEFORE emulation, by filling a big array of function pointers. You'd need 2x the memory of the rom BUT you could also do dead flag stuff as well (which is one of the two big overheads in C emulators)!

e.g.
move.w (A4),d0
moveq #$00,d0
movea.l d0,a6
move a6,usp
moveq #$17,d1

->
// Pregenerate this at compile time
static void MoveW_An_Dn()
{
D[n]=A[m];
};

static void MoveW_An_Dn_flags()
{
D[n]=A[m];
c=?; n=?; v=?; z=?; x=? //work out flags too
};

... etc for all opcodes

// Build this array at run-time from the rom data (or maybe just from basic blocks to save memory)
void (*)()[]=
{
MoveW_An_Dm, // move.w (A4),d0 flags unused
MoveW_Imm_Dm, // moveq #$00,d0 flags unused
MoveW_Dn_Am, // movea.l d0,a6 flags unused
MoveW_Am_USP, // move a6,usp flags unused
MoveW_Imm_Dm_flags, // moveq #$17,d1 flags might be used
}

GENIUS! This could potentially run faster than a standard assembler core! What do you think?
Anyone else done anything like this before (apart from Mr. Ponder?)


The best thing about this type of emu is that you get a potential Static recompiler for free...!

You just do this:
switch (pc)
{
case 0x206:
MoveW_An_Dm(), // move.w (A4),d0 flags unused
MoveW_Imm_Dm(), // moveq #$00,d0 flags unused
MoveW_Dn_Am(), // movea.l d0,a6 flags unused
MoveW_Am_USP(), // move a6,usp flags unused
MoveW_Imm_Dm_flags(), // moveq #$17,d1 flags might be used
break;
...
}

And then the compiler will optimise the crap out of the individual opcodes... I'm pretty excited about this :))


You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted bysmf
Posted on12/12/03 07:56 AM



I believe that is the way the i960 is emulated in virtua & nebulam2. You could do it dynamically too, although it might not be worth it unless the decode was particularly heavy.

smf





SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byElSemi
Posted on12/13/03 07:31 AM



> I believe that is the way the i960 is emulated in virtua & nebulam2. You could
> do it dynamically too, although it might not be worth it unless the decode was
> particularly heavy.
>
> smf
>

it seems a bit different from what I use in nebulam2, it seems that it generates a func for each instruction variation (register and flags), while for m2 I only generate 1 variation, and use a struct to hold the pointers to the src and dst operands, that requires extra work on decoding, I don't know if it's faster or slower.

I should work some day on making the DSP alu ops not to compute the flags always, actually the code to compute flags is twice as big as the code to just compute the result (that will be a very nice speedup).


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/13/03 05:19 PM



> > I believe that is the way the i960 is emulated in virtua & nebulam2. You could
> > do it dynamically too, although it might not be worth it unless the decode was
> > particularly heavy.
> >
> > smf
> >
>
> it seems a bit different from what I use in nebulam2, it seems that it generates
> a func for each instruction variation (register and flags), while for m2 I only
> generate 1 variation, and use a struct to hold the pointers to the src and dst
> operands, that requires extra work on decoding, I don't know if it's faster or
> slower.
>
> I should work some day on making the DSP alu ops not to compute the flags
> always, actually the code to compute flags is twice as big as the code to just
> compute the result (that will be a very nice speedup).
>

I've found that often too - things like half-carry on z80 for example can take so long to calculate when you don't have an opcode like lahf

Even in assembler though it can be a nasty overhead. Dead flag elimation could go quite far...!
Especially if you replaced the opcode directly with the handler

I'm wondering if (because C makes quite a good job on the register copies but a bad job on the flag calculation) dead flag analysis could bring C emulators very close to Asm ones and possibly even past them!

(Of course the ideal would be a Asm emulator with dead flag analysis as well, but who can be bothered!)




You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byBart T.
Posted on12/13/03 08:31 PM



> I'm wondering if (because C makes quite a good job on the register copies but a
> bad job on the flag calculation) dead flag analysis could bring C emulators very
> close to Asm ones and possibly even past them!

You could always write a better emulator in assembly which uses the same trick.

What's nice about this approach is that you don't have to worry about instruction decoding anymore. This can normally take a considerable amount of time and memory. If you look at the fast assembly 68K emulators, they generate many different permutations of each instruction simply to avoid having to extract register fields.

Starscream and Turbo68K had mostly the same approach and generated a lot of instructions. A68K is faster and I think that's mostly due to the fact that it generates much less code.

I'm not sure exactly what Generator does, but you can easily generate different permutations for instructions based on flags and then have a fetch/decode stage which outputs code like this:

push reg1
push reg2
call _OpXXXX
add esp,4
push reg1
push reg2
call _OpYYYY
... etc. ...

Or, to make it truly portable, have an array of structs of a type like this:

struct instruction
{
void (*Handler)(struct instruction *);
UINT32 field1, field2, field3, etc. ;
};

And simply iterate through this. I think this is what ElSemi was talking about (correct me if I'm wrong.)


It's a good approach, especially when you don't have to keep re-fetching code (if it's mostly in ROM, for instance, or in protected non-writeable memory pages) and you want to avoid having to write a full-blown dynarec.

I don't know if it would be worth it for emulating a 68K on X86, though. Flag calculation is simple (LAHF, SETO AL), and decoding can be pretty optimal, as A68K shows.

It's probably great for something like ARM or a low-end MIPS or PPC.


----
Bart


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byElSemi
Posted on12/14/03 05:18 AM



> > I'm wondering if (because C makes quite a good job on the register copies but
> a
> > bad job on the flag calculation) dead flag analysis could bring C emulators
> very
> > close to Asm ones and possibly even past them!
>
> You could always write a better emulator in assembly which uses the same trick.
>
> What's nice about this approach is that you don't have to worry about
> instruction decoding anymore. This can normally take a considerable amount of
> time and memory. If you look at the fast assembly 68K emulators, they generate
> many different permutations of each instruction simply to avoid having to
> extract register fields.
>
> Starscream and Turbo68K had mostly the same approach and generated a lot of
> instructions. A68K is faster and I think that's mostly due to the fact that it
> generates much less code.
>
> I'm not sure exactly what Generator does, but you can easily generate different
> permutations for instructions based on flags and then have a fetch/decode stage
> which outputs code like this:
>
> push reg1
> push reg2
> call _OpXXXX
> add esp,4
> push reg1
> push reg2
> call _OpYYYY
> ... etc. ...
>
> Or, to make it truly portable, have an array of structs of a type like this:
>
> struct instruction
> {
> void (*Handler)(struct instruction *);
> UINT32 field1, field2, field3, etc. ;
> };
>
> And simply iterate through this. I think this is what ElSemi was talking about
> (correct me if I'm wrong.)
>
>
> It's a good approach, especially when you don't have to keep re-fetching code
> (if it's mostly in ROM, for instance, or in protected non-writeable memory
> pages) and you want to avoid having to write a full-blown dynarec.
>
> I don't know if it would be worth it for emulating a 68K on X86, though. Flag
> calculation is simple (LAHF, SETO AL), and decoding can be pretty optimal, as
> A68K shows.
>
> It's probably great for something like ARM or a low-end MIPS or PPC.
>
>
> ----
> Bart
>

Actually, I instead of pre decoding to the register value, I predecode to the pointer of the register.
I have 2 cases:

For the CPUs that don't have implicit side effect on register access (like i960 and SHARC) I store the address
of the register to read/write, with a structure like this:

struct _Inst
{
void (*Op)();
UINT32 *src1,*src2,*dst;

...
};

this way, an opcode like ADD can be written as:

void ADD()
{
*dst=*src1+*src2;
}


in the case of cpu with register side effects, like the MB86235 used in model2c, I store handlers to perform
the register reading. Also DSPs can perform more than 1 op in the same inst in parallel, so all regs must be
fetched before any write. I can't use direct register write in the handlers because the "next" microop in the
same opcode might be reading the reg we have already written, and it will read the changed value,
not the original one, thus making the execution sequential, and not parallel.

struct _TGPInst
{
void (*OP);

UINT32 (*GetASrc1)();
UINT32 (*GetASrc2)();
void (*SetADst)(UINT32);
UINT32 (*AOP)();

.....
};

OP controls the execution order of the ops, so a typical ADDER+MUL op is

void ALU2()
{
UINT32 ADD=thisinst->AOP();
UINT32 MUL=thisinst->MOP();

thisinst->SetADst(ADD);
thisinst->SetMDst(MUL);
}

void AADD()
{
UINT32 dst=thisinst->GetASrc1()+thisinst->GetASrc2();
AFLAGS(dst); //This just compute simple flags like ADDERZero and ADDERSign.
return dst;
}

also side effects must be performed in the correct priority order. Emulating DSPs is a real pain :).

about the flags thing, I think I could generate the flags and noflags variation of each alu operation, using the
noflags by default and then when finding a conditional inst, trace back to the first op instruction that affects
that flag and replacing the handler by the flags version. I haven't thought on it too much, but it might have
some trouble with flags at the end of subs and such. I'll try when I have some free time.



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byRiff
Posted on12/15/03 00:38 AM



> > Or, to make it truly portable, have an array of structs of a type like this:
> >
> > struct instruction
> > {
> > void (*Handler)(struct instruction *);
> > UINT32 field1, field2, field3, etc. ;
> > };
> >
> > And simply iterate through this. I think this is what ElSemi was talking about
> > (correct me if I'm wrong.)

This is more or less how instructions are cached in Nuance except that I store indices into a handler array instead of storing a direct pointer. Each instruction cache entry contains up to five intermediate instructions which I call Nuances. NuonEm uses a similar structure but the fields are explicitly named and referenced for a fixed execution order. Nuance reschedules instructions within each packet and uses a generic for-loop to call the handlers. This method is well-suited for both caching of pre-decoded instruction packets that are to be interpreted, and to use as an intermediate code representation to supply to compiler routines

> > It's a good approach, especially when you don't have to keep re-fetching code
> > (if it's mostly in ROM, for instance, or in protected non-writeable memory
> > pages) and you want to avoid having to write a full-blown dynarec.

I assume that by fetching, you really mean decoding. The amount of data fetched will usually be greater in the cached representation than it is in the original encoding. Indeed that is the whole point, to trade size for speed.

I'd argue that its a good solution in any situation where decoding carries a moderate cost or greater unless there are memory limitations on the host platform. Memory is cheap and plentiful. Decoding is often quite expensive. This is particularly true for variable-length instruction encodings and VLIW streams where you don't know where the instruction packet ends until you fully decode it.

Besides, the generic array structure is suitable for both interpretation and for doing analysis and code generation. For simple CPUs, it might be a waste of memory, but it still has a good chance of being faster if a given code sequence is frequently executed.

> >
> > I don't know if it would be worth it for emulating a 68K on X86, though. Flag
> > calculation is simple (LAHF, SETO AL), and decoding can be pretty optimal, as
> > A68K shows.

The only problem is that works only if you program the instruction handlers in assembly and its only going to work if you have native flags and flag extraction instructions. The most expensive common flag to calculate is overflow and without native flag calculation the only choice is to manually compute it. The overflow flag is almost never examined, so its not needed in the majority of cases.

For an interpreter, I would use an all-or-nothing approach. Do the flag analysis and select between handlers which calculate all flags and which calculate no flags. This significantly reduces the number of handler routines you will have to generate.

For a recompiler, I don't see the point in not doing dead flag elimination. You're going to emit machine code anyway and not only will you save time on calculating flags, you also eliminate dependencies on the result of the operation which may significantly reduce code time. At the very least, you will be improving the use-efficiency of your emulator code cache and the instruction cache of the system you are running on.

> Also DSPs can perform more than 1 op in the same inst in
> parallel, so all regs must be
> fetched before any write. I can't use direct register write in the handlers
> because the "next" microop in the
> same opcode might be reading the reg we have already written, and it will read
> the changed value,
> not the original one, thus making the execution sequential, and not parallel.

This is how the Nuon Aries3 processor works. It can execute up to seven instructions in parallel using five resource units: ALU, ECU, RCU, MUL and MEM. Instead of using pointers directly, I store pointers to the various register banks in addition to the standard the register index values. Almost all of the units can operate on four-register vectors and I didn't want to have a separate operand fetch stage so I use an all-or-nothing approach. In almost all cases, the instructions can be reordered to eliminate any operand dependencies. I do full blown dependency tracking in the decode stage and choose an optimal reordering in the scheduling phase. If there is a dependency, I copy *all* possible input registers to temporaries (a hefty penalty), but parallel instructions rarely contain operand dependencies so usually the instructions can be arranged to eliminate the dependencies completely. I'm not so sure now that this is a big win over having a separate operand fetch stage, but its ultimately desirable for compiling phases as it will reduce code size as well as reduce register spilling, or at least eliminate unnecessary register saves.

>
> about the flags thing, I think I could generate the flags and noflags variation
> of each alu operation, using the
> noflags by default and then when finding a conditional inst, trace back to the
> first op instruction that affects
> that flag and replacing the handler by the flags version. I haven't thought on
> it too much, but it might have
> some trouble with flags at the end of subs and such. I'll try when I have some
> free time.

That doesn't quite work because non-conditionals might use flags as an input. You have the right idea in that you can force synchronization at branches and can work backwards, but you have to do complete dependency analysis, keeping track of which registers/flags are outputs of instruction N and which registers/flags are input dependencies of instructions N+1,...N+M. If the mask value at instruction N shows a given output flag to be a future input dependency then that instruction needs to update the flag. The simplest synchronization method is to force dependencies for all flags at branch points (as well as for instructions that modify all flags or where you dont know if they were modified) but the best method is going to differ depending on what kind of program flow you allow. If you wanted extremely optimal reduction you would have to follow both paths at each branch to determine the smallest set of flags required to be synchronized.

A backwards one-pass method works for both basic blocks and superblocks where a taken branch forces a return to the dispatch loop. If you are going to allow backwards branching in your emitted code, it will work if you force flag updates at branch points or if you follow both paths to make sure all flags are synchronized. I imagine it would work for forward branches in a similar manner.


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byBart T.
Posted on12/15/03 01:17 AM



> > > It's a good approach, especially when you don't have to keep re-fetching
> code
> > > (if it's mostly in ROM, for instance, or in protected non-writeable memory
> > > pages) and you want to avoid having to write a full-blown dynarec.
>
> I assume that by fetching, you really mean decoding.

Both. If it's in RAM, you have to keep track of whether it's been overwritten, which can introduce significant overhead.

These sort of interpreters are essentially what's known as a "threaded interpreter." They have some of the same drawbacks as a dynarec.

> > > I don't know if it would be worth it for emulating a 68K on X86, though.
> Flag
> > > calculation is simple (LAHF, SETO AL), and decoding can be pretty optimal,
> as
> > > A68K shows.
>
> The only problem is that works only if you program the instruction handlers in
> assembly and its only going to work if you have native flags and flag extraction
> instructions.

Right, that's exactly what I was saying. If it's possible to do it good and fast in assembly, and you don't care too much about portability, go for it!

> For an interpreter, I would use an all-or-nothing approach. Do the flag
> analysis and select between handlers which calculate all flags and which
> calculate no flags. This significantly reduces the number of handler routines
> you will have to generate.

Agreed. But there may be a handful of instructions that often require a particular flag to be calculated, so it might be beneficial to occasionally generate that one extra handler.

> This is how the Nuon Aries3 processor works.

Speaking of which, is there a programmer's manual for this processor? I'd love to add it to my little collection of CPU/DSP documentation :)


----
Bart


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byRiff
Posted on12/15/03 02:27 PM



> Both. If it's in RAM, you have to keep track of whether it's been overwritten,
> which can introduce significant overhead.

I think my approach to the problem would be to partition memory space into equal sized regions and keep a bitmap. For every memory write I would set the bit associated with the region containing the memory address. Then in the dispatch loop, the code would clear the cache if the region to be executed has been written and the bitmap would also be cleared. Writes to the current execution region would also require code to do an early exit back to the dispatch loop.

For Nuance, I didn't really have to worry about this. None of the games use self-modifying code. The main processor is cached so the cache only needs to be flushed when the appropriate BIOS call is made. The three other processors are not cached but local memory is split into IRAM and DTRAM and the only way to access external memory is through DMA. The only check I make is to clear the cache for DMA reads from external memory to local IRAM. I assume that any code in the DTRAM area will be loaded by the main MPE via BIOS. This handles all existing situations so far. The instruction encodings are convoluted enough that self-modifying code would probably be more expensive than non-modifying code in both terms of speed and space so I'm not really concerned about supporting it.

> > This is how the Nuon Aries3 processor works.
>
> Speaking of which, is there a programmer's manual for this processor? I'd love
> to add it to my little collection of CPU/DSP documentation :)
>

The public version of the official documentation set is in the SDK which you can get from [url]http://www.nuon-dome.com[/url] in the downloads section. Its very detailed but there are some errors, mostly typos except in the case of the BCLR description where the real-world flag behavior is different from what is shown in the architecture document. It also omits a lot of *cough*nuances*cough*, particularly involving pixel DMA for Z-buffered formats. Most importantly, the architecture document does not contain explain the decoding scheme and completely omits instruction encodings. For these, you have to go to my instruction encoding document which is on [url]http://emuforge.org[/url] in the files section of the forums.


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/15/03 03:48 PM



> > Both. If it's in RAM, you have to keep track of whether it's been overwritten,
> > which can introduce significant overhead.

I'm thinking mainly of ARM, MIPS portable machines here by the way - x86 of course for 68000 is done and dusted since Pentium 2s. With an ARM and only about 8Meg of memory (or maybe even less on the GP32?), you have to be more careful - I don't think the structures for each opcode (dynamically or statically 'recompiled') can be too big.

I'm tempted by the idea of filling in the address of the handler for each opcode however adding arguments seems to be throwing away too much ram... I'm not sure it could be afforded on a portable machine.

How about passing in a pointer to a previous opcode 'instance'...

like this:
struct Opcode
{
void (*execute)(int currentPc); // Handler for executing the opcode, either the flags version or not
char ra,rb; // Registers to use
};

And then you allocate a new one for each unique opcode (e.g. move d0,d2 gets a new Opcode structure, but following move d0,d2 opcodes point to the previous one)

Hmmm, not even convinced myself here! Because you need a unique current Program Counter for each one.

What I'd ideally like I guess is
struct Opcode
{
unsigned int pc; // The PC where this opcode is
void (*execute)(); // Pointer to correct handler for executing the opcode, either the flags version or not

// Do you think it is really worth decoding 68000 source and dest register here?
//I mean isn't it just a shift and AND? Maybe worth including the opcode number itself instead
};

Am I worrying a bit too much here... I mean I guess that's only roughly the same size as the code in the original 68000 Program anyway (as long as you only recompile real opcode starts)? Do you think it would be roughly the same?

By the way... was there a name for this thing? It seems halfway between an interpreting emulator and a recompiler... Generator calls it a Dynarec, but it doesn't quite seem to me that it is because it doesn't emit real code, only IL code.

If it doesn't have a name, shall we give it a name? ;)
Intepretarec or InteRec or something... err no maybe not.....
Opcode Representation, Opcode Preparse, or something


>
> I think my approach to the problem would be to partition memory space into equal
> sized regions and keep a bitmap. For every memory write I would set the bit
> associated with the region containing the memory address. Then in the dispatch
> loop, the code would clear the cache if the region to be executed has been
> written and the bitmap would also be cleared. Writes to the current execution
> region would also require code to do an early exit back to the dispatch loop.
>
> For Nuance, I didn't really have to worry about this. None of the games use
> self-modifying code. The main processor is cached so the cache only needs to be
> flushed when the appropriate BIOS call is made. The three other processors are
> not cached but local memory is split into IRAM and DTRAM and the only way to
> access external memory is through DMA. The only check I make is to clear the
> cache for DMA reads from external memory to local IRAM. I assume that any code
> in the DTRAM area will be loaded by the main MPE via BIOS. This handles all
> existing situations so far. The instruction encodings are convoluted enough
> that self-modifying code would probably be more expensive than non-modifying
> code in both terms of speed and space so I'm not really concerned about
> supporting it.
>
> > > This is how the Nuon Aries3 processor works.
> >
> > Speaking of which, is there a programmer's manual for this processor? I'd love
> > to add it to my little collection of CPU/DSP documentation :)
> >
>
> The public version of the official documentation set is in the SDK which you can
> get from [url] instruction encoding document which is on http://emuforge.org in the
> files section of the forums.
>


You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byBart T.
Posted on12/15/03 04:30 PM



> I'm thinking mainly of ARM, MIPS portable machines here by the way - x86 of
> course for 68000 is done and dusted since Pentium 2s.

68K on a slow MIPS would suck. On ARM, it probably would, too, but I don't know how much because I don't know much about ARM's subtleties off the top of my head.

X86 is nice because of the almost 1:1 flag mapping. What about ARM? ARM can automatically shift operands and supports conditional execution -- would that help in flag calculation and/or on-the-fly instruction decoding?

I don't think anyone's even tried writing a hand-tuned 68K interpreter for ARM in assembly.

> like this:
> struct Opcode
> {
> void (*execute)(int currentPc); // Handler for executing the opcode, either
> the flags version or not
> char ra,rb; // Registers to use
> };
>
> And then you allocate a new one for each unique opcode (e.g. move d0,d2 gets a
> new Opcode structure, but following move d0,d2 opcodes point to the previous
> one)

Wouldn't that imply that you'd have to translate an instruction sequence into an array of pointers to Opcode structs? The idea you have, if I understand correctly, is that a lot of the same opcode structs would be used.

But this adds a level of indirection and might even increase memory usage (because, realistically, how often is the same instance of a single instruction used?)

> By the way... was there a name for this thing? It seems halfway between an
> interpreting emulator and a recompiler...

I think the term "threaded interpreter" still applies here. Or "pre-decoded interpreter", if you want -- that sounds descriptive.

>Generator calls it a Dynarec, but it
> doesn't quite seem to me that it is because it doesn't emit real code, only IL
> code.

It could generate real code to do the CALLs and parameter pushing if it wanted to.


----
Bart


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byBart T.
Posted on12/16/03 03:20 AM



> The public version of the official documentation set is in the SDK which you can
> get from http://www.nuon-dome.com">http://http://emuforge.org">http://www.nuon-dome.com in the downloads section.

> For these, you have to go to my
> instruction encoding document which is on [url]http://emuforge.org in the
> files section of the forums.

Thanks! I've grabbed them both.


----
Bart


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/16/03 05:16 AM



> > I'm thinking mainly of ARM, MIPS portable machines here by the way - x86 of
> > course for 68000 is done and dusted since Pentium 2s.
>
> 68K on a slow MIPS would suck. On ARM, it probably would, too, but I don't know
> how much because I don't know much about ARM's subtleties off the top of my
> head.
>
> X86 is nice because of the almost 1:1 flag mapping. What about ARM? ARM can
> automatically shift operands and supports conditional execution -- would that
> help in flag calculation and/or on-the-fly instruction decoding?
>
> I don't think anyone's even tried writing a hand-tuned 68K interpreter for ARM
> in assembly.

Actually I'm writing one at the moment called Cyclone, it's roughly 90% done.
Iirc it seems to run 68000 code at about 40-70Mhz on a 205Mhz Pocket PC without graphics but a lot less with. Also it's for 32-bit only, not Thumb sadly, because I just ran out of registers, and there's lots of handy conditional execution stuff you can't use on Thumb.

Thing is you have to wonder whether a threaded interpreter would have been more sensible, after all ARM might only last a couple more years with the PSP on the horizon! Ah well, could just finish one then do the other with the experienced learned
e.g. Never utter the words "Ah I'm sure no-one will ever use the fact that add.b 0x40,0x40 sets the negative bit!" :P In fact when writing an emulator you should never ever utter the words "Ah I'm sure no-one will ever use..." because rest assured someone ALWAYS uses it!

>
> > like this:
> > struct Opcode
> > {
> > void (*execute)(int currentPc); // Handler for executing the opcode, either
> > the flags version or not
> > char ra,rb; // Registers to use
> > };
> >
> > And then you allocate a new one for each unique opcode (e.g. move d0,d2 gets a
> > new Opcode structure, but following move d0,d2 opcodes point to the previous
> > one)
>
> Wouldn't that imply that you'd have to translate an instruction sequence into an
> array of pointers to Opcode structs? The idea you have, if I understand
> correctly, is that a lot of the same opcode structs would be used.
>
> But this adds a level of indirection and might even increase memory usage
> (because, realistically, how often is the same instance of a single instruction
> used?)
>
> > By the way... was there a name for this thing? It seems halfway between an
> > interpreting emulator and a recompiler...
>
> I think the term "threaded interpreter" still applies here. Or "pre-decoded
> interpreter", if you want -- that sounds descriptive.
>
> >Generator calls it a Dynarec, but it
> > doesn't quite seem to me that it is because it doesn't emit real code, only IL
> > code.
>
> It could generate real code to do the CALLs and parameter pushing if it wanted
> to.
>
>
> ----
> Bart
>


You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byBart T.
Posted on12/16/03 01:04 PM



> Actually I'm writing one at the moment called Cyclone, it's roughly 90% done.
> Iirc it seems to run 68000 code at about 40-70Mhz on a 205Mhz Pocket PC without
> graphics but a lot less with. Also it's for 32-bit only, not

That sounds pretty good. Do any Genesis emulators (like Generator ported to ARM) run at full speed on a 200MHz ARM? If not, you should take another stab at writing a Genesis emulator ;)


----
Bart


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/16/03 03:40 PM



> > Actually I'm writing one at the moment called Cyclone, it's roughly 90% done.
> > Iirc it seems to run 68000 code at about 40-70Mhz on a 205Mhz Pocket PC
> without
> > graphics but a lot less with. Also it's for 32-bit only, not
>
> That sounds pretty good. Do any Genesis emulators (like Generator ported to ARM)
> run at full speed on a 200MHz ARM?

Generator (aka PocketGenesis) runs at about 50% or so, however the port is quite unfriendly to use (for example you have to hard reset the Pocket PC to get out of it).

>If not, you should take another stab at
> writing a Genesis emulator ;)

Yes am doing, at least a minimal Z80-less one for now to test out the 68000 - and I'll tell you it's a lot easier when you go straight for the line engine from day one, instead of retrofitting once you encounter raster effects. At the moment no games are playable :(
Revenge of Shinobi plays the intro but the demo shows just repeated tiles in a scrolling screen
Strider plays the intro and the game rotates the palette but doesn't scroll :(
Mickey Mouse and Ghouls and Ghosts and Sonic 1, play the intro but go blank for the game.
Rolling Thunder 2 plays part of the intro. Most other stuff is just blank screen at the moment, but then again the Megadrive emu is quite minimal at the moment, I was trying to concentrate on getting the 68000 perfect first.
Runs fullspeed on frameskip of about 0-1. But this is without Z80 or Sound. Bit of a drop-off from the 40-70Mhz I measured a couple of months ago disappointingly... I guess graphics is expensive on Pocket PC.
Any ideas for names? I'm calling it PicoDrive at the moment (where the library is called 'Pico' and the functions are labelled 'Pico*'), but I'm not too fond of that now.
Was thinking about NanoDrive or TeraDrive or GigaDrive or something - the reason I picked PicoDrive was to get across the idea of it being diddy and portable ;)

>
>
> ----
> Bart
>


You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted bysmf
Posted on12/17/03 06:33 AM



> I guess graphics is expensive on Pocket PC.

Probably, but making sure you use the correct bitmap depth so it doesn't have to do any conversion may help. What are you using to display graphics?

smf





SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/17/03 06:39 AM



> > I guess graphics is expensive on Pocket PC.
>
> Probably, but making sure you use the correct bitmap depth so it doesn't have to
> do any conversion may help. What are you using to display graphics?
>
> smf
>

GAPI - yeah I know PocketHAL is a bit quicker. I can always plug it in later if needed...

Rendering in 565. I'll do some profiling at some point...

But the main focus for me is getting a quick 68000 emu on portable machine, since that's a bit unexplored at the moment

You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted bysmf
Posted on12/17/03 09:57 AM



IIRC the older pocket pc's were 12bit colour.

smf





SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/17/03 12:39 PM



> IIRC the older pocket pc's were 12bit colour.
>
> smf
>

Yup - I have one! It's so buggered it won't operate on batteries for more than a fraction of a second. So chances of me playing Revenge of Shinobi on the toilet still limited till I get a new one!

Luckily the bits are in the right places, (e.g. RRRRRGGGGGGBBBBB versus RRRR-GGGG--BBBB-) so you only need to write 565 and you support 99% of Pocket PCs out there (apart from 3130s)

Someone had their thinking cap on when they designed that!

By the way, I saw a Nomad the other day, someone in the office brought one in - they are really big chunky devices: more like a toaster than an Atari Lynx! Still, great to see one!


You learn something old everyday...



SubjectRe: Ahhh! That's genius! new Reply to this message
Posted bysmf
Posted on12/17/03 02:59 PM



The computer exchange retro shop in london had stacks of nomads the last time I looked. It's always worth popping in there if you're down that way.

smf





SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byKayamon
Posted on12/18/03 08:56 AM



> >Generator calls it a Dynarec, but it
> > doesn't quite seem to me that it is because it doesn't emit real code, only IL
> > code.
>
> It could generate real code to do the CALLs and parameter pushing if it wanted
> to.

For the record, this is in fact what it does. For ARM, it generates a set of CALL instructions for most opcodes, but I think the MOVE opcode is emitted directly into the code.

For non-ARM, it just has a list of function pointers which get called from a C loop.

It's been a while since I looked at this, so I might be wrong though...


SubjectRe: Ahhh! That's genius! new Reply to this message
Posted byfinaldave
Posted on12/18/03 09:36 AM



> > >Generator calls it a Dynarec, but it
> > > doesn't quite seem to me that it is because it doesn't emit real code, only
> IL
> > > code.
> >
> > It could generate real code to do the CALLs and parameter pushing if it wanted
> > to.
>
> For the record, this is in fact what it does. For ARM, it generates a set of
> CALL instructions for most opcodes, but I think the MOVE opcode is emitted
> directly into the code.

Ah - good idea since a lot of 68K instructions are moves, which would makes Generator a hybrid dynarec/threaded interpreter

brilliant - i did remember noticing that PocketGenesis ran a bit faster than I would have expected from a simple Musashi port

>
> For non-ARM, it just has a list of function pointers which get called from a C
> loop.
>
> It's been a while since I looked at this, so I might be wrong though...
>


You learn something old everyday...



SubjectRe: Ahhh! That's genius! Reply to this message
Posted bysmf
Posted on12/19/03 04:33 AM



I imagine it can't inline all moves, if you can't easily work out the address it's reading/writing to then you'll need to do some extra checks ( but I guess you get the problem for all dynarecs, unless you can have your own mmu setup ). You could probably do alot of inlining, if you knew that certain flags weren't used etc. Working that out takes time though.

smf





Previous ThreadView All ThreadsNext Thread*Show in Threaded Mode