Data Pointer (aka TOC) Immediate Loading

See https://bugs.libre-soc.org/show_bug.cgi?id=286

The concept and convention of a Table of Contents in RISC is a common one, to deal with the fsct that immediates range is short and very inefficient to get a 64 bit immediate (6 instructions in v3.0B). The TOC allows a short-range immediate to access full 64 bit data via LDs.

However with a huge percentage of typical instructions being TOC style LDs (typically 6%) it makes them a high priority target for reducing executable size and I-Cache pressure. What if there were a way to indicate that a given use of an immediate, in any instruction, was to be a micro-coded implicit TOC Load? Typical code:

addi  r2, r2, TOC.lopart
addis r9, TOC.highpart(r2)
ld r10, imm(r9) # getting the TOC, still not done yet!
lwz r9, 4(r10) # finally get actual data 4 bytes into array

The first three instructions are setup to establish the desired address with the start of the dara structure. What if, instead, the immediate indicated (through a special encoding, taking up some small part of its range) that those three previous instructions were implicit and micro-coded?

lwz r9, {TOC+4}(r0)

Behind the scenes, the first ld (estsblishing the entry pointed to via the TOC) occurs automatically. However this is only one instruction saved. What if large immediates were stored at data pointed to via the TOC, as well? Take the following code (a common pattern for 64 bit immediates):

addi r9, r0, NNNN
addis r9, r9, NNNN
sldi r9, r9, 32
addi r9, r9, NNNN
addis r9, r9, NNNN
cmpl r5, r9 # actual operation, test r5

What instead if this could be replaced by:

cmpi r5, {TOC+8}

where again, behind the scenes, a hidden micro-coded LD occurs at an address 8(TOC) to be loaded into the immediate operand, as if it were possible to have a full 64 bit operand in the cmpli instruction?

This could hypothetically be encoded with existing v3.0B instructions, loading the 64 bit immediate into a temporary register, followed by using cmp rather than cmpi:

ld r9, 8(r10) # r10 loaded from TOC
cmp r5, r9

However the very fact that it requires the extra LD instruction (explicitly, rather than implicitly micro-coded) tells us that there is still a benefit to this approach. Additionally: one less GPR is required.