Bug 741 - bitmanip ALU implementation
Summary: bitmanip ALU implementation
Status: IN_PROGRESS
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Jacob Lifshay
URL:
Depends on: 745 755 782 757
Blocks: 690
  Show dependency treegraph
 
Reported: 2021-11-05 23:16 GMT by Jacob Lifshay
Modified: 2022-03-25 03:33 GMT (History)
3 users (show)

See Also:
NLnet milestone: NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation: 772
child tasks for budget allocation: 745 755 782 852
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Lifshay 2021-11-05 23:16:34 GMT
ALU implementation for bit manipulation instructions:
https://libre-soc.org/openpower/sv/bitmanip/

https://lists.libre-soc.org/pipermail/libre-soc-dev/2021-November/004078.html
Comment 1 Jacob Lifshay 2021-11-05 23:17:46 GMT
added ternaryi to the decoder, I probably missed some stuff when I added the new forms, fields, etc.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a17a252e474d5d5bf34026c25a19682e3f2015c3
Comment 2 Luke Kenneth Casson Leighton 2021-11-06 02:49:40 GMT
no looks good, remember ternary is 4in RT RA RB RC ternaryi 3in RA RB RT
so yes, separate pipeline (all pipelines are created/grouped based on register
profile)

usual trick following how
RB is shared with immediate field is to do the same, drop
TII into DecoderBImm class in powerdecoder2.py
then add TII to InOp2 enum and set RB field in CSV
to TII.

also to save having to create subdecoders of op5 suggest
using 11101----- ie use dashes for ignored bits of field
selector
Comment 3 Luke Kenneth Casson Leighton 2021-11-06 03:14:31 GMT
(In reply to Luke Kenneth Casson Leighton from comment #2)
> no looks good, remember ternary is 4in RT RA RB RC ternaryi 3in RA RB RT

which means ternaryi should be RT RA TII RC 
because TII drops into RB immediate in DecodeBImm

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=6db5c61511a79a22b6f0eccc570c9018a434d32e;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l268

In2Sel.CONST_TII

ternary instr is then "for free"

OP_TERNARY not OP_TERNARYI
Comment 4 Luke Kenneth Casson Leighton 2021-11-06 04:40:26 GMT
hm no 4 in support in decoder, would need
Decode4thReg scratch CONST_TTI idea.
more tomorrow needs thought.
Comment 5 Luke Kenneth Casson Leighton 2021-11-06 11:12:30 GMT
ok, morning now: can do a more in-depth response / walkthrough / insights

* ternary is an overwrite 4-in 1-out instruction. this is like nothing
  that Power ISA has seen before.  (overwrite: the destination is also
  one of the sources)

* ternaryi is an overwrite 3-in 1-out large-immediate instruction.
  other Power ISA 3-in 1-out instructions include FMAC and MAC.
  those other Power ISA 3-in/1-out instructions do not include Rc=1
  because again it is a mad amount of registers

* ternarycr is an overwrite 3-in 1-out large-immediate instruction
  where SVP64 is expected to extend the 3-bit-only fields to 5-bit.
  it is reasonably similar to crops (extends the 2-bit selection
  to 3-bit).  its primary purpose is for manipulating predicates,
  hence the 4-bit "mask" option to be able to select one or multiple
  of EQ/LT/GT/SO.

the problematic one is ternary. given that no other Function Unit
has 4-input 64-bit pathways it is a lot of work and will be
slow or resource-hungry.

because of the time-pressure i do not recommend it be included:
it will require the addition of a 4th register decoder
(In4Sel), the addition of a 4th input register column in all
CSV files, the addition of 

OR

use one single 64-bit input register and split it into 4 parts
(see "swizzle and vec4" variant)

* part 1: input 1
* part 2: input 2
* part 3: input 3
* part 4: 8-bit selector

SVP64 will be capable of dropping down to:

* 64-bit - 16-bit for parts 1-3
* 32-bit - 8-bit for parts 1-3
* 16-bit - 2-bit for parts 1-3
* 8-bit undefined

the advantage of not doing a 4-in 1-out variant is that the
pressure on the opcode space is greatly reduced:

0.5	6.10	11.15	16.20	21.25	26...30	31
NN	RT	RA	RB	RC	mode 001	Rc

001 would be "freed", potentially for use for svp64 ops
(have to examine that).
Comment 6 Luke Kenneth Casson Leighton 2021-11-06 11:35:14 GMT
just spotted this:

@@ -475,6 +478,7 @@ class In3Sel(Enum):
     FRS = 3
     FRC = 4
     RC = 5  # for SVP64 bit-reverse LD/ST
+    CONST_TII = 6  # for ternaryi

that won't work.  there are 3 register inputs: all 3 are required
as actual registers

* In1Sel = RA
* In2Sel = RB
* In3Sel = RC

if there was an In4Sel *that* could be set to CONST_TII
(or, better, In2Sel could be set to CONST_TII and In4Sel
allowed to be RC)

however, an In4Sel as explained in comment #5 is a hell of a lot
of work for a lot of pain / pressure (time, resources, regfile
port pressure)

three input registers are required.  the pseudocode is:

for i in range(64):
    idx = RT[i] << 2 | RA[i] << 1 | RB[i]
    RT[i] = (imm & (1<<idx)) != 0

and what that means is: Insel3 needs to have *RT* as an input
option:

@@ -475,6 +478,7 @@ class In3Sel(Enum):
     FRS = 3
     FRC = 4
     RC = 5  # for SVP64 bit-reverse LD/ST
+    RT = 6  # for ternaryi to be able to use RT as an overwrite-input

i'll sort that now.
Comment 7 Luke Kenneth Casson Leighton 2021-11-06 11:49:44 GMT
hmmm no.  instead of adding extra options (requiring changes to
power_decoder2.py) i noticed that

* In3Sel can select RC
* In2Sel can (as always) select RB
* In1Sel can select RS (for logical ops)

but - drat: this still leaves RT as being only an output.

drat, power_decoder2.py will need a special-case for OP_TERNARY
to select RT as an *input* - 

 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 | -- | -- | --- | --- | ----- | -------- |--|
 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |Rc|
        ^
        | input into InSel3

for i in range(64):
    idx = RT[i] << 2 | RA[i] << 1 | RB[i]
    RT[i] = (imm & (1<<idx)) != 0
Comment 8 Jacob Lifshay 2021-11-10 03:30:03 GMT
maybe we should have:
input A: RA
input B: RB
input C: RT
input D: imm
output: RT

since that way we won't need to shift RA and RB over into inputs B and C saving a few gates.

input D can just be implicit in the CSV (the ALU extracts the immediate rather than using the decoder to do that) until we implement 4-in 1-out instructions.

How does that sound?
Comment 9 Luke Kenneth Casson Leighton 2021-11-10 11:15:39 GMT
(In reply to Jacob Lifshay from comment #8)
> maybe we should have:
> input A: RA
> input B: RB
> input C: RT
> input D: imm
> output: RT
> 
> since that way we won't need to shift RA and RB over into inputs B and C
> saving a few gates.
> 
> input D can just be implicit in the CSV (the ALU extracts the immediate
> rather than using the decoder to do that) until we implement 4-in 1-out
> instructions.
> 
> How does that sound?

Satellite decoders can pick up immediate operands on their own, or
even decode the fields inside the instruction, let me
find an example...  ok here's a simple one (a single bit):

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/main_stage.py;h=253664032fc16a1401665416a70e33688c6caa65;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l97

  97             with m.Case(MicrOp.OP_CNTZ):
  98                 XO = self.fields.FormX.XO[0:-1]
  99                 count_right = Signal(reset_less=True)
 100                 comb += count_right.eq(XO[-1])

ah, and this one, actually does further decode of the *register* fields:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/main_stage.py;h=5f1edc7adb6fed02a2fc08b2a5ff0891905eabe9;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l74

  45         xl_fields = self.fields.FormXL
  ...
  ...
  74                 # Get the bit selector fields from the
  75                 # instruction. This operation takes in the little CR
  76                 # bitfields, so these fields need to get truncated to
  77                 # the least significant 2 bits
  78                 BT = xl_fields.BT
  79                 BA = xl_fields.BA
  80                 BB = xl_fields.BB
  81                 bt = Signal(2, reset_less=True)
  82                 ba = Signal(2, reset_less=True)
  83                 bb = Signal(2, reset_less=True)
  84 
  85                 # Stupid bit ordering stuff.  Because POWER.
  86                 comb += bt.eq(3-BT[0:2])
  87                 comb += ba.eq(3-BA[0:2])
  88                 comb += bb.eq(3-BB[0:2])

also, there is even some cases where the opcode is further decoded
inside an FU, rather than do it in the CSV files and give it a
special (separate) OP_XXXX

in this way it will be possible to pick up the immediate *without*
needing to go to the lengths (a lot of trouble) of adding a 4th
Operand Decoder.

ah - found it:

https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/main_stage.py;h=5f1edc7adb6fed02a2fc08b2a5ff0891905eabe9;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l62

  62         # ##### crand, cror, crnor etc. #####
  63         with m.Case(MicrOp.OP_CROP):
  64             # crand/cror and friends get decoded to the same opcode, but
  65             # one of the fields inside the instruction is a 4 bit lookup
  66             # table. This lookup table gets indexed by bits a and b from
  67             # the CR to determine what the resulting bit should be.
  68 
  69             # Grab the lookup table for cr_op type instructions
  70             lut = Signal(4, reset_less=True)
  71             # There's no field, just have to grab it directly from the insn
  72             comb += lut.eq(op.insn[6:10])

so instead of lut = Signal(4) it would be lut = Signal(8)
and the assignment would either come from self.fields.FormTTI.TTI
or just directly accessing the instruction (bearing in mind
Power ISA spec madness)
Comment 10 Luke Kenneth Casson Leighton 2021-11-10 14:50:34 GMT
(In reply to Jacob Lifshay from comment #8)
> maybe we should have:
> input A: RA
> input B: RB
> input C: RT

these are the options in power_enums.py:

@unique
class In1Sel(Enum):
    RA = 1
    RS = 4  # for some ALU/Logical operations

@unique
class In2Sel(Enum):
    RB = 1
    RS = 13  # for shiftrot (M-Form)

@unique
class In3Sel(Enum):
    RS = 1
    RB = 2  # for shiftrot (M-Form)
    RC = 5  # for SVP64 bit-reverse LD/ST

therefore:
* input 1: RA
* input 2: RB
* input 3: has to be done as a special-coded case (RT) similar to
           how LD/ST-with-update is done (RA)

like this:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=edf2893b3dec4749822db7d926efb4eaa0eea9b2;hb=HEAD#l469

 469         if hasattr(op, "upd"):
 470             # update mode LD/ST uses read-reg A also as an output
 471             with m.If(op.upd == LDSTMode.update):
 472                 comb += self.reg_out.data.eq(self.dec.RA)
 473                 comb += self.reg_out.ok.eq(1)

therefore in class DecodeC, it would be:

   with m.If(op.internal_op == MicrOp.OP_TERNARY):
     comb += self.reg.data.eq(self.dec.RT)
     comb += self.reg.ok.eq(1)

or... yyeah, add RT as one of the sources for DecodeC.
i'll do that now.

(edit)

index 6db5c61..0a054c4 100644
--- a/src/openpower/decoder/power_decoder2.py
+++ b/src/openpower/decoder/power_decoder2.py
@@ -362,6 +362,10 @@ class DecodeC(Elaboratable):
             with m.Case(In3Sel.RC):
                 comb += reg.data.eq(self.dec.RC)
                 comb += reg.ok.eq(1)
+            with m.Case(In3Sel.RT):
+                # for TII-form ternary
+                comb += reg.data.eq(self.dec.RT)
+                comb += reg.ok.eq(1)
 
         return m
 
diff --git a/src/openpower/decoder/power_enums.py b/src/openpower/decoder/power_enums.py
index ec9ed77..5757fb6 100644
--- a/src/openpower/decoder/power_enums.py
+++ b/src/openpower/decoder/power_enums.py
@@ -478,7 +478,8 @@ class In3Sel(Enum):
     FRS = 3
     FRC = 4
     RC = 5  # for SVP64 bit-reverse LD/ST
-    CONST_TII = 6  # for ternaryi
+    CONST_TII = 6  # for ternaryi - XXX TODO: REMOVE THIS (from CSV, first)
+    RT = 7 # for ternary

commit 95fcf66fa63689d6b3a5a8a5e01fd6741928def0 (HEAD -> master, origin/master)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Wed Nov 10 14:53:04 2021 +0000

    add RT as an option for ternary instruction as 3rd register input

so, the CSV can have RT now as in3 (replacing CONST_TII which should
be removed from In3Sel)
Comment 11 Jacob Lifshay 2021-11-12 01:43:32 GMT
(In reply to Luke Kenneth Casson Leighton from comment #9)
> so instead of lut = Signal(4) it would be lut = Signal(8)
> and the assignment would either come from self.fields.FormTTI.TTI
> or just directly accessing the instruction (bearing in mind
> Power ISA spec madness)

Note that the form I added is called TI (for TernaryI) and the immediate field is called TII (for TernaryI Immediate).

(In reply to Luke Kenneth Casson Leighton from comment #3)
> OP_TERNARY not OP_TERNARYI

I decided to leave the internal op set to OP_TERNARYI since the ALU will need to distinguish ternary/ternaryi to correctly read either RC or the TII field -- also they will have a different number of registers, which might cause problems with the instruction issue logic if they are assigned the same internal op.

(In reply to Luke Kenneth Casson Leighton from comment #10)
> so, the CSV can have RT now as in3 (replacing CONST_TII which should
> be removed from In3Sel)

Done:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=782d622d51388374db95c98b8a7ee7cfa8c3e30b

I also fixed sv_analysis to properly handle it
Comment 12 Luke Kenneth Casson Leighton 2021-11-12 04:56:08 GMT
(In reply to Jacob Lifshay from comment #11)
> (In reply to Luke Kenneth Casson Leighton from comment #9)
> > so instead of lut = Signal(4) it would be lut = Signal(8)
> > and the assignment would either come from self.fields.FormTTI.TTI
> > or just directly accessing the instruction (bearing in mind
> > Power ISA spec madness)
> 
> Note that the form I added is called TI (for TernaryI) and the immediate
> field is called TII (for TernaryI Immediate).

the convention, created by the microwatt team, and followed in all 75
operations, is to use the same microcode op for both immediate and nonimmediate.
this is achieved by placing the immediate into the data lanes of the registers
whereupon the Function Unit knows nothing at all about immediates and 
consequently a designation and naming convention based on immediates
is completely inappropriate and misleading.


> (In reply to Luke Kenneth Casson Leighton from comment #3)
> > OP_TERNARY not OP_TERNARYI
> 
> I decided to leave the internal op set to OP_TERNARYI since the ALU will
> need to distinguish ternary/ternaryi to correctly read either RC or the TII
> field

this is normally handled by the Muxers built in to MultiCompUnit
which is specifically designed to recognise a Record field "imm_data"
which places the contents of that data into the 2nd src operand
(nominally designated RB).

i explained the logic in earlier commwnts:
in this case 4 operands is too much therefore we are not
going to add it (or use imm_data, because the RB CompUnit lane is needed
for one of the 3 register inputs)

breaking the convention however is unwise so please rename it so
that a 2-operand non-immediate variant can be done later

> (In reply to Luke Kenneth Casson Leighton from comment #10)
> > so, the CSV can have RT now as in3 (replacing CONST_TII which should
> > be removed from In3Sel)
> 
> Done:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;
> h=782d622d51388374db95c98b8a7ee7cfa8c3e30b
> 
> I also fixed sv_analysis to properly handle it

brilliant. that should be enough to move on to creating the pipeline
and its data structures, regspecs, and pipe_data.py

there aren't any 3-input pipelines (no FMAC yet) but the closest
to copy as a starting point is likely to be shiftrot.

there is no need to add a ti_imm field to the record because as already
explained the TI field can be extracted manually.  this saves wires.
Comment 13 Jacob Lifshay 2021-11-12 05:13:24 GMT
(In reply to Luke Kenneth Casson Leighton from comment #12)
> (In reply to Jacob Lifshay from comment #11)
> > Note that the form I added is called TI (for TernaryI) and the immediate
> > field is called TII (for TernaryI Immediate).
> 
> the convention, created by the microwatt team, and followed in all 75
> operations, is to use the same microcode op for both immediate and
> nonimmediate.
> this is achieved by placing the immediate into the data lanes of the
> registers
> whereupon the Function Unit knows nothing at all about immediates and 
> consequently a designation and naming convention based on immediates
> is completely inappropriate and misleading.

here, the name is based on immediate (or not) because the TI form is *only* ever used with ternaryi...ternary will instead be a separate form MR4 (Modified 4-register -- exact name TBD) with RT, RA, RB, RC, and *no* immediate. This is independent of whatever OP_TERNARY/OP_TERNARYI name we decide to use.

we can switch to merging both ternary and ternaryi into OP_TERNARY once support for 4-in operations are added to the decoder...until then, OP_TERNARY should be separate from OP_TERNARYI imho.
Comment 14 Luke Kenneth Casson Leighton 2021-11-12 05:32:08 GMT
(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > (In reply to Jacob Lifshay from comment #11)
> > > Note that the form I added is called TI (for TernaryI) and the immediate
> > > field is called TII (for TernaryI Immediate).
> > 
> > the convention, created by the microwatt team, and followed in all 75
> > operations, is to use the same microcode op for both immediate and
> > nonimmediate.
> > this is achieved by placing the immediate into the data lanes of the
> > registers
> > whereupon the Function Unit knows nothing at all about immediates and 
> > consequently a designation and naming convention based on immediates
> > is completely inappropriate and misleading.
> 
> here, the name is based on immediate (or not) because the TI form is *only*
> ever used with ternaryi...

please re-read what i wrote, the answer is exactly the same.
because you have not been directly involved in this level
of detail before, you do not understand it (and are not
listening. again).

for the third and final time, i will not say it again:
rename the operation.


regarding pipe_data.py, the 3 src registers need to be called
ra rb and rc because of this code in core.py

 457             # argh.  an experiment to merge RA and RB in the INT regfile
 458             # (we have too many read/write ports)
 459             if self.regreduce_en:
 460                 if regfile == 'INT':
 461                     fuspecs['rabc'] = [fuspecs.pop('rb')]
 462                     fuspecs['rabc'].append(fuspecs.pop('rc'))
 463                     fuspecs['rabc'].append(fuspecs.pop('ra'))

these names are local so have nothing to do with the PowerDecoder2
names, it is a little confusing.
Comment 15 Luke Kenneth Casson Leighton 2021-11-12 13:58:11 GMT
so, again, to reiterate:

1) there are conventions that were created long ago and they are not
   up for debate or discussion.

2) not following those conventions will be confusing and cause massive
   disruption.

3) when i say "please do something in this specific and requested way"
   it is short-hand for:

   a) this was already decided a long time ago, it is not up for debate
   b) i haven't got time and neither have any of us to go over that again
   c) if there was a lengthy discussion it would conclude exactly what i asked
   d) therefore please don't argue, don't jeapordise timescales, just do it

now, i did in fact provide you with the basics of the explanations for
the decisions that had already been made.  these were in, at least:

* comment #5
* comment #7
* comment #12

you ignored those comments.  please do not ignore things, forcing me to
waste time repeating them again and again.  i have made it clear that it
is very disrespectful to ignore the Project Leader, and that it jeapordises
timescales.  we hav e been through the consequences of this type of
behaviour already.

please therefore re-read the comments and take them on board.

i have raised bug #745 as a sub-bug for OP_TERNARY: there are a large
number of sub-tasks (14 if a separate pipeline is used, or only around
7 if sharing with similar-profile shift_rot)

7 sub-tasks is still more than enough to justify a separate bugreport,
can we please therefore keep this one to specifically the individual
functions that get added (more high-level) so as to keep the number
of comments down