ALU implementation for bit manipulation instructions: https://libre-soc.org/openpower/sv/bitmanip/ https://lists.libre-soc.org/pipermail/libre-soc-dev/2021-November/004078.html ------ * TODO: estimate budget needed for bug #782 * TODO: are there other tasks that should go under this task? * DONE: grev initial work (grevlut left for later task) bug #755 * DONE: ternlogi bug #745
added ternaryi to the decoder, I probably missed some stuff when I added the new forms, fields, etc. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a17a252e474d5d5bf34026c25a19682e3f2015c3
no looks good, remember ternary is 4in RT RA RB RC ternaryi 3in RA RB RT so yes, separate pipeline (all pipelines are created/grouped based on register profile) usual trick following how RB is shared with immediate field is to do the same, drop TII into DecoderBImm class in powerdecoder2.py then add TII to InOp2 enum and set RB field in CSV to TII. also to save having to create subdecoders of op5 suggest using 11101----- ie use dashes for ignored bits of field selector
(In reply to Luke Kenneth Casson Leighton from comment #2) > no looks good, remember ternary is 4in RT RA RB RC ternaryi 3in RA RB RT which means ternaryi should be RT RA TII RC because TII drops into RB immediate in DecodeBImm https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=6db5c61511a79a22b6f0eccc570c9018a434d32e;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l268 In2Sel.CONST_TII ternary instr is then "for free" OP_TERNARY not OP_TERNARYI
hm no 4 in support in decoder, would need Decode4thReg scratch CONST_TTI idea. more tomorrow needs thought.
ok, morning now: can do a more in-depth response / walkthrough / insights * ternary is an overwrite 4-in 1-out instruction. this is like nothing that Power ISA has seen before. (overwrite: the destination is also one of the sources) * ternaryi is an overwrite 3-in 1-out large-immediate instruction. other Power ISA 3-in 1-out instructions include FMAC and MAC. those other Power ISA 3-in/1-out instructions do not include Rc=1 because again it is a mad amount of registers * ternarycr is an overwrite 3-in 1-out large-immediate instruction where SVP64 is expected to extend the 3-bit-only fields to 5-bit. it is reasonably similar to crops (extends the 2-bit selection to 3-bit). its primary purpose is for manipulating predicates, hence the 4-bit "mask" option to be able to select one or multiple of EQ/LT/GT/SO. the problematic one is ternary. given that no other Function Unit has 4-input 64-bit pathways it is a lot of work and will be slow or resource-hungry. because of the time-pressure i do not recommend it be included: it will require the addition of a 4th register decoder (In4Sel), the addition of a 4th input register column in all CSV files, the addition of OR use one single 64-bit input register and split it into 4 parts (see "swizzle and vec4" variant) * part 1: input 1 * part 2: input 2 * part 3: input 3 * part 4: 8-bit selector SVP64 will be capable of dropping down to: * 64-bit - 16-bit for parts 1-3 * 32-bit - 8-bit for parts 1-3 * 16-bit - 2-bit for parts 1-3 * 8-bit undefined the advantage of not doing a 4-in 1-out variant is that the pressure on the opcode space is greatly reduced: 0.5 6.10 11.15 16.20 21.25 26...30 31 NN RT RA RB RC mode 001 Rc 001 would be "freed", potentially for use for svp64 ops (have to examine that).
just spotted this: @@ -475,6 +478,7 @@ class In3Sel(Enum): FRS = 3 FRC = 4 RC = 5 # for SVP64 bit-reverse LD/ST + CONST_TII = 6 # for ternaryi that won't work. there are 3 register inputs: all 3 are required as actual registers * In1Sel = RA * In2Sel = RB * In3Sel = RC if there was an In4Sel *that* could be set to CONST_TII (or, better, In2Sel could be set to CONST_TII and In4Sel allowed to be RC) however, an In4Sel as explained in comment #5 is a hell of a lot of work for a lot of pain / pressure (time, resources, regfile port pressure) three input registers are required. the pseudocode is: for i in range(64): idx = RT[i] << 2 | RA[i] << 1 | RB[i] RT[i] = (imm & (1<<idx)) != 0 and what that means is: Insel3 needs to have *RT* as an input option: @@ -475,6 +478,7 @@ class In3Sel(Enum): FRS = 3 FRC = 4 RC = 5 # for SVP64 bit-reverse LD/ST + RT = 6 # for ternaryi to be able to use RT as an overwrite-input i'll sort that now.
hmmm no. instead of adding extra options (requiring changes to power_decoder2.py) i noticed that * In3Sel can select RC * In2Sel can (as always) select RB * In1Sel can select RS (for logical ops) but - drat: this still leaves RT as being only an output. drat, power_decoder2.py will need a special-case for OP_TERNARY to select RT as an *input* - | 0.5|6.10|11.15|16.20| 21..25| 26..30 |31| | -- | -- | --- | --- | ----- | -------- |--| | NN | RT | RA | RB | im0-4 | im5-7 00 |Rc| ^ | input into InSel3 for i in range(64): idx = RT[i] << 2 | RA[i] << 1 | RB[i] RT[i] = (imm & (1<<idx)) != 0
maybe we should have: input A: RA input B: RB input C: RT input D: imm output: RT since that way we won't need to shift RA and RB over into inputs B and C saving a few gates. input D can just be implicit in the CSV (the ALU extracts the immediate rather than using the decoder to do that) until we implement 4-in 1-out instructions. How does that sound?
(In reply to Jacob Lifshay from comment #8) > maybe we should have: > input A: RA > input B: RB > input C: RT > input D: imm > output: RT > > since that way we won't need to shift RA and RB over into inputs B and C > saving a few gates. > > input D can just be implicit in the CSV (the ALU extracts the immediate > rather than using the decoder to do that) until we implement 4-in 1-out > instructions. > > How does that sound? Satellite decoders can pick up immediate operands on their own, or even decode the fields inside the instruction, let me find an example... ok here's a simple one (a single bit): https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/main_stage.py;h=253664032fc16a1401665416a70e33688c6caa65;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l97 97 with m.Case(MicrOp.OP_CNTZ): 98 XO = self.fields.FormX.XO[0:-1] 99 count_right = Signal(reset_less=True) 100 comb += count_right.eq(XO[-1]) ah, and this one, actually does further decode of the *register* fields: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/main_stage.py;h=5f1edc7adb6fed02a2fc08b2a5ff0891905eabe9;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l74 45 xl_fields = self.fields.FormXL ... ... 74 # Get the bit selector fields from the 75 # instruction. This operation takes in the little CR 76 # bitfields, so these fields need to get truncated to 77 # the least significant 2 bits 78 BT = xl_fields.BT 79 BA = xl_fields.BA 80 BB = xl_fields.BB 81 bt = Signal(2, reset_less=True) 82 ba = Signal(2, reset_less=True) 83 bb = Signal(2, reset_less=True) 84 85 # Stupid bit ordering stuff. Because POWER. 86 comb += bt.eq(3-BT[0:2]) 87 comb += ba.eq(3-BA[0:2]) 88 comb += bb.eq(3-BB[0:2]) also, there is even some cases where the opcode is further decoded inside an FU, rather than do it in the CSV files and give it a special (separate) OP_XXXX in this way it will be possible to pick up the immediate *without* needing to go to the lengths (a lot of trouble) of adding a 4th Operand Decoder. ah - found it: https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/cr/main_stage.py;h=5f1edc7adb6fed02a2fc08b2a5ff0891905eabe9;hb=04c4144f0d962bb181c69de4ce5a998077314e24#l62 62 # ##### crand, cror, crnor etc. ##### 63 with m.Case(MicrOp.OP_CROP): 64 # crand/cror and friends get decoded to the same opcode, but 65 # one of the fields inside the instruction is a 4 bit lookup 66 # table. This lookup table gets indexed by bits a and b from 67 # the CR to determine what the resulting bit should be. 68 69 # Grab the lookup table for cr_op type instructions 70 lut = Signal(4, reset_less=True) 71 # There's no field, just have to grab it directly from the insn 72 comb += lut.eq(op.insn[6:10]) so instead of lut = Signal(4) it would be lut = Signal(8) and the assignment would either come from self.fields.FormTTI.TTI or just directly accessing the instruction (bearing in mind Power ISA spec madness)
(In reply to Jacob Lifshay from comment #8) > maybe we should have: > input A: RA > input B: RB > input C: RT these are the options in power_enums.py: @unique class In1Sel(Enum): RA = 1 RS = 4 # for some ALU/Logical operations @unique class In2Sel(Enum): RB = 1 RS = 13 # for shiftrot (M-Form) @unique class In3Sel(Enum): RS = 1 RB = 2 # for shiftrot (M-Form) RC = 5 # for SVP64 bit-reverse LD/ST therefore: * input 1: RA * input 2: RB * input 3: has to be done as a special-coded case (RT) similar to how LD/ST-with-update is done (RA) like this: https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder2.py;h=edf2893b3dec4749822db7d926efb4eaa0eea9b2;hb=HEAD#l469 469 if hasattr(op, "upd"): 470 # update mode LD/ST uses read-reg A also as an output 471 with m.If(op.upd == LDSTMode.update): 472 comb += self.reg_out.data.eq(self.dec.RA) 473 comb += self.reg_out.ok.eq(1) therefore in class DecodeC, it would be: with m.If(op.internal_op == MicrOp.OP_TERNARY): comb += self.reg.data.eq(self.dec.RT) comb += self.reg.ok.eq(1) or... yyeah, add RT as one of the sources for DecodeC. i'll do that now. (edit) index 6db5c61..0a054c4 100644 --- a/src/openpower/decoder/power_decoder2.py +++ b/src/openpower/decoder/power_decoder2.py @@ -362,6 +362,10 @@ class DecodeC(Elaboratable): with m.Case(In3Sel.RC): comb += reg.data.eq(self.dec.RC) comb += reg.ok.eq(1) + with m.Case(In3Sel.RT): + # for TII-form ternary + comb += reg.data.eq(self.dec.RT) + comb += reg.ok.eq(1) return m diff --git a/src/openpower/decoder/power_enums.py b/src/openpower/decoder/power_enums.py index ec9ed77..5757fb6 100644 --- a/src/openpower/decoder/power_enums.py +++ b/src/openpower/decoder/power_enums.py @@ -478,7 +478,8 @@ class In3Sel(Enum): FRS = 3 FRC = 4 RC = 5 # for SVP64 bit-reverse LD/ST - CONST_TII = 6 # for ternaryi + CONST_TII = 6 # for ternaryi - XXX TODO: REMOVE THIS (from CSV, first) + RT = 7 # for ternary commit 95fcf66fa63689d6b3a5a8a5e01fd6741928def0 (HEAD -> master, origin/master) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Wed Nov 10 14:53:04 2021 +0000 add RT as an option for ternary instruction as 3rd register input so, the CSV can have RT now as in3 (replacing CONST_TII which should be removed from In3Sel)
(In reply to Luke Kenneth Casson Leighton from comment #9) > so instead of lut = Signal(4) it would be lut = Signal(8) > and the assignment would either come from self.fields.FormTTI.TTI > or just directly accessing the instruction (bearing in mind > Power ISA spec madness) Note that the form I added is called TI (for TernaryI) and the immediate field is called TII (for TernaryI Immediate). (In reply to Luke Kenneth Casson Leighton from comment #3) > OP_TERNARY not OP_TERNARYI I decided to leave the internal op set to OP_TERNARYI since the ALU will need to distinguish ternary/ternaryi to correctly read either RC or the TII field -- also they will have a different number of registers, which might cause problems with the instruction issue logic if they are assigned the same internal op. (In reply to Luke Kenneth Casson Leighton from comment #10) > so, the CSV can have RT now as in3 (replacing CONST_TII which should > be removed from In3Sel) Done: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=782d622d51388374db95c98b8a7ee7cfa8c3e30b I also fixed sv_analysis to properly handle it
(In reply to Jacob Lifshay from comment #11) > (In reply to Luke Kenneth Casson Leighton from comment #9) > > so instead of lut = Signal(4) it would be lut = Signal(8) > > and the assignment would either come from self.fields.FormTTI.TTI > > or just directly accessing the instruction (bearing in mind > > Power ISA spec madness) > > Note that the form I added is called TI (for TernaryI) and the immediate > field is called TII (for TernaryI Immediate). the convention, created by the microwatt team, and followed in all 75 operations, is to use the same microcode op for both immediate and nonimmediate. this is achieved by placing the immediate into the data lanes of the registers whereupon the Function Unit knows nothing at all about immediates and consequently a designation and naming convention based on immediates is completely inappropriate and misleading. > (In reply to Luke Kenneth Casson Leighton from comment #3) > > OP_TERNARY not OP_TERNARYI > > I decided to leave the internal op set to OP_TERNARYI since the ALU will > need to distinguish ternary/ternaryi to correctly read either RC or the TII > field this is normally handled by the Muxers built in to MultiCompUnit which is specifically designed to recognise a Record field "imm_data" which places the contents of that data into the 2nd src operand (nominally designated RB). i explained the logic in earlier commwnts: in this case 4 operands is too much therefore we are not going to add it (or use imm_data, because the RB CompUnit lane is needed for one of the 3 register inputs) breaking the convention however is unwise so please rename it so that a 2-operand non-immediate variant can be done later > (In reply to Luke Kenneth Casson Leighton from comment #10) > > so, the CSV can have RT now as in3 (replacing CONST_TII which should > > be removed from In3Sel) > > Done: > https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff; > h=782d622d51388374db95c98b8a7ee7cfa8c3e30b > > I also fixed sv_analysis to properly handle it brilliant. that should be enough to move on to creating the pipeline and its data structures, regspecs, and pipe_data.py there aren't any 3-input pipelines (no FMAC yet) but the closest to copy as a starting point is likely to be shiftrot. there is no need to add a ti_imm field to the record because as already explained the TI field can be extracted manually. this saves wires.
(In reply to Luke Kenneth Casson Leighton from comment #12) > (In reply to Jacob Lifshay from comment #11) > > Note that the form I added is called TI (for TernaryI) and the immediate > > field is called TII (for TernaryI Immediate). > > the convention, created by the microwatt team, and followed in all 75 > operations, is to use the same microcode op for both immediate and > nonimmediate. > this is achieved by placing the immediate into the data lanes of the > registers > whereupon the Function Unit knows nothing at all about immediates and > consequently a designation and naming convention based on immediates > is completely inappropriate and misleading. here, the name is based on immediate (or not) because the TI form is *only* ever used with ternaryi...ternary will instead be a separate form MR4 (Modified 4-register -- exact name TBD) with RT, RA, RB, RC, and *no* immediate. This is independent of whatever OP_TERNARY/OP_TERNARYI name we decide to use. we can switch to merging both ternary and ternaryi into OP_TERNARY once support for 4-in operations are added to the decoder...until then, OP_TERNARY should be separate from OP_TERNARYI imho.
(In reply to Jacob Lifshay from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #12) > > (In reply to Jacob Lifshay from comment #11) > > > Note that the form I added is called TI (for TernaryI) and the immediate > > > field is called TII (for TernaryI Immediate). > > > > the convention, created by the microwatt team, and followed in all 75 > > operations, is to use the same microcode op for both immediate and > > nonimmediate. > > this is achieved by placing the immediate into the data lanes of the > > registers > > whereupon the Function Unit knows nothing at all about immediates and > > consequently a designation and naming convention based on immediates > > is completely inappropriate and misleading. > > here, the name is based on immediate (or not) because the TI form is *only* > ever used with ternaryi... please re-read what i wrote, the answer is exactly the same. because you have not been directly involved in this level of detail before, you do not understand it (and are not listening. again). for the third and final time, i will not say it again: rename the operation. regarding pipe_data.py, the 3 src registers need to be called ra rb and rc because of this code in core.py 457 # argh. an experiment to merge RA and RB in the INT regfile 458 # (we have too many read/write ports) 459 if self.regreduce_en: 460 if regfile == 'INT': 461 fuspecs['rabc'] = [fuspecs.pop('rb')] 462 fuspecs['rabc'].append(fuspecs.pop('rc')) 463 fuspecs['rabc'].append(fuspecs.pop('ra')) these names are local so have nothing to do with the PowerDecoder2 names, it is a little confusing.
so, again, to reiterate: 1) there are conventions that were created long ago and they are not up for debate or discussion. 2) not following those conventions will be confusing and cause massive disruption. 3) when i say "please do something in this specific and requested way" it is short-hand for: a) this was already decided a long time ago, it is not up for debate b) i haven't got time and neither have any of us to go over that again c) if there was a lengthy discussion it would conclude exactly what i asked d) therefore please don't argue, don't jeapordise timescales, just do it now, i did in fact provide you with the basics of the explanations for the decisions that had already been made. these were in, at least: * comment #5 * comment #7 * comment #12 you ignored those comments. please do not ignore things, forcing me to waste time repeating them again and again. i have made it clear that it is very disrespectful to ignore the Project Leader, and that it jeapordises timescales. we hav e been through the consequences of this type of behaviour already. please therefore re-read the comments and take them on board. i have raised bug #745 as a sub-bug for OP_TERNARY: there are a large number of sub-tasks (14 if a separate pipeline is used, or only around 7 if sharing with similar-profile shift_rot) 7 sub-tasks is still more than enough to justify a separate bugreport, can we please therefore keep this one to specifically the individual functions that get added (more high-level) so as to keep the number of comments down
just EUR 1000 in this bug for bug #745 for now, will assign more after estimating other subtasks
(In reply to Jacob Lifshay from comment #16) > just EUR 1000 in this bug for bug #745 for now, will assign more after > estimating other subtasks adding in #755