865 – implement vector bitmanip opcodes

Bug 865 - implement vector bitmanip opcodes

Summary: implement vector bitmanip opcodes

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Source Code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Andrey Miroshnikov

URL:	https://libre-soc.org/openpower/sv/ve...

Depends on:
Blocks:

Reported:	2022-06-22 11:07 BST by Luke Kenneth Casson Leighton
Modified:	2022-08-10 17:27 BST (History)
CC List:	2 users (show)

See Also:	866
NLnet milestone:	NLNet.2019.10.031.Video
total budget (EUR) for completion of task and all subtasks:	3500
budget (EUR) for this task, excluding subtasks' budget:	3500
parent task for budget allocation:	234
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	red = { amount = 2000, submitted = 2022-06-25, paid = 2022-08-09 } andrey = { amount = 1000, submitted = 2022-06-26, paid = 2022-07-21 } [jacob] amount = 500 submitted = 2022-07-06 paid = 2022-07-21

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2022-06-22 11:07:49 BST

adding the vector opcodes is needed, to the CSV files, unit tests,
etc.
https://libre-soc.org/openpower/sv/vector_ops/

Comment 1 Luke Kenneth Casson Leighton 2022-06-22 11:29:51 BST

andrey can you do carry-prop (cprop) first, as i took a look last
night at the page jacob found
https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)

and noticed there are patterns

* pattern 1: x / ~x
* pattern 2: x+1 / x-1 / ~(x+1) / -x
* pattern 3: | / & / ^

and from that it becomes possible to create a suite of instructions
covering every possible combination of those 3 patterns (5 bits)

so i will need time to sort that.

carry-prop, however, is clear and is dead-easy as well: one line
of pseudo-code (ok, 3):

    P = (RA)
    G = (RB)
    RT = ((P|G)+G)^P 

the relevant line from the table on the bitmanip page is this:
https://libre-soc.org/openpower/sv/bitmanip/

NN	RT	RA	RB	0	11	0001 110	Rc	vec cprop	X-Form

from which you can construct the appropriate XO-Field to go into
minor_22.csv

Comment 2 Luke Kenneth Casson Leighton 2022-06-22 11:49:14 BST

andrey if you cookie-cut say maxs from here and replace the pseudocode
(and description) you'll have everything you need

    https://libre-soc.org/openpower/isa/av/

Comment 3 Jacob Lifshay 2022-06-22 11:52:27 BST

we'll also want shifting by 1 bit to cover finding up to and including/excluding lowest set bit.

x ^ (x - 1) => set up to lowest set bit inclusive
(x ^ (x - 1)) >> 1 => set up to lowest set bit exclusive

we'll also want the option to bit-reverse both input and output so we can do first set msb rather than first set lsb.

Comment 4 Andrey Miroshnikov 2022-06-22 13:12:34 BST

The nomenclature for pseudo-code is in the PowerISA spec, sections 1.3.2 onwards.

These are the instructions Luke gave at the end of the call yesterday:
pywriter 
add av.mdwn 
pywriter noall av

I added the changes to av.mdwn and minor_22.csv (don't have write permission to openpower-isa, will push once given).

Now on to some question:

Does cprop stand for Carry Propagate? What does it actually do? Does it take bits lower down, and shift them up?
I tried calculating the pseudo-code with two 4-bit numbers (RA:1011, RB:0110, result: 1111) on paper, didn't understand the signifance of the result.

Also is cprop a bitmanip instruction?
If so, does it need to go into bitmanip.mdwn?

In the minor_22.csv, the entries are:
opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,inv A,inv out,cry in,cry ou

From the pseudo-code alone I can't tell if carry in/out are being used. It looks like there are only two inputs: RA, RB; one output RT. After looking at other instructions, Rc seems to determine something (1-bit bitfield).

Comment 5 Luke Kenneth Casson Leighton 2022-06-22 13:38:20 BST

(In reply to Jacob Lifshay from comment #3)
> we'll also want shifting by 1 bit to cover finding up to and
> including/excluding lowest set bit.

that's 6 mode bits


> x ^ (x - 1) => set up to lowest set bit inclusive
> (x ^ (x - 1)) >> 1 => set up to lowest set bit exclusive
> 
> we'll also want the option to bit-reverse both input and output so we can do
> first set msb rather than first set lsb.

that's 8 mode bits.

this needs 5 bits:

+def bmask(mode, RA, RB=None, zero=False):
+    RT = RA if RB is not None and not zero else 0
+    mask = RB if RB is not None else 0xffffffffffffffff
+    a1 = RA if mode&1 else ~RA
+    mode2 = (mode >> 1) & 0b11
+    if mode2 == 0:
+        a2 = -RA
+    if mode2 == 1:
+        a2 = RA-1
+    if mode2 == 2:
+        a2 = RA+1
+    if mode2 == 3:
+        a2 = ~(RA+1)
+    a1 = a1 & mask
+    a2 = a2 & mask
+    mode3 = (mode >> 3) & 0b11
+    if mode3 == 0:
+         RT = a1 | a2
+    if mode3 == 1:
+         RT = a1 & a2
+    if mode3 == 2:
+         RT = a1 ^ a2
+    return RT & mask

* 10-bits XO is the "norm" for X-Form
* 5-bits XO is the "norm" for high-cost (VA-Form for example), leaving
* 5-bits for Mode

however with a budget of only 10-bits for XO:

* 6-bits mode leaves only 4 bits for XO
* 8-bits mode leaves only 2 bits for XO

the table on the bitmanip page has room - barely - for more opcodes
unless grevlogw is removed

    https://libre-soc.org/openpower/sv/bitmanip/

and even then, it would be without an Rc=1 option.

also i was planning to add a "merge" option L=1 (zero=True/False
in the pseudocode above) if practical which leaves only 1 bit
and that's an entire Major Opcode for the entire
instruction.

the only other alternative is to start absorbing some of the 5-XO-bit
portions of Major 19, Major 31 etc. which if we propose too many of
those the ISA WG is going to get pissed.

Comment 6 Luke Kenneth Casson Leighton 2022-06-22 13:42:41 BST

(In reply to Andrey Miroshnikov from comment #4)
> The nomenclature for pseudo-code is in the PowerISA spec, sections 1.3.2
> onwards.
> 
> These are the instructions Luke gave at the end of the call yesterday:
> pywriter 
> add av.mdwn 
> pywriter noall av
> 
> I added the changes to av.mdwn and minor_22.csv (don't have write permission
> to openpower-isa, will push once given).
> 
> Now on to some question:
> 
> Does cprop stand for Carry Propagate? 

yes.

> What does it actually do?

computes the carry bit(s) needed for big-integer math in a single
instruction.

> Does it take
> bits lower down, and shift them up?
> I tried calculating the pseudo-code with two 4-bit numbers (RA:1011,
> RB:0110, result: 1111) on paper, didn't understand the signifance of the
> result.
> 
> Also is cprop a bitmanip instruction?

yes.

> If so, does it need to go into bitmanip.mdwn?

doesn't matter for now
 
> In the minor_22.csv, the entries are:
> opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,inv A,inv out,cry
> in,cry ou
> 
> From the pseudo-code alone I can't tell if carry in/out are being used.

look at fixedarith.mdwn.

>  It
> looks like there are only two inputs: RA, RB; one output RT.

and a co-result, CR0 (which comes from the Rc=1 option) hence
why the page needs two entries "cprop RT,RA,RB" *and* "cprop. RT,RA,RB"

> After looking
> at other instructions, Rc seems to determine something (1-bit bitfield).

yes.  remember i said "just cookie-cut maxs literally", that includes
its entry in minor_22.csv (sorry forgot to emphasise that)

just cut/paste that line, update column 1 (---NNNNNN), update column 2
(s/maxs/cprop), and the rest is good.

Comment 7 Luke Kenneth Casson Leighton 2022-06-22 14:03:37 BST

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bmask.py;hb=HEAD

this is currently producing the right answer for the first
example params when mode=0b001110 but not the others, when
mask is non-zero

  30      m  = 0b11000011
  31      v3 = 0b10010100 # vmsbf.m v2, v3
  32      v2 = 0b01000011 # v2

it's important to replicate the full functionality of sof/sif/sbf
and that includes having a "predicate mask" (aka, a GPR which might
happen to be r3, r10 or 31)

i'm currently brute-force experimenting with bmask.py to find
something vaguely resembling the output of sbf.py

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/sbf.py;hb=HEAD

Comment 8 Luke Kenneth Casson Leighton 2022-06-22 15:10:45 BST

ha!  found them!

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=7b6eb743caafb5bc6846d2d47c3040025a961460

--- a/openpower/sv/bmask.py
+++ b/openpower/sv/bmask.py
@@ -1,6 +1,7 @@
 def bmask(mode, RA, RB=None, zero=False):
     RT = RA if RB is not None and not zero else 0
     mask = RB if RB is not None else 0xffffffffffffffff
+    RA = RA & mask
     a1 = RA if mode&1 else ~RA
     mode2 = (mode >> 1) & 0b11
     if mode2 == 0:
@@ -22,7 +23,9 @@ def bmask(mode, RA, RB=None, zero=False):
          RT = a1 ^ a2
     return RT & mask
 
-SBF = 0b001110
+SBF = 0b01010
+SOF = 0b01001
+SIF = 0b10000 # 10011 also works no idea why yet

i'm so happy and full of great joy. w00t. etc.
that's masking working properly *and* covering the
entirety of the x86 BMI1 and TBM bitmanip set,
*in a way that can be used for Vector Masks*.
frickin cool.

now, there's one set of mode-bits (0b11000 -> 0b11111)
which in *theory* could be used for another mode,
like you suggest in comment #3 (shift-down-by-one)

although, to be honest, if it's *really* just "shift-down by one"
or "invert" input or invert output i'm inclined to suggest just
using an extra 32-bit instruction for that.  sradi.

it depends on whether mask interacts with things and makes
life more complex than just "do a shift afterwards"

Comment 9 Luke Kenneth Casson Leighton 2022-06-22 19:10:19 BST

https://libre-soc.org/openpower/sv/vector_ops/discussion/

bmask pseudocode draft at the top, BM2-Form has been added,
all pieces in place to add this in.

Comment 10 Jacob Lifshay 2022-06-22 19:17:05 BST

me on irc:
> lkcl, imho sv.adde is sufficient for biginteger add, cprop is rendered
> redundant because you can just do the trick of having your 256-bit
> simd unit do a 256-bit add and forward co from the previous clock cycle
> to ci in the current cycle to get full-speed bigint add

> so imho we should remove cprop

Comment 11 Jacob Lifshay 2022-06-22 19:39:46 BST

(In reply to Jacob Lifshay from comment #10)
> me on irc:
> > lkcl, imho sv.adde is sufficient for biginteger add, cprop is rendered
> > redundant because you can just do the trick of having your 256-bit
> > simd unit do a 256-bit add and forward co from the previous clock cycle
> > to ci in the current cycle to get full-speed bigint add
> 
> > so imho we should remove cprop

lkcl:
> programmerjake, i was kinda thinking either well beyond 256, 512 or
> 1024, and also of other circumstances invlving carry
> and, also, for other vector mask purposes, problem being it was 20
> years ago i worked with the Aspex ASP

me:
> beyond 1024 bits? just use the CA register to hold carry between
> one vector add and the next. also, scalar adde can be used as a
> carry propagate instruction like cprop, but with the inputs encoded
> differently.
> for adde RT, RA, RB: set the bit in RA when the element add
> produces >= 0xFFFF...FFFF, set the bit in RB when the element add overflows.
> the same sv.adde 256-bit and carry forwarding tricks work for sv.subfe
> so, imho cprop is still rendered unnecessary

Comment 12 Andrey Miroshnikov 2022-06-22 21:27:06 BST

(In reply to Luke Kenneth Casson Leighton from comment #9)
> https://libre-soc.org/openpower/sv/vector_ops/discussion/
> 
> bmask pseudocode draft at the top, BM2-Form has been added,
> all pieces in place to add this in.

I tried to make a minor_22.csv entry for bmask based on the info in
https://libre-soc.org/openpower/sv/bitmanip/

but I don't really understand this well enough as the instruction bitfields are different:
10001,L,mode,ALU,OP_BMASK,RA,RB,NONE,RT,NONE,NONE,0,0,ZERO,0,NONE,0,0,0,0,0,0,1,0,0,bmask,X,,1,unofficial until submitted and approved/renumbered by the opf isa wg

Comment 13 Luke Kenneth Casson Leighton 2022-06-22 22:15:23 BST

(In reply to Andrey Miroshnikov from comment #12)

> I tried to make a minor_22.csv entry for bmask based on the info in
> https://libre-soc.org/openpower/sv/bitmanip/
> 
> but I don't really understand this well enough as the instruction bitfields
> are different:
> 10001,L,mode,ALU,OP_BMASK,RA,RB,NONE,RT,NONE,NONE,0,0,ZERO,0,NONE,0,0,0,0,0,
> 0,1,0,0,bmask,X,,1,unofficial until submitted and approved/renumbered by the
> opf isa wg

rright, ok, look at the CSV headings
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isatables/minor_22.csv;hb=HEAD

opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,
10001,L,mode,ALU,OP_BMASK,RA,RB,NONE

    opcode=10001
    unit=L??
    internal op = mode??
    in1 = ALU??

that cant' be right, can it?
how about this:

opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,
10001,ALU,OP_BMASK,RA,RB,NONE

    opcode=10001 good
    unit=ALU     ah ha! that's making sense
    internal op=OP_BMASK ok that's better
    in1=RA               looking more like it

so that must be close.  what about the rest (at the end)?

sgn,rc,lk,sgl pipe,comment,form,CONDITIONS,unofficial,comment2
> 0,1,0,0,bmask,X,,1,unofficial un

     sgn=0  # ok
     rc=1   # wrong, it's not an Rc=1.
     lk=0   # ok
     sgl pipe=0     # ok
     comment=bmask  # correct
     form=X         # wrong, it's listed as BM2-Form

so that last bit should be:

    0,NONE,0,0,bmask,BM@,,1,unofficial un...


now, there's *one* more thing, which is slightly complicated.  look closely
at the OP_SETVL and e.g. OP_MINMAX entries:

-----11011-,VL,OP_SETVL,
-----011001,VL,OP_SVSHAPE,
-----111001,VL,OP_SVREMAP,
-----10011-,VL,OP_SVSTEP,
0111001110-,ALU,OP_MINMAX,
0011001110-,ALU,OP_MINMAX,
...

now let's look at the corresponding bitmanip table:

https://libre-soc.org/openpower/sv/bitmanip/

setvl:

   0.5         26....30  31  name   Form
   NN          11 011    Rc  setvl  SVL-Form

av max:

   0.5  21..25 26....30  31  name   Form
   NN   01110  01110     Rc  avmax  X-Form

can you see how in the bit-positions "21..25" for setvl, there is "------"?
this says to the PowerDecoder "don't try to match against those bits".
so we need to do the same thing for bmask, ***BUT***, look again at the
table:

bmask:

   0.5         26....30  31  name   Form
   NN          L 1000    1   bmask  BM2-Form

so that's going to be:

* five "-"s in bitpositions 21..25
* one  "-" in bitposition 26 (for the "L")
* four bits "1000" in 27..30
* one "1" in bit 31

to give:

 ------10001

so where you had this:


    opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,
    10001,ALU,OP_BMASK,RA,RB,NONE....

it should in fact be this:

    opcode,unit,internal op,in1,in2,in3,out,CR in,CR out,
    ------10001,ALU,OP_BMASK,RA,RB,NONE....

that says "match ONLY bits 27..31 against 10001 but IGNORE 21..26"


a *second* job of PowerDecoder is to look up the av.mdwn file,
and get the "Form" (BM2) and the line "bmask RT,RA,RB,mode,L", then
the job of power_fields.py is to decode fields.txt, look at the
BM2 and find the bit-positions of L and mode (oh, and RT, RA and RB)

Comment 14 Luke Kenneth Casson Leighton 2022-06-22 22:20:04 BST

(In reply to Jacob Lifshay from comment #11)

> for adde RT, RA, RB: set the bit in RA when the element add

as a Vectorised instruction to produce a vector of carry-propagation
bits that's ultra-expensive, triggering an astounding number of
register hazards and causing huge numbers of 64-bit registers
to be utilised for the sole purpose of storing binary single-digit
values.

cprop is one single 32-bit scalar instruction that produces up to
64 bits of carry-propagation results.

Comment 15 Luke Kenneth Casson Leighton 2022-06-22 22:24:52 BST

follow-on for context:

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/power_decoder.py;h=2b4799c6dcf53a43879a167250d52389bd83ee7a;hb=8d1e13117cc677247b93542cec6adcf6fc7fd841#l739

 739         Subdecoder(pattern=22, opcodes=get_csv("minor_22.csv"),
 740                    opint=False, bitsel=(0, 11), suffix=None,

that says that the pattern-matcher is looking for a string
(opint=False --> "---NN-NN---")

and that it's looking for a pattern of length 11 in MSB0 bit-positions
21..31 (python range 0,11).   yes. i know.  because MSB0 because LSB0
because python range-numbering the end is +1, sigh.

so that's why the minor_22.csv has opcodes involving "-" don't cares,
and why it has to be exactly 11 long

Comment 16 Andrey Miroshnikov 2022-06-22 23:39:03 BST

I made the pseudo-code, by PyWriter doesn't like it:
    if (RB) = 0 then RT <- 0
    else                  RT <- (RA)

    if (RB) = 0 then mask <- (RB)
    else                  mask <- 0xffffffffffffffff

    RA <- RA & mask
    if (mode&1) = 1 then a1 <- (RA)
    else                  a1 <- (~RA)

    mode2 <- (mode >> 1) & 0b11
    if mode2 = 0 then a2 <- -(RA)
    if mode2 = 1 then a2 <- (RA)-1
    if mode2 = 2 then a2 <- (RA)+1
    if mode2 = 3 then a2 <- ~((RA)+1)

    a1 <- a1 & mask
    a2 <- a2 & mask
    mode3 <- (mode >> 3) & 0b11
    if   mode3 = 0 then RT <- a1 | a2
    if mode3 = 1 then RT <- a1 & a2
    if mode3 = 2 then RT <- a1 ^ a2

The PowerISA doc said switch statements are supported, but I haven't checked if PyWriter supports them.
I'll continue on this tomorrow.

Comment 17 Jacob Lifshay 2022-06-22 23:49:19 BST

(In reply to Luke Kenneth Casson Leighton from comment #14)
> (In reply to Jacob Lifshay from comment #11)
> 
> > for adde RT, RA, RB: set the bit in RA when the element add
> 
> as a Vectorised instruction

I'm referring to *scalar* adde, all those references to vector elements are to illustrate how to set the bits in the input registers to make adde do what you want.

> 
> cprop is one single 32-bit scalar instruction that produces up to
> 64 bits of carry-propagation results.

adde is one single 32-bit scalar instruction that produces 65 bits of carry-propagation results (64 in RT, 1 in CA)

Comment 18 Luke Kenneth Casson Leighton 2022-06-22 23:55:42 BST

(In reply to Andrey Miroshnikov from comment #16)
> I made the pseudo-code, by PyWriter doesn't like it:


(In reply to Luke Kenneth Casson Leighton from comment #9)

  vvvvvvvvvvvvvvvvv
> https://libre-soc.org/openpower/sv/vector_ops/discussion/
  ^^^^^^^^^^^^^^^^^
  vvvvvvvvvvvvvvvvvvvvvv
> bmask pseudocode draft at the top, 
  ^^^^^^^^^^^^^^^^^^^^^^

Comment 19 Jacob Lifshay 2022-06-24 23:04:25 BST

I thought I had replied before, but apparently I forgot to click the submit button.

(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #3)
> > we'll also want shifting by 1 bit to cover finding up to and
> > including/excluding lowest set bit.
> 
> that's 6 mode bits
> 
> 
> > x ^ (x - 1) => set up to lowest set bit inclusive
> > (x ^ (x - 1)) >> 1 => set up to lowest set bit exclusive
> > 
> > we'll also want the option to bit-reverse both input and output so we can do
> > first set msb rather than first set lsb.
> 
> that's 8 mode bits.

it's actually 7, bit-reverse only happens on both or neither of the input and output.

> 
> this needs 5 bits:
> 
> +def bmask(mode, RA, RB=None, zero=False):
> +    RT = RA if RB is not None and not zero else 0
> +    mask = RB if RB is not None else 0xffffffffffffffff
> +    a1 = RA if mode&1 else ~RA
> +    mode2 = (mode >> 1) & 0b11
> +    if mode2 == 0:
> +        a2 = -RA
> +    if mode2 == 1:
> +        a2 = RA-1
> +    if mode2 == 2:
> +        a2 = RA+1

this is redundant since RA + 1 == -(~RA)

> +    if mode2 == 3:
> +        a2 = ~(RA+1)

this is redundant since ~(RA + 1) = (~RA) - 1

removing both of those saves 1 more bit, making it 6 bits with all of my proposed additions.

Comment 20 Jacob Lifshay 2022-06-24 23:12:32 BST

(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #5)
> > +def bmask(mode, RA, RB=None, zero=False):
> > +    RT = RA if RB is not None and not zero else 0
> > +    mask = RB if RB is not None else 0xffffffffffffffff
> > +    a1 = RA if mode&1 else ~RA
> > +    mode2 = (mode >> 1) & 0b11
> > +    if mode2 == 0:
> > +        a2 = -RA
> > +    if mode2 == 1:
> > +        a2 = RA-1
> > +    if mode2 == 2:
> > +        a2 = RA+1
> 
> this is redundant since RA + 1 == -(~RA)
> 
> > +    if mode2 == 3:
> > +        a2 = ~(RA+1)
> 
> this is redundant since ~(RA + 1) = (~RA) - 1
> 
> removing both of those saves 1 more bit, making it 6 bits with all of my
> proposed additions.

thinking about a bit more, imho the mode2 options should be `RA - 1` and `RA + 1` since that saves gates, not requiring xor gates on the output of the add.

Comment 21 Luke Kenneth Casson Leighton 2022-06-25 02:41:43 BST

(In reply to Jacob Lifshay from comment #20)

> thinking about a bit more, imho the mode2 options should be `RA - 1` and `RA
> + 1` since that saves gates, not requiring xor gates on the output of the
> add.

so i am sort-of getting it, but only because OP_ADD, copied from
microwatt, is already subdivided down into

* select a or neg-input-a
* select add 1/0/CA
* select output or neg-output

and the end result is to create an amazing number of arithmetic ops
with the exact same add hardware.

here is the a / neg-a selection anyway:

+    a1 = RA if mode&1 else ~RA

if i understand correctly, what you are saying is
that the mode-bits can be "morphed" to do the same
thing?

saving one bit to add one bit, doesn't totally make sense:
if they are totally equivalent there's not much point

BUT

if things can be morphed such that it fits *directly*
with the existing OP_ADD (ok except the OR, AND and XOR)
that's worth pursuing because it saves gates.

Comment 22 Jacob Lifshay 2022-06-25 03:19:36 BST

(In reply to Luke Kenneth Casson Leighton from comment #21)
> (In reply to Jacob Lifshay from comment #20)
> here is the a / neg-a selection anyway:
> 
> +    a1 = RA if mode&1 else ~RA

that's bitwise-not, not neg -- I get you point anyway...
> 
> if i understand correctly, what you are saying is
> that the mode-bits can be "morphed" to do the same
> thing?

sorta...I'm saying you only need 1 mode2 bit.

The idea is that, currently add/subf/etc. are basically:

a = ~RA if subtracting else RA
carry_in = 0
if subtracting:
    carry_in = 1
RT = a + RB + carry_in

bmask would (ignoring mask and bit-reverse and shifting) do:

a = ~RA if imm & 0b1 else RA
b = 1 if imm & 0b10 else -1 # mode2
carry_in = 0
y = a + b + carry_in
v00 = 0
v01 = v10 = bool(imm & 0b100)
v11 = bool(imm & 0b1000)
if v00 == v01 == v10 == v11 == 0:
    raise IllegalInstruction("other instructions can use the spare space")

# 64x 4-in muxes -- basically a binlog operation:
# probably saves gates over muxing over and, or, and xor
table = [v00, v01, v10, v11]
RT = 0
for i in range(64):
    ra_bit = bool(RA & (1 << i))
    y_bit = bool(y & (1 << i))
    RT |= table[(ra_bit << 1) | y_bit] << i

Comment 23 Luke Kenneth Casson Leighton 2022-06-25 09:37:33 BST

(In reply to Jacob Lifshay from comment #22)

> > +    a1 = RA if mode&1 else ~RA
> 
> that's bitwise-not, not neg -- 

yes.  that's directly from the pseudocode explressions, which took me
a while to stop, it's so similar in small fonts.

https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)

XOP.LZ.09 01 /1	BLCFILL	Fill from lowest clear bit	x & (x + 1)
XOP.LZ.09 02 /6	BLCI	Isolate lowest clear bit	x | ~(x + 1)
  XOP.LZ.09 01 /5	BLCIC	Isolate lowest clear bit and complement	~x & (x + 1)
  XOP.LZ.09 02 /1	BLCMSK	Mask from lowest clear bit	x ^ (x + 1)
  XOP.LZ.09 01 /3	BLCS	Set lowest clear bit	x | (x + 1)
  XOP.LZ.09 01 /2	BLSFILL	Fill from lowest set bit	x | (x - 1)
  XOP.LZ.09 01 /6	BLSIC	Isolate lowest set bit n compl.	~x | (x - 1)
  XOP.LZ.09 01 /7	T1MSKC	Inverse mask from trailing ones	~x | (x + 1)
  XOP.LZ.09 01 /4	TZMSK	Mask from trailing zeros	~x & (x - 1)

and, further up, BMI1

  VEX.LZ.0F38 F3 /3	BLSI	Extract lowest set isolated bit	x & -x
  VEX.LZ.0F38 F3 /2	BLSMSK	Get mask up to lowest set bit	x ^ (x - 1)
  VEX.LZ.0F38 F3 /1	BLSR	Reset lowest set bit	x & (x - 1)

so this separates out 3 expression groups:

    1. x / ~x                    - this is a1
    2. & / ^ / |                 - this is mode3
    3. -x / x-1 / x+1 / ~(x+1)   - this is a2

however, on top of that, to get the same set-before-first, set-only-first
and set-including-first effect, an *additional* mask is added.

> I get you point anyway...

so relieved you can interpret fuzzy-logic :)

> The idea is that, currently add/subf/etc. are basically:
> 
> a = ~RA if subtracting else RA
> carry_in = 0
> if subtracting:
>     carry_in = 1
> RT = a + RB + carry_in

(and an output-invert)

if inverted_out:
    RT = ~RT

> bmask would (ignoring mask and bit-reverse and shifting) do:

mask is quite important (critical to include), and also i found
it... difficult to work out (sotto voice, i had to guess, and
eventually found it)

> a = ~RA if imm & 0b1 else RA
> b = 1 if imm & 0b10 else -1 # mode2
> carry_in = 0
> y = a + b + carry_in

ok so this calculates expression (3) is that correct? (with some
of the equivalence-conversions (~RA)+1 i believe it is) 

> v00 = 0
> v01 = v10 = bool(imm & 0b100)
> v11 = bool(imm & 0b1000)

ahh, a LUT2... it looks like... it's doing and/or/xor. so that's expression (2)

> # 64x 4-in muxes -- basically a binlog operation:
> # probably saves gates over muxing over and, or, and xor
> table = [v00, v01, v10, v11]
> RT = 0
> for i in range(64):
>     ra_bit = bool(RA & (1 << i))
>     y_bit = bool(y & (1 << i))
>     RT |= table[(ra_bit << 1) | y_bit] << i

and the ra input here is not expression (1) which is where the equivalence
chain falls over for me.

i *suspect* that if an extra bit for output-inversion is included then
that might work

as above:

    v00 = 0
    v01 = v10 = bool(imm & 0b100)
    v11 = bool(imm & 0b1000)

(out-inversion built-in to LUT2?)

    v00 ^= bool(imm^0b10000)
    v01 ^= bool(imm^0b10000)
    v10 ^= bool(imm^0b10000)
    v11 ^= bool(imm^0b10000)

Comment 24 Luke Kenneth Casson Leighton 2022-06-25 13:24:27 BST

https://libre-soc.org/irclog/%23libre-soc.2022-06-24.log.html#t2022-06-24T23:07:05

Andrey, i'm answering here:

> just made another test case for bmask, but as I was about to run it,
> noticed the generated python function in av.py is bad

ah no it isn't.

> it generates these statements "if eq(_RB, 0)"

that's correct.  it's from: "if _RB = 0"

> but now noticed "mode" is not defined.

it's called "bm" and i renamed it in the demo-code bmask.py
to help there

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=c7e7ea66564c9c2ba1a1b1f931b6d37f0269b72a
 
> It's part of the opcode, but does it need to be
> one of the input arguments?

rright, ok, so this is... a complex part of python programming.
the pseudo-code is "global" in nature (it was never intended to
be used as an *actual* programming language: we were literally
the first people in the world to *make* it a strictly-defined
deterministic programming language)

the key word there is "global"

python is different: there is in fact *local* and global variables,
and there are strict rules about how those are separated.  in
particular, in class functions.

now, so as not to make an absolute pig's ear out of the pseudocode
compiler, i made a deliberate decision to bypass some of the rules
set by python.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/caller.py;h=941a89f0bb2eedfc323b3a51c52464bd8cbd03ef;hb=a7f3fa7ab2c87d75d0c562eb12d73e01d19095f1#l1950

that's based on a stackoverflow question "how do i explicitly inject
variables into a function namespace"

and as jacob later says, it basically allows the X-Form / BM2-Form /
whatever-Form opcode fields to be "injected" into the Simulator
function.  look here, in av.py:

from openpower.decoder.isa.caller import inject

    vvvvvv
 -> @inject() <-
    ^^^^^
    def op_bmask(self, RB, RA):
        if eq(_RB, 0):
            mask = concat(1, repeat=self.XLEN)
        else:
            mask = RB

> Also in your pseudo-code

in *the* pseudocode.

>  you're skipping bit 0 of "mode" (starting with mode[1])

(bear in mind i renamed "mode" to "bm")
no, i haven't.  you've misinterpreted this:

    a1 = ra if bm&1 else ~ra

that's an *INTEGER* (bm) and the expression "bm&1" is testing BIT ZERO
it helps to view it as this:

    a1 = ra if bm&0b00001 else ~ra

or this:

    a1 = ra if bm&(1<<0) else ~ra

the "0" there refers to "bit 0".

if you were correct (which you're not), then it would be:

    a1 = ra if bm&2 else ~ra

*that* would be testing bit 1.

BUT

butbutbut

PLEASE REMEMBER that whilst the python code bmask.py is in "normal" order
(LSB0), the av.mdwn is in ***MSB0 ORDER***

thus, these two pseudocode lines from av.mdwn here:

    a1 <- ra
    if bm[4] = 0 then a1 <- ¬ra

*ARE* repeat *ARE* repeat *ARE* directly and EXACTLY equivalent to this
from bmask.py:

    a1 = ra if bm&1 else ~ra

why? because bm is 5 bits long, and therefore bm[4] refers to the LEAST
significant bit **NOT** repeat **NOT** to the most significant bit.

bottom line *do not* modify the pseudocode, it is correct.  if you think
you have to modify it, please stop and think "*why* is this correct, what
have i not understood about the utter mind-melting weirdness that takes
everyone who ever sees MSB0 ordering for the first time 6 months to get
used to".

Comment 25 Luke Kenneth Casson Leighton 2022-06-25 18:29:11 BST

unit test added for bmask, pseudocode confirmed functional.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c642f6cbdba0c5eeb2e327735f3f58f145c6363a

Comment 26 Luke Kenneth Casson Leighton 2022-06-25 21:10:06 BST

unit tests all good. closing.
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e2ced8a9c0db4853e216a19a96e40569241823b3