933 – prefix-code (like huffman code) decode/encode instructions

Bug 933 - prefix-code (like huffman code) decode/encode instructions

Summary: prefix-code (like huffman code) decode/encode instructions

Status:	IN_PROGRESS

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:	https://libre-soc.org/openpower/prefi...

Depends on:	937 935
Blocks:	218 222 223
	Show dependency tree / graph

Reported:	2022-09-21 21:53 BST by Jacob Lifshay
Modified:	2022-10-22 13:39 BST (History)
CC List:	2 users (show)

See Also:	936
NLnet milestone:	NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks:	0
budget (EUR) for this task, excluding subtasks' budget:	0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2022-09-21 21:53:48 BST

While working on the JPEG decode/encode, I realized that we'll want instructions to accelerate encoding/decoding huffman codes for jpeg, deflate (gzip, zip, png, etc.), mp3, etc.

I am writing instructions to handle all prefix codes (more than just huffman codes), decoding codes that are <=6-bit, longer codes are assumed to be less common so can take a slow path (decode instruction clears a CR bit to signal a >5-bit code).

WIP initial draft here:
https://libre-soc.org/openpower/prefix_codes/

so far I have the pseudo-code for an decode instruction. the huffman-style tree, along with a 2-bit mode is encoded in an input 64-bit register.

Task list:
* ...
* DONE: rename to `pcdec.`
* DONE: initial implementation and tests
* TODO: explain why _RB=0 test is needed (avoids need for separate pcdecend.)
* TODO: optimize definition for data-dependent fail-first svp64
* TODO: decide which exact set of 4 modes are needed, they are currently kinda arbitrary and definitely not fully thought out.
* TODO: build demo decoder for JPEG
* TODO: build demo decoder for DEFLATE
* TODO: build demo decoder for MP3

Maybe:
* TODO: create encoding instruction if needed?

Comment 1 Jacob Lifshay 2022-09-21 21:55:40 BST

this should be implementable using a 6-deep binary tree network kinda like plru so it can decode 1 code per clock cycle (taking <=8 cycles), or more if we can cram multiple binary tree lookups in a single cycle.

Comment 2 Luke Kenneth Casson Leighton 2022-09-21 21:59:14 BST

ooo, i like it, and it's exactly the kind of fascinating thing
that "emerges" like this.

1. practical detail: 4-operand is barely enough room to fit, takes up
   26 bits on its own, is a 5/6-bit XO. if the imm is 1 bit that's 2of
   5-bit XOs. costly! has to be *really* worth it which...

2. ... are there any other uses? other common compression algorithms?

Comment 3 Jacob Lifshay 2022-09-22 00:15:32 BST

I filled out the prefix-code decode instruction description:
https://libre-soc.org/openpower/prefix_codes/

(In reply to Luke Kenneth Casson Leighton from comment #2)
> ooo, i like it, and it's exactly the kind of fascinating thing
> that "emerges" like this.

:)

> 1. practical detail: 4-operand is barely enough room to fit, takes up
>    26 bits on its own, is a 5/6-bit XO. if the imm is 1 bit that's 2of
>    5-bit XOs. costly! has to be *really* worth it which...

imm is 1 bit. that's *1* 5-bit XO, since imm fits where Rc would be. If Rc=1 is useful (probably, but I didn't work that out yet), then imho we just have Rc always be set, so it doesn't need to be encoded.
> 
> 2. ... are there any other uses? other common compression algorithms?

added some more:
https://libre-soc.org/openpower/prefix_codes/

Comment 4 Luke Kenneth Casson Leighton 2022-09-22 00:42:10 BST

love it. spotted a couple of varnames.

are there any uses for Rc=1?  setting CR0.LE bit if no index was
found? perhaps using CR.SO for overflow instead of XER.OV/SO?
3-in 2-out, expensive but worth it.

Comment 5 Luke Kenneth Casson Leighton 2022-09-22 00:43:28 BST

(In reply to Jacob Lifshay from comment #3)

> imm is 1 bit. that's *1* 5-bit XO, since imm fits where Rc would be. If Rc=1
> is useful (probably, but I didn't work that out yet), then imho we just have
> Rc always be set, so it doesn't need to be encoded.

yep,agreed

Comment 6 Luke Kenneth Casson Leighton 2022-09-22 00:45:54 BST

also adding MPEG bug #223 as this is related.

Comment 7 Jacob Lifshay 2022-09-22 01:03:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> love it. spotted a couple of varnames.
> 
> are there any uses for Rc=1?  setting CR0.LE bit if no index was
> found? perhaps using CR.SO for overflow instead of XER.OV/SO?
> 3-in 2-out, expensive but worth it.

CR0.EQ is 1 iff no values were decoded, because RT=0 in that case.

Comment 8 Jacob Lifshay 2022-09-22 03:18:22 BST

rather than having RS and RC be input bit position, it might be better to have them be whatever input bits were left over from decoding, encoded like so:

0b000...0001iiiii...iii where i is an input bit. input is taken from the lsb and the register is shifted down. one of the CR0 bits (lt?) can be set to if the instruction used RB or not. doing this allows easier decoding of input streams since it doesn't need extra instructions to shift and merge the next bytes of the input stream. if RB is r0, then it just won't take any input bits from RB -- useful for decoding the last few bits of input.

before:
# input in memory at 8(r3), tree in r4, output in memory at 8(r5)
ldu r8, 8(r3)
li r7, 0
loop:
pcdec r6, r4, r8, r7
bso large_code  # encountered code larger than 6 bits
beq fill
stdu r6, 8(r5)
b loop
fill:
# whole bunch of complex stuff to take leftover bits in r6, merge new bits in, update r3 and r7
# i don't want to write it out, it'd be like 8-10 instructions
b loop
# ^ main loop -- large and complex

large_code:
beq skip
stdu r6, 8(r5)
skip:
srd r6, r6, r7
# more complex stuff...


after -- much simpler:
# input in memory at 8(r3), tree in r4, output in memory at 8(r5)
li r7, 1  # 00001 -- no input bits in r7
fill:
ldu r8, 8(r3)
loop:
pcdec r6, r4, r8, r7
bso large_code  # encountered code larger than 6 bits
stdu r6, 8(r5)
blt fill
b loop
# ^ main loop -- really simple

large_code:
beq skip_store  # anything already decoded?
stdu r6, 8(r5)
skip_store:
blt skip_fill
ldu r8, 8(r3)
skip_fill:
# lets assume the longest possible code word is 16 bits like jpeg
cntlzd r9, r7
li r10, 1
sldi r10, r10, 63  # r10 = 1 << 63
srad r10, r10, r9
andc r7, r7, r10
and r10, r8, r10
or r7, r7, r10
# i'm feeling lazy, won't write out all of skip_fill right now, it's the slow cold path anyway

Comment 9 Luke Kenneth Casson Leighton 2022-09-22 12:10:25 BST

(In reply to Jacob Lifshay from comment #7)

> CR0.EQ is 1 iff no values were decoded, because RT=0 in that case.

i mean, are there any *non-standard* uses that would be useful?
such as setting some bits of CR0 from RT (CR0.EQ if RT=0) 
but then setting *other* bits if RS is changed, for example.
and setting CR.SO instead of XER

that way it becomes possible (if practical?) to do a vectorised
call and have a parallel-analysis (parallel batch of CRfs)

l.

Comment 10 Luke Kenneth Casson Leighton 2022-09-22 18:50:47 BST

does this work at all if defined in terms of 16-bit groupings (4 hwords
per 64bit reg) which elwidth overrides can drop down to 8bit or 4bit
based on src and dest elwidth.

only relevant if greater than 8-bit prefix code is common.

Comment 11 Jacob Lifshay 2022-09-22 20:30:03 BST

(In reply to Luke Kenneth Casson Leighton from comment #10)
> does this work at all if defined in terms of 16-bit groupings (4 hwords
> per 64bit reg) which elwidth overrides can drop down to 8bit or 4bit
> based on src and dest elwidth.
> 
> only relevant if greater than 8-bit prefix code is common.

that doesn't really work, >4-bit codes are pretty common, the code size is limited to 6-bits since the tree is a 64-bit value and 6 = log2(64).

afaict this is an instruction where elwid overrides don't really work.

Comment 12 Luke Kenneth Casson Leighton 2022-09-22 22:02:06 BST

(In reply to Jacob Lifshay from comment #11)

> that doesn't really work, >4-bit codes are pretty common, the code size is
> limited to 6-bits since the tree is a 64-bit value and 6 = log2(64).

can't go to register-pairs it'd be too much. in theory could go to
a vec2/3/4 though, hmmm....

Comment 13 Jacob Lifshay 2022-09-23 04:43:01 BST

I added pcdec to openpower-isa.git, I'm running into the issue that ISACaller doesn't know how to handle instructions that write to both RT/RS, so I marked the test I added as unittest.expectedFailure. Luke, can you fix that? (I'll open a bug)

https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=5b122788a7f96555bbf5ae8ed3d18703f175588f

I also added minor_4 to the decoder:

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=4b00a4c153cd64efd9e3b7b2e4a2cdf9bb9faba9

and fixed a bug in maddld's pseudo-code:

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d835e6024d47027d71b8f924f9d90be2f7261065

Comment 14 Luke Kenneth Casson Leighton 2022-09-23 10:06:34 BST

+    | PO    |  RT  |   RA  |   RB  | RC    | XO |once|

please keep to short names, i have told you this many times.

+56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0
+        elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec

you cannot use the same bit for Rc as for the immediate.

you also cannot use a different-width XO in the same Form
so a different Form will be needed.

Comment 15 Luke Kenneth Casson Leighton 2022-09-23 11:33:43 BST

(In reply to Luke Kenneth Casson Leighton from comment #14)

> +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0
> +        elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec

(CRout=CR0 is exclusively reserved for instructions that
 have an Rc=1 form.  stbcx/stwcx perform explicit CR0
 writing. the instruction should be *named* "pcdec." if
 permanently writing to CR0. yes it's... awkward.)

Comment 16 Luke Kenneth Casson Leighton 2022-09-23 12:35:04 BST

    CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit

deep breath, there is no register CR0 and it will not
be picked up (back in ISACaller namespace).... yet.

then also it is not going to work for SV, either, because
writing continuously to CR0 is not the desired behaviour.

writing directly to the CR Field is not the desired behaviour
*either* although with BF and BFA (etc) those can be over-ridden
(extended/walked) forward just like RA/RB/RC/RT/RS for elements,
such that CR[32+4*bf..] etc will actually write to the correct
CR Field.

(keenly aware that CRfield numbers CR8 and above will over-run
 CR, that's for another time)

if this instruction was:

    pcdec RT,RA,RB,RC,BF,imm

then it would be fine - BF could be transparently-incremented.
in theory an "implicit" BF (set to zero) could also be added (sigh)

as a stop-gap measure that would work.

Comment 17 Jacob Lifshay 2022-09-23 14:16:23 BST

(In reply to Luke Kenneth Casson Leighton from comment #14)
> +    | PO    |  RT  |   RA  |   RB  | RC    | XO |once|
> 
> please keep to short names, i have told you this many times.

once is short compared to BHRBE which is a field.

> 
> +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0
> +        elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec
> 
> you cannot use the same bit for Rc as for the immediate.

PowerISA already does that in many places -- one example is the EX immediate bit is in the same place as Rc.

The key is that pcdec always writes CR, it is not optional based on a Rc immediate, therefore Rc is unused and can be used for something else.
> 
> you also cannot use a different-width XO in the same Form
> so a different Form will be needed.

It uses the exact same XO width as every other VA2 form.

That said, PowerISA has many instances of different XO widths for the same form...

XX2-form is probably the worst example (look in the v3.1B pdf, not fields.txt):
it splits XO into two fields:
XX2-form (1.6.1.21):
| ... |21 |25 |26 |29 |30|31|
| ... | XO            |BX|TX|
| ... |XO |dc |XO |dm |BX|TX|

Comment 18 Jacob Lifshay 2022-09-23 14:20:22 BST

(In reply to Luke Kenneth Casson Leighton from comment #15)
> (In reply to Luke Kenneth Casson Leighton from comment #14)
> 
> > +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0
> > +        elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec
> 
> (CRout=CR0 is exclusively reserved for instructions that
>  have an Rc=1 form.  stbcx/stwcx perform explicit CR0
>  writing. the instruction should be *named* "pcdec." if
>  permanently writing to CR0. yes it's... awkward.)

i was following the phrase in v3.1B section 1.3.2 notation:
A period (.) as the last character of an instruction mnemonic means that the instruction records status information in certain fields of the Condition Register as a side effect of execution.

Somehow I got it in my head that that only meant comparing the output to zero, like how Rc usually works.

I'll rename it.

Comment 19 Jacob Lifshay 2022-09-23 14:42:59 BST

(In reply to Luke Kenneth Casson Leighton from comment #16)
>     CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit
> 
> deep breath, there is no register CR0 and it will not
> be picked up (back in ISACaller namespace).... yet.

it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code.

> 
> then also it is not going to work for SV, either, because
> writing continuously to CR0 is not the desired behaviour.

I'm imagining that SV will simply take the CR0 output and write it to the correct CR field for the current element -- that's what it should do for stbcx. anyway (stbcx. could be used in VF mode imho, there would just need to be more than 1 reservation).
> 
> writing directly to the CR Field is not the desired behaviour
> *either* although with BF and BFA (etc) those can be over-ridden
> (extended/walked) forward just like RA/RB/RC/RT/RS for elements,
> such that CR[32+4*bf..] etc will actually write to the correct
> CR Field.
> 
> (keenly aware that CRfield numbers CR8 and above will over-run
>  CR, that's for another time)

iirc the CR register isn't a SPR (mfspr can't read it), so we could just make it be a 512-bit register named CRX and CR[s:e] would be an alias for CRX[s-32:e-32] to avoid needing to change all the pseudocode and mfcr and friends would only read the msb 32-bits without being sv-prefixed.

CR[32:35] = CRX[0:3] = CR0
CR[36:39] = CRX[4:7] = CR1
...
CR[540:543] = CRX[508:511] = CR127

Comment 20 Luke Kenneth Casson Leighton 2022-09-23 15:09:54 BST

(In reply to Jacob Lifshay from comment #19)
> (In reply to Luke Kenneth Casson Leighton from comment #16)
> >     CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit
> > 
> > deep breath, there is no register CR0 and it will not
> > be picked up (back in ISACaller namespace).... yet.
> 
> it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code.

there *is* no stbcx pseudocode.

> > 
> > then also it is not going to work for SV, either, because
> > writing continuously to CR0 is not the desired behaviour.
> 
> I'm imagining that SV will simply take the CR0 output and write it to the
> correct CR field for the current element

naah. the precedent has to be explained to the OPF ISA WG.

>  -- that's what it should do for
> stbcx. anyway 

it's written out in english iirc or it is hard-set into CR, can't
recall which.


> CR[32:35] = CRX[0:3] = CR0
> CR[36:39] = CRX[4:7] = CR1
> ...
> CR[540:543] = CRX[508:511] = CR127

again, that changes the scalar spec, which will be unacceptable.

*hiding* the existence of CRX as a massive hack behind the scenes
in ISACaller, such that the OPF ISA WG never have to know it exists,
i have no problem with at all.

if however the ISA WG come forward with a proposal to change
the scalar spec to replace CR with CRX in all locations?
that i have no problem with either.

but us proposing it? that just gives them the chance to say "no"
and they will stick to it.

Comment 21 Luke Kenneth Casson Leighton 2022-09-23 15:13:46 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isa;hb=HEAD

no "stbcx."

in csv, in fields.txt, in TestIssuer, just not pseudocode, not ISACaller.

Comment 22 Jacob Lifshay 2022-09-23 15:40:15 BST

(In reply to Luke Kenneth Casson Leighton from comment #20)
> (In reply to Jacob Lifshay from comment #19)
> > (In reply to Luke Kenneth Casson Leighton from comment #16)
> > >     CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit
> > > 
> > > deep breath, there is no register CR0 and it will not
> > > be picked up (back in ISACaller namespace).... yet.
> > 
> > it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code.
> 
> there *is* no stbcx pseudocode.

well, there is pseudocode for stbcx. in the v3.1B pdf (page 1089 (1115)) that writes to a variable named CR0
> 
> > > 
> > > then also it is not going to work for SV, either, because
> > > writing continuously to CR0 is not the desired behaviour.
> > 
> > I'm imagining that SV will simply take the CR0 output and write it to the
> > correct CR field for the current element
> 
> naah. the precedent has to be explained to the OPF ISA WG.

well, it makes sense to me, it's also exactly what Rc=1 effectively does.
> 
> >  -- that's what it should do for
> > stbcx. anyway 
> 
> it's written out in english iirc or it is hard-set into CR, can't
> recall which.
> 
> 
> > CR[32:35] = CRX[0:3] = CR0
> > CR[36:39] = CRX[4:7] = CR1
> > ...
> > CR[540:543] = CRX[508:511] = CR127
> 
> again, that changes the scalar spec, which will be unacceptable.
> 
> *hiding* the existence of CRX as a massive hack behind the scenes
> in ISACaller, such that the OPF ISA WG never have to know it exists,
> i have no problem with at all.

I thought having CRX would be less weird than saying CR is a 512-bit register but the MSB is bit 32 and the LSB is bit 543, bits 0-31 are constant zeros that aren't included in the 512 bits.

Comment 23 Luke Kenneth Casson Leighton 2022-09-23 16:36:24 BST

CR0 <- 0b00 || u2 || XERSO

perfect.  good enough for me. precedent already set.
getting the damn thing out of the compiled code, i'd
suggest adding it to the list of "write_regs" in the
parser.

that will have the side-effect of adding it to the
list of output regs returned from the function call
in ISACaller

    outputs = self.instr[ins_name](*inputs)

so watch out for that.  it's not a dict, you *must*
get the order right, matching up with the write_regs

1709         # write out any regs for this instruction
1710         if info.write_regs:
1711             for name, output in zip(output_names, results):
1712                 yield from self.check_write(info, name, output, carry_en)


do you want to try adding that or shall i do it?

Comment 24 Jacob Lifshay 2022-09-23 16:43:52 BST

(In reply to Luke Kenneth Casson Leighton from comment #23)
> do you want to try adding that or shall i do it?

you can do it if you like, along with fixing RS

Comment 25 Luke Kenneth Casson Leighton 2022-09-23 18:35:20 BST

56,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec,
57,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec,

those columns, NONE corresponding to "rc", should i think be "1"

i'll try it out.

Comment 26 Luke Kenneth Casson Leighton 2022-09-23 18:43:17 BST

(In reply to Luke Kenneth Casson Leighton from comment #25)
> 56,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec,
> 57,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec,
> 
> those columns, NONE corresponding to "rc", should i think be "1"
> 
> i'll try it out.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=38802b1df42b147ebaa413e65c872510279b1429

yep all good.

 56,ALU,OP_PCDEC,0,....1,0,ONE,0,0,pcdec,
 57,ALU,OP_PCDEC,0,....1,0,ONE,0,0,pcdec,

that's the "trick" for not needing Rc=1.  we're good to go with
CR0 as CRout, then.  meaning that sv_analysis.py can remain as this:

     elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']:  # pcdec

what a hatchet-job.  it works though.

the results aren't what is expected, i'll leave that to you to
sort out, but RS is written to, CR0 is written to, it's all there.

Comment 27 Luke Kenneth Casson Leighton 2022-09-23 18:48:36 BST

btw one thing to consider: using overwrite-variants (RT=src+dest).
this _has_ been done before (for ternlog[i])

@unique
class In3Sel(Enum):
    NONE = 0
    RS = 1
    RB = 2  # for shiftrot (M-Form)
    FRS = 3
    FRC = 4
    RC = 5  # for SVP64 bit-reverse LD/ST
    RT = 6  # for ternlog[i]

the VA-Form EXT004 locations are extremely precious: there had
better be a really good reason for using them, such as "chaining"
of "sv.pcdec/mr".

Comment 28 Jacob Lifshay 2022-09-24 01:26:58 BST

got all tests to pass!

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a3d71d6e00a14a6cadff2f65662758a123600daf

Added more tasks to top comment:
https://bugs.libre-soc.org/show_bug.cgi?id=933#c0

Comment 29 Jacob Lifshay 2022-09-24 02:31:01 BST

Additional Notes:

it turns out that with a 64-bit tree the code length limit is 5-bits:
2 possible 1-bit codes
4 possible 2-bit codes
8 possible 3-bit codes
16 possible 4-bit codes
32 possible 5-bit codes
sum: 62 which just barely fits in a 64-bit register


a prefix-code encode instruction is possibly not as necessary, since it's possible to encode in parallel by looking up the symbol's bits and lengths in a table, and shifting them into position in parallel and or-ing them all together.
This doesn't have a serial dependency like prefix-code decoding does, so is much easier to vectorize.

Comment 30 Luke Kenneth Casson Leighton 2022-09-24 09:49:52 BST

(In reply to Jacob Lifshay from comment #29)

> a prefix-code encode instruction is possibly not as necessary, since it's
> possible to encode in parallel by looking up the symbol's bits and lengths
> in a table, and shifting them into position in parallel and or-ing them all
> together.

mapreduce-mode on a scalar-targetted sv.or would do the job there.
sv.or/mr

some time ago i was looking at a bitmanip instruction to combine the
shift-and-or into one instruction.  bmset group? v. expensive operations,
3-in 1-out, has to be done as an RT-overwrite

Comment 31 Luke Kenneth Casson Leighton 2022-09-24 16:20:25 BST

commit db6d324297e415eb2358df6c31d193dae59dc256 (HEAD -> master, origin/master, origin/HEAD)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sat Sep 24 17:17:55 2022 +0100

    add RC1 support to ISACaller.

"sv.pcdec./ff=RC1/VLI" should now work

Comment 32 Luke Kenneth Casson Leighton 2022-09-26 12:06:09 BST

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c6fa0d418b53b2e94eea2e0bf378217daa66d668

adding (RB|0) cannot be done lightly (neither can RC|0)
as it means actually passing flags down to ALUs in HDL
(which know nothing about register numbers).

swapped RA with RB

Comment 33 Jacob Lifshay 2022-09-26 16:11:29 BST

I have an idea for better fallback decoding: use pcdec. once=1 multiple times, looking up the `tree` value in a table, this allows reading up to 5 bits each time, which is small enough to be a reasonable in-memory table size.

basically (taking into account RA/RB now swapped):
fallback_table = [...]
tree = 0xFFFF_FFFF_0000_0000
bits = 1  # u64
while True:
    pcdec. addr, next_input, tree, leftover_bits, 1
    leftover_bits = RS
    if used_RA:
        next_input = *input++
    if addr = 0:
        error("invalid code")
    bit_count = floor_log2(bits)  # 63 - count_leading_zeros(bits)
    bits &= ~(1 << bit_count)
    bits |= addr << bit_count
    if addr < 32 or fallback_table is None:
        break  # done
    # now 32 <= addr < 64, so fallback_table needs 32 entries
    tree, fallback_table = fallback_table[addr - 32]
# decoded code is in bits

Comment 34 Jacob Lifshay 2022-09-27 03:15:15 BST

well, unfortunately, it turns out that every codec I looked at (JPEG, DEFLATE, tried looking at MP3 but gave up after reading a bunch of LAME's source) interleaves prefix-codes with plain binary codes, making the best they can do be to use pcdec. once=1. If we don't need to decode multiple prefix-codes in sequence, pcdec. looses a bunch of its advantage over just using an in-memory look-up table, though it still may be worth it.

I'm currently working on a demo jpeg partial decoder.

Comment 35 Jacob Lifshay 2022-09-30 04:41:00 BST

After writing code to generate the lookup tables needed for JPEG decoding from the huffman tables in the JPEG, I realized that those tables are huge and mostly zeros, so I'm altering pcdec.'s definition to optionally compress the output index (explained below).

the least-significant 2 bits of the `tree` input are unused, so I crammed a 2-bit mode in there instead of having a `once` immediate (should be fine because `tree` values are only rarely created), allowing selecting a few different modes that output compressed/uncompressed indexes and optionally read 6-bits into the output when nothing matches in the `tree` input (allowing processing 6-bits at a time of input instead of 5 for long prefix codes).

compressed output indexes:
basically if `tree` is:
 0b1100000001000100
(codes for "0", "10", "110", and "111")
then:
decoded input    uncompressed index    compressed index
"0"              0b1_0                 0
"10"             0b1_10                1
"110"            0b1_110               2
"111"            0b1_111               3

the compressed index can be calculated by calculating the number of 1-bits that are less-significant than the bit 1<<uncompressed_index, ignoring the two LSB `tree` bits (since they're used for `mode` instead)

this compressed index encoding actually largely matches how JPEG stores huffman codes' corresponding values.

Comment 36 Jacob Lifshay 2022-10-02 09:29:23 BST

lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be completed, since there is a pcdec. instruction with working unittests and the nlnet deadline is very close?

If that sounds good, i can submit RFPs sunday night or monday, I'll be busy most of sunday.

Comment 37 Luke Kenneth Casson Leighton 2022-10-02 11:18:39 BST

(In reply to Jacob Lifshay from comment #36)
> lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be
> completed, since there is a pcdec. instruction with working unittests and
> the nlnet deadline is very close?

yep go for it, can you show (somehow) the relationship to mp3?
(a function in some source code that a "pcdec." loop replaces?)

Comment 38 Jacob Lifshay 2022-10-02 17:47:33 BST

(In reply to Luke Kenneth Casson Leighton from comment #37)
> (In reply to Jacob Lifshay from comment #36)
> > lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be
> > completed, since there is a pcdec. instruction with working unittests and
> > the nlnet deadline is very close?
> 
> yep go for it,

ok, will do later

> can you show (somehow) the relationship to mp3?
> (a function in some source code that a "pcdec." loop replaces?)

found what is huffman decoding afaict (libmpg123 is used by latest LAME trunk for mp3 decoding, it used to be vendored into LAME's source tree):
https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622
(reindented to fit)
619 const short *val = h->table;
620 REFRESH_MASK;
621 #ifdef USE_NEW_HUFFTABLE
622 while((y=val[(MASK_UTYPE)mask>>(BITSHIFT+4)])<0)
623 {
624     val -= y;
625     num -= 4;
626     mask <<= 4;
627 }
628 num -= (y >> 8);
629 mask <<= (y >> 8);
630 x = (y >> 4) & 0xf;
631 y &= 0xf;

that while loop would be replaced with a loop on a pcdec. instruction and would need to loop 4/6 as often since pcdec. processes 6 bits at a time and that loop processes 4 bits at a time.

Comment 39 Jacob Lifshay 2022-10-02 17:49:25 BST

(In reply to Jacob Lifshay from comment #38)
> found what is huffman decoding afaict (libmpg123 is used by latest LAME
> trunk for mp3 decoding, it used to be vendored into LAME's source tree):
> https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622
> that while loop would be replaced with a loop on a pcdec. instruction and
> would need to loop 4/6 as often since pcdec. processes 6 bits at a time and
> that loop processes 4 bits at a time.

i'm sure that isn't the only loop that can be sped up with pcdec.

Comment 40 Luke Kenneth Casson Leighton 2022-10-02 18:09:13 BST

(In reply to Jacob Lifshay from comment #39)
> (In reply to Jacob Lifshay from comment #38)
> > https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622
> > that while loop would be replaced with a loop on a pcdec. instruction and
> > would need to loop 4/6 as often since pcdec. processes 6 bits at a time and
> > that loop processes 4 bits at a time.
> 
> i'm sure that isn't the only loop that can be sped up with pcdec.

the bit i'd really like to see is, does "sv.pcdec./ff=SO"
do what's expected (and useful): truncate VL at a point which
can be easily detected, does "sv.bc SO" help in detecting that
there was an overflow in the vector of ConditionCodes, and so on.

Comment 41 Jacob Lifshay 2022-10-02 18:30:13 BST

(In reply to Luke Kenneth Casson Leighton from comment #40)
> (In reply to Jacob Lifshay from comment #39)
> > (In reply to Jacob Lifshay from comment #38)
> > > https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622
> > > that while loop would be replaced with a loop on a pcdec. instruction and
> > > would need to loop 4/6 as often since pcdec. processes 6 bits at a time and
> > > that loop processes 4 bits at a time.
> > 
> > i'm sure that isn't the only loop that can be sped up with pcdec.
> 
> the bit i'd really like to see is, does "sv.pcdec./ff=SO"
> do what's expected (and useful): truncate VL at a point which
> can be easily detected, does "sv.bc SO" help in detecting that
> there was an overflow in the vector of ConditionCodes, and so on.

it turns out that because prefix-codes are almost always (at least in both JPEG and DEFLATE) interleaved with plain binary bits or prefix-codes using different `tree` values, attempting to vectorize prefix-code decoding doesn't really work. even if it did, it wouldn't be much faster than a scalar pcdec. loop because of the scalar dependency chain between each pcdec. and the next/prev pcdec.

that said, sv.pcdec./ff=ne should do what you're imagining afaict.

Comment 42 Luke Kenneth Casson Leighton 2022-10-02 18:57:03 BST

(In reply to Jacob Lifshay from comment #41)

> it turns out that because prefix-codes are almost always (at least in both
> JPEG and DEFLATE) interleaved with plain binary bits or prefix-codes using
> different `tree` values, attempting to vectorize prefix-code decoding
> doesn't really work. even if it did, it wouldn't be much faster than a
> scalar pcdec. loop because of the scalar dependency chain between each
> pcdec. and the next/prev pcdec.

getting the instructions *into* a multi-issue pipeline is one priority,
the other is allowing high-performance implementations to perform
macro-op back-end ALUs, similar to the bigint wide ALUs

> that said, sv.pcdec./ff=ne should do what you're imagining afaict.

brilliant.