While working on the JPEG decode/encode, I realized that we'll want instructions to accelerate encoding/decoding huffman codes for jpeg, deflate (gzip, zip, png, etc.), mp3, etc. I am writing instructions to handle all prefix codes (more than just huffman codes), decoding codes that are <=6-bit, longer codes are assumed to be less common so can take a slow path (decode instruction clears a CR bit to signal a >5-bit code). WIP initial draft here: https://libre-soc.org/openpower/prefix_codes/ so far I have the pseudo-code for an decode instruction. the huffman-style tree, along with a 2-bit mode is encoded in an input 64-bit register. Task list: * ... * DONE: rename to `pcdec.` * DONE: initial implementation and tests * TODO: explain why _RB=0 test is needed (avoids need for separate pcdecend.) * TODO: optimize definition for data-dependent fail-first svp64 * TODO: decide which exact set of 4 modes are needed, they are currently kinda arbitrary and definitely not fully thought out. * TODO: build demo decoder for JPEG * TODO: build demo decoder for DEFLATE * TODO: build demo decoder for MP3 Maybe: * TODO: create encoding instruction if needed?
this should be implementable using a 6-deep binary tree network kinda like plru so it can decode 1 code per clock cycle (taking <=8 cycles), or more if we can cram multiple binary tree lookups in a single cycle.
ooo, i like it, and it's exactly the kind of fascinating thing that "emerges" like this. 1. practical detail: 4-operand is barely enough room to fit, takes up 26 bits on its own, is a 5/6-bit XO. if the imm is 1 bit that's 2of 5-bit XOs. costly! has to be *really* worth it which... 2. ... are there any other uses? other common compression algorithms?
I filled out the prefix-code decode instruction description: https://libre-soc.org/openpower/prefix_codes/ (In reply to Luke Kenneth Casson Leighton from comment #2) > ooo, i like it, and it's exactly the kind of fascinating thing > that "emerges" like this. :) > 1. practical detail: 4-operand is barely enough room to fit, takes up > 26 bits on its own, is a 5/6-bit XO. if the imm is 1 bit that's 2of > 5-bit XOs. costly! has to be *really* worth it which... imm is 1 bit. that's *1* 5-bit XO, since imm fits where Rc would be. If Rc=1 is useful (probably, but I didn't work that out yet), then imho we just have Rc always be set, so it doesn't need to be encoded. > > 2. ... are there any other uses? other common compression algorithms? added some more: https://libre-soc.org/openpower/prefix_codes/
love it. spotted a couple of varnames. are there any uses for Rc=1? setting CR0.LE bit if no index was found? perhaps using CR.SO for overflow instead of XER.OV/SO? 3-in 2-out, expensive but worth it.
(In reply to Jacob Lifshay from comment #3) > imm is 1 bit. that's *1* 5-bit XO, since imm fits where Rc would be. If Rc=1 > is useful (probably, but I didn't work that out yet), then imho we just have > Rc always be set, so it doesn't need to be encoded. yep,agreed
also adding MPEG bug #223 as this is related.
(In reply to Luke Kenneth Casson Leighton from comment #4) > love it. spotted a couple of varnames. > > are there any uses for Rc=1? setting CR0.LE bit if no index was > found? perhaps using CR.SO for overflow instead of XER.OV/SO? > 3-in 2-out, expensive but worth it. CR0.EQ is 1 iff no values were decoded, because RT=0 in that case.
rather than having RS and RC be input bit position, it might be better to have them be whatever input bits were left over from decoding, encoded like so: 0b000...0001iiiii...iii where i is an input bit. input is taken from the lsb and the register is shifted down. one of the CR0 bits (lt?) can be set to if the instruction used RB or not. doing this allows easier decoding of input streams since it doesn't need extra instructions to shift and merge the next bytes of the input stream. if RB is r0, then it just won't take any input bits from RB -- useful for decoding the last few bits of input. before: # input in memory at 8(r3), tree in r4, output in memory at 8(r5) ldu r8, 8(r3) li r7, 0 loop: pcdec r6, r4, r8, r7 bso large_code # encountered code larger than 6 bits beq fill stdu r6, 8(r5) b loop fill: # whole bunch of complex stuff to take leftover bits in r6, merge new bits in, update r3 and r7 # i don't want to write it out, it'd be like 8-10 instructions b loop # ^ main loop -- large and complex large_code: beq skip stdu r6, 8(r5) skip: srd r6, r6, r7 # more complex stuff... after -- much simpler: # input in memory at 8(r3), tree in r4, output in memory at 8(r5) li r7, 1 # 00001 -- no input bits in r7 fill: ldu r8, 8(r3) loop: pcdec r6, r4, r8, r7 bso large_code # encountered code larger than 6 bits stdu r6, 8(r5) blt fill b loop # ^ main loop -- really simple large_code: beq skip_store # anything already decoded? stdu r6, 8(r5) skip_store: blt skip_fill ldu r8, 8(r3) skip_fill: # lets assume the longest possible code word is 16 bits like jpeg cntlzd r9, r7 li r10, 1 sldi r10, r10, 63 # r10 = 1 << 63 srad r10, r10, r9 andc r7, r7, r10 and r10, r8, r10 or r7, r7, r10 # i'm feeling lazy, won't write out all of skip_fill right now, it's the slow cold path anyway
(In reply to Jacob Lifshay from comment #7) > CR0.EQ is 1 iff no values were decoded, because RT=0 in that case. i mean, are there any *non-standard* uses that would be useful? such as setting some bits of CR0 from RT (CR0.EQ if RT=0) but then setting *other* bits if RS is changed, for example. and setting CR.SO instead of XER that way it becomes possible (if practical?) to do a vectorised call and have a parallel-analysis (parallel batch of CRfs) l.
does this work at all if defined in terms of 16-bit groupings (4 hwords per 64bit reg) which elwidth overrides can drop down to 8bit or 4bit based on src and dest elwidth. only relevant if greater than 8-bit prefix code is common.
(In reply to Luke Kenneth Casson Leighton from comment #10) > does this work at all if defined in terms of 16-bit groupings (4 hwords > per 64bit reg) which elwidth overrides can drop down to 8bit or 4bit > based on src and dest elwidth. > > only relevant if greater than 8-bit prefix code is common. that doesn't really work, >4-bit codes are pretty common, the code size is limited to 6-bits since the tree is a 64-bit value and 6 = log2(64). afaict this is an instruction where elwid overrides don't really work.
(In reply to Jacob Lifshay from comment #11) > that doesn't really work, >4-bit codes are pretty common, the code size is > limited to 6-bits since the tree is a 64-bit value and 6 = log2(64). can't go to register-pairs it'd be too much. in theory could go to a vec2/3/4 though, hmmm....
I added pcdec to openpower-isa.git, I'm running into the issue that ISACaller doesn't know how to handle instructions that write to both RT/RS, so I marked the test I added as unittest.expectedFailure. Luke, can you fix that? (I'll open a bug) https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=5b122788a7f96555bbf5ae8ed3d18703f175588f I also added minor_4 to the decoder: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=4b00a4c153cd64efd9e3b7b2e4a2cdf9bb9faba9 and fixed a bug in maddld's pseudo-code: https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=d835e6024d47027d71b8f924f9d90be2f7261065
+ | PO | RT | RA | RB | RC | XO |once| please keep to short names, i have told you this many times. +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0 + elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec you cannot use the same bit for Rc as for the immediate. you also cannot use a different-width XO in the same Form so a different Form will be needed.
(In reply to Luke Kenneth Casson Leighton from comment #14) > +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0 > + elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec (CRout=CR0 is exclusively reserved for instructions that have an Rc=1 form. stbcx/stwcx perform explicit CR0 writing. the instruction should be *named* "pcdec." if permanently writing to CR0. yes it's... awkward.)
CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit deep breath, there is no register CR0 and it will not be picked up (back in ISACaller namespace).... yet. then also it is not going to work for SV, either, because writing continuously to CR0 is not the desired behaviour. writing directly to the CR Field is not the desired behaviour *either* although with BF and BFA (etc) those can be over-ridden (extended/walked) forward just like RA/RB/RC/RT/RS for elements, such that CR[32+4*bf..] etc will actually write to the correct CR Field. (keenly aware that CRfield numbers CR8 and above will over-run CR, that's for another time) if this instruction was: pcdec RT,RA,RB,RC,BF,imm then it would be fine - BF could be transparently-incremented. in theory an "implicit" BF (set to zero) could also be added (sigh) as a stop-gap measure that would work.
(In reply to Luke Kenneth Casson Leighton from comment #14) > + | PO | RT | RA | RB | RC | XO |once| > > please keep to short names, i have told you this many times. once is short compared to BHRBE which is a field. > > +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0 > + elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec > > you cannot use the same bit for Rc as for the immediate. PowerISA already does that in many places -- one example is the EX immediate bit is in the same place as Rc. The key is that pcdec always writes CR, it is not optional based on a Rc immediate, therefore Rc is unused and can be used for something else. > > you also cannot use a different-width XO in the same Form > so a different Form will be needed. It uses the exact same XO width as every other VA2 form. That said, PowerISA has many instances of different XO widths for the same form... XX2-form is probably the worst example (look in the v3.1B pdf, not fields.txt): it splits XO into two fields: XX2-form (1.6.1.21): | ... |21 |25 |26 |29 |30|31| | ... | XO |BX|TX| | ... |XO |dc |XO |dm |BX|TX|
(In reply to Luke Kenneth Casson Leighton from comment #15) > (In reply to Luke Kenneth Casson Leighton from comment #14) > > > +56,ALU,OP_PCDEC,RA,RB,RC,RT,NONE,CR0,0 > > + elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec > > (CRout=CR0 is exclusively reserved for instructions that > have an Rc=1 form. stbcx/stwcx perform explicit CR0 > writing. the instruction should be *named* "pcdec." if > permanently writing to CR0. yes it's... awkward.) i was following the phrase in v3.1B section 1.3.2 notation: A period (.) as the last character of an instruction mnemonic means that the instruction records status information in certain fields of the Condition Register as a side effect of execution. Somehow I got it in my head that that only meant comparing the output to zero, like how Rc usually works. I'll rename it.
(In reply to Luke Kenneth Casson Leighton from comment #16) > CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit > > deep breath, there is no register CR0 and it will not > be picked up (back in ISACaller namespace).... yet. it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code. > > then also it is not going to work for SV, either, because > writing continuously to CR0 is not the desired behaviour. I'm imagining that SV will simply take the CR0 output and write it to the correct CR field for the current element -- that's what it should do for stbcx. anyway (stbcx. could be used in VF mode imho, there would just need to be more than 1 reservation). > > writing directly to the CR Field is not the desired behaviour > *either* although with BF and BFA (etc) those can be over-ridden > (extended/walked) forward just like RA/RB/RC/RT/RS for elements, > such that CR[32+4*bf..] etc will actually write to the correct > CR Field. > > (keenly aware that CRfield numbers CR8 and above will over-run > CR, that's for another time) iirc the CR register isn't a SPR (mfspr can't read it), so we could just make it be a 512-bit register named CRX and CR[s:e] would be an alias for CRX[s-32:e-32] to avoid needing to change all the pseudocode and mfcr and friends would only read the msb 32-bits without being sv-prefixed. CR[32:35] = CRX[0:3] = CR0 CR[36:39] = CRX[4:7] = CR1 ... CR[540:543] = CRX[508:511] = CR127
(In reply to Jacob Lifshay from comment #19) > (In reply to Luke Kenneth Casson Leighton from comment #16) > > CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit > > > > deep breath, there is no register CR0 and it will not > > be picked up (back in ISACaller namespace).... yet. > > it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code. there *is* no stbcx pseudocode. > > > > then also it is not going to work for SV, either, because > > writing continuously to CR0 is not the desired behaviour. > > I'm imagining that SV will simply take the CR0 output and write it to the > correct CR field for the current element naah. the precedent has to be explained to the OPF ISA WG. > -- that's what it should do for > stbcx. anyway it's written out in english iirc or it is hard-set into CR, can't recall which. > CR[32:35] = CRX[0:3] = CR0 > CR[36:39] = CRX[4:7] = CR1 > ... > CR[540:543] = CRX[508:511] = CR127 again, that changes the scalar spec, which will be unacceptable. *hiding* the existence of CRX as a massive hack behind the scenes in ISACaller, such that the OPF ISA WG never have to know it exists, i have no problem with at all. if however the ISA WG come forward with a proposal to change the scalar spec to replace CR with CRX in all locations? that i have no problem with either. but us proposing it? that just gives them the chance to say "no" and they will stick to it.
https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isa;hb=HEAD no "stbcx." in csv, in fields.txt, in TestIssuer, just not pseudocode, not ISACaller.
(In reply to Luke Kenneth Casson Leighton from comment #20) > (In reply to Jacob Lifshay from comment #19) > > (In reply to Luke Kenneth Casson Leighton from comment #16) > > > CR0 <- final_rb_used || 0b0 || (output = 0) || so_bit > > > > > > deep breath, there is no register CR0 and it will not > > > be picked up (back in ISACaller namespace).... yet. > > > > it needs to be added, writing to CR0 also occurs in stbcx.'s pseudo-code. > > there *is* no stbcx pseudocode. well, there is pseudocode for stbcx. in the v3.1B pdf (page 1089 (1115)) that writes to a variable named CR0 > > > > > > > then also it is not going to work for SV, either, because > > > writing continuously to CR0 is not the desired behaviour. > > > > I'm imagining that SV will simply take the CR0 output and write it to the > > correct CR field for the current element > > naah. the precedent has to be explained to the OPF ISA WG. well, it makes sense to me, it's also exactly what Rc=1 effectively does. > > > -- that's what it should do for > > stbcx. anyway > > it's written out in english iirc or it is hard-set into CR, can't > recall which. > > > > CR[32:35] = CRX[0:3] = CR0 > > CR[36:39] = CRX[4:7] = CR1 > > ... > > CR[540:543] = CRX[508:511] = CR127 > > again, that changes the scalar spec, which will be unacceptable. > > *hiding* the existence of CRX as a massive hack behind the scenes > in ISACaller, such that the OPF ISA WG never have to know it exists, > i have no problem with at all. I thought having CRX would be less weird than saying CR is a 512-bit register but the MSB is bit 32 and the LSB is bit 543, bits 0-31 are constant zeros that aren't included in the 512 bits.
CR0 <- 0b00 || u2 || XERSO perfect. good enough for me. precedent already set. getting the damn thing out of the compiled code, i'd suggest adding it to the list of "write_regs" in the parser. that will have the side-effect of adding it to the list of output regs returned from the function call in ISACaller outputs = self.instr[ins_name](*inputs) so watch out for that. it's not a dict, you *must* get the order right, matching up with the write_regs 1709 # write out any regs for this instruction 1710 if info.write_regs: 1711 for name, output in zip(output_names, results): 1712 yield from self.check_write(info, name, output, carry_en) do you want to try adding that or shall i do it?
(In reply to Luke Kenneth Casson Leighton from comment #23) > do you want to try adding that or shall i do it? you can do it if you like, along with fixing RS
56,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec, 57,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec, those columns, NONE corresponding to "rc", should i think be "1" i'll try it out.
(In reply to Luke Kenneth Casson Leighton from comment #25) > 56,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec, > 57,ALU,OP_PCDEC,0,....1,0,NONE,0,0,pcdec, > > those columns, NONE corresponding to "rc", should i think be "1" > > i'll try it out. https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=38802b1df42b147ebaa413e65c872510279b1429 yep all good. 56,ALU,OP_PCDEC,0,....1,0,ONE,0,0,pcdec, 57,ALU,OP_PCDEC,0,....1,0,ONE,0,0,pcdec, that's the "trick" for not needing Rc=1. we're good to go with CR0 as CRout, then. meaning that sv_analysis.py can remain as this: elif regs == ['RA', 'RB', 'RC', 'RT', '', 'CR0']: # pcdec what a hatchet-job. it works though. the results aren't what is expected, i'll leave that to you to sort out, but RS is written to, CR0 is written to, it's all there.
btw one thing to consider: using overwrite-variants (RT=src+dest). this _has_ been done before (for ternlog[i]) @unique class In3Sel(Enum): NONE = 0 RS = 1 RB = 2 # for shiftrot (M-Form) FRS = 3 FRC = 4 RC = 5 # for SVP64 bit-reverse LD/ST RT = 6 # for ternlog[i] the VA-Form EXT004 locations are extremely precious: there had better be a really good reason for using them, such as "chaining" of "sv.pcdec/mr".
got all tests to pass! https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a3d71d6e00a14a6cadff2f65662758a123600daf Added more tasks to top comment: https://bugs.libre-soc.org/show_bug.cgi?id=933#c0
Additional Notes: it turns out that with a 64-bit tree the code length limit is 5-bits: 2 possible 1-bit codes 4 possible 2-bit codes 8 possible 3-bit codes 16 possible 4-bit codes 32 possible 5-bit codes sum: 62 which just barely fits in a 64-bit register a prefix-code encode instruction is possibly not as necessary, since it's possible to encode in parallel by looking up the symbol's bits and lengths in a table, and shifting them into position in parallel and or-ing them all together. This doesn't have a serial dependency like prefix-code decoding does, so is much easier to vectorize.
(In reply to Jacob Lifshay from comment #29) > a prefix-code encode instruction is possibly not as necessary, since it's > possible to encode in parallel by looking up the symbol's bits and lengths > in a table, and shifting them into position in parallel and or-ing them all > together. mapreduce-mode on a scalar-targetted sv.or would do the job there. sv.or/mr some time ago i was looking at a bitmanip instruction to combine the shift-and-or into one instruction. bmset group? v. expensive operations, 3-in 1-out, has to be done as an RT-overwrite
commit db6d324297e415eb2358df6c31d193dae59dc256 (HEAD -> master, origin/master, origin/HEAD) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sat Sep 24 17:17:55 2022 +0100 add RC1 support to ISACaller. "sv.pcdec./ff=RC1/VLI" should now work
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=c6fa0d418b53b2e94eea2e0bf378217daa66d668 adding (RB|0) cannot be done lightly (neither can RC|0) as it means actually passing flags down to ALUs in HDL (which know nothing about register numbers). swapped RA with RB
I have an idea for better fallback decoding: use pcdec. once=1 multiple times, looking up the `tree` value in a table, this allows reading up to 5 bits each time, which is small enough to be a reasonable in-memory table size. basically (taking into account RA/RB now swapped): fallback_table = [...] tree = 0xFFFF_FFFF_0000_0000 bits = 1 # u64 while True: pcdec. addr, next_input, tree, leftover_bits, 1 leftover_bits = RS if used_RA: next_input = *input++ if addr = 0: error("invalid code") bit_count = floor_log2(bits) # 63 - count_leading_zeros(bits) bits &= ~(1 << bit_count) bits |= addr << bit_count if addr < 32 or fallback_table is None: break # done # now 32 <= addr < 64, so fallback_table needs 32 entries tree, fallback_table = fallback_table[addr - 32] # decoded code is in bits
well, unfortunately, it turns out that every codec I looked at (JPEG, DEFLATE, tried looking at MP3 but gave up after reading a bunch of LAME's source) interleaves prefix-codes with plain binary codes, making the best they can do be to use pcdec. once=1. If we don't need to decode multiple prefix-codes in sequence, pcdec. looses a bunch of its advantage over just using an in-memory look-up table, though it still may be worth it. I'm currently working on a demo jpeg partial decoder.
After writing code to generate the lookup tables needed for JPEG decoding from the huffman tables in the JPEG, I realized that those tables are huge and mostly zeros, so I'm altering pcdec.'s definition to optionally compress the output index (explained below). the least-significant 2 bits of the `tree` input are unused, so I crammed a 2-bit mode in there instead of having a `once` immediate (should be fine because `tree` values are only rarely created), allowing selecting a few different modes that output compressed/uncompressed indexes and optionally read 6-bits into the output when nothing matches in the `tree` input (allowing processing 6-bits at a time of input instead of 5 for long prefix codes). compressed output indexes: basically if `tree` is: 0b1100000001000100 (codes for "0", "10", "110", and "111") then: decoded input uncompressed index compressed index "0" 0b1_0 0 "10" 0b1_10 1 "110" 0b1_110 2 "111" 0b1_111 3 the compressed index can be calculated by calculating the number of 1-bits that are less-significant than the bit 1<<uncompressed_index, ignoring the two LSB `tree` bits (since they're used for `mode` instead) this compressed index encoding actually largely matches how JPEG stores huffman codes' corresponding values.
lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be completed, since there is a pcdec. instruction with working unittests and the nlnet deadline is very close? If that sounds good, i can submit RFPs sunday night or monday, I'll be busy most of sunday.
(In reply to Jacob Lifshay from comment #36) > lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be > completed, since there is a pcdec. instruction with working unittests and > the nlnet deadline is very close? yep go for it, can you show (somehow) the relationship to mp3? (a function in some source code that a "pcdec." loop replaces?)
(In reply to Luke Kenneth Casson Leighton from comment #37) > (In reply to Jacob Lifshay from comment #36) > > lkcl, what do you think of declaring this bug and the jpeg/mp3 bugs to be > > completed, since there is a pcdec. instruction with working unittests and > > the nlnet deadline is very close? > > yep go for it, ok, will do later > can you show (somehow) the relationship to mp3? > (a function in some source code that a "pcdec." loop replaces?) found what is huffman decoding afaict (libmpg123 is used by latest LAME trunk for mp3 decoding, it used to be vendored into LAME's source tree): https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622 (reindented to fit) 619 const short *val = h->table; 620 REFRESH_MASK; 621 #ifdef USE_NEW_HUFFTABLE 622 while((y=val[(MASK_UTYPE)mask>>(BITSHIFT+4)])<0) 623 { 624 val -= y; 625 num -= 4; 626 mask <<= 4; 627 } 628 num -= (y >> 8); 629 mask <<= (y >> 8); 630 x = (y >> 4) & 0xf; 631 y &= 0xf; that while loop would be replaced with a loop on a pcdec. instruction and would need to loop 4/6 as often since pcdec. processes 6 bits at a time and that loop processes 4 bits at a time.
(In reply to Jacob Lifshay from comment #38) > found what is huffman decoding afaict (libmpg123 is used by latest LAME > trunk for mp3 decoding, it used to be vendored into LAME's source tree): > https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622 > that while loop would be replaced with a loop on a pcdec. instruction and > would need to loop 4/6 as often since pcdec. processes 6 bits at a time and > that loop processes 4 bits at a time. i'm sure that isn't the only loop that can be sped up with pcdec.
(In reply to Jacob Lifshay from comment #39) > (In reply to Jacob Lifshay from comment #38) > > https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622 > > that while loop would be replaced with a loop on a pcdec. instruction and > > would need to loop 4/6 as often since pcdec. processes 6 bits at a time and > > that loop processes 4 bits at a time. > > i'm sure that isn't the only loop that can be sped up with pcdec. the bit i'd really like to see is, does "sv.pcdec./ff=SO" do what's expected (and useful): truncate VL at a point which can be easily detected, does "sv.bc SO" help in detecting that there was an overflow in the vector of ConditionCodes, and so on.
(In reply to Luke Kenneth Casson Leighton from comment #40) > (In reply to Jacob Lifshay from comment #39) > > (In reply to Jacob Lifshay from comment #38) > > > https://www.mpg123.de/cgi-bin/scm/mpg123/trunk/src/libmpg123/layer3.c?revision=4971&view=markup#l622 > > > that while loop would be replaced with a loop on a pcdec. instruction and > > > would need to loop 4/6 as often since pcdec. processes 6 bits at a time and > > > that loop processes 4 bits at a time. > > > > i'm sure that isn't the only loop that can be sped up with pcdec. > > the bit i'd really like to see is, does "sv.pcdec./ff=SO" > do what's expected (and useful): truncate VL at a point which > can be easily detected, does "sv.bc SO" help in detecting that > there was an overflow in the vector of ConditionCodes, and so on. it turns out that because prefix-codes are almost always (at least in both JPEG and DEFLATE) interleaved with plain binary bits or prefix-codes using different `tree` values, attempting to vectorize prefix-code decoding doesn't really work. even if it did, it wouldn't be much faster than a scalar pcdec. loop because of the scalar dependency chain between each pcdec. and the next/prev pcdec. that said, sv.pcdec./ff=ne should do what you're imagining afaict.
(In reply to Jacob Lifshay from comment #41) > it turns out that because prefix-codes are almost always (at least in both > JPEG and DEFLATE) interleaved with plain binary bits or prefix-codes using > different `tree` values, attempting to vectorize prefix-code decoding > doesn't really work. even if it did, it wouldn't be much faster than a > scalar pcdec. loop because of the scalar dependency chain between each > pcdec. and the next/prev pcdec. getting the instructions *into* a multi-issue pipeline is one priority, the other is allowing high-performance implementations to perform macro-op back-end ALUs, similar to the bigint wide ALUs > that said, sv.pcdec./ff=ne should do what you're imagining afaict. brilliant.