Sunday, 2021-12-05

tplatenin openpower/test/ I fix the broken names so that all needed signals show up in gtkwave15:19
tplatenI start with wishbone_runner15:24
*** A_Dragon is now known as AAAAA_DRAGON15:53
lkcltplaten, that's really helpful.16:13
ghostmansdIn assert src_zero == 0, "dest-zero not allowed in failfirst mode"17:12
ghostmansdOn the first glance it seems there should be dst_zero, amirite?17:13
ghostmansdAll stuff below deals with dst_zero as well17:14
ghostmansdOTOH... It seems that dst_zero should be non-zero, e.g. `(dst_zero << SVP64MODE.DZ) # predicate dst-zeroing`17:15
ghostmansd(I'm currently teaching binutils to parse ff/pr sections)17:15
*** tplaten <tplaten!> has left #libre-soc18:37
*** kylel1 is now known as kylel19:12
lkclghostmansd-pc, sometimes there's just not enough bits to fit things19:48
lkclbut there's nothing to stop the assembly syntax from trying to specify stuff that just doesn't have space19:49
octaviusGood evening. While sorting my bookmark collection, discovered this interesting question about sorted/unsorted array summation time-difference:
lkcleevnin octavius21:54
lkclc++ is now illegible thanks to massive overuse of "standard" templates.21:59
octaviusWhat does this do? Arrange the elements to improve branch prediction?22:00
octaviusThe name makes me think of memory pages, or cache boundaries22:01
octaviusAh wait, I got it22:01
lkclgreat, you can explain it to me. i don't get it22:02
* lkcl going slightly dizzy dealing with and icache.py22:03
octaviusSo in the specific example of the stackoverflow, the partition() function would re-arrange the list, such that all elements greater than 128 would come first, the one's less than 128 would come second.22:03
octaviusThis would speed up execution, because the branch predictor will have a nice easy pattern to follow22:04
lkclurrr and then... yuk22:04
octaviusAs the code adds the element to the sum if it's greater than 12822:04
lkclif predication was used (available) that would be false22:05
octaviuswhat's predication?22:06
lkcleven scalar predication (ARM).  funnily enough that's exactly what's on the wikipedia page about predicates / condition-codes22:06
lkcl"making an operation optional based on a bit"22:06
octaviusAh, but you'd need to make a comparison between more than 1 bit, right?22:07
lkclinstruction 1: predicate = data[c] >= 12822:07
lkclinstruction 2: optional(predicate){sum += data[c])22:07
octaviusBut surely the predicate will still take time to process?22:09
lkclcount the number of instructions compared to having a branch.22:10
octaviusYou need to check if a element is greater than 128, so you need to have the element and 128 loaded somewher22:10
lkclcount the number of instructions compared to having a branch.22:10
octaviustwo instructions22:11
lkclno: three.22:11
octaviuswhat's the 3rd one?22:11
lkclinstruction 1: predicate (CR0) = data[c] >= 12822:11
lkclinstruction 2: bc CR0, instruction 422:11
lkclinstruction 3: sum += data[c]22:12
lkclinstruction 4: loop....22:12
lkclwhen predicates used: ==> 2 instructions22:13
lkclwhen branches used: ==> 3 instructions22:13
octaviusOk, but I don't really understand the mechanism behind the predicate. It sounds like a complex instruction (combination of several operations)22:16
lkclnope.  just 1 bit.22:16
lkclif bit set, instruction executed.22:16
lkclif bit clear, instruction not executed22:16
octaviusWhat's the bit?22:16
lkclin a register22:16
octaviusto determine if a number is bigger than 128, you need to check more than one bit though?22:17
octavius129 for example22:17
lkcland the result of that comparison is stored iiiiin.... a register of size "1 bit"22:17
lkclif result of comparison is true, bit equals 122:18
lkclif result of comparison is false, bit equals 022:18
octaviusHowever before the comparison is made, the 128 constant has to be loaded somewhere? (which takes an instruction)22:18
lkclnot if the ISA has compare-against-immediate22:18
octaviusAh ok, now it makes sense22:19
lkcland if it doesn't then the number of instructions in predicate goes up by one22:19
octaviusJust an ISA feature22:19
lkcland the number of instructions in branch goes up by one22:19
lkclnow you are comparing22:19
lkclwhen predicates used ==> 3 instructions22:19
octaviusSo you still gain regardless22:19
lkclwhen branches used ==> 4 instructions22:19
lkclof course22:19
lkclit's real simple22:19
lkcland ARM and many other ISAs have had predication for decades22:20
lkclVector ISAs have had predication for over 50 years22:20
lkclthere, they have multiple bits because it is multiple elements22:21
lkclone bit per element in the operation.22:21
lkclalso dead simple.22:21
lkclout-of-order speculative engines can jam/smash a ton of predicate-calculations-plus-predicated-operations into their Reservation Stations22:22
lkclthen once the predicate bit is calculated (by the compare)22:22
lkclsimply cancel the instruction if that predicate bit is zero22:22
lkclwhich then frees up the Function Unit for re-use by other incoming instructions22:23
lkclwhere branch speculation does pretty much exactly the same thing...22:24
octaviusThat's beautiful22:24
lkcl... but by jamming 50% more instructions into the Reservation Stations22:24
lkclsome CPUs will actually put *both* the paths into the RSes!22:24
lkcland cancel one of them22:25
octaviusWould that be considered a "high-performance" mode, where more power is consumed for a faster outcome?22:25
lkclthe penalty: you have *twice* as many instructions in-flight in the RSes compared to using predication, if you put both branch-success and branch-fail into RSes22:26
lkclyes, but it's insane.22:26
octaviusInsane because it's a waste most of the time?22:26
lkclwell, branch prediction is supposed to "help"22:27
lkclbut in the case of randomised data, it's never going to help, is it?22:27
octaviusYeah, worse than a coin flip22:27
octaviusI'll be off, will leave you with one more fun blog I'm sure you'll relate (software shipping with O(n^2) algos :) ) Good night22:47

Generated by 2.17.1 by Marius Gedminas - find it at!