Sunday, 2021-12-05

tplaten	in openpower/test/runner.py I fix the broken names so that all needed signals show up in gtkwave	15:19
tplaten	I start with wishbone_runner	15:24
*** A_Dragon is now known as AAAAA_DRAGON		15:53
lkcl	tplaten, that's really helpful.	16:13
ghostmansd	In svp64.py: assert src_zero == 0, "dest-zero not allowed in failfirst mode"	17:12
ghostmansd	On the first glance it seems there should be dst_zero, amirite?	17:13
ghostmansd	All stuff below deals with dst_zero as well	17:14
ghostmansd	OTOH... It seems that dst_zero should be non-zero, e.g. `(dst_zero << SVP64MODE.DZ) # predicate dst-zeroing`	17:15
ghostmansd	(I'm currently teaching binutils to parse ff/pr sections)	17:15
tplaten	done https://git.libre-soc.org/?p=openpower-isa.git;a=summary	17:44
*** tplaten <tplaten!~isengaara@55d46f66.access.ecotel.net> has left #libre-soc		18:37
*** kylel1 is now known as kylel		19:12
lkcl	ghostmansd-pc, sometimes there's just not enough bits to fit things	19:48
lkcl	but there's nothing to stop the assembly syntax from trying to specify stuff that just doesn't have space	19:49
octavius	Good evening. While sorting my bookmark collection, discovered this interesting question about sorted/unsorted array summation time-difference: https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array	20:43
lkcl	eevnin octavius	21:54
lkcl	whyyy.... https://en.cppreference.com/w/cpp/algorithm/partition	21:59
lkcl	c++ is now illegible thanks to massive overuse of "standard" templates.	21:59
octavius	XD	22:00
octavius	What does this do? Arrange the elements to improve branch prediction?	22:00
octavius	The name makes me think of memory pages, or cache boundaries	22:01
octavius	Ah wait, I got it	22:01
lkcl	great, you can explain it to me. i don't get it	22:02
* lkcl going slightly dizzy dealing with dcache.py and icache.py		22:03
octavius	So in the specific example of the stackoverflow, the partition() function would re-arrange the list, such that all elements greater than 128 would come first, the one's less than 128 would come second.	22:03
octavius	This would speed up execution, because the branch predictor will have a nice easy pattern to follow	22:04
lkcl	urrr and then... yuk	22:04
octavius	As the code adds the element to the sum if it's greater than 128	22:04
lkcl	if predication was used (available) that would be false	22:05
octavius	what's predication?	22:06
lkcl	even scalar predication (ARM). funnily enough that's exactly what's on the wikipedia page about predicates / condition-codes	22:06
lkcl	"making an operation optional based on a bit"	22:06
octavius	Ah, but you'd need to make a comparison between more than 1 bit, right?	22:07
lkcl	instruction 1: predicate = data[c] >= 128	22:07
lkcl	instruction 2: optional(predicate){sum += data[c])	22:07
octavius	But surely the predicate will still take time to process?	22:09
lkcl	count the number of instructions compared to having a branch.	22:10
octavius	You need to check if a element is greater than 128, so you need to have the element and 128 loaded somewher	22:10
lkcl	count the number of instructions compared to having a branch.	22:10
octavius	Sure	22:10
octavius	two instructions	22:11
lkcl	no: three.	22:11
octavius	what's the 3rd one?	22:11
lkcl	instruction 1: predicate (CR0) = data[c] >= 128	22:11
lkcl	instruction 2: bc CR0, instruction 4	22:11
lkcl	instruction 3: sum += data[c]	22:12
lkcl	instruction 4: loop....	22:12
lkcl	when predicates used: ==> 2 instructions	22:13
lkcl	when branches used: ==> 3 instructions	22:13
octavius	Ok, but I don't really understand the mechanism behind the predicate. It sounds like a complex instruction (combination of several operations)	22:16
lkcl	nope. just 1 bit.	22:16
lkcl	if bit set, instruction executed.	22:16
lkcl	if bit clear, instruction not executed	22:16
octavius	What's the bit?	22:16
lkcl	dead	22:16
lkcl	simple	22:16
lkcl	in a register	22:16
octavius	to determine if a number is bigger than 128, you need to check more than one bit though?	22:17
octavius	129 for example	22:17
lkcl	and the result of that comparison is stored iiiiin.... a register of size "1 bit"	22:17
lkcl	if result of comparison is true, bit equals 1	22:18
lkcl	if result of comparison is false, bit equals 0	22:18
lkcl	also	22:18
lkcl	dead	22:18
lkcl	simpe	22:18
lkcl	ple	22:18
octavius	However before the comparison is made, the 128 constant has to be loaded somewhere? (which takes an instruction)	22:18
lkcl	not if the ISA has compare-against-immediate	22:18
octavius	Ah ok, now it makes sense	22:19
lkcl	and if it doesn't then the number of instructions in predicate goes up by one	22:19
octavius	Just an ISA feature	22:19
lkcl	and the number of instructions in branch goes up by one	22:19
lkcl	now you are comparing	22:19
lkcl	when predicates used ==> 3 instructions	22:19
octavius	So you still gain regardless	22:19
lkcl	when branches used ==> 4 instructions	22:19
lkcl	of course	22:19
lkcl	it's real simple	22:19
lkcl	and ARM and many other ISAs have had predication for decades	22:20
lkcl	Vector ISAs have had predication for over 50 years	22:20
lkcl	there, they have multiple bits because it is multiple elements	22:21
lkcl	one bit per element in the operation.	22:21
lkcl	also dead simple.	22:21
lkcl	out-of-order speculative engines can jam/smash a ton of predicate-calculations-plus-predicated-operations into their Reservation Stations	22:22
lkcl	then once the predicate bit is calculated (by the compare)	22:22
lkcl	simply cancel the instruction if that predicate bit is zero	22:22
lkcl	which then frees up the Function Unit for re-use by other incoming instructions	22:23
lkcl	where branch speculation does pretty much exactly the same thing...	22:24
octavius	That's beautiful	22:24
lkcl	... but by jamming 50% more instructions into the Reservation Stations	22:24
lkcl	some CPUs will actually put both the paths into the RSes!	22:24
lkcl	and cancel one of them	22:25
octavius	Would that be considered a "high-performance" mode, where more power is consumed for a faster outcome?	22:25
lkcl	the penalty: you have twice as many instructions in-flight in the RSes compared to using predication, if you put both branch-success and branch-fail into RSes	22:26
lkcl	yes, but it's insane.	22:26
octavius	Insane because it's a waste most of the time?	22:26
lkcl	well, branch prediction is supposed to "help"	22:27
lkcl	but in the case of randomised data, it's never going to help, is it?	22:27
octavius	Yeah, worse than a coin flip	22:27
octavius	I'll be off, will leave you with one more fun blog I'm sure you'll relate (software shipping with O(n^2) algos :) ) Good night	22:47
lkcl	:)	22:47
octavius	https://randomascii.wordpress.com/2021/02/16/arranging-invisible-icons-in-quadratic-time/	22:47

Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!