c# - Cache-friendly optimization: Object oriented matrix multiplication and in-function tiled matrix multiplication -
after writing matrix class represents whole matrix in 2 1d-buffers using this implementation , i've reached matrix maltiplication part in project , inclined cache-friendly-optimizations now. stumbled upon 2 options(question in lower part of page):
1)selecting blocked/tiled sub-matrices in multiplication time.
- done in c++ dll function, no function overhead.
- since code more complex, additional optimizations harder apply.
2)building matrix class sub matrix classes(smaller patches) multiplication done on sub-matrix classes.
- object oriented approach leaves space additional optimizations sub-matrices.
- object headers , padding behavior of c# can overcome critical strides?
- function overhead can problem after calling many times instead of few times.
example matrix multiplication: c=a.b
1 2 3 4 used 1 2 3 4 4 3 4 2 4 3 4 2 1 3 1 2 1 1 1 2 1 3 1 2 1 1 1 2 b 1 1 1 1 ---> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 multiplication: 1 2 * 1 1 + 3 4 * 1 1 ==> upper-left tile of result 4 3 1 1 4 2 1 1 same upper-right of result 1 3 * 1 1 + 1 2 * 1 1 ==> lower left tile of result 1 1 1 1 1 2 1 1 same lower-right tile of result multiplication o(n³) summation o(n²).
question: has tried both(functional , object oriented) , made performance comparisons? right now, naive multiplication without of these cache targeted optimizations, takes:
matrix size single threaded time multithreaded time * 128x128 : 5 ms 1ms-5ms(time sample error bigger) * 256x256 : 25 ms 7 ms * 512x512 : 140 ms 35 ms * 1024x1024 : 1.3 s 260 ms * 2048x2048 : 11.3 s 2700 ms * 4096x4096 : 88.1 s 24 s * 8192x8192 : 710 s 177 s giga-multiplications of variables per second single threaded multithreaded multi/single ratio * 128x128 : 0.42 2.0 - 0.4 ? * 256x256 : 0.67 2.39 3.67x * 512x512 : 0.96 3.84 4.00x * 1024x1024 : 0.83 3.47 4.18x * 2048x2048 : 0.76 3.18 4.18x * 4096x4096 : 0.78 2.86 3.67x * 8192x8192 : 0.77 3.09 4.01x
(average results 1.4ghz fx8150 avx-optimized code using 32-bit floats)(c++ avx-intrinsics in dll functions within parallel.for() of visual studio c#)
which size of matrices above suffered cache misses, critical strides , other bad things? know how can performance counters of using intrinsics?
thans time.
edit: inlining optimization within dll:
matrix size single threaded time multithreaded time multi/single radio * 128x128 : 1 ms(%400) 390us avrage in 10k iterations(6g mult /s) * 256x256 : 12 ms(%108 faster) 2 ms (%250 faster) 6.0x * 512x512 : 73 ms(%92 faster) 15 ms (%133 faster) 4.9x * 1024x1024 : 1060 ms(%22 faster) 176 ms (%48 faster) 6.0x * 2048x2048 : 10070 ms(%12 faster) 2350 ms (%15 faster) 4.3x * 4096x4096 : 82.0 s(%7 faster) 22 s (%9 faster) 3.7x * 8192x8192 : 676 s(%5 faster) 174 s (%2 faster) 4.1x
after inlining, shadowed performance of smaller multiplications become visible. there still dll-function-c# overhead. 1024x1024 case seems starting point of cache-misses. while work increased 7 times, execution time increased fifteen times.
edit:: going try strassen's algorithm 3-layers deep object oriented approach week. main matrix composed of 4 sub matrices. composed of 4 sub-subs each. composed of 4 sub-sub-subs each. should give (8/7)(8/7)(8/7)= +%50 speedup. if works, convert dll-function patch-optimized 1 use more cache.
applying strassen's algorithm 1 layer(such 4 of 256x256 512x512) object-oriented approach(the super class strassen
, submatrices matrix
classes):
matrix size single threaded time multithreaded time multi/single radio * 128x128 : **%50 slowdown** **slowdown** * 256x256 : **%30 slowdown** **slowdown** * 512x512 : **%10 slowdown** **slowdown** * 1024x1024 : 540 ms(%96 faster) 130 ms (%35 faster) 4.15 * 2048x2048 : 7500 ms(%34 faster) 1310 ms (%79 faster) 5.72 * 4096x4096 : 70.2 s(%17 faster) 17 s (%29 faster) 4.13 * 6144x6144 : x 68 s * 8192x8192 : outofmemoryexception outofmemoryexception
the overhead between dll-function , c# still in effect small matrices not faster. when there speedup, more 8/7(%14) because using smaller chunks better cache usage.
will write benchmark class repeatedly tests different leaf sizes of stressen's algorithm versus naive 1 find critical size. (for system, 512x512).
superclass recursively build sub-matrix tree until reaches 512x512 size , use naive algorithm 512x512 nodes. in dll-function, patched/blocking algorithm(will add next week) make more faster. dont know how select proper size of patch because dont know how cache-line size of cpu. search after recursive strassen done.
my implementation of strassen's algorithm needs 5 times more memory(working on it).
edit: of recursivity done, updating table resuts come.
matrix size single threaded time multithreaded time * 2048x2048 : x 872 ms average(double layer) * 2560x2560 : x 1527 ms average(double layer)
parallelized tree parsing, decreased memory footprint , introduced full recursivity:
matrix size single threaded time multithreaded time m/s * 1024x1024 : 547 ms 123 ms average(single layer) 4.45x * 2048x2048 : 3790 ms 790 ms average(double layer) 4.79x * 4096x4096 : 26.4 s 5440 ms average(triple layer) 4.85x * 8192x8192 : 185 s 38 s average(quad layer) 4.87x * 8192x8192(4ghz): x 15 s average(quad layer) 4.87x
multiplications per second (x10^9):
matrix size single threaded multithreaded * 1024x1024 : 1.71 7.64 (single layer) * 2048x2048 : 1.73 8.31 (double layer) * 4096x4096 : 1.74 8.45 (triple layer) * 8192x8192 : 1.74 8.48 (quad layer) * 8192x8192(4ghz): x 21.49 (quad layer) strassen's cpu flops multiplied 7/8 each layer.
just found out using priced gpu can 8kx8k under 1 second using opencl.
Comments
Post a Comment