c# - Cache-friendly optimization: Object oriented matrix multiplication and in-function tiled matrix multiplication -


after writing matrix class represents whole matrix in 2 1d-buffers using this implementation , i've reached matrix maltiplication part in project , inclined cache-friendly-optimizations now. stumbled upon 2 options(question in lower part of page):

1)selecting blocked/tiled sub-matrices in multiplication time.

  • done in c++ dll function, no function overhead.
  • since code more complex, additional optimizations harder apply.

2)building matrix class sub matrix classes(smaller patches) multiplication done on sub-matrix classes.

  • object oriented approach leaves space additional optimizations sub-matrices.
  • object headers , padding behavior of c# can overcome critical strides?
  • function overhead can problem after calling many times instead of few times.

example matrix multiplication: c=a.b

  1 2 3 4   used  1 2    3 4    4 3 4 2               4 3    4 2    1 3 1 2                             1 1 1 2               1 3    1 2                          1 1    1 2    b  1 1 1 1  --->         1 1    1 1  1 1 1 1               1 1    1 1  1 1 1 1   1 1 1 1               1 1    1 1                        1 1    1 1    multiplication:       1 2 * 1 1   +   3 4 * 1 1    ==>  upper-left tile of result                        4 3   1 1       4 2   1 1                            same upper-right of result                         1 3 * 1 1   +   1 2 * 1 1    ==> lower left tile of result                        1 1   1 1       1 2   1 1                         same lower-right tile of result          multiplication o(n³) summation o(n²). 

question: has tried both(functional , object oriented) , made performance comparisons? right now, naive multiplication without of these cache targeted optimizations, takes:

 matrix size   single threaded time    multithreaded time  * 128x128   :     5    ms                 1ms-5ms(time sample error bigger)  * 256x256   :     25   ms                 7     ms  * 512x512   :     140  ms                 35    ms  * 1024x1024 :     1.3  s                  260   ms  * 2048x2048 :     11.3 s                  2700  ms   * 4096x4096 :     88.1 s                  24    s  * 8192x8192 :     710  s                  177   s   giga-multiplications of variables per second                single threaded         multithreaded           multi/single ratio                 * 128x128   :     0.42                    2.0 - 0.4               ?  * 256x256   :     0.67                    2.39                   3.67x    * 512x512   :     0.96                    3.84                   4.00x  * 1024x1024 :     0.83                    3.47                   4.18x  * 2048x2048 :     0.76                    3.18                   4.18x  * 4096x4096 :     0.78                    2.86                   3.67x  * 8192x8192 :     0.77                    3.09                   4.01x 

(average results 1.4ghz fx8150 avx-optimized code using 32-bit floats)(c++ avx-intrinsics in dll functions within parallel.for() of visual studio c#)

which size of matrices above suffered cache misses, critical strides , other bad things? know how can performance counters of using intrinsics?

thans time.

edit: inlining optimization within dll:

 matrix size   single threaded time    multithreaded time           multi/single radio  * 128x128   :  1     ms(%400)          390us avrage in 10k iterations(6g mult /s)  * 256x256   :  12    ms(%108 faster)   2     ms   (%250 faster)         6.0x  * 512x512   :  73    ms(%92 faster)    15    ms   (%133 faster)         4.9x  * 1024x1024 :  1060  ms(%22 faster)    176   ms   (%48 faster)          6.0x  * 2048x2048 :  10070 ms(%12 faster)    2350  ms   (%15 faster)          4.3x  * 4096x4096 :  82.0  s(%7 faster)      22    s    (%9 faster)           3.7x  * 8192x8192 :  676   s(%5 faster)      174   s    (%2 faster)           4.1x 

after inlining, shadowed performance of smaller multiplications become visible. there still dll-function-c# overhead. 1024x1024 case seems starting point of cache-misses. while work increased 7 times, execution time increased fifteen times.

edit:: going try strassen's algorithm 3-layers deep object oriented approach week. main matrix composed of 4 sub matrices. composed of 4 sub-subs each. composed of 4 sub-sub-subs each. should give (8/7)(8/7)(8/7)= +%50 speedup. if works, convert dll-function patch-optimized 1 use more cache.

applying strassen's algorithm 1 layer(such 4 of 256x256 512x512) object-oriented approach(the super class strassen , submatrices matrix classes):

 matrix size   single threaded time    multithreaded time           multi/single radio  * 128x128   :  **%50 slowdown**        **slowdown**                * 256x256   :  **%30 slowdown**        **slowdown**           * 512x512   :  **%10 slowdown**        **slowdown**              * 1024x1024 :  540   ms(%96 faster)    130   ms   (%35 faster)     4.15       * 2048x2048 :  7500  ms(%34 faster)    1310  ms   (%79 faster)     5.72      * 4096x4096 :  70.2  s(%17 faster)     17    s    (%29 faster)     4.13  * 6144x6144 :          x               68    s                 * 8192x8192 :  outofmemoryexception    outofmemoryexception      

the overhead between dll-function , c# still in effect small matrices not faster. when there speedup, more 8/7(%14) because using smaller chunks better cache usage.

will write benchmark class repeatedly tests different leaf sizes of stressen's algorithm versus naive 1 find critical size. (for system, 512x512).

superclass recursively build sub-matrix tree until reaches 512x512 size , use naive algorithm 512x512 nodes. in dll-function, patched/blocking algorithm(will add next week) make more faster. dont know how select proper size of patch because dont know how cache-line size of cpu. search after recursive strassen done.

my implementation of strassen's algorithm needs 5 times more memory(working on it).

edit: of recursivity done, updating table resuts come.

 matrix size   single threaded time    multithreaded time             * 2048x2048 :  x                       872     ms  average(double layer)      * 2560x2560 :  x                       1527    ms  average(double layer)   

parallelized tree parsing, decreased memory footprint , introduced full recursivity:

 matrix size   single threaded time    multithreaded time                         m/s  * 1024x1024 :  547   ms               123     ms  average(single layer)          4.45x  * 2048x2048 :  3790  ms               790     ms  average(double layer)          4.79x  * 4096x4096 :  26.4  s                5440    ms  average(triple layer)          4.85x  * 8192x8192 :  185   s                38      s   average(quad layer)            4.87x  * 8192x8192(4ghz):   x                15      s   average(quad layer)            4.87x 

multiplications per second (x10^9):

 matrix size   single threaded         multithreaded                            * 1024x1024 :  1.71                   7.64     (single layer)            * 2048x2048 :  1.73                   8.31     (double layer)            * 4096x4096 :  1.74                   8.45     (triple layer)           * 8192x8192 :  1.74                   8.48     (quad layer)             * 8192x8192(4ghz):   x                21.49    (quad layer)             strassen's cpu flops multiplied 7/8 each layer. 

just found out using priced gpu can 8kx8k under 1 second using opencl.


Comments

Popular posts from this blog

html5 - What is breaking my page when printing? -

c# - must be a non-abstract type with a public parameterless constructor in redis -

ajax - PHP/JSON Login script (Twitter style) not setting sessions -