c++ - CL_MEM_ALLOC_HOST_PTR slower than CL_MEM_USE_HOST_PTR -
so i've been playing around opencl bit , testing speeds of memory transfer between host , device. using intel opencl sdk , running on intel i5 processor integrated graphics. discovered clenqueuemapbuffer
instead of clenqueuewritebuffer
turned out faster 10 times when using pinned memory so:
int amt = 16*1024*1024; ... k_a = clcreatebuffer(context,cl_mem_read_only | cl_mem_use_host_ptr, sizeof(int)*amt, a, null); k_b = clcreatebuffer(context,cl_mem_read_only | cl_mem_use_host_ptr, sizeof(int)*amt, b, null); k_c = clcreatebuffer(context,cl_mem_write_only | cl_mem_use_host_ptr, sizeof(int)*amt, ret, null); int* map_a = (int*) clenqueuemapbuffer(c_q, k_a, cl_true, cl_map_read, 0, sizeof(int)*amt, 0, null, null, &error); int* map_b = (int*) clenqueuemapbuffer(c_q, k_b, cl_true, cl_map_read, 0, sizeof(int)*amt, 0, null, null, &error); int* map_c = (int*) clenqueuemapbuffer(c_q, k_c, cl_true, cl_map_write, 0, sizeof(int)*amt, 0, null, null, &error); clfinish(c_q);
where a
b
, ret
128 bit aligned int arrays. time came out 22.026186 ms, compared 198.604528 ms using clenqueuewritebuffer
however, when changed code
k_a = clcreatebuffer(context,cl_mem_read_only | cl_mem_alloc_host_ptr, sizeof(int)*amt, null, null); k_b = clcreatebuffer(context,cl_mem_read_only | cl_mem_alloc_host_ptr, sizeof(int)*amt, null, null); k_c = clcreatebuffer(context,cl_mem_write_only | cl_mem_alloc_host_ptr, sizeof(int)*amt, null, null); int* map_a = (int*)clenqueuemapbuffer(c_q, k_a, cl_true, cl_map_read, 0, sizeof(int)*amt, 0, null, null, &error); int* map_b = (int*)clenqueuemapbuffer(c_q, k_b, cl_true, cl_map_read, 0, sizeof(int)*amt, 0, null, null, &error); int* map_c = (int*)clenqueuemapbuffer(c_q, k_c, cl_true, cl_map_write, 0, sizeof(int)*amt, 0, null, null, &error); /** initiate map_a , map_b **/
the time increases 91.350065 ms
what problem? or problem @ all?
edit: how initialize arrays in second code:
for (int = 0; < amt; i++) { map_a[i] = i; map_b[i] = i; }
and check, map_a , map_b do contain right elements @ end of program, map_c contains 0's. did this:
clenqueueunmapmemobject(c_q, k_a, map_a, 0, null, null); clenqueueunmapmemobject(c_q, k_b, map_b, 0, null, null); clenqueueunmapmemobject(c_q, k_c, map_c, 0, null, null);
and kernel just
__kernel void test(__global int* a, __global int* b, __global int* c) { int = get_global_id(0); c[i] = a[i] + b[i]; }
my understanding cl_mem_alloc_host_ptr allocates doesn't copy. 2nd block of code data onto device?
also, clcreatebuffer when used cl_mem_use_host_ptr , cl_mem_copy_host_ptr shouldn't require clenqueuewrite, buffer created memory pointed void *host_ptr.
using "pinned" memory in opencl should process like:
int amt = 16*1024*1024; int array[] = new int[amt]; int error = 0; //note, since using null data pointer, have use cl_mem_alloc_host_ptr //this allocates memory on devices cl_mem b1 = clcreatebuffer(context, cl_mem_read_write | cl_mem_alloc_host_ptr, sizeof(int)*amt, null, &error); //map device memory host memory, aka pinning int *host_ptr = clenqueuemapbuffer(queue, b1, cl_true, cl_map_read | cl_map_write, 0, sizeof(int)*amt, 0, null, null, &error); //copy host memory pinned host memory copies card automatically` memcpy(host_ptr, array, sizeof(int)*amt); //call kernel , else , memcpy pinned host when //you done
edit: 1 final thing can speed program not make memory read/write blocking using cl_false instead of cl_true. make sure call clfinish() before data gets copied host command queue emptied , commands processed.
source: opencl in action
Comments
Post a Comment