C++ OpenMP working really slow on matrix-vector product


So, I'm making matrix-vector product using openMP, but I've noticed it's working reallllly slow. After some times trying to figure out whats wrong I just deleted all code in parallel section and its still SLOW. What can be problem here? (n = 1000)

Here is time results for 1, 2 and 4 cores.


seq_method time = 0.001047194215062

parrallel_method (1) time = 0.001050273191140 seq - par = -0.000003078976079 seq/par = 0.997068404578433

parrallel_method (2) time = 0.001961992426004 seq - par = -0.000914798210943 seq/par = 0.533740192460558

parrallel_method (4) time = 0.004448095121916 seq - par = -0.003400900906854 seq/par = 0.235425319459132


Even when I delete code from parallel section - it doesnt change much.

void parallel_method(float A[n][n], float B[n], float C[n], int thr_num) { double t1, t2; float tmp = 0; int i, j; t1 = omp_get_wtime(); omp_set_dynamic(0); omp_set_num_threads(thr_num); #pragma omp parallel for private(tmp, j, i) for (i = 0; i < n; i++) { tmp = 0; for (j = 0; j < n; j++) { tmp += A[i][j] * B[j]; } #pragma omp atomic C[i] += tmp; } ////// t2 = omp_get_wtime(); if (show_c) print_vector(C); par = t2 - t1; printf("\nparrallel_method (%d) time = %.15f", thr_num, par); printf("\nseq - par = %.15f", seq - par); printf("\nseq/par = %.15f\n", seq / par); }

Code: <a href="https://pastebin.com/Q20t5DLk" rel="nofollow">https://pastebin.com/Q20t5DLk</a>


I tried to reproduce your problem and was not able to do that. I have a completely coherent behavior.

n=100 sequential_method (0) time = 0.000023339001928 parallel_method (1) time = 0.000023508997401 parallel_method (2) time = 0.000013864002540 parallel_method (4) time = 0.000008979986887 n=1000 sequential_method (0) time = 0.001439775005565 parallel_method (1) time = 0.001437967992388 parallel_method (2) time = 0.000701391996699 parallel_method (4) time = 0.000372130998080 n=10000 sequential_method (0) time = 0.140988592000213 parallel_method (1) time = 0.133375317003811 parallel_method (2) time = 0.077803490007180 parallel_method (4) time = 0.044142695999355

Except for small size, where thread overhead is significant, the results are more or less what is expected.

What I did:


all measures are done in the same run

</li> <li>

I run all functions once without timing to warm-up the caches

</li> </ul>

In real code estimations, I would have also


time several successive executions of the same function, especially if time is short in order to reduce small variations

</li> <li>

run several experiments and keep the smallest one to suppress outliers. (I prefer minimum, but you can also compute the average).

</li> </ul>

You should have posted all you code, and I do not know what is your methodology. But I think that your estimations where done in different runs and without warming up the caches. For this code, cache impact is very important and cores have to store the same information (B). And the problem is not large enough to benefit from the larger L1/L2 caches. These multiples loads may explain the worse performances of parallel code.

On last remark on your code. Every thread will have their own values of i. Hence C[i] can be accessed by only one thread and the atomic pragma is useless.


  • Why can't I get any performance improvements by running multiple threads in C++11?
  • Reset to Initial State with React Hooks
  • How to skip generate-sources in Maven
  • Is it possible to disable authentication Filter on one action in an MVC 5 controller?
  • Adjacent svg:polygon edges do not meet
  • Visual Studio 2012 C++ compile error with Boost Signal2
  • How to pause using DownloadManager?
  • How to use custom PHP templated system
  • How to clone a single file with jGit?
  • Back propagation algorithm: error computation
  • this.refs is undefined for shallow tests using enzyme on React native
  • Multiple custom authentication with spring security
  • Re-order an array's items based on several criteria
  • Swift + Autolayout: Rotate UILabel, keep it next to UIView
  • Unable to generate cucumber reports in jenkins. net.masterthought.cucumber.ValidationException: None
  • Storing the Cursor for App Engine Pagination
  • App crashes when switching to landscape layout
  • Android NDK refer to external libraries in JNI
  • How to copy memory
  • Using Java runtime to add registry key, cause process reg.exe to run forever
  • Complex multiple if statements
  • Jenkins - could not read Username for 'https://github.com': No such device or address
  • Custom progress dialog not working
  • SpringBoot don't replacen System variable {user.home} in Spring Tool Suite Version: 3.8.4.RELEA
  • Yii Bootstrap not loading JS files
  • How to get a time and Date Separately?
  • WiX ManagedBootstrapper SetDownloadSource confusion
  • Android studio import problems. (Apktool)
  • Windows biometric framework sample umdf driver: This device cannot start. (Code 10)
  • What does “T extends Junk” mean in a generic class in Java?
  • “Cannot open log for source” - Windows 7 - .NET web service - event log
  • Year over Year Stats from a Crossfilter Dataset
  • PHPMailer return to AJAX
  • Error handeling in antlr 3.0
  • reshape/remould data frame to create normalized bar chart and pie chart
  • Google App Engine Datastore: Dealing with eventual consistency
  • Accessing Arguments, Workflow Variables from custom activities
  • ReferenceError: TextEncoder is not defined