[PPL-devel] Thread Safety (continued)

Sat Oct 29 10:08:41 CEST 2016

Hello John.

On 10/26/2016 01:30 AM, John Paulson wrote:
> Hi Enea,
>
> I have installed the developmental version of ppl and configured it 
> with thread-safety on. It seems to work just as you say it will, but I 
> am having issues getting the expected speedups. To demonstrate the 
> speedup issue, I have included a sample program below. This program 
> creates a user inputted number of threads, and in each thread it 
> intersects two NNC_Polyhedron a user inputted number of times. For 
> timing comparisons, I also made a code path in the test program that 
> does not call PPL but rather computes logarithms.

[...]

> I tested this on a new machine with 44 cores and hyperthreading 
> (thread::hardware_concurrency() = 88), run with RepCount = 10,000 and 
> TestPPL = true. Here are the timings:
> #thread,real time (from time)
> 1,0m0.925s
> 5,0m1.820s
> 10,0m3.041s
> 20,0m3.758s
> 40,0m6.775s

First of all, let me warn you that I am NOT an expert of parallel 
programming,
so everything I will say should be taken with a grain of salt.

I have no easy access to a machine with hardware coming close to yours, 
so I just tried your code on my laptop: this has only 4 cores on a 
single socket, i.e., 8 as hardware_concurrency. I applied a few trivial 
corrections in a couple of places to make it compile; I also added -O3 
to the compilation command, just to be sure.

Here is my "baseline", when using a single worker thread:

# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 1 10000 1
real    0m1.174s
user    0m1.169s

Here is the "best" time obtained when using 4 workers:

# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 4 10000 1
real    0m1.487s
user    0m5.841s

In this best case, I am obtaining an "efficiency" of ~ 79% (speedup ~ 3.16).

Note however that in order to get that timing I had to run the test some 
4 or 5 times ...
it is quite easy on my laptop to obtain timings in the 2.0-2.5 secs 
range (i.e., efficiency < 60%, speedup < 2.35).
However, as far as I recall, this is the way these speedups are usually 
computed: people always report their best results, blaming the operating 
system when the program executes slower than that.

So, in my best case, there is an overhead, but not as big as the one you 
are reporting
(question: were you reporting your best timings?)

I know that testing with 4 threads is quite limited ... but I can't go 
beyond that without bias: for instance, if I use 8 threads (i.e., 
hardware concurrency), on each of my 4 cores I will have at least two 
threads competing for the shared L1 & L2 caches (the L3 cache is anyway 
shared among all cores). Of course, you already know this, since you 
were doing your tests with fewer than 44 threads.

Now, the other part of your question:

>
> By way of comparison, here are the timings for RepCount = 50,000,000 
> and TestPPL = false:
> #thread,real time (from time)
> 1,0m1.767s
> 5,0m1.854s
> 10,0m2.012s
> 20,0m2.139s
> 40,0m2.206s
>

In this case, I am obtaining 2.09 secs for 1 worker and 2.26 secs for 4 
workers (speedup ~ 3.7).
Again, best case ... worst cases reach ~ 2.6 secs, speedup 3.2.
The variation between best&worst case is smaller wrt the PPL case.
Looking at the code, the big difference between the two tests is that 
the PPL tests is probably performing many allocations from the heap, 
whereas the non-PPL test seems to be a pure CPU intensive task.

Of course, this is just wild guessing. I tried to get some more info 
using the perf tool.
I started with "perf stat", comparing the single worker wrt the 4 
workers cases.

For the PPL test I was obtaining the following:
=================================================
# LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 1 
10000 1

  Performance counter stats for './paulson 1 10000 1':

        1167.033565 task-clock (msec)         #    0.998 CPUs utilized
                107 context-switches          #    0.092 K/sec
                  3 cpu-migrations            #    0.003 K/sec
                671 page-faults               #    0.575 K/sec
      3,666,223,020 cycles                    #    3.141 GHz
        983,894,620 stalled-cycles-frontend   #   26.84% frontend cycles 
idle
    <not supported> stalled-cycles-backend
      7,803,455,611 instructions              #    2.13  insns per cycle
                                              #    0.13  stalled cycles 
per insn
      1,553,152,858 branches                  # 1330.855 M/sec
          7,428,588 branch-misses             #    0.48% of all branches

        1.169603522 seconds time elapsed
=================================================

# LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 4 
10000 1

  Performance counter stats for './paulson 4 10000 1':

        5851.363612 task-clock (msec)         #    3.944 CPUs utilized
                922 context-switches          #    0.158 K/sec
                 27 cpu-migrations            #    0.005 K/sec
                993 page-faults               #    0.170 K/sec
     16,867,156,201 cycles                    #    2.883 GHz
      6,529,193,696 stalled-cycles-frontend   #   38.71% frontend cycles 
idle
    <not supported> stalled-cycles-backend
     31,184,317,457 instructions              #    1.85  insns per cycle
                                              #    0.21  stalled cycles 
per insn
      6,204,550,206 branches                  # 1060.360 M/sec
         29,955,117 branch-misses             #    0.48% of all branches

        1.483624741 seconds time elapsed
=================================================

We can see a relevant increase (more than 4x) in the number of context 
switches and cpu migrations (as well as an increase on the percentage of 
stalled cycles). Another thing to monitor is the number of cache misses:

================================================
### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e 
L1-dcache-load-misses ./paulson 1 10000 1

  Performance counter stats for './paulson 1 10000 1':

         14,692,004 L1-dcache-load-misses

        1.179272479 seconds time elapsed
================================================
### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e 
L1-dcache-load-misses ./paulson 4 10000 1

  Performance counter stats for './paulson 4 10000 1':

        192,535,509 L1-dcache-load-misses

        1.502333756 seconds time elapsed
================================================

The factor here is ~ 13, much higher than 4.

In contrast, the test not using the PPL, being CPU intensive and not 
memory intensive, scores as follows:
1 worker: 171,187 L1-dcache-load-misses
4 workers: 395,932 L1-dcache-load-misses

As I said in the beginning, I am not an expert and hence I may have 
missed something trivial, making everything I said above "just junk". 
Anyway, this investigation was interesting. Maybe you will want to 
perform a similar analysis on your machine: in that case your feedback 
is welcome.

Cheers,
Enea.