[PPL-devel] Thread Safety (continued)
Enea Zaffanella
zaffanella at cs.unipr.it
Sat Oct 29 10:08:41 CEST 2016
Hello John.
On 10/26/2016 01:30 AM, John Paulson wrote:
> Hi Enea,
>
> I have installed the developmental version of ppl and configured it
> with thread-safety on. It seems to work just as you say it will, but I
> am having issues getting the expected speedups. To demonstrate the
> speedup issue, I have included a sample program below. This program
> creates a user inputted number of threads, and in each thread it
> intersects two NNC_Polyhedron a user inputted number of times. For
> timing comparisons, I also made a code path in the test program that
> does not call PPL but rather computes logarithms.
[...]
> I tested this on a new machine with 44 cores and hyperthreading
> (thread::hardware_concurrency() = 88), run with RepCount = 10,000 and
> TestPPL = true. Here are the timings:
> #thread,real time (from time)
> 1,0m0.925s
> 5,0m1.820s
> 10,0m3.041s
> 20,0m3.758s
> 40,0m6.775s
First of all, let me warn you that I am NOT an expert of parallel
programming,
so everything I will say should be taken with a grain of salt.
I have no easy access to a machine with hardware coming close to yours,
so I just tried your code on my laptop: this has only 4 cores on a
single socket, i.e., 8 as hardware_concurrency. I applied a few trivial
corrections in a couple of places to make it compile; I also added -O3
to the compilation command, just to be sure.
Here is my "baseline", when using a single worker thread:
# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 1 10000 1
real 0m1.174s
user 0m1.169s
Here is the "best" time obtained when using 4 workers:
# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 4 10000 1
real 0m1.487s
user 0m5.841s
In this best case, I am obtaining an "efficiency" of ~ 79% (speedup ~ 3.16).
Note however that in order to get that timing I had to run the test some
4 or 5 times ...
it is quite easy on my laptop to obtain timings in the 2.0-2.5 secs
range (i.e., efficiency < 60%, speedup < 2.35).
However, as far as I recall, this is the way these speedups are usually
computed: people always report their best results, blaming the operating
system when the program executes slower than that.
So, in my best case, there is an overhead, but not as big as the one you
are reporting
(question: were you reporting your best timings?)
I know that testing with 4 threads is quite limited ... but I can't go
beyond that without bias: for instance, if I use 8 threads (i.e.,
hardware concurrency), on each of my 4 cores I will have at least two
threads competing for the shared L1 & L2 caches (the L3 cache is anyway
shared among all cores). Of course, you already know this, since you
were doing your tests with fewer than 44 threads.
Now, the other part of your question:
>
> By way of comparison, here are the timings for RepCount = 50,000,000
> and TestPPL = false:
> #thread,real time (from time)
> 1,0m1.767s
> 5,0m1.854s
> 10,0m2.012s
> 20,0m2.139s
> 40,0m2.206s
>
In this case, I am obtaining 2.09 secs for 1 worker and 2.26 secs for 4
workers (speedup ~ 3.7).
Again, best case ... worst cases reach ~ 2.6 secs, speedup 3.2.
The variation between best&worst case is smaller wrt the PPL case.
Looking at the code, the big difference between the two tests is that
the PPL tests is probably performing many allocations from the heap,
whereas the non-PPL test seems to be a pure CPU intensive task.
Of course, this is just wild guessing. I tried to get some more info
using the perf tool.
I started with "perf stat", comparing the single worker wrt the 4
workers cases.
For the PPL test I was obtaining the following:
=================================================
# LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 1
10000 1
Performance counter stats for './paulson 1 10000 1':
1167.033565 task-clock (msec) # 0.998 CPUs utilized
107 context-switches # 0.092 K/sec
3 cpu-migrations # 0.003 K/sec
671 page-faults # 0.575 K/sec
3,666,223,020 cycles # 3.141 GHz
983,894,620 stalled-cycles-frontend # 26.84% frontend cycles
idle
<not supported> stalled-cycles-backend
7,803,455,611 instructions # 2.13 insns per cycle
# 0.13 stalled cycles
per insn
1,553,152,858 branches # 1330.855 M/sec
7,428,588 branch-misses # 0.48% of all branches
1.169603522 seconds time elapsed
=================================================
# LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 4
10000 1
Performance counter stats for './paulson 4 10000 1':
5851.363612 task-clock (msec) # 3.944 CPUs utilized
922 context-switches # 0.158 K/sec
27 cpu-migrations # 0.005 K/sec
993 page-faults # 0.170 K/sec
16,867,156,201 cycles # 2.883 GHz
6,529,193,696 stalled-cycles-frontend # 38.71% frontend cycles
idle
<not supported> stalled-cycles-backend
31,184,317,457 instructions # 1.85 insns per cycle
# 0.21 stalled cycles
per insn
6,204,550,206 branches # 1060.360 M/sec
29,955,117 branch-misses # 0.48% of all branches
1.483624741 seconds time elapsed
=================================================
We can see a relevant increase (more than 4x) in the number of context
switches and cpu migrations (as well as an increase on the percentage of
stalled cycles). Another thing to monitor is the number of cache misses:
================================================
### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e
L1-dcache-load-misses ./paulson 1 10000 1
Performance counter stats for './paulson 1 10000 1':
14,692,004 L1-dcache-load-misses
1.179272479 seconds time elapsed
================================================
### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e
L1-dcache-load-misses ./paulson 4 10000 1
Performance counter stats for './paulson 4 10000 1':
192,535,509 L1-dcache-load-misses
1.502333756 seconds time elapsed
================================================
The factor here is ~ 13, much higher than 4.
In contrast, the test not using the PPL, being CPU intensive and not
memory intensive, scores as follows:
1 worker: 171,187 L1-dcache-load-misses
4 workers: 395,932 L1-dcache-load-misses
As I said in the beginning, I am not an expert and hence I may have
missed something trivial, making everything I said above "just junk".
Anyway, this investigation was interesting. Maybe you will want to
perform a similar analysis on your machine: in that case your feedback
is welcome.
Cheers,
Enea.
More information about the PPL-devel
mailing list