Sketcher performance - where do those CPU cycles go.

About the development of the Part Design module/workbench. PLEASE DO NOT POST HELP REQUESTS HERE!
Forum rules
Be nice to others! Respect the FreeCAD code of conduct!
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

That's the continuation of Sketcher performance - why is Python the culprit?

I ended understanding how to use the VStudio profiler well enough to draw some conclusions, and Python is not the culprit (I'll put the FreeCAD version info and details on the procedure at the end. I'll say it here though that the profiling data has been obtained with a Windows build, on a Win10 Home hosted by a Lenovo laptop, AMD Ryzen 5 3500U 2.10 GHz, with 16.0 GB of RAM).

In short, for sketches with (too) many elements and constraints:
  1. about 58% of the CPU goes into SketchObject::solve(bool), with about 36% into Sketch::solve(void) and 21.5% into Sketch::setupSketch(...) (see Figure 1 below). Most of the CPU goes on eigen's account, but there are some small bits that I'll come back to .
    In any case, Python adds on overhead of about 3% to this branch of the call tree.
  2. about 38% of the CPU goes into some async execution with an anonymous (not assigned to any module) lambda - see Figure 2
    I have some troubles identifying where this comes from. Initially I thought that it may be some callbacks to pass the results back into Python-land once the "C++ solving" is complete, but I find way more plausible to be the

    Code: Select all

                auto fut = std::async(&System::identifyDependentParametersSparseQR,this,J,jacobianconstraintmap, pdiagnoselist, /*silent=*/true);
    
                makeSparseQRDecomposition( J, jacobianconstraintmap, SqrJT, rank, R, /*transposed=*/true, /*silent=*/false);
    
                int paramsNum = SqrJT.rows();
                int constrNum = SqrJT.cols();
    
                fut.wait(); // wait for the execution of identifyDependentParametersSparseQR to finish
     
    towards the end of the int System::diagnose(Algorithm alg) method in the GSC.cpp file (because, yeah, eigen again)
    I will appreciate though an explicit confirmation coming from the ones in-the-know that the first hypothesis (passing back some data into the Python-land by using async callbacks) is rubbish and I can forget about it.
Figure 1 - the SketchObject::solve(bool) CPU budget
sketcher-solve-cpu.png
sketcher-solve-cpu.png (64.85 KiB) Viewed 2914 times
Figure 2 the async call into some lambda
sketcher-async-cpu.png
sketcher-async-cpu.png (83.18 KiB) Viewed 2914 times
The FreeCAD version details

Code: Select all

OS: Windows 10 Version 2009
Word size of FreeCAD: 64-bit
Version: 0.21.0.31291 (Git)
Build type: Release
Branch: fallen-fruits
Hash: d763f5a64aef76c94653e7f43e854d3c962210d3
Python 3.8.10, Qt 5.15.2, Coin 4.0.1, Vtk 8.2.0, OCC 7.6.3
Locale: English/Australia (en_AU)
Installed mods: 
  * CurvedShapes 1.0.4
  * freecad.gears 1.0.0
  * lattice2 1.0.0
  * sheetmetal 0.2.59
The profiling procedure
  • Build the Release configuration following the instructions
  • download the
    solverLoad-2oo.py
    (4.13 KiB) Downloaded 76 times
    file and place it in some directory
  • open the solution in VisualStudio and set the FreeCADMainCmd as the solution's 'Startup project'
  • edit the FreeCADMainCmd project's properties and, in the Debugging section, set the full path to the solverLoad-2oo.py as the value for the "Command Arguments" property
  • in the VisualStudio Debug menu, choose the "Performance Profiler (Alt-F2)"
  • select the "CPU Usage" checkbox then press the "Start" button at the bottom of the page
  • wait for the execution to finish (in my config, took about 1m40secs) and the report page to finish analyzing the data
  • click "Open details" link (under the plot) and set the "current view" to "Module"
  • Right click on the list of modules and, on the popup menu, pick "Load all symbols". Wait for the view to settle
  • switch the current view to "Call tree"
  • explore at you leisure the contributions of various functions to CPU consumption
abdullah
Veteran
Posts: 4935
Joined: Sun May 04, 2014 3:16 pm
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by abdullah »

acolomitchi wrote: Tue Dec 20, 2022 10:25 am about 58% of the CPU goes into SketchObject::solve(bool), with about 36% into Sketch::solve(void) and 21.5% into Sketch::setupSketch(...) (see Figure 1 below). Most of the CPU goes on eigen's account, but there are some small bits that I'll come back to .
That really sounds like it.
acolomitchi wrote: Tue Dec 20, 2022 10:25 am about 38% of the CPU goes into some async execution with an anonymous (not assigned to any module) lambda - see Figure 2
I have some troubles identifying where this comes from. Initially I thought that it may be some callbacks to pass the results back into Python-land once the "C++ solving" is complete, but I find way more plausible to be the

Code: Select all

            auto fut = std::async(&System::identifyDependentParametersSparseQR,this,J,jacobianconstraintmap, pdiagnoselist, /*silent=*/true);

            makeSparseQRDecomposition( J, jacobianconstraintmap, SqrJT, rank, R, /*transposed=*/true, /*silent=*/false);

            int paramsNum = SqrJT.rows();
            int constrNum = SqrJT.cols();

            fut.wait(); // wait for the execution of identifyDependentParametersSparseQR to finish
 
towards the end of the int System::diagnose(Algorithm alg) method in the GSC.cpp file (because, yeah, eigen again)
I will appreciate though an explicit confirmation coming from the ones in-the-know that the first hypothesis (passing back some data into the Python-land by using async callbacks) is rubbish and I can forget about it.
Yup. That lambda adds concurrency for the identification of parameters (showing which curves and parts of curves are constrained and which are not) via a QR decomposition. The nice part of it is that it runs totally in parallel with another QR decomposition of the same size for identifying rank and redundant/conflicting constraints (if you have a CPU monitor you will see another core at 100% during QR computation).

I am very well aware that QR decomposition is costly. When we switched from Eigen's dense QR decomposition to the sparse version, we interacted with Gaël from Eigen to solve some bugs in Eigen. He said that QR is specially costly and that, if we could get away with other type of decomposition, such as LU, which is also rank revealing, we would get quite some improvement. The thing is that nobody has managed to obtain information about which constraints are conflicting/redundant, or which parameters are constrained and which ones are not, from an LU decomposition.

All in all, your observations sound right to me: Python cost is negligible. The lion's share of the cost is QR decomposition using Eigen (the share will probably get higher and higher as the sketch is bigger in terms of elements and constraints).

Thanks for looking into it. It is reassuring to see that nothing seems off. If you still want to look deeper into other bits, please post your findings.
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

abdullah wrote: Tue Dec 20, 2022 5:38 pm
Yup. That lambda adds concurrency for the identification of parameters (showing which curves and parts of curves are constrained and which are not) via a QR decomposition. The nice part of it is that it runs totally in parallel with another QR decomposition of the same size for identifying rank and redundant/conflicting constraints (if you have a CPU monitor you will see another core at 100% during QR computation).
I saw it, on Windows. Does it happen on Linux/Mac too? (I don't have Linux/Mac machines at the moment to give it a try)
Not so long ago, it didn't; I saw heaps of thread-pool projects to address the short-fall, but I also saw signs that their use is not justifiable by absence of thread-pooling in std::async any more.
E.g. V1 of this thread pool library (May 2021) bears in its "Motivation" :
... currently only the MSVC implementation[6] of std::async uses a thread pool, while GCC and Clang do not.
but that is entirely gone for v3
abdullah wrote: Tue Dec 20, 2022 5:38 pm I am very well aware that QR decomposition is costly. When we switched from Eigen's dense QR decomposition to the sparse version, we interacted with Gaël from Eigen to solve some bugs in Eigen. He said that QR is specially costly and that, if we could get away with other type of decomposition, such as LU, which is also rank revealing, we would get quite some improvement. The thing is that nobody has managed to obtain information about which constraints are conflicting/redundant, or which parameters are constrained and which ones are not, from an LU decomposition.
Oh, I'd have to dig deep for this one, has been... what?... 30 years since my courses in linear algebra. I'm afraid I won't have time for this chunk of research in the near future.
abdullah wrote: Tue Dec 20, 2022 5:38 pm The lion's share of the cost is QR decomposition using Eigen (the share will probably get higher and higher as the sketch is bigger in terms of elements and constraints).
O(N^3) in the number of constraints, if my guts feel it right.

abdullah wrote: Tue Dec 20, 2022 5:38 pm Thanks for looking into it. It is reassuring to see that nothing seems off. If you still want to look deeper into other bits, please post your findings.
I don't think looking deeper into the current state will bring in notable improvements (>15% reduction in CPU).
Anyway, I'll post another comment with the things that can be done on the current state of source code

Adrian

PS Oh, God, what have I gotten into? I swear all this started with buying my first 3D printer a wee less than 3mo ago
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

abdullah wrote: Tue Dec 20, 2022 5:38 pm If you still want to look deeper into other bits, please post your findings.
Low hanging fruits (hanging so low they're more like fallen fruits):
  • the Constraint::redirectParams(MAP_pD_pD redirectionmap) takes the input argument by value, the operation costs 4.2% of the total CPU consumption. Passing the argument by const MAP_pD_pD& will likely eliminate the cost
    PerfReport-FC-deshuffle.png
    PerfReport-FC-deshuffle.png (25.39 KiB) Viewed 2748 times
  • various Constraint classes use deeply inefficient expression (and unnecessarily sacrifice performance for source code readability?)
    E.g. use of pow(x, 2) instead of an 'inline double square(double x) { return x*x; }' (in ConstraintAngleViaPoint, ConstraintPointOnEllipse, ConstraintPointOnHyperbola), computing the length of a segment only to use it by its squared value (why did one do a sqrt before squaring it back? - e.g. ConstraintAngleViaPoint::grad(double *param)).
    Optimizing them might gain back at lest half of the 3% CPU they currently cost.
    PerfReport-FC-grad.png
    PerfReport-FC-grad.png (25.32 KiB) Viewed 2748 times
One major thing that may contribute to performance: bringing in OpenMP into the picture (at least with the Windows port, my GPU gathers spiderwebs while I'm running FreeCAD).
`eigen` can theoretically make use of it if present (IDK if it actually uses it in the algos FC is using).
OpenCascade should be able to use it more effectively (boolean operation between parts, maybe?)

Has anyone looked into using openMP inside FC? Are there some relevant threads I can bring myself up-to-date on the matter?
(I'm annoyed of my GPU's lack of activity, the lazy btard).
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

acolomitchi wrote: Wed Dec 21, 2022 12:16 am Low hanging fruits (hanging so low they're more like fallen fruits):
  • the Constraint::redirectParams(MAP_pD_pD redirectionmap) takes the input argument by value, the operation costs 4.2% of the total CPU consumption. Passing the argument by const MAP_pD_pD& will likely eliminate the cost
Passing the map by const & drops the execution time by 11% (from 1m38s to 1m28s - based on a statistical set of one lonely sample :large grin: And in the condition of executing it under Profiler with a script designed to put unreasonable load in the Sketcher. And with uncontrolled external factors, like YT music playing or not playing in a browser).

Objectively, tho', the Subsytem::redirectParams() is no longer noticeable against CPU allocated ticks

Do you want me to send a pull/merge request or can someone just change the signature of Constraint::redirectParams(MAP_pD_pD redirectionmap) to take a (const &) and be done with it? (I'd prefer the latter, less work for me :grinning:).
PerfReport-FC-deshuffle-fixed.png
PerfReport-FC-deshuffle-fixed.png (35.98 KiB) Viewed 2723 times
abdullah
Veteran
Posts: 4935
Joined: Sun May 04, 2014 3:16 pm
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by abdullah »

acolomitchi wrote: Wed Dec 21, 2022 2:11 am Do you want me to send a pull/merge request or can someone just change the signature of Constraint::redirectParams(MAP_pD_pD redirectionmap) to take a (const &) and be done with it? (I'd prefer the latter, less work for me :grinning:).

PerfReport-FC-deshuffle-fixed.png
Passing CI:
https://github.com/FreeCAD/FreeCAD/pull/8070

Thanks!
abdullah
Veteran
Posts: 4935
Joined: Sun May 04, 2014 3:16 pm
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by abdullah »

acolomitchi wrote: Wed Dec 21, 2022 12:16 am classes use deeply inefficient expression (and unnecessarily sacrifice performance for source code readability?)
Many times it is possible to reconcile the two.

About pow(x, 2) I wonder if this is not really optimized in the library. I would have to check the assembly code produced. If it is not optimised, an inline function as you indicate can be used without loss of readability.
acolomitchi wrote: Wed Dec 21, 2022 12:16 am ConstraintAngleViaPoint, ConstraintPointOnEllipse, ConstraintPointOnHyperbola), computing the length of a segment only to use it by its squared value (why did one do a sqrt before squaring it back? - e.g. ConstraintAngleViaPoint::grad(double *param)).
Optimizing them might gain back at lest half of the 3% CPU they currently cost.
If it is just a function ".length()", a function squaredlength() could be created and use it. That would not be any less readable.
acolomitchi wrote: Wed Dec 21, 2022 12:16 am One major thing that may contribute to performance: bringing in OpenMP into the picture (at least with the Windows port, my GPU gathers spiderwebs while I'm running FreeCAD).
`eigen` can theoretically make use of it if present (IDK if it actually uses it in the algos FC is using).
I think tests were done in the past with disappointing results. But this might have changed. Checking again would be good.
acolomitchi wrote: Wed Dec 21, 2022 12:16 am Has anyone looked into using openMP inside FC? Are there some relevant threads I can bring myself up-to-date on the matter?
(I'm annoyed of my GPU's lack of activity, the lazy btard).
I cannot recall (and I am in a rush to search for them now).

I appreciate you being annoyed by your GPU's lack of activity. It may get you creative in tackling it :mrgreen:

Feel free to investigate further or maybe you get into providing some PRs ;)
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

abdullah wrote: Wed Dec 21, 2022 3:35 pm About pow(x, 2) I wonder if this is not really optimized in the library.
At least on my setup (Windows, VC++), it is not - the implementation is sticking to the C++ reference and promotes to the highest precision floating-point type of the two args (form 7)
abdullah wrote: Wed Dec 21, 2022 3:35 pm
acolomitchi wrote: Wed Dec 21, 2022 12:16 am Has anyone looked into using openMP inside FC? Are there some relevant threads I can bring myself up-to-date on the matter?
(I'm annoyed of my GPU's lack of activity, the lazy btard).
I cannot recall (and I am in a rush to search for them now).
I searched, but the only relevant mentions of openmp in the first 3 pages of the results popped up in re FEM wb.
abdullah wrote: Wed Dec 21, 2022 3:35 pm Feel free to investigate further or maybe you get into providing some PRs ;)
I might get the constraints' grads into the "aesthetically pleasing for perfectionists" state next week, just to have a go in playing with PRs in the FC proj.

Then, who knows? Depends on what annoys me the most - the do-nothing GPU or the lack of openmp libs in the VS/VC++ land - so, yeah, it may happen.
acolomitchi
Posts: 34
Joined: Tue Dec 06, 2022 1:41 am
Location: Melbourne, Australia
Contact:

Re: Sketcher performance - where do those CPU cycles go.

Post by acolomitchi »

acolomitchi wrote: Wed Dec 21, 2022 8:27 pm
abdullah wrote: Wed Dec 21, 2022 3:35 pm About pow(x, 2) I wonder if this is not really optimized in the library.
...
abdullah wrote: Wed Dec 21, 2022 3:35 pm Feel free to investigate further or maybe you get into providing some PRs ;)
I might get the constraints' grads into the "aesthetically pleasing for perfectionists" state next week, just to have a go in playing with PRs in the FC proj.
Actually, I might try a bit more than just optimizing the expressions (@DeepSOIC, if he's still around, may want to chime in)
  1. I'll try to switch the older code to DeriVector2 (yeah, I'd probably prefer a more formal automatic differentiation library being plugged in, but let's not jump that far away... for now :mrgreen: )
  2. I'll try to define the "constraint errors" and their gradients into "square distance errors" wherever they are defined in terms of "distance errors". Two reasons:
    1. less CPU intensive to compute
    2. the sqrt function has infinite derivative at 0 (zero). In an iterative approach to solving an eq system based on gradients, this is likely to throw the corrections steps outta whack (the DoF/dimension closer to its constrained position shouting louder "me, me, adjust me first, I have the highest gradient" - when, if fact, the other DoF-s farther would actually need more attention).

      Look, I can do nothing about the complexity of a QR decomposition, but if the problem can be formulated in terms that improve the stability of the algo that uses QR, less such expensive steps will be necessary. (think of chi-squared minimization methods, they don't try to minimize the "sqrt(chi-squared)", maybe they have some reasons :mrgreen:)
  3. I'll try to express "angle errors" in their specific constraints in terms of "square distance errors" too. More precisely, redefine the errors in term of difference between the expected vs actual of dot(v0, v1) and cross(v0, v1) (where v1, v2 are vectors derived from the specific geometry of the constraint - oh, boy, swampy ground here, I hope all the angle constraints will have vector elements that one can use) Rationale:
    1. eliminate the "impedance mismatch" between large linear dimensions vs angles in the (-pi, pi) range, leading to fudge factors that don't quite work
    2. angles (and their errors) have a bad habit of being periodic - which will make the (hyper-)surface of the restrictions show local minima/maxima in which an iterative algo can slide into and be blocked there. By contrast, dimensions (or their squares thereof) are better defined
    3. less CPU expensive to compute - just multiplications and additions (and, maaaaybe a single sin/cos evaluation in the beginning if the constrained is actually defined as an angle, but a good number of times they are "tangent to" and "normal to", which have trivial sin/cos-es)
User avatar
DeepSOIC
Veteran
Posts: 7896
Joined: Fri Aug 29, 2014 12:45 am
Location: used to be Saint-Petersburg, Russia

Re: Sketcher performance - where do those CPU cycles go.

Post by DeepSOIC »

acolomitchi wrote: Wed Dec 21, 2022 11:35 pm I'll try to define the "constraint errors" and their gradients into "square distance errors" wherever they are defined in terms of "distance errors".
I'm not 100% sure what you are up to, but be careful with it.
1) if the gradient of error function zeroes out where error zeroes out - that is bad. It causes trouble with convergence, solver precision and dof counting. A "classic" case of such situation, which can be created with just the UI, is constraining a point to an intersection of tangent things (point on circle + point on line + tangent between line and circle).
2) scales of error functions should be comparable. At the moment they are not (some error funcrtions are effectively millimeters, while others are radians). This causes troubles with sketches of large geometric size (a few meters and up). It'd be great if you can fix it (i began fixing it, that led me to try and rewrite solver architecture completely, which i did half-way and then abandoned).


acolomitchi wrote: Tue Dec 20, 2022 11:30 pm PS Oh, God, what have I gotten into? I swear all this started with buying my first 3D printer a wee less than 3mo ago
;) you are not alone ;) ;) though, it took me far more than three months to plough my dirty hands into freecad code
Post Reply