Parallelization of recompute

Here's the place for discussion related to coding in FreeCAD, C++ or Python. Design, interfaces and structures.
Forum rules
Be nice to others! Respect the FreeCAD code of conduct!
Post Reply
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Parallelization of recompute

Post by _Nemo »

Hello, everybody!
I've been trying to use multithreading to handle time-consuming recompute recently.
I use QtConcurrent's "run" function to run the key method "_recomputeFeature".

Code: Select all

int Document::recompute(const std::vector<App::DocumentObject*> &objs, bool force, bool *hasError, int options)
{
    if (d->undoing || d->rollback) {
        if (FC_LOG_INSTANCE.isEnabled(FC_LOGLEVEL_LOG))
            FC_WARN("Ignore document recompute on undo/redo");
        return 0;
    }

    int objectCount = 0;
    if (testStatus(Document::PartialDoc)) {
        if(mustExecute())
            FC_WARN("Please reload partial document '" << Label.getValue() << "' for recomputation.");
        return 0;
    }
    if (testStatus(Document::Recomputing)) {
        // this is clearly a bug in the calling instance
        FC_ERR("Recursive calling of recompute for document " << getName());
        return 0;
    }
    // The 'SkipRecompute' flag can be (tmp.) set to avoid too many
    // time expensive recomputes
    if(!force && testStatus(Document::SkipRecompute)) {
        signalSkipRecompute(*this,objs);
        return 0;
    }

    // delete recompute log
    d->clearRecomputeLog();

    FC_TIME_INIT(t);

    Base::ObjectStatusLocker<Document::Status, Document> exe(Document::Recomputing, this);
    signalBeforeRecompute(*this);

#if 0
    //////////////////////////////////////////////////////////////////////////
    // FIXME Comment by Realthunder:
    // the topologicalSrot() below cannot handle partial recompute, haven't got
    // time to figure out the code yet, simply use back boost::topological_sort
    // for now, that is, rely on getDependencyList() to do the sorting. The
    // downside is, it didn't take advantage of the ready built InList, nor will
    // it report for cyclic dependency.
    //////////////////////////////////////////////////////////////////////////

    // get the sorted vector of all dependent objects and go though it from the end
    auto depObjs = getDependencyList(objs.empty()?d->objectArray:objs);
    vector<DocumentObject*> topoSortedObjects = topologicalSort(depObjs);
    if (topoSortedObjects.size() != depObjs.size()){
        cerr << "App::Document::recompute(): cyclic dependency detected" << endl;
        topoSortedObjects = d->partialTopologicalSort(depObjs);
    }
    std::reverse(topoSortedObjects.begin(),topoSortedObjects.end());
#else
    auto topoSortedObjects = getDependencyList(objs.empty()?d->objectArray:objs,DepSort|options);
#endif
    for(auto obj : topoSortedObjects)
        obj->setStatus(ObjectStatus::PendingRecompute,true);

    ParameterGrp::handle hGrp = GetApplication().GetParameterGroupByPath(
            "User parameter:BaseApp/Preferences/Document");
    bool canAbort = hGrp->GetBool("CanAbortRecompute",true);

    std::set<App::DocumentObject *> filter;
    size_t idx = 0;

    FC_TIME_INIT(t2);

    try {
        // maximum two passes to allow some form of dependency inversion
        for(int passes=0; passes<2 && idx<topoSortedObjects.size(); ++passes) {
            std::unique_ptr<Base::SequencerLauncher> seq;
            if(canAbort)
                seq.reset(new Base::SequencerLauncher("Recompute...", topoSortedObjects.size()));
            FC_LOG("Recompute pass " << passes);
            for (; idx < topoSortedObjects.size(); ++idx) {
                auto obj = topoSortedObjects[idx];
                if(!obj->getNameInDocument() || filter.find(obj)!=filter.end())
                    continue;
                // ask the object if it should be recomputed
                bool doRecompute = false;
                if (obj->mustRecompute()) {
                    doRecompute = true;
                    ++objectCount;
                    
                    QFuture<int> ans = QtConcurrent::run([&]()->int{ 
                        _recomputeFeature(obj);
                    });
                    int res = ans.result();
                    ans.waitForFinished();
                                   
                    // int res = _recomputeFeature(obj);
                    
                    if(res) {
                        if(hasError)
                            *hasError = true;
                        if(res < 0) {
                            passes = 2;
                            break;
                        }
                        // if something happened filter all object in its
                        // inListRecursive from the queue then proceed
                        obj->getInListEx(filter,true);
                        filter.insert(obj);
                        continue;
                    }
                }
                if(obj->isTouched() || doRecompute) {
                    signalRecomputedObject(*obj);
                    obj->purgeTouched();
                    // set all dependent object touched to force recompute
                    for (auto inObjIt : obj->getInList())
                        inObjIt->enforceRecompute();
                }
                if (seq)
                    seq->next(true);
            }
            // check if all objects are recomputed but still thouched
            for (size_t i=0;i<topoSortedObjects.size();++i) {
                auto obj = topoSortedObjects[i];
                obj->setStatus(ObjectStatus::Recompute2,false);
                if(!filter.count(obj) && obj->isTouched()) {
                    if(passes>0)
                        FC_ERR(obj->getFullName() << " still touched after recompute");
                    else{
                        FC_LOG(obj->getFullName() << " still touched after recompute");
                        if(idx>=topoSortedObjects.size()) {
                            // let's start the next pass on the first touched object
                            idx = i;
                        }
                        obj->setStatus(ObjectStatus::Recompute2,true);
                    }
                }
            }
        }
    }catch(Base::Exception &e) {
        e.ReportException();
    }

    FC_TIME_LOG(t2, "Recompute");

    for(auto obj : topoSortedObjects) {
        if(!obj->getNameInDocument())
            continue;
        obj->setStatus(ObjectStatus::PendingRecompute,false);
        obj->setStatus(ObjectStatus::Recompute2,false);
        if(obj->testStatus(ObjectStatus::PendingRemove))
            obj->getDocument()->removeObject(obj->getNameInDocument());
    }

    signalRecomputed(*this,topoSortedObjects);

    FC_TIME_LOG(t,"Recompute total");

    if (d->_RecomputeLog.size())
        Base::Console().Log("Recompute failed! Please check report view.\n");

    return objectCount;
}
But when debugging, the program seems to have encountered a deadlock.
The parallel stack is shown in the attachment.
Could someone please give me some simple suggestions?
Attachments
interrupt.png
interrupt.png (67.89 KiB) Viewed 1375 times
acpopescu
Posts: 16
Joined: Thu Jan 19, 2023 11:49 pm
Contact:

Re: Parallelization of recompute

Post by acpopescu »

Hey, it's python :)

The global interpreter lock will throw a spanner in everything that is trying to call into it.

You can try and use a fiber to serialize all calls to python to allow C++ code parallelization . There _may_ be some wins, if the C++ code is where the perf issue lies.

The idea would be to suspend the fiber when it tries to enter a python call, and put it into a "Python GIL LOCK" queue, that does serial processing inside the GIL Lock. That processing thread, when done with the python job can then enqueue the fiber to the regular parallel thread pool, suspend it and go to the next fiber-job

If your performance profiling (Diagnostic tools -> perf profile) shows that Python is where most of the problems are, you're possibly really out of luck.
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Re: Parallelization of recompute

Post by _Nemo »

acpopescu wrote: Mon Jan 30, 2023 7:35 pm If your performance profiling (Diagnostic tools -> perf profile) shows that Python is where most of the problems are, you're possibly really out of luck.
Thank you very much for your advice!
I tried to use VS's performance analysis tool.
The attachment is the call tree summary I exported.
I don't know how to judge whether the key to the problem is C++ or python.
Could you help me? :oops:
Attachments
性能测试.png
性能测试.png (119.04 KiB) Viewed 1140 times
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Re: Parallelization of recompute

Post by _Nemo »

acpopescu wrote: Mon Jan 30, 2023 7:35 pm The global interpreter lock will throw a spanner in everything that is trying to call into it.
I debugged a part as an example, and its dependency diagram is shown in the attachment.
依赖图.png
依赖图.png (159.86 KiB) Viewed 1112 times
After topological sorting, the features in the graph are recomputed one by one.
Deadlock occurs when Sketch002 is recomputed. So I guess the features contained in the "Body" box can only be processed serially at present. (I also output the thread situation at runtime as shown in the attachment.)
输出.png
输出.png (174.4 KiB) Viewed 1112 times
I don't understand the mechanism of global interpreter locking. Can you tell me if there are any relevant documents to learn? Or where should I look for the code?
acpopescu
Posts: 16
Joined: Thu Jan 19, 2023 11:49 pm
Contact:

Re: Parallelization of recompute

Post by acpopescu »

Any code that executes into python, or is executed from python is subject to a global lock, to prevent concurrent modifications of python structures. It is a big issue with multiprocessing on python.

See https://wiki.python.org/moin/GlobalInterpreterLock

As soon as you start a new thread, it will try to lock. As the calling thread(_recompute) is already holding the lock (call is coming from python) what most likely happens is:
  • on UI thread
  • User presses recompute
  • UI calls into python
  • Python locks GIL
  • Python executes scripts
  • _recompute gets called from python code
  • you launch 30 jobs
    • each job does some work
    • each job calls into python
    • python locks the GIL, job is now waiting for the GIL to be released
  • you wait for the jobs to complete (BUT you are UNDER GIL lock) <--- DEADLOCK

My perf trace (the performance is actually due to OPENCASCADE processing my recompute command, tk* calls)

That means it could be parallelized, but it needs isolation from python and other deps
calltree.jpg
calltree.jpg (349.34 KiB) Viewed 1068 times

Possible solutions
  • Recompute immediately returns a future that when finished will say the document is updated
    • create a "recompute job" that launches all the other 30 jobs needed and waits for them.
    • this will end up eventually waiting for the GIL, but the trick is to release the GIL in the python
    • return a future to that job in the python code.
    • exit the python code back to UI. This SHOULD release the GIL allowing the background job to continue
    • in the APP, when the future is done call into python continuation/callback on the correct thread ( I don't know FreeCAD well enough to tell you the details). You may need to create a "pending done future notification" queue, where the python continuation/callbacks are serialized on the correct thread.
  • Recompute immediately waits for everything else to be done
    • recomputes queues 30 fibers (!)
    • Then, for each queued fiber on the current thread (that current thread has the GIL lock) resume them serially on the current thread
    • Each fiber does this every time "GIL lock state is needed"
      • DO I need the GIL lock?
        • NO - AM I on the GIL lock thread ?
          • YES - Enqueue myself on the parallel job threads and return control to parent/next job in current queue
          • NO - nothing to do
        • YES - AM I on the GIL lock thread?
          • YES - nothing to do
          • NO - Enqueue myself on the GIL job thread (original recompute launch) and return control to parent/next job in current queue
    • Recomputes waits for all jobs to be done on all queues
    • Note that this approach assumes there ARE things that can be run in parallel with GIL lock. It allows you to choose which can be run in parallel, but you still need to be careful!
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Re: Parallelization of recompute

Post by _Nemo »

acpopescu wrote: Wed Feb 01, 2023 4:17 pm Any code that executes into python, or is executed from python is subject to a global lock, to prevent concurrent modifications of python structures. It is a big issue with multiprocessing on python.

See https://wiki.python.org/moin/GlobalInterpreterLock

...
Thank you very much for your patient reply! This helps me a lot!
It seems that I still need to learn a lot of new things.
Hope to communicate with you continuously!
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Re: Parallelization of recompute

Post by _Nemo »

acpopescu wrote: Wed Feb 01, 2023 4:17 pm
...

Possible solutions
  • Recompute immediately returns a future that when finished will say the document is updated
    • create a "recompute job" that launches all the other 30 jobs needed and waits for them.
    • this will end up eventually waiting for the GIL, but the trick is to release the GIL in the python
    • return a future to that job in the python code.
    • exit the python code back to UI. This SHOULD release the GIL allowing the background job to continue
    • in the APP, when the future is done call into python continuation/callback on the correct thread ( I don't know FreeCAD well enough to tell you the details). You may need to create a "pending done future notification" queue, where the python continuation/callbacks are serialized on the correct thread.
...
I want to try this idea first, but how do I get into the python code?
Where is the code from runString to recompute?
QQ图片20230203162614.png
QQ图片20230203162614.png (191.78 KiB) Viewed 818 times
acpopescu
Posts: 16
Joined: Thu Jan 19, 2023 11:49 pm
Contact:

Re: Parallelization of recompute

Post by acpopescu »

Where is the code from runString to recompute?
That is python interpreter code - in itself not important. The important part is what is being executed in "runString", i.e. the interpreted python code itself.
What is it doing, does it need to have the result of the recompute immediately? How does it get from runString into the recompute call in the interpreted code (i.e. .py files) ? Is it expecting the recompute to work?

What you can try to do to progress is write some ugly, proto code to see if the idea works and unblock:
  • queue all QFuture<int> somewhere in a static vector, no need to wait on any job
  • comment out the hasErrorCode, you can plug it in later in a different place. The key goal is getting unblocked by the deadlock.
  • return control to python as before
  • in GUI_Command, after run string
  • Pseudocode

    Code: Select all

    While(!queue.empty())
    {
       auto future = queue.pop_back();
       future.Wait().
       
       If(hasError) 
       {
       //....
       }
    }  
    
Once you've figured out the proof of concept works and happy with the perf improvements, you can then move and polish. You can remove some code that would get in the way of you trying things out. The key thing is that the code would need to go back in later.
_Nemo
Posts: 117
Joined: Sat Feb 06, 2021 8:25 am

Re: Parallelization of recompute

Post by _Nemo »

acpopescu wrote: Tue Feb 07, 2023 2:31 pm
The important part is what is being executed in "runString", i.e. the interpreted python code itself.
...
Thank you for your reply!
As you guessed before, what I did in the UI was to select all the objects, mark them for recompute, and then click Recompute.
At this time, only one line of code appears in the python console:

Code: Select all

App.activeDocument().recompute(None,True,True)
I'm not sure if this is the only code actually passed into the runString function.
And I found the lock in the runString function:

Code: Select all

std::string InterpreterSingleton::runString(const char *sCmd)
{
    PyObject *module, *dict, *presult;          /* "exec code in d, d" */

    PyGILStateLocker locker;
    module = PP_Load_Module("__main__");         /* get module, init python */
    ...
}
In addition, I don't quite understand why the queue is used to store QFuture<int>. I know that the queue is FIFO, but what is stored in QFuture<int> should be the result state of recomputing one object returned by _recomputeFeature.
I think the program will be deadlocked before the process of recomputing objects is over (the queue has not been built yet).
Post Reply