Issue:

https://github.com/waku-org/nwaku/issues/2921#issuecomment-2239277713

First iteration

TL;DR:

Analyzing Discv5 and the time it takes to find other peers, in some of the simulations we observed that some messages were being delivered twice (more info: https://www.notion.so/Message-hash-duplication-d59f6133a2e341398064562d7a4c74f2).

In order to further investigate this, we set up the node logs from INFO to TRACE. Then, we started seeing another issue. This time, some nodes were losing messages.

We were able to duplicate this issue in nWaku v0.30 and v0.31. Trying to further analyze this, we end up discovering that the issue of messages being delivered only happens when the logs are in TRACE MODE.

While investigating this issue, we discovered a leak in Yamux: Yamux issue (Solved). We are still not sure if it is related, but as this also happens in mplex, and as far as we investigated, this leak issue is not part of mplex.


Lab Structure: 9 physical machines with 8 virtual machines each

vash vaxis nia juhani atris ambellina al inferno monstar
ruby-k8s-w01 ruby-k8s-w09 ruby-k8s-w17 ruby-k8s-w25 ruby-k8s-w33 ruby-k8s-w41 ruby-k8s-w49 ruby-k8s-w57 ruby-k8s-w65
ruby-k8s-w08 ruby-k8s-w16 ruby-k8s-w24 ruby-k8s-w32 ruby-k8s-w40 ruby-k8s-w48 ruby-k8s-w56 ruby-k8s-w64 ruby-k8s-w72

Question 1: Is the issue related to the number of waku nodes per virtual machine?

In order to answer this, we will repeat 10 simulations with 100 waku nodes. Each time the issue happens, we will chech if that waku node is sharing the virtual machine with another waku node.

To help us visualize this, we will use the following table:

Machines
vash vaxis nia juhani atris ambellina al inferno monstar
w01
w02
Workers w03
w04
w05
w06
w07
w08

In each simulation, every time a node failed to receive all messages because it has been in a blocked state, we will put it here. The idea is to check if a malfunctioning node is sharing either virtual machine (row-Worker) and/or physical host (column-Machine).