Issue:
https://github.com/waku-org/nwaku/issues/2921#issuecomment-2239277713
Analyzing Discv5 and the time it takes to find other peers, in some of the simulations we observed that some messages were being delivered twice (more info: https://www.notion.so/Message-hash-duplication-d59f6133a2e341398064562d7a4c74f2).
In order to further investigate this, we set up the node logs from INFO to TRACE. Then, we started seeing another issue. This time, some nodes were losing messages.
We were able to duplicate this issue in nWaku v0.30
and v0.31
. Trying to further analyze this, we end up discovering that the issue of messages being delivered only happens when the logs are in TRACE MODE.
While investigating this issue, we discovered a leak in Yamux: Yamux issue (Solved). We are still not sure if it is related, but as this also happens in mplex, and as far as we investigated, this leak issue is not part of mplex.
vash | vaxis | nia | juhani | atris | ambellina | al | inferno | monstar |
---|---|---|---|---|---|---|---|---|
ruby-k8s-w01 | ruby-k8s-w09 | ruby-k8s-w17 | ruby-k8s-w25 | ruby-k8s-w33 | ruby-k8s-w41 | ruby-k8s-w49 | ruby-k8s-w57 | ruby-k8s-w65 |
… | … | … | … | … | … | … | … | … |
ruby-k8s-w08 | ruby-k8s-w16 | ruby-k8s-w24 | ruby-k8s-w32 | ruby-k8s-w40 | ruby-k8s-w48 | ruby-k8s-w56 | ruby-k8s-w64 | ruby-k8s-w72 |
In order to answer this, we will repeat 10 simulations with 100 waku nodes. Each time the issue happens, we will chech if that waku node is sharing the virtual machine with another waku node.
To help us visualize this, we will use the following table:
Machines | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
vash | vaxis | nia | juhani | atris | ambellina | al | inferno | monstar | ||
w01 | ||||||||||
w02 | ||||||||||
Workers | w03 | |||||||||
w04 | ||||||||||
w05 | ||||||||||
w06 | ||||||||||
w07 | ||||||||||
w08 |
In each simulation, every time a node failed to receive all messages because it has been in a blocked state, we will put it here. The idea is to check if a malfunctioning node is sharing either virtual machine (row-Worker) and/or physical host (column-Machine).