Stuck Issue (ONGOING)

Issue:

https://github.com/waku-org/nwaku/issues/2921#issuecomment-2239277713

TL;DR:

Analyzing Discv5 and the time it takes to find other peers, in some of the simulations we observed that some messages were being delivered twice (more info: https://www.notion.so/Message-hash-duplication-d59f6133a2e341398064562d7a4c74f2).

In order to further investigate this, we set up the node logs from INFO to TRACE. Then, we started seeing another issue. This time, some nodes were losing messages.

We were able to duplicate this issue in nWaku v0.30 and v0.31. Trying to further analyze this, we end up discovering that the issue of messages being delivered only happens when the logs are in TRACE MODE.

While investigating this issue, we discovered a leak in Yamux: Yamux issue (Solved). We are still not sure if it is related, but as this also happens in mplex, and as far as we investigated, this leak issue is not part of mplex.

Lab Structure: 9 physical machines with 8 virtual machines each

vash	vaxis	nia	juhani	atris	ambellina	al	inferno	monstar
ruby-k8s-w01	ruby-k8s-w09	ruby-k8s-w17	ruby-k8s-w25	ruby-k8s-w33	ruby-k8s-w41	ruby-k8s-w49	ruby-k8s-w57	ruby-k8s-w65
…	…	…	…	…	…	…	…	…
ruby-k8s-w08	ruby-k8s-w16	ruby-k8s-w24	ruby-k8s-w32	ruby-k8s-w40	ruby-k8s-w48	ruby-k8s-w56	ruby-k8s-w64	ruby-k8s-w72

Question 1: Is the issue related to the number of waku nodes per virtual machine?

In order to answer this, we will repeat 10 simulations with 100 waku nodes. Each time the issue happens, we will chech if that waku node is sharing the virtual machine with another waku node.

To help us visualize this, we will use the following table:

					Machines
		vash	vaxis	nia	juhani	atris	ambellina	al	inferno	monstar
	w01

	w02
Workers	w03
	w04
	w05

	w06
	w07

	w08

In each simulation, every time a node failed to receive all messages because it has been in a blocked state, we will put it here. The idea is to check if a malfunctioning node is sharing either virtual machine (row-Worker) and/or physical host (column-Machine).