Jump to content
OpenSplice DDS Forum
luca.gherardi

Configuration for wireless network

Recommended Posts

We have a system were multiple nodes are connected over wireless. The wireless coverage is not perfect and once in a while a node can drop connection or switch from one access point to the other.

I've noticed that when node B drops connection, the packets sent by node B are not received by node A when node B can reconnect to the network.

In those cases the writer receives an invalid sample (i.e. valid_data flag set to false and sample state: 2, view state: 2, instance state: 2).

 

How can we configure open splice to keep messages buffered until Node B reconnects to the network so that they can be delivered to node A?

For those messages we use a topic with the following settings: DDS::RELIABLE_RELIABILITY_QOS and  DDS::KEEP_ALL_HISTORY_QOS.

 

Thanks in advance for your answer and let me know if you need more information.

Share this post


Link to post
Share on other sites

I think you need to distinguish between reliability and durability.

Reliability is about the guarantee that 'in steady state' (as in non-steady-state, old samples in the writer-history might already be overwritten by new ones, depending if you y/n use a KEEP_ALL history-policy at the writer-side) the writer-history will be 'eventually' replicated (that is 'delivered') to the reader-history (where it of course can push-out samples from that reader's history, depending on the history-policy of that reader). For short disconnections/reconnections, the reliability-protocol should recover from message-loss i.e. retransmit those messages that got lost during the disconnect. Yet I suspect here we're talking about multi-second disconnections which likely implies that the reader needs to be re-discovered after the connection re-establishes (and similarly on 'the other side' i.e. the reader re-discovering the writer) ..

It could be that deploying a TRANSIENT_LOCAL durability-QoS is helpful in these cases as upon reconnection and re-discovery, the reader would be considered a 'late-joiner' and therefore woud be provided with the historical data kept at the writer-side (you can configure the amount i.e. 'depth' of that history data using the durability-service QoS-settings on the topic-level)

Share this post


Link to post
Share on other sites

Thanks a log Hans,

I will look into the durability settings. I've a couple of follow up questions:

  • Is there a limit of how many messages a later joining node will receive? Let's say I've a topic with RELIABLE reliability and KEEP ALL history. Would a later joining node receive all the messages published before? Those could be a lot.
    The reference manual says that for TRANSIENT durability messages are stored in the data distribution service and not in the writer. What does this mean when using the single process (or standalone) configuration? In that case the data is stored in the distribution service living on the sender side?
  • Looking into the mailing list I found this post (https://developer.opensplice.narkive.com/lO0KMMdt/ospl-dev-problems-with-reliable-communication-via-wan). Some of the ospl.xml settings suggested there sound relevant. Would you suggest getting started just with the durability and get to those settings only when the problem cannot be addressed with the durability settings?

At the moment our priority is to receive the messages sent when the publisher (or subscriber) was not connected to the network.

 

Thanks again!

Luca

Share this post


Link to post
Share on other sites

In steady state (i.e a writer isn't writing samples), a late-joiner will receive not more 'durable' (i.e. non-volatile) samples (of instances) than whats defined in the durability-qos settings for the durability-service as configured via the topic-qos policy. Those settings are max-samples (for all instances), max-samples-per-instance and/or max-instances.

W.r.t. where these samples are 'stored' depends on the QoS. When using TRANSIENT_LOCAL durability, those samples are stored 'at the writer' (so are gone when the writer terminates), if the durability QoS is set to TRANSIENT (or PERSISTENT), those samples are maintained in 'durability-services'. Now how many and where those durability-services reside depends on the deployment mode which is either 'standalone' (which is the only option for the community-edition) or 'federated' (which is only available in the commercially supported version) in which case each federation will typically have a durability-service configured and which then align-themselves so to assure there's multiple copies available to provide late-joiners with historical data.

W.r.t. the reliability-over-wan, it makes sense to NOT use multicast for the data-flows when exploiting Wifi (for discovery its fine). This can be accomplished by changing the xml-configuration-file and changing the default setting of 'Allowmulticast' from 'true' to 'spdp' which implies that multicast is ONLY used for the discovery-phase but not for the actual data-flows i.e. when there are multiple recipients, each one will be provided by a unicast-stream, something that works often better than multicast over WIFI.

            <AllowMulticast>spdp</AllowMulticast>

Share this post


Link to post
Share on other sites

Hi Hans,

Thanks a lot for your feedback! I'll test the suggestion you proposed and get back in case they don't help (it might take a bit).

Out of curiosity, why disabling multicast could help?

Thanks again,

Luca

Share this post


Link to post
Share on other sites

Wifi is notoriously unreliable when it comes to multicast. If you have an excellent connection that's not an issue but typically the advantages of using multicast (send-once efficiency) are outweighed by the retransmissions required due to massive data-loss when using multicast). 

I'm not sure however if that would impact your disconnect/reconnect issues .. but at least it's good to know I guess :)

Share this post


Link to post
Share on other sites

Hi Hans,

 

Do I understand correctly that for TRANSIENT_LOCAL I have to apply the following settings?

  • Topic:
    • topicQoS.durability.kind = DDS::TRANSIENT_LOCAL_DURABILITY_QOS
    • topicQoS.durability_service.service_cleanup_delay = 0
  • Data reader:
    • inherit from topic
  • Data writer:
    • inherit from topic
    • writerQoS.writer_data_lifecycle.autodispose_unregistered_instances = true

My understanding is that with the default values, how many samples are stored will depends on the Topic history QoS, which in my case is inherited by writers and readers:

  • KEEP_LAST_HISTORY_QOS: based on depth
  • KEEP_ALL_HISTORY_QOS: unlimited

 

Is that correct?

Thanks again,

Luca

Share this post


Link to post
Share on other sites

I don't think you have to set the service-cleanup-delay.

W.r.t. the history, when using KEEP_ALL (for the durability-service QoS) you also should set the resource-limits as otherwise its likely that you'll run out of memory.

Share this post


Link to post
Share on other sites

Thanks a lot Hans,

From an initial test on a sample application I noticed that if I use those settings and destroy the data writer before creating the data reader, the message is still received by the data reader. Is that due to the fact that I'm creating writer and reader in the same process? I don't see the same behavior when running reader and writer in different processes.

Regarding history, I guess the topic history settings should be consistent with the durability history settings?

I'll set the limits as suggested. If set to all it seems to stop receiving data pretty soon when sending 1MB messages. Are there memory settings (e.g. in ospl.xml) that I could change in case a need bigger memory?

Thanks a lot for the great support!

Luca

Share this post


Link to post
Share on other sites

Hi Luca,

Receiving transient-local data from a destroyed data would be a true miracle :) (as that data is solely maintained at that writer).

Are you sure that there are no other writers alive in the system who's data you're receiving ?

The only other possibility would have been if your data was TRANSIENT instead TRANSIENT_LOCAL and there would be other app's alive that have an 'embedded' durability-service (as the community-edition doesn't support federated-deployment where such a durability-service would be part of a federation which doesn't necessarily need to include any applications)

W.r.t. the 'normal' versus topic-level-durability-service-history-levels, for TRANSIENT and TRANSIENT_LOCAL, the topic-level durability-service history/resource-limits settings actually drive how much historical data is preserved for late-joiners. The 'normal' history-settings aren't about durability but determine the behavior of writer- and reader-caches: for a writer, a KEEP_LAST history would imply that when writing data faster than the system (typically the network) can handle, old data will be overwritten with fresh data even before its transmitted and for a reader, a KEEP_LAST history means that when a reader can't keep-up with the flow of arriving data, the data in its history-cache will be overwritten with fresh data so that 'at least' the most recent data is available for consumption. Note that this 'overwrite-behavior' happens for each instance individually (i.e. a history-depth applies to the history-size for each instance).

Using a KEEP_ALL policy (at writer and reader) implies flow-control and will (eventually i.e. after queueing resources are used-up) cause end-to-end flow-control where a slow reader determines the speed at which a writer can publish samples (and therefore should be handled-with-care i.e. used only for 'event-kind' of data where its important that all samples are delivered and consumed in order, which is different from typical telemetry or even state-kind of data where the most recent data is what is typically required, and where this downsampling is thus allowed and can be even considered a feature as it allows to maintain the decoupling  between autonomous applications.

Hope this helps a little

-Hans

Share this post


Link to post
Share on other sites

Thanks Hans,

There was actually an error on my side. The data writer object was destroyed but I did not destroy it on the domain participant side, so I assume it was still alive.

Thanks also for the clarification on durability and history.

Share this post


Link to post
Share on other sites

Hi Hans,

We've deployed the solution you proposed and we are experiencing a couple of problems:

  • Our data writer is always alive, while the reader is created when needed. Therefore when we create the reader we received the last N messages sent by the writer (where N is the length of the queue). I expect this to be normal. However, in few circumstances I've seen the messages being received twice. Is that due to some misconfiguration?
  • On some of the nodes connected via Wi-Fi we had a segmentation fault of the application. Unfortunately we couldn't look into the core dumps, but looking at dmesg (see below) we can see that the library /usr/lib/libdurability.so seems to be involved. In those cases I can also see in the ospl-info the kind of warnings reported below(for the thread warning the log is pretty spammed of them). Any idea of what could cause this or where to look for possible issues?
    • thread tev failed to make progress
    • thread xmit.user failed to make progress
    • writer 409049990:125:1:2050 topic d_nameSpacesRequest waiting on high watermark due to reader 1484033314:125:1:3847
    • Already tried to resend d_nameSpaceRequest message '10' times

Thanks a lot!


Luca

[ 1273.988742] conflictResolve[1023]: unhandled level 1 translation fault (11) at 0x00000008, esr 0x92000005
[ 1273.998329] pgd = ffffffc07b086000
[ 1274.001737] [00000008] *pgd=0000000000000000, *pud=0000000000000000
[ 1274.008034] 
[ 1274.009532] CPU: 3 PID: 1023 Comm: conflictResolve Not tainted 4.4.38-rt49+ #4
[ 1274.016759] Hardware name: quill (DT)
[ 1274.020464] task: ffffffc07a435100 ti: ffffffc1e5904000 task.ti: ffffffc1e5904000
[ 1274.027954] PC is at 0x7f76590e8c
[ 1274.031278] LR is at 0x7f76590e7c
[ 1274.034602] pc : [<0000007f76590e8c>] lr : [<0000007f76590e7c>] pstate: 00000000
[ 1274.042001] sp : 0000007f75efe7e0
[ 1274.045326] x29: 0000007f75efe7f0 x28: 0000000000000000 
[ 1274.050685] x27: 0000007f75efe900 x26: 0000000000000000 
[ 1274.056030] x25: 0000007f4005bd40 x24: 0000007f40000cb0 
[ 1274.061364] x23: 0000007f40000cb0 x22: 000000555d5d1040 
[ 1274.066695] x21: 0000007ee8004db0 x20: 0000007f40002900 
[ 1274.072024] x19: 0000007f0c000e10 x18: 000000000000007f 
[ 1274.077355] x17: 0000007f765965c0 x16: 0000007f765d13f0 
[ 1274.082687] x15: 001dcd6500000000 x14: 000f94c758000000 
[ 1274.088016] x13: ffffffffa127f6eb x12: 0000000000000017 
[ 1274.093348] x11: 0000000000000018 x10: 0101010101010101 
[ 1274.098678] x9 : 000000000026dfb6 x8 : 7f7f7f7f7f7f7f7f 
[ 1274.104007] x7 : fefeff7dff646b6e x6 : 000000000000005d 
[ 1274.109339] x5 : 0000000100000000 x4 : 000000000000005d 
[ 1274.114682] x3 : 0000000000000008 x2 : 0000007f765b9468 
[ 1274.120023] x1 : 000000004e614d65 x0 : 0000007ee8001430 
[ 1274.125366] 
[ 1274.126869] Library at 0x7f76590e8c: 0x7f7653a000 /usr/lib/libdurability.so
[ 1274.133837] Library at 0x7f76590e7c: 0x7f7653a000 /usr/lib/libdurability.so
[ 1274.140800] vdso base = 0x7f8d5f3000
[ 1274.144436] audit: type=1701 audit(1591217678.620:2): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1023 comm="conflictResolve" exe="/opt/verity/bin/vs_process_executor" sig=11
[ 1274.161183] BUG: scheduling while atomic: dcpsHeartbeatLi/1033/0x00000002
[ 1274.161186] BUG: scheduling while atomic: vs_process_miss/651/0x00000002
[ 1274.161187] Modules linked in:
[ 1274.161188] Modules linked in:
[ 1274.161189]  uvcvideo
[ 1274.161190]  uvcvideo
[ 1274.161190]  videobuf2_vmalloc
[ 1274.161191]  videobuf2_vmalloc
[ 1274.161192]  mttcan
[ 1274.161192]  mttcan
[ 1274.161193]  can_dev
[ 1274.161193]  can_dev
[ 1274.161194]  xhci_tegra
[ 1274.161195]  xhci_tegra
[ 1274.161195]  bcmdhd
[ 1274.161196]  bcmdhd
[ 1274.161196]  xhci_hcd
[ 1274.161197]  xhci_hcd
[ 1274.161198]  bluedroid_pm
[ 1274.161198]  bluedroid_pm
[ 1274.161199]  spidev
[ 1274.161199]  spidev
[ 1274.161200]  pci_tegra
[ 1274.161200]  pci_tegra
[ 1274.161200] 
[ 1274.161201] 
[ 1274.161202] Preemption disabled at:
[ 1274.161210] Preemption disabled at:
[ 1274.161211] [<ffffffc0000b4780>] exit_signals+0x98/0x24c
[ 1274.161214] [<ffffffc0000b4780>] exit_signals+0x98/0x24c
[ 1274.161216] BUG: scheduling while atomic: d_nameSpaces/1026/0x00000002
[ 1274.161217] 
[ 1274.161217] 
[ 1274.161221] Modules linked in: uvcvideo
[ 1274.161222] CPU: 3 PID: 1033 Comm: dcpsHeartbeatLi Not tainted 4.4.38-rt49+ #4
[ 1274.161223]  videobuf2_vmalloc
[ 1274.161223] Hardware name: quill (DT)
[ 1274.161225]  mttcan can_dev
[ 1274.161225] Call trace:
[ 1274.161230]  xhci_tegra
[ 1274.161230] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100
[ 1274.161234]  bcmdhd
[ 1274.161234] [<ffffffc000089ac4>] show_stack+0x14/0x1c
[ 1274.161240]  xhci_hcd
[ 1274.161240] [<ffffffc000345c14>] dump_stack+0x94/0xc0
[ 1274.161243]  bluedroid_pm
[ 1274.161243] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0
[ 1274.161248]  spidev
[ 1274.161248] [<ffffffc000b544b0>] __schedule+0x390/0x4fc
[ 1274.161250]  pci_tegra
[ 1274.161251] [<ffffffc000b54664>] schedule+0x48/0xdc
[ 1274.161251] 
[ 1274.161254] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0
[ 1274.161257] Preemption disabled at:
[ 1274.161257] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c
[ 1274.161260] [<ffffffc0000b4780>] exit_signals+0x98/0x24c
[ 1274.161263] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c
[ 1274.161263] 
[ 1274.161265] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54
[ 1274.161267] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c
[ 1274.161269] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc
[ 1274.161271] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8
[ 1274.161273] [<ffffffc0000b4298>] get_signal+0x21c/0x66c
[ 1274.161274] [<ffffffc0000890c0>] do_signal+0x70/0x3a0
[ 1274.161276] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74
[ 1274.161279] [<ffffffc000084eec>] work_pending+0x20/0x24
[ 1274.161281] CPU: 5 PID: 651 Comm: vs_process_miss Not tainted 4.4.38-rt49+ #4
[ 1274.161282] Hardware name: quill (DT)
[ 1274.161283] Call trace:
[ 1274.161286] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100
[ 1274.161288] [<ffffffc000089ac4>] show_stack+0x14/0x1c
[ 1274.161290] [<ffffffc000345c14>] dump_stack+0x94/0xc0
[ 1274.161292] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0
[ 1274.161294] [<ffffffc000b544b0>] __schedule+0x390/0x4fc
[ 1274.161296] [<ffffffc000b54664>] schedule+0x48/0xdc
[ 1274.161298] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0
[ 1274.161300] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c
[ 1274.161301] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c
[ 1274.161303] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54
[ 1274.161305] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c
[ 1274.161306] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc
[ 1274.161308] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8
[ 1274.161309] [<ffffffc0000b4298>] get_signal+0x21c/0x66c
[ 1274.161311] [<ffffffc0000890c0>] do_signal+0x70/0x3a0
[ 1274.161313] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74
[ 1274.161314] [<ffffffc000084eec>] work_pending+0x20/0x24
[ 1274.161317] CPU: 0 PID: 1026 Comm: d_nameSpaces Tainted: G        W       4.4.38-rt49+ #4
[ 1274.161318] Hardware name: quill (DT)
[ 1274.161318] Call trace:
[ 1274.161321] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100
[ 1274.161323] [<ffffffc000089ac4>] show_stack+0x14/0x1c
[ 1274.161325] [<ffffffc000345c14>] dump_stack+0x94/0xc0
[ 1274.161327] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0
[ 1274.161329] [<ffffffc000b544b0>] __schedule+0x390/0x4fc
[ 1274.161331] [<ffffffc000b54664>] schedule+0x48/0xdc
[ 1274.161333] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0
[ 1274.161335] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c
[ 1274.161336] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c
[ 1274.161338] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54
[ 1274.161339] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c
[ 1274.161341] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc
[ 1274.161343] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8
[ 1274.161344] [<ffffffc0000b4298>] get_signal+0x21c/0x66c
[ 1274.161346] [<ffffffc0000890c0>] do_signal+0x70/0x3a0
[ 1274.161348] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74
[ 1274.161349] [<ffffffc000084eec>] work_pending+0x20/0x24
[ 1274.161415] BUG: scheduling while atomic: OSPL Garbage Co/963/0x00000002
[ 1274.161419] Modules linked in: uvcvideo
[ 1274.161420] DEBUG_LOCKS_WARN_ON(val > preempt_count())
[ 1274.161428]  videobuf2_vmalloc mttcan can_dev xhci_tegra bcmdhd xhci_hcd bluedroid_pm spidev pci_tegra
[ 1274.161431] Preemption disabled at:[<ffffffc0000b4780>] exit_signals+0x98/0x24c
[ 1274.161431] 
[ 1274.161433] CPU: 0 PID: 963 Comm: OSPL Garbage Co Tainted: G        W       4.4.38-rt49+ #4
[ 1274.161434] Hardware name: quill (DT)
[ 1274.161434] Call trace:
[ 1274.161437] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100
[ 1274.161439] [<ffffffc000089ac4>] show_stack+0x14/0x1c
[ 1274.161441] [<ffffffc000345c14>] dump_stack+0x94/0xc0
[ 1274.161442] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0
[ 1274.161445] [<ffffffc000b544b0>] __schedule+0x390/0x4fc
[ 1274.161446] [<ffffffc000b54664>] schedule+0x48/0xdc
[ 1274.161449] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0
[ 1274.161450] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c
[ 1274.161452] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c
[ 1274.161454] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54
[ 1274.161455] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c
[ 1274.161457] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc
[ 1274.161458] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8
[ 1274.161460] [<ffffffc0000b4298>] get_signal+0x21c/0x66c
[ 1274.161461] [<ffffffc0000890c0>] do_signal+0x70/0x3a0
[ 1274.161463] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74
[ 1274.161465] [<ffffffc000084eec>] work_pending+0x20/0x24
[ 1274.740528] ------------[ cut here ]------------
[ 1274.740531] WARNING: at ffffffc0000cba34 [verbose debug info unavailable]
[ 1274.740543] Modules linked in: uvcvideo videobuf2_vmalloc mttcan can_dev xhci_tegra bcmdhd xhci_hcd bluedroid_pm spidev pci_tegra
[ 1274.740544] 
[ 1274.740548] CPU: 3 PID: 1033 Comm: dcpsHeartbeatLi Tainted: G        W       4.4.38-rt49+ #4
[ 1274.740550] Hardware name: quill (DT)
[ 1274.740552] task: ffffffc1e5bd2880 ti: ffffffc1e5be0000 task.ti: ffffffc1e5be0000
[ 1274.740560] PC is at preempt_count_sub+0xb0/0xb8
[ 1274.740562] LR is at preempt_count_sub+0xb0/0xb8
[ 1274.740564] pc : [<ffffffc0000cba34>] lr : [<ffffffc0000cba34>] pstate: 80000045
[ 1274.740565] sp : ffffffc1e5be3c10
[ 1274.740568] x29: ffffffc1e5be3c10 x28: 0000000000000009 
[ 1274.740570] x27: ffffffc000b5e000 x26: ffffffc1e5bd2880 
[ 1274.740571] x25: ffffffc07b21d488 x24: ffffffc07b21cc80 
[ 1274.740573] x23: 0000000000000008 x22: ffffffc07b21cd80 
[ 1274.740575] x21: ffffffc000f4e000 x20: ffffffc001465a20 
[ 1274.740577] x19: ffffffc1e5bd2880 x18: 000000000000007f 
[ 1274.740579] x17: ffffffc000b62a68 x16: ffffffc000b62a68 
[ 1274.740580] x15: ffffffc000b62a68 x14: 5720202020202020 
[ 1274.740582] x13: 2047203a6465746e x12: 696154206f432065 
[ 1274.740584] x11: 0000000000000002 x10: 0000000000000000 
[ 1274.740586] x9 : ffffffc1e5be3a00 x8 : 00000000000004a5 
[ 1274.740587] x7 : ffffffc0012a2680 x6 : 0000000000000001 
[ 1274.740589] x5 : ffffffc1e5be39e0 x4 : 0000000000000001 
[ 1274.740591] x3 : ffffffc1e5be0000 x2 : 0000000000000001 
[ 1274.740592] x1 : 0000000000000208 x0 : 000000000000002a 
[ 1274.740593] 
[ 1274.740871] ---[ end trace 0000000000000002 ]---
[ 1274.740872] Call trace:
[ 1274.740877] [<ffffffc0000cba34>] preempt_count_sub+0xb0/0xb8
[ 1274.740881] [<ffffffc0000b47b0>] exit_signals+0xc8/0x24c
[ 1274.740884] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc
[ 1274.740886] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8
[ 1274.740887] [<ffffffc0000b4298>] get_signal+0x21c/0x66c
[ 1274.740893] [<ffffffc0000890c0>] do_signal+0x70/0x3a0
[ 1274.740895] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74
[ 1274.740898] [<ffffffc000084eec>] work_pending+0x20/0x24

 

Share this post


Link to post
Share on other sites

Couple of notes:

  • when threads are reported to make no progress, thats often caused by an overloaded system
  • when watermarks are reported to be reached that's often an indication that data couldn't be delivered
  • when d_namespaceRequests issues are reported, there's an issue with durability (probably i.c.w. the above)

So I have a few questions:

  • are you using TRANSIENT and/or PERSISTENT topics and if so, please note that those imply running durability-services which are typically part of federations
  • are you using the community-edition, as that edition does not support federations (with configured durability-services)
  • if you're using the community-edition, by default each application is configured with a durability-service-thread but as app's come and go, you can't be sure of transient/persistent data remaining available when app's are gone
  • so when using the community-edition, I'd strongly suggest to use TRANSIENT_LOCAL topics for non-volatile data as that is a simpler mechanism where the data is retained at the writer
  • and finally if you're using the commercially supported version, you can ask such questions and/or file bug-reports directly to our helpdesk and support-portal

 

Share this post


Link to post
Share on other sites

Thanks Hans,

We are using the community edition.

We have just one topic that does not use volatile durability.

These are its settings:

  • topicQoS.reliability.kind = RELIABLE_RELIABILITY_QOS;
  • topicQoS.history.kind  = DDS::KEEP_LAST_HISTORY_QOS;
  • topicQoS.history.depth = 5;
  • topicQoS.durability.kind = TRANSIENT_LOCAL_DURABILITY_QOS;
  • topicQoS.durability_service.history_kind  = KEEP_LAST_HISTORY_QOS;
  • topicQoS.durability_service.history_depth = 5;

The data writer for this topic has the following setting enabled (all the other settings for the writer and read QoS are the default ones loaded from the topic QoS):

  • dataWriterQoS.writer_data_lifecycle.autodispose_unregistered_instances = true;

 

I just managed to retrieve a core dump and when looking into it with GDB I get the following backtrace:

#0 0x0000007f7612fe8c in d_conflictResolverRun () from /lib/libdurability.so
#1 0x0000007f8c7fdfc8 in ut_threadWrapper () from /lib/libddskernel.so
#2 0x0000007f8c7e5e3c in os_startRoutineWrapper () from /lib/libddskernel.so
#3 0x0000007f8c1317e4 in start_thread (arg=0x7f760d757f) at pthread_create.c:486
#4 0x0000007f8bde5b9c in ?? () from /lib/libc.so.6

Anything else that I can tell you to point you in the right direction?

Thanks a lot!

Luca

Share this post


Link to post
Share on other sites

Hi Hans, 

I can add one more thing.

The ospl.xml configuration of Wi-Fi nodes and Ethernet node are different. This was not intentional. Could this be a problem? 

I report below the differences. If you could let us know which one of the two should be used that would be helpful.

On the nodes connected with Wi-Fi we have the following entry in ospl.xml (while we do not have it on the node connected with ethernet).

    <DurabilityService name="durability">
        <Network>
            <Alignment>
                <TimeAlignment>false</TimeAlignment>
                <RequestCombinePeriod>
                    <Initial>2.5</Initial>
                    <Operational>0.1</Operational>
                </RequestCombinePeriod>
            </Alignment>
            <WaitForAttachment maxWaitCount="100">
                <ServiceName>ddsi2</ServiceName>
            </WaitForAttachment>
        </Network>
        <NameSpaces>
            <NameSpace name="defaultNamespace">
                <Partition>*</Partition>
            </NameSpace>
            <Policy alignee="Initial" aligner="true" durability="Durable" nameSpace="defaultNamespace"/>
        </NameSpaces>
    </DurabilityService>

 In addition the configuration on the Wi-Fi nodes have the following entry

    <Domain>
        <Name>ospl_sp_ddsi</Name>
        <Id>0</Id>
        <SingleProcess>true</SingleProcess>
        <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description>
        <Service name="ddsi2">
            <Command>ddsi2</Command>
        </Service>
        <Service name="durability">
            <Command>durability</Command>
        </Service>
        <Service enabled="false" name="cmsoap">
            <Command>cmsoap</Command>
        </Service>
    </Domain>

While the one of the ethernet node has the following (note the differences in terms of durability):

    <Domain>
        <Name>ospl_sp_ddsi</Name>
        <Id>0</Id>
        <SingleProcess>true</SingleProcess>
        <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description>
        <Service name="ddsi2">
            <Command>ddsi2</Command>
        </Service>
        <Service enabled="false" name="cmsoap">
            <Command>cmsoap</Command>
        </Service>
        <DurablePolicies>
            <Policy obtain="*.*"/>
        </DurablePolicies>
    </Domain>

 

Share this post


Link to post
Share on other sites
On 5/8/2020 at 1:27 AM, luca.gherardi said:

Dear Luca,

As you have mentioned you are using TRANSIENT_LOCAL_DURABILITY_QOS so no need to use durability service in your configuration (ospl.xml). Durability Service is only needed when you use transient/ Persistent QoS. 

Now come to the your queries about "Already tried to resend d_nameSpaceRequest message '10' times" 

The message 'Already tried to send ...' occurs when durability is trying to publish a d_nameSpacesRequest, but the write operation times out (it uses a max_blocking_time of one second, so the timeout occurs after 1 second). From the message it looks the nameSpacesRequest is not being delivered after 10 attempts. This is an indicator that the message cannot be delivered. Possible causes are network congestion . We have seen recently similar issues when fixing stuff for other customers that went into the commercial version.

Such issues can lead to stalling alignment if an asymmetric disconnect occurs while discovering a fellow.

In cases of network overload, a too aggressive ddsi N-Ack generation leads to resending data, which adds to overload and make the situation worst leading to asymmetric disconnections, which then have to be resolved by realignment again.

To narrow the problem down we would advise you test the scenario with a commercially supported version at no cost.

With best regards,

Vivek Pandey

 

On 5/8/2020 at 1:27 AM, luca.gherardi said:

 

 

Share this post


Link to post
Share on other sites

Dear Vivek,

Thanks a lot for your answer.

We will remove the durability service from the ospl.xml configuration. Should we keep the DurablePolicies?

Do you have any idea on what could cause the segmentation fault? Could that be the network congestion effect mentioned in your answer?

Thanks in advance,

Luca

Share this post


Link to post
Share on other sites

DurablePolicies is not required because you are using TRANSIENT_LOCAL_DURABILITY_QOS and ddsi service. If ddsi is used then durability has NOTHING to do with transient-data delivery because ddsi is responsible for the alignment of builtin topics. In fact, you don't need a durability service at all to experience transient-local behavior when ddsi is used. DurablePolicies is only required when you don't run durability service locally , but to request data from a durability service on a remote federation using the client-durability feature.

I am not sure about the cause of your segmentation fault. 

You can try this scenario with our commercial Opensplice DDS in which you get all the features and services enabled. For evaluation it is free of cost.

 

With best regards,

Vivek Pandey

 

 

 

Share this post


Link to post
Share on other sites

Dear Vivek,

Thanks for your answer, we will try to disable the durability service and policies and let you know how it goes.

Unfortunately, the problem is happening only on a deployed system and it's not easy for us to use the commercial version there. If the changes proposed do not help we will try to get it deployed.

One more question. Do you know what could be causing the reader to receive the same message twice after being created (see point below)?

Quote
  • Our data writer is always alive, while the reader is created when needed. Therefore when we create the reader we received the last N messages sent by the writer (where N is the length of the queue). I expect this to be normal. However, in few circumstances I've seen the messages being received twice. Is that due to some misconfiguration?

Thanks a lot,

Luca

Share this post


Link to post
Share on other sites

Dear Luca,

The problem is because of network disconnection and re-connection (for a moment) between two nodes. As the result of disconnection between data writer and data reader the instances go to NOT_ALIVE_NO_WRITER/NOT_ALIVE_DISPOSED state. I suppose you are using take call. take may remove the instance from the reader administration when the instance becomes empty. When the network connection is restored then either the durability service will realign the data or writer (in case of transient_local) may resent it's data again. Because the instance was removed as a result of the take, all knowledge of that instance is removed and realigned data may then be read again ( that is expected behavior).

Note that in the commercial release the instance is not directly removed after a take in case of there are no alive writers of that instances. In that case the instance is maintained for some time before removing it.

 

With best regards,

Vivek Pandey

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...