Jump to content
OpenSplice DDS Forum

luca.gherardi

Members
  • Content Count

    29
  • Joined

  • Last visited

About luca.gherardi

  • Rank
    Member

Profile Information

  • Company
    Verity Studios

Recent Profile Visitors

573 profile views
  1. Thanks a lot for the explanation Vivek, that makes sense to me. Best, Luca
  2. Dear Vivek, Thanks for your answer, we will try to disable the durability service and policies and let you know how it goes. Unfortunately, the problem is happening only on a deployed system and it's not easy for us to use the commercial version there. If the changes proposed do not help we will try to get it deployed. One more question. Do you know what could be causing the reader to receive the same message twice after being created (see point below)? Thanks a lot, Luca
  3. Dear Vivek, Thanks a lot for your answer. We will remove the durability service from the ospl.xml configuration. Should we keep the DurablePolicies? Do you have any idea on what could cause the segmentation fault? Could that be the network congestion effect mentioned in your answer? Thanks in advance, Luca
  4. Hi Hans, I can add one more thing. The ospl.xml configuration of Wi-Fi nodes and Ethernet node are different. This was not intentional. Could this be a problem? I report below the differences. If you could let us know which one of the two should be used that would be helpful. On the nodes connected with Wi-Fi we have the following entry in ospl.xml (while we do not have it on the node connected with ethernet). <DurabilityService name="durability"> <Network> <Alignment> <TimeAlignment>false</TimeAlignment> <RequestCombinePeriod> <Initial>2.5</Initial> <Operational>0.1</Operational> </RequestCombinePeriod> </Alignment> <WaitForAttachment maxWaitCount="100"> <ServiceName>ddsi2</ServiceName> </WaitForAttachment> </Network> <NameSpaces> <NameSpace name="defaultNamespace"> <Partition>*</Partition> </NameSpace> <Policy alignee="Initial" aligner="true" durability="Durable" nameSpace="defaultNamespace"/> </NameSpaces> </DurabilityService> In addition the configuration on the Wi-Fi nodes have the following entry <Domain> <Name>ospl_sp_ddsi</Name> <Id>0</Id> <SingleProcess>true</SingleProcess> <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description> <Service name="ddsi2"> <Command>ddsi2</Command> </Service> <Service name="durability"> <Command>durability</Command> </Service> <Service enabled="false" name="cmsoap"> <Command>cmsoap</Command> </Service> </Domain> While the one of the ethernet node has the following (note the differences in terms of durability): <Domain> <Name>ospl_sp_ddsi</Name> <Id>0</Id> <SingleProcess>true</SingleProcess> <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description> <Service name="ddsi2"> <Command>ddsi2</Command> </Service> <Service enabled="false" name="cmsoap"> <Command>cmsoap</Command> </Service> <DurablePolicies> <Policy obtain="*.*"/> </DurablePolicies> </Domain>
  5. Maybe one more detail, as suggested, we changed the AllowMulticast option to "spdp"
  6. Thanks Hans, We are using the community edition. We have just one topic that does not use volatile durability. These are its settings: topicQoS.reliability.kind = RELIABLE_RELIABILITY_QOS; topicQoS.history.kind = DDS::KEEP_LAST_HISTORY_QOS; topicQoS.history.depth = 5; topicQoS.durability.kind = TRANSIENT_LOCAL_DURABILITY_QOS; topicQoS.durability_service.history_kind = KEEP_LAST_HISTORY_QOS; topicQoS.durability_service.history_depth = 5; The data writer for this topic has the following setting enabled (all the other settings for the writer and read QoS are the default ones loaded from the topic QoS): dataWriterQoS.writer_data_lifecycle.autodispose_unregistered_instances = true; I just managed to retrieve a core dump and when looking into it with GDB I get the following backtrace: #0 0x0000007f7612fe8c in d_conflictResolverRun () from /lib/libdurability.so#1 0x0000007f8c7fdfc8 in ut_threadWrapper () from /lib/libddskernel.so#2 0x0000007f8c7e5e3c in os_startRoutineWrapper () from /lib/libddskernel.so#3 0x0000007f8c1317e4 in start_thread (arg=0x7f760d757f) at pthread_create.c:486#4 0x0000007f8bde5b9c in ?? () from /lib/libc.so.6 Anything else that I can tell you to point you in the right direction? Thanks a lot! Luca
  7. Hi Hans, We've deployed the solution you proposed and we are experiencing a couple of problems: Our data writer is always alive, while the reader is created when needed. Therefore when we create the reader we received the last N messages sent by the writer (where N is the length of the queue). I expect this to be normal. However, in few circumstances I've seen the messages being received twice. Is that due to some misconfiguration? On some of the nodes connected via Wi-Fi we had a segmentation fault of the application. Unfortunately we couldn't look into the core dumps, but looking at dmesg (see below) we can see that the library /usr/lib/libdurability.so seems to be involved. In those cases I can also see in the ospl-info the kind of warnings reported below(for the thread warning the log is pretty spammed of them). Any idea of what could cause this or where to look for possible issues? thread tev failed to make progress thread xmit.user failed to make progress writer 409049990:125:1:2050 topic d_nameSpacesRequest waiting on high watermark due to reader 1484033314:125:1:3847 Already tried to resend d_nameSpaceRequest message '10' times Thanks a lot! Luca [ 1273.988742] conflictResolve[1023]: unhandled level 1 translation fault (11) at 0x00000008, esr 0x92000005 [ 1273.998329] pgd = ffffffc07b086000 [ 1274.001737] [00000008] *pgd=0000000000000000, *pud=0000000000000000 [ 1274.008034] [ 1274.009532] CPU: 3 PID: 1023 Comm: conflictResolve Not tainted 4.4.38-rt49+ #4 [ 1274.016759] Hardware name: quill (DT) [ 1274.020464] task: ffffffc07a435100 ti: ffffffc1e5904000 task.ti: ffffffc1e5904000 [ 1274.027954] PC is at 0x7f76590e8c [ 1274.031278] LR is at 0x7f76590e7c [ 1274.034602] pc : [<0000007f76590e8c>] lr : [<0000007f76590e7c>] pstate: 00000000 [ 1274.042001] sp : 0000007f75efe7e0 [ 1274.045326] x29: 0000007f75efe7f0 x28: 0000000000000000 [ 1274.050685] x27: 0000007f75efe900 x26: 0000000000000000 [ 1274.056030] x25: 0000007f4005bd40 x24: 0000007f40000cb0 [ 1274.061364] x23: 0000007f40000cb0 x22: 000000555d5d1040 [ 1274.066695] x21: 0000007ee8004db0 x20: 0000007f40002900 [ 1274.072024] x19: 0000007f0c000e10 x18: 000000000000007f [ 1274.077355] x17: 0000007f765965c0 x16: 0000007f765d13f0 [ 1274.082687] x15: 001dcd6500000000 x14: 000f94c758000000 [ 1274.088016] x13: ffffffffa127f6eb x12: 0000000000000017 [ 1274.093348] x11: 0000000000000018 x10: 0101010101010101 [ 1274.098678] x9 : 000000000026dfb6 x8 : 7f7f7f7f7f7f7f7f [ 1274.104007] x7 : fefeff7dff646b6e x6 : 000000000000005d [ 1274.109339] x5 : 0000000100000000 x4 : 000000000000005d [ 1274.114682] x3 : 0000000000000008 x2 : 0000007f765b9468 [ 1274.120023] x1 : 000000004e614d65 x0 : 0000007ee8001430 [ 1274.125366] [ 1274.126869] Library at 0x7f76590e8c: 0x7f7653a000 /usr/lib/libdurability.so [ 1274.133837] Library at 0x7f76590e7c: 0x7f7653a000 /usr/lib/libdurability.so [ 1274.140800] vdso base = 0x7f8d5f3000 [ 1274.144436] audit: type=1701 audit(1591217678.620:2): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1023 comm="conflictResolve" exe="/opt/verity/bin/vs_process_executor" sig=11 [ 1274.161183] BUG: scheduling while atomic: dcpsHeartbeatLi/1033/0x00000002 [ 1274.161186] BUG: scheduling while atomic: vs_process_miss/651/0x00000002 [ 1274.161187] Modules linked in: [ 1274.161188] Modules linked in: [ 1274.161189] uvcvideo [ 1274.161190] uvcvideo [ 1274.161190] videobuf2_vmalloc [ 1274.161191] videobuf2_vmalloc [ 1274.161192] mttcan [ 1274.161192] mttcan [ 1274.161193] can_dev [ 1274.161193] can_dev [ 1274.161194] xhci_tegra [ 1274.161195] xhci_tegra [ 1274.161195] bcmdhd [ 1274.161196] bcmdhd [ 1274.161196] xhci_hcd [ 1274.161197] xhci_hcd [ 1274.161198] bluedroid_pm [ 1274.161198] bluedroid_pm [ 1274.161199] spidev [ 1274.161199] spidev [ 1274.161200] pci_tegra [ 1274.161200] pci_tegra [ 1274.161200] [ 1274.161201] [ 1274.161202] Preemption disabled at: [ 1274.161210] Preemption disabled at: [ 1274.161211] [<ffffffc0000b4780>] exit_signals+0x98/0x24c [ 1274.161214] [<ffffffc0000b4780>] exit_signals+0x98/0x24c [ 1274.161216] BUG: scheduling while atomic: d_nameSpaces/1026/0x00000002 [ 1274.161217] [ 1274.161217] [ 1274.161221] Modules linked in: uvcvideo [ 1274.161222] CPU: 3 PID: 1033 Comm: dcpsHeartbeatLi Not tainted 4.4.38-rt49+ #4 [ 1274.161223] videobuf2_vmalloc [ 1274.161223] Hardware name: quill (DT) [ 1274.161225] mttcan can_dev [ 1274.161225] Call trace: [ 1274.161230] xhci_tegra [ 1274.161230] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100 [ 1274.161234] bcmdhd [ 1274.161234] [<ffffffc000089ac4>] show_stack+0x14/0x1c [ 1274.161240] xhci_hcd [ 1274.161240] [<ffffffc000345c14>] dump_stack+0x94/0xc0 [ 1274.161243] bluedroid_pm [ 1274.161243] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0 [ 1274.161248] spidev [ 1274.161248] [<ffffffc000b544b0>] __schedule+0x390/0x4fc [ 1274.161250] pci_tegra [ 1274.161251] [<ffffffc000b54664>] schedule+0x48/0xdc [ 1274.161251] [ 1274.161254] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0 [ 1274.161257] Preemption disabled at: [ 1274.161257] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c [ 1274.161260] [<ffffffc0000b4780>] exit_signals+0x98/0x24c [ 1274.161263] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c [ 1274.161263] [ 1274.161265] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54 [ 1274.161267] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c [ 1274.161269] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc [ 1274.161271] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8 [ 1274.161273] [<ffffffc0000b4298>] get_signal+0x21c/0x66c [ 1274.161274] [<ffffffc0000890c0>] do_signal+0x70/0x3a0 [ 1274.161276] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74 [ 1274.161279] [<ffffffc000084eec>] work_pending+0x20/0x24 [ 1274.161281] CPU: 5 PID: 651 Comm: vs_process_miss Not tainted 4.4.38-rt49+ #4 [ 1274.161282] Hardware name: quill (DT) [ 1274.161283] Call trace: [ 1274.161286] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100 [ 1274.161288] [<ffffffc000089ac4>] show_stack+0x14/0x1c [ 1274.161290] [<ffffffc000345c14>] dump_stack+0x94/0xc0 [ 1274.161292] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0 [ 1274.161294] [<ffffffc000b544b0>] __schedule+0x390/0x4fc [ 1274.161296] [<ffffffc000b54664>] schedule+0x48/0xdc [ 1274.161298] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0 [ 1274.161300] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c [ 1274.161301] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c [ 1274.161303] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54 [ 1274.161305] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c [ 1274.161306] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc [ 1274.161308] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8 [ 1274.161309] [<ffffffc0000b4298>] get_signal+0x21c/0x66c [ 1274.161311] [<ffffffc0000890c0>] do_signal+0x70/0x3a0 [ 1274.161313] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74 [ 1274.161314] [<ffffffc000084eec>] work_pending+0x20/0x24 [ 1274.161317] CPU: 0 PID: 1026 Comm: d_nameSpaces Tainted: G W 4.4.38-rt49+ #4 [ 1274.161318] Hardware name: quill (DT) [ 1274.161318] Call trace: [ 1274.161321] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100 [ 1274.161323] [<ffffffc000089ac4>] show_stack+0x14/0x1c [ 1274.161325] [<ffffffc000345c14>] dump_stack+0x94/0xc0 [ 1274.161327] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0 [ 1274.161329] [<ffffffc000b544b0>] __schedule+0x390/0x4fc [ 1274.161331] [<ffffffc000b54664>] schedule+0x48/0xdc [ 1274.161333] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0 [ 1274.161335] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c [ 1274.161336] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c [ 1274.161338] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54 [ 1274.161339] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c [ 1274.161341] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc [ 1274.161343] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8 [ 1274.161344] [<ffffffc0000b4298>] get_signal+0x21c/0x66c [ 1274.161346] [<ffffffc0000890c0>] do_signal+0x70/0x3a0 [ 1274.161348] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74 [ 1274.161349] [<ffffffc000084eec>] work_pending+0x20/0x24 [ 1274.161415] BUG: scheduling while atomic: OSPL Garbage Co/963/0x00000002 [ 1274.161419] Modules linked in: uvcvideo [ 1274.161420] DEBUG_LOCKS_WARN_ON(val > preempt_count()) [ 1274.161428] videobuf2_vmalloc mttcan can_dev xhci_tegra bcmdhd xhci_hcd bluedroid_pm spidev pci_tegra [ 1274.161431] Preemption disabled at:[<ffffffc0000b4780>] exit_signals+0x98/0x24c [ 1274.161431] [ 1274.161433] CPU: 0 PID: 963 Comm: OSPL Garbage Co Tainted: G W 4.4.38-rt49+ #4 [ 1274.161434] Hardware name: quill (DT) [ 1274.161434] Call trace: [ 1274.161437] [<ffffffc0000898fc>] dump_backtrace+0x0/0x100 [ 1274.161439] [<ffffffc000089ac4>] show_stack+0x14/0x1c [ 1274.161441] [<ffffffc000345c14>] dump_stack+0x94/0xc0 [ 1274.161442] [<ffffffc000176920>] __schedule_bug+0x8c/0xa0 [ 1274.161445] [<ffffffc000b544b0>] __schedule+0x390/0x4fc [ 1274.161446] [<ffffffc000b54664>] schedule+0x48/0xdc [ 1274.161449] [<ffffffc000b55d30>] rt_spin_lock_slowlock+0x1a0/0x2e0 [ 1274.161450] [<ffffffc000b5732c>] rt_spin_lock+0x58/0x5c [ 1274.161452] [<ffffffc0000eb1cc>] __wake_up+0x20/0x4c [ 1274.161454] [<ffffffc0000ee5d4>] __percpu_up_read+0x48/0x54 [ 1274.161455] [<ffffffc0000b4888>] exit_signals+0x1a0/0x24c [ 1274.161457] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc [ 1274.161458] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8 [ 1274.161460] [<ffffffc0000b4298>] get_signal+0x21c/0x66c [ 1274.161461] [<ffffffc0000890c0>] do_signal+0x70/0x3a0 [ 1274.161463] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74 [ 1274.161465] [<ffffffc000084eec>] work_pending+0x20/0x24 [ 1274.740528] ------------[ cut here ]------------ [ 1274.740531] WARNING: at ffffffc0000cba34 [verbose debug info unavailable] [ 1274.740543] Modules linked in: uvcvideo videobuf2_vmalloc mttcan can_dev xhci_tegra bcmdhd xhci_hcd bluedroid_pm spidev pci_tegra [ 1274.740544] [ 1274.740548] CPU: 3 PID: 1033 Comm: dcpsHeartbeatLi Tainted: G W 4.4.38-rt49+ #4 [ 1274.740550] Hardware name: quill (DT) [ 1274.740552] task: ffffffc1e5bd2880 ti: ffffffc1e5be0000 task.ti: ffffffc1e5be0000 [ 1274.740560] PC is at preempt_count_sub+0xb0/0xb8 [ 1274.740562] LR is at preempt_count_sub+0xb0/0xb8 [ 1274.740564] pc : [<ffffffc0000cba34>] lr : [<ffffffc0000cba34>] pstate: 80000045 [ 1274.740565] sp : ffffffc1e5be3c10 [ 1274.740568] x29: ffffffc1e5be3c10 x28: 0000000000000009 [ 1274.740570] x27: ffffffc000b5e000 x26: ffffffc1e5bd2880 [ 1274.740571] x25: ffffffc07b21d488 x24: ffffffc07b21cc80 [ 1274.740573] x23: 0000000000000008 x22: ffffffc07b21cd80 [ 1274.740575] x21: ffffffc000f4e000 x20: ffffffc001465a20 [ 1274.740577] x19: ffffffc1e5bd2880 x18: 000000000000007f [ 1274.740579] x17: ffffffc000b62a68 x16: ffffffc000b62a68 [ 1274.740580] x15: ffffffc000b62a68 x14: 5720202020202020 [ 1274.740582] x13: 2047203a6465746e x12: 696154206f432065 [ 1274.740584] x11: 0000000000000002 x10: 0000000000000000 [ 1274.740586] x9 : ffffffc1e5be3a00 x8 : 00000000000004a5 [ 1274.740587] x7 : ffffffc0012a2680 x6 : 0000000000000001 [ 1274.740589] x5 : ffffffc1e5be39e0 x4 : 0000000000000001 [ 1274.740591] x3 : ffffffc1e5be0000 x2 : 0000000000000001 [ 1274.740592] x1 : 0000000000000208 x0 : 000000000000002a [ 1274.740593] [ 1274.740871] ---[ end trace 0000000000000002 ]--- [ 1274.740872] Call trace: [ 1274.740877] [<ffffffc0000cba34>] preempt_count_sub+0xb0/0xb8 [ 1274.740881] [<ffffffc0000b47b0>] exit_signals+0xc8/0x24c [ 1274.740884] [<ffffffc0000a82b0>] do_exit+0x78/0x9bc [ 1274.740886] [<ffffffc0000a8c64>] do_group_exit+0x40/0xa8 [ 1274.740887] [<ffffffc0000b4298>] get_signal+0x21c/0x66c [ 1274.740893] [<ffffffc0000890c0>] do_signal+0x70/0x3a0 [ 1274.740895] [<ffffffc0000895fc>] do_notify_resume+0x60/0x74 [ 1274.740898] [<ffffffc000084eec>] work_pending+0x20/0x24
  8. Thanks Hans, There was actually an error on my side. The data writer object was destroyed but I did not destroy it on the domain participant side, so I assume it was still alive. Thanks also for the clarification on durability and history.
  9. Thanks a lot Hans, From an initial test on a sample application I noticed that if I use those settings and destroy the data writer before creating the data reader, the message is still received by the data reader. Is that due to the fact that I'm creating writer and reader in the same process? I don't see the same behavior when running reader and writer in different processes. Regarding history, I guess the topic history settings should be consistent with the durability history settings? I'll set the limits as suggested. If set to all it seems to stop receiving data pretty soon when sending 1MB messages. Are there memory settings (e.g. in ospl.xml) that I could change in case a need bigger memory? Thanks a lot for the great support! Luca
  10. Hi Hans, Do I understand correctly that for TRANSIENT_LOCAL I have to apply the following settings? Topic: topicQoS.durability.kind = DDS::TRANSIENT_LOCAL_DURABILITY_QOS topicQoS.durability_service.service_cleanup_delay = 0 Data reader: inherit from topic Data writer: inherit from topic writerQoS.writer_data_lifecycle.autodispose_unregistered_instances = true My understanding is that with the default values, how many samples are stored will depends on the Topic history QoS, which in my case is inherited by writers and readers: KEEP_LAST_HISTORY_QOS: based on depth KEEP_ALL_HISTORY_QOS: unlimited Is that correct? Thanks again, Luca
  11. Hi Hans, Thanks a lot for your feedback! I'll test the suggestion you proposed and get back in case they don't help (it might take a bit). Out of curiosity, why disabling multicast could help? Thanks again, Luca
  12. Thanks a log Hans, I will look into the durability settings. I've a couple of follow up questions: Is there a limit of how many messages a later joining node will receive? Let's say I've a topic with RELIABLE reliability and KEEP ALL history. Would a later joining node receive all the messages published before? Those could be a lot. The reference manual says that for TRANSIENT durability messages are stored in the data distribution service and not in the writer. What does this mean when using the single process (or standalone) configuration? In that case the data is stored in the distribution service living on the sender side? Looking into the mailing list I found this post (https://developer.opensplice.narkive.com/lO0KMMdt/ospl-dev-problems-with-reliable-communication-via-wan). Some of the ospl.xml settings suggested there sound relevant. Would you suggest getting started just with the durability and get to those settings only when the problem cannot be addressed with the durability settings? At the moment our priority is to receive the messages sent when the publisher (or subscriber) was not connected to the network. Thanks again! Luca
  13. We have a system were multiple nodes are connected over wireless. The wireless coverage is not perfect and once in a while a node can drop connection or switch from one access point to the other. I've noticed that when node B drops connection, the packets sent by node B are not received by node A when node B can reconnect to the network. In those cases the writer receives an invalid sample (i.e. valid_data flag set to false and sample state: 2, view state: 2, instance state: 2). How can we configure open splice to keep messages buffered until Node B reconnects to the network so that they can be delivered to node A? For those messages we use a topic with the following settings: DDS::RELIABLE_RELIABILITY_QOS and DDS::KEEP_ALL_HISTORY_QOS. Thanks in advance for your answer and let me know if you need more information.
  14. Thanks a lot Erik!
  15. Hi Everyone, I'm having the same problem and I realized I'm also linking both libraries. Our OSPL.xml is configured to run in single process, and we use the header files from include/dcps/C++/SACPP. I assume that therefore the right library to use is libdcpssacpp.so. Could someone please confirm this? Is the library libdcpsisocpp.so meant to be linked when using header files from include/dcps/C++/isocpp(2)? Thanks in advance, Luca
×
×
  • Create New...