Jump to content
OpenSplice DDS Forum


  • Content Count

  • Joined

  • Last visited

About erik

  • Rank
    Advanced Member

Profile Information

  • Gender
    Not Telling
  • Company

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Hi Chris, I don't usually keep an eye on these forums, so I guess you're lucky I did this time. Firstly, are you sure it is blocked during the take? If you have allocated that CPU exclusively to this process, it should be pretty straightforward to determine whether "take" takes 20ms or whether it sleeps 20ms. Both are "less than ideal" of course, but being certain which case it is definitely would help with diagnosing. That said, if it is blocked, it should be blocked on some mutex somewhere and I would expect it to be a victim of priority inversion, though I am not certain. There are a number of cases I can think of that might do this (in no particular order, and noting there may be more): update of data received from the network or from a local writer a GC step checking for old instances to be freed a badly timed network disconnection the memory allocator releasing large numbers of objects in a short period of time and hitting contention possibly clearing trigger events higher up in the entity hierarchy that are used for blocking on waitsets and triggering listeners None of these would lead me to expect delays in ms unless there are huge numbers of instances (or, in some cases, samples), but if it is indeed priority inversion then the scenarios can get pretty hairy pretty quickly. If it is this, then mitigation on Linux (which I think you're running) could be as simple as enabling priority inheritance on the mutexes — that has an option in the configuration file: Domain/PriorityInheritance, set attribute "enabled" to true. If you have a way of making it take 20ms reasonably often, then it should be possible to catch it in flagrante delicto without too much trouble if you have SystemTap or dtrace at hand. I've never actually done that, but once upon a time I did play with dtrace and I am certain it is possible to use it to profile only during a take operation. Then you discard the profile if it took mere microseconds, and something interesting might well show up. Finally, while I don't think it is the case, it could be driven by interrupts on Linux. I believe it is possible to assign interrupts to CPUs, and hence to not handle them on this particular CPU, but I could be wrong there. Best regards, Erik
  2. Hi Bill, The rule that we always try to follow is to never generate invalid messages, and in this case, that means once you reach 2^31-1, you cannot continue while remaining compliant with the specification. After all, it states that this particular sequence number is of type “Count_t”, described as a “[t]ype used to encapsulate a count that is incremented monotonically, used to identify message duplicates.” And quite obviously, you can’t increment a signed 32-bit (two’s complement) number past 2^31-1. So what does one do in a case like this? Clearly the correct answer is not to crash but to let it roll over anyway, but perhaps out of frustration with some blatant errors in the specification that wasn’t the initial implementation. Needless to say, this should have been addressed before releasing, but somehow it slipped through the cracks. If it is any consolation, you are the first ever to report running into this. It has been fixed long since; that the release notes don’t show it is an oversight. if you upgrade to the current version you will not encounter it anymore and moreover benefit from the many other improvements made since the 6.3 release — including some fixes that address an issue where the data path can stall when just the right packets get lost while sending fragmented data. (If you're on the community edition, then you can also fix this by deleting the problematic two lines — and then please also take out two analogous cases in q_xevent.c — just search for DDSI_COUNT_MAX.) Best regards, Erik
  3. Hi Jeremy, Yes, it will work. Best regards, Erik
  4. Hi Jeremy, There is no issue using it on Ubuntu 16.04 LTS. Backwards compatibility is excellent in Linux. Best regards, Erik
  5. I suspect the use of -flto (which turns on link-time optimizations) is the cause in your case, too, but I can't easily test it on my machine. I would suggest modifying bin/checkconf, commenting out the "set_var CFLAGS_LTO=-flto" (line 262), and doing a full rebuild.
  6. HI Bud, As is well-known, unicorns do exist. The problem is finding fully grown ones, it is only the baby ones that are quite common. In other words: - does it really have to be C++ or is C good enough? - what are your performance requirements? - how much effort are you willing to put into it? - what licensing schemes are acceptable? There is a C library named "corto" on github that does all this. The generic type handling in https://github.com/prismtech/opensplice-tools can convert between something resembling C99 designated initializers and the in-memory representation of the IDL-to-C mapping — and so all the tricks needed to do this are in there even if it does require hooking up the right parsers. Then there is my proof-of-concept Haskell binding (https://github.com/prismtech/haskell-dds), if you're really looking for a proper unicorn The second and third are definitely limited to the C representation, I'm not sure about the first. One way to deal with that is to have a multi-language program that does the conversion between C and C++ representations via DDS. Best regards, Erik
  7. Hi Loay, In a shared-memory deployment OpenSplice uses shared memory to communicate between the OpenSplice applications that are attached to that shared memory, but for everything else it relies on the networking service, i.e., DDSI2. With a bit of trickery you can even have two independent shared memory domains running inside a single machine, and then connect them via DDSI2. So no, OpenSplice's shared memory is not in any way relevant here. Both OpenSplice and OpenDDS are multicasting to but there is no indication either of them receives anything — and I am certain OpenSplice did not receive anything from OpenDDS in the ddsi2.log file you sent earlier and from the traffic that it generates. I know I have at times had problems with multicasting in a VM, especially if the VM was in a NAT configuration, and I suspect this may be the case for you as well. That means as a next step I think you should try enabling unicast discovery (and probably disable multicasting altogether). In OpenSplice that is pretty straightforward: - set General/AllowMulticast to false - add a <Discovery><Peers><Peer address="localhost"/></Peers></Discovery> Obviously this is not a desirable configuration, but if it turns out that it is a virtual networking problem, then I don't think there are many alternatives short of configuring your VM differently. In any case, it is a sensible step to gain some further understanding of the problem.
  8. Hi Loay, Can you do a WireShark capture of all RTPS traffic and post it? Best regards, Erik
  9. Hi Loay, It is obvious that OpenDDS and OpenSplice are not talking, as OpenSplice doesn't receive a single packet from OpenDDS. Each packet contains a "vendor code", PrismTech's is 1.2, and there no packets with a vendor code other than 1.2 — I think OpenDDS uses 1.3, but it is just as easy to check for anything else. Try, e.g., "grep -E 'vendor 1\.([013-9]|2[0-9])'". Absolutely nothing happens unless both sides receive participant discovery data (SPDP) from each other. So if you see a hint of OpenDDS responding to OpenSplice, but nothing actually working, then the first thing to check is where OpenDDS is sending its SPDP data and why that isn't received by OpenSplice. In WireShark, the SPDP data is shown in the summary as DATA(p), so that's easy to spot. About the "proxy" thing: the world in DDSI is divided into two sides: the entities, and the proxy entities. The first are local, the proxy entities are where it stores information on the remote entities such as the locators, last sequence number received, what sequence numbers have been acknowledged so far, &c., &c. The DDSI endpoint discovery (SEDP) distributes the information on the readers and writers, so that every party in the network is aware of who is out there, and what data needs to be send to whom. The "match_writer_with_proxy_readers" therefore is about matching local writers with remote readers, which determines the destination IP addresses to use and from whom to expect acknowledgements. Also note that the "plist" keyword messes up the trace but helps debug issues with the encoding/interpretation of the QoS and various other things that are transmitted as part of the discovery. If you leave it out, then you can easily search for new participants using the regular expression "SPDP.*NEW". As the case is now, it is a bit harder because it is split over multiple lines. Still, "bes .*NEW" works. Best regards, Erik
  10. Hi Loay, Perhaps the problem is with the selection of the network interface to use if you have multiple network interfaces. I don't know about OpenDDS, but the DDSI2 service is somewhat picky in that it really wants to use a single interface. It could well be that the two simply don't receive each other's multicasts. You can specify the interface (by name or by IP address) in the "General/NetworkInterfaceAddress" parameter. When all else fails ... try enabling DDSI2's tracing by adding: <Tracing> <EnableCategory>trace,plist</EnableCategory> </Tracing> to the DDSI2Service section in the ospl.xml file. This consists of a dump of the configuration, then stuff about network interfaces, addresses, port numbers, &c., and finally you get all the traffic and discovery. This may help in finding out what network interface to use, but it usually also gives valuable information for more complicated problems. It would be unreasonable to expect you to understand everything that is that trace file, so feel free to post fragments of it if you need further help. Best regards, Erik
  11. Hi, There is no "ipv6" boolean attribute in the network interface selection: the correct way is to add <UseIPv6>true</UseIPv6> to the "General" element of the DDSI configuration. Best regards, Erik
  12. Hi Peter, RTI are now sending both a UDPv4 and a type 16777216 locator, and then there is a "transport info" list that correlates with and gives some additional information, so it clearly must be a vendor-specific extension occupying part of the OMG-reserved namespace, that evidently should be ignored for things to work ... I'm pretty sure they added this recently, by the way, or we would've run into it ourselves in the most recent interoperability plugfest. Thanks for helping us discover it. For a quick fix, since you are using the open source version, just modify OpenSplice's DDSI implentation to ignore it (see my previous comment, just return 0 instead of ERR_INVALID). I'll make sure a fix goes into the OpenSplice sources. Best regards, Erik
  13. I wonder ... 16777216 could be an RTI-specific locator type that gets rejected (whether or not it should be rejected is debatable, the language of the specification is interpretable in multiple ways), or it could be a byte-swapped version of a UDPv4 locator ... A wireshark capture will likely give a hint: if this sample includes locators with kind 1 as well as locators with kind 16777216, then it almost certainly is an RTI-specific locator, but if there are none with kind 1, it likely is an endianness issue. In the former case, ignoring it is pretty simple (see https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_plist.c#L1018). Would you be able to do a packet capture?
  14. Hi Peter, Firstly, your only chance of interoperability is with “StandardsConformance” set to “lax”. Only in that mode will OpenSplice accept some of the non-conforming messages sent (and even send a few itself that are needed) by the other implementations. I suspect the “invalid qos” and “malformed packet […] parse:acknack” messages occurred in some mode other than “lax”, if not, it would be useful to have a Wireshark capture and perhaps a DDSI2 trace. Secondly, regarding: There are some subtle, at least formally incompatible, changes from the 2.1 to the 2.2 version of the specification, so we believe that our DDSI implementation should restrict itself to version 2.1 until it has been qualified for version 2.2. However, that is not a valid reason for flagging version 2.2 messages as invalid. Chances are that it will work, though, and you change the check easily enough (see https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_receive.c#L2798). Thirdly, in the second log file: and (as well as all analogous cases) are correct warnings: these messages are indeed not valid DDSI 2.1 messages. Why RTI sends them, I don’t know. Best regards, Erik
  15. Hi, What "goes wrong" when you raise the limit is that the unicast discovery will start blasting even larger numbers of packets into the network, and it has to do so periodically (the SPDPInterval). For each peer address, it sends a unicast packet to all N port numbers, so before you know it, the burst will be huge. There are some obvious ways of mitigating that, but those break support for asymmetrical discovery. Obviously, it is a bit silly that it is hard-coded at 10. It is a historical artefact (as is the call to "exit") that simply never is an issue in federated (shared memory) deployments because the limit is on the number of DDSI2 instances, not participants, and also not in environments that support multicast. How it came to be that this found its way into the product I am unfortunately not at liberty to tell, but I suspect you would understand if you knew ... Anyway, it never got its priority raised because it never became a real issue ... such is life. Please feel free to raise the limit and recompile, that has by far the shortest turn-around time. A periodic burst of packets presumably is better than a non-working system. In that case, please raise the limit in two (...) places: the one you found, and at https://github.com/PrismTech/opensplice/blob/master/src/services/ddsi2/code/q_addrset.c#L58. I will start the process of making it configurable, eliminating the call to exit(), and considering mitigations for the resulting packet bursts, but please be aware that whatever we do internally may take a while to reach github. I have very little influence on that. The decisions what is freely available in the community edition and what is not are what they are, and you're welcome to use the community edition. It just happens that sometimes the commercial edition appears to be a better proposition on technical grounds, and from your description, I think yours is one of those cases. Since I don't know why you are using the community edition, for all I know, you may be in a position to consider switching to the commercial package. If you are, you might want to look at the traffic overhead caused by having 10 hosts with 20 autonomous processes each, compared to having 10 hosts each containing a shared-memory deployment with 20 attached processes. The specification is freely downloadable, but I can give you the short summary: the mandated discovery protocol is quadratic in terms of the number of "participants" (scare quotes because in OpenSplice it is the number of DDSI2 instances, not application participants), and if there are multiple participants on a single node all subscribed to the same data, many copies will have to be sent. Both disappear in shared-memory deployments. (If you really want to scale up, you enter the territory of our Cloud package, with proper scalable discovery.) If you can't go commercial and bump into issues of scale, the best I can advise is to look at the "poor man's" shared memory mode: multiple threads in a single application. A bit of trickery with the run-time linker can go a long way ... Best regards, Erik
  • Create New...