Problem with service discovery (2012-10-15) #2
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Jens Kristian Søgaard
Hi,
Continuing from the blog post comments at:
http://irq5.wordpress.com/2011/04/10/publishing-services-over-mdns-in-c/
I have fetched the latest commits from 12th of October 2012, but I still experience the same problem.
Scenario:
I have two servers running Linux and tinysvcmdns. They call mdnsd_set_hostname() to register hostname "Server1" and "Server2" respectively with two different IP-adresses. Afterwards they call mdns_register_svc() to register a service with name "Server1 Service" and "Server2 Service" respectively, with the same protocol and port number.
My client computer is running Windows and Apple's Bonjour service.
If I don’t do any service discovery from the client PC for a little while (a few minutes) and then do a discovery, it will only find one of the two servers.
If I do a discovery again a second later, it will find both servers – and it will do that again and again without any problems… until I wait for a few minutes, then the first time I try it will only find 1 server again.
geekman (2012-10-15) repo owner
Hi Jens, I just pushed another fix to address the issue. Could you try it and let me know if it works for you now?
Jens Kristian Søgaard (2012-10-15) reporter
Thanks a lot!
I have tested for a few minutes, but everything seems to be working perfectly now!
I'll do some more testing over the rest of the day!
Sorry about that... but the issue popped up again after some more testing :-(
geekman (2012-10-16) repo owner
is the bug reproducible? i'd like to see if i can reproduce the problem on my end.
Jens Kristian Søgaard (2012-10-16) reporter
Well, it has happened 20-30 times today for me, so reproducible for me.
I haven't tried to set up any alternative client systems, so I don't know if this is only something that happens with Windows clients or Apple Bonjour clients... but both Windows and Apple Bonjour are in use a lot of places, so the system would not be useful for me if it it did not support those.
Let me know what, if any, specifics you need from me?
geekman (2012-10-16) repo owner
Firstly, how do you perform service discovery on Windows? Secondly, when does the bug occur? Does it consistently happen at a certain time, for example the 5th discovery, or after 10mins? This is what I mean by reproducible, along with the conditions under which the bug can be reproduced.
Other than the issue I have fixed, I can't think of anything that would cause tinysvcmdns to not respond. I have also verified that it is responding correctly so far, to keep the service records in Bonjour "alive". I have both Windows and Apple clients running Bonjour as well.
Unless I can reproduce the bug that you are facing, it will not be possible for me to fix it.
Jens Kristian Søgaard (2012-10-16) reporter
Service discovery on Windows is performed using the Apple Bonjour SDK. I call the function DNSServiceBrowse in domain "local." for protocol "_myprotocol._tcp", which is the same protocol as advertised using tinysvcmdns.
The number of discoveries does not matter - I can do 100 discoveries right after each other and they will all succeed.
The bug occurs when I haven't done discoveries for a period of time, and then the first discovery after the pause will only find one of the servers. I have tried to do tests here to determine what the period of time specifically is. My tests indicate that it is somewhere between 90 seconds and 120 seconds.
I looked through the tinysvcmdns source code after I found the time to be approx. two minutes.
I stumbled over this line in mdns.c:
#define DEFAULT_TTL 120
I tried changing that to 1200 seconds instead of 120 sec.
Now I discover both servers every time even when I wait 5 minutes between discoveries.
I don't know if this helps you in any way?
geekman (2012-10-16) repo owner
Nope that TTL does not affect anything. If a client is interested in a particular service, it will keep track of the TTL and broadcast a query again as the 120s is about to expire. This is when tinysvcmdns will respond to extend the client's TTL by another 120s.
Jens Kristian Søgaard (2012-10-16) reporter
It does seem to have an effect here. Could something cause that re-query not to happen, or to fail?
I haven't learned how mDNS works in detail, so I'm a bit in the blind (for now).
geekman (2012-10-16) repo owner
I'm guessing the effect is that you don't see the problem happening so soon. Previously there were bugs that caused tinysvcmdns to not respond, hence sometimes the services will disappear as soon as the TTL expired, but this has since been fixed.
Until you can see a pattern of when the bug is occurring, there's really nothing much I can do.
Jens Kristian Søgaard (2012-10-16) reporter
What kind of pattern description do you need exactly? - I'm willing to do tests, if you can give me some direction.
The pattern I see right now is that I do a discovery right now and both servers. Then I wait 120 seconds and do a discovery again - and now I find only one server.
Which one I will detect as the "lone server" is seemingly random, as it changes from test to test.
geekman (2012-10-16) repo owner
Do you mind if you do a network packet capture of the mDNS packets? You can use Wireshark for that. This way I can see what tinysvcmdns is broadcasting, and how your Windows client is doing service discovery. I'd like to see what is happening when only one of the servers responded.
You need only to capture mDNS traffic. The capture filter looks like this:
geekman (2012-10-18) repo owner
The bug, reported and reproduced, was described as follows:
This problem occurs because when one of the servers responds to the query _http._tcp with server1._http._tcp, the other server checks the known-answer list and thinks that its own _http._tcp was already added and therefore does not respond. This causes server2 to not be found.
This was fixed in rev ea6495c by calling rr_entry_match() which specially checks the target of the PTR as well, ensuring that server2 does not incorrectly recognize someone else's record as its own.
In testing, both server1 and server2 now respond to queries that it is supposed to.
changed status to resolved