<<< Timothy W Macinta's Website Menu >>>
Home | My Java | Resume/CV | R&D | Age Detector | Contact
Timothy W Macinta - Contract Software Development

Update - October 12, 2002

Uwe Kubosch has written me to say that as of the JDK 1.4, the disconnect() method finally closes connections in a proper manner. He writes that he has "tested it on Windows NT with the Sun JDK in Websphere Studio Application Developer and on SPARC/Solaris with the Sun JDK where the system is in production." I have not tested this in the JDK 1.4 myself, but I welcome anybody who can verify that the problem has been fixed on other platforms as well to send the details to me so that I can post them on this page.

Update - October 31, 2002

Eric Koperda wrote me to say that the disconnect() method does in fact still not work in the J2SDK 1.4.1 on Linux. He also points out that Sun has recently introduced some system properties that could be used for a hack-ish workaround. For more details, see the sun.net.client.defaultConnectTimeout and sun.net.client.defaultReadTimeout properties entry in Sun's J2SE Networking Properties documentation for the J2SE 1.4.

I would tend to agree with his evaluation of this solution's elegance: "While these system properties aren't as elegant as a new specification for HttpUrlConnection, or a configurable sockets-factory API in Java, I would rank them somewhere above a JNI-hack for anyone who doesn't require multiple connections with varying timeouts. These new system properties are Good Enough to solve my problem (passing SOAP messages across the Internet in a distributed app) for now." Unfortunately, the crawler I am writing does require customizable timeouts for simultaneous connections, and I am targeting Linux, so I'm still stuck with the JNI hack myself.

Sun's URLConnection Cannot Be Reliably Timed Out

The Problem

From the beginning of Java(TM), using Sun's java.net.URLConnection class as the basis for a robust crawler has been impeded by the difficulty of protecting against unruly/unresponsive servers. In the JDK 1.1, Sun made the HttpURLConnection class public and thus introduced the disconnect() method for the purpose of terminating an unresponsive connection. The consensus then was that a URLConnection could now be timed out through the relatively minor inconvenience of having a separate thread call disconnect() after a certain period of inactivity. Unfortunately, this was not enough.

The HttpURLConnection.disconnect() method does not cause blocking calls to methods such as getInputStream() to abort if the server accepts the socket connection but does not send any data. This is a real problem if you wish to build a Java based crawler to monitor a substantial number of web pages/sites as this bug will lead to an accumulation of hung threads and open sockets which will eventually kill performance if the same Java process is run long enough.

To illustrate the problem, I have created the following classes:

I have tested these examples on the JDK 1.1, JDK 1.2, and JDK 1.3 on Linux. They may also work in later versions of the JDK and on other platforms (then again, they may not).

In order to see the problem illustrated, compile the above classes. First, run TimeoutTest on a URL that won't hang so that you can see that in most cases the URLConnection can be used effectively. One possible test might be:

java TimeoutTest http://www.yahoo.com/
Next, start the TimeoutTestServer which will accept connections and read all input, but never send any output. You can start it on port 8080 using:
java TimeoutTestServer 8080
While the TimeoutTestServer is still running, run the TimeoutTest again, but this time use a URL that will cause it to connect to the TimeoutTestServer. The following will probably work:
java TimeoutTest http://127.0.0.1:8080/
Notice that even though disconnect() is called, the main thread never returns from its call to getInputStream(). In the JDK 1.1 and JDK 1.2, I have determined that this is because the HttpURLConnection internally attempts to reconnect after a failed attempt. Personally, I think that automatically attempting to reconnect after disconnect() has been called is broken functionality and probably a bug, but that doesn't really eliminate the fact that this needs to be worked around until the problem is fixed (if it ever is fixed - it has been a problem for a long time). I'm not sure why disconnect() does not properly release the caller to getInputStream() in the JDK 1.3 as the server does not register a reconnection attempt - I will look into this more in the event that I want to support the JDK 1.3 with the crawler I am building.


 

Potential Solutions

Native method solution - This is the solution I have chosen for use within my crawler. Please see below for more information.

Write your own version of URLConnection - This is probably the cleanest approach, but it would involve a great deal of work, a duplication of effort, and it will not benefit from future enhancements to Sun's URLConnection.

Write your own socket factory - This is a bit of work, but it will allow you to take advantage of future enhancements to URLConnection. Unfortunately, it is incompatible with things like the JSSE which install their own socket factory.

Other - There are some other suggestions in the Java bug database under bug #4143518.


 

Preferred Solution

The preferred solution that I have devised is to use a Java native method to bypass the security restrictions of the JVM and poke around at the internals of the HttpURLConnection. Here's a short list of the pros and cons of this solution:

Pros Cons
  • Requires very little code.
  • Allows continued use of URLConnection which has the advantage of providing widely tested code and providing automatic access to future enhancements.
  • Compatible with JSSE and other code which needs to set the default socket factory.
  • Must be compiled for each platform.
  • Not necessarily portable across different JVMs and versions of the JDK.
  • Directly manipulating the internals of an Object is not elegant.

This is my preferred solution for my crawler because all of the "cons" are either negligible or unavoidable. The target JVMs and platforms for my crawler change very infrequently and the first two "cons" are therefore an acceptable trade-off. The third con of inelegance is applicable to all the other solutions as well (apart from re-writing Sun's URLConnection), so it is not a determining factor.

I am making my fix available for use under the terms of the GNU LGPL version 2.1, or (at your option) any later version. I have tested it with Sun's JDK 1.1 and JDK 1.2 on Linux and it will most likely work on other platforms as well with these same JDK versions from Sun. It does not work with the JDK 1.3 on Linux probably because the internal structure/operation of the classes that are manipulated has changed. It should not be too hard to update this fix to work with the JDK 1.3 and higher - if you do so, please send your changes back to me so that I can incorporate them into the code that I offer here. Otherwise, it will be upgraded to work with the JDK 1.3 when I either have a need for it or when somebody hires me to do so.

Here's what your need to do to use my fix:

Please send comments, patches, and inquiries to Tim Macinta <twm@alum.mit.edu>. I am building an extremely customizable, flexible, and robust crawler in Java for data mining and would seriously consider licensing offers.


Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.

All Pages, Images, and Other Content Copyright © 1997 - 2024 Timothy W Macinta , except where noted. All Rights Reserved. The "Tim Macinta Now" button may be used on web pages that are external to this site to provide a link back to this page. For usage guidelines on KMFMS artwork please see http://www.kmfms.com/usage-guide.html.