Once we were intermittently getting “java.sql.SQLException: Io exception: Operation timed out” error on production environment.
After few trial and error, we came to the conclusion that it was due to the proxy between application server and new database server. The proxy was terminating the idle connection after some time.
Hibernate connection pool configuration was changed to test the connection validity every few minutes using “idle_test_period” and “prefferedTestQuery”
Connection pool parameter were changed from :
<property name="hibernate.c3p0.max_size">30</property> <property name="hibernate.c3p0.min_size">10</property> <property name="hibernate.c3p0.timeout">5000</property> <property name="hibernate.c3p0.max_statements">0</property> <property name="hibernate.c3p0.idle_test_period">3000</property> <property name="hibernate.c3p0.acquire_increment">2</property> <property name="hibernate.c3p0.validate">false</property>
<property name="hibernate.c3p0.max_size">30</property> <property name="hibernate.c3p0.min_size">0</property> <property name="hibernate.c3p0.timeout">1200</property> <property name="hibernate.c3p0.max_statements">0</property> <property name="hibernate.c3p0.idle_test_period">300</property> <property name="hibernate.c3p0.acquire_increment">2</property> <property name="hibernate.c3p0.validate">false</property> <property name="c3p0.preferredTestQuery">SELECT 1 from dual</property>
And after this small change, we never received the Operation timed out error again. 🙂 🙂
Detailed troubleshooting of the Issue (Short Story Long 😉 )
In my experience, I have found that intermittent production issue like this are very hard to debug. In the end issue and solution both might seem very trivial, but person who actually works on live issue might not be able to narrow down the root cause so easily. So I will share my experience on how the issue was debugged and narrowed down. It might help other in troubleshooting similar issues.
Once upon a time we were receiving annoying error “java.sql.SQLException: Io exception: Operation timed out” intermittently on our production server. And the only thing which had changed was that production application was moved to new faster server and production database was moved to new database server, the network speed was even faster than the previous environment. :O network speed was faster then it rules out the possibility of intermittent issues in network (network might be congested or something like that) . So I knew that operation timed out error is not due to network problem.
On new environment, there was a proxy between production application and production database and on previous environment proxy was not there between production application and production database.But how come this has caused problem? If proxy allows to establish connection between two machine once, it will allow all the time, It does not intermittently deny connections.
One possibility was that query was taking more time to execute and due to which operation was being timed out. I searched for setting of hibernate that increases query timeout to some greater value then the default and thus could buy some more time before operation times out. But then I remembered my earlier findings that query which was failing, used to execute within a second and operation time out error was coming after wait of 80 seconds. If database is heavily loaded then also it will not take 80 seconds to execute a query that used to be executed within seconds. Even if proxy was also heavily loaded, then also it was very unlikely that request would not go through till 80 seconds.
I explained our DBA about the problem and ask for his opinion and his first question was : Are you receiving any error starting with ORA. And I was like NOPE. DBA explained to me that, if there is any error on oracle database side then it will always throw an error which will have ORA number. At that time I realized that ohh the error is not coming from the database. Then I looked at the error log again. The error “Io error : Operation timed” out was clearly saying that it was coming while jdbc was trying to read/write any data on the connection stream. connection stream??? Jdbc first creates connection with the database and it writes and reads data from that connection as we read/write data from any network stream. So now I knew that exception was generic java operation timed out exception and it was not coming from database.
If we think for java opeartion timed out error then it comes in the case when somebody is waiting on some operation(operation like getting resource, getting connection, etc ) and it does not completes in time. For e.g. If java program tries to connect to some url and that url cannot be connected in time then Operation timeout exception will be thrown. So in this case it should be trying to connect to the database and it somehow it should not be able to connect to database. But if I rethink about my previous statement (if it connects once it should connect always), I will straight away discard that possibility, because our another production application works perfectly on same server connecting to same database server. The only difference I see in other application is that it connects to database using direct jdbc connection and not through hibernate. Ok ok, so by now I know that there is a jdbc operation time out exception and it should not be coming while establishing connections.
After these findings, I again checked hibernate configurations file to look for any suspicion and I came across c3po connection pool configuration. Connection pool creates a connection in advance and when application wants to create a database connection it simply returns connection if it’s available in it’s pool and if it’s not available in the pool it creates a new connection and returns it. I looked at connection pool settings and found that connection pool was configured in such a way that at a time it should have minimum 10 connection in its pool and it number of connection should not exceed 30. Connection time out was set to 5000 seconds which is around 83 mins, :O 83 minutes, its quite high value. and time period to test if connection is idle was set to 3000 seconds, 50 mins, this interval for testing connection is idle is quite high.
If we consider at two important configuration then we can find out that timeout was set to 83 mins and idle test period was set to around 50 mins. It meant that after interval of 50 mins c3p0 will check for all connections that if connection are still valid or not by executing some validation query on the connection. and if the connection was idle till 83 mins then it will automatically time out.
So in our case c3p0 was checking connection was valid or not at the interval of 50 minutes. But proxy might have been configured to terminate idle connection before 50 mins.
Now at application startup c3p0 would have been creating and pooling min 10 connection. And they stays there in its pool. If application asks for connection it returns connection from its pool.If a application ask for connection before 50 mins say at 45 mins then c3p0 would not have checked that connection is still valid or not and it simply gives one connection from its pool assuming that they are still valid. But meanwhile if proxy is configured to disconnect connections that are idle for 30 mins then proxy should have dropped the connection to the database. So connection is not valid anymore but since c3p0 valid connection check interval is 50 mins it assumes that before 50 mins all connections are still valid and returns that connection to application.
Now application assumes that any connection returned from c3p0 is valid and it starts executing a query and that query does not executes since connection to the database is not valid and eventually this query execution operation times out giving error Io : Operation timeout error. and in this why we were receiving operation timed out error. Error must have been coming due to higher connection validation check timeout.
Solution for the problem is reduce connection timeout time and idle test period time so that c3p0 will execute a test query on connection before it is dropped by proxy. As c3p0 executes validation query every few minutes, connection no longer remains idle and proxy do not drops the connection and hence we do not get Operation time-out error anymore and which solves our problem. In addition to this we reduces min_size to 0 so that c3p0 will not have any connection in its pool.
So we changed connection timeout to 20 minutes, idle test period to 5 minutes and minimum connection to 0 and added preferredTestQuery. This query will be used to check validity of connection when c3p0 checks for validity of connection after idle test period.
<property name="hibernate.c3p0.min_size">0</property> <property name="hibernate.c3p0.timeout">1200</property> <property name="hibernate.c3p0.idle_test_period">300</property> <property name="c3p0.preferredTestQuery">SELECT 1 from dual</property>