問題描述
最近項目中遇到一個問題,在某一個時刻應用系統出現了大量的ES訪問超時的現象,異常日志主要為:
JAVA.io.IOException: Connection reset by peer
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:793) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:205) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1454) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1439) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.IndicesClient.exists(IndicesClient.java:785) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at com.free4inno.scheduler.adapter.service.elasticsearch.GenericEsService.isIndexExists(GenericEsService.java:60) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.getTodayIndex(MetricsToEsService.java:56) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.insert(MetricsToEsService.java:49) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.Scheduler.sendMetrics(Scheduler.java:64) [classes!/:0.0.1-SNAPSHOT]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_261]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_261]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_261]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_261]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_261]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_261]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_261]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_261]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_261]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_261]
at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_261]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) ~[?:1.8.0_261]
at org.Apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:231) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
... 1 more
異常現象表現為:
①第一次連接出現timeout,后續連接正常且timeout時長為配置的socket時長。
②拋出異常時間段不固定。
問題定位
這個問題在之前開發多集群管理平臺的時候也出現,當時因為存在③跨版本且還沒有正式投產的情況下,并沒有太關注的此類問題,覺得此問題可能是版本兼容性。項目組報上來問題才認真地排查了一番。
應用系統通過ES High Level Rest Client客戶端直連且版本相同,可以排查③的情況。起初懷疑是查詢或者寫入量大,后臺處理不及時的原因導致異常。于是查看監控數據和日志,監控顯示在幾次出現異常的時間點上各項指標這都很正常,沒有出現流量陡增的情況,結合現象②,可以排除是因為服務端壓力的原因。
接著把影響異常的源頭定位在客戶端。ES客戶端和服務端的連接采用的是長連接,查閱源碼發現客戶端創建了client連接池,每個client持有一個http連接,并且開啟http的keep-alive策略復用連接。正是因為這個復用探活的原因導致該異常發生。下面具體分析該問題。
問題分析
在未處理前restClient客戶端創建示意代碼:
final RestClientBuilder restClientBuilder = RestClient.builder(redisHostList).setRequestConfigCallBack(requestConfigBuilder -> {;
requestConfigBuilder.setConnectTimeout(5000);
requestConfigBuilder.setSocketTimeout(60000);
requestConfigBuilder.setConnectionRequestTimeout(500);
return requestConfigBuilder;
}).setHttpClientConfigCallback((httpAsyncClientBuilder) -> {
httpAsyncClientBuilder.disableAuthCacheing();
return httpAsyncClientBuilder; final RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);
客戶端與服務端的連接示意圖:
ES high-level-client 對長連接的實現是把超時時間設置為-1,意味著客戶端永遠不超時,服務端設備為了資源的利用率會檢測與此設備的連接是否在使用,如果一個連接長時間沒有使用,服務端會主動把這個連接關閉,而此時客戶端不知情,還處在連接狀態,可以說當前處于半連接狀態。當有流量進來的時候使用了該連接就會發現與服務端連接不上,產生timeout,客戶端也斷開此鏈接,如上圖 1 所示。客戶端使用新連接 2 與服務端通信。因此問題的根源在于客戶端沒有及時發現連接的不可用并斷開,因此需要設置讓客戶端主動對tcp連接進行探測保活。
解決方案
一、客戶端顯式的設置setKeepAliveStrategy
httpAsyncClientBuilder.setKeepAliveStrategy((httpResponse,httpContext) -> TimeUtils.MINUTES.toMillis(3))
二、顯示開啟 tcp keepalive
httpAsyncClientBuilder.setDefaultIOReactorConfig(IOReactorConfig.sustom().setSoKeepAlive(true).build())
三、在系統層面設置tcp keepalive探測保活時間
net.ipv4.tcp_keepalive_time = 60
來源:
https://www.modb.pro/db/388569