Dubbo + Nacos這么玩就失去高可用的能力了
我們常用的微服務(wù)框架是SpringCloud那一套,在服務(wù)遠(yuǎn)程調(diào)用和注冊(cè)中心的選型上也有不少方案。在服務(wù)遠(yuǎn)程調(diào)用上常用的有:Feign、Dubbo等,在注冊(cè)中心上常用的有:Nacos、Zookeeper、Consul、Eureka等。我們項(xiàng)目這兩塊的選型是這樣的:RPC調(diào)用-Dubbo、注冊(cè)中心和配置中心-Nacos。
一、故障開(kāi)端
項(xiàng)目平穩(wěn)運(yùn)行了好幾年,有一天發(fā)現(xiàn)Nacos集群的Server內(nèi)存有點(diǎn)高,所以想升級(jí)下機(jī)器配置,然后重啟。說(shuō)干就干,立馬在測(cè)試環(huán)境的3臺(tái)Nacos-Server集群中,任意選了一臺(tái)進(jìn)行停機(jī),暫且叫它Nacos-Server-1吧。接下來(lái)就是故障了開(kāi)端了。
停機(jī)之后,測(cè)試環(huán)境立馬有許多服務(wù)的接口調(diào)不通,等待許久,故障一直沒(méi)恢復(fù)。所以又趕緊把Nacos-Server-1啟動(dòng)起來(lái)。要找找原因,否則無(wú)法在生產(chǎn)環(huán)境重啟Nacos-Server。
我一直的觀點(diǎn)是:出現(xiàn)疑難問(wèn)題時(shí),首先看異常信息,然后猜測(cè)原因,再通過(guò)實(shí)踐去驗(yàn)證,最終可以通過(guò)源碼再去證實(shí)。而不是一上來(lái)就看源碼,那樣比醬香配拿鐵更傷頭。
二、、異常信息
當(dāng)Nacos-server-1停機(jī)時(shí),首先在Nacos-Client(即某個(gè)微服務(wù)應(yīng)用)看到異常,主要有2個(gè):
- nacos-client與nacos-server心跳異常
- dubbo微服務(wù)調(diào)用異常
(1) nacos-client與nacos-server心跳異常:
2023-09-06 08:10:09|ERROR|com.alibaba.nacos.client.naming.net.NamingProxy:reqApi|548|com.alibaba.nacos.naming.beat.sender|"request: /nacos/v1/ns/instance/beat failed, servers: [10.20.1.13:8848, 10.20.1.14:8848, 10.20.1.15:8848], code: 500, msg: java.net.SocketTimeoutException: Read timed out"|""
2023-09-06 08:10:09|ERROR|com.alibaba.nacos.client.naming.beat.BeatReactor$BeatTask:run|198|com.alibaba.nacos.naming.beat.sender|"[CLIENT-BEAT] failed to send beat: {"port":0,"ip":"10.21.230.14","weight":1.0,"serviceName":"DEFAULT_GROUP@@consumers:com.cloud.usercenter.api.PartyCompanyMemberApi:1.0:","metadata":{"owner":"ehome-cloud-owner","init":"false","side":"consumer","application.version":"1.0","methods":"queryGroupMemberCount,queryWithValid,query,queryOne,update,insert,queryCount,queryPage,delete,queryList","release":"2.7.8","dubbo":"2.0.2","pid":"6","check":"false","interface":"com.bm001.ehome.cloud.usercenter.api.PartyCompanyMemberApi","version":"1.0","qos.enable":"false","timeout":"20000","revision":"1.2.38-SNAPSHOT","retries":"0","path":"com.bm001.ehome.cloud.usercenter.api.PartyCompanyMemberApi","protocol":"consumer","metadata-type":"remote","application":"xxxx-cloud","sticky":"false","category":"consumers","timestamp":"1693917779436"},"scheduled":false,"period":5000,"stopped":false}, code: 500, msg: failed to req API:/nacos/v1/ns/instance/beat after all servers([10.20.1.13:8848, 10.20.1.14:8848, 10.20.1.15:8848])"|""
2023-09-06 08:10:10|ERROR|com.alibaba.nacos.client.naming.net.NamingProxy:callServer|613|com.alibaba.nacos.naming.beat.sender|"[NA] failed to request"|"com.alibaba.nacos.api.exception.NacosException: java.net.ConnectException: 拒絕連接 (Connection refused)
at com.alibaba.nacos.client.naming.net.NamingProxy.callServer(NamingProxy.java:611)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:524)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:491)
at com.alibaba.nacos.client.naming.net.NamingProxy.sendBeat(NamingProxy.java:426)
at com.alibaba.nacos.client.naming.beat.BeatReactor$BeatTask.run(BeatReactor.java:167)
Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://10.20.1.14:8848/nacos/v1/ns/instance/beat?app=unknown&serviceName=DEFAULT_GROUP%40%40providers%3AChannelOrderExpressApi%3A1.0%3A&namespaceId=dev&port=20880&ip=10.20.0.200
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1914)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1512)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.alibaba.nacos.common.http.client.response.JdkHttpClientResponse.getStatusCode(JdkHttpClientResponse.java:75)
at com.alibaba.nacos.common.http.client.handler.AbstractResponseHandler.handle(AbstractResponseHandler.java:43)
"
(2) dubbo微服務(wù)調(diào)用異常:
2023-09-06 08:09:38|ERROR|runtimeExceptionHandler|135|http-nio-8080-exec-5|"發(fā)生系統(tǒng)異常"|"org.apache.dubbo.rpc.RpcException: No provider available from registry 10.20.1.13:8848,10.20.1.14:8848,10.20.1.15:8848 for service ClueAuntMatchApi:1.0 on consumer 10.21.230.14 use dubbo version 2.7.8, please check status of providers(disabled, not registered or in blacklist).
at org.apache.dubbo.registry.integration.RegistryDirectory.doList(RegistryDirectory.java:599)
at org.apache.dubbo.rpc.cluster.directory.AbstractDirectory.list(AbstractDirectory.java:74)
at org.apache.dubbo.rpc.cluster.support.AbstractClusterInvoker.list(AbstractClusterInvoker.java:292)
at org.apache.dubbo.rpc.cluster.support.AbstractClusterInvoker.invoke(AbstractClusterInvoker.java:257)
at org.apache.dubbo.rpc.cluster.interceptor.ClusterInterceptor.intercept(ClusterInterceptor.java:47)
三、根據(jù)異常進(jìn)行猜測(cè)
熟悉Dubbo的朋友肯定知道這個(gè)錯(cuò)誤please check status of providers(disabled, not registered or in blacklist).,基本上是代表:Provider下線了 或者 Consumer沒(méi)找到Provider。
根據(jù)以往使用dubbo + zookeeper的經(jīng)驗(yàn),客戶(hù)端應(yīng)該會(huì)拉取注冊(cè)中心的Provider的信息,然后本地緩存一份,即使注冊(cè)中心掛了,應(yīng)該也能調(diào)用到別的服務(wù)。不至于出現(xiàn)完全找不到服務(wù)提供者的信息。
當(dāng)思考不出來(lái)時(shí),只能靠異常去猜測(cè)原因了。根據(jù)以上2個(gè)異常開(kāi)始猜測(cè)。
1.猜測(cè)1
由于nacos-server-1掛了,導(dǎo)致nacos-client與server的心跳異常,導(dǎo)致本地緩存的provider的元數(shù)據(jù)被清掉了。有了猜測(cè),趕緊查看nacos-client的源代碼,找到nacos-client 與 nacos-server 心跳的那一段:
繼續(xù)往下跟,可以看到這段核心代碼:
public String reqApi(String api, Map<String, String> params, Map<String, String> body, List<String> servers,
String method) throws NacosException {
params.put(CommonParams.NAMESPACE_ID, getNamespaceId());
if (CollectionUtils.isEmpty(servers) && StringUtils.isEmpty(nacosDomain)) {
throw new NacosException(NacosException.INVALID_PARAM, "no server available");
}
NacosException exception = new NacosException();
if (servers != null && !servers.isEmpty()) {
Random random = new Random(System.currentTimeMillis());
int index = random.nextInt(servers.size());
for (int i = 0; i < servers.size(); i++) {
String server = servers.get(index);
try {
return callServer(api, params, body, server, method);
} catch (NacosException e) {
exception = e;
if (NAMING_LOGGER.isDebugEnabled()) {
NAMING_LOGGER.debug("request {} failed.", server, e);
}
}
index = (index + 1) % servers.size();
}
}
if (StringUtils.isNotBlank(nacosDomain)) {
for (int i = 0; i < UtilAndComs.REQUEST_DOMAIN_RETRY_COUNT; i++) {
try {
return callServer(api, params, body, nacosDomain, method);
} catch (NacosException e) {
exception = e;
if (NAMING_LOGGER.isDebugEnabled()) {
NAMING_LOGGER.debug("request {} failed.", nacosDomain, e);
}
}
}
}
NAMING_LOGGER.error("request: {} failed, servers: {}, code: {}, msg: {}", api, servers, exception.getErrCode(),
exception.getErrMsg());
throw new NacosException(exception.getErrCode(),
"failed to req API:" + api + " after all servers(" + servers + ") tried: " + exception.getMessage());
}
注意上面的這段代碼:
for (int i = 0; i < servers.size(); i++) {
String server = servers.get(index);
try {
return callServer(api, params, body, server, method);
} catch (NacosException e) {
exception = e;
if (NAMING_LOGGER.isDebugEnabled()) {
NAMING_LOGGER.debug("request {} failed.", server, e);
}
}
index = (index + 1) % servers.size();
}
通過(guò)以上這一段代碼可以知道,nacos-client與nacos-server集群里的隨機(jī)一臺(tái)通信,感興趣的朋友可以繼續(xù)閱讀源代碼,跟到最后會(huì)發(fā)現(xiàn),只要有一次心跳是正常的,那就認(rèn)為心跳正常。因?yàn)槲抑煌A艘慌_(tái)nacos-server,但是與其他兩臺(tái)server依舊可以保持心跳,所以整個(gè)心跳過(guò)程雖然報(bào)錯(cuò),但是仍然是正常的,所以這個(gè)猜測(cè)放棄了,繼續(xù)猜測(cè)。
2.猜測(cè)2
既然dubbo與zookeeper是建立長(zhǎng)連接進(jìn)行socket通信,那dubbo與nacos-server可能也是建立了長(zhǎng)連接進(jìn)行socket通信,某個(gè)nacos-server掛了之后,可能因?yàn)閚acos-server沒(méi)有zookeeper的選主機(jī)制,所以不會(huì)自動(dòng)切換到別的可用的nacos-server去調(diào)用。
或者是nacos-server集群選主問(wèn)題,選主后沒(méi)有及時(shí)通知到consumer,或者consumer與nacos本身通信機(jī)制有問(wèn)題??傊褪且?yàn)槟撤N機(jī)制,導(dǎo)致沒(méi)有自動(dòng)切換到可用的nacos-server上,導(dǎo)致獲取不到provider元數(shù)據(jù),自然就無(wú)法發(fā)起調(diào)用。
既然有了這個(gè)猜想,那就趕緊去證實(shí):
繼續(xù)翻看nacos源碼,發(fā)現(xiàn)nacos提供了集群節(jié)點(diǎn)之間數(shù)據(jù)一致性保障,使用的是Raft協(xié)議(一致性的選主協(xié)議,最后在簡(jiǎn)單介紹),源代碼如下:
既然有選主協(xié)議,那就看看為什么通信還是失敗了呢?繼續(xù)從nacos-server的異常信息入手,在nacos-server-1停機(jī)時(shí),看到nacos-server的logs下多種異常信息:
在naming-raft.log里,如下異常信息:
java.lang.NullPointerException: null
at com.alibaba.nacos.naming.consistency.persistent.raft.RaftCore.signalDelete(RaftCore.java:275)
at com.alibaba.nacos.naming.consistency.persistent.raft.RaftConsistencyServiceImpl.remove(RaftConsistencyServiceImpl.java:72)
at com.alibaba.nacos.naming.consistency.DelegateConsistencyServiceImpl.remove(DelegateConsistencyServiceImpl.java:53)
at com.alibaba.nacos.naming.core.ServiceManager.easyRemoveService(ServiceManager.java:434)
at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$null$1(ServiceManager.java:902)
at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$null$2(ServiceManager.java:891)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.concurrent.ConcurrentHashMap$EntrySpliterator.forEachRemaining(ConcurrentHashMap.java:3606)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
at java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)
at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$run$3(ServiceManager.java:891)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.run(ServiceManager.java:881)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2023-09-07 08:19:25,262 ERROR Raft remove failed.
naming-push.log里,如下異常信息:
java.lang.IllegalStateException: unable to find ackEntry for key: 10.21.140.23,43998,31247629183519634, ack json: {"type": "push-ack", "lastRefTime":"31247629183519634", "data":""}
at com.alibaba.nacos.naming.push.PushService$Receiver.run(PushService.java:677)
at java.lang.Thread.run(Thread.java:748)
2023-09-07 08:17:38,533 ERROR [NACOS-PUSH] error while receiving ack data
naming-distro.log里,如下異常信息:
2023-09-07 08:19:39,904 ERROR receive responsible key timestamp of com.alibaba.nacos.naming.iplist.ephemeral.dev-jzj##DEFAULT_GROUP@@providers:com.bm001.league.ordercenter.api.AdCluePoolApi:1.0: from 10.20.1.13:8848
將這些異常信息結(jié)合起來(lái)可以推斷出,在nacos-server-1停機(jī)時(shí),nacos-server集群只剩余2臺(tái)機(jī)器,它們?cè)诶肦aft協(xié)議進(jìn)行選主時(shí),出現(xiàn)了異常。導(dǎo)致consumer沒(méi)有找到主節(jié)點(diǎn),無(wú)法建立正確的通信,所以consumer獲取不到provider的元數(shù)據(jù)。
繼續(xù)證實(shí)這個(gè)推斷吧!
此時(shí)同時(shí)把nacos-server-1和nacos-server-2同時(shí)停機(jī),只保留1臺(tái)nacos-server時(shí),微服務(wù)之間調(diào)用就正常了。因?yàn)閱蝹€(gè)節(jié)點(diǎn)時(shí),選主正常,consumer很快與nacos-server建立了通信。此時(shí)再把3臺(tái)全部啟動(dòng)后,也是一切正常。至此可以證實(shí)2臺(tái)nacos-server確實(shí)存在選主問(wèn)題。
至此問(wèn)題解決,安心干活兒去了,哈哈!
3.Raft協(xié)議
簡(jiǎn)單講下Raft協(xié)議,Raft協(xié)議主要用來(lái)滿(mǎn)足微服務(wù)CAP理論中的CP,保障集群環(huán)境下的數(shù)據(jù)一致性。在Raft理論中,把每一個(gè)集群節(jié)點(diǎn)定義了三種狀態(tài),跟zookeeper的ZAB 協(xié)議類(lèi)似:
Follower 追隨者:集群所有節(jié)點(diǎn)一開(kāi)始都是 Follower。
Candidate 候選者:當(dāng)集群的某個(gè)節(jié)點(diǎn)開(kāi)始發(fā)起投票選舉 Leader 的時(shí),先給自己投一票,這時(shí)就會(huì)從 Follower 變成 Candidate。
Leader 領(lǐng)導(dǎo)者:當(dāng)集群的某個(gè)節(jié)點(diǎn)獲得大多數(shù)節(jié)點(diǎn)(超過(guò)一半)的投票,那么就會(huì)變成 Leader。
四、總結(jié)
經(jīng)過(guò)以上的過(guò)程,有3點(diǎn)注意:
- nacos-client和nacos-server的心跳只是告訴服務(wù)器,我這個(gè)客戶(hù)端的服務(wù)是正常的,同時(shí)nacos-server集群之間會(huì)異步同步服務(wù)信息。但是具體調(diào)用時(shí)依賴(lài)dubbo,dubbo在調(diào)用時(shí),是單獨(dú)的通道從nacos-server拉取provider的元數(shù)據(jù)。
- nacos-server重啟時(shí),一定要選在深夜,避開(kāi)正常流量時(shí)間。同時(shí)為了保障集群持續(xù)可用,集群節(jié)點(diǎn)數(shù)保持奇數(shù),偶數(shù)時(shí)會(huì)出現(xiàn)選主問(wèn)題,導(dǎo)致客戶(hù)端與服務(wù)端無(wú)法正常通信,無(wú)法發(fā)起微服務(wù)調(diào)用。也就失去了nacos集群的能力了。
- 出現(xiàn)疑難問(wèn)題時(shí),首先看異常信息,然后猜測(cè)原因,再通過(guò)實(shí)踐去驗(yàn)證,最終可以通過(guò)源碼再去證實(shí)。