Nacos的单机模式启动失败问题及解决
作者:搬砖的小程序猿
背景
项目上使用Nacos做服务注册与发现,在开发环境中我们使用的容器化的Naocs部署,并且使用的单机节点以及内置数据库derby,但是最近在部署Nacos的服务的使用发现启动失败,报错日志如下(日志很长,已经截取了一些):
2021-09-18 22:40:31,822 INFO Tomcat initialized with port(s): 8848 (http)
2021-09-18 22:40:32,148 INFO Root WebApplicationContext: initialization completed in 11864 ms
2021-09-18 22:40:44,722 ERROR Error starting Tomcat context. Exception: org.springframework.beans.factory.BeanCreationException. Message: Error creating bean with name 'authFilterRegistration' defined in class path resource [com/alibaba/nacos/core/auth/AuthConfigs.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.boot.web.servlet.FilterRegistrationBean]: Factory method 'authFilterRegistration' threw exception; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'authFilter': Unsatisfied dependency expressed through field 'authManager'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosAuthManager': Unsatisfied dependency expressed through field 'authenticationManager'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosAuthConfig': Unsatisfied dependency expressed through field 'userDetailsService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosUserDetailsServiceImpl': Unsatisfied dependency expressed through field 'userPersistService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'embeddedUserPersistServiceImpl': Unsatisfied dependency expressed through field 'databaseOperate'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'standaloneDatabaseOperateImpl': Invocation of init method failed; nested exception is java.lang.RuntimeException: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: load schema.sql error.
2021-09-18 22:40:44,827 WARN Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat
2021-09-18 22:40:44,837 INFO Nacos Log files: /home/nacos/logs
2021-09-18 22:40:44,849 INFO Nacos Log files: /home/nacos/conf
2021-09-18 22:40:44,853 INFO Nacos Log files: /home/nacos/data
2021-09-18 22:40:44,856 ERROR Startup errors : {}
org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.onRefresh(ServletWebServerApplicationContext.java:157)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:540)
...............
Caused by: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: load schema.sql error.
at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.reload(LocalDataSourceServiceImpl.java:101)
at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.initialize(LocalDataSourceServiceImpl.java:170)
at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.init(LocalDataSourceServiceImpl.java:83)
at com.alibaba.nacos.config.server.service.datasource.DynamicDataSource.getDataSource(DynamicDataSource.java:47)
... 166 common frames omitted
Caused by: java.sql.SQLTimeoutException: Login timeout exceeded.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.timeLogin(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.EmbeddedDriver.connect(Unknown Source)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:354)
at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:202)
at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:473)
at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:554)
at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:115)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.reload(LocalDataSourceServiceImpl.java:96)
... 169 common frames omitted
Caused by: org.apache.derby.iapi.error.StandardException: Login timeout exceeded.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 185 common frames omitted
根据错误日志分析
Nacos在加载schema.sql文件时,derby数据库报了一个Login Timeout exceeded的错误,看着像是超时,此时根据异常堆栈查看源码如下:
private EmbedConnection timeLogin(String var1, Properties var2, int var3) throws SQLException { try { InternalDriver.LoginCallable var4 = new InternalDriver.LoginCallable(this, var1, var2); Future var5 = _executorPool.submit(var4); long var6 = System.currentTimeMillis(); long var8 = var6 + (long)var3 * 1000L; while(var6 < var8) { try { EmbedConnection var10 = (EmbedConnection)var5.get(var8 - var6, TimeUnit.MILLISECONDS); return var10; } catch (InterruptedException var16) { InterruptStatus.setInterrupted(); var6 = System.currentTimeMillis(); } catch (ExecutionException var17) { throw this.processException(var17); } catch (TimeoutException var18) { throw Util.generateCsSQLException("XBDA0.C.1", new Object[0]); } } throw Util.generateCsSQLException("XBDA0.C.1", new Object[0]); } finally { InterruptStatus.restoreIntrFlagIfSeen(); } }
代码逻辑很简单,封装了获取EmbedConnection这个数据库链接的过程为异步获取,通过Future来控制超时时长,那么此时可以判断出就是在获取EmbedConnection这个数据库链接的时候超时了。
但是获取EmbedConnection为什么超时,里面到底做了什么,不得而知(由于源码的class文件中缺乏了本地变量表,导致代码阅读困难),此时注意到:
2021-09-18 22:40:31,822 INFO Tomcat initialized with port(s): 8848 (http)
2021-09-18 22:40:44,722 ERROR。。。。
这两行日志为SpringBoot的启动日志,而这两行日志中,日志打印的时间差了10s左右,那么猜测可能就是这10s中获取数据库连接超时,并且一定会有一个线程一直阻塞中。
此时我们想到可以使用 jstack命令去查看java进程的线程堆栈日志,如下:
"derby.rawStoreDaemon" #38 daemon prio=5 os_prio=0 tid=0x00007ff090057800 nid=0x6337 in Object.wait() [0x00007ff097ffe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000ee89fd70> (a org.apache.derby.impl.services.daemon.BasicDaemon)
at org.apache.derby.impl.services.daemon.BasicDaemon.rest(Unknown Source)
- locked <0x00000000ee89fd70> (a org.apache.derby.impl.services.daemon.BasicDaemon)
at org.apache.derby.impl.services.daemon.BasicDaemon.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)"Thread-15" #37 daemon prio=5 os_prio=0 tid=0x00007ff12928b800 nid=0x6336 runnable [0x00007ff0ec1be000]
java.lang.Thread.State: RUNNABLE
at java.io.FileDescriptor.sync(Native Method)
at org.apache.derby.impl.io.DirRandomAccessFile.sync(Unknown Source)
at org.apache.derby.impl.store.raw.data.RAFContainer.writeRAFHeader(Unknown Source)
at org.apache.derby.impl.store.raw.data.RAFContainer.clean(Unknown Source)
- locked <0x00000000ef759b28> (a org.apache.derby.impl.store.raw.data.RAFContainer4)
at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(Unknown Source)
at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(Unknown Source)
at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.createFinished(Unknown Source)
at org.apache.derby.impl.store.raw.RawStore.createFinished(Unknown Source)
at org.apache.derby.impl.store.access.RAMAccessManager.createFinished(Unknown Source)
at org.apache.derby.impl.db.BasicDatabase.createFinished(Unknown Source)
at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
at org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown Source)
at org.apache.derby.impl.services.monitor.FileMonitor.createPersistentService(Unknown Source)
at org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.impl.jdbc.EmbedConnection.createPersistentService(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$LoginCallable.call(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$LoginCallable.call(Unknown Source)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
......."main" #1 prio=5 os_prio=0 tid=0x00007ff12804c000 nid=0x61d1 waiting on condition [0x00007ff130ee5000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c01f6de8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
at com.alibaba.nacos.common.utils.ThreadUtils.shutdownThreadPool(ThreadUtils.java:121)
at com.alibaba.nacos.common.utils.ThreadUtils.shutdownThreadPool(ThreadUtils.java:106)
at com.alibaba.nacos.common.executor.ThreadPoolManager.destroy(ThreadPoolManager.java:156)
- locked <0x00000000c03b65d8> (a java.lang.Object)
at com.alibaba.nacos.common.executor.ThreadPoolManager.shutdown(ThreadPoolManager.java:197)
at com.alibaba.nacos.core.code.StartingSpringApplicationRunListener.failed(StartingSpringApplicationRunListener.java:147)
at org.springframework.boot.SpringApplicationRunListeners.callFailedListener(SpringApplicationRunListeners.java:91)
at org.springframework.boot.SpringApplicationRunListeners.failed(SpringApplicationRunListeners.java:84)
at org.springframework.boot.SpringApplication.handleRunFailure(SpringApplication.java:828)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:327)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248)
at com.alibaba.nacos.Nacos.main(Nacos.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
at org.springframework.boot.loader.PropertiesLauncher.main(PropertiesLauncher.java:467)"VM Thread" os_prio=0 tid=0x00007ff12819d000 nid=0x61df runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007ff12805e800 nid=0x61d8 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007ff128060800 nid=0x61d9 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007ff128062800 nid=0x61da runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007ff128064000 nid=0x61db runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007ff1281f0000 nid=0x61e7 waiting on condition
JNI global references: 991
经过多次的jstack命令查看,发现Thread-15这个线程一直阻塞在
java.io.FileDescriptor.sync(Native Method)
查看下jdk中对此方法的说明:
- 在涉及IO的操作中,linux系统为了平衡物理介质和内存的速度差异,引入了PageCache的概念,在我们调用outputstram的write的方法时,并不是将内容写入到物理介质中而是写入到PageCache中并且标记此块内存为dirty。
- linux的后台会启动一个线程定时将标记为dirty的内存块刷新到物理介质中,而如果此时虚机断电的话,那么没有刷新到物理介质中的内容就有可能会丢失,所以linxu会提供fsync的命令强制刷新,而不是等定时任务刷新。
- 而对应jdk中封装就是FileDescriptor.sync的方法
这个问题我的第一反应就是会不会是PageCache过大,导致刷盘过慢通过free -g命令查看
free -g total used free shared buff/cache available Mem: 8 1 0 0 6 6 Swap: 9 0 9
发现总共内存为8g,而PageCache占用了高达6g,立马将PageCache强制刷盘,并重新启动Nacos,可惜问题依旧存在。
此时我的想法是会不会是在执行某一个fsync的命令的时候,执行时间过久呢,我希望能够知道每个fsync的执行耗时,修改启动命令为:
#nohup java ${JAVA_OPT} >${BASE_DIR}/logs/start.out 2>&1 </dev/null strace -T -ttt -ff -xx -yy -o strace.log java ${JAVA_OPT} >${BASE_DIR}/logs/start.out 2>&1 </dev/null
增加了strace命令去追踪每个系统调用的耗时,并且以线程ID的形式分成不同的文件,strace.log后面跟着的数字就是线程ID。
[root@localhost nacos]# ls -l total 77832 drwxr-xr-x 2 root root 4096 Sep 18 22:37 bin drwxr-xr-x 2 root root 4096 Sep 18 01:52 conf drwxr-xr-x 3 root root 4096 Sep 18 22:40 data -rw-r--r-- 1 root root 696 Sep 18 22:40 derby.log drwxr-xr-x 2 root root 4096 Sep 18 22:40 logs -rw-r--r-- 1 root root 44298 Sep 18 22:40 strace.log.27777 -rw-r--r-- 1 root root 58640394 Sep 18 22:40 strace.log.27778 -rw-r--r-- 1 root root 16262 Sep 18 22:40 strace.log.27779 -rw-r--r-- 1 root root 16882 Sep 18 22:40 strace.log.25398 ...... drwxr-xr-x 2 root root 4096 Sep 18 22:38 target drwxr-xr-x 3 root root 4096 Sep 18 22:40 work
此时我们回到之前打印的线程堆栈日志,找到阻塞的线程
"Thread-15" #37 daemon prio=5 os_prio=0 tid=0x00007ff12928b800 nid=0x6336 runnable [0x00007ff0ec1be000]
线程Id: nid=0x6336,通过进制换算将16进制换算成10进制为25398,找到strace.log.25398,下载并通过notepad++ 打开,搜索fsync命令:
// 时间戳 1631906136.913442 fsync(44<\x2f\x68\x6f\x6d\x65\x2f\x6e\x61\x63\x6f\x73\x2f\x64\x61\x74\x61\x2f\x64\x65\x72\x62\x79\x2d\x64\x61\x74\x61\x2f\x64\x62\x2e\x6c\x63\x6b>) = 0 // fsync命令耗时 <0.184688>
找到所有的fsync命令,通过打印的时间戳计算总耗时大约为11s左右。
此时只需要确认前面获取链接的超时时间,确认默认的超时时长为多少,查看Nacos的代码,找到初始化数据库连接的地方,通过代码跟踪找到:
private synchronized void initialize(String jdbcUrl) { HikariDataSource ds = new HikariDataSource(); ds.setDriverClassName(jdbcDriverName); ds.setJdbcUrl(jdbcUrl); ds.setUsername(userName); ds.setPassword(password); ds.setIdleTimeout(30_000L); ds.setMaximumPoolSize(80); ds.setConnectionTimeout(10000L); DataSourceTransactionManager tm = new DataSourceTransactionManager(); tm.setDataSource(ds); if (jdbcTemplateInit) { jt.setDataSource(ds); tjt.setTransactionManager(tm); } else { jt = new JdbcTemplate(); jt.setMaxRows(50000); jt.setQueryTimeout(5000); jt.setDataSource(ds); tjt = new TransactionTemplate(tm); tjt.setTimeout(5000); jdbcTemplateInit = true; } reload(); }
ds.setConnectionTimeout(10000L);这里就是设置数据库连接超时的地方,代码中写死为10s,和strace.log中计算的时长大致能对的上,所以至此可以确认,是调用linux的fsync的方法耗时过久导致业务侧超时,但是为什么会耗时这么久?毕竟阿里给的默认10s应该是经过验证的。带着这个疑问我寻思应该去查看磁盘的性能了。
iostat -x 1 1000 avg-cpu: %user %nice %system %iowait %steal %idle 3.49 0.00 2.00 22.94 0.00 71.57 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 43.00 0.00 52.00 0.00 488.00 14.92 0.96 17.56 0.00 17.56 18.42 95.80 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 87.00 0.00 488.00 8.92 0.99 10.92 0.00 10.92 11.01 95.80 avg-cpu: %user %nice %system %iowait %steal %idle 3.77 0.00 1.26 23.37 0.00 71.61 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 47.00 0.00 59.00 0.00 548.00 18.58 0.99 16.47 0.00 16.47 16.24 95.80 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 98.00 0.00 552.00 11.27 1.03 10.40 0.00 10.40 9.79 95.90 avg-cpu: %user %nice %system %iowait %steal %idle 3.99 0.00 1.75 22.94 0.00 71.32 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 42.00 0.00 55.00 0.00 464.00 13.24 0.98 19.11 0.00 19.11 17.22 94.70 dm-0 0.00 0.00 0.00 2.00 0.00 36.00 36.00 0.03 15.50 0.00 15.50 8.00 1.60 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 86.00 0.00 424.00 7.53 0.98 12.30 0.00 12.30 11.01 94.70
观测磁盘使用率,发现在Nacos的启动过程中, %util指标(磁盘的使用率)一直在95%左右,磁盘在被高度占用中,磁盘的写入速率wkB/s平均在500kb/s左右。
至此可以确认磁盘的性能不足导致derby数据库初始化的时候,刷盘耗时过久,超过了默认配置的10s的超时时长,导致Nacos启动失败。
修改方法很简单,修改默认超时时长或者将默认的超时时长变更为配置项放在配置文件中。
总结
以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。