RocketMQ Broker实现高可用高并发的消息中转服务
作者:Acqierement
RocketMq-broker
broker主要作用就是存储消息。所以重点就放在它对于消息的处理上面。我提出几个问题,后续看代码解答。
- broker启动的时候是怎么向nameserv进行注册的?
- productor发送过来的消息是怎么储存的?
- comsumer是怎么在broker拉取数据的?
- 高可用怎么做的?broker挂了怎么办,数据肯定要有备份的
注册
注册的时候,就是在启动的时候,向所有的nameService注册自己的信息。其中nameService的地址是可以在启动的时候配置的。代码在org.apache.rocketmq.broker.out.BrokerOuterAPI#registerBrokerAll。这里我省略了其他代码
public List<RegisterBrokerResult> registerBrokerAll( final String clusterName, final String brokerAddr, final String brokerName, final long brokerId, final String haServerAddr, final TopicConfigSerializeWrapper topicConfigWrapper, final List<String> filterServerList, final boolean oneway, final int timeoutMills, final boolean enableActingMaster, final boolean compressed, final Long heartbeatTimeoutMillis, final BrokerIdentity brokerIdentity) { final List<RegisterBrokerResult> registerBrokerResultList = new CopyOnWriteArrayList<>(); List<String> nameServerAddressList = this.remotingClient.getAvailableNameSrvList(); if (nameServerAddressList != null && nameServerAddressList.size() > 0) { final CountDownLatch countDownLatch = new CountDownLatch(nameServerAddressList.size()); for (final String namesrvAddr : nameServerAddressList) { brokerOuterExecutor.execute(new AbstractBrokerRunnable(brokerIdentity) { @Override public void run2() { try { RegisterBrokerResult result = registerBroker(namesrvAddr, oneway, timeoutMills, requestHeader, body); if (result != null) { registerBrokerResultList.add(result); } LOGGER.info("Registering current broker to name server completed. TargetHost={}", namesrvAddr); } catch (Exception e) { LOGGER.error("Failed to register current broker to name server. TargetHost={}", namesrvAddr, e); } finally { countDownLatch.countDown(); } } }); } try { if (!countDownLatch.await(timeoutMills, TimeUnit.MILLISECONDS)) { LOGGER.warn("Registration to one or more name servers does NOT complete within deadline. Timeout threshold: {}ms", timeoutMills); } } catch (InterruptedException ignore) { } } return registerBrokerResultList; }
这里用了countDownLatch来判断一下所有broker注册完成是否超时,超时就打印一个warn。
消息存储
具体可以看官网的文档设计。我这里贴一部分内容。
消息存储架构图中主要有下面三个跟消息存储相关的文件构成。
(1) CommitLog:消息主体以及元数据的存储主体,存储Producer端写入的消息主体内容,消息内容不是定长的。单个文件大小默认1G, 文件名长度为20位,左边补零,剩余为起始偏移量,比如00000000000000000000代表了第一个文件,起始偏移量为0,文件大小为1G=1073741824;当第一个文件写满了,第二个文件为00000000001073741824,起始偏移量为1073741824,以此类推。消息主要是顺序写入日志文件,当文件满了,写入下一个文件;
(2) ConsumeQueue:消息消费索引,引入的目的主要是提高消息消费的性能。ConsumeQueue作为消费消息的索引,保存了指定Topic下的队列消息在CommitLog中的起始物理偏移量offset,消息大小size和消息Tag的HashCode值。consumequeue文件可以看成是基于topic的commitlog索引文件,故consumequeue文件夹的组织方式如下:topic/queue/file三层组织结构
(3) IndexFile:IndexFile(索引文件)提供了一种可以通过key或时间区间来查询消息的方法。Index文件的存储位置是:$HOME/store/index/{fileName},文件名fileName是以创建时的时间戳命名的,固定的单个IndexFile文件大小约为400M,一个IndexFile可以保存 2000W个索引,IndexFile的底层存储设计为在文件系统中实现HashMap结构,故RocketMQ的索引文件其底层实现为hash索引。
具体请求是通过netty来处理的
NettyRemotingAbstract#processRequestCommand里面会根据请求code拿到具体的processor。
其中
- SendMessageProcessor 负责处理 Producer 发送消息的请求;
- PullMessageProcessor 负责处理 Consumer 消费消息的请求;
- QueryMessageProcessor 负责处理按照消息 Key 等查询消息的请求。
数据写入主要是在DefaultMessageStore#asyncPutMessage里面
public CompletableFuture<PutMessageResult> asyncPutMessage(final MessageExtBrokerInner msg) { ...... topicQueueLock.lock(topicQueueKey); try { boolean needAssignOffset = true; if (defaultMessageStore.getMessageStoreConfig().isDuplicationEnable() && defaultMessageStore.getMessageStoreConfig().getBrokerRole() != BrokerRole.SLAVE) { needAssignOffset = false; } if (needAssignOffset) { defaultMessageStore.assignOffset(msg, getMessageNum(msg)); } PutMessageResult encodeResult = putMessageThreadLocal.getEncoder().encode(msg); if (encodeResult != null) { return CompletableFuture.completedFuture(encodeResult); } msg.setEncodedBuff(putMessageThreadLocal.getEncoder().getEncoderBuffer()); PutMessageContext putMessageContext = new PutMessageContext(topicQueueKey); putMessageLock.lock(); //spin or ReentrantLock ,depending on store config try { long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now(); this.beginTimeInLock = beginLockTimestamp; // Here settings are stored timestamp, in order to ensure an orderly // global if (!defaultMessageStore.getMessageStoreConfig().isDuplicationEnable()) { msg.setStoreTimestamp(beginLockTimestamp); } if (null == mappedFile || mappedFile.isFull()) { // 首先获取mappedFile mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise } if (null == mappedFile) { log.error("create mapped file1 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString()); beginTimeInLock = 0; return CompletableFuture.completedFuture(new PutMessageResult(PutMessageStatus.CREATE_MAPPED_FILE_FAILED, null)); } // 写入数据 result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext); switch (result.getStatus()) { case PUT_OK: onCommitLogAppend(msg, result, mappedFile); break; case END_OF_FILE: onCommitLogAppend(msg, result, mappedFile); unlockMappedFile = mappedFile; // Create a new file, re-write the message mappedFile = this.mappedFileQueue.getLastMappedFile(0); if (null == mappedFile) { // XXX: warn and notify me log.error("create mapped file2 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString()); beginTimeInLock = 0; return CompletableFuture.completedFuture(new PutMessageResult(PutMessageStatus.CREATE_MAPPED_FILE_FAILED, result)); } result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext); if (AppendMessageStatus.PUT_OK.equals(result.getStatus())) { onCommitLogAppend(msg, result, mappedFile); } break; case MESSAGE_SIZE_EXCEEDED: case PROPERTIES_SIZE_EXCEEDED: beginTimeInLock = 0; return CompletableFuture.completedFuture(new PutMessageResult(PutMessageStatus.MESSAGE_ILLEGAL, result)); case UNKNOWN_ERROR: beginTimeInLock = 0; return CompletableFuture.completedFuture(new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result)); default: beginTimeInLock = 0; return CompletableFuture.completedFuture(new PutMessageResult(PutMessageStatus.UNKNOWN_ERROR, result)); } elapsedTimeInLock = this.defaultMessageStore.getSystemClock().now() - beginLockTimestamp; beginTimeInLock = 0; } finally { putMessageLock.unlock(); } } finally { topicQueueLock.unlock(topicQueueKey); } if (elapsedTimeInLock > 500) { log.warn("[NOTIFYME]putMessage in lock cost time(ms)={}, bodyLength={} AppendMessageResult={}", elapsedTimeInLock, msg.getBody().length, result); } if (null != unlockMappedFile && this.defaultMessageStore.getMessageStoreConfig().isWarmMapedFileEnable()) { this.defaultMessageStore.unlockMappedFile(unlockMappedFile); } PutMessageResult putMessageResult = new PutMessageResult(PutMessageStatus.PUT_OK, result); // Statistics storeStatsService.getSinglePutMessageTopicTimesTotal(msg.getTopic()).add(result.getMsgNum()); storeStatsService.getSinglePutMessageTopicSizeTotal(topic).add(result.getWroteBytes()); // 刷盘策略 return handleDiskFlushAndHA(putMessageResult, msg, needAckNums, needHandleHA); }
首先获取mappedFile,可以理解就是commitLog文件的一个映射。创建mappedFile会同时提前创建两个文件,避免了下次创建文件等待。
org.apache.rocketmq.store.AllocateMappedFileService#mmapOperation
private boolean mmapOperation() { boolean isSuccess = false; AllocateRequest req = null; try { req = this.requestQueue.take(); AllocateRequest expectedRequest = this.requestTable.get(req.getFilePath()); if (null == expectedRequest) { log.warn("this mmap request expired, maybe cause timeout " + req.getFilePath() + " " + req.getFileSize()); return true; } if (expectedRequest != req) { log.warn("never expected here, maybe cause timeout " + req.getFilePath() + " " + req.getFileSize() + ", req:" + req + ", expectedRequest:" + expectedRequest); return true; } if (req.getMappedFile() == null) { long beginTime = System.currentTimeMillis(); MappedFile mappedFile; if (messageStore.getMessageStoreConfig().isTransientStorePoolEnable()) { try { mappedFile = ServiceLoader.load(MappedFile.class).iterator().next(); mappedFile.init(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool()); } catch (RuntimeException e) { log.warn("Use default implementation."); mappedFile = new DefaultMappedFile(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool()); } } else { mappedFile = new DefaultMappedFile(req.getFilePath(), req.getFileSize()); } long elapsedTime = UtilAll.computeElapsedTimeMilliseconds(beginTime); if (elapsedTime > 10) { int queueSize = this.requestQueue.size(); log.warn("create mappedFile spent time(ms) " + elapsedTime + " queue size " + queueSize + " " + req.getFilePath() + " " + req.getFileSize()); } // pre write mappedFile if (mappedFile.getFileSize() >= this.messageStore.getMessageStoreConfig() .getMappedFileSizeCommitLog() && this.messageStore.getMessageStoreConfig().isWarmMapedFileEnable()) { mappedFile.warmMappedFile(this.messageStore.getMessageStoreConfig().getFlushDiskType(), this.messageStore.getMessageStoreConfig().getFlushLeastPagesWhenWarmMapedFile()); } req.setMappedFile(mappedFile); this.hasException = false; isSuccess = true; } } catch (InterruptedException e) { log.warn(this.getServiceName() + " interrupted, possibly by shutdown."); this.hasException = true; return false; } catch (IOException e) { log.warn(this.getServiceName() + " service has exception. ", e); this.hasException = true; if (null != req) { requestQueue.offer(req); try { Thread.sleep(1); } catch (InterruptedException ignored) { } } } finally { if (req != null && isSuccess) req.getCountDownLatch().countDown(); } return true; }
这里会去初始化mapperFile
org.apache.rocketmq.store.logfile.DefaultMappedFile#init
private void init(final String fileName, final int fileSize) throws IOException { ...... try { this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel(); this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize); TOTAL_MAPPED_VIRTUAL_MEMORY.addAndGet(fileSize); TOTAL_MAPPED_FILES.incrementAndGet(); ok = true; } catch (FileNotFoundException e) { log.error("Failed to create file " + this.fileName, e); throw e; } catch (IOException e) { log.error("Failed to map file " + this.fileName, e); throw e; } finally { if (!ok && this.fileChannel != null) { this.fileChannel.close(); } } }
这里其实就是用java的map创建文件。
如果开启了堆外对象池,会用writeBuffer来写入数据。读取文件还是用mappedByteBuffer。
@Override public void init(final String fileName, final int fileSize, final TransientStorePool transientStorePool) throws IOException { init(fileName, fileSize); this.writeBuffer = transientStorePool.borrowBuffer(); this.transientStorePool = transientStorePool; }
在创建好maperFile后,还有个预热的操作
public void warmMappedFile(FlushDiskType type, int pages) { this.mappedByteBufferAccessCountSinceLastSwap++; long beginTime = System.currentTimeMillis(); ByteBuffer byteBuffer = this.mappedByteBuffer.slice(); int flush = 0; long time = System.currentTimeMillis(); //通过写入 1G 的字节 0 来让操作系统分配物理内存空间,如果没有填充值,操作系统不会实际分配物理内存,防止在写入消息时发生缺页异常 for (int i = 0, j = 0; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) { byteBuffer.put(i, (byte) 0); // force flush when flush disk type is sync if (type == FlushDiskType.SYNC_FLUSH) { if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) { flush = i; mappedByteBuffer.force(); } } // 这里就是每隔一段时间sleep一下,这样让其他线程有执行的机会,这其中也包括gc线程,让gc线程有机会在循环的中途可以执行gc。避免很久才执行一次gc // prevent gc if (j % 1000 == 0) { log.info("j={}, costTime={}", j, System.currentTimeMillis() - time); time = System.currentTimeMillis(); try { Thread.sleep(0); } catch (InterruptedException e) { log.error("Interrupted", e); } } } // force flush when prepare load finished if (type == FlushDiskType.SYNC_FLUSH) { log.info("mapped file warm-up done, force to disk, mappedFile={}, costTime={}", this.getFileName(), System.currentTimeMillis() - beginTime); mappedByteBuffer.force(); } log.info("mapped file warm-up done. mappedFile={}, costTime={}", this.getFileName(), System.currentTimeMillis() - beginTime); this.mlock(); }
因为通过 mmap 映射,只是建立了进程虚拟内存地址与物理内存地址之间的映射关系,并没有将 Page Cache 加载至内存。读写数据时如果没有命中写 Page Cache 则发生缺页中断,从磁盘重新加载数据至内存,这样会影响读写性能。为了防止缺页异常,阻止操作系统将相关的内存页调度到交换空间(swap space),RocketMQ 通过对文件预热,将对应page cache提前加载到内存中。
然后中间循环会sleep一下,就是让gc可以运行。我复制一下chatGpt的回答:
这段代码中的if (j % 1000 == 0)语句是为了防止频繁的GC。在每次循环中,当j的值是1000的倍数时,会执行一次Thread.sleep(0),这个操作会让当前线程暂停一小段时间,从而让JVM有机会回收一些不再使用的对象。这样做的目的是为了减少GC的频率,从而提高程序的性能。
最后还有一个锁定
public void mlock() { final long beginTime = System.currentTimeMillis(); final long address = ((DirectBuffer) (this.mappedByteBuffer)).address(); Pointer pointer = new Pointer(address); { // 通过系统调用 mlock 锁定该文件的 Page Cache,防止其被交换到 swap 空间 int ret = LibC.INSTANCE.mlock(pointer, new NativeLong(this.fileSize)); log.info("mlock {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime); } { // 通过系统调用 madvise 给操作系统建议,说明该文件在不久的将来要被访问 int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED); log.info("madvise {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime); } }
然后就是对mapperFile进行写入消息。就是拿着buffer写入具体的数据。
接着就是处理刷盘方式和高可用。
org.apache.rocketmq.store.CommitLog#handleDiskFlushAndHA
private CompletableFuture<PutMessageResult> handleDiskFlushAndHA(PutMessageResult putMessageResult, MessageExt messageExt, int needAckNums, boolean needHandleHA) { // 处理刷盘机制 CompletableFuture<PutMessageStatus> flushResultFuture = handleDiskFlush(putMessageResult.getAppendMessageResult(), messageExt); CompletableFuture<PutMessageStatus> replicaResultFuture; if (!needHandleHA) { replicaResultFuture = CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } else { // 处理HA replicaResultFuture = handleHA(putMessageResult.getAppendMessageResult(), putMessageResult, needAckNums); } return flushResultFuture.thenCombine(replicaResultFuture, (flushStatus, replicaStatus) -> { if (flushStatus != PutMessageStatus.PUT_OK) { putMessageResult.setPutMessageStatus(flushStatus); } if (replicaStatus != PutMessageStatus.PUT_OK) { putMessageResult.setPutMessageStatus(replicaStatus); } return putMessageResult; }); }
处理刷盘
org.apache.rocketmq.store.CommitLog.DefaultFlushManager#handleDiskFlush
@Override public CompletableFuture<PutMessageStatus> handleDiskFlush(AppendMessageResult result, MessageExt messageExt) { // Synchronization flush if (FlushDiskType.SYNC_FLUSH == CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) { final GroupCommitService service = (GroupCommitService) this.flushCommitLogService; if (messageExt.isWaitStoreMsgOK()) { GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes(), CommitLog.this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); flushDiskWatcher.add(request); service.putRequest(request); return request.future(); } else { service.wakeup(); return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } } // Asynchronous flush else { if (!CommitLog.this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) { flushCommitLogService.wakeup(); } else { commitLogService.wakeup(); } return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } }
根据配置的同步刷盘或者异步刷盘的机制来决定具体的刷盘策略。
处理高可用
org.apache.rocketmq.store.CommitLog#handleHA
private CompletableFuture<PutMessageStatus> handleHA(AppendMessageResult result, PutMessageResult putMessageResult, int needAckNums) { if (needAckNums >= 0 && needAckNums <= 1) { return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } HAService haService = this.defaultMessageStore.getHaService(); long nextOffset = result.getWroteOffset() + result.getWroteBytes(); // Wait enough acks from different slaves GroupCommitRequest request = new GroupCommitRequest(nextOffset, this.defaultMessageStore.getMessageStoreConfig().getSlaveTimeout(), needAckNums); haService.putRequest(request); haService.getWaitNotifyObject().wakeupAll(); return request.future(); }
其实后台一直有一个同步线程去处理消息同步的事情,只要比较一下master和salve的commitLog的offset就可以比较出来差多少数据了。所以把slave没有的数据同步过去就可以了,这块后面再写一篇文章细讲。
那还有一个问题,consumeQueue和indexFile是怎么处理的呢?
ReputMessageService里面会去读取commitLog的数据,写入到comsunerQueue和IndexFile
根据各个dispatch,分别处理两个文件。这里就不细讲了。
ConsumeQueue的处理是在这里面
org.apache.rocketmq.store.DefaultMessageStore.CommitLogDispatcherBuildConsumeQueue#dispatch
文件的名字其实就是topic/queueid。写入的数据是
this.byteBufferIndex.flip(); this.byteBufferIndex.limit(CQ_STORE_UNIT_SIZE); this.byteBufferIndex.putLong(offset); this.byteBufferIndex.putInt(size); this.byteBufferIndex.putLong(tagsCode);
其实就是commitLog的一个offset,根据这个值就可以拿到具体的消息了。
org.apache.rocketmq.store.DefaultMessageStore.CommitLogDispatcherBuildIndex
indexFile就是写入这些数据
this.mappedByteBuffer.putInt(absIndexPos, keyHash); this.mappedByteBuffer.putLong(absIndexPos + 4, phyOffset); this.mappedByteBuffer.putInt(absIndexPos + 4 + 8, (int) timeDiff); this.mappedByteBuffer.putInt(absIndexPos + 4 + 8 + 4, slotValue); this.mappedByteBuffer.putInt(absSlotPos, this.indexHeader.getIndexCount());
包括key的hash值,还有物理偏移,还有时间等信息。首先文件是按照每个毫秒创建的,所以天然就是按照时间顺序排列。根据key查询的话,写入文件的位置是根据key的hash来的,所以可以马上知道是哪个位置。
好了,到这里数据存储就差不多了。来看看怎么读消息的
消息读取
消费者拉取消息
拉取消息有自己的处理器:
org.apache.rocketmq.broker.processor.PullMessageProcessor#processRequest
里面有很多额外的逻辑,具体在下面的方法中:
org.apache.rocketmq.store.DefaultMessageStore#getMessage
消息读取很简单,就是从根据topic和queueId去consumeQueue里面读,消费者知道上次拉取到了哪里,所以就直接根据consumeQueue的offset去读内容,consumeQueue里面存的是commitLog的offset和size,根据这两个值就可以从commitLog里面拿到消息,返回。然后更新下次的offset,返回给productor。
按照key查询
org.apache.rocketmq.store.DefaultMessageStore#queryMessage
主要是查的indexFile,前面提到indexFile就是按照时间来创建文件的,所以先按照时间筛选出符合条件的indexFile,然后根据key的hash,找到文件对应的写入位置,因为对应的hash会有冲突,就一个个遍历,找到所有hash值相等的数据。然后再根据indexFile记录的offset,去commitLog里面去查消息。
到此这篇关于RocketMQ Broker实现高可用高并发的消息中转服务的文章就介绍到这了,更多相关RocketMq Broker内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!