Golden Observation | Review of the Arbitrum sorter bug event

Author: Golden Finance Jason

1Y10H2VnvGHHMNxj5ZZizD4vqjbiaw6Y2vKX0uCq.jpeg

Jinjin Finance Blockchain, June 10th A bug in Arbitrum’s sorter code this week caused a brief interruption of the network’s ability to submit transactions in batches, and transactions could not be confirmed on the main chain. The vulnerability has since been fixed, and the transaction batch submission function has been restored. On June 10, the Arbitrum Foundation released the after-the-fact analysis report on the Arbitrum sorter bug. Next, let’s review and see why this bug event did not cause user funds to be lost.

Arbitrum Sorter Bug Event Timeline

  1. At 06:04:53 on June 7, 2023, the batch issuer failed to update its L1 state view due to a temporary issue with the Arbitrum collator L1 node. Due to a root cause problem, the Arbitrum sorter continued to try to query the state of its previous L1 view block number. This means that even after the temporary L1 node issue resolves itself, the batch publisher will keep trying to query the state of the old L1 block number, and the L1 node no longer has its state because it is not an archive node.

  2. At 09:38:28 on June 7, 2023, Arbitrum's batch poster stopped publishing transactions because it reached the configured maximum queued transaction limit (256), which is the same as the mempool limit. If this limit is not reached, bulk posting will continue as usual.

  3. At 11:09 am on June 7, 2023, due to unpublished batches, an alert was triggered on the Sequencer Inbox smart contract to check for new batches, and a warning was sent to the Slack channel.

  4. At 11:10 AM, the lack of recent batch releases triggered a log-based alert and sent a critical level alert to the Slack channel.

  5. At 11:13 am, a member of the community team initiated PagerDuty with a member of the SRE team, who promptly acknowledged the incident and began responding.

  6. At 11:19:02 in the morning, the SRE team restarted the batch poster, but due to the previously mentioned maximum queued transaction limit, the batch poster was prevented from publishing transactions. The SRE team noticed this issue and started switching to a third-party L1 RPC provider in an attempt to mitigate the issue.

  7. At 11:24:16 AM, 5 minutes after the batch poster started, it updated the L1 state view and published the first batch of transactions.

  8. At 11:25:09 AM, the batch poster configuration was changed to use a third-party L1 RPC provider and restarted because the SRE team had already started making this change and didn't notice batching. After a restart, batch transactions continue to occur.

  9. At 11:30:21 in the morning, 5 minutes after the batch poster started, the L1 state view was updated, which triggered the L1 state to be out of sync, which was also the root cause of the problem. The L1 state was updated to the final block number 17428199, but it used the latest nonce 178078, corresponding to the latest block at the time, instead of the final block stored in its state, resulting in all queued transactions in Redis being wiped Except, because Redis considers these transactions to be confirmed.

  10. At 11:30:26 AM, the batch poster posted a new batch. Redis relies on the L1 state view to determine what to publish, but at this point the Redis queue is empty, as stated earlier, the L1 state is incorrect, and a batch was published with a random number in state 178078, but for Determine the batch to be published, using an irrelevant block number 17428199, resulting in an old batch with a serial number of 229209 being published, which was actually published at 11:24:16 before, and then the batch poster restarted start up. Because the 229209 old batch has already been published, the L1 transaction submitted by the batch was rolled back.

  11. At 11:36:35 AM, the batch poster address ran out of Ether because it did not refund the gas fee, so it stopped publishing. This is an intentional mechanism to prevent the batch poster from consuming all the gas stored in the recovery batch cost.

  12. At 11:46 AM, a member of the Nitro team received a call to resolve a software issue with batch recovery.

  13. Around 11:58 AM, Arbitrum started to receive reports that some users found that there was a problem with the sorter (broadcasting newly sorted transactions to RPC nodes), because more and more ordered transactions were due to batch poster problems Rather than posting to the chain, this issue primarily affects feed clients with poor internet connections or insufficient memory allocation, as they are more likely to drop and experience reconnection issues. Arbitrum recommends that users running multiple RPC nodes run a local feed relay to reduce the external bandwidth required.

At 12:03 p.m., Arbitrum removed Cloudflare's feed rate limiting to alleviate the problem of clients hitting the rate limit when trying to reconnect after being disconnected due to an internet connection.

At 12:05 PM, Arbitrum removed all Cloudflare rate limiting to allow increased public RPC utilization for those whose nodes were having trouble maintaining connections to the feed.

  1. At 12:12:09 PM, the faulty batch poster was shut down and the Redis queue store cleared to remove the bad state.

  2. At 12:12:40 pm, the batch poster started on the old version v2.0.14, and there was no root problem.

  3. At 12:21:56 pm, the first batch of newly opened batch posters succeeded, and they have been running continuously since then.

Arbitrum sorter bug event lessons learned

This failure was caused by a bug in the batch poster. The sequencer itself was not affected or interrupted and continued to process transactions throughout the process. Reports that the sequencer ran out of funds are incorrect. The Arbitrum funding mechanism consists of two wallets, namely: the "sequencer" wallet and the "gas-refunder" wallet. Only when the sequencer can successfully release the batch will it be refunded. The Arbitrum network has not refunded to the sequencer for this failure. funds, and the related issues were not caused by the sequencer running out of funds, so no user funds were at risk.

Arbitrum will clean up the configuration options that have been added in the temporary solution. Later, it plans to re-evaluate the sequencer client and server timeout issues to improve network reliability in the case of transaction backlogs. A new "v2.1.0- beta.2" beta version. In addition, Arbitrum will create a public network status page to reduce confusion if there are problems with the service.

This article is referenced from the official website of the Arbitrum Foundation

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)