Skip to content

[Merged by Bors] - Handle Geth pre-EIP-155 block sync error condition #2304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

[Merged by Bors] - Handle Geth pre-EIP-155 block sync error condition #2304

wants to merge 12 commits into from

Conversation

jmcph4
Copy link
Member

@jmcph4 jmcph4 commented Apr 7, 2021

Issue Addressed

#2293

Proposed Changes

  • Modify the handler for the eth_chainId RPC (i.e., get_chain_id) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes
  • Add a new helper function, rpc_error_msg, to aid in the above point
  • Refactor response_result into response_result_or_error and patch reliant RPC handlers accordingly (thanks to @pawanjay176)

Additional Info

Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part).

Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when
it is not synced past the EIP-155 block (2675000). Previously, Geth
simply returned a chain ID of 0 (which was obviously much easier to
handle on Lighthouse's part).

This commit explicitly handles this error condition by modifying the
`get_chain_id` function in `beacon_node/eth1/src/http.rs` to `match`
against the error string returned by Geth.
@CLAassistant
Copy link

CLAassistant commented Apr 7, 2021

CLA assistant check
All committers have signed the CLA.

@jmcph4
Copy link
Member Author

jmcph4 commented Apr 7, 2021

Marking this as draft currently for a few reasons:

  • Confirming CI is happy
  • Likely prettier Rust that could be used
  • First time contributing to Lighthouse

@jmcph4
Copy link
Member Author

jmcph4 commented Apr 8, 2021

I've been considering whether or not to tidy this up a bit but I feel it's overengineering at this point. Likely out of scope for this PR, but should we add types for handling JSON-RPC 2.0 errors?

jmcph4 added 2 commits April 8, 2021 20:40
This commit adds a helper function, `rpc_error_msg`, which extracts the
human-readable error message from a JSON-RPC error message. This is
done to aid the handling of EIP-155-related errors from Geth inside the
`get_chain_id` RPC handler.
This commit fixes Clippy complaints due to usage of `skip(1).next()`
instead of the much simpler `nth(1)`.
@jmcph4 jmcph4 marked this pull request as ready for review April 8, 2021 19:30
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! I had a suggestion to reduce manual string operations on the json string. Let me know what you think :)

This commit refactors how RPC error responses are handled by making the
following changes:

 - The helper `response_result` is now `response_result_or_error` and
   handles RPC error responses directly (thanks to @pawanjay176)
 - Each RPC handler relying on `response_result` has now been patched
   to support the new API (mostly by inserting an `ok()` call
   immediately after each `response_result_or_error` call in order to
   preserve the `Option`-based expectations)
 - The previously-added helper, `rpc_error_msg` has been removed as its
   now redundant
@jmcph4 jmcph4 requested a review from pawanjay176 April 11, 2021 09:54
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just a minor nitpick :)

@pawanjay176 pawanjay176 requested a review from paulhauner April 11, 2021 12:37
jmcph4 added 2 commits April 13, 2021 06:03
This commit simplifies how errors are handled in the `get_chain_id`
handler by using `?` on the initial call to `response_result_or_error`.

Once again, thanks to @pawanjay176!
This commit fixes a missed formatting issue.
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for getting involved @jmcph4. Glad to have your contributions 🚀

I've added a couple of nitpicks and a suggestion about the response_result_or_error. I know it conflicts with @pawanjay176, so both of you please let me know if I'm missing the mark :)

@paulhauner paulhauner added the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label Apr 14, 2021
This commit fixes the styling on comments left in `http.rs` in order to
better comply with the style guide. Now, C-style comments have been
converted to their more C++-like equivalents and some minor punctuation
fixes have been applied.

Thanks to @paulhauner for pointing these issues out - I certainly
missed them!
@jmcph4 jmcph4 requested a review from paulhauner April 14, 2021 22:27
@paulhauner
Copy link
Member

I've added a couple of nitpicks and a suggestion about the response_result_or_error

It seems like my comment regarding this was lost :( I'll rewrite it in the coming day(s).

Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy response to my other comments. Here's that comment that got lost along the way, apologies for the delay.

/// Accepts an entire HTTP body (as a string) and returns the `result` field, as a serde `Value`.
fn response_result(response: &str) -> Result<Option<Value>, String> {
/// Accepts an entire HTTP body (as a string) and returns either the `result` field or the `error['message']` field, as a serde `Value`.
fn response_result_or_error(response: &str) -> Result<Result<Value, Value>, String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this Result<Result<Value, Value>, String>, it feels like an anti-pattern and I think the present implementation will result in some less-than-ideal errors.

For example, if we get an "error" field on fn get_deposit_logs_in_range we'll log an error saying "No result field was returned for deposit logs", when it would be ideal to indicate that we got an error and the value of it.

I'd suggest something like this:

#[derive(Debug)]
enum RpcError {
    NoResultField,
    Error(String)
}

fn response_result_or_error(response: &str) -> Result<Value, RpcError> {
    ...
}

And then in fn get_deposit_logs_in_range we can do:

    response_result_or_error(&response_body)
        .map_err(|e| format!("Unable to obtain deposit logs: {:?}", e)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an explicit error type would be better, agreed.

While it might be out of scope for this PR specifically, I feel like this gets back to my comment about supporting the entirety of the set of JSON-RPC 2.0 error conditions? After all, it seems like that's what Geth is sending to us anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this gets back to my comment about supporting the entirety of the set of JSON-RPC 2.0 error conditions? After all, it seems like that's what Geth is sending to us anyway.

I don't feel so strongly either way about parsing the error struct instead of flattening it to a String. AFAIK, ultimately we just flatten it and present it to the user as a String anyway, so I'm not sure there's a whole lot to gain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, ultimately we just flatten it and present it to the user as a String anyway, so I'm not sure there's a whole lot to gain.

I think we err on the side of simplicity and just pass the string on, so I'm inclined to agree
here.

I'll work on your proposed changes now.

jmcph4 added 2 commits April 24, 2021 17:04
This commit simplifies how RPC messages are handled by defining a new
type, `RpcError`, and making `response_result_or_result` return a
single `Result` with `RpcError` as its error type.
This commit makes the `get_deposit_logs_in_range` function pass on
error strings derive from `RpcError` types it receives from
`response_result_or_error`.
@jmcph4
Copy link
Member Author

jmcph4 commented Apr 24, 2021

@paulhauner Sorry this latest iteration took so long, hopefully the current changes are what you had in mind.

It appears that CI jobs aren't running for this anymore as I'm a first-time contributor? Strange given they've run fine every other time, but it might be me misunderstanding whatever the GitHub UI is trying to tell me.

@paulhauner
Copy link
Member

Hi @jmcph4, sorry for the delay here. We've got a big PR backlog at the moment with Altair and Rayonism running in parallel. We're making good progress though and are coming back to PR reviews.

Would you mind fixing the CI conflict here please? I approved your CI run :)

This commit applies the changes from `cargo update` to the lockfile
in order to mitigate two vulnerabilities in upstream dependencies,
namely:

 - [RUSTSEC-2021-0056](https://rustsec.org/advisories/RUSTSEC-2021-0056)
 - [RUSTSEC-2021-0055](https://rustsec.org/advisories/RUSTSEC-2021-0055)
@jmcph4
Copy link
Member Author

jmcph4 commented May 14, 2021

@paulhauner dbe15af fixes my make audit locally, but it appears that CI is only running the CLA check again.

As for the linting failure, make lint works for me locally, but I'll continue investigating.

@paulhauner
Copy link
Member

As for the linting failure, make lint works for me locally, but I'll continue investigating.

This is usually fixed by a rustup update, might be worth a shot :)

@jmcph4
Copy link
Member Author

jmcph4 commented May 27, 2021

@paulhauner CI should be clear now and provided you're happy with the actual changeset I think this might be good to go? Let me know if I've missed anything (sorry this slid out for so long).

@paulhauner paulhauner removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 31, 2021
@paulhauner paulhauner added ready-for-review The code is ready for review v1.5.0 For inclusion in v1.5.0 release labels May 31, 2021
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for persisting with this and apologies for the slow turn-around times.

There's a few changes I'd like to request. In the interests of giving feedback, I left a comment for each change request. In the interests of saving your time (and allowing me to understand it better) I created a commit which contains my recommendations:

paulhauner@384b354

Feel free to just cherry-pick this commit onto your branch, if you agree with them.

#[derive(Debug, Serialize, Deserialize)]
pub enum RpcError {
NoResultField,
InvalidJson,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the ability to see the exact JSON error here. Sometimes serde gives useful information that can narrow down the error. Making this InvalidJson(String) would ensure we retain the error.

@@ -282,7 +324,7 @@ async fn call(
]);

let response_body = send_rpc_request(endpoint, "eth_call", params, timeout).await?;
match response_result(&response_body)? {
match response_result_or_error(&response_body).ok() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're suppressing errors here.

Previously, if the server returned an error in the response we'd exit early with Err. Now, we return Ok(None), indicating the server successfully returned with no data.

impl From<RpcError> for String {
fn from(value: RpcError) -> Self {
match value {
RpcError::NoResultField => "No JSON error returned".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be more helpful as somthing like: "No result field in JSON response"

response_result(&response_body)?
let hash: Vec<u8> = hex_to_bytes(
response_result_or_error(&response_body)
.ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use ok() here we're still flattening all the different error types into "No result field was returned for block".

So, if there's an RpcError::Error(s) we're still presenting it to the user as if there was no result field.


let timestamp = hex_to_u64_be(
response_result(&response_body)?
response_result_or_error(&response_body)
.ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above regarding flattening errors.

@@ -165,7 +206,8 @@ pub async fn get_block(
)?;

let number = hex_to_u64_be(
response_result(&response_body)?
response_result_or_error(&response_body)
.ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above regarding flattening errors.

@@ -322,7 +364,8 @@ pub async fn get_deposit_logs_in_range(
}]);

let response_body = send_rpc_request(endpoint, "eth_getLogs", params, timeout).await?;
response_result(&response_body)?
Ok(response_result_or_error(&response_body)
.ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above regarding flattening errors.

@paulhauner paulhauner added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jun 2, 2021
@jmcph4
Copy link
Member Author

jmcph4 commented Jun 3, 2021

Thanks for persisting with this and apologies for the slow turn-around times.

There's a few changes I'd like to request. In the interests of giving feedback, I left a comment for each change request. In the interests of saving your time (and allowing me to understand it better) I created a commit which contains my recommendations:

paulhauner@384b354

Feel free to just cherry-pick this commit onto your branch, if you agree with them.

@paulhauner Thanks for the feedback, it's much appreciated. I've reviewed your commit and cherry-picked it into this branch as you suggested. Apologies that this wasn't as smooth as it could've been.

@paulhauner
Copy link
Member

Thanks for the feedback, it's much appreciated. I've reviewed your commit and cherry-picked it into this branch as you suggested. Apologies that this wasn't as smooth as it could've been.

Thanks for your contribution, it's much appreciated. It's great to see people digging in and fixing annoyances ❤️

We're waiting to clear the v1.4.0 release. Once that's done, we can merge this and include it in v1.5.0 :)

Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this merged! Thanks again @jmcph4.

bors r+

@bors
Copy link

bors bot commented Jun 17, 2021

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@paulhauner
Copy link
Member

bors r-

@paulhauner
Copy link
Member

bors r+

@bors
Copy link

bors bot commented Jun 17, 2021

Stopped waiting for PR status (Github check) without running due to duplicate requests to run. You may check Bors to see that this PR is included in a batch by one of the other requests.

bors bot pushed a commit that referenced this pull request Jun 17, 2021
## Issue Addressed

#2293 

## Proposed Changes

 - Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes
 - ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~
 - Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176)

## Additional Info

Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part).


Co-authored-by: Paul Hauner <[email protected]>
@paulhauner paulhauner added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jun 17, 2021
@bors
Copy link

bors bot commented Jun 17, 2021

@bors bors bot changed the title Handle Geth pre-EIP-155 block sync error condition [Merged by Bors] - Handle Geth pre-EIP-155 block sync error condition Jun 17, 2021
@bors bors bot closed this Jun 17, 2021
paulhauner added a commit to paulhauner/lighthouse that referenced this pull request Jun 21, 2021
commit ea6838b
Author: Paul Hauner <[email protected]>
Date:   Mon Jun 21 17:45:55 2021 +1000

    Flip bool

commit 89afc26
Author: Paul Hauner <[email protected]>
Date:   Mon Jun 21 15:22:48 2021 +1000

    Avoid taking the write-lock on val monitor

commit b84ff9f
Author: realbigsean <[email protected]>
Date:   Fri Jun 18 05:58:01 2021 +0000

    rust 1.53.0 updates (sigp#2411)

    ## Issue Addressed

    `make lint` failing on rust 1.53.0.

    ## Proposed Changes

    1.53.0 updates

    ## Additional Info

    I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places

    Co-authored-by: realbigsean <[email protected]>
    Co-authored-by: Michael Sproul <[email protected]>

commit 3dc1eb5
Author: Michael Sproul <[email protected]>
Date:   Thu Jun 17 02:10:48 2021 +0000

    Ignore inactive validators in validator monitor (sigp#2396)

    ## Proposed Changes

    A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation.

    This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them.

    ## Additional Info

    To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`.

commit 98ab00c
Author: Jack <[email protected]>
Date:   Thu Jun 17 02:10:47 2021 +0000

    Handle Geth pre-EIP-155 block sync error condition (sigp#2304)

    ## Issue Addressed

    sigp#2293

    ## Proposed Changes

     - Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes
     - ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~
     - Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176)

    ## Additional Info

    Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part).

    Co-authored-by: Paul Hauner <[email protected]>

commit b1657a6
Author: realbigsean <[email protected]>
Date:   Thu Jun 17 02:10:46 2021 +0000

    Reorg events (sigp#2090)

    ## Issue Addressed

    Resolves sigp#2088

    ## Proposed Changes

    Add the `chain_reorg` SSE event topic

    ## Additional Info

    Co-authored-by: realbigsean <[email protected]>
    Co-authored-by: Paul Hauner <[email protected]>

commit 3261eff
Author: divma <[email protected]>
Date:   Thu Jun 17 00:40:16 2021 +0000

    split outbound and inbound codecs encoded types (sigp#2410)

    Splits the inbound and outbound requests, for maintainability.

commit a526145
Author: Clifton King <[email protected]>
Date:   Wed Jun 16 10:42:55 2021 +0000

    Fix remote signer test (sigp#2400)

    ## Proposed Changes

    Unescape text for json comparison in:

    https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/remote_signer/tests/sign.rs#L282-L285

    Which causes this error:

    ```
    ---- sign::invalid_field_fork stdout ----
    thread 'sign::invalid_field_fork' panicked at 'assertion failed: `(left == right)`
      left: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: 'I', index: 0 })\", line: 1, column: 237097)"`,
     right: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: \\'I\\', index: 0 })\", line: 1, column: 237097)"`', testing/remote_signer_test/src/consumer.rs:144:5
    ```

    This is my first contribution and happy to receive feedback if you have any. Thanks

commit dffe31c
Author: Pawan Dhananjay <[email protected]>
Date:   Wed Jun 16 09:16:51 2021 +0000

    Add an account command to enable/disable validators (sigp#2386)

    ## Issue Addressed

    Resolves sigp#2322

    ## Proposed Changes

    Adds a `modify` command to `lighthouse account validator` with subcommands to enable and disable specific or all pubkeys.

commit 3b600ac
Author: Paul Hauner <[email protected]>
Date:   Thu Jun 10 01:44:49 2021 +0000

    v1.4.0 (sigp#2402)

    ## Issue Addressed

    NA

    ## Proposed Changes

    - Bump versions and update `Cargo.lock`

    ## Additional Info

    NA

    ## TODO

    - [x] Ensure sigp#2398 gets merged succesfully

commit b383836
Author: Paul Hauner <[email protected]>
Date:   Wed Jun 9 02:30:06 2021 +0000

    Modify Malloc Tuning (sigp#2398)

    ## Issue Addressed

    NA

    ## Proposed Changes

    I've noticed some of the SigP Prater nodes struggling on v1.4.0-rc.0. I suspect this is due to the changes in sigp#2296. Specifically, the trade-off which lowered the memory footprint whilst increasing runtime on some functions.

    Presently, this PR is documenting my testing on Prater.

    ## Additional Info

    NA

commit 4a6f2fa
Author: Paul Hauner <[email protected]>
Date:   Mon Jun 7 02:34:10 2021 +0000

    Only perform malloc tuning for beacon node (sigp#2397)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Only run `configure_memory_alllocator` for the BN process.

    I noticed that VC memory usage increases significantly with the new malloc tuning parameters. This was also raised by a user on [r/ethstaker](https://www.reddit.com/r/ethstaker/comments/nr8998/lighthouse_prerelease_v140rc0/h0fnt9l?utm_source=share&utm_medium=web2x&context=3).

    There wasn't any issue with memory usage by the VC before we implemented sigp#2296, so I think we were a bit overzealous when we allowed these changes to affect it. This PR allows things that weren't broken to remain unfixed.

    ## Additional Info

    NA

commit 93100f2
Author: Paul Hauner <[email protected]>
Date:   Mon Jun 7 02:34:09 2021 +0000

    Make less logs for attn with unknown head (sigp#2395)

    ## Issue Addressed

    NA

    ## Proposed Changes

    I am starting to see a lot of slog-async overflows (i.e., too many logs) on Prater whenever we see attestations for an unknown block. Since these logs are identical (except for peer id) and we expose volume/count of these errors via `metrics::GOSSIP_ATTESTATION_ERRORS_PER_TYPE`, I took the following actions to remove them from `DEBUG` logs:

    - Push the "Attestation for unknown block" log to trace.
    - Add a debug log in `search_for_block`. In effect, this should serve as a de-duped version of the previous, downgraded log.

    ## Additional Info

    TBC

commit 502402c
Author: Pawan Dhananjay <[email protected]>
Date:   Fri Jun 4 00:10:59 2021 +0000

    Fix options for `--eth1-endpoints` flag (sigp#2392)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    Set `config.sync_eth1_chain` to true when using just the  `--eth1-endpoints` flag (without `--eth1`).

commit f6280aa
Author: Paul Hauner <[email protected]>
Date:   Thu Jun 3 00:13:02 2021 +0000

    v1.4.0-rc.0 (sigp#2379)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Bump versions.

    ## Additional Info

    This is not exactly the v1.4.0 release described in [Lighthouse Update sigp#36](https://lighthouse.sigmaprime.io/update-36.html).

    Whilst it contains:

    - Beta Windows support
    - A reduction in Eth1 queries
    - A reduction in memory footprint

    It does not contain:

    - Altair
    - Doppelganger Protection
    - The remote signer

    We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible.

    ## TODO

    - [x] Wait for sigp#2340, sigp#2356 and sigp#2376 to merge and then rebase on `unstable`.
    - [x] Ensure discovery issues are fixed (see sigp#2388)
    - [x] Ensure sigp#2382 is merged/removed.
    - [x] Ensure sigp#2383 is merged/removed.
    - [x] Ensure sigp#2384 is merged/removed.
    - [ ] Double-check eth1 cache is carried between boots

commit 90ea075
Author: Paul Hauner <[email protected]>
Date:   Wed Jun 2 01:07:28 2021 +0000

    Revert "Network protocol upgrades (sigp#2345)" (sigp#2388)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Reverts sigp#2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again.

    ## Additional Info

    NA

commit d34f922
Author: Paul Hauner <[email protected]>
Date:   Wed Jun 2 01:07:27 2021 +0000

    Add early check for RPC block relevancy (sigp#2289)

    ## Issue Addressed

    NA

    ## Proposed Changes

    When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block *before* we check to see if the block is already known. This is a big waste of resources.

    This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through.

    Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info.

    ## Additional Info

    NA

commit bf4e02e
Author: Paul Hauner <[email protected]>
Date:   Tue Jun 1 06:59:43 2021 +0000

    Return a specific error for frozen attn states (sigp#2384)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue.

    ## Additional Info

    Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change.

commit ba9c4c5
Author: Paul Hauner <[email protected]>
Date:   Tue Jun 1 06:59:41 2021 +0000

    Return more detail in Eth1 HTTP errors (sigp#2383)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Whilst investigating sigp#2372, I [learned](sigp#2372 (comment)) that the error message returned from some failed Eth1 requests are always `NotReachable`. This makes debugging quite painful.

    This PR adds more detail to these errors. For example:

    - Bad infura key: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: https://mainnet.infura.io/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK:  401 Unauthorized.\"))", retry_millis: 60000, service: eth1_rpc`
    - Unreachable server: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Request failed: reqwest::Error { kind: Request, url: Url { scheme: \\\"http\\\", cannot_be_a_base: false, username: \\\"\\\", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8545), path: \\\"/\\\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\\\"tcp connect error\\\", Os { code: 111, kind: ConnectionRefused, message: \\\"Connection refused\\\" })) }\"))", retry_millis: 60000, service: eth1_rpc`
    - Bad server: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK:  501 Not Implemented.\"))", retry_millis: 60000, service: eth1_rpc`

    ## Additional Info

    NA

commit 4c7bb49
Author: Paul Hauner <[email protected]>
Date:   Mon May 31 04:18:20 2021 +0000

    Use the forwards iterator more often (sigp#2376)

    ## Issue Addressed

    NA

    ## Primary Change

    When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting.

    After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one.

    I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function.

    Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant).

    ## Additional Changes

    In the process I also noticed that we have two functions for getting block roots:

    - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot.
    - `BeaconChain::root_at_slot`: returns the previous root for a skip slot.

    I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root.

    Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way.

    Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:.

    ## Additional Info

    I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific.

    Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here sigp#2377.

commit 320a683
Author: Kevin Lu <[email protected]>
Date:   Mon May 31 04:18:19 2021 +0000

    Minimum Outbound-Only Peers Requirement (sigp#2356)

    ## Issue Addressed

    sigp#2325

    ## Proposed Changes

    This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum.

    ## Additional Info

    Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers.

    EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected.

    ```rust
    let previous_state = info.score_state();
    // Update scores
    info.score_update();
    Self::handle_score_transitions(
                   previous_state,
                    peer_id,
                    info,
                   &mut to_ban_peers,
                   &mut to_unban_peers,
                   &mut self.events,
                   &self.log,
    );
    ```

    ```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status.

    Co-authored-by: Kevin Lu <[email protected]>

commit 0847986
Author: Mac L <[email protected]>
Date:   Mon May 31 04:18:18 2021 +0000

    Reduce outbound requests to eth1 endpoints (sigp#2340)

    ## Issue Addressed

    sigp#2282

    ## Proposed Changes

    Reduce the outbound requests made to eth1 endpoints by caching the results from `eth_chainId` and `net_version`.
    Further reduce the overall request count by increasing `auto_update_interval_millis` from `7_000` (7 seconds) to `60_000` (1 minute).
    This will result in a reduction from ~2000 requests per hour to 360 requests per hour (during normal operation). A reduction of 82%.

    ## Additional Info

    If an endpoint fails, its state is dropped from the cache and the `eth_chainId` and `net_version` calls will be made for that endpoint again during the regular update cycle (once per minute) until it is back online.

    Co-authored-by: Paul Hauner <[email protected]>

commit ec5cceb
Author: Age Manning <[email protected]>
Date:   Sat May 29 07:25:06 2021 +0000

    Correct issue with dialing peers (sigp#2375)

    The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update.

    This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial.

    This makes sigp#2374 obsolete.

commit d12e746
Author: Age Manning <[email protected]>
Date:   Fri May 28 22:02:10 2021 +0000

    Network protocol upgrades (sigp#2345)

    This provides a number of upgrades to gossipsub and discovery.

    The updates are extensive and this needs thorough testing.

commit 456b313
Author: Paul Hauner <[email protected]>
Date:   Fri May 28 05:59:45 2021 +0000

    Tune GNU malloc (sigp#2299)

    ## Issue Addressed

    NA

    ## Proposed Changes

    Modify the configuration of [GNU malloc](https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html) to reduce memory footprint.

    - Set `M_ARENA_MAX` to 4.
        - This reduces memory fragmentation at the cost of contention between threads.
    - Set `M_MMAP_THRESHOLD` to 2mb
        - This means that any allocation >= 2mb is allocated via an anonymous mmap, instead of on the heap/arena. This reduces memory fragmentation since we don't need to keep growing the heap to find big contiguous slabs of free memory.
    - ~~Run `malloc_trim` every 60 seconds.~~
        - ~~This shaves unused memory from the top of the heap, preventing the heap from constantly growing.~~
        - Removed, see: sigp#2299 (comment)

    *Note: this only provides memory savings on the Linux (glibc) platform.*

    ## Additional Info

    I'm going to close sigp#2288 in favor of this for the following reasons:

    - I've managed to get the memory footprint *smaller* here than with jemalloc.
    - This PR seems to be less of a dramatic change than bringing in the jemalloc dep.
    - The changes in this PR are strictly runtime changes, so we can create CLI flags which disable them completely. Since this change is wide-reaching and complex, it's nice to have an easy "escape hatch" if there are undesired consequences.

    ## TODO

    - [x] Allow configuration via CLI flags
    - [x] Test on Mac
    - [x] Test on RasPi.
    - [x] Determine if GNU malloc is present?
        - I'm not quite sure how to detect for glibc.. This issue suggests we can't really: rust-lang/rust#33244
    - [x] Make a clear argument regarding the affect of this on CPU utilization.
    - [x] Test with higher `M_ARENA_MAX` values.
    - [x] Test with longer trim intervals
    - [x] Add some stats about memory savings
    - [x] Remove `malloc_trim` calls & code

commit fdaeec6
Author: Pawan Dhananjay <[email protected]>
Date:   Wed May 26 05:58:41 2021 +0000

    Monitoring service api (sigp#2251)

    ## Issue Addressed

    N/A

    ## Proposed Changes

    Adds a client side api for collecting system and process metrics and pushing it to a monitoring service.

commit 55aada0
Author: Age Manning <[email protected]>
Date:   Wed May 26 14:21:44 2021 +1000

    More stringent dialing (sigp#2363)

    * More stringent dialing

    * Cover cached enr dialing

commit 5d9a1bc
Author: Michael Sproul <[email protected]>
Date:   Thu May 20 00:23:08 2021 +0000

    Add Windows to Bors config (sigp#2358)

    We accidentally omitted the new Windows tests (sigp#2333) from the Bors config, meaning that PRs will merge before the tests pass. This PR corrects that.

commit ba55e14
Author: ethDreamer <[email protected]>
Date:   Wed May 19 23:05:16 2021 +0000

    Enable Compatibility with Windows (sigp#2333)

    ## Issue Addressed

    Windows incompatibility.

    ## Proposed Changes

    On windows, lighthouse needs to default to STDIN as tty doesn't exist. Also Windows uses ACLs for file permissions. So to mirror chmod 600, we will remove every entry in a file's ACL and add only a single SID that is an alias for the file owner.

    Beyond that, there were several changes made to different unit tests because windows has slightly different error messages as well as frustrating nuances around killing a process :/

    ## Additional Info

    Tested on my Windows VM and it appears to work, also compiled & tested on Linux with these changes. Permissions look correct on both platforms now. Just waiting for my validator to activate on Prater so I can test running full validator client on windows.

    Co-authored-by: ethDreamer <[email protected]>
    Co-authored-by: Michael Sproul <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. v1.5.0 For inclusion in v1.5.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants