Skip to content
This repository was archived by the owner on Nov 17, 2020. It is now read-only.

Cluster node won't start after reboot when setup federated exchange queue #111

Closed
rerime opened this issue Aug 5, 2020 · 9 comments
Closed

Comments

@rerime
Copy link

rerime commented Aug 5, 2020

Setup federation between two clusters

# On cluster A
# federation user
  rabbitmqctl add_vhost vhost
  rabbitmqctl add_user fed_user fed_password
  rabbitmqctl set_permissions --vhost vhost fed_user '^fed' '^fed' '^fed'    
# upstream fed_a_b
  rabbitmqctl set_parameter --vhost=vhost federation-upstream fed_a_b "{\"uri\":\"amqp://fed_user:fed_password@host_a/vhost\"}"
# a -> b policies
  rabbitmqctl set_policy --vhost=vhost fed_a_b_exchange_policy 'fed\.exchange\.a_b' '{"federation-upstream":"fed_a_b"}'
  rabbitmqctl set_policy --vhost=vhost fed_b_a_exchange_policy 'fed\.exchange\.b_a' '{"federation-upstream":"fed_a_b"}'
  rabbitmqctl set_policy --vhost=vhost fed_a_b_events_policy   '^fed\.queue\.events\.a_b' '{"federation-upstream":"fed_a_b"}'


# On cluster B
# federation user
  rabbitmqctl add_vhost vhost
  rabbitmqctl add_user fed_user fed_password
  rabbitmqctl set_permissions --vhost vhost fed_user '^fed' '^fed' '^fed'
# upstream fed_b_a
  rabbitmqctl set_parameter --vhost=vhost federation-upstream fed_b_a "{\"uri\":\"amqp://fed_user:fed_password@host_b/vhost\"}"
# b -> a policies
  rabbitmqctl set_policy --vhost=vhost fed_b_a_exchange_policy 'fed\.exchange\.b_a' '{"federation-upstream":"fed_b_a"}'
  rabbitmqctl set_policy --vhost=vhost fed_a_b_exchange_policy 'fed\.exchange\.a_b' '{"federation-upstream":"fed_b_a"}'
  rabbitmqctl set_policy --vhost=vhost fed_b_a_events_policy   '^fed\.queue\.events\.b_a' '{"federation-upstream":"fed_b_a"}'

Add queue on hq-0 cluster A

Screenshot 2020-08-04 at 18 34 25

Reboot node ha-0 on cluster A where federated queue is located, node won't start

Log

  Starting broker...2020-08-04 14:56:21.327 [info] <0.269.0> 
 node           : [email protected]
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/advanced.config
                : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : NrIYU1tF9/JIWz5dzdEPqQ==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/[email protected]
2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step pre_boot defined by app rabbit
2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step rabbit_core_metrics defined by app rabbit
2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step rabbit_alarm defined by app rabbit
2020-08-04 14:56:21.348 [info] <0.405.0> Memory high watermark set to 244 MiB (256000000 bytes) of 5961 MiB (6250688512 bytes) total
2020-08-04 14:56:21.359 [info] <0.407.0> Enabling free disk space monitoring
2020-08-04 14:56:21.359 [info] <0.407.0> Disk free limit set to 50MB
2020-08-04 14:56:21.363 [info] <0.269.0> Running boot step code_server_cache defined by app rabbit
2020-08-04 14:56:21.364 [info] <0.269.0> Running boot step file_handle_cache defined by app rabbit
2020-08-04 14:56:21.365 [info] <0.414.0> Limiting to approx 1048479 file handles (943629 sockets)
2020-08-04 14:56:21.365 [info] <0.415.0> FHC read buffering:  OFF
2020-08-04 14:56:21.365 [info] <0.415.0> FHC write buffering: ON
2020-08-04 14:56:21.366 [info] <0.269.0> Running boot step worker_pool defined by app rabbit
2020-08-04 14:56:21.366 [info] <0.400.0> Will use 2 processes for default worker pool
2020-08-04 14:56:21.366 [info] <0.400.0> Starting worker pool 'worker_pool' with 2 processes in it
2020-08-04 14:56:21.367 [info] <0.269.0> Running boot step database defined by app rabbit
2020-08-04 14:56:21.696 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-08-04 14:56:21.830 [info] <0.269.0> Successfully synced tables from a peer
2020-08-04 14:56:21.834 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-08-04 14:56:21.834 [info] <0.269.0> Successfully synced tables from a peer
2020-08-04 14:56:21.835 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-08-04 14:56:21.835 [info] <0.269.0> Successfully synced tables from a peer
2020-08-04 14:56:21.865 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-08-04 14:56:21.865 [info] <0.269.0> Successfully synced tables from a peer
2020-08-04 14:56:21.878 [info] <0.269.0> Will register with peer discovery backend rabbit_peer_discovery_k8s
2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step database_sync defined by app rabbit
2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step feature_flags defined by app rabbit
2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step codec_correctness_check defined by app rabbit
2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step external_infrastructure defined by app rabbit
2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_registry defined by app rabbit
2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_queue_location_random defined by app rabbit
2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_event defined by app rabbit
2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit
2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit
2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit
2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_priority_queue defined by app rabbit
2020-08-04 14:56:21.955 [info] <0.269.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_queue_location_client_local defined by app rabbit
2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit
2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step kernel_ready defined by app rabbit
2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2020-08-04 14:56:21.957 [info] <0.517.0> epmd monitor knows us, inter-node communication (distribution) port: 25672
2020-08-04 14:56:21.958 [info] <0.269.0> Running boot step guid_generator defined by app rabbit
2020-08-04 14:56:21.959 [info] <0.269.0> Running boot step rabbit_node_monitor defined by app rabbit
2020-08-04 14:56:22.079 [info] <0.521.0> Starting rabbit_node_monitor
2020-08-04 14:56:22.079 [info] <0.269.0> Running boot step delegate_sup defined by app rabbit
2020-08-04 14:56:22.080 [info] <0.269.0> Running boot step rabbit_memory_monitor defined by app rabbit
2020-08-04 14:56:22.082 [info] <0.269.0> Running boot step core_initialized defined by app rabbit
2020-08-04 14:56:22.082 [info] <0.269.0> Running boot step upgrade_queues defined by app rabbit
2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_connection_tracking defined by app rabbit
2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2020-08-04 14:56:22.161 [info] <0.269.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit
2020-08-04 14:56:22.162 [info] <0.269.0> Running boot step rabbit_policies defined by app rabbit
2020-08-04 14:56:22.163 [info] <0.269.0> Running boot step rabbit_policy defined by app rabbit
2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_queue_location_validator defined by app rabbit
2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_quorum_memory_manager defined by app rabbit
2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_vhost_limit defined by app rabbit
2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step recovery defined by app rabbit
2020-08-04 14:56:22.299 [info] <0.555.0> Making sure data directory '/var/lib/rabbitmq/mnesia/[email protected]/msg_stores/vhosts/5JIAVEDLEJHIGKZFVWKV87LER' for vhost 'vhost' exists
2020-08-04 14:56:22.303 [info] <0.555.0> Starting message stores for vhost 'vhost'
2020-08-04 14:56:22.303 [info] <0.559.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-08-04 14:56:22.308 [info] <0.555.0> Started message store of type transient for vhost 'vhost'
2020-08-04 14:56:22.308 [info] <0.563.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-08-04 14:56:22.310 [warning] <0.563.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_persistent": rebuilding indices from scratch
2020-08-04 14:56:22.312 [info] <0.555.0> Started message store of type persistent for vhost 'vhost'
2020-08-04 14:56:22.474 [error] <0.418.0> ** Generic server <0.418.0> terminating
** Last message in was {'$gen_cast',{submit_async,#Fun<rabbit_binding.3.74481353>,<0.417.0>}}
** When Server state == undefined
** Reason for termination == 
** {{case_clause,undefined},[{rabbit_federation_util,pgname,1,[{file,"src/rabbit_federation_util.erl"},{line,74}]},{rabbit_federation_exchange_link,x,1,[{file,"src/rabbit_federation_exchange_link.erl"},{line,265}]},{rabbit_federation_exchange_link,cast,2,[{file,"src/rabbit_federation_exchange_link.erl"},{line,254}]},{rabbit_federation_exchange,add_binding,3,[{file,"src/rabbit_federation_exchange.erl"},{line,62}]},{rabbit_exchange,'-callback/4-lc$^1/1-0-',5,[{file,"src/rabbit_exchange.erl"},{line,71}]},{rabbit_exchange,callback,4,[{file,"src/rabbit_exchange.erl"},{line,72}]},{rabbit_binding,x_callback,4,[{file,"src/rabbit_binding.erl"},{line,700}]},{rabbit_binding,'-recover_semi_durable_route/3-fun-0-',3,[{file,"src/rabbit_binding.erl"},{line,124}]}]}
2020-08-04 14:56:22.474 [error] <0.418.0> CRASH REPORT Process <0.418.0> with 0 neighbours exited with reason: no case clause matching undefined in rabbit_federation_util:pgname/1 line 74 in gen_server2:terminate/3 line 1183
2020-08-04 14:56:22.474 [error] <0.416.0> Supervisor worker_pool_sup had child 1 started with worker_pool_worker:start_link(worker_pool) at <0.418.0> exit with reason no case clause matching undefined in rabbit_federation_util:pgname/1 line 74 in context child_terminated

System

OS: Linux
Uptime (seconds): 72270
RabbitMQ version: 3.8.5
Node name: rabbit@rmqrb-rabbitmq-ha-0
Erlang configuration: Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:64]
Erlang processes: 981 used, 1048576 limit
Scheduler run queue: 2
Cluster heartbeat timeout (net_ticktime): 60

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_prometheus
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_federation_management
 * rabbitmq_peer_discovery_common
 * rabbitmq_shovel_management
 * rabbitmq_shovel
 * amqp10_client
 * rabbitmq_federation
 * amqp10_common
 * rabbitmq_auth_backend_ldap
 * prometheus
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

@michaelklishin
Copy link
Contributor

Duplicate of #109.

@michaelklishin
Copy link
Contributor

michaelklishin commented Aug 5, 2020

Something in your environment results in plugin's defaults not being loaded. You can explicitly set the undefined value of rabbitmq_federation.pgroup_name_cluster_id in advanced.config to false:

[
  {rabbitmq_federation, [
    {pgroup_name_cluster_id, false}
  ]}
].

or upgrade to 3.8.6 where all values other than true would be treated as false.

@michaelklishin
Copy link
Contributor

Looking at the log I have a good guess as to what may be going on here. This seems to be a yet another "definition import before plugins are activated" problem that 3.8.6 addressed for different reasons. See rabbitmq/rabbitmq-server#2384.

@rerime
Copy link
Author

rerime commented Aug 6, 2020

Same result with 3.8.6

@michaelklishin
Copy link
Contributor

@rerime the exception in the provided log snippet states that rabbit_federation_util:pgname/1 runs into an exception where a case statement does not handle the effective configuration value, undefined. As of https://github.com/rabbitmq/rabbitmq-federation/pull/109/files and a6308bc, the plugin will treat any value other than true loaded from its config file as a false.

So I find it hard to believe that this scenario is realistic with 3.8.6. I will try to reproduce using the provided steps above.

@rerime
Copy link
Author

rerime commented Aug 7, 2020

@michaelklishin
Sorry for inconvenience. Check again, seems it was fixed.

@michaelklishin
Copy link
Contributor

I cannot reproduce with 3.8.6 with the following setup:

  • Two independent nodes, A and B
  • B has A configured as an upstream
  • Two policies set up exchange and queue federation which I can confirm to be functional using management UI and a few client connections
  • I restarted node B and it started successfully
  • I then restarted node A and it started successfully

@rerime
Copy link
Author

rerime commented Aug 7, 2020

@michaelklishin
Thx for your support. We've fixed another bug with
management.load_definitions = /path/to/definitions/file.json instead of
load_definitions = /path/to/definitions/file.json
We've missed in very useful docs https://www.rabbitmq.com/definitions.html
/etc/definitions/definitions.json

{
  "global_parameters": [
    
  ],
  "users": [
    {
      "name": "management",
      "password": "xxx",
      "tags": "management"
    },
    {
      "name": "guest",
      "password": "xxx",
      "tags": "administrator"
    }
  ],
  "vhosts": [
    {
      "name": "/"
    }
  ],
  "permissions": [
    {
      "user": "guest",
      "vhost": "/",
      "configure": ".*",
      "read": ".*",
      "write": ".*"
    }
  ],
  "topic_permissions": [
    
  ],
  "parameters": [
    {"value":{"ack-mode":"on-confirm","trust-user-id":false,"uri":"amqp://localhost/"},
        "vhost":"/","component":"federation-upstream","name":"test"}
  ],
  "policies": [
    
  ],
  "queues": [
    
  ],
  "exchanges": [
    
  ],
  "bindings": [
    
  ]
}

Log

2020-08-07 09:37:19.903 [error] <0.522.0> BOOT FAILED
BOOT FAILED
2020-08-07 09:37:19.904 [error] <0.522.0> ===========
===========
2020-08-07 09:37:19.904 [error] <0.522.0> Exception during startup:
Exception during startup:
2020-08-07 09:37:19.905 [error] <0.522.0> 

2020-08-07 09:37:19.906 [error] <0.522.0>     rabbit:do_run_postlaunch_phase/0 line 905
    rabbit:do_run_postlaunch_phase/0 line 905
    lists:foreach/2 line 1338
    rabbit:'-do_run_postlaunch_phase/0-fun-12-'/1 line 907
2020-08-07 09:37:19.906 [error] <0.522.0>     lists:foreach/2 line 1338
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit:'-do_run_postlaunch_phase/0-fun-12-'/1 line 907
    rabbit_boot_steps:run_boot_steps/1 line 20
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:run_boot_steps/1 line 20
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:run_step/2 line 46
2020-08-07 09:37:19.908 [error] <0.522.0>     rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 43
    rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
    rabbit_boot_steps:run_step/2 line 46
    rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 43
exit:{error,<<"Validation failed\n\ncomponent federation-upstream not found\n (//federation-upstream/test)">>}
2020-08-07 09:37:19.908 [error] <0.522.0> exit:{error,<<"Validation failed\n\ncomponent federation-upstream not found\n (//federation-upstream/test)">>}

2020-08-07 09:37:19.909 [error] <0.522.0> 
2020-08-07 09:37:21.124 [info] <0.522.0> Stopping RabbitMQ applications and their dependencies in the following order:
    rabbitmq_federation_management
    rabbitmq_shovel_management
    rabbitmq_management
    rabbitmq_shovel
    rabbitmq_federation
    amqp_client
    rabbitmq_prometheus
    rabbitmq_web_dispatch
    cowboy
    cowlib
    rabbitmq_management_agent
    rabbitmq_peer_discovery_k8s
    rabbitmq_peer_discovery_common
    rabbitmq_auth_backend_ldap
    rabbitmq_consistent_hash_exchange
    rabbit
    rabbitmq_prelaunch
    rabbit_common
    sysmon_handler
    os_mon
    ra
    mnesia

@michaelklishin
Copy link
Contributor

@rerime which is rabbitmq/rabbitmq-server#2384, so you want to upgrade to 3.8.6 either way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants