Resolving a GlusterFS Server Failure

Rebuilding a failed server and removing/re-adding bricks from/into a GlusterFS volume

Note: This is an old article and may contain content which is out of date.

Backgound

I setup four servers that have been built for running automated browser tests using Selenium. Each server runs a selenium hub and multiple XVNC desktops / selenium nodes. The tests are run in Firefox and the Firefox profile utilised by Selenium imports the Firebug Firefox extension. The profile also enables HTTP Archive logging. As there are four different servers in this Selenium cluster, in the event of a Selenium test failing, checking four different SMB shares for the correct HTTP Archive (HAR) file could be tedious. Thus, each server uses a distributed file system (GlusterFS) for storing the HAR files. Each server is configured as a ‘brick’ in a replicated GlusterFS volume. In fact there are two volumes. The second volume is used for storing files that some browser tests can upload if needed.

One of the nodes in the Selenium/GlusterFS cluster had suffered a hard disk failure. Thankfully, using four servers in the cluster gives plenty of redundancy.

I replaced the disk in the failed server and then reprovisioned it by temporarily withdrawing one of the other good servers from the cluster and cloning the disk. This is not recommended in a production environment. Ideally we would have the configuration for these servers written in Ansible, Chef or Puppet and stored in version control.

Before the clone server can be added to the node, I needed to remove it’s existing GlusterFS configuration and then re-add to the two GlusterFS volumes as a pair of new bricks. I couldn’t just change the hostname because GlusterFS uses UUIDs to identify each brick. The clone server would have had the same UUIDs for it’s bricks as the server it was cloned from, and thus any attempts to re-add it’s bricks to our two replicated GlusterFS volumes would fail.

This article describes the process I used in re-adding the clone server to the distributed GlusterFS cluster.

Removing the failed node and it’s bricks from the replicated GlusterFS volumes

Cloning the source node and restoring it to the repaired failed node will take some time. Whilst those operations are running, you can remove the dead GlusterFS bricks from the two GlusterFS volumes that we use.

On a third GlusterFS node, use the gluster command to retrieve some information about our gluster volumes.

  $ sudo gluster volume info all

  Volume Name: glusvol1
  Type: Replicate
  Volume ID: 09c0da39-d1b5-41ea-965b-0212ee316568
  Status: Started
  Number of Bricks: 1 x 4 = 4
  Transport-type: tcp
  Bricks:
  Brick1: glusnode1.biscuit.ninja:/media/gluster-volume-1
  Brick2: glusnode2.biscuit.ninja:/media/gluster-volume-1
  Brick3: glusnode3.biscuit.ninja:/media/gluster-volume-1
  Brick4: glusnode4.biscuit.ninja:/media/gluster-volume-1
  Options Reconfigured:
  server.allow-insecure: on
  auth.allow: *

  Volume Name: glusvol2
  Type: Replicate
  Volume ID: 3553fcf7-cf6f-49ee-8c15-e7e02a9309b7
  Status: Started
  Number of Bricks: 1 x 4 = 4
  Transport-type: tcp
  Bricks:
  Brick1: glusnode1.biscuit.ninja:/media/gluster-volume-2
  Brick2: glusnode2.biscuit.ninja:/media/gluster-volume-2
  Brick3: glusnode3.biscuit.ninja:/media/gluster-volume-2
  Brick4: glusnode4.biscuit.ninja:/media/gluster-volume-2
  Options Reconfigured:
  auth.allow: *
  server.allow-insecure: on

In this example, glusnode1 is the failed node. We can confirm this:

  $ sudo gluster peer status
  Number of Peers: 3

  Hostname: 192.168.5.51
  Uuid: fe29cf69-45f5-476a-a542-686e136cf3fc
  State: Peer in Cluster (Disconnected)

  Hostname: glusnode3.biscuit.ninja
  Uuid: 7aeb75c3-6d54-4a1d-b8f4-623598f8da4a
  State: Peer in Cluster (Connected)

  Hostname: glusnode2.biscuit.ninja
  Uuid: 3ba486b1-86e5-4d8d-899d-b9f969aa9079
  State: Peer in Cluster (Connected)

Note our first peer, 192.168.5.51 (glusnode1) is disconnected. Of course, the node we are using to perform these checks does not itself show in the list of peers, hence there are just three peers returned from the gluster peer command.

When we check the status of our volumes, the dead node is omitted from the results.

  $ sudo gluster volume status all
  Status of volume: glusvol1
  Gluster process                                         Port    Online  Pid
  ------------------------------------------------------------------------------
  Brick glusnode2.biscuit.ninja:/media/gluster-volume-1   49153   Y       1391
  Brick glusnode3.biscuit.ninja:/media/gluster-volume-1  49153   Y       1345
  Brick glusnode4.biscuit.ninja:/media/gluster-volume-1   49153   Y       1326
  NFS Server on localhost                                 2049    Y       1340
  Self-heal Daemon on localhost                           N/A     Y       1345
  NFS Server on glusnode3.biscuit.ninja                  2049    Y       2862
  Self-heal Daemon on glusnode3.biscuit.ninja            N/A     Y       2880
  NFS Server on glusnode2.biscuit.ninja                 2049    Y       1400
  Self-heal Daemon on glusnode2.biscuit.ninja           N/A     Y       1405

  There are no active volume tasks
  Status of volume: glusvol2
  Gluster process                                         Port    Online  Pid
  ------------------------------------------------------------------------------
  Brick glusnode2.biscuit.ninja:/media/gluster-volume-2  49152   Y       1386
  Brick glusnode3.biscuit.ninja:/media/gluster-volume-2   49152   Y       2852
  Brick glusnode4.biscuit.ninja:/media/gluster-volume-2  49152   Y       1331
  NFS Server on localhost                                 2049    Y       1340
  Self-heal Daemon on localhost                           N/A     Y       1345
  NFS Server on glusnode3.biscuit.ninja                  2049    Y       2862
  Self-heal Daemon on glusnode3.biscuit.ninja            N/A     Y       2880
  NFS Server on glusnode2.biscuit.ninja                 2049    Y       1400
  Self-heal Daemon on glusnode2.biscuit.ninja           N/A     Y       1405

Now we’ve established the state of play, go ahead and remove the failed brick from each GlusterFS volume using the Gluster command:

  $ sudo gluster volume remove-brick glusvol1 replica 3 glusnode1.biscuit.ninja:/media/gluster-volume-1
  Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
  volume remove-brick commit force: success

  $ sudo gluster volume remove-brick glusvol2 replica 3  glusnode1.biscuit.ninja:/media/gluster-volume-2
  Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
  volume remove-brick commit force: success

Detach the failed node:
```
  $ sudo gluster peer detach glusnode1
```

You can now close this SSH session as we turn our attention to the rebuilt failed node.

Change the hostname and update the hosts file on our clone server

Assuming that you have now successfully repaired/cloned/rebuilt the failing node, we need to make some configuration changes before connecting it to the network. Boot it up and using the console, execute the following:

Change the hostname
```
  $ sudo vi /etc/hostname
```

Assuming we’ve cloned from glusnode2, change

glusnode2

glusnode1

Update the hosts file
```
  $ sudo vi /etc/hosts
```

Change

127.0.1.1	glusnode2

127.0.1.1	glusnode1

Removing the existing GlusterFS bricks from our clone server

Before we remove the bricks from the clone server, we need to stop them from being automatically mounted. You will see that our GlusterFS volumes are stored within the root partition

Remove the GlusterFS volumes from fstab by commenting out any lines that begin with localhost
```
  $ sudo vi /etc/fstab
```

Change

localhost:/glusvol1    /var/vol1mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2
localhost:/glusvol2    /var/log/vol2mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2

localhost:/glusvol1    /var/vol1mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2
localhost:/glusvol2    /var/log/vol2mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2

Delete the old physical GlusterFS volumes (I’m not interested in saving any data):
```
  $ sudo rm -rf /media/gluster*
```
Optionally, clear out the logs folder. Obviously these logs really belong to the source node which we cloned to rebuild the failed node and leaving them behind could confuse the issue. You may want to go a step further and remove the contents of /var/log. If you have logrotate configured for all of your log files, you could run “logrotate –force” and remove any files suffixed .1 or 2.gz etc.
```
  $ sudo rm -rf /var/log/glusterfs/*
```
Delete existing GlusterFS peer/volume/brick metadata. GlusterFS will re-initialise this folder structure when it is restarted.
```
  $ sudo rm -rf /var/lib/glusterd/*
```

Recreate the folders used as glusterfs volumes:

  $ sudo mkdir gluster-volume-1
  $ sudo mkdir gluster-volume-2
  $ sudo chmod 777 gluster*

It isn’t absolutely necessary to delete and recreate these folders. The alternative is removing the file attributes glusterfs cares about. It’s arguably slightly more typing. If you want to retain the folders rather than deleting them and recreating them, you can try:

$ sudo setfattr -x trusted.glusterfs.volume-id /media/gluster-volume-1
$ sudo setfattr -x trusted.gfid /media/gluster-volume-1
$ sudo rm -rf /media/gluster-volume-1/.glusterfs
 
$ sudo setfattr -x trusted.glusterfs.volume-id /media/gluster-volume-2
$ sudo setfattr -x trusted.gfid /media/gluster-volume-2
$ sudo rm -rf /media/gluster-volume-1/.glusterfs

That’s our node cleaned up. Now assuming that the node relies on DHCP, we can shutit down and reconnected it to the network …
```
  $ sudo shutdown now
```

If the node relies on a static IP configuration, then you will need to update “/etc/network/interfaces” with the correct IP address, otherwise the node will call an IP address conflict with the node form which it was cloned.

If the machine fails to connect to the network, it’s likely that the ethernet interface has a different logical name, for example it may now be called eth1 instead of eth0. You can get the logical name of the network adapter with:

$ lshw -class network | grep "logical name"

You can check the returned result against /etc/network/interfaces. If you see references to a different logical interface, then you can amend as appropriate.

Add the clone server (back into) to the GlusterFS cluster

We’re now in a position to reconfigure GlusterFS on our clone server. Start a new SSH session and confirm the glusterfs service is running
```
  $ sudo service glusterfs-server status
  glusterfs-server start/running, process 5721
```
Start a new SSH session on an existing good node and execute:
```
  $ sudo gluster peer probe <ip address>
```

Substitute for the network address of the failed rebuilt node, e.g.

$  sudo gluster peer probe 192.168.5.51
peer probe: success

Running “gluster peer status” confirms the node has been re-added:

$  sudo gluster peer status
Number of Peers: 3

Hostname: glusnode3.biscuit.ninja
Uuid: 7aeb75c3-6d54-4a1d-b8f4-623598f8da4a
State: Peer in Cluster (Connected)

Hostname: glusnode2.biscuit.ninja
Uuid: 3ba486b1-86e5-4d8d-899d-b9f969aa9079
State: Peer in Cluster (Connected)

Hostname: 192.168.5.51
Port: 24007
Uuid: 8fccf14e-4f84-44e8-9eeb-6d2d2b23e932
State: Peer in Cluster (Connected)

Now we can re-add bricks for both glusterfs volumes served from our rebuilt failed node:

$ sudo gluster volume add-brick glusvol1 replica 4 192.168.5.51:/media/gluster-volume-1 force volume add-brick: success

$ sudo gluster volume add-brick glusvol2 replica 4 192.168.5.51:/media/gluster-volume-2 force volume add-brick: success

The force is necessary because we have created GlusterFS “volumes” within the root file system, which isn’t recommended. These servers are not in a production environment. They are built from recycled components with a high degree of redundancy, which makes this compromise acceptable.

Check the status of the new bricks:

$ sudo gluster volume info all

  Volume Name: glusvol1
  Type: Replicate
  Volume ID: 09c0da39-d1b5-41ea-965b-0212ee316568
  Status: Started
  Number of Bricks: 1 x 4 = 4
  Transport-type: tcp
  Bricks:
  Brick1: glusnode2.biscuit.ninja:/media/gluster-volume-1
  Brick2: glusnode3.biscuit.ninja:/media/gluster-volume-1
  Brick3: glusnode4.biscuit.ninja:/media/gluster-volume-1
  Brick4: 192.168.5.51:/media/gluster-volume-1
  Options Reconfigured:
  auth.allow: *
  server.allow-insecure: on

  Volume Name: glusvol2
  Type: Replicate
  Volume ID: 3553fcf7-cf6f-49ee-8c15-e7e02a9309b7
  Status: Started
  Number of Bricks: 1 x 4 = 4
  Transport-type: tcp
  Bricks:
  Brick1: glusnode2.biscuit.ninja:/media/gluster-volume-2
  Brick2: glusnode3.biscuit.ninja:/media/gluster-volume-2
  Brick3: glusnode4.biscuit.ninja:/media/gluster-volume-2
  Brick4: 192.168.5.51:/media/gluster-volume-2
  Options Reconfigured:
  server.allow-insecure: on
  auth.allow: *

  $ sudo gluster peer status
  Number of Peers: 3

  Hostname: glusnode4.biscuit.ninja
  Port: 24007
  Uuid: aaa72f7a-ea87-4bc1-beda-e95f7aff4398
  State: Peer in Cluster (Connected)

  Hostname: glusnode3.biscuit.ninja
  Uuid: 7aeb75c3-6d54-4a1d-b8f4-623598f8da4a
  State: Peer in Cluster (Connected)

  Hostname: glusnode2.biscuit.ninja
  Uuid: 3ba486b1-86e5-4d8d-899d-b9f969aa9079
  State: Peer in Cluster (Connected)

  $ sudo gluster volume status all
  Status of volume: glusvol1
  Gluster process                                         Port    Online  Pid
  ------------------------------------------------------------------------------
  Brick glusnode2.biscuit.ninja:/media/gluster-volume-1   49153   Y       1391
  Brick glusnode3.biscuit.ninja:/media/gluster-volume-1   49153   Y       1345
  Brick glusnode4.biscuit.ninja:/media/gluster-volume-1   49153   Y       1326
  Brick 192.168.5.51:/media/gluster-volume-1      49152   Y       10332
  NFS Server on localhost                                 2049    Y       10537
  Self-heal Daemon on localhost                           N/A     Y       10544
  NFS Server on glusnode4.biscuit.ninja                 2049    Y       15518
  Self-heal Daemon on glusnode4.biscuit.ninja           N/A     Y       15525
  NFS Server on glusnode3.biscuit.ninja                  2049    Y       17375
  Self-heal Daemon on glusnode3.biscuit.ninja            N/A     Y       17400
  NFS Server on glusnode2.biscuit.ninja                 2049    Y       19522
  Self-heal Daemon on glusnode2.biscuit.ninja           N/A     Y       19535

  There are no active volume tasks
  Status of volume: glusvol2
  Gluster process                                         Port    Online  Pid
  ------------------------------------------------------------------------------
  Brick glusnode2.biscuit.ninja:/media/gluster-volume-2  49152   Y       1386
  Brick glusnode3.biscuit.ninja:/media/gluster-volume-2   49152   Y       2852
  Brick glusnode4.biscuit.ninja:/media/gluster-volume-2  49152   Y       1331
  Brick 192.168.5.51:/media/gluster-volume-2                  49153   Y       10518
  NFS Server on localhost                                 2049    Y       10537
  Self-heal Daemon on localhost                           N/A     Y       10544
  NFS Server on glusnode4.biscuit.ninja                 2049    Y       15518
  Self-heal Daemon on glusnode4.biscuit.ninja           N/A     Y       15525
  NFS Server on glusnode3.biscuit.ninja                  2049    Y       17375
  Self-heal Daemon on glusnode3.biscuit.ninja            N/A     Y       17400
  NFS Server on glusnode2.biscuit.ninja                 2049    Y       19522
  Self-heal Daemon on glusnode2.biscuit.ninja           N/A     Y       19535

  There are no active volume tasks

I’ve run these checks from our clone server, hence the slight variation when compared to the checks run earlier. All is looking healthy and listing the contents of /media/gluster-volume-2 shows data is getting synchronised into our new brick:

$ ls /media/gluster-volume-2/
2015-09-28

We can now remount the gluster volumes locally. Edit /etc/fstab and uncomment out the two lines beginning “#localhost:/”
```
  $ sudo vi /etc/fstab
```

Change

localhost:/glusvol2    /var/log/vol2mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2
localhost:/glusvol1    /var/vol1mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2

localhost:/glusvol2    /var/log/vol2mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2
localhost:/glusvol1    /var/vol1mnt   glusterfs  defaults,nobootwait,_netdev,fetch-attempts=10 0 2

Mount the glusterfs volumes
```
  $ sudo mount -a
```

The clone server is now ready to be re-added to the Selenium Cluster. If there’s a lot of data to be replicated onto the clone server, you may want to wait for the synchonisation to complete prior to re-adding it.