blog.rulez.org: September 2009

After having a basic look on Virtual IP failover in a cluster environment with Pacemaker, now let's see how can we set up a shared device with ext3 filesystem with the use of drbd. In this example, we will use a Master/Slave cluster settings with Primary/Secondary drbd. On the master cluster node we'll have the /var/www ext3fs filesystem, on the slave node, we won't have anything until the fs resource will migrate to there.

Now let's see how can we achieve it with Pacemaker!

At first, I built the drbd package from the Linbit's git repository because it supports primary/primary connection (what we doesn't use now), and it has debian directory to build deb package from: git://git.drbd.org/drbd-8.3.git . After installing the packages, in this example I'll use a bare image file as block device with the help of losetup for mapping device file to this image.

Let's create the image file on every cluster node (c01, c02):

root@c01:~# dd if=/dev/zero of=/drbd_block.img bs=1M count=2048
root@c01:~# dd if=/dev/zero of=/drbd_block.img bs=1M count=2048

And this way we created a 2G images. Now, let's assign these image files to a device on every cluster node. Since we need to do that at every single boot time, I recommend to put it there:

/etc/rc.local

losetup /dev/loop0 /drbd_block.img

for now, you can simply run this command, then you can check the results:

root@c01:~# losetup -a
/dev/loop0: [fe00]:24971 (/drbd_block.img)
root@c02:~# losetup -a
/dev/loop0: [fe00]:24849 (/drbd_block.img)

It is successfully assigned on both nodes. This step is needed since the drbd requires block device to work with, so let's configure the drbd on every nodes:

root@c02:~# cat /etc/drbd.conf |grep -v ^$
global {
usage-count no;
}
common {
#protocol C;
}
resource r0 {
device /dev/drbd0;
disk /dev/loop0;
meta-disk internal;
protocol C;
on c01 {
address 192.168.0.1:7789;
#flexible-meta-disk internal;
}
on c02 {
address 192.168.0.2:7789;
#meta-disk internal;
}
net {
allow-two-primaries;
#for GFS2 or OCFS2:
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}

startup {
become-primary-on both;
}
syncer {
verify-alg crc32c;
rate 40M;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
#pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
#pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
#local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
#outdate-peer "/usr/sbin/drbd-peer-outdater";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
disk {
# on-io-error detach;
fencing resource-only;
}
}

The explanation: we'll create the r0 assigned to /dev/loop0 which is already assigned to /drbd_block.img file. We'll use Protocol C which requires a full acknowledgement from the peer that they all have all changes written to the device. We tuned to 40M the bandwith, note, that the default is so low, so on recent net we need much more. We'll use crc32c which is probably the fastest hash algorythm available in drbd. And however this configuration is prepared to primary/primary with cluster enabled file system, we won't need it - but it's OK for us for now.

New let's check wether the drbd kernel module is loaded successfully. If not, please:

root@c01:~# modprobe drbd
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:Unconfigured

Let's create the device meta-data:

root@c01:~# drbdadm create-md r0
You want me to create a v08 style flexible-size internal meta data block.
There apears to be a v08 flexible-size internal meta data block
already in place on /dev/loop0 at byte offset 2147479552
Do you really want to overwrite the existing v08 meta-data?
[need to type 'yes' to confirm] yes

Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

Note, I typed yes, because I wanted to overwrite a pre-existing meta data in this resource r0. I recommed to do that on only the wannabe primary node.

Now let's attach the device and set up the net for it:

root@c01:~# drbdadm attach r0
root@c01:~# drbdadm connect r0
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2097052

or w/o attach and connect, you can simply use drbdadm up r0. Now, you can see, it is Wait For Connection (WFConnection) state. Note, that the data storage is inconsistent and we don't know anything about peer's. Let's set up the peer:

root@c02:~# drbdadm up r0
root@c02:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c02, 2009-09-09 11:16:15
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2097052

Now you can see, they're connected, but inconsistent. So let's force the full syncronization from the primary node (c01):

root@c01:~# drbdadm -- --overwrite-data-of-peer primary r0
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:12324 nr:0 dw:0 dr:20688 al:0 bm:0 lo:0 pe:1 ua:2024 ap:0 ep:1 wo:b oos:2084728
[>....................] sync'ed: 0.8% (2084728/2097052)K
finish: 0:02:48 speed: 12,324 (12,324) K/sec

You can check the progress bar periodically (on any host). When it's done:

root@c02:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c02, 2009-09-09 11:16:15
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:2097052 dw:2097052 dr:0 al:0 bm:128 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

You can see here that both local and remote host is UpToDate.

Let's format this newly created drbd device on the primary node (c01):

root@c01:~# mkfs.ext3 /dev/drbd/by-res/r0
mke2fs 1.40.8 (13-Mar-2008)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
131072 inodes, 524263 blocks
26213 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=536870912
16 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

And now we could mount it by mount -t ext3 /dev/drbd/by-res/r0 /mnt but now we won't.

Now we have to make sure that the Pacemaker crm will only handle this resource so remove or start with exit 0 in drbd init script:

root@c01:~# head -2 /etc/init.d/drbd
#!/bin/bash
exit 0

Make sure that we release the r0 on every nodes:

root@c01:~# drbdadm down r0
root@c02:~# drbdadm down r0

And see the crm config!

primitive drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="9s" role="Master" timeout="30s" \
op monitor interval="11s" role="Slave" timeout="30s" \
meta target-role="Stopped"

This will instruct the /usr/lib/ocf/resource.d/linbit/drbd script to turn the master/slave aka. primary/secondary roles. I shave set up the monitor intervals differently to see the transitions better.

Let's setup a master/slave resource for it:

ms ms-drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Stopped"

Where we declare that on every node and in the whole cluster environment there can be only one master and the clone slave aka. secondary can be running on other node only.

Now, the file system primitive:

primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/var/www" fstype="ext3" \
meta target-role="Stopped"

Note, we will mount this drbd device with ext3 fs under /var/www.

Make sure that the file system will be mounted on the master (aka. primary drbd) when the drbd device is ready:

order ms-drbd-before-fs inf: ms-drbd:promote fs:start

And making sure that the filesystem lives where the drbd master is up, so if we migrate one of the to another cluster node, the fs will move with it:

colocation coloc-fs-drbd inf: fs ms-drbd:Master

In this example, as well, since we're using two nodes, the quorum will fail, so there are two options to avoid this pitfall:

crm(live)configure# property no-quorum-policy="ignore"

or

crm(live)configure# property expected-quorum-votes="1"

Which means that if we have only one node online, it will be decision capable ie. the Domain Co-ordinator.

Now let's start up the resources:

root@c01:~# crm
crm(live)# status

============
Last updated: Sat Sep 26 21:48:05 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

crm(live)# resource start drbd
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Slaves: [ c02 ]
Stopped: [ drbd:0 ]
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Slaves: [ c01 c02 ]
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Masters: [ c01 ]
Slaves: [ c02 ]

Note the transitions by hitting status command before the drbd resource up&working!

Then starting the filesystem ontop the running master drbd:

crm(live)# resource start fs
crm(live)# status

============
Last updated: Sat Sep 26 21:52:39 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

Master/Slave Set: ms-drbd
Masters: [ c01 ]
Slaves: [ c02 ]
fs (ocf::heartbeat:Filesystem): Started c01
vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

Now let's see how to migrate by hand from c01 to c02 the fs:
We're creating a test file on c01:

root@c01:~# ls -la /var/www/
total 24
drwxr-xr-x 3 root root 4096 2009-09-26 21:24 .
drwxr-xr-x 14 root root 4096 2009-09-25 06:26 ..
drwx------ 2 root root 16384 2009-09-26 21:24 lost+found
root@c01:~# echo "Big bada boom" > /var/www/moo
root@c01:~# ls -la /var/www/moo
-rw-r--r-- 1 root root 14 2009-09-26 21:55 /var/www/moo

Then migrating the fs by:

crm(live)# resource migrate fs c02

[a few seconds later:]

crm(live)# status

============
Last updated: Sat Sep 26 21:56:00 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

Master/Slave Set: ms-drbd
Masters: [ c02 ]
Slaves: [ c01 ]
fs (ocf::heartbeat:Filesystem): Started c02
vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

Here we go. Checking this on c02:

root@c02:~# ls -la /var/www/
total 28
drwxr-xr-x 3 root root 4096 2009-09-26 21:55 .
drwxr-xr-x 14 root root 4096 2009-09-25 06:30 ..
drwx------ 2 root root 16384 2009-09-26 21:24 lost+found
-rw-r--r-- 1 root root 14 2009-09-26 21:55 moo
root@c02:~# cat /var/www/moo
Big bada boom

Now we can simply test the cluster if we push a power button to turn off the machine (or poweroff in Virtual Machine as I did by VBoxManage controlvm c02 poweroff) and you'll see that the other node will take over the resources if there is no location contraint defined to not to do that.

Now the recap of my current testing crm config, where I defined the resource-stickiness to pretty high, to avoid any further resource takeover if the crashed/poweroff'ed node comes up again, because this behaviour can cause more downtime.

root@c02:~# crm configure show
node c01 \
attributes standby="off"
node c02 \
attributes standby="off"
primitive drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="9s" role="Master" timeout="30s" \
op monitor interval="11s" role="Slave" timeout="30s" \
meta target-role="Started"
primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/var/www" fstype="ext3" \
meta target-role="Started"
primitive vip ocf:heartbeat:IPaddr \
params ip="10.30.49.254" \
op monitor interval="10s" \
meta target-role="Started"
primitive vip2 ocf:heartbeat:IPaddr2 \
params ip="10.30.49.253" nic="eth0" cidr_netmask="16" \
meta target-role="Started" \
op monitor interval="10s"
ms ms-drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Stopped"
location cli-prefer-fs fs \
rule $id="cli-prefer-rule-fs" inf: #uname eq c02
location cli-prefer-vip vip \
rule $id="cli-prefer-rule-vip" inf: #uname eq c02
location cli-prefer-vip2 vip2 \
rule $id="cli-prefer-rule-vip2" inf: #uname eq c01
colocation coloc-fs-drbd inf: fs ms-drbd:Master
colocation vip-with-vip2 inf: vip vip2
order ms-drbd-before-fs inf: ms-drbd:promote fs:start
property $id="cib-bootstrap-options" \
dc-version="1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6" \
cluster-infrastructure="openais" \
expected-quorum-votes="1" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
start-failure-is-fatal="false" \
stonith-action="reboot" \
last-lrm-refresh="1254001497"
rsc_defaults $id="rsc-options" \
resource-stickiness="100000"

I hope you find these short tutorials useful!

I played with Linux High Availability solutions - on Debian/Ubuntu.

Including Red Hat Cluster Suite, Heartbeat, and the relatively newcomer player, probably the successor, the Pacemaker/Corosync/OpenAIS with heartbeat commons (for scripting resources).

I had the best impressions about Pacemaker. Its configuration and manageability is far the bes in my humble opinion, and therefore should provide the upmost uptime counters.

I started to play with the official ubuntu ha repository on two ubuntu hardy virtualbox virtual machine named by c01 and c02 with will be the part of the cluster service:

root@c01:~# cat /etc/apt/sources.list.d/ubuntu-ha.list
deb http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main
deb-src http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main

I have put it to the file above to separate it from the other debian sources.

Then I installed:

root@c01:~# apt-get install pacemaker-openais
root@c02:~# apt-get install pacemaker-openais

As you can see, I installed it on both cluster nodes.

Since openais and corosync packages has been split up their main configuration

root@c01:~# ls -la /etc/corosync/authkey /etc/corosync/corosync.conf
lrwxrwxrwx 1 root root 16 2009-09-21 09:16 /etc/corosync/authkey -> /etc/ais/authkey
lrwxrwxrwx 1 root root 21 2009-09-21 09:16 /etc/corosync/corosync.conf -> /etc/ais/openais.conf

And note, that the authkey has been created by corosync-keygen after I linked the /dev/random to point to /dev/urandom unless keygen won't work if you have no /dev/random.

In this scenario I have set up two ethernet links between the two (virtual) boxes since the HA cluster service stringly recommends two independent media for communication, in this example, 192.168.0.0/24 and 192.168.1.0/24.

So my corosync/openais config file /etc/corosync/corosync.conf file looks like this:

totem {
version: 2
# How long before declaring a token lost (ms)
token: 10000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 20
# How long to wait for join messages in the membership protocol (ms)
join: 60
# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
consensus: 4800
# Turn off the virtual synchrony filter
vsftype: none
# Number of messages that may be sent by one processor on receipt of the token
max_messages: 20
# Disable encryption
secauth: off
# How many threads to use for encryption/decryption
threads: 0

# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes

# Optionally assign a fixed node id (integer)
# nodeid: 1234
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 192.168.0.0
#mcastaddr: 226.94.1.1
mcastaddr: 225.0.0.1
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 192.168.1.0
mcastaddr: 225.0.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_syslog: yes
to_stderr: no
syslog_facility: daemon
debug: on
timestamp: on
}

amf {
mode: disabled
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
use_mgmtd: yes
}
aisexec {
user: root
group: root
}

Note that the service is configured to use pacemaker as CRM (Cluster Resource Manager) and I'll be using the mcast settings for multicast, and I have two interface subconfig since as I mentioned earlier, in this scenario, I have two lan connected boxes.

Then let's stop/start the services on both nodes by:

/etc/init.d/corosync stop
/etc/init.d/corosync start

And check wether we can use the crm command to communicate with the cluster. If you see similar, then we're good:

root@c01:~# crm status

============
Last updated: Sat Sep 26 18:47:41 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

Watch out, we see two online machines.

Now let's set up a Virtual IPs (abbreviated as VIPs) to the cluster:

root@c01:~# crm
crm(live)# configure
crm(live)configure#
primitive vip ocf:heartbeat:IPaddr \
params ip="10.30.49.254" \
op monitor interval="10s" \
meta target-role="Started"
primitive vip2 ocf:heartbeat:IPaddr2 \
params ip="10.30.49.253" nic="eth0" cidr_netmask="16" \
meta target-role="Started" \
op monitor interval="10s"
crm(live)configure# commit

Where at the first example, named as vip we configure to use the heartbeat kind of Open Cluster Format scripts, located at the physical file system: /usr/lib/ocf/resource.d/heartbeat/IPaddr with will be monitored in every 10th seconds and will start up immediately after we commit the config changes. For the vip2 which is a slightly more advanced script (located at /usr/lib/ocf/resource.d/heartbeat/IPaddr2 ) will use a well defined netmask (cidr) and interface name, where this IP will be located (at the first unallocated one eg. eth0:1).

In this example, as well, since we're using two nodes, the quorum will fail, so there are two options to avoid this pitfall:

crm(live)configure# property no-quorum-policy="ignore"

or

crm(live)configure# property expected-quorum-votes="1"

Let's see how it will migrate (by manual failover) from one cluster node to another one:

crm(live)configure# cd
crm(live)# status

============
Last updated: Sat Sep 26 18:58:48 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

This shows, that the resource vip and vip2 started successfully, you can check the interface list:

root@c01:~# ip addr
[..]
5: eth0: mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 08:00:27:63:61:9a brd ff:ff:ff:ff:ff:ff
inet 10.30.49.1/16 brd 10.30.255.255 scope global eth0
inet 10.30.49.253/16 brd 10.30.255.255 scope global secondary eth0
inet 10.30.49.254/16 brd 10.30.255.255 scope global secondary eth0:0

Now let's see with the manual failover!

crm(live)# resource migrate vip c02
crm(live)# status
[..]
Online: [ c01 c02 ]

vip (ocf::heartbeat:IPaddr): Started c02
vip2 (ocf::heartbeat:IPaddr2): Started c01

Voilà, it works, you can see that vip went to c02. You can check it on its own interface list.

Now, let's see how to bond together resources, in this example the VIPs:

crm(live)configure# colocation vip-with-vip2 inf: vip vip2
crm(live)configure# commit
crm(live)configure# cd
crm(live)# status
[..]
vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

From now, if you migrate one of them to another place, the other vip will go with it as well.

blog.rulez.org

Tuesday, September 29, 2009

High Availability #2

Monday, September 28, 2009

High Availability #1

Intense Debate Comments

Blog Archive