blog.rulez.org: High Availability #2

After having a basic look on Virtual IP failover in a cluster environment with Pacemaker, now let's see how can we set up a shared device with ext3 filesystem with the use of drbd. In this example, we will use a Master/Slave cluster settings with Primary/Secondary drbd. On the master cluster node we'll have the /var/www ext3fs filesystem, on the slave node, we won't have anything until the fs resource will migrate to there.

Now let's see how can we achieve it with Pacemaker!

At first, I built the drbd package from the Linbit's git repository because it supports primary/primary connection (what we doesn't use now), and it has debian directory to build deb package from: git://git.drbd.org/drbd-8.3.git . After installing the packages, in this example I'll use a bare image file as block device with the help of losetup for mapping device file to this image.

Let's create the image file on every cluster node (c01, c02):

root@c01:~# dd if=/dev/zero of=/drbd_block.img bs=1M count=2048
root@c01:~# dd if=/dev/zero of=/drbd_block.img bs=1M count=2048

And this way we created a 2G images. Now, let's assign these image files to a device on every cluster node. Since we need to do that at every single boot time, I recommend to put it there:

/etc/rc.local

losetup /dev/loop0 /drbd_block.img

for now, you can simply run this command, then you can check the results:

root@c01:~# losetup -a
/dev/loop0: [fe00]:24971 (/drbd_block.img)
root@c02:~# losetup -a
/dev/loop0: [fe00]:24849 (/drbd_block.img)

It is successfully assigned on both nodes. This step is needed since the drbd requires block device to work with, so let's configure the drbd on every nodes:

root@c02:~# cat /etc/drbd.conf |grep -v ^$
global {
usage-count no;
}
common {
#protocol C;
}
resource r0 {
device /dev/drbd0;
disk /dev/loop0;
meta-disk internal;
protocol C;
on c01 {
address 192.168.0.1:7789;
#flexible-meta-disk internal;
}
on c02 {
address 192.168.0.2:7789;
#meta-disk internal;
}
net {
allow-two-primaries;
#for GFS2 or OCFS2:
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}

startup {
become-primary-on both;
}
syncer {
verify-alg crc32c;
rate 40M;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
#pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
#pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
#local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
#outdate-peer "/usr/sbin/drbd-peer-outdater";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
disk {
# on-io-error detach;
fencing resource-only;
}
}

The explanation: we'll create the r0 assigned to /dev/loop0 which is already assigned to /drbd_block.img file. We'll use Protocol C which requires a full acknowledgement from the peer that they all have all changes written to the device. We tuned to 40M the bandwith, note, that the default is so low, so on recent net we need much more. We'll use crc32c which is probably the fastest hash algorythm available in drbd. And however this configuration is prepared to primary/primary with cluster enabled file system, we won't need it - but it's OK for us for now.

New let's check wether the drbd kernel module is loaded successfully. If not, please:

root@c01:~# modprobe drbd
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:Unconfigured

Let's create the device meta-data:

root@c01:~# drbdadm create-md r0
You want me to create a v08 style flexible-size internal meta data block.
There apears to be a v08 flexible-size internal meta data block
already in place on /dev/loop0 at byte offset 2147479552
Do you really want to overwrite the existing v08 meta-data?
[need to type 'yes' to confirm] yes

Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

Note, I typed yes, because I wanted to overwrite a pre-existing meta data in this resource r0. I recommed to do that on only the wannabe primary node.

Now let's attach the device and set up the net for it:

root@c01:~# drbdadm attach r0
root@c01:~# drbdadm connect r0
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2097052

or w/o attach and connect, you can simply use drbdadm up r0. Now, you can see, it is Wait For Connection (WFConnection) state. Note, that the data storage is inconsistent and we don't know anything about peer's. Let's set up the peer:

root@c02:~# drbdadm up r0
root@c02:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c02, 2009-09-09 11:16:15
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2097052

Now you can see, they're connected, but inconsistent. So let's force the full syncronization from the primary node (c01):

root@c01:~# drbdadm -- --overwrite-data-of-peer primary r0
root@c01:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c01, 2009-09-09 11:13:45
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:12324 nr:0 dw:0 dr:20688 al:0 bm:0 lo:0 pe:1 ua:2024 ap:0 ep:1 wo:b oos:2084728
[>....................] sync'ed: 0.8% (2084728/2097052)K
finish: 0:02:48 speed: 12,324 (12,324) K/sec

You can check the progress bar periodically (on any host). When it's done:

root@c02:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@c02, 2009-09-09 11:16:15
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:2097052 dw:2097052 dr:0 al:0 bm:128 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

You can see here that both local and remote host is UpToDate.

Let's format this newly created drbd device on the primary node (c01):

root@c01:~# mkfs.ext3 /dev/drbd/by-res/r0
mke2fs 1.40.8 (13-Mar-2008)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
131072 inodes, 524263 blocks
26213 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=536870912
16 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912

Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

And now we could mount it by mount -t ext3 /dev/drbd/by-res/r0 /mnt but now we won't.

Now we have to make sure that the Pacemaker crm will only handle this resource so remove or start with exit 0 in drbd init script:

root@c01:~# head -2 /etc/init.d/drbd
#!/bin/bash
exit 0

Make sure that we release the r0 on every nodes:

root@c01:~# drbdadm down r0
root@c02:~# drbdadm down r0

And see the crm config!

primitive drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="9s" role="Master" timeout="30s" \
op monitor interval="11s" role="Slave" timeout="30s" \
meta target-role="Stopped"

This will instruct the /usr/lib/ocf/resource.d/linbit/drbd script to turn the master/slave aka. primary/secondary roles. I shave set up the monitor intervals differently to see the transitions better.

Let's setup a master/slave resource for it:

ms ms-drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Stopped"

Where we declare that on every node and in the whole cluster environment there can be only one master and the clone slave aka. secondary can be running on other node only.

Now, the file system primitive:

primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/var/www" fstype="ext3" \
meta target-role="Stopped"

Note, we will mount this drbd device with ext3 fs under /var/www.

Make sure that the file system will be mounted on the master (aka. primary drbd) when the drbd device is ready:

order ms-drbd-before-fs inf: ms-drbd:promote fs:start

And making sure that the filesystem lives where the drbd master is up, so if we migrate one of the to another cluster node, the fs will move with it:

colocation coloc-fs-drbd inf: fs ms-drbd:Master

In this example, as well, since we're using two nodes, the quorum will fail, so there are two options to avoid this pitfall:

crm(live)configure# property no-quorum-policy="ignore"

or

crm(live)configure# property expected-quorum-votes="1"

Which means that if we have only one node online, it will be decision capable ie. the Domain Co-ordinator.

Now let's start up the resources:

root@c01:~# crm
crm(live)# status

============
Last updated: Sat Sep 26 21:48:05 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

crm(live)# resource start drbd
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Slaves: [ c02 ]
Stopped: [ drbd:0 ]
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Slaves: [ c01 c02 ]
crm(live)# status
[..]
Master/Slave Set: ms-drbd
Masters: [ c01 ]
Slaves: [ c02 ]

Note the transitions by hitting status command before the drbd resource up&working!

Then starting the filesystem ontop the running master drbd:

crm(live)# resource start fs
crm(live)# status

============
Last updated: Sat Sep 26 21:52:39 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

Master/Slave Set: ms-drbd
Masters: [ c01 ]
Slaves: [ c02 ]
fs (ocf::heartbeat:Filesystem): Started c01
vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

Now let's see how to migrate by hand from c01 to c02 the fs:
We're creating a test file on c01:

root@c01:~# ls -la /var/www/
total 24
drwxr-xr-x 3 root root 4096 2009-09-26 21:24 .
drwxr-xr-x 14 root root 4096 2009-09-25 06:26 ..
drwx------ 2 root root 16384 2009-09-26 21:24 lost+found
root@c01:~# echo "Big bada boom" > /var/www/moo
root@c01:~# ls -la /var/www/moo
-rw-r--r-- 1 root root 14 2009-09-26 21:55 /var/www/moo

Then migrating the fs by:

crm(live)# resource migrate fs c02

[a few seconds later:]

crm(live)# status

============
Last updated: Sat Sep 26 21:56:00 2009
Stack: openais
Current DC: c01 - partition with quorum
Version: 1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6
2 Nodes configured, 1 expected votes
4 Resources configured.
============

Online: [ c01 c02 ]

Master/Slave Set: ms-drbd
Masters: [ c02 ]
Slaves: [ c01 ]
fs (ocf::heartbeat:Filesystem): Started c02
vip (ocf::heartbeat:IPaddr): Started c01
vip2 (ocf::heartbeat:IPaddr2): Started c01

Here we go. Checking this on c02:

root@c02:~# ls -la /var/www/
total 28
drwxr-xr-x 3 root root 4096 2009-09-26 21:55 .
drwxr-xr-x 14 root root 4096 2009-09-25 06:30 ..
drwx------ 2 root root 16384 2009-09-26 21:24 lost+found
-rw-r--r-- 1 root root 14 2009-09-26 21:55 moo
root@c02:~# cat /var/www/moo
Big bada boom

Now we can simply test the cluster if we push a power button to turn off the machine (or poweroff in Virtual Machine as I did by VBoxManage controlvm c02 poweroff) and you'll see that the other node will take over the resources if there is no location contraint defined to not to do that.

Now the recap of my current testing crm config, where I defined the resource-stickiness to pretty high, to avoid any further resource takeover if the crashed/poweroff'ed node comes up again, because this behaviour can cause more downtime.

root@c02:~# crm configure show
node c01 \
attributes standby="off"
node c02 \
attributes standby="off"
primitive drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="9s" role="Master" timeout="30s" \
op monitor interval="11s" role="Slave" timeout="30s" \
meta target-role="Started"
primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/var/www" fstype="ext3" \
meta target-role="Started"
primitive vip ocf:heartbeat:IPaddr \
params ip="10.30.49.254" \
op monitor interval="10s" \
meta target-role="Started"
primitive vip2 ocf:heartbeat:IPaddr2 \
params ip="10.30.49.253" nic="eth0" cidr_netmask="16" \
meta target-role="Started" \
op monitor interval="10s"
ms ms-drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Stopped"
location cli-prefer-fs fs \
rule $id="cli-prefer-rule-fs" inf: #uname eq c02
location cli-prefer-vip vip \
rule $id="cli-prefer-rule-vip" inf: #uname eq c02
location cli-prefer-vip2 vip2 \
rule $id="cli-prefer-rule-vip2" inf: #uname eq c01
colocation coloc-fs-drbd inf: fs ms-drbd:Master
colocation vip-with-vip2 inf: vip vip2
order ms-drbd-before-fs inf: ms-drbd:promote fs:start
property $id="cib-bootstrap-options" \
dc-version="1.0.4-2609e060ce0c516c95ae31f44a10fed0202abfb6" \
cluster-infrastructure="openais" \
expected-quorum-votes="1" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
start-failure-is-fatal="false" \
stonith-action="reboot" \
last-lrm-refresh="1254001497"
rsc_defaults $id="rsc-options" \
resource-stickiness="100000"

I hope you find these short tutorials useful!

blog.rulez.org

Tuesday, September 29, 2009

High Availability #2

No comments:

Post a Comment

Intense Debate Comments

Blog Archive