Noviembre 17

How to identify ECC memory modules

This is a short article describing how you proceed to identify whether or not you have ECC memory modules in your Linux workstation or server.

Also as a side note, the importance of ECC memory is great. Even filesystems such as ZFS with check summing will not account for flipped bits due to cosmic rays. According to studies such as , a DIMM has an 8% chance per year of getting a correctable error. Multiply that with the amount of DIMM’s you have in your system (4 or more?), and you suddenly have a very likely chance of seeing data corruption during a year.

To display what type of memory module you have, we make use of the following DMI type:

16   Physical Memory Array


# dmidecode --type 16


# dmidecode 2.11
SMBIOS 2.7 present.

Handle 0x0007, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 32 GB
        Error Information Handle: 0x0010
        Number Of Devices: 4

Both on Debian/Ubuntu and RedHat based distributions this tool is provided by thedmidecode package.

Category: ECC, MEMORY | Los comentarios están deshabilitados en How to identify ECC memory modules
Noviembre 17

How to configure network bonding (LACP) on Debian Wheezy

How to configure network bonding (LACP) on Debian Wheezy

This process essentially consist of two steps. I will be detailing steps relevant for the Linux host.

  • Configuring the switch for LACP bonding.
  • Configuring the Linux host for LACP bonding.


  • ifenslave
  • Shut down the network after installing ifenslave.
  • Start the network once the configuration changes are in place.


This is a virtual package and will in reality install ifenslave-2.6

# aptitude install ifenslave

Stop the network. Make sure you’re not connected via SSH while doing this.

# /etc/init.d/networking stop

Debian Kernel Module Configuration

File: /etc/modprobe.d/bonding.conf

alias bond0 bonding
 options bonding mode=4 miimon=100 lacp_rate=1

File: /etc/modules

echo "bonding" >> /etc/modules
echo "mii" >> /etc/modules


Debian Network Configuration

auto eth0
 iface eth0 inet manual
 bond-master bond0
auto eth1
 iface eth1 inet manual
 bond-master bond0
auto bond0
 iface bond0 inet static
 bond-mode 802.3ad
 bond-miimon 100
 bond-downdelay 200
 bond-updelay 200
 bond-lacp-rate 4
 bond-slaves none
Category: BONDING, NETWORKING | Los comentarios están deshabilitados en How to configure network bonding (LACP) on Debian Wheezy
Noviembre 17

Configure Network Bonding [ Teaming / Aggregating NIC ]

NIC teaming is nothing but combining or aggregating multiple network connections in parallel. This is done to increase throughput, and to provide redundancy in case one of the links fails or Ethernet card fails. The Linux kernel comes with the bounding driver for aggregating multiple network interfaces into a single logical interface called bond0. In this tutorial, I will explain how to setup bonding under Debian Linux server to aggregate multiple Ethernet devices into a single link, to get higher data rates and link failover.

The instructions were tested using the following setup:

  • 2 x PCI-e Gig NIC with jumbo frames.
  • RAID 6 w/ 5 enterprise grade 15k SAS hard disks.
  • Debian Linux 6.0.2 amd64

Please note that the following instructions should also work on Ubuntu Linux server.

Required Software

You need to install the following tool:

  • ifenslave command: It is used to attach and detach slave network devices to a bonding device. A bonding device will act like a normal Ethernet network device to the kernel, but will send out the packets via the slave devices using a simple round-robin scheduler. This allows for simple load-balancing, identical to “channel bonding” or “trunking” techniques used in network switches.

Our Sample Setup

 |         (eth0)
ISP Router/Firewall (eth1)
     \                             +------ Server 1 (Debian file server w/ eth0 & eth1)
      +------------------+         |
      | Gigabit Ethernet |---------+------ Server 2 (MySQL)
      | with Jumbo Frame |         |
      +------------------+         +------ Server 3 (Apache)
                                   +-----  Server 4 (Proxy/SMTP/DHCP etc)
                                   +-----  Desktop PCs / Other network devices (etc)

Install ifenslave

Use the apt-get command to install ifenslave, enter:
# apt-get install ifenslave-2.6
Sample outputs:

Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'ifenslave-2.6' instead of 'ifenslave'
The following NEW packages will be installed:
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 18.4 kB of archives.
After this operation, 143 kB of additional disk space will be used.
Get:1 squeeze/main ifenslave-2.6 amd64 1.1.0-17 [18.4 kB]
Fetched 18.4 kB in 1s (10.9 kB/s)
Selecting previously deselected package ifenslave-2.6.
(Reading database ... 24191 files and directories currently installed.)
Unpacking ifenslave-2.6 (from .../ifenslave-2.6_1.1.0-17_amd64.deb) ...
Processing triggers for man-db ...
Setting up ifenslave-2.6 (1.1.0-17) ...
update-alternatives: using /sbin/ifenslave-2.6 to provide /sbin/ifenslave (ifenslave) in auto mode.

Linux bounding Driver Configuration

Create a file called /etc/modprobe.d/bonding.conf, enter:
# vi /etc/modprobe.d/bonding.conf
Append the following

alias bond0 bonding
  options bonding mode=0 arp_interval=100 arp_ip_target=,

Save and close the file. This configuration file is used by the Linux kernel driver called bounding. The options are important here:

  1. mode=0 : Set the bonding policies to balance-rr (round robin). This is default. This mode provides load balancing and fault tolerance.
  2. arp_interval=100 : Set the ARP link monitoring frequency to 100 milliseconds. Without option you will get various warning when start bond0 via /etc/network/interfaces.
  3. arp_ip_target=, : Use the (router ip) and IP addresses to use as ARP monitoring peers when arp_interval is > 0. This is used determine the health of the link to the targets. Multiple IP addresses must be separated by a comma. At least one IP address must be given (usually I set it to router IP) for ARP monitoring to function. The maximum number of targets that can be specified is 16.

How Do I Load the Driver?

Type the following command
# modprobe -v bonding mode=0 arp_interval=100 arp_ip_target=,
# tail -f /var/log/messages
# ifconfig bond0

Interface Bonding (Teaming) Configuration

First, stop eth0 and eth1 (do not type this over an ssh session), enter:
# /etc/init.d/networking stop
You need to modify /etc/network/interfaces file, enter:
# cp /etc/network/interfaces /etc/network/interfaces.bak
# vi /etc/network/interfaces

Remove eth0 and eth1 static IP configuration and update the file as follows:

############ WARNING ####################
# You do not need an "iface eth0" nor an "iface eth1" stanza.
# Setup IP address / netmask / gateway as per your requirements.
auto lo
iface lo inet loopback

# The primary network interface
auto bond0
iface bond0 inet static
    slaves eth0 eth1
    # jumbo frame support
    mtu 9000
    # Load balancing and fault tolerance
    bond-mode balance-rr
    bond-miimon 100
    bond-downdelay 200
    bond-updelay 200

Save and close the file. Where,

  • address : Dotted quad ip address for bond0.
  • netmask : Dotted quad netnask for bond0.
  • network : Dotted quad network address for bond0.
  • gateway : Default gateway for bond0.
  • slaves eth0 eth1 : Setup a bonding device and enslave two real Ethernet devices (eth0 and eth1) to it.
  • mtu 9000 : Set MTU size to 9000. See Linux JumboFrames configuration for more information.
  • bond-mode balance-rr : Set bounding mode profiles to “Load balancing and fault tolerance”. See below for more information.
  • bond-miimon 100 : Set the MII link monitoring frequency to 100 milliseconds. This determines how often the link state of each slave is inspected for link failures.
  • bond-downdelay 200 : Set the time, t0 200 milliseconds, to wait before disabling a slave after a link failure has been detected. This option is only valid for the bond-miimon.
  • bond-updelay 200 : Set the time, to 200 milliseconds, to wait before enabling a slave after a link recovery has been detected. This option is only valid for the bond-miimon.
  • dns-nameservers : Use as dns server.
  • dns-search : Use as default host-name lookup (optional).

A Note About Various Bonding Policies

In the above example bounding policy (mode) is set to 0 or balance-rr. Other possible values are as follows:

The Linux bonding driver aggregating policies
Bonding policies (mode) Description
balance-rr or 0 Round-robin policy to transmit packets in sequential order from the first available slave through the last. This mode provides load balancing and fault tolerance.
active-backup or 1 Active-backup policy. Only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. This mode provides fault tolerance.
balance-xor or 2 Transmit based on the selected transmit hash policy. The default policy is a simple [(source MAC address XOR’d with destination MAC address) modulo slave count]. This mode provides load balancing and fault tolerance.
broadcast or 3 Transmits everything on all slave interfaces. This mode provides fault tolerance.
802.3ad or 4 Creates aggregation groups that share the same speed and duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification. Most network switches will require some type of configuration to enable 802.3ad mode.
balance-tlb or 5 Adaptive transmit load balancing: channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.
balance-alb or 6 Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation.

Source: See Documentation/networking/bonding.txt for more information. ]

Start bond0 Interface

Now, all configuration files have been modified, and networking service must be started or restarted, enter:
# /etc/init.d/networking start
# /etc/init.d/networking stop && /etc/init.d/networking start

Verify New Settings

Type the following commands:
# /sbin/ifconfig
Sample outputs:

bond0     Link encap:Ethernet  HWaddr 00:xx:yy:zz:tt:31
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::208:9bff:fec4:3031/64 Scope:Link
          RX packets:2414 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1559 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:206515 (201.6 KiB)  TX bytes:480259 (469.0 KiB)
eth0      Link encap:Ethernet  HWaddr 00:xx:yy:zz:tt:31
          RX packets:1214 errors:0 dropped:0 overruns:0 frame:0
          TX packets:782 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:103318 (100.8 KiB)  TX bytes:251419 (245.5 KiB)
eth1      Link encap:Ethernet  HWaddr 00:xx:yy:zz:tt:31
          RX packets:1200 errors:0 dropped:0 overruns:0 frame:0
          TX packets:777 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:103197 (100.7 KiB)  TX bytes:228840 (223.4 KiB)
lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:560 (560.0 B)  TX bytes:560 (560.0 B)

How Do I Verify Current Link Status?

Use the cat command command to see current status of bounding driver and nic links:
# cat /proc/net/bonding/bond0
Sample outputs:

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:xx:yy:zz:tt:31
Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:xx:yy:zz:tt:30

Example: Link Failure

The contents of /proc/net/bonding/bond0 after the link failure:
# cat /proc/net/bonding/bond0
Sample outputs:

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:xx:yy:zz:tt:31
Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:xx:yy:zz:tt:30

You will also see the following information in your /var/log/messages file:

Sep  5 04:16:15 nas01 kernel: [ 6271.468218] e1000e: eth1 NIC Link is Down
Sep  5 04:16:15 nas01 kernel: [ 6271.548027] bonding: bond0: link status down for interface eth1, disabling it in 200 ms.
Sep  5 04:16:15 nas01 kernel: [ 6271.748018] bonding: bond0: link status definitely down for interface eth1, disabling it

However, your nas01 server should work without any problem as eth0 link is still up and running. Next, replace the faulty network card, connect the cable, and you will see the following message in your /var/log/messages file:

Sep  5 04:20:21 nas01 kernel: [ 6517.492974] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep  5 04:20:21 nas01 kernel: [ 6517.548029] bonding: bond0: link status up for interface eth1, enabling it in 200 ms.
Sep  5 04:20:21 nas01 kernel: [ 6517.748016] bonding: bond0: link status definitely up for interface eth1.


Category: BONDING, NETWORKING | Los comentarios están deshabilitados en Configure Network Bonding [ Teaming / Aggregating NIC ]
Noviembre 17

FCoE but i can talk about iSCSI

I cant talk about FCoE but i can talk about iSCSI.

first up, what are you talking to array wise? it shouldnt make a huge difference to the info to follow, just interested.

We use Dell Equallogic arrays at work, and the following config works remarkably well.

First up, get rid of any smell of LACP/PortChannel/LinkAgg between servers and switches and storage and switches. the only place you should be using link agg is switch to switch. iSCSI is a path aware protocol, in a nutshell you’re only getting 1Gbit/sec will be because the IO stream is being balanced down one cable because it is all part of the same session.

Couple of Terms of reference
Initiator – the host
Target Portal – the storage
Session, a logical connection between a initiator and a target

Now, there is basicly a 1:1 relationship between initiator and target, a 1:1 relationship between physical interface and initiator address and a 1:1 relationship between session and physical network port. This pretty much comes back to the FC world where you have one session between the HBA and the storage array, you dont run multiple sessions over a FC cable.

so given you have two cables plugged in, you will want to create two sessions back to your storage, the multipathing daemon (multipathd) on the linux box will sort out the MPIO stuff for you.

here are a bunch of config things we do on our servers to make the magic happen

first up, /etc/sysctl.conf

# Equallogic Configuration settings
# ARP Flux and Return Path Filtering are two behaviours the Linux
# employs for multiple NICs on the same subnet.  They need to be disabled
# to ensure proper MPIO operation with Equallogic storage.


then /etc/iscsi/iscsid.conf

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

we put this into /etc/rc.local to disable some of the tcp stack offloading

/sbin/ethtool --offload eth2 gro off
/sbin/ethtool --offload eth3 gro off

For sanity reasons we modify the default autogenerated iscsi IQN name to something more meaninful, its located in /etc/iscsi/initiatorname.iscsi


Then we turn on jumbo frames on the iSCSI interfaces

Modify /etc/sysconfig/network-scripts/ifcfg-ethX (where X is the interfaces talking to the iSCSI Array)

Add the following


now we get stuck into configuring the iscsi daemon itself
first up we create some logical interfaces within the iscsi daemon to direct traffic down

# iscsiadm --mode iface --op=new --interface iscsi0
# iscsiadm --mode iface --op=new --interface iscsi1

then we link the physical interfaces to the logical interfaces

# iscsiadm --mode iface --op=update --interface iscsi0 --name=iface.net_ifacename --value=eth2
# iscsiadm --mode iface --op=update --interface iscsi1 --name=iface.net_ifacename --value=eth3

Now we discover the targets we can talk to on our storage array

# iscsiadm -m discovery -t st -p <ARRAY IP> -I iscsi0 -I iscsi1

and finally we login to the targets we have discovered

# iscsiadm -m node --login

Right now, with all this done, you should now see two sd[a-z] devices for each LUN you have presented, basicly one of the devices goes via one ethernet cable, the other device via the other ethernet cable, protecting you against cable/switch failure, assuming correct connectivity and config of your switching fabric.

now here is a sample config of the multipath system in linux

all this goes into /etc/multipath.conf

## Use user friendly names, instead of using WWIDs as names.
defaults {
     user_friendly_names yes

## Blacklist local devices (no need to have them accidentally "multipathed")
blacklist {
     devnode "^sd[a]$"
     devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
     devnode "^hd[a-z]"
     devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"

## For EqualLogic PS series RAID arrays
devices {
     device {
          vendor                  "EQLOGIC"
          product                 "100E-00"
          path_grouping_policy    multibus
          getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
          features                "1 queue_if_no_path"
          path_checker            readsector0
          path_selector           "round-robin 0"
          failback                immediate
          rr_min_io               10
          rr_weight               priorities

## Assign human-readable name  (aliases) to the World-Wide ID's (names) of the devices we're wanting to manage
## Identify the WWID by issuing # scsi_id -gus /block/sdX

multipaths {
     multipath {
          wwid                    36090a01850e08bd883c4a45bfa0330ba
          alias                     devicename2
     multipath {
          wwid                    36090a01850e04bd683c4745bfa03b07c
          alias                     devicename1

there will be other examples in the config file around the device stanza, you’ll need to modify the vendor and product values to suit what you’re running, when the disk arrived on the machine via iscsi, dmesg would have recorded some details that would help here.

the multipaths stuff isn’t required, it just make life a lot easier

then start all the services needed

# service multipathd start
# chkconfig multipathd on

then use the following command to see if things have worked

# multipath -v2
# multipath -ll

if everything is good you should get something like this as the output from the second command

devicename1 (36090a01850e04bd683c4745bfa03b07c) dm-3 EQLOGIC,100E-00
[size=5.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=4][active]
 \_ 13:0:0:0 sdd 8:48  [active][ready]
 \_ 14:0:0:0 sdb 8:16  [active][ready]
 \_ 12:0:0:0 sdc 8:32  [active][ready]
 \_ 11:0:0:0 sde 8:64  [active][ready]

the device to mount will be /dev/mapper/devicename1

Category: iSCSI, STORAGE | Los comentarios están deshabilitados en FCoE but i can talk about iSCSI
Noviembre 17

iSCSI MPIO with Nimble

With the implementation of a new Nimble Storage Array, HCL is changing the storage strategy away from fiber channel to iSCSI. If you have not looked at a Nimble array, you really should. Fantastic!

The Nimble allows for four ethernet ports for iSCSI traffic. To have the highest amount of bandwidth and redundancy, MPIO needs to be configured on the system to communicate with the Nimble.

Target (SAN)

  • Nimble Storage Array CS220-X2
  • Discovery IP:
  • Data IP’s:,, 172.16.13,

Initiator (Client)

  • Ubuntu 12.04 LTS
  • Data IP:
  • iSCSI IP:

Software Prerequisite

# sudo apt-get install open-iscsi open-iscsi-utils multipath-tools


iSCSI uses an IQN to to refer to targets and initiators. Once you install the open-iscsi package, an IQN will be created for you. This can be found in the /etc/iscsi/initiatorname.iscsi file.

# cat /etc/iscsi/initiatorname.iscsi
## If you remove this file, the iSCSI daemon will not start.
## If you change the InitiatorName, existing access control lists
## may reject this initiator.  The InitiatorName must be unique
## for each iSCSI initiator.  Do NOT duplicate iSCSI InitiatorNames.

Use this initiator IQN to configure your volume on the Nimble array and to create your initiator group. As a practice, we have decided to build our initiator groups based on IQN vs. the IP address of the initiator systems.

Set iSCSI startup to automatic

# sudo ./chconfig /etc/iscsi/iscsid.conf node.startup automatic

chconfig is a small bash script to execute a sed command to change the value of configuration property to a specific value. It is useful in cases where configuration files are written in the form of property=value. It is available on github.

Discover the Target

# sudo iscsiadm -m discovery -t sendtargets -p,2460,2460,2460,2460

If everything is running correctly up to this point, you will see all four paths to the Nimble in the output along with the IQNs of the volumes that you have created. In my case, the volume name is ubuntutest.

Configure Multipath

This step is important to do prior to loging into each of the storage paths.

The first step is to log into one of the data targets.

# sudo iscsiadm -m node --targetname "" --portal "" --login

Once you are logged in, you will be able to get the wwid of the drive. You will need this for the /etc/multipath.conf. This file configures all of your multipath preferences. To get the wwid…

# sudo multipath -ll
202e7bcc950e534c26c9ce900a0588a97 dm-2 Nimble,Server
size=5.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 3:0:0:0 sdb 8:16 active ready running

In my case, the wwid is 202e7bcc950e534c26c9ce900a0588a97. Now, open /etc/multipath.conf in your favorite editor and edit the file so it matches something like this…

defaults {
    udev_dir /dev
    polling_interval 10
    prio_callout /bin/true
    path_checker readsector0
    prio const
    fallback immediate
    use_friendly_name yes

devices {
    device {
            vendor "Nimble*"
            product "*"

            path_grouping_policy multibus

            path_selector "round-robin 0"
            # path_selector "queue-length 0"
            # path_selector "service-time 0"

multipaths {
    multipath {
            wwid 202e7bcc950e534c26c9ce900a0588a97
            alias data

Now would be a good point to reload the multipath service.

# sudo service multipath-tools reload

Continue to log into the iSCSI targets

# sudo iscsiadm -m node --targetname "" --portal "" --login

# sudo iscsiadm -m node --targetname "" --portal "" --login

# sudo iscsiadm -m node --targetname "" --portal "" --login

Once you are completed logging into each target, you can verify your multipath configuration.

# sudo multipath -ll
data (202e7bcc950e534c26c9ce900a0588a97) dm-2 Nimble,Server
size=5.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 14:0:0:0 sdb 8:16 active ready  running
  |- 12:0:0:0 sdc 8:32 active ready  running
  |- 13:0:0:0 sdd 8:48 active ready  running
  `- 11:0:0:0 sde 8:64 active ready  running

The drive will be available at /dev/mapper/data.

Next up will be creating a LVM volume and formatting with OCFS2 for shared storage in a cluster

Category: MPIO, STORAGE | Los comentarios están deshabilitados en iSCSI MPIO with Nimble
Noviembre 17

Configuring DRBD

At this point, a diagram might be in order – you’ll have to excuse my “Dia” skills, which are somewhat lacking!


Following on from part one, where we covered the basic architecture and got DRBD installed, we’ll proceed to configuring and then initialising the shared storage across both nodes. The configuration file for DRBD (/etc/drbd.conf) is very simple, and is the same on both hosts. The full configuration file is below – you can copy and paste this in; I’ll go through each line afterwards and explain what it all means. Many of these sections and commands can be fine tuned – see the man pages on drbd.conf and drbdsetup for more details.


global {
resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout  0;
  disk {
    on-io-error   detach;
  net {
    on-disconnect reconnect;
  syncer {
    rate 30M;
  on weasel {
    device     /dev/drbd0;
    disk       /dev/md3;
    meta-disk  internal;
  on otter {
    device    /dev/drbd0;
    disk      /dev/md3;
    meta-disk internal;

The structure of this file should be pretty obvious – sections are surrounded by curly braces, and there are two main sections – a global one, in which nothing is defined, and a resource section, where a shared resource named “r0” is defined.

The global section only has a few options available to it – see the DRBD website for more information; though it’s pretty safe to say you can ignore this part of the configuration file when you’re getting started.

The next section is the one where we define our shared resource, r0. You can obviously define more than one resource within the configuration file, and you don’t have to call them “r0”, “r1” etc. They can have any arbitary name – but the name MUST only contain alphanumeric characters. No whitespace, no punctuation, and no characters with accents.So, r0 is our shared storage resource. The next line says that we’ll be using protocol “C” – and this is what you should be using in pretty much any scenario. DRBD can operate in one of three modes, which provide various levels of replication integrity. Consider what happens in the event of some data being written to disk on the primary :

  • Protocol A considers writes on the primary to be complete if the write request has reached the local disk, and the local TCP network buffer. Obviously, this provides no guarantee whatsoever it will reach the secondary node: any hardware or software failure could result in you loosing data. This option should only be chosen if you don’t care obout the integrity of your data, or are replicating over a high-latency link.
  • Protocol B goes one step further, and considers the write to be a complete if it reaches the secondary node. This is a little safer, but you could still loose data in the event of power or equipment failure as there is no guarantee the data will have reached the disk on the secondary.
  • Protocol C is the safest of the three, as it will only consider the write request completed once the secondary node has safely written it to disk. This is therefore the most common mode of operation, but it does mean that you will have to take into account network latency when considering the performance characteristics of your storage. If you have the fastest drives in the world, it won’t help you if you have to wait for a slow secondary node to complete it’s write over a high-latency connection.

The next somewhat cryptic-looking line (“incon-degr-cmd…”) in the configuration simply defines what to do in the event of the primary node starting up, and discovering it’s copy of the data is inconsistent. We obviously wouldn’t want this propogated over to the secondary, so this statement tells it to display an error message, wait 60 seconds and then shutdown.

We then come to the startup section – here we simply state this node should wait for ever for the other node to connect.

In the disk section, we tell DRBD  that if there is an error, or a problem with the underlying storage that it should disconnect it. The network section states that if we experience a communication problem with the other node, instead of going into standalone mode we should try to reconnect.

The syncer controls various aspects of the actual replication – the main paramater to tune is the rate at which we’ll replicate data between the nodes. While it may be tempting at first glance to enter a rate of 100M for a Gigabit network, it is advisable to reserve bandwidth. The documentation ( says that a good rule of thumb is to use about 30% of the available replication bandwidth, taking into account local IO systems, and not just the network.

We now come to the “meat” of this example – we define the two storage nodes – “otter” and “weasel”. Refer back to part one and the diagram there, if you need a quick reminder on the network layout and host names. For each node, we define what the device name is (/dev/drbd0); the underlying storage (/dev/md3); the address and port that DRBD will be listening on (10.0.0.x:7788); and what to use to store the metadata for the device. “meta-disk internal” just stores the data in the last 128MB of the underlying storage – you could use another disk for this, but it makes sense to use the same device unless it’s causing performance issues.

Initialising DRBD

Now we’ve got DRBD configured, we need to start it up – but first, we need to let the two nodes know which is going to be the primary. If you start DRBD on both nodes :

/etc/init.d/drbd start

You will notice that the command appears to hang for a while as DRBD initialises the volumes, but if you check the output of “dmesg” on another terminal, you’ll see something similar to :

drbd0: Creating state block
drbd0: resync bitmap: bits=33155488 words=1036110
drbd0: size = 126 GB (132621952 KB)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: Assuming that all blocks are out of sync (aka FullSync)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: drbdsetup [4087]: cstate Unconfigured --> StandAlone
drbd0: drbdsetup [4113]: cstate StandAlone --> Unconnected
drbd0: drbd0_receiver [4114]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [4114]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00
drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00
drbd0: drbd0_receiver [4114]: cstate WFReportParams --> Connected
drbd0: I am inconsistent, but there is no sync? BOTH nodes inconsistent!
drbd0: Secondary/Unknown --> Secondary/Secondary

And if you also check the contents of /proc/drbd on both nodes, you’ll see something similar to :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:Connected st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:24523 lo:0 pe:0 ua:0 ap:0

So you can see, both nodes have started up and have connected to each other – but have reached deadlock, as neither knows which one should be the primary. Pick one of the nodes to be the primary (in my case, I chose “otter”) and issue the following command on it:

drbdadm -- --do-what-I-say primary all

And then recheck /proc/drbd (or use “/etc/init.d/drbd status”). On the primary you’ll see it change to something like this :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@otter, 2008-02-19 10:07:52
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:169984 nr:0 dw:0 dr:169984 al:0 bm:16200 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.2% (129347/129513)M
        finish: 3:15:56 speed: 11,260 (10,624) K/sec

Whilst the secondary will show :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:476216 dw:476216 dr:0 al:0 bm:24552 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.4% (129048/129513)M
        finish: 3:32:40 speed: 10,352 (9,716) K/sec

If you have a large underlying pool of storage, you’ll find that this will take a fair amount of time to complete. Once it’s done, we can try manually failing over between the two nodes.

Manual failover and testing

On the primary node, create a filesystem on the DRBD volume, mount it and create some files on it for testing :

mke2fs -j /dev/drbd0
mount /dev/drbd0 /mnt

Your replicated storage should now be mounted under /mnt. Copy some data into it, make some directories, unpack a few tarballs – and check to see how /proc/drbd updates on both nodes. Once you’re done, unmount the volume, and then demote the primary node to secondary :

umount /mnt
drbdadm secondary r0

Now, both nodes are marked as being “secondary”. If we then log into “weasel”, we can promote that node :

 drbdadm primary r0

Check /proc/drbd to see that the nodes have in fact swapped over (“otter” should be the secondary now, and “weasel” should be the new primary). You can then mount the volume on weasel and see the data there from earlier.

That’s it for DRBD – you have a functioning replicated pool of storage shared between both nodes. At this point, I’d recommend you take a breather, experiment with the two systems, read the manpages, and see what happens when you reboot each one. It’s also an interesting experiment to see what happens when you try to mount the volume on a non-primary node – you should see that DRBD does the “right thing” in all of these cases, and prevents you from corrupting your data.


As we’ll be using LVM for greater flexibility in defining, managing and backing up our replicated storage, a few words may be needed at this point regarding managing LVM on top of DRBD, as it introduces a few extra complexities. I assume you’re familiar with LVM – if not, the LVM HOWTO is a very good place to start.

To begin with, we’ll set up the DRBD volume as “physical volume” for use with LVM, so install the LVM tools :

apt-get install lvm2

We now need to tell the LVM subsystem to ignore the underlying device (/dev/md3) – otherwise on the primary node, LVM would see two identical physical volumes – one on the underlying disk, and the DRBD volume itself. Edit /etc/lvm/lvm.conf, and inside the “devices” section, add a couple of lines that read :

# Exclude DRBD underlying device
filter = [ "r|/dev/md3|" ]

Now, create a LVM “physical volume” on the DRBD device:

pvcreate /dev/drbd0

Then the volume group, called “storage” :

vgcreate storage /dev/drbd0

And then finally, a 10Gb test volume called “test”, which we’ll then format and mount :

lvcreate -L10G -ntest storage
mke2fs -j /dev/storage/test
mount /dev/storage/test /mnt

And again, test the volume by creating a few files under /mnt. Now, when you want to failover, you will need to add a couple of steps. First, on the primary node, unmount your logical volume, and change the volume group state to unavailable :

umount /mnt
vgchange -a n storage
  0 logical volume(s) in volume group "storage" now active

Now, demote the primary node to secondary :

drbdadm secondary r0

Now, you can promote the other node to primary, and you should be able to activate the volume group and logical volume :

drbdadm primary r0
vgchange -a y storage

And you should then be able to mount /dev/storage/test on the new primary node. Obviously, this becomes a major hassle each time you want to failover between nodes, and doesn’t provide any automatic monitoring or failover in the event of hardware failure on one node. We’ll cover that next in Part 3, where we’ll add the Heartbeat package from the Linux-HA project to provide automatic node failover and redundancy.

Category: DRBD, STORAGE | Los comentarios están deshabilitados en Configuring DRBD
Noviembre 17




Simply install from apt-get on both nodes :

# apt-get install heartbeat

This will give an error warning at the end (“Heartbeat not configured”), which you can ignore. You now need to setup authentication for both nodes – this is very simple, and just uses a shared secret key.

Create /etc/ha.d/authkeys on both systems with the following content:

auth 1
1 sha1 secret

In this sample file, the auth 1 directive says to use key number 1 for signing outgoing packets. The 1 sha1… line describes how to sign the packets. Replace the word “secret” with the passphrase of your choice.

As this is stored in plaintext, make sure that it is owned by root and has a restrictive set of permissions on it :

# chown root:root /etc/ha.d/authkeys
# chmod 600 /etc/ha.d/authkeys

Make sure that copies of this file are identical across both nodes, and don’t have any blank lines etc. in them.

Now, we need to set up the global cluster configuration file. Create the /etc/ha.d/ file on both nodes as follows :

# Interval in seconds between heartbeat packets
keepalive 1
# How long to wait in seconds before deciding node is dead
deadtime 10
# How long to wait in seconds before warning node is dead
warntime 5
# How long to wait in seconds before deciding node is dead
# When heartbeat is first started
initdead 60
# If using serial port for heartbeat
baud 9600
serial /dev/ttyS0
# If using network for heartbeat
udpport 694
# eth1 is our dedicated cluster link (see diagram in part 1)
bcast eth1
# Don't want to auto failback, let admin check and do it manually if needed
auto_failback off
# Nodes in our cluster
node otter
node weasel


We now need to tell Heartbeat about what resources we want it to manage. This is configured in the /etc/ha.d/haresources file. The format for this is again very simple – it just takes the form :

<hostname> resource[::arg1:arg2:arg3:........:argN]

Resources can either be one of the supplied scripts in /etc/ha.d/resource.d :

# ls /etc/ha.d/resource.d
AudibleAlarm  db2  Delay  drbddisk  Filesystem  ICP  IPaddr  IPaddr2  IPsrcaddr  
IPv6addr  LinuxSCSI  LVM  LVSSyncDaemonSwap  MailTo  OCF  portblock  SendArp  ServeRAID  
WAS  WinPopup  Xinetd

Or, they can be one of the init scripts in /etc/init.d, and Heartbeat will search those locations in that order.

To start with, we’ll want to move the DRBD resource we configured in part 2 between the two nodes. This can be accomplished via the “drbddisk” script, provided by the drbd0.7-utils package.  The configuration /etc/ha.d/haresources file should therefore look like the following :

weasel drbddisk::r0

This says that the node “weasel” should be the preferred node for this service. The resource script is “drbddisk”, which can be found under /etc/ha.d/resource.d, and we’re passing it the argument “r0”, which is our DRBD resource configured in part 2.

To test this out, make the DRBD resource secondary by running the following on both nodes :

# drbdadm secondary r0

And then start the cluster on both nodes :

# /etc/init.d/heartbeat start
Starting High-Availability services: 

Once they’ve started up, check the cluster status using the cl_status tool. First, let’s check which nodes Heartbeat thinks are in the cluster :

# cl_status listnodes

Now, check both nodes are up :

# cl_status nodestatus weasel
# cl_status nodestatus otter

We can also use the cl_status tool to see which cluster links are available (which should be eth1 and /dev/ttyS0) :

# cl_status listhblinks otter
# cl_status hblinkstatus otter eth1
# cl_status hblinkstatus otter /dev/ttyS0

And we can also use it to check which resources each node has :

[root@otter] # cl_status rscstatus
[root@weasel] # cl_status rscstatus

You should be able to check the output of /proc/drbd on both systems and see that r0 has been made the master on weasel. To failover to otter, simply restart the Heartbeat services on weasel :

# /etc/init.d/heartbeat restart
Stopping High-Availability services:
Waiting to allow resource takeover to complete:
Starting High-Availability services:

Now, check /proc/drbd and you should see that it is now the master on otter. You can confirm this with cl_status :

[root@otter] # cl_status rscstatus
[root@weasel] # cl_status rscstatus

If you want to try a more dramatic approach, try yanking the power out of otter. You should see output similar to the following appear in /var/log/ha-log on weasel :

heartbeat: 2009/02/03_15:06:29 info: Resources being acquired from otter.
heartbeat: 2009/02/03_15:06:29 info: acquire all HA resources (standby).
heartbeat: 2009/02/03_15:06:29 info: Acquiring resource group: weasel drbddisk::r0
heartbeat: 2009/02/03_15:06:29 info: Local Resource acquisition completed.
heartbeat: 2009/02/03_15:06:29 info: all HA resource acquisition completed (standby).
heartbeat: 2009/02/03_15:06:29 info: Standby resource acquisition done [all].
heartbeat: 2009/02/03_15:06:29 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/02/03_15:06:29 info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
heartbeat: 2009/02/03_15:06:39 WARN: node otter: is dead
heartbeat: 2009/02/03_15:06:39 info: Dead node otter gave up resources.

Play around with this a few times, and make sure you’re familiar with your resource moving between systems. Once you’re happy with this, we’ll add our LVM volume group into the configuration. Edit the /etc/ha.d/haresources file, and modify it so that it looks like the following :

weasel drbddisk::r0 \

The backslash (\) character just tells Heartbeat that this should all be treated as one resource group – the same as a backslash indicates a line continuation in a shell script. Things can be on just one line, but I find it easier to read when it’s split up like this.

Restart Heartbeat on each node in turn, and you should then be able to see the DRBD resource and the LVM volume group move between systems. The next step will cover setting up an iSCSI target, and adding that into the cluster configuration along with a group of managed IP addresses.

Category: HEARTBEAT, NETWORKING | Los comentarios están deshabilitados en Heartbeat
Noviembre 17

Building a redundant iSCSI and NFS cluster with Debian

In this part of the series, we’ll configure an iSCSI client (“initiator”), connect it to the storage servers and set up multipathing. Note : Since Debian Lenny has been released since this series of articles started, that’s the version we’ll use for the client.

If you refer back to part one to refresh your memory of the network layout, you can see that the storage client (“badger” in that diagram) should have 3 network interfaces :

  • eth0 : 172.16.7.x for the management interface, this is what you’ll use to SSH into it.

And two storage interfaces. As the storage servers (“targets”) are using 192.168.x.1 and 2, I’ve given this client the following addresses :

  • eth1:
  • eth2:

Starting at .10 on each range keeps things clear – I’ve found it can help to have a policy of servers being in a range of, say, 1 to 10, and clients being above this. Before we continue, make sure that these interfaces are configured, and you can ping the storage server over both interfaces, e.g. try pinging and

Assuming the underlying networking is configured and working, the first thing we need to do is install open-iscsi (which is the “initiator” – the iSCSI client). This is done by a simple :

# aptitude install open-iscsi

You should see the package get installed, and the service started :

Setting up open-iscsi (2.0.870~rc3-0.4) ...
Starting iSCSI initiator service: iscsid.
Setting up iSCSI targets:
iscsiadm: No records found!

At this point, we have all we need to start setting up some connections.

There are two ways we can “discover” targets on a server (well, three actually, if you include iSNS, but that’s beyond the scope of this article).

  • We can use “send targets” – this logs into a iSCSI target server, and asks it to send the initiator a list of all the available targets.
  • We can use manual discovery, where we tell the initiator explicitly what targets to connect to.

For this exercise, I’ll first show how “send targets” works, then we’ll delete the records so we can add them back manually later. Sendtargets can be useful if you’re not sure what targets your storage server offers, but you can end up with a lot of stale or unused records if you don’t trim down the ones you’re not using.

So, to get things rolling, we’ll query the targets available on one of the interfaces we’re going to use ( – we’ll set up multipathing later. Run the following as root :

iscsiadm -m discovery -t st -p

And you should see the following output returned :,1

This shows that your initiator has successfully queried the storage server, and has returned a list of targets – which, if you haven’t changed anything since the last article, should just be the one “” target. You can always see which nodes are available to your initiator at any time by simply running :

iscsiadm -m node

A few things have happened behind the scenes that it’s worth checking out at this point. After discovering an available target, the initiator will have created a node record for it under /etc/iscsi/nodes. If you take a look in that directory, you’ll see the following file :


Which is a file that contains specific configuration details for that iSCSI node. Some of these settings are influenced by the contents of /etc/iscsi/iscsid.conf, which governs the overall behaviour of the iSCSI initiator (e.g. settings in iscsid.conf apply to all nodes). We’ll investigate a few of these settings later.

For now though, all your initiator has done is discover a set of available targets, we can’t actually make use of them without “logging in”. So, now run the following as root :

iscsiadm -m node -p -T -l

The arguments to this command are largely self-explanatory – we’re performing an operation on a node (“-m node”), are using the portal we queried earlier (“-p”), are running the operation on a specific target (“-T”) and are logging in to it (“-l”).

You can use the longer form of these arguments if you want – for instance, you could use “–login” instead of “-l” if you feel it makes things clearer (see the man page for iscsiadm for more details). Anyway, you should see the following output after running that command :

Logging in to [iface: default, target:, portal:,3260]
Login to [iface: default, target:, portal:,3260]: successful

If you now check the output from “dmesg”, you’ll see output similar to the following in your logs :

[3688756.079470] scsi0 : iSCSI Initiator over TCP/IP
[3688756.463218] scsi 0:0:0:0: Direct-Access     IET      VIRTUAL-DISK     0    PQ: 0 ANSI: 4
[3688756.580379]  sda: unknown partition table
[3688756.581606] sd 0:0:0:0: [sda] Attached SCSI disk

The last line is important – it tells us the device node that the iSCSI node has been created under. You can also query this information by running :

iscsiadm -m session -P3

Which will display a lot of information about your iSCSI session, including the device it has created for you.

If you go back to your storage server now, you can see your client has connected and logged in to the target :

# cat /proc/net/iet/session
                cid:0 ip: state:active hd:none dd:none

You now have a device on your iSCSI client that you can partition and format, just like it was a locally attached disk. Give it a try: fire up fdisk on it, create some partitions, format and mount them. You should find it behaves just the same as a local disk, although the speed will be limited by the capacity of your link to the storage server.

Once you’ve finished, make sure any filesystem you have created on the volume is unmounted, and we’ll then logout of the node and delete it’s record :

# iscsiadm -m node -p -T --logout
Logging out of session [sid: 1, target:, portal:,3260]
Logout of [sid: 1, target:, portal:,3260]: successful
# iscsiadm -m node -p -T -o delete

You should now find that the record for it has been removed from /etc/iscsi/nodes.


We’ll now manually log into the target on both paths to our storage server, and combine the two devices into one multipathed, fault-tolerant device that can handle the failure of one path.

Before we start, you’ll want to change a few of the default settings in /etc/iscsi/iscsid.conf – if you want any nodes you’ve added to the server to automatically be added back when the server reboots, you’ll want to change

node.startup = manual


node.startup = automatic

The default timeouts are also far too high when we’re using multipathing – you’ll want to set the following values :

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

Make sure you restart open-iscsi so these changes get picked up. We can then manually log into both paths to the storage server :

iscsiadm -m node -p -T -o new
iscsiadm -m node -p -T -l
iscsiadm -m node -p -T -o new
iscsiadm -m node -p -T -l

Note the use of “-o new” to manually specify and add the node, instead of using sendtargets discovery. After this, you should find that you have two devices created – in my case, these were /dev/sda and /dev/sdb. We now need to combine these using multipathing.

First, install “multipath-tools” :

aptitude install multipath-tools

And then create a default configuration file under /etc/multipath.conf with the following contents :

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id -g -u -s /block/%n"
        prio_callout            /bin/true
        path_checker            readsector0
        rr_min_io               100
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     no
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"

The first section sets some defaults for the multipat daemon, including how it should identify devices. The blacklist section lists devices that should not be multipathed so the daemon can ignore them – you can see it’s using regular expressions to exclude a number of entries under /dev, including anything starting with “hd” . This will exclude internal IDE devices, for instance. You may need to tune this to your needs, but it should work OK for this example.

Restart the daemon with

/etc/init.d/multipath-tools restart

And check what it can see with the command “multipath -ll”:

# multipath -ll
149455400000000000000000001000000c332000011000000dm-0 IET     ,VIRTUAL-DISK
_ round-robin 0 [prio=1][active]
 _ 1:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 2:0:0:0 sdb  8:16   [active][ready]

That long number on the first line of output is the WWID of the multipathed device, which is similar to a MAC address in networking. It’s a unique identifier for this device, and you can see the components below it. You’ll also have a new device created under /dev/mapper :


Which is the multipathed device. You can access this the same as you would the individual devices, but I always find that long WWID a little too cumbersome. Fortunately, you can assign short names to multipathed devices. Just edit /etc/multipath.conf, and add the following section (replacing the WWID with your value) :

multipaths {
        multipath {
                wwid 149455400000000000000000001000000c332000011000000
                alias mpio

And restart multipath-tools. When you next run “multipath -ll”, you should see the following :

mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK

And you can now access your volume through /dev/mapper/mpio.

Failing a path

To see what happens when a path fails, try creating a filesystem on your multipathed device (you may wish to partition it first, or you can use the whole device) and then mounting it. E.G.

mke2fs -j /dev/mapper/mpio
mount /dev/mapper/mpio /mnt

While the volume is mounted, try unplugging one of the storage switches – in this case, I tried pulling the power supply from the switch on the 192.168.2.x network. I then ran “multipath -ll”, which paused for a short time (the timeout values set above), and then I saw the following :

sdb: checker msg is "directio checker reports path is down"
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=0][enabled]
 _ 4:0:0:0 sdb  8:16   [active][faulty]

So, one path to our storage is unavailable – you can see it marked above as faulty. However, as the 192.168.1.x network path is still available, IO can continue to the remaining “sda” component of the device. The volume was still mounted, and I could carry on copying data to and from it. I then plugged the switch back in, and after a short pause, multipathd shows both paths as active again :

# multipath -ll
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 4:0:0:0 sdb  8:16   [active][ready]

You now have a resilient, fault-tolerant iSCSI SAN!

That’s it for this part – in the next part, I’ll add an NFS server to the mix, tie off a few loose ends, and discuss some performance tuning issues, as well as post some scripts I’ve written to automate some of this.

Category: MPIO, STORAGE | Los comentarios están deshabilitados en Building a redundant iSCSI and NFS cluster with Debian
Noviembre 17

Linux SAN Multipathing

There are a lot of SAN multipathing solutions on Linux at the moment. Two of them are discussesed in this blog. The first one is device mapper multipathing that is a failover and load balancing solution with a lot of configuration options. The second one (mdadm multipathing) is just a failover solution with manuel re-anable of a failed path. The advantage of mdadm multiphating is that it is very easy to configure.

Before using a multipathing solution for a production environment on Linux it is also important to determine if the used solution is supportet with the used Hardware. For example HP doesn’t support the Device Mapper Multipathing solution on their servers yet.

Device Mapper Multipathing

Procedure for configuring the system with DM-Multipath:

  1. Install device-mapper-multipath rpm
  2. Edit the multipath.conf configuration file:
    • comment out the default blacklist
    • change any of the existing defaults as needed
  3. Start the multipath daemons
  4. Create the multipath device with the multipath

Install Device Mapper Multipath

# rpm -ivh device-mapper-multipath-0.4.7-8.el5.i386.rpm
warning: device-mapper-multipath-0.4.7-8.el5.i386.rpm: Header V3 DSA signature:
Preparing...                ########################################### [100%]
1:device-mapper-multipath########################################### [100%]

Initial Configuration

Set user_friendly_name. The devices will be created as /dev/mapper/mpath[n]. Uncomment the blacklist.

# vim /etc/multipath.conf

#blacklist {
#        devnode "*"

defaults {
user_friendly_names yes
path_grouping_policy multibus


Load the needed modul and the startup service.

# modprobe dm-multipath
# /etc/init.d/multipathd start
# chkconfig multipathd on

Print out the multipathed device.

# multipath -v2
# multipath -v3


Configure device type in config file.

# cat /sys/block/sda/device/vendor

# cat /sys/block/sda/device/model

# vim /etc/multipath.conf
devices {

device {
vendor                  "HP"
product                 "HSV200"
path_grouping_policy    multibus
no_path_retry           "5"

Configure multipath device in config file.

# cat /var/lib/multipath/bindings

# Format:
# alias wwid
mpath0 3600508b400070aac0000900000080000

# vim /etc/multipath.conf

multipaths {

multipath {
wwid                    3600508b400070aac0000900000080000
alias                   mpath0
path_grouping_policy    multibus
path_checker            readsector0
path_selector           "round-robin 0"
failback                "5"
rr_weight               priorities
no_path_retry           "5"

Set not mutipathed devices on the blacklist. (f.e. local Raid-Devices, Volume Groups)

# vim /etc/multipath.conf

devnode_blacklist {

devnode "^cciss!c[0-9]d[0-9]*"
devnode "^vg*"

Show Configured Multipaths.

# dmsetup ls --target=multipath
mpath0  (253, 1)

# multipath -ll

mpath0 (3600508b400070aac0000900000080000) dm-1 HP,HSV200
[size=10G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=4][active]
\_ 0:0:0:1 sda 8:0   [active][ready]
\_ 0:0:1:1 sdb 8:16  [active][ready]
\_ 1:0:0:1 sdc 8:32  [active][ready]
\_ 1:0:1:1 sdd 8:48  [active][ready]

Format and mount Device

Fdisk cannot be used with /dev/mapper/[dev_name] devices. Use fdisk on the underlying disks and execute the following command when device-mapper multipath maps the device to create a /dev/mapper/mpath[n] device for the partition.

# fdisk /dev/sda

# kpartx -a /dev/mapper/mpath0

# ls /dev/mapper/*
mpath0  mpath0p1

# mkfs.ext3 /dev/mapper/mpath0p1

# mount /dev/mapper/mpath0p1 /mnt/san

After that /dev/mapper/mpath0p1 is the first partition on the multipathed device.

Multipathing with mdadm on Linux

The md multipathing solution is only a failover solution what means that only one path is used at one time and no load balancing is made.
Start the MD Multipathing Service

# chkconfig mdmpd on

# /etc/init.d/mdmpd start

On the first Node (if it is a shared device)
Make Label on Disk

# fdisk /dev/sda
Disk /dev/sdt: 42.9 GB, 42949672960 bytes
64 heads, 32 sectors/track, 40960 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/sdt1               1       40960    41943024   fd  Linux raid autodetect

# partprobe

Bind multiple paths together

# mdadm --create /dev/md4 --level=multipath --raid-devices=4 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1


# mdadm --detail /dev/md4
UUID : b13031b5:64c5868f:1e68b273:cb36724e

Set md configuration in config file

# vim /etc/mdadm.conf

# Multiple Paths to RAC SAN
DEVICE /dev/sd[qrst]1
ARRAY /dev/md4 uuid=b13031b5:64c5868f:1e68b273:cb36724e

# cat /proc/mdstat

On the second Node (Copy the /etc/mdadm.conf from the first node)

# mdadm -As

# cat /proc/mdstat

Restore a failed path

# mdadm /dev/md1 -f /dev/sdt1 -r /dev/sdt1 -a /dev/sdt1
Category: MULTIPATH, STORAGE | Los comentarios están deshabilitados en Linux SAN Multipathing
Noviembre 17

Enabling and Disabling Multipathing in the Linux Operating System

This section describes how to enable and disable multipathing in supported versions of the Linux operating system. The following subsections are included:

About Multipathing

After cabling your server for multipath, you will see two copies of each disk from the OS since you are using two separate array paths (SAS A and B). If you want to have multiple hosts accessing disks in the array, you must first set up zoning per host as described in Chapter 3, Adding and Zoning Array Storage Using CAM.

For example, if you have created a zone in each of the array’s SAS domains that includes three disks, when entering the lsscsi command before MPxIO is installed, you will see two of each multipathed disk.


[1:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sda
[1:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdb
[1:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdc
[1:0:3:0] enclosu SUN Storage J4500 3R21 -
[2:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdd
[2:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sde
[2:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdf
[2:0:3:0] enclosu SUN Storage J4500 3R21 -

Once the multipath daemon is started on the host, you can see multipath details using the multipath command.

multipath -ll

35000c5000357625b dm-2 SEAGATE,ST340008SSUN0.4
    \_ round-robin 0 [prio=2][active]
    \_ 1:0:1:0  sdb 8:0    [active][ready]
    \_ 2:0:1:0  sde 8:192  [active][ready]

To correlate the dm-2 dual path disks with what’s displayed in CAM (in the Host details page, see Two Hosts With Zoned Disks in CAM), use the lsscsi command with the -g option.

lsscsi -g

[1:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sda /dev/sg0
[1:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdb /dev/sg1
[1:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdc /dev/sg2
[1:0:3:0] enclosu SUN Storage J4500 3R21 - /dev/sg3
[2:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdd /dev/sg4
[2:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sde /dev/sg5
[2:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdf /dev/sg6
[2:0:3:0] enclosu SUN Storage J4500 3R21 - /dev/sg7

For each disk, CAM will report the device names from the last column in the /dev/sgN format.

To Enable Multipathing in Linux

  1. Attach a J4500 array to a server with a supported version of Linux installed.
  2. On the server, edit or create the /etc/multipath.conf file.
  3. Reboot the server.
  4. After the reboot, make sure that the OS discovers all the disks in the J4500 array either by using the Linux commands, fdisk or lsscsi.
  5. Partition any disks you want to the desired sizes.
  6. Use the Linux command modprobe to add the loadable kernel modules dm-multipath and dm-round-robin.modprobe dm-multipath

    modprobe dm-round-robin

  7. Start the multipathd daemon.For Linux SUSE 9, use the following command:

    multipathd -v0

    For other supported Linux versions, use the following command:

    service multipathd start

  8. Start the multipathing device mapper target autoconfig.multipath -v2
  9. List the multipath devices that have been created.multipath -ll

    The output should list the same number of devices as there are disks in the J4500 array. The following is an example of output:

    35000c5000357625b dm-2 SEAGATE,ST340008SSUN0.4
        \_ round-robin 0 [prio=2][active]
        \_ 0:0:0:0  sda 8:0    [active][ready]
        \_ 1:0:0:0  sdm 8:192  [active][ready]

To Disable Multipathing in Linux

  1. If a RAID volume, LVM volume, or volume mount have been placed over the device node of the multipathed disk, quiesce the volume.
  2. Use the multipath -f command to disable multipathing to a specific device.multipath -f mpath1
  3. Use the multipath -F command to disable multipathing on all multipathed devices.multipath -F

    Note – If the message map in use appears for a device when you attempt to disable multipathing, the device is still in use. You must unmount or otherwise quiesce the device before you can disable multipathing. If you cannot quiesce the device, edit the /etc/multipath.conf file to exclude the device and then reboot the server.

Category: MULTIPATH, STORAGE | Los comentarios están deshabilitados en Enabling and Disabling Multipathing in the Linux Operating System