Septiembre 26

Linux File System Read Write Performance Test

The Simplest Performance Test Using dd command

The simplest read write performance test in Linux can be done with the help of dd command. This command is used to write or read from any block device in Linux. And you can do a lot of stuff with this command. The main plus point with this command, is that its readily available in almost all distributions out of the box. And is pretty easy to use.

With this dd command we will only be testing sequential read and sequential write.I will test the speed of my partition /dev/sda1 which is mounted on “/” (the only partition i have on my system)so can write the data to any where in my filesystem to test.

1
2
3
4
5
[root@slashroot2 ~]# dd if=/dev/zero of=speetest bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0897865 seconds, 1.2 GB/s
[root@slashroot2 ~]

In the above command you will be amazed to see that you have got 1.1GB/s. But dont be happy thats falsecheeky. Becasue the speed that dd reported to us is the speed with which data was cached to RAM memory, not to the disk. So we need to ask dd command to report the speed only after the data is synced with the disk.For that we need to run the below command.

1
2
3
4
[root@slashroot2 ~]# dd if=/dev/zero of=speetest bs=1M count=100 conv=fdatasync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 2.05887 seconds, 50.9 MB/s

As you can clearly see that with the attribute fdatasync the dd command will show the status rate only after the data is completely written to the disk. So now we have the actual sequencial write speed. Lets go to an amount of data size thats larger than the RAM. Lets take 200MB of data in 64kb block size.

1
2
3
4
[root@slashroot2 ~]# dd if=/dev/zero of=speedtest bs=64k count=3200 conv=fdatasync
3200+0 records in
3200+0 records out
209715200 bytes (210 MB) copied, 3.51895 seconds, 59.6 MB/s

 

as you can clearly see that the speed came to 59 MB/s. You need to note that ext3 bydefault if you do not specify the block size, gets formatted with a block size thats determined by the programes like mke2fs . You can verify yours with the following commands.

tune2fs -l /dev/sda1

dumpe2fs /dev/sda1

For testing the sequential read speed with dd command, you need to run the below command as below.

1
2
3
4
[root@myvm1 sarath]# dd if=speedtest of=/dev/null bs=64k count=24000
5200+0 records in
5200+0 records out
340787200 bytes (341 MB) copied, 3.42937 seconds, 99.4 MB/s

Performance Test using HDPARM

Now lets use some other tool other than dd command for our tests. We will start with hdparm command to test the speed. Hdparm tool is also available out of the box in most of the linux distribution.

1
2
3
4
5
[root@myvm1 ~]# hdparm -tT /dev/sda1
 
/dev/sda1:
 Timing cached reads:   5808 MB in  2.00 seconds = 2908.32 MB/sec
 Timing buffered disk reads:   10 MB in  3.12 seconds =   3.21 MB/sec

 

There are multiple things to understand here in the above hdparm results. the -t Option will show you the speed of reading from the cache buffer(Thats why its much much higher).

The -T option will show you the speed of reading without precached buffer(which from the above output is low 3.21 MB/sec as shown above. )

the hdparm output shows you both the cached reads and disk reads separately. As mentioned before hard disk seek time also matters a lot for your speed you can check your hard disk seek time with the following linux command. seek time is the time required by the hard disk to reach the sector where the data is stored.Now lets use this seeker tool to find out the seek time by the simple seek command.

1
2
3
4
5
6
7
8
[root@slashroot2 ~]# seeker /dev/sda1
Seeker v3.0, 2009-06-17, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda1 [81915372 blocks, 41940670464 bytes, 39 GB, 39997 MB, 41 GiB, 41940 MiB]
[512 logical sector size, 512 physical sector size]
[1 threads]
Wait 30 seconds..............................
Results: 87 seeks/second, 11.424 ms random access time (26606211 < offsets < 41937280284)
[root@slashroot2 ~]#

its clearly mentioned that my disk did a 86 seeks for sectors containing data per second. Thats ok for a desktop Linux machine but for servers its not at all ok.

Read Write Benchmark Test using IOZONE:

Now there is one tool out there in linux that will do all these test in one shot. Thats none other than “IOZONE”. We will do some benchmark test against my /dev/sda1 with the help of iozone.Computers or servers are always purchased keeping some purpose in mind. Some servers needs to be highend performance wise, some needs to be fast in sequencial reads,and some others are ordered keeping random reads in mind. IOZONE will be very much helpful in carrying out large number of permance benchmark test against the drives. The output produced by iozone is too much brief.

The default command line option -a is used for full automatic mode, in which iozone will test block sizes ranging from 4k to 16M and file sizes ranging from 64k to 512M. Lets do a test using this -a option and see what happens.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@myvm1 ~]# iozone -a /dev/sda1
             Auto Mode
        Command line used: iozone -a /dev/sda1
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
              64       4  172945  581241  1186518  1563640  877647  374157  484928   240642   985893   633901   652867 1017433  1450619
              64       8   25549  345725   516034  2199541 1229452  338782  415666   470666  1393409   799055   753110 1335973  2071017
              64      16   68231  810152  1887586  2559717 1562320  791144 1309119   222313  1421031   790115   538032  694760  2462048
              64      32  338417  799198  1884189  2898148 1733988  864568 1421505   771741  1734912  1085107  1332240 1644921  2788472
              64      64   31775  811096  1999576  3202752 1832347  385702 1421148   771134  1733146   864224   942626 2006627  3057595
             128       4  269540  699126  1318194  1525916  390257  407760  790259   154585   649980   680625   684461 1254971  1487281
             128       8  284495  837250  1941107  2289303 1420662  779975  825344   558859  1505947   815392   618235  969958  2130559
             128      16  277078  482933  1112790  2559604 1505182  630556 1560617   624143  1880886   954878   962868 1682473  2464581
             128      32  254925  646594  1999671  2845290 2100561  554291 1581773   723415  2095628  1057335  1049712 2061550  2850336
             128      64  182344  871319  2412939   609440 2249929  941090 1827150  1007712  2249754  1113206  1578345 2132336  3052578
             128     128  301873  595485  2788953  2555042 2131042  963078  762218   494164  1937294   564075  1016490 2067590  2559306

           

Category: STORAGE | Los comentarios están deshabilitados en Linux File System Read Write Performance Test
Julio 3

Configure NFS Server to share any directories in local network.

root@nas01:~# aptitude -y install nfs-kernel-server
root@nas01:~# vi /etc/idmapd.conf

# line 6: uncomment and change to your domain name
Domain = server.world
root@nas01:~# vi /etc/exports

# write like below *note

/home 10.0.0.0/24(rw,sync,fsid=0,no_root_squash,no_subtree_check)
# *note
/home ⇒ shared directory
10.0.0.0/24 ⇒ range of networks NFS permits accesses
rw ⇒ possible to read and write
sync ⇒ synchronize
no_root_squash ⇒ enable root privilege
no_subtree_check ⇒ disable subtree check

root@nas01:~# /etc/init.d/nfs-kernel-server restart
Stopping NFS kernel daemon: mountd nfsd.
Unexporting directories for NFS kernel daemon….
Exporting directories for NFS kernel daemon….
Starting NFS kernel daemon: nfsd mountd.

Configure NFS Client to be able to mount shared directory from NFS server.

root@client:~# aptitude -y install nfs-common

root@client:~# vi /etc/idmapd.conf

# line 6: uncomment and change to your domain name
Domain = server.world

root@client:~# /etc/init.d/nfs-common restart
Stopping NFS common utilities: idmapd statd.
Starting NFS common utilities: statd idmapd.

root@client:~# mount -t nfs nas01.server.world:/home /home

root@client:~# df -h

Filesystem Size Used Avail Use% Mounted on
rootfs 19G 745M 17G 5% /
udev 10M 0 10M 0% /dev
tmpfs 202M 196K 202M 1% /run
/dev/mapper/www-root 19G 745M 17G 5% /
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 403M 0 403M 0% /run/shm
/dev/vda1 228M 18M 199M 9% /boot
nas01.server.world:/home 19G 917M 17G 6% /home
# home directory on NFS is mounted

root@client:~# vi /etc/fstab
# add at the last: change home directory this server mounts to the one on NFS

nas01.server.world:/home /home nfs defaults 0 0

 

 

 

 

Category: NFS, STORAGE | Los comentarios están deshabilitados en Configure NFS Server to share any directories in local network.
Diciembre 10

Setting up cluster synchronization with csync2

Having to synchronize some data among a Debian linux cluster, i settled on using csync2 for the job.
Here’s a short guide to set it up.

We are assuming two machines here, 01.cluster and 02.cluster. The 01.cluster is gonna be our “master” in this setup.

First on both machines install csync2 by executing:

apt-get install csync2

On each node we now need to generate a certificate for csync2 to communicate. We do it by the following commands, When asked to set a challenge password leave it empty, and leave the common name empty.:

openssl genrsa -out /etc/csync2_ssl_key.pem 1024
openssl req -new -key /etc/csync2_ssl_key.pem -out /etc/csync2_ssl_cert.csr
openssl x509 -req -days 600 -in /etc/csync2_ssl_cert.csr -signkey /etc/csync2_ssl_key.pem -out /etc/csync2_ssl_cert.pem

On the master we need to generate a preshared key for the nodes to communicate with:

csync2 -k /etc/csync2_ssl_cert.key

Note: You might experience a somewhat hang on the command, this is because the /dev/random pool isn’t filling up fast enough. To remedy this open a secondary connection to the server, and jab a bit around. Look in files, download some big file whatever makes the /dev/random entropy fill up

Now we need to set up the configuration file on the master, which is in /etc/csync2.cfg.

# Csync2 configuration example
group cluster
{
	host 01.cluster;
	host (02.cluster); # Slave host

	key /etc/csync2_ssl_cert.key;

	include /var/www;
	exclude /var/www/sessions;

	auto none;
}

Note: the hostname of all the machines needs to match the output of the hostname command.
The parentheses around 02.cluster is to make the synchronization only work in one direction. Now we need to copy the csync.cfg and the csync2_ssl_cert.key to the slave server(s).

After all this do a /etc/init.d/openbsd-inetd restart on all machines.
And run csync -x on the master to synchronize data on the slaves. Note data on the slave(s) WILL be overwritten/deleted

A logical step now would be to run csync –x from within a cron job. Which i will leave for a later post.

Troubleshooting

If lsyncd doesn’t want to start, then check in the log file. In our case the log file is /var/log/lsyncd.log
If you encounter a last line like:
Tue Oct 30 18:59:40 2012 Error: Terminating since out of inotify watches.
Tue Oct 30 18:59:40 2012 Error: Consider increasing /proc/sys/fs/inotify/max_user_watches

Then do one of the followings:

To change immediately the limit, run:
# echo 32768 > /proc/sys/fs/inotify/max_user_watches

To make the change permanent, edit the file /etc/sysctl.conf and add this line to the end of the file:
fs.inotify.max_user_watches=32768

After a crash of the start you should delete the lock file as follows:
rm /var/lock/lsyncd

To check if it runs, run the command:
root@ado:~# ps ax | grep lsyncd
12635 ? Ss 0:00 /usr/local/bin/lsyncd /etc/lsyncd.conf
12647 pts/2 S+ 0:00 grep lsyncd

First Synchronize with Csync2 Manually

Before setup cron job for handling Csync2 periodically, let’s trial manually at dev6c1:

csync2 -xv

If you get any error message you may try will following commands, too:

csync2 -xvvv
csync2 -TI

Else it should already works as expected.

Periodically Synchronize with Csync2

Now setup cron jobs with “crontab -e” as below:

*/1 * * * * csync2 -x >/dev/null 2>&1

Once save Csync2 will run once per 1 minute, check, synchronize and restart your service if required automatically.

Category: STORAGE, SYNC, SYNC | Los comentarios están deshabilitados en Setting up cluster synchronization with csync2
Noviembre 17

FCoE but i can talk about iSCSI

I cant talk about FCoE but i can talk about iSCSI.

first up, what are you talking to array wise? it shouldnt make a huge difference to the info to follow, just interested.

We use Dell Equallogic arrays at work, and the following config works remarkably well.

First up, get rid of any smell of LACP/PortChannel/LinkAgg between servers and switches and storage and switches. the only place you should be using link agg is switch to switch. iSCSI is a path aware protocol, in a nutshell you’re only getting 1Gbit/sec will be because the IO stream is being balanced down one cable because it is all part of the same session.

Couple of Terms of reference
Initiator – the host
Target Portal – the storage
Session, a logical connection between a initiator and a target

Now, there is basicly a 1:1 relationship between initiator and target, a 1:1 relationship between physical interface and initiator address and a 1:1 relationship between session and physical network port. This pretty much comes back to the FC world where you have one session between the HBA and the storage array, you dont run multiple sessions over a FC cable.

so given you have two cables plugged in, you will want to create two sessions back to your storage, the multipathing daemon (multipathd) on the linux box will sort out the MPIO stuff for you.

here are a bunch of config things we do on our servers to make the magic happen

first up, /etc/sysctl.conf

Code:
# Equallogic Configuration settings
# ARP Flux and Return Path Filtering are two behaviours the Linux
# employs for multiple NICs on the same subnet.  They need to be disabled
# to ensure proper MPIO operation with Equallogic storage.

net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0

then /etc/iscsi/iscsid.conf

Code:
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

we put this into /etc/rc.local to disable some of the tcp stack offloading

Code:
/sbin/ethtool --offload eth2 gro off
/sbin/ethtool --offload eth3 gro off

For sanity reasons we modify the default autogenerated iscsi IQN name to something more meaninful, its located in /etc/iscsi/initiatorname.iscsi

Code:
iqn.1994-05.com.redhat:<HOSTNAME>-bf81bfe3f7a

Then we turn on jumbo frames on the iSCSI interfaces

Modify /etc/sysconfig/network-scripts/ifcfg-ethX (where X is the interfaces talking to the iSCSI Array)

Add the following

Code:
MTU=9000

now we get stuck into configuring the iscsi daemon itself
first up we create some logical interfaces within the iscsi daemon to direct traffic down

Code:
# iscsiadm --mode iface --op=new --interface iscsi0
# iscsiadm --mode iface --op=new --interface iscsi1

then we link the physical interfaces to the logical interfaces

Code:
# iscsiadm --mode iface --op=update --interface iscsi0 --name=iface.net_ifacename --value=eth2
# iscsiadm --mode iface --op=update --interface iscsi1 --name=iface.net_ifacename --value=eth3

Now we discover the targets we can talk to on our storage array

Code:
# iscsiadm -m discovery -t st -p <ARRAY IP> -I iscsi0 -I iscsi1

and finally we login to the targets we have discovered

Code:
# iscsiadm -m node --login

Right now, with all this done, you should now see two sd[a-z] devices for each LUN you have presented, basicly one of the devices goes via one ethernet cable, the other device via the other ethernet cable, protecting you against cable/switch failure, assuming correct connectivity and config of your switching fabric.

now here is a sample config of the multipath system in linux

all this goes into /etc/multipath.conf

Code:
## Use user friendly names, instead of using WWIDs as names.
defaults {
     user_friendly_names yes
}

## Blacklist local devices (no need to have them accidentally "multipathed")
blacklist {
     devnode "^sd[a]$"
     devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
     devnode "^hd[a-z]"
     devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

## For EqualLogic PS series RAID arrays
devices {
     device {
          vendor                  "EQLOGIC"
          product                 "100E-00"
          path_grouping_policy    multibus
          getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
          features                "1 queue_if_no_path"
          path_checker            readsector0
          path_selector           "round-robin 0"
          failback                immediate
          rr_min_io               10
          rr_weight               priorities
     }
}

## Assign human-readable name  (aliases) to the World-Wide ID's (names) of the devices we're wanting to manage
## Identify the WWID by issuing # scsi_id -gus /block/sdX

multipaths {
     multipath {
          wwid                    36090a01850e08bd883c4a45bfa0330ba
          alias                     devicename2
     }
     multipath {
          wwid                    36090a01850e04bd683c4745bfa03b07c
          alias                     devicename1
     }
}

there will be other examples in the config file around the device stanza, you’ll need to modify the vendor and product values to suit what you’re running, when the disk arrived on the machine via iscsi, dmesg would have recorded some details that would help here.

the multipaths stuff isn’t required, it just make life a lot easier

then start all the services needed

Code:
# service multipathd start
# chkconfig multipathd on

then use the following command to see if things have worked

Code:
# multipath -v2
# multipath -ll

if everything is good you should get something like this as the output from the second command

Code:
devicename1 (36090a01850e04bd683c4745bfa03b07c) dm-3 EQLOGIC,100E-00
[size=5.0T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=4][active]
 \_ 13:0:0:0 sdd 8:48  [active][ready]
 \_ 14:0:0:0 sdb 8:16  [active][ready]
 \_ 12:0:0:0 sdc 8:32  [active][ready]
 \_ 11:0:0:0 sde 8:64  [active][ready]

the device to mount will be /dev/mapper/devicename1

Category: iSCSI, STORAGE | Los comentarios están deshabilitados en FCoE but i can talk about iSCSI
Noviembre 17

iSCSI MPIO with Nimble

With the implementation of a new Nimble Storage Array, HCL is changing the storage strategy away from fiber channel to iSCSI. If you have not looked at a Nimble array, you really should. Fantastic!

The Nimble allows for four ethernet ports for iSCSI traffic. To have the highest amount of bandwidth and redundancy, MPIO needs to be configured on the system to communicate with the Nimble.

Target (SAN)

  • Nimble Storage Array CS220-X2
  • Discovery IP: 172.16.2.10
  • Data IP’s: 172.16.2.11, 172.16.2.12, 172.16.13, 172.16.2.14

Initiator (Client)

  • Ubuntu 12.04 LTS
  • Data IP: 10.2.10.46
  • iSCSI IP: 172.16.2.50

Software Prerequisite

# sudo apt-get install open-iscsi open-iscsi-utils multipath-tools

IQN

iSCSI uses an IQN to to refer to targets and initiators. Once you install the open-iscsi package, an IQN will be created for you. This can be found in the /etc/iscsi/initiatorname.iscsi file.

# cat /etc/iscsi/initiatorname.iscsi
## DO NOT EDIT OR REMOVE THIS FILE!
## If you remove this file, the iSCSI daemon will not start.
## If you change the InitiatorName, existing access control lists
## may reject this initiator.  The InitiatorName must be unique
## for each iSCSI initiator.  Do NOT duplicate iSCSI InitiatorNames.
InitiatorName=iqn.1993-08.org.debian:01:48a7e07cd57c

Use this initiator IQN to configure your volume on the Nimble array and to create your initiator group. As a practice, we have decided to build our initiator groups based on IQN vs. the IP address of the initiator systems.

Set iSCSI startup to automatic

# sudo ./chconfig /etc/iscsi/iscsid.conf node.startup automatic

chconfig is a small bash script to execute a sed command to change the value of configuration property to a specific value. It is useful in cases where configuration files are written in the form of property=value. It is available on github.

Discover the Target

# sudo iscsiadm -m discovery -t sendtargets -p 172.16.2.10
172.16.2.12:3260,2460 iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0
172.16.2.11:3260,2460 iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0
172.16.2.13:3260,2460 iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0
172.16.2.14:3260,2460 iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0

If everything is running correctly up to this point, you will see all four paths to the Nimble in the output along with the IQNs of the volumes that you have created. In my case, the volume name is ubuntutest.

Configure Multipath

This step is important to do prior to loging into each of the storage paths.

The first step is to log into one of the data targets.

# sudo iscsiadm -m node --targetname "iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0" --portal "172.16.2.11:3260" --login

Once you are logged in, you will be able to get the wwid of the drive. You will need this for the /etc/multipath.conf. This file configures all of your multipath preferences. To get the wwid…

# sudo multipath -ll
202e7bcc950e534c26c9ce900a0588a97 dm-2 Nimble,Server
size=5.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 3:0:0:0 sdb 8:16 active ready running

In my case, the wwid is 202e7bcc950e534c26c9ce900a0588a97. Now, open /etc/multipath.conf in your favorite editor and edit the file so it matches something like this…

defaults {
    udev_dir /dev
    polling_interval 10
    prio_callout /bin/true
    path_checker readsector0
    prio const
    fallback immediate
    use_friendly_name yes
}

devices {
    device {
            vendor "Nimble*"
            product "*"

            path_grouping_policy multibus

            path_selector "round-robin 0"
            # path_selector "queue-length 0"
            # path_selector "service-time 0"
    }
}

multipaths {
    multipath {
            wwid 202e7bcc950e534c26c9ce900a0588a97
            alias data
    }
}

Now would be a good point to reload the multipath service.

# sudo service multipath-tools reload

Continue to log into the iSCSI targets

# sudo iscsiadm -m node --targetname "iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0" --portal "172.16.2.12:3260" --login

# sudo iscsiadm -m node --targetname "iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0" --portal "172.16.2.13:3260" --login

# sudo iscsiadm -m node --targetname "iqn.2007-11.com.nimblestorage:ubuntutest-v681ac6f7ff909e57.0000000a.978a58a0" --portal "172.16.2.14:3260" --login

Once you are completed logging into each target, you can verify your multipath configuration.

# sudo multipath -ll
data (202e7bcc950e534c26c9ce900a0588a97) dm-2 Nimble,Server
size=5.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 14:0:0:0 sdb 8:16 active ready  running
  |- 12:0:0:0 sdc 8:32 active ready  running
  |- 13:0:0:0 sdd 8:48 active ready  running
  `- 11:0:0:0 sde 8:64 active ready  running

The drive will be available at /dev/mapper/data.

Next up will be creating a LVM volume and formatting with OCFS2 for shared storage in a cluster

Category: MPIO, STORAGE | Los comentarios están deshabilitados en iSCSI MPIO with Nimble
Noviembre 17

Configuring DRBD

At this point, a diagram might be in order – you’ll have to excuse my “Dia” skills, which are somewhat lacking!

 

Following on from part one, where we covered the basic architecture and got DRBD installed, we’ll proceed to configuring and then initialising the shared storage across both nodes. The configuration file for DRBD (/etc/drbd.conf) is very simple, and is the same on both hosts. The full configuration file is below – you can copy and paste this in; I’ll go through each line afterwards and explain what it all means. Many of these sections and commands can be fine tuned – see the man pages on drbd.conf and drbdsetup for more details.

 

global {
}
resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout  0;
  }
  disk {
    on-io-error   detach;
  }
  net {
    on-disconnect reconnect;
  }
  syncer {
    rate 30M;
  }
  on weasel {
    device     /dev/drbd0;
    disk       /dev/md3;
    address    10.0.0.2:7788;
    meta-disk  internal;
  }
  on otter {
    device    /dev/drbd0;
    disk      /dev/md3;
    address   10.0.0.1:7788;
    meta-disk internal;
  }
}

The structure of this file should be pretty obvious – sections are surrounded by curly braces, and there are two main sections – a global one, in which nothing is defined, and a resource section, where a shared resource named “r0” is defined.

The global section only has a few options available to it – see the DRBD website for more information; though it’s pretty safe to say you can ignore this part of the configuration file when you’re getting started.

The next section is the one where we define our shared resource, r0. You can obviously define more than one resource within the configuration file, and you don’t have to call them “r0”, “r1” etc. They can have any arbitary name – but the name MUST only contain alphanumeric characters. No whitespace, no punctuation, and no characters with accents.So, r0 is our shared storage resource. The next line says that we’ll be using protocol “C” – and this is what you should be using in pretty much any scenario. DRBD can operate in one of three modes, which provide various levels of replication integrity. Consider what happens in the event of some data being written to disk on the primary :

  • Protocol A considers writes on the primary to be complete if the write request has reached the local disk, and the local TCP network buffer. Obviously, this provides no guarantee whatsoever it will reach the secondary node: any hardware or software failure could result in you loosing data. This option should only be chosen if you don’t care obout the integrity of your data, or are replicating over a high-latency link.
  • Protocol B goes one step further, and considers the write to be a complete if it reaches the secondary node. This is a little safer, but you could still loose data in the event of power or equipment failure as there is no guarantee the data will have reached the disk on the secondary.
  • Protocol C is the safest of the three, as it will only consider the write request completed once the secondary node has safely written it to disk. This is therefore the most common mode of operation, but it does mean that you will have to take into account network latency when considering the performance characteristics of your storage. If you have the fastest drives in the world, it won’t help you if you have to wait for a slow secondary node to complete it’s write over a high-latency connection.

The next somewhat cryptic-looking line (“incon-degr-cmd…”) in the configuration simply defines what to do in the event of the primary node starting up, and discovering it’s copy of the data is inconsistent. We obviously wouldn’t want this propogated over to the secondary, so this statement tells it to display an error message, wait 60 seconds and then shutdown.

We then come to the startup section – here we simply state this node should wait for ever for the other node to connect.

In the disk section, we tell DRBD  that if there is an error, or a problem with the underlying storage that it should disconnect it. The network section states that if we experience a communication problem with the other node, instead of going into standalone mode we should try to reconnect.

The syncer controls various aspects of the actual replication – the main paramater to tune is the rate at which we’ll replicate data between the nodes. While it may be tempting at first glance to enter a rate of 100M for a Gigabit network, it is advisable to reserve bandwidth. The documentation (http://www.drbd.org/users-guide/s-configure-syncer-rate.html) says that a good rule of thumb is to use about 30% of the available replication bandwidth, taking into account local IO systems, and not just the network.

We now come to the “meat” of this example – we define the two storage nodes – “otter” and “weasel”. Refer back to part one and the diagram there, if you need a quick reminder on the network layout and host names. For each node, we define what the device name is (/dev/drbd0); the underlying storage (/dev/md3); the address and port that DRBD will be listening on (10.0.0.x:7788); and what to use to store the metadata for the device. “meta-disk internal” just stores the data in the last 128MB of the underlying storage – you could use another disk for this, but it makes sense to use the same device unless it’s causing performance issues.

Initialising DRBD

Now we’ve got DRBD configured, we need to start it up – but first, we need to let the two nodes know which is going to be the primary. If you start DRBD on both nodes :

/etc/init.d/drbd start

You will notice that the command appears to hang for a while as DRBD initialises the volumes, but if you check the output of “dmesg” on another terminal, you’ll see something similar to :

drbd0: Creating state block
drbd0: resync bitmap: bits=33155488 words=1036110
drbd0: size = 126 GB (132621952 KB)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: Assuming that all blocks are out of sync (aka FullSync)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: drbdsetup [4087]: cstate Unconfigured --> StandAlone
drbd0: drbdsetup [4113]: cstate StandAlone --> Unconnected
drbd0: drbd0_receiver [4114]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [4114]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00
drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00
drbd0: drbd0_receiver [4114]: cstate WFReportParams --> Connected
drbd0: I am inconsistent, but there is no sync? BOTH nodes inconsistent!
drbd0: Secondary/Unknown --> Secondary/Secondary

And if you also check the contents of /proc/drbd on both nodes, you’ll see something similar to :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:Connected st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:24523 lo:0 pe:0 ua:0 ap:0

So you can see, both nodes have started up and have connected to each other – but have reached deadlock, as neither knows which one should be the primary. Pick one of the nodes to be the primary (in my case, I chose “otter”) and issue the following command on it:

drbdadm -- --do-what-I-say primary all

And then recheck /proc/drbd (or use “/etc/init.d/drbd status”). On the primary you’ll see it change to something like this :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@otter, 2008-02-19 10:07:52
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:169984 nr:0 dw:0 dr:169984 al:0 bm:16200 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.2% (129347/129513)M
        finish: 3:15:56 speed: 11,260 (10,624) K/sec

Whilst the secondary will show :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:476216 dw:476216 dr:0 al:0 bm:24552 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.4% (129048/129513)M
        finish: 3:32:40 speed: 10,352 (9,716) K/sec

If you have a large underlying pool of storage, you’ll find that this will take a fair amount of time to complete. Once it’s done, we can try manually failing over between the two nodes.

Manual failover and testing

On the primary node, create a filesystem on the DRBD volume, mount it and create some files on it for testing :

mke2fs -j /dev/drbd0
mount /dev/drbd0 /mnt

Your replicated storage should now be mounted under /mnt. Copy some data into it, make some directories, unpack a few tarballs – and check to see how /proc/drbd updates on both nodes. Once you’re done, unmount the volume, and then demote the primary node to secondary :

umount /mnt
drbdadm secondary r0

Now, both nodes are marked as being “secondary”. If we then log into “weasel”, we can promote that node :

 drbdadm primary r0

Check /proc/drbd to see that the nodes have in fact swapped over (“otter” should be the secondary now, and “weasel” should be the new primary). You can then mount the volume on weasel and see the data there from earlier.

That’s it for DRBD – you have a functioning replicated pool of storage shared between both nodes. At this point, I’d recommend you take a breather, experiment with the two systems, read the manpages, and see what happens when you reboot each one. It’s also an interesting experiment to see what happens when you try to mount the volume on a non-primary node – you should see that DRBD does the “right thing” in all of these cases, and prevents you from corrupting your data.

LVM

As we’ll be using LVM for greater flexibility in defining, managing and backing up our replicated storage, a few words may be needed at this point regarding managing LVM on top of DRBD, as it introduces a few extra complexities. I assume you’re familiar with LVM – if not, the LVM HOWTO is a very good place to start.

To begin with, we’ll set up the DRBD volume as “physical volume” for use with LVM, so install the LVM tools :

apt-get install lvm2

We now need to tell the LVM subsystem to ignore the underlying device (/dev/md3) – otherwise on the primary node, LVM would see two identical physical volumes – one on the underlying disk, and the DRBD volume itself. Edit /etc/lvm/lvm.conf, and inside the “devices” section, add a couple of lines that read :

# Exclude DRBD underlying device
filter = [ "r|/dev/md3|" ]

Now, create a LVM “physical volume” on the DRBD device:

pvcreate /dev/drbd0

Then the volume group, called “storage” :

vgcreate storage /dev/drbd0

And then finally, a 10Gb test volume called “test”, which we’ll then format and mount :

lvcreate -L10G -ntest storage
mke2fs -j /dev/storage/test
mount /dev/storage/test /mnt

And again, test the volume by creating a few files under /mnt. Now, when you want to failover, you will need to add a couple of steps. First, on the primary node, unmount your logical volume, and change the volume group state to unavailable :

umount /mnt
vgchange -a n storage
  0 logical volume(s) in volume group "storage" now active

Now, demote the primary node to secondary :

drbdadm secondary r0

Now, you can promote the other node to primary, and you should be able to activate the volume group and logical volume :

drbdadm primary r0
vgchange -a y storage

And you should then be able to mount /dev/storage/test on the new primary node. Obviously, this becomes a major hassle each time you want to failover between nodes, and doesn’t provide any automatic monitoring or failover in the event of hardware failure on one node. We’ll cover that next in Part 3, where we’ll add the Heartbeat package from the Linux-HA project to provide automatic node failover and redundancy.

Category: DRBD, STORAGE | Los comentarios están deshabilitados en Configuring DRBD
Noviembre 17

Building a redundant iSCSI and NFS cluster with Debian

In this part of the series, we’ll configure an iSCSI client (“initiator”), connect it to the storage servers and set up multipathing. Note : Since Debian Lenny has been released since this series of articles started, that’s the version we’ll use for the client.

If you refer back to part one to refresh your memory of the network layout, you can see that the storage client (“badger” in that diagram) should have 3 network interfaces :

  • eth0 : 172.16.7.x for the management interface, this is what you’ll use to SSH into it.

And two storage interfaces. As the storage servers (“targets”) are using 192.168.x.1 and 2, I’ve given this client the following addresses :

  • eth1: 192.168.1.10
  • eth2: 192.168.2.10

Starting at .10 on each range keeps things clear – I’ve found it can help to have a policy of servers being in a range of, say, 1 to 10, and clients being above this. Before we continue, make sure that these interfaces are configured, and you can ping the storage server over both interfaces, e.g. try pinging 192.168.1.1 and 192.168.2.1.

Assuming the underlying networking is configured and working, the first thing we need to do is install open-iscsi (which is the “initiator” – the iSCSI client). This is done by a simple :

# aptitude install open-iscsi

You should see the package get installed, and the service started :

Setting up open-iscsi (2.0.870~rc3-0.4) ...
Starting iSCSI initiator service: iscsid.
Setting up iSCSI targets:
iscsiadm: No records found!

At this point, we have all we need to start setting up some connections.

There are two ways we can “discover” targets on a server (well, three actually, if you include iSNS, but that’s beyond the scope of this article).

  • We can use “send targets” – this logs into a iSCSI target server, and asks it to send the initiator a list of all the available targets.
  • We can use manual discovery, where we tell the initiator explicitly what targets to connect to.

For this exercise, I’ll first show how “send targets” works, then we’ll delete the records so we can add them back manually later. Sendtargets can be useful if you’re not sure what targets your storage server offers, but you can end up with a lot of stale or unused records if you don’t trim down the ones you’re not using.

So, to get things rolling, we’ll query the targets available on one of the interfaces we’re going to use (192.168.1.1) – we’ll set up multipathing later. Run the following as root :

iscsiadm -m discovery -t st -p 192.168.1.1

And you should see the following output returned :

192.168.1.1:3260,1 iqn.2009-02.com.example:test

This shows that your initiator has successfully queried the storage server, and has returned a list of targets – which, if you haven’t changed anything since the last article, should just be the one “iqn.2009-02.com.example:test” target. You can always see which nodes are available to your initiator at any time by simply running :

iscsiadm -m node

A few things have happened behind the scenes that it’s worth checking out at this point. After discovering an available target, the initiator will have created a node record for it under /etc/iscsi/nodes. If you take a look in that directory, you’ll see the following file :

/etc/iscsi/nodes/iqn.2009-02.com.example:test/192.168.1.1,3260,1/default

Which is a file that contains specific configuration details for that iSCSI node. Some of these settings are influenced by the contents of /etc/iscsi/iscsid.conf, which governs the overall behaviour of the iSCSI initiator (e.g. settings in iscsid.conf apply to all nodes). We’ll investigate a few of these settings later.

For now though, all your initiator has done is discover a set of available targets, we can’t actually make use of them without “logging in”. So, now run the following as root :

iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -l

The arguments to this command are largely self-explanatory – we’re performing an operation on a node (“-m node”), are using the portal we queried earlier (“-p 192.168.1.1”), are running the operation on a specific target (“-T iqn.2009-02.com.example:test”) and are logging in to it (“-l”).

You can use the longer form of these arguments if you want – for instance, you could use “–login” instead of “-l” if you feel it makes things clearer (see the man page for iscsiadm for more details). Anyway, you should see the following output after running that command :

Logging in to [iface: default, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]
Login to [iface: default, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]: successful

If you now check the output from “dmesg”, you’ll see output similar to the following in your logs :

[3688756.079470] scsi0 : iSCSI Initiator over TCP/IP
[3688756.463218] scsi 0:0:0:0: Direct-Access     IET      VIRTUAL-DISK     0    PQ: 0 ANSI: 4
[3688756.580379]  sda: unknown partition table
[3688756.581606] sd 0:0:0:0: [sda] Attached SCSI disk

The last line is important – it tells us the device node that the iSCSI node has been created under. You can also query this information by running :

iscsiadm -m session -P3

Which will display a lot of information about your iSCSI session, including the device it has created for you.

If you go back to your storage server now, you can see your client has connected and logged in to the target :

# cat /proc/net/iet/session
tid:1 name:iqn.2009-02.com.example:test
        sid:562949974196736 initiator:iqn.1993-08.org.debian:01:16ace3ba949f
                cid:0 ip:192.168.1.10 state:active hd:none dd:none

You now have a device on your iSCSI client that you can partition and format, just like it was a locally attached disk. Give it a try: fire up fdisk on it, create some partitions, format and mount them. You should find it behaves just the same as a local disk, although the speed will be limited by the capacity of your link to the storage server.

Once you’ve finished, make sure any filesystem you have created on the volume is unmounted, and we’ll then logout of the node and delete it’s record :

# iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test --logout
Logging out of session [sid: 1, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]
Logout of [sid: 1, target: iqn.2009-02.com.example:test, portal: 192.168.1.1,3260]: successful
# iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -o delete

You should now find that the record for it has been removed from /etc/iscsi/nodes.

Multipathing

We’ll now manually log into the target on both paths to our storage server, and combine the two devices into one multipathed, fault-tolerant device that can handle the failure of one path.

Before we start, you’ll want to change a few of the default settings in /etc/iscsi/iscsid.conf – if you want any nodes you’ve added to the server to automatically be added back when the server reboots, you’ll want to change

node.startup = manual

to

node.startup = automatic

The default timeouts are also far too high when we’re using multipathing – you’ll want to set the following values :

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

Make sure you restart open-iscsi so these changes get picked up. We can then manually log into both paths to the storage server :

iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -o new
iscsiadm -m node -p 192.168.1.1 -T iqn.2009-02.com.example:test -l
iscsiadm -m node -p 192.168.2.1 -T iqn.2009-02.com.example:test -o new
iscsiadm -m node -p 192.168.2.1 -T iqn.2009-02.com.example:test -l

Note the use of “-o new” to manually specify and add the node, instead of using sendtargets discovery. After this, you should find that you have two devices created – in my case, these were /dev/sda and /dev/sdb. We now need to combine these using multipathing.

First, install “multipath-tools” :

aptitude install multipath-tools

And then create a default configuration file under /etc/multipath.conf with the following contents :

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id -g -u -s /block/%n"
        prio_callout            /bin/true
        path_checker            readsector0
        rr_min_io               100
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     no
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
}

The first section sets some defaults for the multipat daemon, including how it should identify devices. The blacklist section lists devices that should not be multipathed so the daemon can ignore them – you can see it’s using regular expressions to exclude a number of entries under /dev, including anything starting with “hd” . This will exclude internal IDE devices, for instance. You may need to tune this to your needs, but it should work OK for this example.

Restart the daemon with

/etc/init.d/multipath-tools restart

And check what it can see with the command “multipath -ll”:

# multipath -ll
149455400000000000000000001000000c332000011000000dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 1:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 2:0:0:0 sdb  8:16   [active][ready]

That long number on the first line of output is the WWID of the multipathed device, which is similar to a MAC address in networking. It’s a unique identifier for this device, and you can see the components below it. You’ll also have a new device created under /dev/mapper :

/dev/mapper/149455400000000000000000001000000c332000011000000

Which is the multipathed device. You can access this the same as you would the individual devices, but I always find that long WWID a little too cumbersome. Fortunately, you can assign short names to multipathed devices. Just edit /etc/multipath.conf, and add the following section (replacing the WWID with your value) :

multipaths {
        multipath {
                wwid 149455400000000000000000001000000c332000011000000
                alias mpio
        }
}

And restart multipath-tools. When you next run “multipath -ll”, you should see the following :

mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK

And you can now access your volume through /dev/mapper/mpio.

Failing a path

To see what happens when a path fails, try creating a filesystem on your multipathed device (you may wish to partition it first, or you can use the whole device) and then mounting it. E.G.

mke2fs -j /dev/mapper/mpio
mount /dev/mapper/mpio /mnt

While the volume is mounted, try unplugging one of the storage switches – in this case, I tried pulling the power supply from the switch on the 192.168.2.x network. I then ran “multipath -ll”, which paused for a short time (the timeout values set above), and then I saw the following :

sdb: checker msg is "directio checker reports path is down"
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=0][enabled]
 _ 4:0:0:0 sdb  8:16   [active][faulty]

So, one path to our storage is unavailable – you can see it marked above as faulty. However, as the 192.168.1.x network path is still available, IO can continue to the remaining “sda” component of the device. The volume was still mounted, and I could carry on copying data to and from it. I then plugged the switch back in, and after a short pause, multipathd shows both paths as active again :

# multipath -ll
mpio (149455400000000000000000001000000c332000011000000) dm-0 IET     ,VIRTUAL-DISK
[size=1.0G][features=0][hwhandler=0]
_ round-robin 0 [prio=1][active]
 _ 3:0:0:0 sda  8:0    [active][ready]
_ round-robin 0 [prio=1][enabled]
 _ 4:0:0:0 sdb  8:16   [active][ready]

You now have a resilient, fault-tolerant iSCSI SAN!

That’s it for this part – in the next part, I’ll add an NFS server to the mix, tie off a few loose ends, and discuss some performance tuning issues, as well as post some scripts I’ve written to automate some of this.

Category: MPIO, STORAGE | Los comentarios están deshabilitados en Building a redundant iSCSI and NFS cluster with Debian
Noviembre 17

Linux SAN Multipathing

There are a lot of SAN multipathing solutions on Linux at the moment. Two of them are discussesed in this blog. The first one is device mapper multipathing that is a failover and load balancing solution with a lot of configuration options. The second one (mdadm multipathing) is just a failover solution with manuel re-anable of a failed path. The advantage of mdadm multiphating is that it is very easy to configure.

Before using a multipathing solution for a production environment on Linux it is also important to determine if the used solution is supportet with the used Hardware. For example HP doesn’t support the Device Mapper Multipathing solution on their servers yet.

Device Mapper Multipathing

Procedure for configuring the system with DM-Multipath:

  1. Install device-mapper-multipath rpm
  2. Edit the multipath.conf configuration file:
    • comment out the default blacklist
    • change any of the existing defaults as needed
  3. Start the multipath daemons
  4. Create the multipath device with the multipath

Install Device Mapper Multipath

# rpm -ivh device-mapper-multipath-0.4.7-8.el5.i386.rpm
warning: device-mapper-multipath-0.4.7-8.el5.i386.rpm: Header V3 DSA signature:
Preparing...                ########################################### [100%]
1:device-mapper-multipath########################################### [100%]

Initial Configuration

Set user_friendly_name. The devices will be created as /dev/mapper/mpath[n]. Uncomment the blacklist.

# vim /etc/multipath.conf

#blacklist {
#        devnode "*"
#}

defaults {
user_friendly_names yes
path_grouping_policy multibus

}

Load the needed modul and the startup service.

# modprobe dm-multipath
# /etc/init.d/multipathd start
# chkconfig multipathd on

Print out the multipathed device.

# multipath -v2
or
# multipath -v3

Configuration

Configure device type in config file.

# cat /sys/block/sda/device/vendor
HP

# cat /sys/block/sda/device/model
HSV200

# vim /etc/multipath.conf
devices {

device {
vendor                  "HP"
product                 "HSV200"
path_grouping_policy    multibus
no_path_retry           "5"
}
}

Configure multipath device in config file.

# cat /var/lib/multipath/bindings

# Format:
# alias wwid
#
mpath0 3600508b400070aac0000900000080000

# vim /etc/multipath.conf

multipaths {

multipath {
wwid                    3600508b400070aac0000900000080000
alias                   mpath0
path_grouping_policy    multibus
path_checker            readsector0
path_selector           "round-robin 0"
failback                "5"
rr_weight               priorities
no_path_retry           "5"
}
}

Set not mutipathed devices on the blacklist. (f.e. local Raid-Devices, Volume Groups)

# vim /etc/multipath.conf

devnode_blacklist {

devnode "^cciss!c[0-9]d[0-9]*"
devnode "^vg*"
}

Show Configured Multipaths.

# dmsetup ls --target=multipath
mpath0  (253, 1)

# multipath -ll

mpath0 (3600508b400070aac0000900000080000) dm-1 HP,HSV200
[size=10G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=4][active]
\_ 0:0:0:1 sda 8:0   [active][ready]
\_ 0:0:1:1 sdb 8:16  [active][ready]
\_ 1:0:0:1 sdc 8:32  [active][ready]
\_ 1:0:1:1 sdd 8:48  [active][ready]

Format and mount Device

Fdisk cannot be used with /dev/mapper/[dev_name] devices. Use fdisk on the underlying disks and execute the following command when device-mapper multipath maps the device to create a /dev/mapper/mpath[n] device for the partition.

# fdisk /dev/sda

# kpartx -a /dev/mapper/mpath0

# ls /dev/mapper/*
mpath0  mpath0p1

# mkfs.ext3 /dev/mapper/mpath0p1

# mount /dev/mapper/mpath0p1 /mnt/san

After that /dev/mapper/mpath0p1 is the first partition on the multipathed device.

Multipathing with mdadm on Linux

The md multipathing solution is only a failover solution what means that only one path is used at one time and no load balancing is made.
Start the MD Multipathing Service

# chkconfig mdmpd on

# /etc/init.d/mdmpd start

On the first Node (if it is a shared device)
Make Label on Disk

# fdisk /dev/sda
Disk /dev/sdt: 42.9 GB, 42949672960 bytes
64 heads, 32 sectors/track, 40960 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/sdt1               1       40960    41943024   fd  Linux raid autodetect

# partprobe

Bind multiple paths together

# mdadm --create /dev/md4 --level=multipath --raid-devices=4 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1

Get UUID

# mdadm --detail /dev/md4
UUID : b13031b5:64c5868f:1e68b273:cb36724e

Set md configuration in config file

# vim /etc/mdadm.conf

# Multiple Paths to RAC SAN
DEVICE /dev/sd[qrst]1
ARRAY /dev/md4 uuid=b13031b5:64c5868f:1e68b273:cb36724e

# cat /proc/mdstat

On the second Node (Copy the /etc/mdadm.conf from the first node)

# mdadm -As

# cat /proc/mdstat

Restore a failed path

# mdadm /dev/md1 -f /dev/sdt1 -r /dev/sdt1 -a /dev/sdt1
Category: MULTIPATH, STORAGE | Los comentarios están deshabilitados en Linux SAN Multipathing
Noviembre 17

Enabling and Disabling Multipathing in the Linux Operating System

This section describes how to enable and disable multipathing in supported versions of the Linux operating system. The following subsections are included:

About Multipathing

After cabling your server for multipath, you will see two copies of each disk from the OS since you are using two separate array paths (SAS A and B). If you want to have multiple hosts accessing disks in the array, you must first set up zoning per host as described in Chapter 3, Adding and Zoning Array Storage Using CAM.

For example, if you have created a zone in each of the array’s SAS domains that includes three disks, when entering the lsscsi command before MPxIO is installed, you will see two of each multipathed disk.

lsscsi

[1:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sda
[1:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdb
[1:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdc
[1:0:3:0] enclosu SUN Storage J4500 3R21 -
[2:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdd
[2:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sde
[2:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdf
[2:0:3:0] enclosu SUN Storage J4500 3R21 -

Once the multipath daemon is started on the host, you can see multipath details using the multipath command.

multipath -ll

35000c5000357625b dm-2 SEAGATE,ST340008SSUN0.4
[size=373G][features=0][hwhandler=0]
    \_ round-robin 0 [prio=2][active]
    \_ 1:0:1:0  sdb 8:0    [active][ready]
    \_ 2:0:1:0  sde 8:192  [active][ready]

To correlate the dm-2 dual path disks with what’s displayed in CAM (in the Host details page, see Two Hosts With Zoned Disks in CAM), use the lsscsi command with the -g option.

lsscsi -g

[1:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sda /dev/sg0
[1:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdb /dev/sg1
[1:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdc /dev/sg2
[1:0:3:0] enclosu SUN Storage J4500 3R21 - /dev/sg3
[2:0:0:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdd /dev/sg4
[2:0:1:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sde /dev/sg5
[2:0:2:0] disk SEAGATE ST330055SSUN300G 0B92 /dev/sdf /dev/sg6
[2:0:3:0] enclosu SUN Storage J4500 3R21 - /dev/sg7

For each disk, CAM will report the device names from the last column in the /dev/sgN format.

To Enable Multipathing in Linux

  1. Attach a J4500 array to a server with a supported version of Linux installed.
  2. On the server, edit or create the /etc/multipath.conf file.
  3. Reboot the server.
  4. After the reboot, make sure that the OS discovers all the disks in the J4500 array either by using the Linux commands, fdisk or lsscsi.
  5. Partition any disks you want to the desired sizes.
  6. Use the Linux command modprobe to add the loadable kernel modules dm-multipath and dm-round-robin.modprobe dm-multipath

    modprobe dm-round-robin

  7. Start the multipathd daemon.For Linux SUSE 9, use the following command:

    multipathd -v0

    For other supported Linux versions, use the following command:

    service multipathd start

  8. Start the multipathing device mapper target autoconfig.multipath -v2
  9. List the multipath devices that have been created.multipath -ll

    The output should list the same number of devices as there are disks in the J4500 array. The following is an example of output:

    35000c5000357625b dm-2 SEAGATE,ST340008SSUN0.4
    [size=373G][features=0][hwhandler=0]
        \_ round-robin 0 [prio=2][active]
        \_ 0:0:0:0  sda 8:0    [active][ready]
        \_ 1:0:0:0  sdm 8:192  [active][ready]

To Disable Multipathing in Linux

  1. If a RAID volume, LVM volume, or volume mount have been placed over the device node of the multipathed disk, quiesce the volume.
  2. Use the multipath -f command to disable multipathing to a specific device.multipath -f mpath1
  3. Use the multipath -F command to disable multipathing on all multipathed devices.multipath -F

    Note – If the message map in use appears for a device when you attempt to disable multipathing, the device is still in use. You must unmount or otherwise quiesce the device before you can disable multipathing. If you cannot quiesce the device, edit the /etc/multipath.conf file to exclude the device and then reboot the server.

  4.  
Category: MULTIPATH, STORAGE | Los comentarios están deshabilitados en Enabling and Disabling Multipathing in the Linux Operating System
Noviembre 17

Multipath : configurer plusieurs chemins pour ses accès disques externe

1 Introduction

Nous allons aborder ici 2 points :

  • Les devices mapper
  • Le multipathing

Nous avons besoin de comprendre comment fonctionne les device mapper avant d’attaquer le multipathing, c’est pourquoi il y aura une explication sur les 2 dans cette documentation.

Dans le noyau Linux, la carte des périphériques (device-mapper en anglais) sert de framework générique pour créer une projection d’un périphérique bloc (“mapper” le périphérique) sur un autre. Elle est à la base de LVM2 et EVMS, des RAIDs logiciels, ou encore du chiffrage de disque; et offre des fonctionnalités supplémentaires telles que l’instantané de système de fichiers (file-system snapshot).
La carte des périphériques fonctionne en traitant les données que lui transmet un périphérique bloc virtuel (fourni par elle-même), et en passant les données résultantes à un autre périphérique bloc.

Le Multipathing permet d’avoir plusieurs chemin pour accéder aux même données. Ceci a pour but d’augmenter les capacités d’accès aux données si l’équipement de stockage le permet (actif/actif) et d’assurer de la redondance en cas de panne d’un équipement, tel qu’un contrôleur. Voici a quoi ressemble une architecture faite de multipathing :

Multipath.png

Cela fonctionne également très bien avec un seul SAN.

2 Device Mapper

Les Device mapper sont très rarement utilisés manuellement. Ce sont généralement des couches supérieurs qui les utilisent tel que LVM. Néanmoins, nous allons voir comment s’en servir.

Pour ajouter une partition dans device mapper :

Command dmsetup
dmsetup create <device> <map_table>

 

  • device : nom du device a créer
  • map_table : est un fichier qui doit contenir les règles de mapping, exemple :
Configuration File map_table
0 409600 linear /dev/sdal 0
409600 2048000 linear /dev/sda2 0

 

Si je souhaites créer un device mapper, je peux le faire en une ligne de commande également et sans fichier :

Command dmsetup
echo "0 `blockdev --getsize /dev/sda1` linear /dev/sda1 0" | dmsetup create mynewdm

 

Il existe plusieurs type de mapping tagets :

  • linera : allocation continue
  • stripped : allocation segmentée entre tous les périphériques
  • erreur : pour générer des erreurs (idéale pour le dev et tests)
  • snapshot – périphérique copy-on-write
  • snapshot-origin : mapping vers un volume originial
  • zero – sparse block devices (équivalent à /dev/null)
  • multipath : multi routes pour une connection à un périphérique

Pour voir toutes les devices mapper disponibles :

Command dmsetup
> dmsetup table
myvg-rootvol: 0 10092544 linear 8:3 2048

 

Pour supprimer un device mapper :

Command dmsetup
dmsetup remove <disk>

 

Pour lister tous les devices mapper sous forme d’arbre :

Command dmsetup
dmsetup ls --tree

Dm-tree.png

 

3 Multipathing

3.1 Installation

Le multipath n’est pas quelque chose d’installé de base, nous allons donc avoir besoin d’installer un package :

Command yum
yum install device-mapper-multipath

 

Puis nous allons charger les modules et passer le service en persistant :

Command
modprobe dm_multipath
modprobe dm-round-robin
chkconfig multipathd on

 

3.2 Configuration

Si vous n’avez pas de fichier de configuration, prenez en un dans la doc :

Command cp
cp /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf /etc/

 

Multipath utilise une notion de groupes allant de 0 à 1024 (du plus prioritaire au moins). Un seul groupe est actif à la fois. Un groupe peut contenir plusieurs chemins.

Nous allons passer à la configuration de notre service de multipathing (je ne présente que les lignes essentielles) :

Configuration File /etc/multipath.conf
...
# Blacklist all devices by default. Remove this to enable multipathing
# on the default devices.
#blacklist {
#        devnode "*"
#}
## Use user friendly names, instead of using WWIDs as names.
defaults {
        user_friendly_names yes
}
##
## Here is an example of how to configure some standard options.
##
#
defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
        prio                    alua
        path_checker            readsector0
        rr_min_io               100
        max_fds                 8192
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     yes
}
blacklist {
       wwid 26353900f02796769
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z]"
}
...

 

Je vous conseil fortement de regarder le man pour le choix des options ci dessus.

Nous allons pouvoir démarrer notre service :

Command service
service multipathd start

 

3.3 Utilisation

Voici l’ordre et les outis à utiliser pour la détection de disques, cette section est très importante :

  1. Les devices : partprobe /dev/<device> (ex: sda)
  2. Les device-mapper : kpartx -a /dev/<device-mapper> (ex: dm-1)
  3. Le multipath : partprobe /dev/mapper/<multipath> (ex: mpath0)

Si nous souhaitons voir les paths actifs :

Command multipath
multipath -l

 

Pour créer des partitions sur un device mapper multipathé, il faut absolument le faire sur le disque sous jassent (ex: /dev/sda) et non le device mapper multipathé ! Donc la procédure consiste a créer sa partition avec fdisk par exemple, puis faire détecter sa nouvelle partition :

Command
partprobe /dev/sda
partprobe /dev/sdb
kpartx -a /dev/mapper/mpath0

 

4 FAQ

4.1 Je ne vois toujours pas mes nouveaux LUN, comment rafraichir ?

Il est possible que la création de nouveaux LUN/partition nécessite un nouveau scan pour les détecter. Nous allons avoir besoin de ce package :

Command yum
yum install sg3_utils

 

Puis lançons le scan :

Command rescan-scsi-bus.sh
rescan-scsi-bus.sh

 

Ou sinon, nous pouvons le faire directement avec /proc.

  • Si c’est sur une plateforme de type SCSI :
Command echo
echo "- - -" > /sys/class/scsi_host/<HBA>/scan

 

  • Si c’est pour une baie Fibre Optique :
Command echo
echo "1" > /sys/class/fc_host/<HBA>/issue_lip
echo "- - -" > /sys/class/scsi_host/<HBA>/scan

 

4.2 Je n’arrive pas à voir correctement ma nouvelle partition, comment m’y prendre ?

Voici comment procéder lorsque l’on rencontre un problème lors de la création d’une partition sur un multipath. Prenons cette exemple  :

  • Je ne vois pas mpath0p2 sur 1 machine, alors que sur les autres machines, je les voit :
Command ls
> ls /dev/mpath/
mpath0  mpath0p1

 

  • Je vérifie que je vois bien ma partition sur les 2 paths (sda2 et sdb2) :
Command fdisk
> fdisk -l

Disk /dev/hda: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1          13      104391   83  Linux
/dev/hda2              14         274     2096482+  8e  Linux LVM
/dev/hda3             275         339      522112+  82  Linux swap / Solaris

Disk /dev/sda: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1          10       51429   83  Linux
/dev/sda2              11         770     3910960   8e  Linux LVM

Disk /dev/sdb: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1          10       51429   83  Linux
/dev/sdb2              11         770     3910960   8e  Linux LVM

Disk /dev/dm-1: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes

     Device Boot      Start         End      Blocks   Id  System
/dev/dm-1p1               1          10       51429   83  Linux

Disk /dev/dm-2: 52 MB, 52663296 bytes
255 heads, 63 sectors/track, 6 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/dm-2 doesn't contain a valid partition table

 

S’assurer que dans la configuration, il n’y a pas des options de blacklist qui m’empêcherait de voir correctement les devices. Pour cela, nous allons commenter toutes les parties blacklist :

Configuration File /etc/multipath.conf
#blacklist {
#       wwid 26353900f02796769
#       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
#       devnode "^hd[a-z]"
#}

 

  • On voit que /dev/dm-1p1 est présent, mais pas le 2 (tout du moins qu’il y a une erreur). Je vérifie de nouveau qu’il n’y a aucune présence de mon dm-2 :
Command ls
> ls -l /dev/dm-1
brw-rw---- 1 root root 253, 1 Mar  1 13:42 /dev/dm-1

 

Command ls
> ls -l /dev/mapper/
total 0
crw------- 1 root root  10, 62 Mar  1 13:41 control
brw-rw---- 1 root disk 253,  1 Mar  1 13:42 mpath0
brw-rw---- 1 root root 253,  2 Mar  1 14:06 mpath0p1
brw------- 1 root root 253,  0 Mar  1 13:41 myvg0-mylv0

 

  • On voit que “253, 1” correspond à /dev/dm-1. On va faire un kpartx et partprobe sur les 2 pour rafraichir les paths :
Command
kpartx -a /dev/dm-1
kpartx -a /dev/dm-2
partprobe /dev/mapper/mpath0

 

Même si vous avez des erreurs du type :

device-mapper: create ioctl failed: Device or resource busy

Ce n’est pas grâve, ça lui permet de rafriachir la liste des devices mapper.

  • Et là ça marche :
Command ls
> ls -l /dev/mapper/
total 0
crw------- 1 root root  10, 62 Mar  1 13:41 control
brw-rw---- 1 root disk 253,  1 Mar  1 13:42 mpath0
brw-rw---- 1 root root 253,  2 Mar  1 14:06 mpath0p1
brw-rw---- 1 root disk 253,  3 Mar  1 14:19 mpath0p2brw------- 1 root root 253,  0 Mar  1 13:41 myvg0-mylv0

 

Category: MULTIPATH, MULTIPATH, STORAGE | Los comentarios están deshabilitados en Multipath : configurer plusieurs chemins pour ses accès disques externe