LVM2 - Logical Volume Management

Disk-Space Management - Why is it so hard?

In a computer, there were small disks.
Then they grew...
...and multiplied...
...and divided...
...and had to be backed up...
...and the data loss came, and was great...
...and the tape was unreadable and silent.

LVM

Someone said "virtual" to a Unix kernel hacker
and the hacker said "let all disks be virtual, and thus managed by software instead of hardware"....
..."this way we can grow them and divide them and multiply them without direct relation to hardware".
"and we can also name names, and find them more easily"
Then someone else said "If they are virtual, we could also make virtual copies, thus making data restore life easier".
Thus LVM was conceived.

LVM History

In the world, LVM (Logical Volume Management) was used on various operating systems since... many years ago.
In Linux, there was an initial implementation of LVM (now known as LVM1), dating back to 1997.
When work on kernel 2.5 begun, there were two competing implementations - EVMS (made by IBM) and LVM2.
Eventually, LVM2 won the entrance to the 2.6 kernel, and EVMS remained as an out-of-tree project.
EVMS, however, is much more then LVM. IBM's developers decided to get rid of all the kernel parts, and turn EVMS into a front-end to LVM2, being entirely user-space. Thus, EVMS made its way into distributions and to user's hands.

LVM2 Capabilities

You can create logical volumes, possibly spanning more then one physical disk, and give them any name.
You can resize logical volumes.
You can delete logical volumes.
You can export logical volumes, in order to move them to a separate machine without changing the device names.
You can create snapshots (a frozen version of a logical volume) in order to undo many changes in a single operation.

Glossary

Physical Volume (PV) - A disk (or any block device).
UUID - A unique ID assigned to a physical volume by the LVM sub-system. The UUID is stored on the physical volume (as the last signature).
Volume Group (VG) - A set of physical volumes.
Physical Extent (PE) - The minimal chunk on a physical volume that is managed by LVM.
Logical Volume (LV) - A "virtual" block device, that is created by LVM on some volume group. The logical volume is mapped to a list of physical extents, that may reside on any of the volumes in a single volume group.
Snapshot - A (logical) freeze of a logical volume at some point in time. enables us to revert changes made to a logical volume since the snapshot was taken.

Using LVM2

Before we start working with LVM2, we need to choose some physical volumes on which we will work.
We take these physical volumes, and place them in volume groups.
This operation will cause LVM2 to assign a unique UUID for each physical volume.
After this, we can start creating logical volumes.
Then we use those logical volumes like normal disks - we can create a file system and mount it, we can use them as raw devices...
We may add more physical volumes to existing volume groups, in order to add more capacity.
We may increase the size of the logical volumes, if they become too full. If there is a file system on such a logical volume, we will need to resize it as well.

Exercising With Loop Block Devices

During our demonstration, we will use loop (block) devices, that will serve as our physical volumes.

To create such a loop device:

Create a large file:

dd if=/dev/zero of=vdisk1 bs=1m count=1000

Create a loop device on top of this file:
```
losetup -f vdisk1
```

Check that this worked:

  root@simey:~/vdisks# losetup /dev/loop0
  /dev/loop0: [1601]:8388744 (vdisk1)
  root@simey:~/vdisks# dd if=/dev/loop0 of=/dev/null bs=512 count=1
  1+0 records in
  1+0 records out
  512 bytes (512 B) copied, 0.016822 seconds, 30.4 kB/s

Creating A Volume Group

Our next step is to create a volume group (or several).
A reason to use more then one volume group, is if our physical volumes reside on different types of storage (e.g. some come from a fast RAID, others come from a slow RAID, so we'll use them for less-accessed data).

To create a volume group:

lvm vgcreate VG_YOYO /dev/loop0 /dev/loop2

To see the volume groups available:

root@simey:~/vdisks# lvm vgscan
Reading all physical volumes.  This may take a while...
Found volume group "VG_YOYO" using metadata type lvm2

Creating A Logical Volume

Next, we want to create a few logical volumes.

Here, we create logical volumes "lv_kosmo" and "lv_kramer":

root@simey:~/vdisks# lvm lvcreate -L 200M -n lv_kosmo VG_YOYO
  Logical volume "lv_kosmo" created
root@simey:~/vdisks# lvm lvcreate -L 200M -n lv_kramer VG_YOYO
  Logical volume "lv_kramer" created

Lets list the existing logical volumes:

root@simey:~/vdisks# lvm lvscan
  ACTIVE            '/dev/VG_YOYO/lv_kosmo' [200.00 MB] inherit
  ACTIVE            '/dev/VG_YOYO/lv_kramer' [200.00 MB] inherit

"ACTIVE" means we can start using it.

A File-System On A Logical Volume

To have a file-system on a logical volume, we follow the same steps as with any disk...
...except that there is no need to create partitions. Instead of partitions, we can create more logical volumes.

Here is an example:

root@simey:~/vdisks# mkfs.reiserfs /dev/VG_YOYO/lv_kosmo
mkfs.reiserfs 3.6.19 (2003 www.namesys.com)
[...delete...]
root@simey:~/vdisks# mkdir /mnt/lv_kosmo
root@simey:~/vdisks# mount /dev/VG_YOYO/lv_kosmo /mnt/lv_kosmo
root@simey:~/vdisks# df /mnt/lv_kosmo
Filesystem   1K-blocks Used Available Use% Mounted on
/dev/mapper/VG_YOYO-lv_kosmo
                204788     32840    171948  17% /mnt/lv_kosmo

Extending The Capacity Of A Logical Volume

We can now easily extend the capacity of the logical volume.
This can be done without unmounting it, if the file-system supports such operations.
Reiserfs supports this since long ago. Ext3 supports it since kernel 2.6.10, provided you have e2fsprogs 1.39-1 or later.

Here is a demonstration:

root@simey:~/vdisks# lvm lvextend -L +100M /dev/VG_YOYO/lv_kosmo
  Extending logical volume lv_kosmo to 300.00 MB
  Logical volume lv_kosmo successfully resized
root@simey:~/vdisks# resize_reiserfs /dev/VG_YOYO/lv_kosmo
resize_reiserfs 3.6.19 (2003 www.namesys.com)

resize_reiserfs: On-line resizing finished successfully.
root@simey:~/vdisks# df /mnt/lv_kosmo
Filesystem           1K-blocks Used Available Use% Mounted on
/dev/mapper/VG_YOYO-lv_kosmo
                        307184     32840    274344  11% /mnt/lv_kosmo

What Is A Snapshot?

A snapshot is an operation in which we "freeze" the data on a logical volume, while still enabling writing new data.
This, of-course, is an oxymoron.
It is solved by splitting the data to old (written before taking the snapshot) and new (written after taking the snapshot).
The old data resides on the original logical volume.
The new data is written to a different disk.
When an application reads from the device, the underlying kernel code finds where lies the fresh copy of the data, and returns that to the application.
Meanwhile, we may mount the original (frozen) content on a different directory, and access it in read-only mode.
E.g. to backup the data.

Creating A Snapshot Of A Logical Volume

We create a snapshot similar to creating a logical volume, with the addition of the '-s' flag, and naming the device to be frozen:

root@simey:~/vdisks# lvm lvcreate -s -n lv_kosmo_old -L 100M /dev/VG_YOYO/lv_kosmo
  Logical volume "lv_kosmo_old" created
root@simey:~/vdisks# lvm lvscan
  ACTIVE   Original '/dev/VG_YOYO/lv_kosmo' [300.00 MB] inherit
  ACTIVE            '/dev/VG_YOYO/lv_kramer' [200.00 MB] inherit
  ACTIVE   Snapshot '/dev/VG_YOYO/lv_kosmo_old' [100.00 MB] inherit
root@simey:~/vdisks# touch /mnt/lv_kosmo/empty1
root@simey:~/vdisks# ls /mnt/lv_kosmo*
/mnt/lv_kosmo:
empty1  junk1  junk2

/mnt/lv_kosmo_old:
junk1  junk2

What-What?? What Is Going On With This Snapshot?!?

We can continue working with the original device, without changing the data in the snapshot.
COW (Copy-On-Write) - Every time data is written to the original device, the Linux kernel copies the old data to the snapshot's device, and only then updates the original device.
Thus, the first write to a snapshot is translated into a read (from the origin), a write (to the snapshot) and another write (to the origin).
The kernel keeps an on-disk hash-table of the chunks that were copied to the snapshot.
This is used both for reads, and to avoid redundant copies.

Properties Of Snapshots

The first property of a snapshot, is the amount of disk space it was given. If we run out of disk space (too much of the original volume changed), the kernel will drop the snapshot.
The size of a snapshot may be increased while it is mounted. No need for file-system resizing tricks here.
A snapshot may be read-only or read-write. If it is read-write, then the snapshot's disk space is also used for writing on top of the snapshot.
When creating a snapshot, we may define the chunk size (i.e. how much data to COW when there is a change in the original volume). A larger chunk means better throughput, but possibly higher latency, and more waste of disk space.
We may stack snapshots together (everything in LVM2 is stackable). The performance of reads might degrade.

LVM2 Internals

Internally, LVM2 is split in two.
In the kernel, there is the device mapper, which implements everything.
In user-space, there are the LVM2 tools, to manage everything.
What we once knew as "software RAID", is now only a part of the device mapper. It just grew out of proportions ;)

LVM2 Labels And UUID-s

In order to identify that a physical device is managed by LVM, an LVM2 label is written on one of its first 4 sectors.
This label is used, during LVM scans, to find its devices.
In order to identify physical volumes belonging to volume groups, a UUID is assigned to each of them, and stored on them.
The UUID of a physical device is stored as part of the volume-group meta-data area on the device: at the start (beginning with 5th sector), at the end, or on both sides (for redundancy).
By default, the UUID is generated randomally.
We can force a specific UUID when marking a physical volume, in case we are intending to load an LVM setup from backup.

The Layout Of A Logical Volume

As we said, a logical volume is stored on a volume group.
The logical volume is made of a list of physical extents (not necessarily contiguous) on the volume group.
In addition, we might want to make the logical volume striped - that is, its data chunks will be stored alternately on different physical devices of the volume group. This may increase throughput.
Thus, when we send I/O to a logical volume, it will may get mapped twice.
This mapping is done in memory. If the device is very large - the map might be very large too...

The Structure Of A Snapshot

A snapshot is generally made of two parts - an exceptions table, and copied chunks.
The exceptions table is a hash table of all chunks, on the origin volume, that were modified since the creation of the snapshot.
The copied chunks contain the original data from the origin volume, before it was changed.

The Write On An Origin Of A Snapshot

Sending I/O to a device for which we have a snapshot:

If the start sector of the I/O belongs to a chunk that is found in the exceptions table, just perform a normal write to the origin volume.
Otherwise, if there's a pending copy of this chunk, place the I/O on the chunk's queue.
Otherwise (this is the first write for this chunk), look for a free entry in the exceptions table.
If there were no free entries, mark this snapshot as invalid (since it is full), and return an I/O error.
If there was a free entry, assign it to our chunk, and send a copy request to the kcopyd kernel-daemon.
When the copy was complete, update the entry in the exceptions table, perform the original I/O and all other I/Os that might have gotten queued for this chunk.

The dm-table

The basis for all of the abstractions of the device mapper is the dm-table.
This table contains a list of devices that comprise a mapping, and a btree (sorted by virtual sector number) showing where each slice of the device is mapped to.
Each component of the device-mapper (dm-stripe, dm-snap, etc) uses a am-table internally, and configures it based on its requirements.
When a device-mapper component gets an I/O request, it consults its internal dm-table to see which dm-device to map the request to. This process might repeat recursively until we get to the underlying block devices.

BIO Lists

The 'BIO' is a structure containing an I/O request (read or write) for a block device.
During traversal of the device-mapper tree, a single I/O might have to be split (e.g. for a stripe, or on the boundary of a physical extent), cloned (e.g. for mirroring), etc.
The device-mapper has an internal I/O request struct, which manages the mapping of the BIOs in this manner. It has one original BIO, and one or more internally-generated BIOs.
The original BIO is only being completed when all the BIOs generated for it were completed.
Note: to avoid copying the data fields, a BIO points to memory pages - several BIO-s may point to the same (or different) pages. Thus, we only need to copy the page lists, not the actually data of the pages.

Companions Of LVM2

Some drivers of the device-mapper are not used by LVM2, but may still be interesting for the user.
We will mention those which are mostly relevant for users in an enterprise, where "high availability" is key to success.

Mirroring With The DRBD Driver

The DRBD driver supports synchronous mirroring of data between two different machines.
You load the driver on two machines, and define remote mirrors for "local" disks ("local" can also be on a SAN or a NAS).
When the local machine gets an I/O write request, it will perform it locally, and also send a copy to the remote machine (via the LAN).
Only when the I/O request completed on both local and remote devices, the result will be returned to the calling application (or file-system).
DRBD is useful under the control of a high-availability cluster software, such as heartbeat.

The Multi-Pathing Driver

In a high-availability environment, administrators make sure there is more then one path to access a physical disk.
For example, in a SAN, the host might be connected to the RAID via two separate switches.
On the block device level, each path will look like a separate disk.
The multi-pathing driver has the purpose of identifying that several block device actually map to the same disk, combine them to a single multipath device, and allow I/O to this device.
When I/O gets to a multipath device, it may be sent via a single path, or using round-robin on several paths, etc.
The multi-path driver may be controlled using the 'multipath' utility, and the /etc/multipath.conf configuration file, to support different types of multi-path devices and RAIDs.

References

LVM HowTo - http://www.tldp.org/HOWTO/LVM-HOWTO/index.html
Wikipedia - Logical Volume Management - http://en.wikipedia.org/wiki/Logical_volume_management
LVM HowTo -
LVM2 Webcast by Heinz Mauelshagen - http://people.redhat.com/heinzm/talks/LVM2_Webcast_2004.09.21.pdf
The /etc/lvm/lvm.conf configuration file (which is usually very annotated).

Originally written by

guy keren