Disk-Space Management - Why is it so hard?
- In a computer, there were small disks.
- Then they grew...
- ...and multiplied...
- ...and divided...
- ...and had to be backed up...
- ...and the data loss came, and was great...
- ...and the tape was unreadable and silent.
- Someone said "virtual" to a Unix kernel hacker
- and the hacker said "let all disks be virtual, and thus managed by
software instead of hardware"....
- ..."this way we can grow them and divide them and multiply them without
direct relation to hardware".
- "and we can also name names, and find them more easily"
- Then someone else said "If they are virtual, we could also make virtual
copies, thus making data restore life easier".
- Thus LVM was conceived.
- In the world, LVM (Logical Volume Management) was used on various
operating systems since... many years ago.
- In Linux, there was an initial implementation of LVM (now known as LVM1),
dating back to 1997.
- When work on kernel 2.5 begun, there were two competing implementations -
EVMS (made by IBM) and LVM2.
- Eventually, LVM2 won the entrance to the 2.6 kernel, and EVMS remained
as an out-of-tree project.
- EVMS, however, is much more then LVM. IBM's developers decided to get rid
of all the kernel parts, and turn EVMS into a front-end to LVM2, being
entirely user-space. Thus, EVMS made its way into distributions and to
- You can create logical volumes, possibly spanning more then one physical
disk, and give them any name.
- You can resize logical volumes.
- You can delete logical volumes.
- You can export logical volumes, in order to move them to a separate
machine without changing the device names.
- You can create snapshots (a frozen version of a logical volume) in order
to undo many changes in a single operation.
- Physical Volume (PV) - A disk (or any block device).
- UUID - A unique ID assigned to a physical volume by the LVM sub-system.
The UUID is stored on the physical volume (as the last signature).
- Volume Group (VG) - A set of physical volumes.
- Physical Extent (PE) - The minimal chunk on a physical volume that is
managed by LVM.
- Logical Volume (LV) - A "virtual" block device, that is created by LVM
on some volume group. The logical volume is mapped to a list of
physical extents, that may reside on any of the volumes in a single volume
- Snapshot - A (logical) freeze of a logical volume at some point in time.
enables us to revert changes made to a logical volume since the
snapshot was taken.
- Before we start working with LVM2, we need to choose some physical volumes
on which we will work.
- We take these physical volumes, and place them in volume groups.
- This operation will cause LVM2 to assign a unique UUID for each
- After this, we can start creating logical volumes.
- Then we use those logical volumes like normal disks - we can create a
file system and mount it, we can use them as raw devices...
- We may add more physical volumes to existing volume groups, in order
to add more capacity.
- We may increase the size of the logical volumes, if they become too full.
If there is a file system on such a logical volume, we will need to resize
it as well.
Exercising With Loop Block Devices
- During our demonstration, we will use loop (block) devices, that will
serve as our physical volumes.
- To create such a loop device:
- Create a large file:
dd if=/dev/zero of=vdisk1 bs=1m count=1000
- Create a loop device on top of this file:
losetup -f vdisk1
- Check that this worked:
root@simey:~/vdisks# losetup /dev/loop0
/dev/loop0: :8388744 (vdisk1)
root@simey:~/vdisks# dd if=/dev/loop0 of=/dev/null bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.016822 seconds, 30.4 kB/s
Creating A Volume Group
Creating A Logical Volume
- Next, we want to create a few logical volumes.
- Here, we create logical volumes "lv_kosmo" and "lv_kramer":
root@simey:~/vdisks# lvm lvcreate -L 200M -n lv_kosmo VG_YOYO
Logical volume "lv_kosmo" created
root@simey:~/vdisks# lvm lvcreate -L 200M -n lv_kramer VG_YOYO
Logical volume "lv_kramer" created
- Lets list the existing logical volumes:
root@simey:~/vdisks# lvm lvscan
ACTIVE '/dev/VG_YOYO/lv_kosmo' [200.00 MB] inherit
ACTIVE '/dev/VG_YOYO/lv_kramer' [200.00 MB] inherit
- "ACTIVE" means we can start using it.
A File-System On A Logical Volume
Extending The Capacity Of A Logical Volume
What Is A Snapshot?
- A snapshot is an operation in which we "freeze" the data on a logical
volume, while still enabling writing new data.
- This, of-course, is an oxymoron.
- It is solved by splitting the data to old (written before taking the
snapshot) and new (written after taking the snapshot).
- The old data resides on the original logical volume.
- The new data is written to a different disk.
- When an application reads from the device, the underlying kernel code
finds where lies the fresh copy of the data, and returns that to the
- Meanwhile, we may mount the original (frozen) content on a different
directory, and access it in read-only mode.
- E.g. to backup the data.
Creating A Snapshot Of A Logical Volume
What-What?? What Is Going On With This Snapshot?!?
- We can continue working with the original device, without changing the
data in the snapshot.
- COW (Copy-On-Write) - Every time data is written to the original device,
the Linux kernel copies the old data to the snapshot's device, and only
then updates the original device.
- Thus, the first write to a snapshot is translated into a read (from the
origin), a write (to the snapshot) and another write (to the origin).
- The kernel keeps an on-disk hash-table of the chunks that were copied
to the snapshot.
- This is used both for reads, and to avoid redundant copies.
Properties Of Snapshots
- The first property of a snapshot, is the amount of disk space it was given.
If we run out of disk space (too much of the original volume changed),
the kernel will drop the snapshot.
- The size of a snapshot may be increased while it is mounted. No need for
file-system resizing tricks here.
- A snapshot may be read-only or read-write. If it is read-write, then
the snapshot's disk space is also used for writing on top of the snapshot.
- When creating a snapshot, we may define the chunk size (i.e. how much
data to COW when there is a change in the original volume). A larger
chunk means better throughput, but possibly higher latency, and more
waste of disk space.
- We may stack snapshots together (everything in LVM2 is stackable). The
performance of reads might degrade.
- Internally, LVM2 is split in two.
- In the kernel, there is the device mapper, which implements everything.
- In user-space, there are the LVM2 tools, to manage everything.
- What we once knew as "software RAID", is now only a part of the device
mapper. It just grew out of proportions ;)
LVM2 Labels And UUID-s
- In order to identify that a physical device is managed by LVM, an LVM2
label is written on one of its first 4 sectors.
- This label is used, during LVM scans, to find its devices.
- In order to identify physical volumes belonging to volume groups, a UUID
is assigned to each of them, and stored on them.
- The UUID of a physical device is stored as part of the volume-group
meta-data area on the device: at the start (beginning with 5th sector),
at the end, or on both sides (for redundancy).
- By default, the UUID is generated randomally.
- We can force a specific UUID when marking a physical volume, in case we
are intending to load an LVM setup from backup.
The Layout Of A Logical Volume
- As we said, a logical volume is stored on a volume group.
- The logical volume is made of a list of physical extents (not necessarily
contiguous) on the volume group.
- In addition, we might want to make the logical volume striped - that is,
its data chunks will be stored alternately on different physical devices
of the volume group. This may increase throughput.
- Thus, when we send I/O to a logical volume, it will may get mapped twice.
- This mapping is done in memory. If the device is very large - the map
might be very large too...
The Structure Of A Snapshot
- A snapshot is generally made of two parts - an exceptions table, and
- The exceptions table is a hash table of all chunks, on the origin volume,
that were modified since the creation of the snapshot.
- The copied chunks contain the original data from the origin volume,
before it was changed.
The Write On An Origin Of A Snapshot
Sending I/O to a device for which we have a snapshot:
- If the start sector of the I/O belongs to a chunk that is found in the
exceptions table, just perform a normal write to the origin volume.
- Otherwise, if there's a pending copy of this chunk, place the I/O on
the chunk's queue.
- Otherwise (this is the first write for this chunk), look for a free
entry in the exceptions table.
- If there were no free entries, mark this snapshot as invalid (since it is
full), and return an I/O error.
- If there was a free entry, assign it to our chunk, and send a copy
request to the kcopyd kernel-daemon.
- When the copy was complete, update the entry in the exceptions table,
perform the original I/O and all other I/Os that might have gotten queued
for this chunk.
- The basis for all of the abstractions of the device mapper is the dm-table.
- This table contains a list of devices that comprise a mapping, and a btree
(sorted by virtual sector number) showing where each slice of the device
is mapped to.
- Each component of the device-mapper (dm-stripe, dm-snap, etc) uses
a am-table internally, and configures it based on its requirements.
- When a device-mapper component gets an I/O request, it consults its
internal dm-table to see which dm-device to map the request to. This
process might repeat recursively until we get to the underlying block
- The 'BIO' is a structure containing an I/O request (read or write) for
a block device.
- During traversal of the device-mapper tree, a single I/O might have to
be split (e.g. for a stripe, or on the boundary of a physical extent),
cloned (e.g. for mirroring), etc.
- The device-mapper has an internal I/O request struct, which manages the
mapping of the BIOs in this manner. It has one original BIO, and one or
more internally-generated BIOs.
- The original BIO is only being completed when all the BIOs generated for
it were completed.
- Note: to avoid copying the data fields, a BIO points to memory
pages - several BIO-s may point to the same (or different) pages. Thus,
we only need to copy the page lists, not the actually data of the pages.
Companions Of LVM2
- Some drivers of the device-mapper are not used by LVM2, but may still
be interesting for the user.
- We will mention those which are mostly relevant for users in an
enterprise, where "high availability" is key to success.
Mirroring With The DRBD Driver
- The DRBD driver supports synchronous mirroring of data between two
- You load the driver on two machines, and define remote mirrors
for "local" disks ("local" can also be on a SAN or a NAS).
- When the local machine gets an I/O write request, it will perform it
locally, and also send a copy to the remote machine (via the LAN).
- Only when the I/O request completed on both local and remote devices,
the result will be returned to the calling application (or file-system).
- DRBD is useful under the control of a high-availability cluster software,
such as heartbeat.
The Multi-Pathing Driver
- In a high-availability environment, administrators make sure there is
more then one path to access a physical disk.
- For example, in a SAN, the host might be connected to the RAID via
two separate switches.
- On the block device level, each path will look like a separate disk.
- The multi-pathing driver has the purpose of identifying that several
block device actually map to the same disk, combine them to a single
multipath device, and allow I/O to this device.
- When I/O gets to a multipath device, it may be sent via a single path,
or using round-robin on several paths, etc.
- The multi-path driver may be controlled using the 'multipath' utility,
and the /etc/multipath.conf configuration file, to support different
types of multi-path devices and RAIDs.
Originally written by
- LVM HowTo -
- Wikipedia - Logical Volume Management -
- LVM HowTo -
- LVM2 Webcast by Heinz Mauelshagen -
- The /etc/lvm/lvm.conf configuration file (which is usually very annotated).