Next Previous Contents

4. Considerations

The starting point in this will be to consider where you are and what you want to do. The typical home system starts out with existing hardware and the newly converted Linux user will want to get the most out of existing hardware. Someone setting up a new system for a specific purpose (such as an Internet provider) will instead have to consider what the goal is and buy accordingly. Being ambitious I will try to cover the entire range.

Various purposes will also have different requirements regarding file system placement on the drives, a large multiuser machine would probably be best off with the /home directory on a separate disk, just to give an example.

In general, for performance it is advantageous to split most things over as many disks as possible but there is a limited number of devices that can live on a SCSI bus and cost is naturally also a factor. Equally important, file system maintenance becomes more complicated as the number of partitions and physical drives increases.

4.1 File system features

The various parts of FSSTND have different requirements regarding speed, reliability and size, for instance losing root is a pain but can easily be recovered. Losing /var/spool/mail is a rather different issue. Here is a quick summary of some essential parts and their properties and requirements. Note that this is just a guide, there can be binaries in etc and lib directories, libraries in bin directories and so on.

Swap

Speed

Maximum! Though if you rely too much on swap you should consider buying some more RAM. Note, however, that on many PC motherboards the cache will not work on RAM above 128 MB.

Size

Similar as for RAM. Quick and dirty algorithm: just as for tea: 16 MB for the machine and 2 MB for each user. Smallest kernel run in 1 MB but is tight, use 4 MB for general work and light applications, 8 MB for X11 or GCC or 16 MB to be comfortable. (The author is known to brew a rather powerful cuppa tea...)

Some suggest that swap space should be 1-2 times the size of the RAM, pointing out that the locality of the programs determines how effective your added swap space is. Note that using the same algorithm as for 4BSD is slightly incorrect as Linux does not allocate space for pages in core.

Also remember to take into account the type of programs you use. Some programs that have large working sets, such as finite element modeling (FEM) have huge data structures loaded in RAM rather than working explicitly on disk files. Data and computing intensive programs like this will cause excessive swapping if you have less RAM than the requirements.

Other types of programs can lock their pages into RAM. This can be for security reasons, preventing copies of data reaching a swap device or for performance reasons such as in a real time module. Either way, locking pages reduces the remaining amount of swappable memory and can cause the system to swap earlier then otherwise expected.

In man 8 mkswap it is explained that each swap partition can be a maximum of just under 128 MB in size.

Reliability

Medium. When it fails you know it pretty quickly and failure will cost you some lost work. You save often, don't you?

Note 1

Linux offers the possibility of interleaved swapping across multiple devices, a feature that can gain you much. Check out "man 8 swapon" for more details. However, software raiding swap across multiple devices adds more overheads than you gain.

Thus the /etc/fstab file might look like this:

/dev/sda1       swap            swap    pri=1           0       0
/dev/sdc1       swap            swap    pri=1           0       0
Remember that the fstab file is very sensitive to the formatting used, read the man page carefully and do not just cut and paste the lines above.

Note 2

Some people use a RAM disk for swapping or some other file systems. However, unless you have some very unusual requirements or setups you are unlikely to gain much from this as this cuts into the memory available for caching and buffering.

Temporary storage (/tmp and /var/tmp)

Speed

Very high. On a separate disk/partition this will reduce fragmentation generally, though ext2fs handles fragmentation rather well.

Size

Hard to tell, small systems are easy to run with just a few MB but these are notorious hiding places for stashing files away from prying eyes and quota enforcements and can grow without control on larger machines. Suggested: small home machine: 8 MB, large home machine: 32 MB, small server: 128 MB, and large machines up to 500 MB (The machine used by the author at work has 1100 users and a 300 MB /tmp directory). Keep an eye on these directories, not only for hidden files but also for old files. Also be prepared that these partitions might be the first reason you might have to resize your partitions.

Reliability

Low. Often programs will warn or fail gracefully when these areas fail or are filled up. Random file errors will of course be more serious, no matter what file area this is.

Files

Mostly short files but there can be a huge number of them. Normally programs delete their old tmp files but if somehow an interruption occurs they could survive. Many distributions have a policy regarding cleaning out tmp files at boot time, you might want to check out what your setup is.

Note

In FSSTND there is a note about putting /tmp on RAM disk. This, however, is not recommended for the same reasons as stated for swap. Also, as noted earlier, do not use flash RAM drives for these directories. One should also keep in mind that some systems are set to automatically clean tmp areas on rebooting.

(* That was 50 lines, I am home and dry! *)

Spool areas (/var/spool/news and /var/spool/mail)

Speed

High, especially on large news servers. News transfer and expiring are disk intensive and will benefit from fast drives. Print spools: low. Consider RAID0 for news.

Size

For news/mail servers: whatever you can afford. For single user systems a few MB will be sufficient if you read continuously. Joining a list server and taking a holiday is, on the other hand, not a good idea. (Again the machine I use at work has 100 MB reserved for the entire /var/spool)

Reliability

Mail: very high, news: medium, print spool: low. If your mail is very important (isn't it always?) consider RAID for reliability.

Files

Usually a huge number of files that are around a few KB in size. Files in the print spool can on the other hand be few but quite sizable.

Note

Some of the news documentation suggests putting all the .overview files on a drive separate from the news files, check out all news FAQs for more information.

Home directories (/home)

Speed

Medium. Although many programs use /tmp for temporary storage, others such as some news readers frequently update files in the home directory which can be noticeable on large multiuser systems. For small systems this is not a critical issue.

Size

Tricky! On some systems people pay for storage so this is usually then a question of finance. Large systems such as nyx.net (which is a free Internet service with mail, news and WWW services) run successfully with a suggested limit of 100 KB per user and 300 KB as enforced maximum. Commercial ISPs offer typically about 5 MB in their standard subscription packages.

If however you are writing books or are doing design work the requirements balloon quickly.

Reliability

Variable. Losing /home on a single user machine is annoying but when 2000 users call you to tell you their home directories are gone it is more than just annoying. For some their livelihood relies on what is here. You do regular backups of course?

Files

Equally tricky. The minimum setup for a single user tends to be a dozen files, 0.5 - 5 KB in size. Project related files can be huge though.

Note1

You might consider RAID for either speed or reliability. If you want extremely high speed and reliability you might be looking at other operating system and hardware platforms anyway. (Fault tolerance etc.)

Note2

Web browsers often use a local cache to speed up browsing and this cache can take up a substantial amount of space and cause much disk activity. There are many ways of avoiding this kind of performance hits, for more information see the sections on Home Directories and WWW.

Note3

Users often tend to use up all available space on the /home partition. The Linux Quota subsystem is capable of limiting the number of blocks and the number of inode a single user ID can allocate on a per-filesystem basis. See the Linux Quota mini-HOWTO by Albert M.C. Tam for details on setup.

Main binaries ( /usr/bin and /usr/local/bin)

Speed

Low. Often data is bigger than the programs which are demand loaded anyway so this is not speed critical. Witness the successes of live file systems on CD ROM.

Size

The sky is the limit but 200 MB should give you most of what you want for a comprehensive system. A big system, for software development or a multi purpose server should perhaps reserve 500 MB both for installation and for growth.

Reliability

Low. This is usually mounted under root where all the essentials are collected. Nevertheless losing all the binaries is a pain...

Files

Variable but usually of the order of 10 - 100 kB.

Libraries ( /usr/lib and /usr/local/lib)

Speed

Medium. These are large chunks of data loaded often, ranging from object files to fonts, all susceptible to bloating. Often these are also loaded in their entirety and speed is of some use here.

Size

Variable. This is for instance where word processors store their immense font files. The few that have given me feedback on this report about 70 MB in their various lib directories. A rather complete Debian 1.2 installation can take as much as 250 MB which can be taken as an realistic upper limit. The following ones are some of the largest disk space consumers: GCC, Emacs, TeX/LaTeX, X11 and perl.

Reliability

Low. See point Main binaries.

Files

Usually large with many of the order of 100 kB in size.

Note

For historical reasons some programs keep executables in the lib areas. One example is GCC which have some huge binaries in the /usr/lib/gcc/lib hierarchy.

Root

Speed

Quite low: only the bare minimum is here, much of which is only run at startup time.

Size

Relatively small. However it is a good idea to keep some essential rescue files and utilities on the root partition and some keep several kernel versions. Feedback suggests about 20 MB would be sufficient.

Reliability

High. A failure here will possibly cause a fair bit of grief and you might end up spending some time rescuing your boot partition. With some practice you can of course do this in an hour or so, but I would think if you have some practice doing this you are also doing something wrong.

Naturally you do have a rescue disk? Of course this is updated since you did your initial installation? There are many ready made rescue disks as well as rescue disk creation tools you might find valuable. Presumably investing some time in this saves you from becoming a root rescue expert.

Note 1

If you have plenty of drives you might consider putting a spare emergency boot partition on a separate physical drive. It will cost you a little bit of space but if your setup is huge the time saved, should something fail, will be well worth the extra space.

Note 2

For simplicity and also in case of emergencies it is not advisable to put the root partition on a RAID level 0 system. Also if you use RAID for your boot partition you have to remember to have the md option turned on for your emergency kernel.

DOS etc.

At the danger of sounding heretical I have included this little section about something many reading this document have strong feelings about. Unfortunately many hardware items come with setup and maintenance tools based around those systems, so here goes.

Speed

Very low. The systems in question are not famed for speed so there is little point in using prime quality drives. Multitasking or multi-threading are not available so the command queueing facility found in SCSI drives will not be taken advantage of. If you have an old IDE drive it should be good enough. The exception is to some degree Win95 and more notably NT which have multi-threading support which should theoretically be able to take advantage of the more advanced features offered by SCSI devices.

Size

The company behind these operating systems is not famed for writing tight code so you have to be prepared to spend a few tens of MB depending on what version you install of the OS or Windows. With an old version of DOS or Windows you might fit it all in on 50 MB.

Reliability

Ha-ha. As the chain is no stronger than the weakest link you can use any old drive. Since the OS is more likely to scramble itself than the drive is likely to self destruct you will soon learn the importance of keeping backups here.

Put another way: "Your mission, should you choose to accept it, is to keep this partition working. The warranty will self destruct in 10 seconds..."

Recently I was asked to justify my claims here. First of all I am not calling DOS and Windows sorry excuses for operating systems. Secondly there are various legal issues to be taken into account. Saying there is a connection between the last two sentences are merely the ravings of the paranoid. Surely. Instead I shall offer the esteemed reader a few key words: DOS 4.0, DOS 6.x and various drive compression tools that shall remain nameless.

4.2 Explanation of terms

Naturally the faster the better but often the happy installer of Linux has several disks of varying speed and reliability so even though this document describes performance as 'fast' and 'slow' it is just a rough guide since no finer granularity is feasible. Even so there are a few details that should be kept in mind:

Speed

This is really a rather woolly mix of several terms: CPU load, transfer setup overhead, disk seek time and transfer rate. It is in the very nature of tuning that there is no fixed optimum, and in most cases price is the dictating factor. CPU load is only significant for IDE systems where the CPU does the transfer itself but is generally low for SCSI, see SCSI documentation for actual numbers. Disk seek time is also small, usually in the millisecond range. This however is not a problem if you use command queueing on SCSI where you then overlap commands keeping the bus busy all the time. News spools are a special case consisting of a huge number of normally small files so in this case seek time can become more significant.

There are two main parameters that are of interest here:

Seek

is usually specified in the average time take for the read/write head to seek from one track to another. This parameter is important when dealing with a large number of small files such as found in spool files. There is also the extra seek delay before the desired sector rotates into position under the head. This delay is dependent on the angular velocity of the drive which is why this parameter quite often is quoted for a drive. Common values are 4500, 5400 and 7200 rpm (rotations per minute). Higher rpm reduces the seek time but at a substantial cost. Also drives working at 7200 rpm have been known to be noisy and to generate a lot of heat, a factor that should be kept in mind if you are building a large array or "disk farm". Very recently drives working at 10000 rpm has entered the market and here the cooling requirements are even stricter and minimum figures for air flow are given.

Transfer

is usually specified in megabytes per second. This parameter is important when handling large files that have to be transferred. Library files, dictionaries and image files are examples of this. Drives featuring a high rotation speed also normally have fast transfers as transfer speed is proportional to angular velocity for the same sector density.

It is therefore important to read the specifications for the drives very carefully, and note that the maximum transfer speed quite often is quoted for transfers out of the on board cache (burst speed) and not directly from the platter (sustained speed). See also section on Power and Heating.

Reliability

Naturally no-one would want low reliability disks but one might be better off regarding old disks as unreliable. Also for RAID purposes (See the relevant information) it is suggested to use a mixed set of disks so that simultaneous disk crashes become less likely.

So far I have had only one report of total file system failure but here unstable hardware seemed to be the cause of the problems.

Files

The average file size is important in order to decide the most suitable drive parameters. A large number of small files makes the average seek time important whereas for big files the transfer speed is more important. The command queueing in SCSI devices is very handy for handling large numbers of small files, but for transfer EIDE is not too far behind SCSI and normally much cheaper than SCSI.

4.3 Technologies

In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity.

RAID

This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other documents and FAQs for more information.

For Linux one can set up a RAID system using either software (the md module in the kernel), a Linux compatible controller card (PCI-to-SCSI) or a SCSI-to-SCSI controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost.

SCSI-to-SCSI controllers are usually implemented as complete cabinets with drives and a controller that connects to the computer with a second SCSI bus. This makes the entire cabinet of drives look like a single large, fast SCSI drive and requires no special RAID driver. The disadvantage is that the SCSI bus connecting the cabinet to the computer becomes a bottleneck.

PCI-to-SCSI are as the name suggests, connected to the high speed PCI bus and is therefore not suffering from the same bottleneck as the SCSI-to-SCSI controllers. These controllers require special drivers but you also get the means of controlling the RAID configuration over the network which simplifies management.

Currently the only supported SCSI RAID controller cards are the SmartCache I/III/IV and SmartRAID I/III/IV controller families from DPT. These controllers are supported by the EATA-DMA driver in the standard kernel. This company also has an informative home page which also describes various general aspects of RAID and SCSI in addition to the product related information.

More information from the author of the DPT controller drivers (EATA* drivers) can be found at his pages on SCSI and DPT.

SCSI-to-SCSI-controllers are small computers themselves, often with a substantial amount of cache RAM. To the host system they mask themselves as a gigantic, fast and reliable SCSI disk whereas to their disks they look like the computer's SCSI host adapter. Some of these controllers have the option to talk to multiple hosts simultaneously. Since these controllers look to the host as a normal, albeit large SCSI drive they need no special support from the host system. Usually they are configured via the front panel or with a vt100 terminal emulator connected to their on-board serial interface.

Very recently I have heard that Syred also makes SCSI-to-SCSI controllers that are supported under Linux. I have no more information about this yet but will come back with more information soon. In the mean time check out their home pages for more information.

RAID comes in many levels and flavours which I will give a brief overview of this here. Much has been written about it and the interested reader is recommended to read more about this in the RAID FAQ.

There are also hybrids available based on RAID 1 and one other level. Many combinations are possible but I have only seen a few referred to. These are more complex than the above mentioned RAID levels.

RAID 0/1 combines striping with duplication which gives very high transfers combined with fast seeks as well as redundancy. The disadvantage is high disk consumption as well as the above mentioned complexity.

RAID 1/5 combines the speed and redundancy benefits of RAID5 with the fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but disk consumption is still substantial. Implementing such a system would involve typically more than 6 drives, perhaps even several controllers or SCSI channels.

AFS, Veritas and Other Volume Management Systems

Although multiple partitions and disks have the advantage of making for more space and higher speed and reliability there is a significant snag: if for instance the /tmp partition is full you are in trouble even if the news spool is empty, as it is not easy to retransfer quotas across partitions. Volume management is a system that does just this and AFS and Veritas are two of the best known examples. Some also offer other file systems like log file systems and others optimised for reliability or speed. Note that Veritas is not available (yet) for Linux and it is not certain they can sell kernel modules without providing source for their proprietary code, this is just mentioned for information on what is out there. Still, you can check their home page to see how such systems function.

Derek Atkins, of MIT, ported AFS to Linux and has also set up the Linux AFS mailing List for this which is open to the public. Requests to join the list should go to Request and finally bug reports should be directed to Bug Reports.

Important: as AFS uses encryption it is restricted software and cannot easily be exported from the US. AFS is now sold by Transarc and they have set up a www site. The directory structure there has been reorganized recently so I cannot give a more accurate URL than just the Transarc Home Page which lands you in the root of the web site. There you can also find much general information as well as a FAQ.

The is now also development based on the last free sources of AFS.

Volume management is for the time being an area where Linux is lacking. Someone has recently started a virtual partition system project that will reimplement many of the volume management functions found in IBM's AIX system.

Linux md Kernel Patch

There is however one kernel project that attempts to do some of this, md, which has been part of the kernel distributions since 1.3.69. Currently providing spanning and RAID it is still in early development and people are reporting varying degrees of success as well as total wipe out. Use with caution.

Currently it offers linear mode and RAID levels 0,1,4,5; all in various stages of development and reliability with linear mode and RAID levels 0 and 1 being the most stable. It is also possible to stack some levels, for instance mirroring (RAID 1) two pairs of drives, each pair set up as striped disks (RAID 0), which offers the speed of RAID 0 combined with the reliability of RAID 1.

Think very carefully what drives you combine so you can operate all drives in parallel, which gives you better performance and less wear. Read more about this in the documentation that comes with md.

General File System Consideration

In the Linux world ext2fs is well established as a general purpose system. Still for some purposes others can be a better choice. News spools lend themselves to a log file based system whereas high reliability data might need other formats. This is a hotly debated topic and there are currently few choices available but work is underway. Log file systems also have the advantage of very fast file checking. Mail servers in the 100 GB class can suffer file checks taking several days before becoming operational after rebooting.

The Minix file system is the oldest one, used in some rescue disk systems but otherwise very little used these days. At one time the Xiafs was a strong contender to the standard for Linux but seems to have fallen behind these days.

Adam Richter from Yggdrasil posted recently that they have been working on a compressed log file based system but that this project is currently on hold. Nevertheless a non-working version is available on their FTP server. Check out the Yggdrasil ftp server where special patched versions of the kernel can be found. Hopefully this will be rolled into the mainstream kernel in the near future.

As of July, 23th 1997 Hans Reiser has put up the source to his tree based reiserfs on the web. While his filesystem has some very interesting features and is much faster than ext2fs, it is still very experimental and difficult to integrate with the standard kernel. Expect some interesting developments in the future - this is different from your "average log based file system for Linux" project, because Hans already has working code.

There is room for access control lists (ACL) and other unimplemented features in the existing ext2fs, stay tuned for future updates.

There is also an encrypted file system available but again as this is under export control from the US, make sure you get it from a legal place.

File systems is an active field of academic and industrial research and development, the results of which are quite often freely available. Linux has in many cases been a development tool in such activities so you can expect a lot of continuous work in this field, stay tuned for the latest development.

CD-ROM File Systems

There has been a number of file systems available for use on CD-ROM systems and one of the earliest one was the High Sierra format, supposedly named after the hotel where the final agreement took place. This was the precursor to the ISO 9660 format which is supported by Linux. Later there were the Rock Ridge extensions which added file system features such as long filenames, permissions and more.

The Linux iso9660 file system supports both High Sierra as well as Rock Ridge extensions.

However, once again Microsoft decided it should create another standard and their latest effort here is called Joliet and offers some internationalisation features. This is at the time of writing not yet available in the standard kernel releases but exists in beta versions. Hopefully this should soon work its way into the standard kernel.

In a recent Usenet News posting hpa (at) transmeta.com (H. Peter Anvin) writes the following the following interesting piece of trivia:

Actually, Joliet is a city outside Chicago; best known for being the
site of the prison where Elwood was locked up in the movie "Blues
Brothers."  Rock Ridge (the UNIX extensions to ISO 9660) is named
after the (fictional) town in the movie "Blazing Saddles."

Compression

Disk versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neck breaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given.

Other filesystems

Also there is the user file system (userfs) that allows FTP based file system and some compression (arcfs) plus fast prototyping and many other features. The docfs is based on this filesystem.

Recent kernels feature the loop or loopback device which can be used to put a complete file system within a file. There are some possibilities for using this for making new file systems with compression, tarring, encryption etc.

Note that this device is unrelated to the network loopback device.

There is a number of other ongoing file system projects, but these are in the experimental stage and fall outside the scope of this HOWTO.

Physical Track Positioning

This trick used to be very important when drives were slow and small, and some file systems used to take the varying characteristics into account when placing files. Although higher overall speed, on board drive and controller caches and intelligence has reduced the effect of this.

Nevertheless there is still a little to be gained even today. As we know, "world dominance" is soon within reach but to achieve this "fast" we need to employ all the tricks we can use .

To understand the strategy we need to recall this near ancient piece of knowledge and the properties of the various track locations. This is based on the fact that transfer speeds generally increase for tracks further away from the spindle, as well as the fact that it is faster to seek to or from the central tracks than to or from the inner or outer tracks.

Most drives use disks running at constant angular velocity but use (fairly) constant data density across all tracks. This means that you will get much higher transfer rates on the outer tracks than on the inner tracks; a characteristics which fits the requirements for large libraries well.

Newer disks use a logical geometry mapping which differs from the actual physical mapping which is transparently mapped by the drive itself. This makes the estimation of the "middle" tracks a little harder.

In most cases track 0 is at the outermost track and this is the general assumption most people use. Still, it should be kept in mind that there are no guarantees this is so.

Inner

tracks are usually slow in transfer, and lying at one end of the seeking position it is also slow to seek to.

This is more suitable to the low end directories such as DOS, root and print spools.

Middle

tracks are on average faster with respect to transfers than inner tracks and being in the middle also on average faster to seek to.

This characteristics is ideal for the most demanding parts such as swap, /tmp and /var/tmp.

Outer

tracks have on average even faster transfer characteristics but like the inner tracks are at the end of the seek so statistically it is equally slow to seek to as the inner tracks.

Large files such as libraries would benefit from a place here.

Hence seek time reduction can be achieved by positioning frequently accessed tracks in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk or cfdisk to make a partition on the middle tracks or by first making a file (using dd) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk.

The latter trick is suitable for news spools where the empty directory structure can be placed in the middle before putting in the data files. This also helps reducing fragmentation a little.

This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centring the tracks will be different, if possible. Consult the latest RAID manual.


Next Previous Contents