mapleson@gamers.org Tel: (+44) (0)1772 893297 Fax: (+44) (0)1772 892913 WWW: http://www.futuretech.vuurwerk.nl/
Running a UNIX system always involves monitoring how a system is behaving on a daily basis. Admins must keep an eye on such things as:
This section explains the various system monitoring tools, commands and techniques which an admin can use to monitor the areas listed above. Typical example administration tasks are discussed in a later lecture. The focus here is on available tools and what they offer, not on how to use them as part of an admin strategy.
Disk Space Usage.
The df command reports current disk space usage. Run on its own, the output is expressed in terms of numbers of blocks used/free, eg.:
yoda # df Filesystem Type blocks use avail %use Mounted on /dev/root xfs 8615368 6116384 2498984 71 / /dev/dsk/dks4d5s7 xfs 8874746 4435093 4439653 50 /home milamber:/mapleson nfs 4225568 3906624 318944 93 /mapleson
Figure 38. Using df without options.
A block is 512 bytes. But most people tend to think in terms of kilobytes, megabytes and gigabytes, not multiples of 512 bytes. Thus, the -k option can be used to show the output in K:
yoda # df -k Filesystem Type kbytes use avail %use Mounted on /dev/root xfs 4307684 3058192 1249492 71 / /dev/dsk/dks4d5s7 xfs 4437373 2217547 2219826 50 /home milamber:/mapleson nfs 2112784 1953312 159472 93 /mapleson
Figure 39. The -k option with df to show data in K.
The df command can be forced to report data only for the file system housing the current directory by adding a period:
yoda # cd /home && df -k . Filesystem Type kbytes use avail %use Mounted on /dev/dsk/dks4d5s7 xfs 4437373 2217547 2219826 50 /home
Figure 40. Using df to report usage for the file system holding the current directory.
The du command can be used to show the amount of space used by a particular directory or file, or series of directories and files. The -k option can be used to show usage in K instead of 512byte blocks just as with df. du's default behaviour is to report a usage amount recursively for every sub-directory, giving a total at the end, eg.:
yoda # du -k /usr/share/data/models 436 /usr/share/data/models/sgi 160 /usr/share/data/models/food 340 /usr/share/data/models/toys 336 /usr/share/data/models/buildings 412 /usr/share/data/models/household 864 /usr/share/data/models/scenes 132 /usr/share/data/models/chess 1044 /usr/share/data/models/geography 352 /usr/share/data/models/CyberHeads 256 /usr/share/data/models/machines 1532 /usr/share/data/models/vehicles 88 /usr/share/data/models/simple 428 /usr/share/data/models/furniture 688 /usr/share/data/models/robots 7760 /usr/share/data/models
The -s option can be used to restrict the output to just an overall total for the specified directory:
yoda # du -k -s /usr/share/data/models 7760 /usr/share/data/models
By default, du does not follow symbolic links, though the -L option can be used to force links to be followed if desired.
However, du does examine NFS-mounted file systems by default. The -l and -m options can be used to restrict this behaviour, eg.:
ASH # cd / ASH # du -k -s -l 0 CDROM 0 bin 0 debug 68 dev 0 disk2 2 diskcopy 0 dumpster 299 etc 0 home 2421 lib 2579 lib32 0 opt 0 proc 1 root.home 4391 sbin 565 stand 65 tmp 3927 unix 397570 usr 6346 var
The output in Fig 43 shows that the /home directory has been ignored.
Another example: a user can find out how much disk space their account currently uses by entering: du -k -s ~/
Swap space (ie. virtual memory on disk) can be monitored using the swap command with the -l option.
For full details on these commands, see the relevant man pages.
Commands relating to file system quotas are dealt with in a later lecture.
System Performance.
This includes processor loading, disk loading, etc.
The most common command used by admins/users to observe CPU usage is ps, which displays a list of currently running processes along with associated information, including the percentage of CPU time currently being consumed by each process, eg.:
ASH 6# ps -ef UID PID PPID C STIME TTY TIME CMD root 0 0 0 08:00:41 ? 0:01 sched root 1 0 0 08:00:41 ? 0:01 /etc/init root 2 0 0 08:00:41 ? 0:00 vhand root 3 0 0 08:00:41 ? 0:03 bdflush root 4 0 0 08:00:41 ? 0:00 munldd root 5 0 0 08:00:41 ? 0:02 vfs_sync root 900 895 0 08:03:27 ? 1:25 /usr/bin/X11/Xsgi -bs [etc] root 7 0 0 08:00:41 ? 0:00 shaked root 8 0 0 08:00:41 ? 0:00 xfsd root 9 0 0 08:00:41 ? 0:00 xfsd root 10 0 0 08:00:41 ? 0:00 xfsd root 11 0 0 08:00:41 ? 0:00 pdflush root 909 892 0 08:03:31 ? 0:02 /usr/etc/videod root 1512 1509 0 15:37:17 ? 0:00 sh -c /var/X11/xdm/Xlogin root 158 1 0 08:01:01 ? 0:01 /usr/etc/ypbind -ypsetme root 70 1 0 08:00:50 ? 0:00 /usr/etc/syslogd root 1536 211 0 16:06:04 pts/0 0:00 rlogind root 148 1 0 08:01:00 ? 0:01 /usr/etc/routed -h -[etc] root 146 1 0 08:01:00 ? 0:00 /usr/etc/portmap root 173 172 0 08:01:03 ? 0:01 /usr/etc/nfsd 4 root 172 1 0 08:01:03 ? 0:01 /usr/etc/nfsd 4 root 174 172 0 08:01:03 ? 0:01 /usr/etc/nfsd 4 root 175 172 0 08:01:03 ? 0:01 /usr/etc/nfsd 4 root 178 1 0 08:01:03 ? 0:00 /usr/etc/biod 4 root 179 1 0 08:01:03 ? 0:00 /usr/etc/biod 4 root 180 1 0 08:01:03 ? 0:00 /usr/etc/biod 4 root 181 1 0 08:01:03 ? 0:00 /usr/etc/biod 4 root 189 1 0 08:01:04 ? 0:00 bio3d root 190 1 0 08:01:04 ? 0:00 bio3d root 191 1 0 08:01:04 ? 0:00 bio3d root 202 1 0 08:01:05 ? 0:00 /usr/etc/rpc.statd root 192 1 0 08:01:04 ? 0:00 bio3d root 188 1 0 08:01:03 ? 0:00 bio3d root 311 1 0 08:01:08 ? 0:00 /usr/etc/timed -M -F yoda root 211 1 0 08:01:05 ? 0:02 /usr/etc/inetd root 823 1 0 08:01:33 ? 0:13 /usr/lib/sendmail -bd -q15m root 1557 1537 9 16:10:58 pts/0 0:00 ps -ef root 892 1 0 08:03:25 ? 0:00 /usr/etc/videod root 1513 1512 0 15:37:17 ? 0:07 /usr/Cadmin/bin/clogin -f root 1546 872 0 16:07:55 ? 0:00 /usr/Cadmin/bin/directoryserver root 1537 1536 1 16:06:04 pts/0 0:01 -tcsh root 903 1 0 08:03:27 tablet 0:00 /sbin/getty ttyd1 co_9600 lp 460 1 0 08:01:17 ? 0:00 /usr/lib/lpsched root 1509 895 0 15:37:13 ? 0:00 /usr/bin/X11/xdm root 488 1 0 08:01:19 ? 0:01 /sbin/cron root 1556 1537 28 16:10:56 pts/0 0:01 find /usr -name *.txt -print root 895 1 0 08:03:27 ? 0:00 /usr/bin/X11/xdm root 872 1 0 08:02:32 ? 0:06 /usr/Cadmin/bin/directoryserver
Figure 44. Typical output from the ps command.
Before obtaining the output shown in Fig 44, I ran a find command in the background. The output shows that the find command was utilising 28% of available CPU resources; tasks such as find are often limited by the speed and bandwidth capacity of the disk, not the speed of the main CPU.
The ps command has a variety of options to show or not show various information. Most of the time though, 'ps -ef' is adequate to display the kind of information required. Note that other UNIX variants use slightly different options, eg. the equivalent command on SunOS would be 'ps -aux'.
One can use grep to only report data for a particular process, eg.:
ASH 5# ps -ef | grep lplp 460 1 0 08:01:17 ? 0:00 /usr/lib/lpsched
This only reports data for the lp printer scheduler.
However, ps only gives a snapshot of the current system state. Often
of more interest is a system's dynamic behaviour. A more suitable
command for monitoring system performance over time is 'top', a
typical output of which looks like this:
IRIX ASH 6.2 03131015 IP22 Load[0.22,0.12,0.01] 16:17:47 166 procs user pid pgrp %cpu proc pri size rss time command root 1576 1576 24.44 * 20 386 84 0:02 find root 1577 1577 0.98 0 65 432 100 0:00 top root 1513 1509 0.18 * 60 4322 1756 0:07 clogin root 900 900 0.12 * 60 2858 884 1:25 Xsgi root 146 146 0.05 * 60 351 77 0:00 portmap root 158 0 0.05 * 60 350 81 0:00 ypbind root 1567 1567 0.02 * 60 349 49 0:00 rlogind root 3 0 0.01 * +39 0 0 0:03 bdflush root 172 0 0.00 * 61 0 0 0:00 nfsd root 173 0 0.00 * 61 0 0 0:00 nfsd root 174 0 0.00 * 61 0 0 0:00 nfsd root 175 0 0.00 * 61 0 0 0:00 nfsd
Figure 46. top shows a continuously updated output.
From the man page for top:
The display is constantly updated at regular intervals, the duration of which can be altered with the -i option (default duration is 5 seconds). top shows the following data for each process:
Just as with the ps command, top shows the ID number for each process. These IDs can be used with the kill command (and others) to control running processes, eg. shut them down, suspend them, etc.
There is a GUI version of top called gr_top.
Note that IRIX 6.5 contains a newer version of top which gives even more information, eg.:
IRIX WOLFEN 6.5 IP22 load averages: 0.06 0.01 0.00 17:29:44 58 processes: 56 sleeping, 1 zombie, 1 running CPU: 93.5% idle, 0.5% usr, 5.6% ker, 0.0% wait, 0.0% xbrk, 0.5% intr Memory: 128M max, 116M avail, 88M free, 128M swap, 128M free swap PID PGRP USERNAME PRI SIZE RES STATE TIME WCPU% CPU% COMMAND 1372 1372 root 20 2204K 1008K run/0 0:00 0.2 3.22 top 153 153 root 20 2516K 1516K sleep 0:05 0.1 1.42 nsd 1364 1364 root 20 1740K 580K sleep 0:00 0.0 0.24 rlogind
Figure 47. The IRIX 6.5 version of top, giving extra information.
A program which offers much greater detail than top is osview. Like top, osview constantly updates a whole range of system performance statistics. Unlike top though, so much information is available from osview that it offers several different 'pages' of data. The number keys are used to switch between pages. Here is a typical output for each of the five pages:
Page 1 (system information):
Osview 2.1 : One Second Average WOLFEN 17:32:13 04/21/99 #5 int=5s Load Average fs ctl 2.0M 1 Min 0.000 fs data 7.7M 5 Min 0.000 delwri 0 15 Min 0.000 free 87.5M CPU Usage data 26.0M %user 0.20 empty 61.4M %sys 0.00 userdata 20.7M %intr 0.00 reserved 0 %gfxc 0.00 pgallocs 2 %gfxf 0.00 Scheduler %sxbrk 0.00 runq 0 %idle 99.80 swapq 0 System Activity switch 4 syscall 19 kswitch 95 read 1 preempt 1 write 0 Wait Ratio fork 0 %IO 1.2 exec 0 %Swap 0.0 readch 19 %Physio 0.0 writech 38 iget 0 System Memory Phys 128.0M kernel 10.1M heap 3.9M mbufs 96.0K stream 40.0K ptbl 1.2M
Figure 48. System information from osview.
Page 2 (CPU information):
Osview 2.1 : One Second Average WOLFEN 17:36:27 04/21/99 #1 int=5s CPU Usage %user 0.00 %sys 100.00 %intr 0.00 %gfxc 0.00 %gfxf 0.00 %sxbrk 0.00 %idle 0.00
Figure 49. CPU information from osview.
Page 3 (memory information):
Osview 2.1 : One Second Average WOLFEN 17:36:56 04/21/99 #1 int=5s System Memory iclean 0 Phys 128.0M *Swap kernel 10.5M *System VM heap 4.2M *Heap mbufs 100.0K *TLB Actions stream 48.0K *Large page stats ptbl 1.3M fs ctl 1.5M fs data 8.2M delwri 0 free 77.1M data 28.8M empty 48.3M userdata 30.7M reserved 0 pgallocs 450 Memory Faults vfault 1.7K protection 225 demand 375 cw 25 steal 375 onswap 0 oncache 1.4K onfile 0 freed 0 unmodswap 0 unmodfile 0
Figure 50. Memory information from osview.
Page 4 (network information):
Osview 2.1 : One Second Average WOLFEN 17:38:15 04/21/99 #1 int=5s TCP acc. conns 0 sndtotal 33 rcvtotal 0 sndbyte 366 rexmtbyte 0 rcvbyte 0 UDP ipackets 0 opackets 0 dropped 0 errors 0 IP ipackets 0 opackets 33 forward 0 dropped 0 errors 0 NetIF[ec0] Ipackets 0 Opackets 33 Ierrors 0 Oerrors 0 collisions 0 NetIF[lo0]
Figure 51. Network information from osview.
Page 5 (miscellaneous):
Osview 2.1 : One Second Average WOLFEN 17:38:43 04/21/99 #1 int=5s Block Devices lread 37.5K bread 0 %rcache 100.0 lwrite 0 bwrite 0 wcancel 0 %wcache 0.0 phread 0 phwrite 0 Graphics griioctl 0 gintr 75 swapbuf 0 switch 0 fifowait 0 fifonwait 0 Video vidioctl 0 vidintr 0 drop_add 0 *Interrupts *PathName Cache *EfsAct *XfsAct *Getblk *Vnodes
Figure 51. Miscellaneous information from osview.
osview clearly offers a vast amount of information for monitoring system and network activity.
There is a GUI version of osview called gr_osview. Various options exist to determine which parameters are displayed with gr_osview, the most commonly used being -a to display as much data as possible.
Programs such as top and osview may be SGI-specific (I'm not sure). If they are, other versions of UNIX are bound to have equivalent programs to these.
Example use: although I do virtually all the administration of the server remotely using the office Indy (either by command line or GUI tools), there is also a VT-style terminal in my office connected to the server's serial port via a lengthy cable (the Challenge S server itself is in a small ante room). The VT display offers a simple text-only interface to the server; thus, most of the time, I leave osview running on the VT display so that I can observe system activity whenever I need to. The VT also offers an extra communications link for remote administration should the network go down, ie. if the network links fail (eg. broken hub) the admin Indy cannot be used to communicate with the server, but the VT still can.
Another tool for monitoring memory usage is gmemusage, a GUI program which displays a graphical split-bar chart view of current memory consumption. gmemusage can also display a breakdown of the regions within a program's memory space, eg. text, data, shared memory, etc.
Much lower-level tools exist too, such as sar (system activity reporter). In fact, osview works by using sar. Experienced admins may use tools like sar, but most admins will prefer to use higher-level tools such as top, osview and gmemusage. However, since sar gives a text output, one can use it in script files for automated system information gathering, eg. a system activity report produced by a script, executed every hour by the cron job-scheduling system (sar-based information gathering scripts are included in the cron job schedule as standard). sar can be given options to report only on selected items, eg. the number of processes in memory waiting for CPU resource time. sar can be told to monitor some system feature for a certain period, saving the data gathered during that period to a file. sar is a very flexible program.
Network Performance and Statistics.
osview can be used to monitor certain network statistics, but another useful program is ttcp. The online book, "IRIX Admin: Networking and Mail", says:
To run a test with ttcp, enter the following on one system, eg. sevrin:
ttcp -r -s
Then enter the following on another system, eg. akira:
ttcp -t -s sevrin
After a delay of roughly 20 seconds for a 10Mbit network, results are reported by both systems, which will look something like this:
SEVRIN # ttcp -r -s ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp ttcp-r: socket ttcp-r: accept from 193.61.252.2 ttcp-r: 16777216 bytes in 18.84 real seconds = 869.70 KB/sec +++ ttcp-r: 3191 I/O calls, msec/call = 6.05, calls/sec = 169.39 ttcp-r: 0.1user 3.0sys 0:18real 16% 118maxrss 0+0pf 1170+1csw AKIRA # ttcp -t -s sevrin ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp -> sevrin ttcp-t: socket ttcp-t: connect ttcp-t: 16777216 bytes in 18.74 real seconds = 874.19 KB/sec +++ ttcp-t: 2048 I/O calls, msec/call = 9.37, calls/sec = 109.27 ttcp-t: 0.0user 2.3sys 0:18real 12% 408maxrss 0+0pf 426+4csw
Figure 52. Results from ttcp between two hosts on a 10Mbit network.
Full details of the output are in the ttcp man page, but one can immediately see that the observed network throughput (around 870KB/sec) is at a healthy level.
Another program for gathering network performance information is
netstat. The online book, "IRIX Admin: Networking and Mail", says:
netstat is commonly used with the -i option to list basic local network information, eg.:
yoda # netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 193.61.252 yoda.comp.uclan 3906956 3 2945002 0 553847 ec3 1500 193.61.250 gate-yoda.comp. 560206 2 329366 0 16460 lo0 8304 loopback localhost 476884 0 476884 0 0
Figure 53. The output from netstat.
Here, the packet collision rate has averaged at 18.8%. This is within acceptable limits [1].
Another useful command is 'ping'. This program sends packets of data to a remote system requesting an acknowledgement response for each packet sent. Options can be used to send a specific number of packets, or send as many packets as fast as they are returned, send a packet every so often (user-definable duration), etc.
For example:
MILAMBER # ping yoda PING yoda.comp.uclan.ac.uk (193.61.252.1): 56 data bytes 64 bytes from 193.61.252.1: icmp_seq=0 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=1 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=2 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=3 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=4 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=5 ttl=255 time=1 ms 64 bytes from 193.61.252.1: icmp_seq=6 ttl=255 time=1 ms ----yoda.comp.uclan.ac.uk PING Statistics---- 7 packets transmitted, 7 packets received, 0% packet loss round-trip min/avg/max = 1/1/1 ms
I pressed CTRL-C after the 7th packet was sent. ping is a quick and easy way to see if a host is active and if so how responsive the connection is.
If a ping test produces significant packet loss on a local network, then it is highly likely there exists a problem of some kind. Normally, one would rarely see a non-zero packet loss on a local network from a direct machine-to-machine ping test.
A fascinating use of ping I once observed was at The Moving Picture Company (MPC) [2]. The admin had written a script which made every host on the network send a ping test to every other host. The results were displayed as a table with host names shown down the left hand side as well as along the top. By looking for horizontal or diagonal lines of unusually large ping times, the admin could immediately see if there was a problem with a single host, or with a larger part of the network. Because of the need for a high system availability rate, the script allows the admin to spot problems almost as soon as they occur, eg. by running the script once every ten seconds.
When the admin showed me the script in use, one column had rather high ping times (around 20ms). Logging into the host with rlogin, ps showed everything was ok: a complex process was merely consuming alot of CPU time, giving a slower network response.
System Status and User Status.
The rup command offers an immediate overview of current system states, eg.:
yoda # rup yoda.comp.uclan.ac.u up 6 days, 8:25, load average: 0.33, 0.36, 0.35 gate-yoda.comp.uclan up 6 days, 8:25, load average: 0.33, 0.36, 0.35 wolfen.comp.uclan.ac up 11:28, load average: 0.00, 0.00, 0.00 conan.comp.uclan.ac. up 11:28, load average: 0.06, 0.01, 0.00 akira.comp.uclan.ac. up 11:28, load average: 0.01, 0.00, 0.00 nikita.comp.uclan.ac up 11:28, load average: 0.03, 0.00, 0.00 gibson.comp.uclan.ac up 11:28, load average: 0.00, 0.00, 0.00 woo.comp.uclan.ac.uk up 11:28, load average: 0.01, 0.00, 0.00 solo.comp.uclan.ac.u up 11:28, load average: 0.00, 0.00, 0.00 cameron.comp.uclan.a up 11:28, load average: 0.02, 0.00, 0.00 sevrin.comp.uclan.ac up 11:28, load average: 0.69, 0.46, 0.50 ash.comp.uclan.ac.uk up 11:28, load average: 0.00, 0.00, 0.00 ridley.comp.uclan.ac up 11:28, load average: 0.00, 0.00, 0.00 leon.comp.uclan.ac.u up 11:28, load average: 0.00, 0.00, 0.00 warlock.comp.uclan.a up 1:57, load average: 0.08, 0.13, 0.11 milamber.comp.uclan. up 9:52, load average: 0.11, 0.07, 0.00 merlin.comp.uclan.ac up 11:28, load average: 0.01, 0.00, 0.00 indiana.comp.uclan.a up 11:28, load average: 0.00, 0.00, 0.02 stanley.comp.uclan.a up 1:56, load average: 0.00, 0.00, 0.00
Figure 55. The output from rup.
The load averages for a single machine can be ascertained by running 'uptime' on that machine, eg.:
MILAMBER 84# uptime 8:05pm up 10:28, 6 users, load average: 0.07, 0.06, 0.25 MILAMBER 85# rsh yoda uptime 8:05pm up 6 days, 9:02, 2 users, load average: 0.47, 0.49, 0.42
Figure 56. The output from uptime.
The w command displays current system activity, including what each user is doing. The man page says, "The heading line shows the current time of day, how long the system has been up, the number of users logged into the system, and the load averages." For example:
yoda # w 8:10pm up 6 days, 9:07, 2 users, load average: 0.51, 0.50, 0.41 User tty from login@ idle JCPU PCPU what root q0 milamber.comp. 7:02pm 8 w cmprj ftp UNKNOWN@ns5ip. 7:29pm -
Figure 57. The output from w showing current user activity.
With the -W option, w shows the 'from' information on a separate line, allowing one to see the full domain address of ftp connections, etc.:
yoda # w -W 8:11pm up 6 days, 9:08, 2 users, load average: 0.43, 0.48, 0.40 User tty login@ idle JCPU PCPU what root ttyq0 7:02pm 8 w -W milamber.comp.uclan.ac.uk cmprj ftp22918 7:29pm - UNKNOWN@ns5ip.uclan.ac.uk
Figure 58. Obtaining full domain addresses from w with the -W option.
The rusers command broadcasts to all machines on the local network, gathering data about who is logged on and where, eg.:
yoda # rusers yoda.comp.uclan.ac.uk root wolfen.comp.uclan.ac.uk guest guest gate-yoda.comp.uclan.ac.uk root milamber.comp.uclan.ac.uk root root root root mapleson mapleson warlock.comp.uclan.ac.uk sensjv sensjv
Figure 59. The output from rusers, showing who is logged on where.
The multiple entries for certain users indicate that more than one shell is active for that user. As usual, my login data shows I'm doing several things at once.
rusers can be modified with options to:
The most obvious way to check if a service is available for use by users is to try and use the service, eg. ftp or telnet to a test location, run up a Netscape sessions and enter a familiar URL, send an email to a local or remote account, etc. The ps command can be used to make sure the relevant background process is running for a service too, eg. 'nfsd' for the NFS system. However, if a service is experiencing problems, simply attempting to use the service will not reveal what may be wrong.
For example, if one cannot ftp, it could be because of anything from a loose cable connection to some remote server that's gone down.
The ping command is useful for an immediate check of network-related services such as ftp, telnet, WWW, etc. One pings each host in the communication chain to see if the hosts respond. If a host somewhere in the chain does not respond, then that host may be preventing any data from getting through (eg. a remote proxy server is down).
A useful command one can use to aid in such detective work is traceroute. This command sends test packets in a similar way to ping, but it also reports how the test packets reached the target site at each stage of the communication chain, showing response times in milliseconds for each step, eg.:
yoda # traceroute www.cee.hw.ac.uk traceroute to osiris.cee.hw.ac.uk (137.195.52.12), 30 hops max, 40 byte packets 1 193.61.250.33 (193.61.250.33) 6 ms (ttl=30!) 3 ms (ttl=30!) 4 ms (ttl=30!) 2 193.61.250.65 (193.61.250.65) 5 ms (ttl=29!) 5 ms (ttl=29!) 5 ms (ttl=29!) 3 gw-mcc.netnw.net.uk (194.66.24.1) 9 ms (ttl=28!) 8 ms (ttl=28!) 10 ms (ttl=28!) 4 manchester-core.ja.net (146.97.253.133) 12 ms 11 ms 9 ms 5 scot-pop.ja.net (146.97.253.42) 15 ms 13 ms 14 ms 6 146.97.253.34 (146.97.253.34) 20 ms 15 ms 17 ms 7 gw1.hw.eastman.net.uk (194.81.56.110) 20 ms (ttl=248!) 18 ms 14 ms 8 cee-gw.hw.ac.uk (137.195.166.101) 17 ms (ttl=23!) 31 ms (ttl=23!) 18 ms (ttl=23!) 9 osiris.cee.hw.ac.uk (137.195.52.12) 14 ms (ttl=56!) 26 ms (ttl=56!) 30 ms (ttl=56!)
At a lower level, system services often depend on background system processes, or daemons. If these daemons are not running, or have shut down for some reason, then the service will not be available.
On the SGI Indys, one example is the GUI service which handles the use of on-screen icons. The daemon responsible is called objectserver. Older versions of this particular daemon can occasionally shut down if an illegal iconic operation is performed, or if the file manager daemon experiences an error. With no objectserver running, the on-screen icons disappear.
Thus, a typical task might be to periodically check to make sure the objectserver daemon is running on all relevant machines. If it isn't, then the command sequence:
/etc/init.d/cadmin stop /etc/init.d/cadmin start
restarts the objectserver. Once running, the on-screen icons return.
A common cause of objectserver shutting down is when a user's desktop layout configuration files (contained in .desktop- directories) become corrupted in some way, eg. edited by hand in an incorrect manner, or mangled by some other operation (eg. a faulty Java script from a home made web page). One solution is to erase the user's desktop layout configuration directory, then login as the user and create a fresh .desktop- directory.
objectserver is another example of UNIX GUI evolution. In 1996 SGI decided to replace the objectserver system entirely in IRIX 6.3 (and later) with a new service that was much more reliable, less likely to be affected by errors made in other applications, and fully capable of supporting new 'webified' iconic services such as on-screen icons that are direct links to ftp, telnet or WWW sites.
In general, checking the availability of a service requires one to check that the relevant daemons are running, that the appropriate configuration files are in place, accessible and have the correct settings, that the relevant daemon is aware of any changes which may have been made (perhaps the service needs to be stopped and restarted?) and to investigate via online information what may have caused services to fail as and when incidents occur. For every service one can use, the online information explains how to setup, admin and troubleshoot the service. The key is to know where to find that information when it is needed.
A useful source of constantly updated status information is the /var/adm/SYSLOG file. This file is where any important system events are logged. One can configure all the various services and daemons to log different degrees of detailed information in the SYSLOG file. Note: logging too much detail can cause the log file to grow very quickly, in which case one would also have to ensure that it did not consume valuable disk space. The SYSLOG file records user logins, connections via ftp, telnet, etc., messages logged at system bootup/shutdown time, and many other things.
Vendor Information Updates.
Most UNIX vendors send out periodic information booklets containing indepth articles on various system administration issues. SGI's bulletin is called Pipeline. Such information guides are usually supplied as part of a software support contract, though the vendor will often choose to include copies on the company web site. An admin should read any relevant articles from these guides - they can often be unexpectedly enlightening.
System Hardware Failures.
When problems occur on a system, what might at first appear to be a software problem may in fact be a hardware fault. Has a disk failed? The fx program can be used to check disk status (block read tests, disk label checks, etc.)
Has a network cable failed? Are all the cable connections firmly in place in the hub? Has a plug come loose?
In late 1998, the Ve24 network stopped operating quite unexpectedly one morning. The errors made it appear that there was a problem with the NFS service or perhaps the main user files disk connected to the server; in fact, the fault lay with the Ve24 hub.
The online guides have a great deal of advice on how to spot possible hardware failures. My advice is to check basic things first and move onto the more complex possibilities later. In the above example, I wasted a great deal of time investigating whether the NFS service was responsible, or the external user files disk, when in fact I should have checked the hub connections first. As it happens, the loose connection was such that the hub indicator light was on even though the connection was not fully working (thus, a visual check would not have revealed the problem) - perhaps the fault was caused by a single loose wire out of the 8 running through the cable, or even an internal fault in the hub (more likely). Either way, the hub was eventually replaced.
Other things that can go wrong include memory faults. Most memory errors are not hardware errors though, eg. applications with bugs can cause errors by trying to access some protected area of memory.
Hardware memory errors will show up in the system log file /var/adm/SYSLOG as messages saying something like 'Hardware ECC Memory Error in SIMM slot 4'. By swapping the memory SIMMs around between the slots, one can identify which SIMM is definitely at fault (assuming there is only one causing the problem).
The most common hardware component to go wrong on a system, even a non-PC system, is the disk drive. When configuring systems, or carrying out upgrades/expansions, it is wise to stick with models recommended by the source vendor concerned, eg. SGI always uses high-quality Seagate, IBM or Quantum disk drives for their systems; thus, using (for example) a Seagate drive is a good way to ensure a high degree of reliability and compatibility with the system concerned.
Sometimes an admin can be the cause of the problem. For example, when swapping disks around or performing disk tasks such as disk cloning, it is possible to incorrectly set the SCSI ID of the disk. SGI systems expect the system disk to be on SCSI ID 1 (though this is a configurable setting); if the internal disk is on the wrong SCSI ID, then under certain circumstances it can appear to the system as if there are multiple disks present, one on each possible ID. If hardware errors are observed on bootup (the system diagnostics checks), then the first thing to do is to reboot and enter the low-level 'Command Monitor' (an equivalent access method will exist for all UNIX systems): the Command Monitor has a small set of commands available, some of which can be used to perform system status checks, eg. the hinv command. For the problem described above, hinv would show multiple instances of the same disk on all SCSI IDs from 1 to 7 - the solution is to power down and check the SCSI jumpers carefully.
Other problems can occasionally be internal, eg. a buildup of dust blocking air vents (leading to overheating), or a fan failure, followed by overheating and eventually an automatic system shutdown (most UNIX systems' power supplies include circuitry to monitor system temperature, automatically shutting down if the system gets too hot). This leads on to questions of system maintenance which will be dealt with on Day 3.
After disk failures, the other most common failure is the power supply. It can sometimes be difficult to spot because a failure overnight or when one isn't around can mean the system shuts down, cools off and is thus rebootable again the next morning. All the admin sees is a system that's off for no readily apparent reason the next morning. The solution is to, for example, move the system somewhere close at hand so that it can be monitored, or write a script which tests whether the system is active every few seconds, logging the time of each successful test - if the system goes down, the admin is notified in some way (eg. audio sound file played) and the admin can then quickly check the machine - if the power supply area feels overly hot, then that is the likely suspect, especially if an off/on mains switch toggle doesn't turn the system back on (power supplies often have circuitry which will not allow power-on if the unit is still too hot). If the admin wasn't available at the time, then the logged results can show when the system failed.
All SGIs (and UNIX systems in general) include a suite of hardware and software diagnostics tests as part of the OS. IRIX contains a set of tests for checking the mouse, keyboard, monitor, audio ports, digital camera and other basic hardware features.
Thankfully, for just about any hardware failure, hardware support contracts cover repairs and/or replacements very effectively for UNIX systems. It's worth noting that although the Computing Department has a 5-day support contract with SGI, all problems I've encountered so far have been dealt either on the same day or early next morning by a visiting support engineer (ie. they arrived much earlier than they legally had to). Since November 1995 when I took charge of the Ve24 network, the hardware problems I've encountered have been:
It is likely that dirty air (dust, humidity, corrosive gases) was largely responsible for the disk, power supply and memory failures - perhaps some of the others too. A build up of dust can combine with airborne moisture to produce corrosive chemicals which can short-circuit delicate components.
To put the above list another way: 14 out of the 18 Indys have been running non-stop for 3.5 years without a single hardware failure of any kind, despite being housed in an area without filtered air or temperature control. This is very impressive and is quite typical of UNIX hardware platforms.
Installing systems with proper air filtering and temperature control can be costly, but the benefit may be a much reduced chance of hardware failure - this could be important for sites with many more systems and a greater level of overall system usage (eg. 9 to 5 for most machines).
Some companies go to great lengths to minimise the possibility of hardware failure. For example, MPC [2] has an Origin200 render farm for rendering movie animation frames. The render farm consists of 50 Origin200 servers, each with 2 R10000 CPUs, ie. 100 CPUs in total. The system is housed in a dedicated room with properly filtered air and temperature control. Almost certainly as a result of this high-quality setup, MPC has never had a single hardware failure of any kind in nearly 3 years of operation. Further, MPC has not experienced a single OS failure over the same period either, even though the system operates 24hours/day.
This kind of setup is common amongst companies which have time-critical tasks to perform, eg. oil companies with computational models that can take six months to complete - such organisations cannot afford to have failures (the problem would likely have to be restarted from scratch, or at least delayed), so it's worth spending money on air filters, etc.
If one does not have filtered air, then the very least one should do is keep the systems clean inside and out, performing system cleaning on a regular basis.
At present, my current policy is to thoroughly clean the Indys twice a year: every machine is stripped right down to the bare chassis; every component is individually cleaned with appropriate cleaning solutions, cloths, air-dusters, etc. (this includes removing every single key from all the keyboards and mass-cleaning them with a bucket of hot water and detergent! And cleaning the keyboard bases inside and out too). Aside from these major bi-annual cleanings, simple regular cleaning is performed on a weekly or monthly basis: removing dirt from the mice (inside especially), screen, chassis/monitor surface, cables and so on; cleaning the desks; opening each system and blowing away internal dust using a can of compressed filtered air, etc.
Without a doubt, this process greatly lengthens the life-span of the systems' hardware components, and users benefit too from a cleaner working environment - many new students each autumn often think the machines must be new because they look so clean.
Hardware failures do and will occur on any system whether it's a UNIX platform or not. An admin can use information from online sources, combined with a knowledge of relevant system test tools such as fx and ping, to determine the nature of hardware failures and take corrective action (contacting vendor support if necessary); such a strategy may include setting up automated hardware tests using regularly-executed scripts.
Another obvious source of extensive information about any UNIX platform is the Internet. Hundreds of existing users, including company employees, write web pages [3] or USENET posts describing their admin experiences and how to deal with typical problems.
Suspicious/Illegal Activity.
Users inevitably get up to mischief on occasion, or external agencies may attempt to hack the system. Types of activity could include:
As an admin who is given the task of supposedly preventing and/or detecting illegal activities, the first thing which comes to mind is the use of various file-searching methods to locate suspect material, eg. searching every user's netscape bookmarks file for particular keywords. However, this approach can pose legal problems.
Some countries have data protection and/or privacy laws [4] which may prohibit one from arbitrarily searching users' files. Searches of this type are the equivalent of a police force tapping all the phones in an entire street and recording every single conversation just on the off-chance that they might record something interesting; such methods are sometimes referred to as 'fishing' and could be against the law. So, for example, the following command might be illegal:
find /home/students -name "*" -print > list grep sex list > suspected grep warez list >> suspected grep xxx list >> suspected grep pics list >> suspected grep mpg list >> suspected grep jpg list >> suspected grep gif list >> suspected grep sites list >> suspected
As a means of finding possible policy violations, the above script would be very effective, but it's definitely a form of fishing (even the very first line).
Now consider the following:
find /home/students -name "bookmarks.html" -print -exec grep playboy {} \;
This command will effectively locate any Netscape bookmarks file which contains a possible link to the PlayBoy web site. Such a command is clearly looking for fairly specific content in a very specific file in each user's .netscape directory; further, it is probably accessing a user's account space without her or his permission (this opens the debate on whether 'root' even needs a user's permission since root actually owns all files anyway - more on this below).
The whole topic of computer file access is a grey area. For example, might the following command also be illegal?
find . -name "*.jpg" -print > results && grep sex results
A user's lawyer could argue that it's clearly looking for any JPEG image file that is likely to be of an explicit nature. On the other hand, an admin's lawyer could claim the search was actually looking for any images relating to tourism in Sussex county, or musical sextets, or adverts for local unisex hair salons, and just accidentally happened to be in a directory above /home/students when the command was executed (the find would eventually reach /home/students). Obviously a setting for a messy court-room battle.
But even ignoring actions taken by an admin using commands like find, what about data backups? An extremely common practice on any kind of computer system is to backup user data to a media such as DAT on a regular basis - but isn't this accessing user files without permission? But hang on, on UNIX systems, the root user is effectively the absolute owner of any file, eg. suppose a file called 'database' in /tmp, owned by an ordinary user, contained some confidential data; if the the admin (logged in as root) then did this:
cat /tmp/database
the contents of the database file would indeed be displayed.
Thus, since root basically owns all files anyway by default, surely a backup procedure is just the root user archiving files it already owns? If so, does one instead have to create some abstract concept of ownership in order to offer users a concrete concept of what data privacy actually is? Who decides? Nations which run their legal systems using case-law will find these issues very difficult to clarify, eg. the UK's Data Protection Act is known to be 'weak'.
Until such arguments are settled and better laws created, it is best for an admin to err on the side of caution. For example, if an admin wishes to have some kind of regular search conducted, the existence of the search should be included as part of stated company policy, and enshrined into any legal documents which users must sign before they begin using the system, ie. if a user signs the policy document, then the user has agreed to the actions described in that document. Even then, such clauses may not be legally binding. An admin could also setup some form of login script which would require users to agree to a sytsem usage policy before they were fully logged-in.
However, these problems won't go away, partly because of the specifics of how some modern Internet services such as the web are implemented. For example, a user could access a site which automatically forces the pop-up of a Netscape window which is directed to access a prohibited site; inline images from the new site will then be present in the user's Netscape cache directory in their home account area even though they haven't specifically tried to download anything. Are they legally liable? Do such files even count as personal data? And if the site has its own proxy server, then the images will also be in the server's proxy cache - are those responsible for the server also liable? Nobody knows. Legal arguments on the nature of cache directories and other file system details have not yet been resolved. Clearly, there is a limit to how far one can go in terms of prevention simply because of the way computing technologies work.
Thus, the best thing to do is to focus efforts on information that does not reside inside user accounts. The most obvious place is the system log file, /var/adm/SYSLOG. This file will show all the ftp and telnet sites which users have been accessing; if one of these sites is a prohibited place, then that is sufficient evidence to take action.
The next most useful data resource to keep an eye on is the web server log(s). The web logs record every single access by all users to the WWW. Users have their own record of their accesses in the form of a history file, hidden in their home directory inside the .netscape directory (or other browser); but the web logs are outside their accounts and so can be probably be freely examined, searched, processed, etc. by an admin without having to worry about legal issues. Even here though, there may be legal issues, eg. log data often includes user IDs which can be used to identify specific individuals and their actions - does a user have a legal right to have such data kept private? Only a professional lawyer in the field would know the correct answer.
Note: the amount of detail placed into a log file can be changed to suit the type of logging required. If a service offers different levels of logging, then the appropriate online documentation will explain how to alter the settings.
Blocking Sites.
If an admin does not want users to be able to access a particular site, then that site can be added to a list of 'blocked' sites by using the appropriate option in the web server software concerned, eg. Netscape Enterprise Server, CERN web server, Apache web server, etc. Even this may pose legal problems if a country has any form of freedom-of-speech legislation though (non-existent in the UK at present, so blocking sites should be legally OK in the UK).
However, blocking sites can become somewhat cumbersome because there are thousands of web sites which an admin could theoretically have to deal with - once the list becomes quite large, web server performance decreases as every access has to have its target URL checked against the list of banned sites. So, if an admin does choose to use such a policy, it is best to only add sites when necessary, and to construct some kind of checking system so that if no attempt is made to access a blocked site after a duration of, say, two weeks (whatever), then that site is removed from the list of blocked sites. In the long term, such a policy should help to keep the list to a reasonably manageable size. Even so, just the act of checking the web logs and adding sites to the list could become a costly time-consuming process (time = money = wages).
One can also use packet filtering systems such as hardware routers or software daemons like ipfilterd which can accept, reject, or reject-and-log incoming packets based on source/destination IP address, host name, network interface, port number, or any combination of these. Note that daemons such as ipfilterd may require the presence of a fast CPU if the overhead from a busy site is to be properly supported. The ipfilterd system is discussed in detail on Day 3.
System Temporary Directories.
An admin should keep a regular eye on the contents of temporary directories on all systems, ie. /tmp and /var/tmp. Users may download material and leave the material lying around for anyone to see. Thus, a suspicious file can theoretically be traced to its owner via the user ID and group ID of the file. I say theoretically because, as explained elsewhere, it is possible for a user X to download a file (eg. by ftp so as to avoid the web logs, or by telnet using a shell on a remote system) and then 'hand over' ownership of the file to someone else (say user Y) using the chgrp and chown commands, making it look as though a different user is responsible for the file. In that sense, files found outside a user's home directory could not normally be used as evidence, though they would at least alert the admin to the fact that suspect activities may be occurring, permitting a refocusing of monitoring efforts, etc.
However, one way in which it could be possible to reinforce such evidence is by being able to show that user Y was not logged onto the system at the time when the file in question was created (this information can be gleaned from the system's own local /var/adm/SYSLOG file, and the file's creation time and date).
Unfortunately, both users could have been logged onto the same system at the time of the file's creation. Thus, though a possibility, the extra information may not help. Except in one case: video evidence. If one can show by security camera recordings that user X did indeed login 'on console' (ie. at the actual physical keyboard) then that can be tied in with SYSLOG data plus file creation times, irrespective of what user Y was doing at the time.
Certainly, if someone wished to frame a user, it would not be difficult to cause a considerable amount of trouble for that user with just a little thought on how to access files, where to put them, changing ownership, etc.
In reality, many admins probably just do what they like in terms of searching for files, examining users' areas, etc. This is because there is no way to prove someone has attempted to search a particular part of the file system - UNIX doesn't keep any permanent record of executed commands.
Ironically, the IRIX GUI environment does keep a record of any file-related actions taken with the GUI system (icons, file manager windows, directory views, etc.) but the log file with this information is kept inside the user's .desktop- directory and thus may be legally out of bounds.
File Access Permissions.
Recall the concept of file access permissions for files. If a user has a directory or file with its permissions set so that another ordinary user can read it (ie. not just root, who can access anything by default anyway), does the fact that the file is globally readable mean the user has by default given permission for anyone else to read the file? If one says no, then that would mean it is illegal to read any user's own public_html web area! If one says yes, and a legal body confirmed this for the admin, then that would at least enable the admin to examine any directory or file that had the groups and others permissions set to a minimum of read-only (read and executable for directories).
The find command has an option called -perm which allows one to search for files with permissions matching a given mode. If nothing else, such an ability would catch out careless users since most users are not aware that their account has hidden directories such as .netscape. An admin ought to make users aware of security issues beforehand though.
Backup Media.
Can an admin search the data residing on backup media? (DAT, CDR, ZIP, DLT, etc.) After all, the data is no longer inside the normal home account area. In my opinion yes, since root owns all files anyway (though I've never done such a search), but others might disagree. For that matter, consider the tar commands commonly used to perform backups: a full backup accesses every file on the file system by default (ie. including all users' files, whatever the permissions may be), so are backups a problem area?
Yet again, one can easily see how legal grey areas emerge concerning the use of computing technologies.
Conclusion.
Until the law is made clearer and brought up-to-date (unlikely) the best an admin can do is consult any available internal legal team, deciding policy based on any advice given.
References:
http://support.sgi.com/surfzone/content/pipeline/html/19980404EthernetCollisions.html
The SGI Tech/Advice Centre: Holland #1: http://www.futuretech.vuurwerk.nl/ Worldwide Mirror Sites: Holland #2: http://sgi.webguide.nl/ Holland #3: http://sgi.webcity.nl/ USA: http://sgi.cartsys.net/ Canada: http://sgi-tech.unixology.com/
(Have you found this page via a web search? Examine the index pages to see all the available items!) [Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]