Outsourceware:
Design and Implementation of a Distributed and Automated Intranet Server

ir. Hans Lambermont, ir. Guido van Rooij, ir. Arjan de Vet - Origin B.V.
{Hans.Lambermont,Guido.vanRooij,Arjan.deVet}@nl.origin-it.com

ABSTRACT

In this paper we present a methodology for handling software that can be used to outsource intranet services. We call this software 'outsourceware'. We present our general oursourceware philosophy and show how it is used in our intranet server implementation.

1. Introduction and motivation

1.1 Introduction to Origin

Origin is the result of a merger between the Dutch software house BSO/Origin and the IT division of Philips, Philips Communication and Processing Services (Philips C&P). This created a world-wide company consisting of more than 16000 employees, and with a presence in 31 countries. Origin delivers, among other services, outsourcing services. Outsourcing means that a company hands over part, or all, of its IT services to the outsourcing company. Around the clock availability is required in most cases.

Within Origin, Technical Infrastructure Services (TIS) is the part of the company doing day-to-day operations of all services, ranging from Network Services (TCP/IP and SNA) to Mainframe Processing Services. Within TIS, the Intranet Services (INS) division (currently approximately 100 people) delivers Internet and Intranet related services on a world-wide scale: SMTP, DNS, mail gateways, IP/SNA gateways, FTP/WWW hosting, Usenet News, firewall/security services, dialin services, and various types of proxy services. Our largest customer is still Philips (for historical reasons of course) but we have started providing Intranet services for other customers as well.

1.2 Delivering Intranet services

By early 1996 the strong growth of Intranet services provided by Origin created an unmanageable situation. We delivered many different Intranet services to almost every product division within Philips. Philips at that time appeared to us as many different companies (this is changing, however), each with its own IT-policy and with different platforms: every possible platform could be found somewhere within Philips. Furthermore Philips is big: over 1500 LANs, around 100,000 hosts and more than 250,000 employees world-wide, all connected over a private WAN.

The servers in use at that time were all located in Eindhoven and built as prototypes without a global roll-out and without customers other than Philips in mind; some of these servers were legacy systems inherited by Origin. We needed to standardize and redesign our services in order to accomodate our customers' generic demands and to realize the scalability needed.

From a customer point of view we need services that :

Allow users to have local access to well-established Internet/Intranet services that are available on the customer's Intranet through a server close to, or even located on, the customer's LAN.
Offer excellent performance by minimizing the use of `slow' and expensive national and international links.
Reduce off-site traffic through local caching of information.
Offer much flexibility in tuning the services to the customers requirements, including the ability to define individual users and groups of users and authorize them for any mix of predefined services.
Guarantee service levels offered by a global service delivery organization.
Avoid local investments in hardware, software, and skilled staff for maintaining a complicated local server.
Generate extensive reporting on the service usage and offers online information for quick problem resolution.

Any client software that complies with the official definitions of the Internet protocols (RFCs) can be used. Our division's target is to provide the server part of Internet and Intranet services; other parts of Origin specialize in the desktop and network parts of Intranet services.

1.3 Redesigning our services

We defined the following strong requirements:

Scalability. We should be able to deliver our services to world-wide companies which have a size comparable to Philips.
Ease of management.
Limit functionality to a standardized set of proven services, preferably based on the reference implementations of the Internet protocols. We explicitly do not deliver the 'hype-of-the-day' services.
Automated software installation and upgrades, including the operating system.
All the administrator tasks need to be split in (at least) two parts. One part dealing with nothing more than content, like web pages and user accounts which we call the local part; the other part dealing with the proper configuration and functioning of the services which we call the remote part (because these are done from a remote location).
Complete automation of the remote administrator part in order to achieve scalability and to meet our targeted service levels.

We first looked at some commercial products, which, of course, were favoured by our management. This was early 1997. We considered:

Sun Netra (Internet Server and Netscape Enterprise Server);
Netscape SuiteSpot (without Netra);
Microsoft IIS version 2.0 and Remotely Possible;
BSO Office Server.

None of these products supported layered administration, none offered all the services we needed, none were sufficiently scalable, and none were based on proven technology. We tried to cooperate with Sun's Netra development team, but this proved to be not very useful because of different roadmaps. Therefore we opted for a redesign of our own prototype servers.

With the strong requirements in mind we started designing our own implementation, which started under the initial project name 'DIBBS': Distributed Intranet Black Box Server. The reason for calling it a black box was that the customer should consider it a black box, although it should be a crystal clear box to ourselves. Later on DIBBS was changed to the commercial name IntraConnect.

In each of the following 5 chapters we first present general outsourceware remarks and requirements, and then explain the IntraConnect implementation.

2. Split administration

Providing services on a global scale is not a trivial task. We must make sure that the service offered is the same everywhere, regardless of the geographical location of the end user. Furthermore, the administration of the service should be set up such that the service will meet predefined availability figures and that changes can be implemented within agreed lead times. There are two obvious ways to implement the administration:

Have a central administration that does everything needed to provide the service;
Have a local administration at each service location that does everything needed to provide the service.

The advantage of the first method is that it is relatively easy to achieve global service consistency. However, communication overhead causes relatively high turn around times for simple tasks such as user authorization. This is not what our customers want. They want to have the user authorization nearby, and sometimes even under their direct control.

The second method can deal quickly with changes, but experience shows that services will diverge at all service locations because of different administration methods.

A third way of dealing with global services is to implement what we call 'split administration'. In this scheme we keep local things local and central things central. Because the central things are normally done from a remote location we started using the term remote administration for this and we will use this term in the remainder of this paper.

2.1 Local administration

The role of local administration is to address issues regarding content, such as web pages; accounts, such as user and group accounts; service authorizations, etc. The local administrator is typically a system administrator close to the location where the service is provided, and might even be an employee of the customer rather than of the service provider. A local administrator does not need to have in-depth knowledge of the services.

2.2 Remote administration

Remote administration is responsible for keeping the service consistent at all locations and to maintain the service levels as agreed with the customer. Typically, this consists of service monitoring, software upgrades, performance management, etc. The remote administration can override changes made by the local administration.

2.3 IntraConnect implementation

The IntraConnect server enforces split administration in a very strict way. Through a web based GUI (see Fig. 1 for the welcome page, note that a local administrator cannot logon to the server in other ways) the local administrator manages the accounts and authorization database and is able to view usage statistics and reports. This database is shared among all services. It contains users and groups of users, email aliases, their DNS domains and/or IP ranges that are used for the service authorizations.

Fig. 1, Local administrator web start page. Users can find help and personal statistics here too and can change their password. This image is from our Asia Pacific (ASP) IntraConnect.

Additionally, the local administrator performs those tasks that require physical access to the server. This includes replacing hardware in case of failures, and tape management for backup purposes.

The remote administrators (normally a group of people) do the remote service management on a global scale. The idea is to automate the remote administration tasks as far as possible, and to store information just once. Standard tasks include enabling and disabling a service for a specific server, reconfiguring a service, upgrading the service software, etc.

IntraConnect implements these tasks such that they do not require intervention from a remote administrator on the server itself, as they are automated via a central distribution server. But if needed these tasks can be performed via an extended version of the GUI that the local administrator uses or by logging on to the server. Unlike local administrators, remote administrators can logon to the servers for example, to troubleshoot difficult problems which cannot be diagnosed (or not fast enough) via a web-based GUI.

Currently, the group of remote administrators consists more or less of the group of people doing the IntraConnect development. This will certainly change when more and more servers are rolled out.

3. Modular approach

In order to create a framework within which various services can be handled it is best to create abstraction levels by modularizing the software packages. This way they can be handled without the need to know their internals. Each distinct service should be put into one module. In order to achieve the abstraction layer, a standard API must be defined to handle the software package. Together with a standard way of distributing and installation (discussed in the next chapter) this gives what we call a module. A module should have clear interfaces to the outside world, that is, other modules. Our old prototype machines were not set up that way and that also contributed to their unmanageability.

The API must contain certain generic functions like 'start', 'stop', 'check', 'reconfigure', 'install', 'backout' and 'remove'. These must be implemented for each module, possibly as a 'stub' operation. The software packages own methods of starting, etc. should not be used directly anymore.

Not all modules will be on the same level as some will have a general or support function for the other modules. This will result in some kind of module hierarchy.

The main advantages of modularization are:

Uniform and modular upgrade;
Uniform command interface;
Uniform accounting and reporting;
Easy addition of new software packages.

3.1. Requirements for software packages

First of all, it must be possible to implement an API that controls the software package. As a second requirement, all file locations that the software package uses/creates must be known, and preferably, be configurable. Thirdly, there should be a single user/group account database from which all user/account related configuration files can be generated for every module that needs this. These requirements mean that the software packages configuration files have to be open, that is, we need to be able to generate the configuration files from our own scripts, preferably as plain text files. Software packages which are only configurable through a grapical user interface are almost always unusable for our setup as they tend to use obscure and/or undocumented configuration file formats. The same holds for installation procedures. These must have the possibility of being non-interactive, so installation can be done through scripts.

3.2. IntraConnect implementation

When it comes to the file locations we used a stronger requirement; we must be able to modify the file locations in some way (command line options, changing source code, etc.). We chose not to use symlinks from the original locations to do that in order to avoid 'symlink wars'. This way, each module can have a distinct directory subtree for all executables, configuration files etc. This greatly simplifies remote administration tasks like, for example, backing out a recent installation.

A disadvantage of this approach is that we changed all the software packages file locations from where they are usually located on a UNIX system. In return we get a service oriented directory structure. Underneath that structure, the standard UNIX directory structure was preserved. We found that new people to our project understood this setup quite fast.

3.2.1 Split binary and configuration modules

Instead of putting all binaries and configuration files for one service together in one module we decided to make a distinction between the binary (software) and configuration parts of a service. The advantages are that configuration changes do not involve transferring a much larger module which also contains the binaries, and that it isolates software upgrades from configuration file changes.

The real configuration files that the service software uses are generated from three sources:

The configuration parts from the binary module, normally these are template files (centrally managed);
The remote configuration files for that server from the configuration module (centrally managed);
The local configuration files for that server which are managed by the local administrator via the web-based GUI.

The script which generates the real configuration files from these three sources is contained in the binary module and it makes sure that local configuration does not conflict with remote configuration.

The binary module contains all binaries, sh/perl scripts, CGI scripts, online documentation (HTML) and those parts of the configuration files which can or must be the same for all servers irrespective of location and customer. Thus, the binary module is the same for all servers (given one OS). A typical binary module is a 100 KB gzipped tar file.

The configuration module contains all parts of the centrally managed configuration specific to one server machine. Examples are IP address, netmask, DNS name servers, etc. A typical configuration module is a 1 KB gzipped tar file.

Modules are stored as gzip compressed tar files. To uniquely identify a module we have added a version identification in YYYYMMDDxx format to the module names. Examples:

    general-b.1998071400.tar.gz (Binary module)
    general-c.1998071003.tar.gz (Configuration module)

This naming convention with version numbers is used by the automated distribution and upgrade mechanism to decide whether a new release of a module is available. This will be discussed later.

3.2.2. Module overview and hierarchy

From the module hierarchy we have (see Fig. 2) it follows that, for example, only the general module can interact with the os module. In this case the general module manages all host-dependent configuration files for the os module. We will now discuss the various modules in more detail:

os

The operating system module. This is a stripped down version of the OS containing only those programs really necessary. In case of IntraConnect on BSD/OS the OS has been stripped down to only 6.2 Mbytes (gzipped). The reason for using a stripped down version is that it makes automated OS installs over small-bandwidth network connections possible; furthermore it is easier to make secure.

The OS module consists of a binary part only. All configuration files in /etc are generated from the general module. The OS module itself is created via a tool create-os and an accompanying configuration file with which it puts selected files from the development server into a tar file. This implies that the development server must have the same or a higher OS version than the one in use on the operational servers. While this can be seen as an unnecessary restriction, in practice it does not turn out that way. On the contrary, it means an operational server can only run an OS version if we have experience with it on our development and test servers.

Fig. 2, Module hierarchy.

Upgrading to a new OS release involves upgrading the development server and generating a new OS module, of course taking care that binaries from other modules for the older OS release can still be run by installing backward compatible shared libraries. When a major OS release is involved, extra actions may be necessary in the OS installation scripts.

local

The local module (binary part only) contains software that we found missing in the OS and that does not belong to a service module (like, for example: sshd, expect, filter, follow, ...).

general

This module, consisting of a binary and a configuration module, handles all generic functions for itself and the other service modules:

Generating OS configuration files for /etc.
Starting, reconfiguring and stopping the other services.
Running service checks regularly and sending alerts.
Performing automated and unattended upgrades of modules, using the version numbers.
Web-based GUI for local and remote administrators which links to the GUI parts of the other service modules (top frame of Fig.3).
Generic /bin/sh include files and perl libraries.

The most important configuration file is the master.root file. It contains all generic parameters like IP addresses, netmasks, extra DNS servers, trusted hosts for the web-based GUI and telnet/ssh, etc. It contains variable/value pairs and may only be read by two include files, one for /bin/sh scripts, one for perl scripts. These include files can also be used for all kinds of backward compatibility tricks, this will be shown in the section about multi machine IntraConnect later on.

Another important aspect of the master.root file is that items like IP numbers are stored only once. Other configuration files needing these parameters must retrieve them from this master.root file by scripts. In the past, moving a machine to another IP addresses caused lots of trouble because of the large number of different configuration files containing one and the same IP address.

Fig. 3, Administrator example of the services overview. The three frames showed come from the general module.

admin

The admin module deals with the user and group administration for all services and their accounting. A user is a single person who can use the services from one or more IP address and/or hostnames. A group is either a collection of users belonging to, for example, the same department, or any combination of IP numbers, network slices and DNS domains (see Fig. 4). The group concept avoids the need to register every user and the user's IP address.

Fig. 4, Flexible group definition example. Note the various 'Hosts' entries. The Services list is automatically expanded when new services become available.

The admin module on IntraConnect deals with fine-grained access control, up to services per individual (see Fig. 5 and Fig. 6). The reason for this fine-grained access control is that not every company allows all of its employees to access all public Internet services. The database of the admin module is used by other services to generate access files for those particular services.

Fig. 5, Flexible user service permissions example.

Fig. 6, Flexible group service permissions example.

The admin module also provides detailed accounting and reporting information which is very important for an outsourcing company like Origin. In the case of Philips, Origin deals with lots of different product division and business units each of which get their own bills based on, for example, megabytes transferred.

ntp, dns, smtp

These are considered basic services which must be provided on every server machine (together with the os and general modules of course).

The ntp service is implemented using the xntpd distribution. The remote configuration consists only of a list of servers and peers. Fig. 7 shows an example of the NTP service status frame.

Fig. 7, NTP daemon server status example.

The dns service is implemented using the bind distribution (version 8). The configuration part consists of two files, primaries and secondaries in very simple formats, and DNS zone files (without the SOA part) in the case of primary zones. From these files a named.conf file is generated and complete zone files with a correct SOA record containing an automatically maintained serial number are generated.

The smtp service is currently implemented using sendmail 8.8.8. A script generates an M4 file from master.root information and some additional configuration files. Then, using the normal sendmail M4 tools, the real sendmail.cf is generated. If users are defined in the admin module, the script will generate the appropriate aliases, virtusertable and userdb files and their .db versions. The smtp service can handle multiple mail domains per server machine.

pop, ftp, http, nntp

The pop service is implemented using qpopper, the ftp service using wu-ftpd, the http service using Apache and the nntp service using INN. The http service is also multidomain capable. INN has been extended with a mail-to-news gateway.

telnet, ftp, www and nntp proxies

The telnet and ftp proxies are from the TIS firewall toolkit. For www-proxy we use Squid and for nntp-proxy we use nntpcache which is the only news client that is allowed to talk to INN's nnrpd. In this way we offer transparent access to company-internal and external news servers.

All these proxies have been socksified to work with the socks-based firewalls managed by Origin. At Origin we separate machines for network perimeter security and proxy servers to keep the bastion hosts of the firewall as simple as possible: no users, no accounting, etc.

Access files are generated using the user/group database from the admin module. Because the admin module allows almost every possible DNS domain and network slice notation, some entries like 192.168.2.32/27 get expanded into multiple IP addresses for the telnet, ftp and nntp proxies which do not support these notations.

During development we move scripts or routines from a normal service module to the general module as soon as that script or routine is necessary in other modules too. This way a lot of duplicate coding is avoided when adding new services. This also matches our 'store everything once' approach for configuration items.

3.2.3. Interfacing with a module

As said before, we have chosen not to store the files from our modules in the usual locations like /usr/local/bin/, /usr/local/etc/, etc. but each module has its own directory tree under /usr/dibbs/, for example, /usr/dibbs/general/. Below such a tree we have the standard bin/, etc/ and lib/ directories, furthermore we added config/, htdocs/ and cgi-bin/. The reasons to do this are easy upgrades (just move a directory and unpack a new module) and backouts of upgrades (move the old directory back). It also avoids the need for a detailed file list per module and makes it possible to have scripts with the same names for each module.

For a binary module we defined some entry points; the most important ones are listed below. Filenames are relative to the root of the module, for example, /usr/dibbs/general/.

VERSION (required for every module)

This is the file containing the version number of the currently installed module.

bin/installer (required)

This script has three possible parameters: install (install the module), uninstall (remove the module), backout (go back to the previous versions). During installation of a module the module's tar.gz file is extracted in /tmp creating a /tmp/general directory. Then '

cd /tmp/general;
    bin/installer install

' is run. This will stop the current running service, create a backout copy of the currently installed module, install the new module in the appropriate directory, optionally import local configuration files and finally reconfigure and start itself. All these actions are done in a non-interactive fashion so that they can be done from cron.

Although all services have their own installer script, most of the installer routines are very generic and are defined in a library in the general module which is used by the installer scripts.

bin/service (required)

Each service script accepts four parameters: stop, start, reconfigure (regenerate all configuration files and restart the service) and check (for correct operation).

A simple way to implement the reconfigure part of the service script is to call the stop routine, regenerate all configuration files and call the start routine. However, for many of the services the reconfigure is done by just regenerating the configuration files and giving the service daemon a signal (normally HUP) to make it reread the configuration files. This leads to less, or no, service interruption.

The service script of the general module has two extra options, startall and stopall to start and stop all other services.

bin/{hourly,daily,weekly,monthly} (optional)

These scripts are used for module specific hourly, daily, weekly and monthly tasks. The general module makes sure these scripts are called from root's crontab.

bin/rotate (optional)

This script rotates one or more logfiles. It is executed daily at 00:00.

cgi-bin/menu.cgi (optional)

This is the CGI script which generates a web page containing the main menu for the service. It is called from the main menu in the 'general' module.

config/filter.conf

This file contains the filter rules (regular expressions) for monitoring the service. The general module creates one single filter.conf file from the individual filter.conf files to be used by the filter program discussed in chapter 5.

4. Automated installation and upgrading

Automated installation and upgrading is primarily a method to achieve scalability and to keep the costs down. It is also a good method to guarantee global service consistency as every local modification other than those made by the local administrator though the GUI will disappear after the next automated upgrade.

A first-time installation procedure should be a special upgrade procedure: 'upgrade from nothing'.

For the local administrator, a service backout option (restore the old working situation) should be available as a quick fix for any possible undesirable effects experienced by end-users after an automated change which is solely under remote control.

Upgrades must only be performed in the customers' change window, but in case of problems the local and global service delivery organizations must be informed so they can take the necessary actions needed to solve the problem. Only a few minutes of service interruption should arise in a normal change window.

4.1. IntraConnect implementation

The initial installation is done by a specially prepared boot floppy (for the PC based IntraConnect). This floppy creates a RAM disk, retrieves the necessary modules and runs the installer of the OS module. This partitions the hard drives, installs the OS and the general module, and reboots the system after the local administrator removes the floppy. Normally we perform this initial setup ourselves, and then ship the box, but we have also used email to send an image of this boot floppy to, for example, Brazil, because the hardware was bought locally.

4.1.1 Distribution server

Because the binary modules are the same for all IntraConnect servers we wanted to store them in no more than one location. We created a so-called distribution server for this purpose. On this distribution server every IntraConnect has a private directory (in reverse FQDN notation, like e.g. com/philips/ms/best/ ). All remote configuration modules are stored here. For the binary modules we use symlinks from these private direcories to a special binary-module-only directory with a name given by GNU's config.guess script (for example, i386-pc-bsdi3.0). In this way a distribution server can support multiple platforms.

As proof of concept, the distribution server has been implemented as a normal FTP server. Minimal security is implemented through a tcp-wrapper. This will be improved and extended in the future as we require more security and a push variant for non-intranet servers; normally the servers can pull the modules from the distribution server but for servers outside the firewall this might not be allowed because of security reasons.

4.1.2 Change window

During the change window, configured in the master.root file, all modules are checked against the version on the distribution server. This event is initiated by cron on the IntraConnect server itself. If there is a newer version available it is fetched, unpacked and installed automatically. At the end of a change window an overview of the actions performed is sent to the remote administrators. A typical change window causes between one second and one minute service interruption per upgraded service.

4.1.3 OS upgrade

Even the OS itself is a module, although the installer differs a lot from the installers in the other modules. The IntraConnect server uses two boot hard disks only one of which is used at any time. We call this a dual boot setup. We took the idea from our Origin firewalls and extended the OS upgrade process to run unattended.

During an automated upgrade, the OS module performs all actions on the other boot disk. This provides an easy backout possibility; the advantage of being able to backout an upgrade greatly outweighs the cost of an extra disk. At the end of the installation of the new OS on the new disk /usr/dibbs/, /usr/config/ are copied to the new bootdisk and a fresh general-{b,c} module is saved on the new disk. Only at the end of the OS upgrade are all services stopped by calling service stopall and after changing either the server's boot prom (Suns) or the boot.default configuration file (BSD/OS) to configure the new boot disk, the system is rebooted.

When the system comes up from the new boot disk it detects the first boot after an upgrade, installs general-b and general-c (which configures everything needed in /etc, such as its IP numbers and hostname), and boots further starting every service with service startall.

The /var, /news and /cache partitions are located on separate data disks and thus need not be copied during an upgrade. The whole OS upgrade can cause up to 4 minutes service disruption; this is highly dependent on the amount of RAM in the machine which has to be counted during boot (the HP Netservers we use have slow memory tests).

5. Automated monitoring

Monitoring is crucial in guaranteeing the service level and in communicating the achieved service levels to the customer. What needs to be monitored largely depends on the agreed service levels. As usual, scalability requires that monitoring is an autonomous process. Only if something is found to be irrepairably wrong, human intervention should be required.

5.1. Requirements for monitoring

Outsourcing companies must be able to show that an agreed service level is met. Therefore every service must be checked for proper functioning at regular time intervals. We recommend that this functional checking is performed by the server itself and that event-driven alerting is used above the standard SNMP polling based methods. This greatly reduces the amount of network communication used by the monitoring process (expensive WAN links). Furthermore, when the network link to the monitoring station fails, the service checks also fail with polling based methods. This is not the case when the server checks itself. Event driven methods scale better than polling based ones. A queing mechanism for the alerting messages is strongly advised, because of the same network failure reasons. Therefore we also advice against the use of SNMP-traps.

The service check should be more than just verifying that a given daemon process exists. Proper functioning of a service should also be checked. Useful methods include protocol level questions with known answers, or the typical welcome-to-this-protocol messages.

When the server discovers that something is not functioning properly, it should notify a central monitoring location. Failure in the self-checking functionality of the server will result in loss of the notification. Therefore, the server should send so-called heartbeats to the central monitoring location using the same mechanism. If the self-checking mechanism fails, the loss of heartbeats will be detected by the central monitoring location.

5.2. IntraConnect implementation

The self-check mechanism for services is provided by the 'service check' command for each service. This command is run every 15 minutes from the cron daemon. If the service check notices an error it reports this via syslog. Most of the services themselves also use syslog.

Heartbeat messages are generated from the general module every 15 minutes and are logged via syslog. The heartbeat message contains the current UTC date and time in ISO format.

The syslog file (/var/log/messages) is tracked in real-time (by a program called follow) and filtered based on regular expressions (by a program called filter) to service-specific log files in (/var/log/<modulename>/). Every service has at least a 'messages' log file for just informative, non-priority messages, an 'acct' log file for accounting data and 'daily', 'low', 'high' and 'alert' log files for messages of different priority. Any message not filtered explicitly by a regular expression is directed to the high priority log file. This forces us to classify any unknown messages in an appropriate category (daily/low/high/alert). The contents of these log files are sent to the central monitoring station every 1440 (1 day), 60, 15 and 1 minutes (if there is new data in it of course).

The heartbeat messages follow exactly the same path through the system as the output of the check scripts and the logged events from the services: start from cron, log via syslog, then filtered via follow and filter and put into specific log files and then processed by a script run from cron. We can thus be sure that all critical system processes are operational when we receive heartbeats. If a heartbeat is not received within 30 minutes, the central monitoring station generates an alert message itself. Absence of the heartbeat means that either the self-check mechanism is not working, or that there is a network outage between the IntraConnect server and the central monitoring location.

All heartbeats and logging messages are sent using SMTP to, preferably, the IP address of the central monitoring station. We use IP addresses to prevent dependency on other SMTP hubs. All we need is a properly functioning network between server and monitoring station. We did not want to rely on UDP based services like SNMP traps because of the unreliability of UDP and because of systems outside proxy-based firewalls. Despite the complexity of two SMTP setups, this system has proven to be stable and reliable during the four years that we use it.

Our experiences show that in general we, as a service provider, can react quickly to service problems and that we are sometimes even able to solve problems before the customer detected that there was a problem. Furthermore, the concept can also be used to generate lower priority messages whenever a service interruption is likely to occur in the near future. As an example, we generate messages whenever DNS zone transfers are failing. In itself this is non-critical but if the problem persists, it might lead to a disfunctional DNS service.

6. Automated accounting and reporting

Customers who outsource their IT infrastructure normally demand detailed billing information. Furthermore, in order to take pro-active measures, trend analysis has to be done based on the generated reports.

6.1. Requirements for accounting and reporting

First of all, these requirements depend on the contract with the customer. We see the demand vary from the total usage for all services to per-user byte-level accounting. The latter is often the case with (big) customers who want to divide the bill, per service, among their departments or even among individual users.

Reporting includes all usage statistics not needed for accounting. Reporting needs to be much more detailed per billing unit than accounting. It is sometimes used solely for trend analysis. Most software is capable of generating (huge) logfiles, but this is just the first phase.

Because our IntraConnect solution should be usable for multiple customers who may have different kinds of accounting and reporting requirements, we had to implement accounting and reporting in a very generic way which fits the majority of the requests we get.

This very generic approach had implications for the various software packages we use because some of them did not log the necessary details and needed to be modified. Furthermore, the accounting file formats have to be open in order to combine them with those of other modules and to process them automatically.

6.2. IntraConnect implementation

The software packages used employ various methods for logging their accounting data. Some log this information directly into a private logfile, others use syslog. In the latter case we use the follow and filter programs to separate the accounting data into per service accounting files. This approach is more generic than using the facility option of syslog because the number of syslog 'facilities' is limited, and also because syslog cannot distinguish between different contents of lines logged.

Each night at 00:00 local time the logfiles are rotated and archived in /var/archive/<modulename>/. This is done by the rotate script. The reason for rotating at 00:00 is to keep logfiles per calendar day. Then we gather the various accounting logfiles, preprocess and send them to the central accounting servers every day where they are processed further. This way we achieve an enormous data reduction and only send the necessary accounting information over the expensive, and often slow, WAN links.

IntraConnect also generates local statistics from the accounting logfiles for group and user accounts, and usage overviews for all services per group and user. This is done by matching the information in the logfiles with the information in the user/group database. The statistics are made available via the web-based GUI, an example can be seen in Fig. 8.

Fig. 8, Traffic statistics example from our local LAN on a very quiet day. Only administrators can request this information, whereas individual users can only request their own information.

7. Development and testing

7.1. CVS repository

We use CVS (Concurrent Versions System, layered on top of RCS) for all source code control and documentation, including this paper. This proved to be a very efficient way of managing thousands of documents, source files including our local modifications and configuration files among multiple developers. We extended CVS such that it sends an email immediately to all developers after a commit to the central repository. This is based on the way CVS is used in the FreeBSD project (and possibly others too).

The CVS layout is as follows:

htdocs/				docs
os/				operating system
os/i386-pc-bsdi3.0/		BSD/OS 3.0 OS module
os/sparc-sun-solaris2.6/	Solaris 2.6 OS module
...
usrconfig/com/philips/cp/mpn/general/	remote configuration, general
usrconfig/com/philips/cp/mpn/dns/	dns
...
usrdibbs/general/		the general module
usrdibbs/admin/			the admin module
...
usrdibbs/dns/bin/		dns module, scripts
usrdibbs/dns/cgi-bin/		CGI scripts for web GUI
usrdibbs/dns/config/		config templates
usrdibbs/dns/dist/		original bind 8 distribution
usrdibbs/dns/htdocs/		online docs (HTML)
...
usrlocal/

Every module has its own place in the CVS tree. The CVS files for making the binary modules can be found below usrdibbs/ and CVS files for making the remote configuration modules can be found below usrconfig/, using reverse-FQDN-based names like com/philips/cp/mpn/ for each server. This makes managing a lot of configuration files easy because they are stored in a separate subtree and in a hierarchical fashion. For example, the DNS root.cache file, which is the same for philips.com servers, is stored just once in usrconfig/com/philips/dns/ instead of multiple times (for each server). A local root.cache file can still be used and will override any root.cache files from positions higher in the hierarchy.

Below usrdibbs/<modulename>/ we use bin/ for our own scripts, config/ for the configuration template files, lib/ for additional include files, libraries and support programs (like the mail2news gateway for INN), cgi-bin/ and htdocs/ for this module's web interface, and finally dist/ for the original CVS-imported source code.

On an IntraConnect server the scripts from the CVS bin/ directory together with the compiled binaries from the CVS dist/ directory are installed into /usr/dibbs/<modulename>/bin/. Files from the CVS config/ directory are used on the IntraConnect server together with local and remote configuration files to create the real configuration files in /usr/dibbs/<modulename>/etc/.

For Solaris we use the same CVS repository on a BSD/OS system and the remote repository mechanism of CVS. Our setup isolates every major OS dependency in separate directories named after the operating system using GNU's config.guess script. Small OS dependencies were solved using conditional statements in our makefiles, also using the config.guess program.

7.2. CVS and the distribution server

All CVS directories are linked together by makefiles, most of which consist only of some variables and an include statement, because all generic code has been moved to makefile include files. Besides the normal 'all' and 'clean' targets there are two special targets, 'test' and 'dist', for releasing modules onto the distribution server.

The test target compiles a binary module, installs it in a local temporary directory and creates the module from it. It then installs the module on the distribution server (currently the same machine as the development server) for all test machines. As explained before, the module is stored only once and symlinks are created for the individual machines.

The dist target does the same, but releases the module for all servers including the test servers. The dist target is also used to release configuration modules; because configuration modules are for one server only there is no test target here.

Every module has its own VERSION file which gets automatically updated and committed to the CVS repository by the 'make dist' and 'make test' commands.

7.3. Testing and upgrading

When working with automated and unattended upgrades of modules including the operating system, testing changes is very, very important. The IntraConnect development team therefore has a collection of test machines in all kinds of configurations on which all changes are tested before being released. Because of our strict split administration setup which keeps machines completely identical except for some configuration files, the number of different configurations to test is very limited.

Binary module upgrades are done almost weekly for, mostly small, bugfixes and new functionality. Urgent bugfixes are still tested in the standard way but released immediately. Configuration module upgrades are more rare: only when a server has just been installed some configuration fine-tuning is often necessary. Two automated and unattended OS upgrades have been done so far which failed for only one server because it was waiting for its keyboard (the well-known error 'Keyboard error: press F1 to continue' when no keyboard is attached). We operate the IntraConnect machines with only a power cable, one (optionally two) network connections and, if possible, a serial console connection to a terminal server. No monitor, no keyboard, no mouse.

8. Experiences and conclusions

The first item completed was the automated installation and upgrading method. This included the distribution server, the make dist method and the installer scripts for each module. Once we had this up and running things speeded up dramatically because releasing, distributing and installing modules was completely automated. We consider it our 'crown jewel'. We use an FTP server as distribution server, a shell script module installer and an expect script interacting with an ftp client to retrieve modules from the distribution server. Nothing fancy.

The use of the reverse DNS domain path (like /com/philips/cp/mpn) for the configuration modules and the private directory on the distribution server proved very useful. A module inherits everything available earlier in the reverse DNS domain path. This way we do not need to duplicate information and can store it exactly once.

Because the logfiles are tracked and prioritized in real-time and any messages not filtered explicitly become 'high prio', we are able to create a good filter configuration file per service in a short time. Any new unknown message is evaluated and prioritized depending on the possible impact on the service. The new filter rules are then distributed to all servers with a service upgrade which is just one 'make dist' command on the development/distribution server.

Customers are very pleased with the fact that the user/group administration and authorization for services is done from one GUI. This GUI is set up in such a way that new services can be included very simply. Sample extensions requested by customers included a logfile search facility, so the local administrator can perform some first-line support actions (although understanding logfiles can require in-depth knowledge of the service).

Also on request of our customers we built an accounting/reporting tool for users and groups. Using their usercode and password they can request detailed accounting information up to and including the previous day.

We now operate (multiple) IntraConnect's in New York, Sao Paulo, Sydney, Hong Kong, Barcelona, Paris, Brussel, Vienna, Hamburg, London and Eindhoven. Soon we will add Arlington TX, Singapore, Copenhagen, Milan, Zurich, Taipei and others.

A team of 5 people spent 1.5 man years on technical development and 1.5 man years on management issues. Technical central operation costs 0.5 FTE (Full Time Equivalent). Furthermore we are spending 1 FTE on ongoing development (importing new software package releases, extensions to multi machine servers, etc.). Local administration averages to 0.1 FTE per IntraConnect, which includes first line support.

8.1 Current developments

The concept is OS independent, limited to those that adhere to our modularization demands.

Because of our experience with BSD-type unices we chose BSD/OS for the IntraConnect. Nevertheless our outsourceware concept is OS independent, and we have almost completed porting it to the Solaris operating system on Sparc Ultra servers for amongst others our Internet/Intranet webhosting services (we host for example www.philips.com). This took only three man weeks. Most of the time was spent creating a Solaris version of the OS module. Porting it to other BSD4.4-type unices like FreeBSD should be possible in one man week, we guess. We see a lot of activity regarding Microsoft Windows-NT based Intranet servers, however, we found that NT does not meet our modularization and other demands.

Multi-machine IntraConnect.

Because the load on our old central servers is too high for one machine, we extended the implementation to have one 'logical' IntraConnect server consisting of multiple server machines. Our structured modular approach proved to be very useful in this process: only a few generic routines needed adaptation.

We extended the master.root file with numbered variable names (for example, DIBBS_IPADDRESS became DIBBS_IPADDRESS_1 and DIBBS_IPADDRESS_2) and added the ethernet MAC address of the primary interface card as the unique machine identifier. Because the master.root file may only be read via no more than two (/bin/sh and perl) include files we were able to hide the fact that multiple machines are involved in those two scripts: on machine number 1 the value of DIBBS_IPADDRESS_1 was copied to DIBBS_IPADDRESS, on machine number 2 the value of DIBBS_IPADDRESS_2 was copied to DIBBS_IPADDRESS and so on. Most of the other scripts could be used unaltered in the multi-machine situation. We were happily surprised that our approach worked out so well.

The only other changes were in the GUI, user/group administration and accounting/reporting. The GUI has been extended so it can find out on which servers a service runs: a hyperlink can now send you to another machine without the user seeing it. The user administration database is replicated from the first 'master' machine to the 'slave' machines by the slave machines when needed, that is, during a service reconfigure. This change was also implemented in one include file, instead of modifying all scripts for all services. Preprocessed logfiles are sent from all servers to the master server which generates the overall accounting and reporting files.

During the implementation of this multi-machine feature we already had a number of servers running. These servers in fact migrated from the old setup to a multi-machine setup with just one server: the one-machine situation can now be considered the minimal multi-machine situation.

8.2 Future plans

Customer-specific modules.: The IntraConnect design is such that we could add a software package for just one customer. The only requirement is that it complies with the modularization requirements, so an API can be build around it. This would typically be a commercial software package that comes without sources. For our web hosting services we have plans to add, for example, the UltraSeek search engine.
Monitoring 7*24*365, attached to systems such as HP Openview, Tivoli, Unicenter.: At the moment, one person from our standby pool receives all alert-level monitoring output from all IntraConnect servers with the GSM SMS service. This is not the way it should be done: Origin has several 7*24*365 monitoring centers better suited for this task. We need to move the IntraConnect monitoring to a service center similar to the one that already monitors our firewalls with HP-openview in the same event driven manner.
User authentication extensions.: We plan to use RADIUS-based user authentication for users without fixed IP addresses, such as dial-in users and users on LANs with DHCP. RADIUS is already used for some of our services like INN's nnrpd in the old setup and our dial-in services. We are in the process of extending squid, nntpcache and other proxies with support for external authentication protocols. Furthermore, we plan to offer NT domain-based authorization for our services. Both efforts will allow the incorporation of the IntraConnect server in a single-logon environment.
Directory service.: We are thinking about using IntraConnect's user database to generate a directory.

8.3 Conclusions

Split administration, which is still not found in most service software packages including commercial ones, is absolutely necessary when delivering outsourcing services. Most off-the-shelf software packages are targeted for small or medium-sized companies which buy the software and let their system administrator do all the tasks from user administration to software upgrades.
Complete automation of software upgrades, releasing, distribution and monitoring results in a cheap and scalable outsourcing solution.
The service oriented directory layout proved to be a good choice.
Detailed accounting information is rarely available. We had to modify some software packages in order to generate accounting information. Clearly, outsourcing with tariffing based on traffic is not common; we think this is because this is not needed in smaller and medium-sized companies.
Using publicly available (and most of the time even free) software with source code has proved to be a very good method for delivering very stable services for which you can guarantee world-class SLAs. It is simple: if the world-wide Internet has been built with such products (like bind, sendmail and apache) it is safe to use them for world-wide companies too.
Side note: If we find and fix bugs, the fixes are always communicated back to the authors. In the future, our improvements and new features for the freeware packages can be made publicly available too with an Origin disclaimer attached. Note that our IntraConnect framework consisting of Makefiles, sh/perl libraries, installer/service scripts etc. is not publicly available.
When delivering outsourcing services scalability must always be taken into account. This goes two ways: one GUI for all services, and administration must not only be possible from anywhere in the network, but be completely automated for multiple instances as well.

Outsourceware: Design and Implementation of a Distributed and Automated Intranet Server