WFCAM data processing strategy

Proposed Cambridge and Summit pipeline layouts

Summit Pipeline

The main roles of the summit pipeline are to produce real time data quality control information, a first pass science product and to provide verified export raw FITS files for shipping to Cambridge.

The overall schema for the proposed summit data flow is shown in the following diagram and is based on commodity rack-mounted PCs.

The data aquisition system internally applies non-linearity corrections and reset-corrects each integration, stacking multiple exposures within an integration as necessary.
Individual data aquisition PCs, one for each detector, feed 32-bit integer SDF files to the data processing PCs. Each rack-mounted PC unit also has a mini IDE-based RAID disk and an Ultrium LTO-1 tape drive. The raw SDF files are written to tape (one tape per detector) and archived at JAC.
The first stage of the summit pipeline, running under ORAC-DR, converts the SDF files to verified single extension FITS using the lossless Rice tile compression in CFITSIO to reduce the data container size by a factor of roughly 4.
Raw FITS files are written to tape (one per detector) and approximately weekly shipped to CASU for further processing.
Data quality control information is ingested into the survey progress database and the first pass science products at the summit are available locally if required. CASU will mirror the summit survey progress database in Cambridge to aid in ingestion of ancilliary observing information.

Cambridge Pipelines

Pipeline processing in the UK is built on a modular system and involves three separate pipeline philosophies: the standard pipeline provides instrumental signature removal, object catalogue generation and astrometric and photometric calibration and operates on a nightly basis; likewise the nightly further processing pipeline will be used to generate spatially mapped oversampled PSFs for use in PSF fitting for stars and as the basis for Sersic profile fitting of galaxies; finally an advanced database-driven pipeline will run asynchronously via the science archive at WFAU to perform tasks such as deep stacking, mosaicing of contiguous regions, catalogue merging, list-driven photometry and photometric calibration refinement using overlaps of adjacent fields.

The Cambridge pipelines will be built around commodity PC clusters and will host an online raw data products archive and a data products server as a staging system for transfer of pipeline output to WFAU.

Initial data processing stages consist of:

Ingest and verify Ultrium tapes and check if all files are present by comparing with mirrored summit database
Add in summit database extras such as autoguider statistics and observer comments, and reformat the single extension FITS files to Multi-Extension FITS (MEFs).
Organise data and store in raw data archive and ingest header information to Sybase archive server. Raw Ultrium tapes will be archived offline to provide a backup system.
Run standard pipeline and update pipeline progress database. If reprocessing for bug fixes/updates are necessary they will be excecuted from the on-line raw data archive

For the standard pipeline we are intending to use 6 rack-mounted twin processor PC units running GNU/Linux, with each enclosure containing twin redundant power supplies, fans, dual Athlon processors (~3 GHz/s), with a ~1 Tbyte 3ware Escalade-based RAID5 IDE disk system. Two of these systems, the data archive server and the data products server will be interfaced to rack-mounted standalone Ultrium LTO-2 tape drive units (SCSI interface) and will be mainly used as data servers. The other four will be mainly used for data processing for the standard pipeline products.

With the details depending on the stability of the master calibration frames, we currently envisage using these data processing PCs in either a cyclical nightly hierarchy or for processing data split by filter(s) for each night.

External multi-terabyte RAID arrays based on the Infotrend controller (or similar) will provide the on-line storage for the raw data database server and for the data product server systems.

Operating system and processing software will be served from the two servers (one for redundancy) to the other data processing PCs to allow for software and operating system updates (eg. security patches) and further expansion of the cluster with minimal system manager intervention. Local disks will be used for any temporary storage.

The alternative approach of maintaining a separate operating system installation on each node is also under consideration, with the possibility of writing additional tools to automate the distribution of software updates. The use of an appropriate distribution of GNU/Linux, for example Debian or Redhat with the up2date tool, would facilitate this setup since operating system updates can be more easily automated. This would be based on a common "system image" taken from the same original GNU/Linux installation to allow for any future node replacements or additions and to maintain an identical setup on each node.

The whole PC cluster will be connected using 1 Gbit ports on a rack-mounted 3C17701-ME SuperStack III 4924 switch, which has 24 autosensing 10/100/1000BASE-T(RJ45) ports.

Pipeline Status Monitoring

The current status of the pipeline is monitered via a www page generated by a perl script. The page shows the overall status of each dataset currently being processed, and allows the CASU staff to keep an eye on data flow progress.

Following the link for the dataset shows the status of all the files within the dataset. The script currently extracts the status information from the fits headers; we are alos investigating the alternative of the pipeline writing the status to a noticeboard file instead - this would reduce i/o and speed up the monitoring script with the caveat of losing independance of the test.

Processed Data DQC

Before the processed data can be placed in the release area, they must be verified. On pipeline completion, a perl script generates a www page of various diagnostic plots to help with overall data quality control monitoring. An example page based on INT WFC data is shown here.

This will be expanded to include diagnostics for all of the following parameters (per detector where appropriate):

astrometric fit residuals
requested versus actual pointing
photometric transmission throughput from 2MASS
photometric transmission diagnostic from A/G
photometric solutions from standard stars
site diagnostics - Tau, Humidity, Windspeed/direction
number of detected stars
average measured seeing
aperture correction for PSF
average stellar ellipticity
sky brightness
sky noise
5-sigma detection threshold
dark current level
fringe levels

Last modified: Fri Jun 27 07:25:27 2003