Data Acquisition, Visualization, and Analysis


Download Data Acquisition, Visualization, and Analysis


Preview text

CHAPTER 4
Data Acquisition, Visualization, and Analysis
Stephen E. Reichenbach

Contents

1. Introduction

77

2. Data Acquisition

79

2.1 Modulation and sampling

79

2.2 Digitization and coding

80

2.3 File formats

81

3. Visualization

82

3.1 Image visualizations

82

3.2 Other visualizations

85

4. Data Processing

89

4.1 Phase correction

89

4.2 Baseline correction

90

4.3 Peak detection

92

5. Chemical Identification

95

5.1 Chemical identification by retention time

95

5.2 Multivariate methods for chemical identification

97

5.3 Smart Templates

99

6. Quantification and Multi-Dataset Analyses

100

6.1 Quantification

100

6.2 Sample comparison, classification, and recognition

102

6.3 Databases and information systems

104

7. Conclusion

104

Acknowledgment

105

References

105

1. INTRODUCTION
An introduction to informatics for comprehensive two-dimensional gas chromatography (GCÂGC) should begin with the strikingly beautiful and

Comprehensive Analytical Chemistry, Volume 55 ISSN: 0166-526X, DOI 10.1016/S0166-526X(09)05504-4

r 2009 Elsevier B.V. All rights reserved.

77

78

Stephen E. Reichenbach

Figure 1 GCÂGC data from a gasoline analysis visualized as a digital image. Only a portion of the data is shown. (This and other figures were generated with GC Images software [1]. Data supplied by Zoex Corporation.)
Figure 2 GCÂGC data visualized as a three-dimensional surface. A subregion of the data from Figure 1 is shown.
complex pictures of data visualization. Whether viewed as a pseudocolorized two-dimensional image, as in Figure 1, or as a projection of a three-dimensional surface, as in Figure 2, GCÂGC visualizations impress even observers lacking chromatographic expertise with their colorful and multitudinous features. Chromatographers recognize, within these pictures, complex patterns embedding a wealth of multidimensional chemical information. The richness of GCÂGC data is immediately apparent, but the size and complexity of GCÂGC data pose significant challenges for chemical analysis.
This chapter examines methods and information technologies for GCÂGC data acquisition, visualization, and analysis. The quantity and complexity of GCÂGC data make human analyses of GCÂGC data difficult and timeconsuming and motivate the need for computer-assisted and automated processing. GCÂGC transforms chemical samples into raw data; information technologies are required to transform GCÂGC data into chemical information.
The typical data flow is a sequence of: acquiring and storing raw data, processing data to correct artifacts, detecting and identifying chemical peaks, and analyzing datasets to produce higher-level information (including quantification) and reports. In applications for which the analysis is fairly well understood and routine, information technologies may fully automate this process.

Data Acquisition, Visualization, and Analysis

79

However, because GCÂGC is so powerful, it frequently is used for analyses that are not well understood or are not routine. In such cases, information technologies must support semi-automated processing, visual interpretation, and interactive analysis.
This chapter addresses the following fundamental tasks in transforming GCÂGC data into chemical information:
 Acquiring and formatting data for storage, access, and interchange.  Visualizing multidimensional data.  Processing data to remove acquisition artifacts and detect peaks.  Identifying chemical constituents.  Analyzing datasets for higher-level information and reporting.

2. DATA ACQUISITION
Although GCÂGC is a true two-dimensional separation, the process serializes the data — producing data values in a sequence. In GCÂGC, the first column progressively separates and presents eluates to the modulator, which iteratively collects and introduces them into the second column, which then progressively separates and presents eluates to the detector. As explained in detail in Chapter 2, in the detector, the analog-to-digital (A/D) converter samples the chromatographic signal at a specified frequency. In concept, this operation is similar to how some optical systems create an image with as few as one detector by progressively scanning the detector(s) across the two spatial dimensions, but, in GCÂGC, the two dimensions are the two retention times. Then, the digitized data and relevant metadata (information about the data) are stored in a file with a defined format for subsequent access.
2.1 Modulation and sampling
The modulation frequency and the detector sampling frequency typically are under user control. Setting these frequencies (subject to the limitations of the hardware) involves trade-offs between resolution and other constraints. The desire for high resolution suggests that the modulation and sampling rates should be as rapid as possible. A Gaussian peak is not band-limited, so truly sufficient sampling is not possible. Therefore, higher modulation and sampling rates provide greater information capacity and increased resolution for detecting co-eluted peaks. However, the modulation frequency must allow adequate intervals for separations in the second column, and the sampling frequency involves a trade-off in data size (i.e., higher sampling frequencies generate more data) and diminishing returns in selectivity and precision. Full consideration of these and other issues (such as duty cycle and noise) in setting the modulation and sampling frequencies involves instrumental and application-specific concerns that are beyond the scope of this chapter, but consideration of the data suggests general guidelines.

80

Stephen E. Reichenbach

Experimental and theoretical studies [2] suggest that the modulation rate should be at least one cycle per two times the primary peak standard deviation s1 (i.e., the standard deviation of the peak width from the first column separation), which translates to at least four modulation cycles over 8s1 (the effective width of peaks from the first-column separation). The considerations for GCÂGC detector frequencies are similar to those for traditional one-dimensional chromatography, for which a rate of at least one sample per peak standard deviation is recommended [3,4], that is, eight samples over 8s2 (the effective width of peaks from the second-column separation). With these considerations, Murphy et al. [5] recommend that method development begin with determining the shortest time for adequate chromatographic separation in the second column and then a firstdimension method be used that provides peak widths of at least four times the modulation interval. With the wide variety of chemical mixtures and analytical goals for GCÂGC, a broad range of modulation and sampling frequencies are used. Modulation cycles from 2 to 20 seconds (s) and sampling frequencies from 25 to 200 hertz (Hz) are not unusual. Again, however, the application should be considered; slow modulation and sampling rates relative to peak width may be sufficient for applications that require only quantification of well-separated peaks, and fast modulation and sampling rates relative to peak width may be required for applications that involve compounds that are difficult to separate.
A common problem in GCÂGC data processing is inadequate sampling of the first-column output; that is, the modulation period is too long with respect to the first-column peak widths, or, put another way, the first-column chromatography produces peaks too narrow for the modulation period. Of course, if the modulation period is constrained by the time required for second-column separations, then broadening the peak widths from the first column may require longer runs (thereby increasing cost). Inadequate sampling of the second-column output is less commonly problematic because most detectors used for GCÂGC are fast and most laboratories typically use detector sampling rates that exceed what is required for the analysis (and so generate more data than may be necessary). However, as explained in Chapter 2, some types of detectors — for example, quadrupole mass spectrometer (qMS), atomic emission detector (AED), and electron capture detector (ECD) — may be challenged by the acquisition speeds required for GCÂGC.
2.2 Digitization and coding
GCÂGC systems use an A/D converter to map the intensity of the chromatographic signal to a digital number (DN). Among the many types of detectors used with GCÂGC, the major distinction is between detectors that produce a single number at each time sample of the chromatogram, such as a flameionization detector (FID) and a sulfur chemiluminescence detector (SCD), and multichannel detectors that produce multiple values (typically, over a spectral range) for each time sample, such as a mass spectrometer (MS). In either case, each DN is represented with a limited number of bits indicating a value in a limited range with limited precision.

Data Acquisition, Visualization, and Analysis

81

Because GCÂGC can produce large datasets, GCÂGC systems often employ data compression in their file formats. Sampling at 200 Hz, a detector for single values with a 48-bit dynamic range (as supported by Agilent’s IQ data file format [6]) produces data at the rate of 4.3 megabytes/hour (MB/h). Most programming languages must perform arithmetic on 48-bit values with 64-bit long integers or 64-bit double-precision floating-point numbers. Mass spectrometers can produce data at sub-1 GHz (e.g., one 8-bit spectral intensity per nanosecond), a data rate of greater than 1 gigabyte/sec (GB/s). In order to more efficiently store data, GCÂGC systems may compress the data. For example, because data values are correlated with neighboring values in the sequence, Agilent’s IQ data file format implements a second-order backward differential coding that compresses values from a 48-bit range to 2 bytes. Even more aggressive compression commonly is used for MS data. For example, ORTEC’s FastFlight-2TM [7] can accumulate successive spectra in hardware and output only the summed spectra for a much smaller data rate. In a MS with GHz raw speed, summing 100 transient spectra in 100 K channels generates 100 spectra per second (compared to 10,000 raw spectra per second). The FastFlight2 also offers a lossless compression mode that uses fewer bytes to represent smaller values and a lossy compression mode that detects and encodes only the spectral peaks in the MS data — a process sometimes called centroiding because each spectral peak is represented by a single centroid indicating the center, intensity, and sometimes the peak width.
2.3 File formats
Most GCÂGC systems use a proprietary data file format, which affords vendors a high degree of control (e.g., to implement data compression), but which poses a barrier and inconvenience for sharing or processing data across systems. Currently, there is no standard format for GCÂGC data, but GCÂGC data can be shared using nonstandard text files or existing standards for gas chromatography (GC) data. GCÂGC data can be converted to text, for example, ASCII-format comma-separated values (CSV), but the resulting files are nonstandard and are larger than binary or compressed data files. The ASTM has issued Analytical Data Interchange (ANDI) standards for chromatography [8] and MS [9]. These standards lack some requirements for GCÂGC metadata (e.g., a metadata element for the modulation cycle) but can be used to communicate raw data and other chromatographic metadata. These standards were developed primarily for data interchange and lack some desirable features for more routine use. Another limitation of the ANDI standards is that the network Common Data Form (netCDF) [10], upon which the standards are built, was defined for 32-bit computing systems, limiting their usability for data larger than 2 GB. The ASTM has sanctioned an effort to develop a new format standard for analytical chemistry data, the Analytical Information Markup Language (AnIML) [11,12], utilizing the eXtensible Markup Language (XML) [13]. Standard formats for analytical chemistry data facilitate data portability and interchange, but despite such considerations proprietary GC formats have continued to dominate the market.

82

Stephen E. Reichenbach

3. VISUALIZATION
Visualization is a powerful tool for qualitative analysis of GCÂGC data (e.g., to troubleshoot the chromatography). Various types of visualizations are useful: two-dimensional images provide a comprehensive overview, three-dimensional visualizations effectively illustrate quantitative relationships over a large dynamic range, one-dimensional graphs are useful for overlaying multivariate data, tabular views reveal the numeric values in the data, and graphical and text annotations communicate additional information. This section explores some of the methods and considerations in the various types of visualizations.
3.1 Image visualizations
3.1.1 Rasterization A fundamental visualization of GCÂGC data is as a two-dimensional image. GCÂGC data, which is acquired sequentially, can be reorganized as a raster — a two-dimensional array, matrix, or grid of picture elements called pixels — in which each pixel value is the intensity of the detector signal. As a twodimensional array of intensities, GCÂGC data has many similarities with other types of digital images and so many methods and techniques from the field of digital image processing can be applied or adapted for GCÂGC data visualization and processing.
The standard approach for rasterization is to arrange the data values acquired during a single modulation cycle as a column of pixels, so that the ordinate (Y-axis, bottom-to-top) is the elapsed time for the second-column separation, and then to arrange these pixel columns so that the abscissa (X-axis, left-to-right) is the elapsed time for the first-column separation. This ordering presents the data in the commonly used right-handed Cartesian coordinate system, with the firstcolumn retention time as the first index into the array. Other orderings are possible but less commonly used. The problems of correctly synchronizing the columns of data with the modulation cycle and of modulation cycles that are not evenly divisible by the detector sampling-interval are examined in Section 4.1.
3.1.2 Colorization For presentation as an image, the pixels are colorized; that is, the GCÂGC values are mapped to colors of the display device. Scalar values, such as single-valued GCÂGC data, can be colorized simply on an achromatic grayscale, familiar from so-called black-and-white images. Scalar values can be extracted from multispectral data in various ways, for example, by adding all intensities in each spectrum to compute the total intensity count (TIC) of the data point or by taking the value in a selected ‘‘channel’’ of the spectrum. A grayscale mapping typically is defined by setting a lower bound, below which values are mapped to black; an upper bound, above which values are mapped to white; and a function to map values between the bounds to shades of gray, with brightness increasing with value. Linear, logarithmic, and exponential mapping functions are useful for different effects: linear mapping treats gradations at all intensity levels similarly;

Data Acquisition, Visualization, and Analysis

83

logarithmic mapping emphasizes gradations nearer the lower bound; and exponential mapping emphasizes gradations nearer the upper bound. Although grayscale colorization provides a straightforward ordering of values from small to large that is intuitively meaningful, humans may be able to distinguish fewer than 100 distinct grayscale gradations [14]. Therefore, grayscale images cannot effectively communicate many differences among values over a large dynamic range such as is common for GCÂGC data.
Pseudocolorization takes advantage of the differing sensitivities in human vision for different frequencies of light [14]. These differing sensitivities enable ‘‘color’’ perception, with greater selectivity than for grayscale. Because humans have trichromatic vision based on three types of color receptors (cones), a trichromatic color model is sufficient for image colorization. Various trichromatic color models have been developed. RGB (with values for red, green and blue) and HSV (with values for hue, saturation, and brightness value) are widely used color models for digital imaging.
Pseudocolorization maps data values with three independent functions for the three color components. The mapping functions for the color components typically are not monotonically nondecreasing (as grayscale mapping functions typically are), so discerning relative values in a pseudocolor image is not as straightforward as with grayscale (for which brighter means larger). However, a good pseudocolor scale can communicate a clear ordering of values. For example, topographic and temperature images commonly use a pseudocolor scale sometimes called cold-to-hot, which has a mapping from small to large that progresses through blue, cyan, green, yellow, and red, with intermediate colors. In Figure 1, the color scale has the smaller values of the background colorized dark blue and the larger values of the peaks colorized with the cold-to-hot scale to show increasing values. This mapping is easily interpreted because it is familiar. Pseudocolor images can present many distinguishable colors, but there is a tradeoff between having a pseudocolor scale with an ordinal progression that is simple to understand and the number of gradations that can be discerned: an easily understood scale visually differentiates a smaller number of gradations, and a scale that visually differentiates a larger number of gradations makes the value ordering more difficult to understand.
Pseudocolorization offers better visualization than grayscale for gradations across a wide dynamic range of values, but to be effective the mapping still must allocate color variations to the value range according to the presence of gradations. Specifying pseudocolorization interactively can be tedious and difficult, so automated determination of pseudocolor mapping is useful. GradientBased Value Mapping (GBVM) [15] is an automated method for mapping GCÂGC data values onto a color scale, for example, the cold-to-hot scale. For a given dataset, GBVM builds a value-mapping function that emphasizes gradations in the data while maintaining ordinal relationships of the values. The first step computes the gradient (local difference) at each pixel. Then, the pixels (with computed gradients) are sorted by value, and the relative cumulative gradient magnitude is computed for the sorted array. The GBVM function is the mapping from pixel value to the relative cumulative gradient magnitude of the sorted

84

Stephen E. Reichenbach

array. GBVM is effective at showing local differences across a large dynamic range.
Each resolved chemical compound in a sample increases the value in a small cluster of pixels, which, if the colorization effectively shows local differences, are seen as a localized spot with different colors than the surrounding background. If the colorization is not effective over the full dynamic range, spots with small values may not be visible or spots with large values may not show significant relative differences.
3.1.3 Navigation Standard operations for navigating digital images include panning, scrolling, and rescaling. Rescaling requires resampling the data — creating a displayed image with more pixels to zoom in or a displayed image with fewer pixels to zoom out. (Visualization does not change the underlying data used for later processing.) Enlarging an image by rescaling entails reconstruction, which is the task of rebuilding the signal at resampling points between the data values. Popular methods for digital image reconstruction include nearest-neighbor interpolation, bilinear interpolation, and various methods using cubic polynomial functions for interpolation or approximation [14]. Bilinear interpolation provides a good compromise between quality and computational overhead. It is important to remember that reconstruction estimates signal values and that large zoom factors entail numerous estimates. Therefore, although nearest-neighbor interpolation creates blocky images with less accurate reconstruction, the result makes clear the modulation and sampling rates of the data. Similarly, nearest-neighbor interpolation will show changes in the aspect ratio imposed during rescaling (e.g., to compensate for different sampling rates in the two dimensions, such as undersampling the first-column separation and oversampling the second-column separation). Figure 3 compares bilinear and nearest-neighbor interpolation. Bilinear interpolation shows a spot that more closely represents the continuous peak produced by chromatography. Nearest-neighbor interpolation shows rectangular pixels that make clear the discrete nature of the digitized signal.
3.1.4 Qualitative analysis Visualization can quickly and clearly show important characteristics of GCÂGC data, including problems related to the chromatography. Three such examples are considered briefly here. First, if the retention time of a compound in any second-column separation exceeds the length of the modulation cycle, the associated compound will elute during a subsequent modulation cycle and the peak will appear as a spot that is wrapped around into a subsequent column of pixels in the image. If the retention time is only slightly too long, the spot will appear in the otherwise blank region at the bottom of the image corresponding to the void time of the next second-column separation. This problem can be recognized upon visual inspection, and the chromatographer can change the acquisition settings, for example, lengthening the modulation cycle time or accelerating the second-column separations with a temperature program or shorter column. A second problem sometimes is seen in crescent-shaped trails

Data Acquisition, Visualization, and Analysis

85

Figure 3 A single GCÂGC peak enlarged by bilinear interpolation (left) and nearest-neighbor interpolation (right). Bilinear interpolation yields a truer (i.e., higher fidelity), more pleasing spot; but nearest-neighbor interpolation more clearly shows the individual data points.
that, from left-to-right, slope downward quickly at first and then level out. These artifacts indicate a continuous presentation of eluates from the first column into the second column, perhaps caused by incomplete bake-out (an unclean first column) or by incomplete modulation (i.e., a thermal modulator that is not heated sufficiently to fully release). A third problem seen in visualizations is peak tailing in the second-column separations, which can be caused by various chromatographic issues. Figure 1 illustrates small artifacts of crescent-shaped ‘‘bleed’’ and peak tailing. Data visualization enables quick inspection of the data for these and other qualitative issues.
3.2 Other visualizations
3.2.1 Three-dimensional visualizations Three-dimensional visualizations use many of the same techniques as twodimensional image visualizations, including rasterization, colorization, navigation, and reconstruction. A three-dimensional visualization is based on a surface, with the surface elevation relative to the base plane given by each pixel’s value. The elevation scale can utilize a mapping function (e.g., linear, logarithmic, or exponential functions). Constructing and viewing an artificial surface utilizes many of the techniques of computer graphics. The surface can be rendered in various ways, for example, pseudocolorized at each pixel, colorized with a solid color and illuminated to provide shading, or built as a wire frame. Then, the surface is projected onto a two-dimensional viewing plane for display. A common projection is the perspective view from a single viewpoint. Additional navigation

86

Stephen E. Reichenbach

operations enable the user to rotate the surface in space, in order to view the surface from different perspectives. Figure 2 illustrates a three-dimensional perspective view of a portion of the GCÂGC data shown in Figure 1 with values shown as the third dimension (i.e., elevation), with log scaling.
With the added dimension of height, three-dimensional visualizations are better able to show quantitative relationships over a large dynamic range. However, in three-dimensional visualizations, points on the surface can be obscured, and there is no correspondence between the dimensions of the data and the axes of the display, so interactive operations such as point-and-click indexing are more difficult and problematic than with a two-dimensional image. In that sense, different visualizations are complementary, each with its own utilities.
3.2.2 One-dimensional visualizations One-dimensional graphs are useful for various purposes, including showing slices or integrations of GCÂGC data in a graphical format that is familiar to traditional chromatographers. For example, the values in different secondary chromatograms (or rows along the first-column separation) can be rendered as a graph and overlaid to show whether the profiles change over time and/or the results of peak detection in one dimension. Similarly, values in different spectral ‘‘channels’’ of a pixel column (or row) can be graphed and overlaid to show if the multispectral profiles reveal the presence of co-eluted peaks, as illustrated in Figure 4.
3.2.3 Text and tabular visualizations Some information is best communicated in a text format. For example, the values of the two-dimensional data array can be shown directly as a table, in which each cell displays a numeric pixel value. Visualization features available in spreadsheets are useful for tabular text visualizations. For example, colorization of the text or textboxes can be useful for highlighting different features of the data, such as peak

Response

500000 450000 400000 350000 300000 250000
200000 150000 100000
50000 0

M/Z =165

M/Z =180

M/Z =182

14.47 14.53 14.60 14.66 14.73 14.79 14.86 14.92 14.99 15.05 15.12 15.18 15.24 15.31 15.37 15.44 15.50

Second column retention time (seconds)
Figure 4 A one-dimensional visualization graphing values in selected-ion channels along a slice through co-eluted peaks.

Preparing to load PDF file. please wait...

0 of 0
100%
Data Acquisition, Visualization, and Analysis