Article PCA

2.1. Data sources

2.1.1 WWTP data

Input data used for this study is retrieved from our own API (created ad-hoc). It is available at https://apicovid.icradev.cat/n1. When accessed, the API reads our SQL database (where the labs submit the PCR results every week) and outputs the data in CSV format. The data includes information about the concentration of N1 gene copies at each WWTP and influent flows (in m³/day).

SARSAIGUA covers different plant sizes: from 1,698 to 1,484,787 inhabitants, and from 363 to 334,712 m³/day of average influent flows. In addition, the plants cover zones with different degrees of industrialization as well as with a different touristic pressure.

Manual data quality control on the lab results is executed weekly. This manual quality control involves evaluating: i) RT-qPCR standard curve parameters; ii) recovery values from process control and iii) correlation between RT-qPCR target genes. Whenever large deviations from previous accumulated data in parameters like recovery percentage of a process control are identified, the labs are requested to repeat the sample.

For the analysis, 51 WWTPs (see wwtps) were included which correspond to the ones sampled weekly and biweekly; the seasonal WWTPs (only sampled during winter or summer periods) were excluded from the analysis. The data used as input for the procedure applied in this study (PCA and 2 statistical tests) is the (N1 gene copies)/day/person; for each observation at each WWTP the concentration of N1 (GC/L) is multiplied by flow (L/day) and divided by the number of inhabitants serviced by that WWTP.

WWTP code	Population (inhabitants)	Average inflow (m³/day)	Sampling frequency
DBSS	1,484,787	334,712	weekly
DPDL	1,154,006	204,328	weekly
DMIR	241,142	49,968	weekly
DTRS	227,825	39,389	weekly
DGVC	220,302	40,082	weekly
DSFL	206,708	48,307	weekly
DMAT	185,114	23,465	weekly
DGIR	146,609	41,619	weekly
DLLE	145,211	51,106	weekly
DSRS	135,228	22,943	weekly
DLLL	133,252	34,485	biweekly
DTAR	132,381	25,266	weekly
DRUS	108,581	15,655	weekly
DRUB	97,940	21,904	weekly
DGRA	97,393	21,138	weekly
DMAS	92,243	20,646	weekly
DVLG	80,910	15,554	weekly
DMDV	76,690	30,441	biweekly
DIGU	65,757	17,459	weekly
DABR	65,160	16,187	weekly
DVDP	62,122	14,006	weekly
DVIC	55,679	22,118	weekly
DPAM	55,339	14,984	weekly
DFIG	52,048	13,319	weekly
DVEN	49,850	11,398	weekly
DVLC	49,663	35,768	weekly (summer)
DOLO	40,127	15,273	weekly
DBLN	39,028	8,673	biweekly
DLDM	38,373	9,768	biweekly
DCPA	38,008	13,921	biweekly (summer)
DTOT	36,005	5,722	weekly
DMRT	35,801	7,731	weekly
DBAY	27,959	10,695	weekly
DMAN	27,898	5,307	weekly
DVAL	22,410	9,304	biweekly
DFON	22,005	22,208	biweekly
DAMP	20,738	3,964	weekly
DRSS	19,550	9,191	biweekly (summer)
DBAL	17,162	7,217	weekly
DTRG	17,098	3,284	biweekly
DBER	16,494	4,616	weekly
DMOF	15,185	3,376	weekly
DPUI	15,120	6,790	weekly
DRIP	13,695	7,696	weekly
DSLL	12,356	2,490	weekly
DSOL	9,509	2,367	weekly
DCER	9,180	1,469	biweekly
DMLN	8,777	1,616	biweekly
DMOB	7,364	1,109	biweekly
DBBL	6,087	1,774	biweekly
DTRP	5,845	2,145	biweekly
DVIE	4,892	6,799	biweekly
DGAN	3,064	525	biweekly
DFAL	2,834	363	weekly (winter)
DPSU	2,220	1,026	biweekly
DSOR	1,698	462	biweekly
total	5,958,422	1,363,128


              wwtps
              WWTPs of the SARSAIGUA surveillance network sorted by amount of
              serviced population. Also, average influent flows are displayed.

2.1.2 Clinical cases

A second input dataset has been prepared: daily reported clinical cases for all municipalites served by each WWTP, weighted by the percentage of inhabitants connected to each WWTP. This data was extracted from the official API of the Information Systems of the Department of Health and the Catalan Health Service (https://analisi.transparenciacatalunya.cat/resource/jj6z-iyrp.json).

The percentage of each municipality's inhabitants connected to each WWTP was requested directly to WWTPs. This information is available here: https://covidvigilancia.icra.cat/formularis/municipis_assistits.json

2.2. Gap filling

An input matrix without gaps is required when running PCA. Linear interpolation has been applied to the biweekly-sampled WWTPs (N1 concentration and flows) and missing values due to incidences in sampling (e.g., missing flow data).

2.3. Principal Component Analysis

Principal Component Analysis (PCA) is a multivariant statistical method for exploratory data analysis (unsupervised learning). It can be used to reduce the number of variables in a dataset while retaining as much information as possible.

PCA works by transforming the original variables into a new set of variables, called principal components, which are uncorrelated and ranked in order of importance. These new variables are linear combinations of the original variables: the first principal component is the linear combination that captures the greatest amount of variation in the data. The second principal component is the linear combination that captures the next greatest amount of variation in the data, subject to the constraint that it is orthogonal (i.e., perpendicular) to the first principal component, and so on for the remaining principal components.

PCA can be used for a variety of applications, such as identifying patterns, reducing the dimensionality of data for visualization, and making predictions using machine learning algorithms.

Our goal in this study is to use PCA as a basis for an automated alarm sytem procedure, based on fault detection techniques: (1) the error of reprojection (Squared Prediction Error, SPE or Q), which we've identified as useful for early warning, and (2), the Hotelling's T², which has the ability to detect shifts in the mean vector of a process; in our case identifies abrupt increases in the general circulation of the virus.

2.3.1. Pretreatment

Let X be a complete matrix, containing n observations (rows) and m variables (columns).

	obs nº	WWTP[1]	WWTP[2]	...	WWTP[m]
X=	1	...	...	...	...
	2	...	...	...	...
	...	...	...	...	...
	n	...	...	...	...

Scaling is performed to each column, to ensure each WWTP is given equal weight in the monitoring process. Involves subtracting each variable by its sample mean (to capture the variation from this mean) and then, divide by its standard deviation. This gives equal importance to all columns (no matter the size) and also removes variabilities caused by different sampling and analytical methods (e.g. analytical differences between labs).

2.3.2. Covariance and correlation matrix

Let S be the covariance matrix of X. Since X is mean centered, it can be calculated as:

\[ S = \frac{1}{n-1} X^T X \]

S is also the correlation matrix because X is scaled to unit variance. Size of S is mxm.

2.3.3. Eigenvalues and eigenvectors

A Single Value Decomposition (SVD) algorithm is applied to S: \[ S = V \Lambda V^T \]

V is an orthogonal matrix (V^TV=I). Its columns are called the "loading" vectors (or eigenvectors). Size is mxm.
Λ is a diagonal matrix (size mxm) that contains the non-negative real eigenvalues of decreasing magnitude along its main diagonal, and zeros in the offdiagonal elements: \[ \lambda_1 \ge \lambda_2 \ge ... \ge \lambda_m \ge 0 \]

2.3.4. Loadings and scores

Let x be any row of X (an observation vector, size m). Let t be the projection of x into the matrix V: \[ t = xV \]

Projecting into x into V decouples the observation space into a new set of uncorrelated variables (the elements of t):

The i^th column of V is the loading vector p_i that transforms an observation x into the "score" t.

To distinguish between the transformed variables and the transformed observations, the transformed variables are called "principal components" and the individual transformed observations are the "scores".

2.3.5. Reducing dimensionality

Let a be an integer value between 1 and m. Let P be the matrix resulting from taking the first a columns of V, corresponding to the a largest eigenvalues of S. P size is mxa. \[ P = [V_1,...,V_a] \]

Choosing the value of a is choosing the number of principal components. This choice can be done in multiple ways. In this study we've chosen the number of principal components corresponding to the eigenvalues of S larger than one, which is the same as keeping only the principal components with variance above the average.

Let T be the projection of X into P, also known as the "scores" matrix. T size is nxa. \[ T = XP \]

2.3.6. Properties

Let T_i be the i^th column of T. The following properties can be shown:

Ordered variance: Var(T₁) ≥ Var(T₂) ≥ ... ≥ Var(T_a).
Mean of T_i is 0 for all i.
All components are orthogonal (T_i·T_j=0 for all i≠j).
There exists no other orthogonal expansion that captures more variation of the data (using a components).

2.3.7. Residual matrix

Let X̂ be the matrix resulting of projecting T back into the original m-dimensional space. Size of X̂ is nxm: \[ \hat{X} = TP^T \]

Let E be the residual matrix, computed from the difference between the original X and the back-projection of T. Size of E is nxm: \[ E = X - \hat{X} \]

2.4. Fault detection

Dimensional reduction provided by PCA (from m to a variables) represents nearly the same information in a lower dimension space. This new space can be used for multivariate process monitoring.

The Hotelling's T² statistic uses the a-dimension space (columns of the matrix T) to detect misbehaviours based on threshold trespassing.

The Q statistic uses the residual matrix (E) to monitor the portion of observation corresponding to the smallest eigenvalues of S (from number a to m).

2.4.1. Hotelling's T²

For every observation in the reduced space (t) we can associate a T² value, computed as:

\[ T^2 = \sum_{i=1}^{a}{ t_i \lambda_i^{-1} t_i^T } \]

The value of T² can be interpreted as the distance of this observation to the center of the model (mean).

Scores are scaled inversely proportional to the standard deviation along the eigenvectors (columns of P). This allows defining a scalar threshold to characterise the variability in the a-dimensional space.

Given a level of significance (α, chosen as 0.05 in this study), appropriate threshold values for the T² statistic can be determined. When the mean and variance are estimated from a sample of observations, the threshold is calculated as:

\[ T_\alpha^2 = \frac{a(n+1)(n-1)}{n(n-a)} F_\alpha(a,n-a) \] Where:

\[ F_\alpha(v_1,v_2) = \frac{\chi^2(v_1)/v_1}{\chi^2(v_2)/v_2} \]
v₁=a and v₂=n-a: are the degrees of freedom.
F_α(v₁,v₂) is the inverse F distribution function (Fisher-Snedecor).
α is also the ratio of false alarms.

2.4.2. Prediction error (Q or SPE)

Let r be the residual vector of an observation (x), computed from the back projection of its scores in the lower dimensional space. \[ r = x-\hat{x} \] \[ r = x-tP^T \] \[ r = x-xPP^T \] \[ r = x(I-PP^T) \] The Squared Prediction error (SPE) or Q statistic of an observation can be computed as: \[ Q = rr^T \] A threshold (Q_α) can be defined as: \[ Q_\alpha = \theta_1 ( \frac{ c_{1-\alpha} h_0 \sqrt{2\theta_2} }{\theta_1} + 1 + \frac{ \theta_2 h_0 (h_0-1) }{ \theta_1^2 } )^{\frac{1}{h_0}} \] Where:

c_1-α = boundary value that determines the area 1-α with mean=0 and standard deviation=1 in the normal distribution.
For example: c_0.95 = 1.644854 (in R can be computed as qnorm(0.95,0,1)).
\[ \theta_i = \sum_{j=a+1}^{m}{ \lambda_j^i } \]
\[ h_0 = 1 -2\theta_1 \frac{\theta_3}{3\theta_2^2} \]

2.4.3. Contribution analysis

The T² chart is useful to detect observations out of normal operation conditions. It is a quality measure of the data according to the model (the first a principal components). It is measured in the direction of the model. If T² increases, the observation preserves the model structure but with larger values ("far from the mean").

The Q chart detects structural changes in the process. It is a measure of the projection error. It is measured in the error direction (orthogonal to PCs). If Q increases, the correlation structure gathered by the PCA model (the first a components) is not preserved in that observation.

Contribution analysis answers which variable/s in the original space are responsible of the detected fault (threshold trespassing), which we refer to as "alarms" in our study.

2.4.3.1. Contribution analysis for T²

For one observation under fault, the contribution of one particular variable from the original space to the out of control status is: \[ T^2_{contribution_j} = \sum_{k=0}^{a} \frac{t_k}{\lambda_k}p_{kj} x_j \] Where:

j = number of the variable from the original (m-dimensional) space.
k = number of the variable from the transformed (a-dimensional) space.
λ_j = k^th largest eigenvalue of S.
x_j = j^th element of the obseravation vector x.
t_k = k^th element of the scores vector t.
p_kj = k^th, j^th element of P.

2.4.3.2. Contribution analysis for Q

For one observation vector x under fault, the components of the residual vector r are directly the contributions to Q. \[ Q_{contribution_j} = r_j \]

2.5. Implementation in a web interface

All the equations above have been implemented in Javascript. This language choice allows doing all the computation client-side, since it is native in all web browsers (e.g. Google Chrome, Mozilla Firefox, Microsoft Edge, etc.), regardless of the operating system, and type of device.

The starting point was creating a matrix library from scratch, to perform basic matrix operations (e.g., sum, multiplication, transposition, determinant, inverse...). Then, the SVD algorithm and PCA procedures were implemented. Then the Hotelling T² and Q statistics, along with the calculation of the contributions.

This matrix library is freely available at https://github.com/icra/matrix-library-js, but at the time of this writing is still not fully documented.

To validate the Javascript implementation, numeric tests were conducted using R (https://www.r-project.org/) since it is a well-known and stablished numeric platform with PCA functions builtin.

The matrix library was then loaded into the web interface, also created ad-hoc, coded in HTML and CSS, and using VueJS (https://vuejs.org/) as the frontend Javascript framework. This approach was selected to facilitate the integration of these developments with the SARSAIGUA platform as a superior layer of analysis, beyond current visualization options.

3.1. Raw time series (input data)

The evolution of the COVID-19 pandemic in Catalonia can be observed in raw and raw_cases. Between March 2020 and December 2022, seven waves have hit the Catalan population. The first wave started in March 2020, but it is out of the scope of the SARSAIGUA program. The evolution in terms of waves is both explained by reported cases and N1 loads detected in wastewater; the Spearman correlation between both signals is of 0.69, using data until the 6th wave. In the April 2022 the government stopped progressively reporting positive COVID-19 cases and launched a sentinel system based on screening of certain patients in key primary care health centres.

The most dominant variant of SARS-CoV-2 have been: Delta (until January 2022), Omicron BA.1 (until mid-March 2022), Omicron BA.2 (until mid-June 2022), Omicron BA.4 and BA.5 (until November 2022), and Omicron BQ.1 (current at the time of this writing).

raw Time series: N1 (CG/inhabitant) at each WWTP, from July 2020 to December 2022. Each WWTP is mean centered and scaled to unit variance.

raw_cases Time series: reported clinical cases per 1000 inhabitants at the municipalities served by each WWTP, from July 2020 to April 2022. Each WWTP is mean centered and scaled to unit variance.

3.2. PCA results

PCA of N1/inhabitant was run with an input matrix X of 125 observations (rows) and 51 WWTPs (columns). Taking 11 PCs (with associated eigenvalue>1), 82% of the variance was retained (see eigenvalues).

loadings_and_scores (left) shows the loadings from PC1 vs PC2. All WWTPs contribute positively to PC1; no clear clustering exists. But PC2 shows some clusters: (1) WWTPs with positive contribution to PC2 and (2) WWTPs with negative contribution to PC2.

PC1 explains 32% of the variance. This signal is equivalent to the mean of each observation (after scaling it with the standard deviation of the first principal component).

PC2 explains 18% of the variance and distributes WWTPs along the vertical axis. PC2 captures differences amongst WWTP groups. In particular, PC2 clusters WWTPs that did not detect 2nd, 3rd and 7th wave very well, but had a very strong detection of 4th, 5th and 6th wave, and the other cluster is the opposite (see clusters_PC2).

eigenvalues Eigenvalues, percent variance explained per PC and accumulated variance per PC.


          loadings_and_scores
          Loadings and Scores (PC1 vs PC2).

scores Scores (time series) of the first 11 principal components.

clusters_PC2 Time series of PC2's scores (purple), with raw data of WWTPs with positive contribution to PC2 (red), and raw data of WWTPs with negative contribution to PC2 (blue). Blue lines have been "mirrored" respect the horizontal (x) axis (multiplied by -1), so they can be seen without overlaping with the red lines.

3.3. Detection of outbreaks

3.3.1. Control charts

TO DO


          
          
          control_charts
          Control chart (top: N1/inhabitant, bottom: clinical cases)
          with Q and T² and respective thresholds.
          Note that reporting of cases stopped on 2022-08-01.

3.3.2. Contributions of the different WWTPs to Q and T² alarms

One might wonder if the same subset of WWTPs is consistently triggering alarms; the response is not. contributions shows the contribution of each WWTP to each alarm. The triggering of alarms is generated by different subsets of WWTPs. In the case of T², the fact that some WWTPs contribute little to trigger alarms means that these follow the general pattern of N1 concentrations (WWTPs pushing towards the center of the 11-dimensional space); in general these are either the larger WWTPs (DPDL or DBSS) or the ones closer to the Barcelona region (mobility in Catalonia is dominated by trips from and to the Barcelona region). With regards to the Q-statistic, we cannot see a clear subgroup of WWTPs prone to earlier detect outbreaks (higher sensitivity in detecting changes in N1 concentrations).

contributions Contributions to alarms by each WWTP, sorted by serviced population in descending order. Intensity of color is with respect to the highest contribution among all alarms. Red cells correspond to alarms for Q. Blue cells correspond to alarms for T².

The fact that the Q alarms when executed on the clinical reported cases and on wastewater match well, indicates that the ranking in the Q-statistic is more related to the spread of each wave across the territory rather than the sensitivity of the WWTPs. It is also possible that sampling or analytical errors lead to unexpectedly higher or smaller N1 concentrations; such poor data quality values would influence the result of the PCA and trigger false alarms. The developed system allows detecting such abnormal behaviour and tracing back the origin of the alarm to a single WWTP (in that case). Hence, a manual check is needed for each alarm generated, and if generated by a single WWTP or by WWTPs analyzed by a single laboratory, a verification of data quality needs to be conducted before triggering an alarm and communicating to health authorities.

Such a situation happened in one observation, where an unexpected increase was observed (observation at 2022-08-08 for the WWTP "DPDL", see outlier). All analytical parameters were within the quality control ranges; the only explanation was the potential discharge of sewage from cruise ships which discharge to the WWTP of DPDL. Yet, the system allowed identifying such abnormal behaviour and go back to revise data quality from that observation. Overall, the message is that the more WWTPs included in a surveillance network the better. The method proposed in this paper allows capturing relevant information from all WWTPs included in a surveillance network to anticipate outbreaks.


            outlier
            Values of N1/inhabitant at WWTP "DPDL". It spikes at 2022-08-08. The
            alarm system triggers an alarm for Q. Below, the contribution
            decomposition by WWTP for the alarm triggered at that day.  It
            reveals that the largest contribution by far was almost entirely by
            that WWTP.

Watching the guards: Developing a Web-Based Alarm System for Early Detection of SARS-CoV-2 in Sewage Using Principal Component Analysis and Threshold Trespassing

1. Introduction

2. Methods

2.1. Data sources

2.1.1 WWTP data

2.1.2 Clinical cases

2.2. Gap filling

2.3. Principal Component Analysis

2.3.1. Pretreatment

2.3.2. Covariance and correlation matrix

2.3.3. Eigenvalues and eigenvectors

2.3.4. Loadings and scores

2.3.5. Reducing dimensionality

2.3.6. Properties

2.3.7. Residual matrix

2.4. Fault detection

2.4.1. Hotelling's T²

2.4.2. Prediction error (Q or SPE)

2.4.3. Contribution analysis

2.4.3.1. Contribution analysis for T²

2.4.3.2. Contribution analysis for Q

2.5. Implementation in a web interface

3. Results and Discussion

3.1. Raw time series (input data)

3.2. PCA results

3.3. Detection of outbreaks

3.3.1. Control charts

3.3.2. Contributions of the different WWTPs to Q and T² alarms

4. Conclusions

5. References

6. Acknowledgements and others

7. Appendix

Watching the guards: Developing a Web-Based Alarm System for Early Detection of SARS-CoV-2 in Sewage Using Principal Component Analysis and Threshold Trespassing

1. Introduction

2. Methods

2.1. Data sources

2.1.1 WWTP data

2.1.2 Clinical cases

2.2. Gap filling

2.3. Principal Component Analysis

2.3.1. Pretreatment

2.3.2. Covariance and correlation matrix

2.3.3. Eigenvalues and eigenvectors

2.3.4. Loadings and scores

2.3.5. Reducing dimensionality

2.3.6. Properties

2.3.7. Residual matrix

2.4. Fault detection

2.4.1. Hotelling's T2

2.4.2. Prediction error (Q or SPE)

2.4.3. Contribution analysis

2.4.3.1. Contribution analysis for T2

2.4.3.2. Contribution analysis for Q

2.5. Implementation in a web interface

3. Results and Discussion

3.1. Raw time series (input data)

3.2. PCA results

3.3. Detection of outbreaks

3.3.1. Control charts

3.3.2. Contributions of the different WWTPs to Q and T2 alarms

4. Conclusions

5. References

6. Acknowledgements and others

7. Appendix

2.4.1. Hotelling's T²

2.4.3.1. Contribution analysis for T²

3.3.2. Contributions of the different WWTPs to Q and T² alarms