A.3.3 Split-sample testing schemes
One special type of proficiency testing design that is often used by participants' customers and some regulatory bodies is the “split-sample” design.
NOTE This design is not to be confused with a split-level design, which is discussed in A.3.2.
Typically, spiit-sample proficiency testing involves comparisons of the data produced by small groups of participants (often only two). In these proficiency testing schemes, samples of a product or a material are divided into two or more parts, with each participant testing one part of the sample (see Figure A.1, model 5). Uses for this type of scheme include identifying poor accuracy, describing consistent bias and verifying the effectiveness of corrective actions. This design may be used to evaluate one or both participants as suppliers of testing services, or in cases where there are too few participants for appropriate evaluation of results.
Under such schemes, one of the participants may be considered to operate at a higher metrological level (i.e. lower measurement uncertainty), due to the use of reference methodology and more advanced equipment, etc., or through confirmation of its own performance through satisfactory participation in a recognized interlaboratory comparison scheme. Its results are considered to be the assigned values in such comparisons and it may act as an advisory or mentor laboratory to the other participants comparing split-sample data with it.
A.3.4 Partial-process schemes
Special types of proficiency testing involve the evaluation of participants' abilities to perform parts of the overall testing or measurement process. For example, some existing proficiency testing schemes evaluate participants' abilities to transform and report a given set of data (rather than conduct the actual test or measurement), to make interpretations based on a given set of data or proficiency testing items, such as stained blood films for diagnosis, or to take and prepare samples or specimens in accordance with a specification.
A.4 External quality assessment (EQA) programmes
EQA programmes (such as those provided for laboratory medicine examinations) offer a variety of interlaboratory comparison schemes based on this traditional proficiency testing model, but with broader application of the schemes described in A.2 and A.3 and illustrated in Figure A.1. Many EQA programmes are designed to provide insight into the complete path of workflow of the laboratory, and not just the testing processes. Most EQA programmes are continuous schemes that include long term follow-up of laboratory performance. A typical feature of EQA programmes is to provide education to participants and promote quality improvement. Advisory and educational comments comprise part of the report returned to participants to achieve this aim.
Some EQA programmes assess performance of pre-analytical and post-analytical phases of testing, as well as the analytical phase. In such EQA programmes, the nature of the proficiency test item may differ significantly from that used in traditional proficiency testing schemes. The “proficiency test item” may be a questionnaire or case study (see Figure A.1, model 3) circulated by the EQA provider to each participant for return of specific answers. Alternatively, pre-analytical information may accompany the proficiency test item, requiring the participant to select an appropriate approach to testing or interpretation of results, and not just to
perform the test. In “sample review” schemes, participants may be required to provide the “proficiency test items” to the EQA provider (see Figure A.1, model 4). This may take the form of a processed specimen or sample (e.g. stained slide or fixed tissue), laboratory data (e.g. test results, laboratory reports or quality assurance/control records) or documentation (e.g. procedures or method verification criteria).
^{a} Depending how the assigned value is derived, it will be either determined prior to the distribution of the proficiency test items or after the return of participant results.
Figure A.1 — Examples of common types of proficiency testing schemes
Annex В
(informative)
Statistical methods for proficiency testing
General
Proficiency test results can appear in many forms, spanning a wide range of data types and underlying statistical distributions. The statistical methods used to analyse the results need to be appropriate for each situation, and so are too varied to be specified in this International Standard. ISO 13528 describes preferred specific methods for each of the situations discussed below, but also states that other methods may be used as long as they are statistically valid and are fully described to participants. Some of the methods in ISO 13528, especially for homogeneity and stability testing, are modified slightly in the IUPAC^{2}) Technical Report “The International Harmonized Protocol for the proficiency testing of analytical chemistry laboratories’’!^{18}!. These documents also present guidance on design and visual data analysis. Other references may be consulted for specific types of proficiency testing schemes, e.g. measurement comparison schemes for calibration.
The methods discussed in this annex and in the referenced documents cover the fundamental steps common to nearly all proficiency testing schemes, i.e.
determination of the assigned value,
calculation of performance statistics,
evaluation of performance, and
preliminary determination of proficiency test item homogeneity and stability.
With new proficiency testing schemes, initial agreement between results is often poor, due to new questions, new forms, artificial test items, poor agreement of test or measurement methods, or variable measurement procedures. Coordinators may have to use robust indicators of relative performance (such as percentiles) until agreement improves. Statistical methods may need to be refined once participant agreement has improved and proficiency testing is well established.
This annex does not consider statistical methods for analytical studies other than for treatment of proficiency test data. Different methods may be needed to implement the other uses of interlaboratory comparison data listed in the Introduction.
Determination of the assigned value and its uncertainty
There are various procedures available for the establishment of assigned values. The most common procedures are listed below in an order that, in most cases, will result in increasing uncertainty for the assigned value. These procedures involve the use of:
known values - with results determined by specific proficiency test item formulation (e.g. manufacture or dilution);
certified reference values - as determined by definitive test or measurement methods (for quantitative tests);
reference values - as determined by analysis, measurement or comparison of the proficiency test item alongside a reference material or standard, traceable to a national or international standard;
consensus values from expert participants - experts (which may, in some situations, be reference laboratories) should have demonstrable competence in the determination of the measurand(s) under test, using validated methods known to be highly accurate and comparable to methods in general use;
consensus values from participants - using statistical methods described in ISO 13528 and the IUPAC International Harmonized Protocol, and with consideration of the effects of outliers.
Assigned values should be determined to evaluate participants fairly, yet to encourage agreement among test or measurement methods. This is accompiished through selection of common comparison groups and the use of common assigned values, wherever possible.
Procedures for determining the uncertainty of assigned values are discussed in detail in ISO 13528 and the IUPAC International Harmonized Protocol, for each common statistic used (as mentioned above). Additional information on uncertainty is also provided in ISO/IEC Guide 98-3.
Statistical methods for determining the assigned value for qualitative data (also called "categorical” or “nominal” values), or semi-quantitative values (also called “ordinal” values) are not discussed in ISO 13528 or the IUPAC International Harmonized Protocol. In general, these assigned values need to be determined by expert judgement or manufacture. In some cases, a proficiency testing provider may use a consensus value, as defined by agreement of a predetermined majority percentage of responses (e.g. 80% or more). However, the percentage used should be determined based on objectives for the proficiency testing scheme and the level of competence and experience of the participants.
Outliers are statistically treated as described below.
Obvious blunders, such as those with incorrect units, decimal point errors, and results for a different proficiency test item should be removed from the data set and treated separately. These results should not be subject to outlier tests or robust statistical methods.
When participants' results are used to determine assigned values, statistical methods should be in place to minimize the influence of outliers. This can be accomplished with robust statistical methods or by removing outliers prior to calculation. In larger or routine proficiency testing schemes, it may be possible to have automated outlier screens, if justified by objective evidence of effectiveness.
If results are removed as outliers, they should be removed only for calculation of summary statistics. These results should still be evaluated within the proficiency testing scheme and be given the appropriate performance evaluation.
NOTE ISO 13528 describes a specific robust method for determination of the consensus mean and standard deviation, without the need for outlier removal.
Other considerations are outlined below.
Ideally, if assigned values are determined by participant consensus, the proficiency testing provider should have a procedure to establish the trueness of the assigned values and for reviewing the distribution of the data.
The proficiency testing provider should have criteria for the acceptability of an assigned value in terms of its uncertainty. In ISO 13528 and in the IUPAC International Harmonized Protocol, criteria are provided that are based on a goal to limit the effect that uncertainty in the assigned value has on the evaluation, i.e. the criteria limit the probability of having a participant receive an unacceptable evaluation because of uncertainty in the assigned value.
Calculation of performance statistics
Performance for quantitative results
Proficiency test results often need to be transformed into a performance statistic, in order to aid
interpretation and to allow comparison with defined objectives. The purpose is to measure the deviation from the assigned value in a manner that allows comparison with performance criteria. Statistical methods may range from no processing required to complex statistical transformations.
Performance statistics should be meaningful to participants. Therefore, statistics should be
appropriate for the relevant tests and be well understood or traditional within a particular field.
Commonly used statistics for quantitative results are listed below, in order of increasing degree of transformation of participants' results.
The difference, D, is calculated using Equation (B.1):
D = (x-T) (B.1)
where
x is the participant's result;
X is the assigned value.
The percent difference, D_{o/o}, is calculated using Equation (B.2):
D_{o}. =1^{х}~^{Х}1х100 (B.2)
X
The z scores are calculated using Equation (B.3):
z = ^- (B.3)
a
where a is the standard deviation for proficiency assessment.
As described in ISO 13528, <r can be calculated from the following:
a fitness for purpose goal for performance, as determined by expert judgement or regulatory mandate (prescribed value);
an estimate from previous rounds of proficiency testing or expectations based on experience (by perception);
an estimate from a statistical model (general model);
the results of a precision experiment; or
participant results, i.e. a traditional or robust standard deviation based on participant results.
T he zeta score, is calculated using Equation (B.4), where calculation is very similar to the E_{n} number [see e) below], except that standard uncertainties are used rather than expanded uncertainties. This allows the same interpretation as for traditional z scores.
(B.4)
where
zz|_{ab} is the combined standard uncertainty of a participant's result;
u_{av} is the standard uncertainty of the assigned value.
E_{n} numbers are calculated using Equation (B.5):
!_{v} 2 <^{B 5}>
V^{u}lab ^{+t7}ref
where
C7|_{ab} is the expanded uncertainty of a participant's result;
L/_{ref} is the expanded uncertainty of the reference laboratory’s assigned value.
NOTE 1 The formulae in Equations (B.4) and (B.5) are correct only if л and X are independent.
NOTE 2 For additional statistical approaches, see ISO 13528 and the IUPAC International Harmonized Protocol.
The aspects below should be taken into consideration.
The simple difference between the participant's result and the assigned value may be adequate to determine performance, and is most easily understood by participants. The quantity (x-X) is called the "estimate of laboratory bias” in ISO 5725-4 and ISO 13528.
The percent difference is independent of the magnitude of the assigned value, and is well understood by participants.
Percentiles or ranks are useful for highly disperse or skewed results, ordinal responses, or when there are a limited number of different responses. This method should be used with caution.
Transformed results may be preferred, or necessary, depending on the nature of the test. For example, dilution-based results are a form of geometric scale, transformable by logarithms.
If consensus is used to determine a , the estimates of variability should be reliable, i.e. based on enough observations to reduce the influence of outliers and achieve sufficiently low uncertainty.
If scores consider the participants' reported estimates of measurement uncertainty (e.g. with E_{n} scores or zeta scores), these will only be meaningful if the uncertainty estimates are determined in a consistent manner by all participants, such as in accordance with the principles in ISO/IEC Guide 98-3.
Performance for qualitative and semi-quantitative results
For qualitative or semi-quantitative results, if statistical methods are used, they must be appropriate for the nature of the responses. For qualitative data (also called “categorical" data), the appropriate technique is to compare a participant's result with the assigned value. If they are identical, then performance is acceptable. If they are not identical, then expert judgement is needed to determine if the result is fit for its intended use. In some situations, the proficiency testing provider may review the results from participants and determine that a proficiency testing item was not suitable for evaluation, or that the assigned value was not correct. These determinations should be part of the plan for the scheme and understood by the participants in advance of the operation of the scheme.
For semi-quantitative results (also called “ordinal” results), the techniques used for qualitative data (B.3.2.1) are appropriate. Ordinal results include, for example, responses such as grades or rankings, sensory evaluations, or the strength of a chemical reaction (e.g. 1+, 2+, 3+, etc.). Sometimes these responses are given as numbers, e.g. 1 = Poor, 2 = Unsatisfactory, 3 = Satisfactory, 4 = Good, 5 = Very Good.
. 3.2.3 It is not appropriate to calculate usual summary statistics for ordinal data, even if the results are numerical. This is because the numbers are not on an interval scale, i.e. the difference between 1 and 2, in some objective sense, may not be the same as the difference between 3 and 4, so averages and standard deviations cannot be interpreted. Therefore, it is not appropriate to use evaluation statistics such as z scores for semi-quantitative results. Specific statistics, such as rank or order statistics, designed for ordinal data, should be used.
.3.2.4 It is appropriate to list the distribution of results from all participants (or produce a graph), along with the number or percentage of results in each category, and to provide summary measures, such as the modes (most common responses) and range (lowest and highest response). It may also be appropriate to evaluate results as acceptable based on closeness to the assigned value, e.g. results within plus or minus one response from the assigned value may be fit for the purpose of the measurement. In some situations, it may be appropriate to evaluate performance based on percentiles, e.g. the 5 % of results farthest from the mode or farthest from the assigned value may be determined to be unacceptable. This should be based on the proficiency testing scheme plan (i.e. fitness for purpose) and understood by participants in advance.
B.3.3 Combined performance scores
Performance may be evaluated on the basis of more than one result in a single proficiency testing round. This occurs when there is more than one proficiency test item for a particular measurand, or a family of related measurands. This would be done to provide a more comprehensive evaluation of performance.
Graphical methods, such as the Youden plot or a plot showing Mandel’s Л-statistics, are effective tools for interpreting performance (see ISO 13528).
In general, averaged performance scores are discouraged because they can mask poor performance on one or more proficiency test items that should be investigated. The most commonly used combined performance score is simply the number (or percentage) of results determined to be acceptable.
Evaluation of performance
Initial performance
Criteria for performance evaluation should be established after taking into account whether the performance measure involves certain features. The features for performance evaluation are the following:
expert consensus, where the advisory group, or other qualified experts, directly determine whether reported results are fit for their intended purpose; agreement of experts is the typical way to assess results for qualitative tests;
fitness for purpose, predetermined criteria that consider, for example, method performance specifications and participants' recognized level of operation;
statistical determination for scores, i.e. where criteria should be appropriate for each score; common examples of application of scores are as follows:
for z scores and zeta scores (for simplicity, only “z” is indicated in the examples below, but may be substituted for “z” in each case):
I z| ^2,0 indicates “satisfactory” performance and generates no signal;
2,0 < I z| < 3,0 indicates “questionable” performance and generates a warning signal;
Id ^3,0
2) for E_{n} numbers:
indicates “unsatisfactory” performance and generates an action signal;
IeJ < 1,0
I Ej > 1,0
indicates “satisfactory” performance and generates no signal;
indicates “unsatisfactory" performance and generates an action signal.
For split-sample designs, an objective may be to identify in results inadequate calibration or large random fluctuation, or both. In these cases, evaluations should be based on an adequate number of results and across a wide range of concentrations. Graphical presentations are useful for identifying and describing these problems, and are described in ISO 13528. These graphs should use differences between results on the vertical axis, rather than plots of results from one participant versus another, because of problems of scale. One key consideration is whether results from one of the participants have, or can be expected to have, lower measurement uncertainty. In this case, those results are the best estimate of the actual level of measurand. If both participants have approximately the same measurement uncertainty, the average of the pair of results is the preferred estimate of actual level.
Graphs should be used whenever possible to show performance (e.g. histograms, error bar
charts, ordered z score charts), as described in ISO 13528 and the IUPAC International Harmonized Protocol. These charts can be used to show:
distributions of participant values;
relationship between results on multiple proficiency test items;
comparative distributions for different methods.
Monitoring performance over time
A proficiency test scheme can include procedures to monitor performance over time. The procedures should allow participants to see the variability in their performance, whether there are general trends or inconsistencies, and where the performance varies randomly.
Graphical methods should be used to facilitate interpretation by a wider variety of readers. Traditional “Shewhart” control charts are useful, particularly for self-improvement purposes. Data listings and summary statistics allow more detailed review. Standardized performance scores used to evaluate performance, such as the z score, should be used for these graphs and tables. ISO 13528 presents additional examples and graphical tools.
Where a consensus standard deviation is used as the standard deviation for proficiency testing, caution should be taken when monitoring performance over time, as the participant group can change, and can have unknown effects on the scores. It is also common for the interlaboratory standard deviation to decrease over time, as participants become familiar with the proficiency testing scheme or as methodology improves. This could cause an apparent increase in z scores, when a participant's individual performance has not changed.
Demonstration of proficiency test item homogeneity and stability
The requirements of this International Standard call for a demonstration of “sufficient homogeneity" with valid statistical methods, including a statistically random selection of a representative number of samples. Procedures for this are detailed in ISO 13528 and the IUPAC International Harmonized Protocol. These documents define “sufficient homogeneity" relative to the evaluation interval for the proficiency testing scheme, and so the recommendations are based on allowances for uncertainty due to inhomogeneity relative to the evaluation interval. While ISO 13528 places a strict limit on inhomogeneity and instability to limit the effect on uncertainty and therefore the effect it has on the evaluation, the IUPAC International Harmonized Protocol expands the criteria to allow a statistical test of the estimate of inhomogeneity and instability, relative to the same criterion recommended in ISO 13528.
There are different needs for requirements in ISO Guide 34 and ISO Guide 35, which are for determining reference values for certified reference materials, including their uncertainties. ISO Guide 35 uses statistical analysis of variance to estimate the “bottle-to-bottle” variability and “within-bottle” variability (as appropriate), and subsequently uses those variances as components of the uncertainty of the assigned value. Given the need to estimate components accurately for certified reference materials, the number of randomly selected samples may exceed what is needed for proficiency testing, where the main objective is to check for unexpected inconsistencies in batches of manufactured proficiency test items.