LECTURE 4.
Probabilistic downscaling: verification methods and metrics. Methods for probabilistic forecasts
A probabilistic forecast gives a probability of an event occurring, with a value between 0 and 1 (or 0 and 100%). In general, it is difficult to verify a single probabilistic forecast. Instead, a set of probabilistic forecasts, pi, is verified using observations that those events either occurred (oi=1) or did not occur (oi=0).
An accurate probability forecast system has:
reliability - agreement between forecast probability and mean observed frequency
sharpness - tendency to forecast probabilities near 0 or 1, as opposed to values clustered around the mean
resolution - ability of the forecast to resolve the set of sample events into subsets with characteristically different outcomes 1.
Reliability (attributes) diagram Reliability diagram - (called "attributes diagram" when the no-resoloution and no-skill w.r.t. climatology lines are included).
The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into K bins (for example, 0-5%, 5-15%, 15-25%, etc.).
The sample size in each bin is often included as a histogram or values beside the data points.
Answers the question: How well do the predicted probabilities of an event correspond to their observed frequencies?
Characteristics: Reliability is indicated by the proximity of the plotted curve to the diagonal. The deviation from the diagonal gives the conditional bias. If the curve lies below the line, this indicates overforecasting (probabilities too high); points above the line indicate underforecasting (probabilities too low). The flatter the curve in the reliability diagram, the less resolution it has. A forecast of climatology does not discriminate at all between events and non-events, and thus has no resolution. Points between the "no skill" line and the diagonal contribute positively to the Brier skill score. The frequency of forecasts in each probability bin (shown in the histogram) shows the sharpness of the forecast.
The reliability diagram is conditioned on the forecasts (i.e., given that X was predicted, what was the outcome?), and can be expected to give information on the real meaning of the forecast. It it a good partner to the ROC, which is conditioned on the observations.
2. Relative operational characteristic
Relative operating characteristic – Plot hit rate (POD) vs false alarm rate (POFD), using a set of increasing probability thresholds (for example, 0.05, 0.15, 0.25, etc.) to make the yes/no decision. The area under the ROC curve is frequently used as a score.
Answers the question: What is the ability of the forecast to discriminate between events and non-events?
ROC: Perfect: Curve travels from bottom left to top left of diagram, then across to top right of diagram. Diagonal line indicates no skill.
ROC area: Range: 0 to 1, 0.5 indicates no skill. Perfect score: 1
Characteristics: ROC measures the ability of the forecast to discriminate between two alternative outcomes, thus measuring resolution. It is not sensitive to bias in the forecast, so says nothing about reliability. A biased forecast may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the forecast through calibration. The ROC can thus be considered as a measure of potential usefulness. The ROC is conditioned on the observations (i.e., given that Y occurred, what was the correponding forecast?) It is therefore a good companion to the reliability diagram, which is conditioned on the forecasts.
More information on ROC can be found in Mason 1982, Jolliffe and Stephenson 2003 (ch.3), and the WISE site.
3. Brier skill score
Brier score -
Answers the question: What is the magnitude of the probability forecast errors?
Measures the mean squared probability error. Murphy (1973) showed that it could be partitioned into three terms: (1) reliability, (2) resolution, and (3) uncertainty.
Range: 0 to 1. Perfect score: 0.
Characteristics: Sensitive to climatological frequency of the event: the more rare an event, the easier it is to get a good BS without having any real skill. Negative orientation (smaller score better) - can "fix" by subtracting BS from 1.
- - - - - - - - - - -
Brier skill score - Brier skill score -
Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of predicting whether or not an event occurred?
Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast.
Perfect score: 1.
Characteristics: Measures the improvement of the probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology), thus taking climatological frequency into account. Not strictly proper. Unstable when applied to small data sets; the rarer the event, the larger the number of samples needed.
4. Rank probability skill score
Ranked probability score -
where M is the number of forecast categories, pk is the predicted probability in forecast category k, and ok is an indicator (0=no, 1=yes) for the observation in category k.
Answers the question: How well did the probability forecast predict the category that the observation fell into?
Range: 0 to 1. Perfect score: 0.
Characteristics: Measures the sum of squared differences in cumulative probability space for a multi-category probabilistic forecast. Penalizes forecasts more severely when their probabilities are further from the actual outcome. Negative orientation - can "fix" by subtracting RPS from 1. For two forecast categories the RPS is the same as the Brier Score.
Continuous version -
Ranked probability skill score -
Answers the question: What is the relative improvement of the probability forecast over climatology in predicting the category that the observations fell into?
Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast.
Perfect score: 1.
Characteristics: Measures the improvement of the multi-category probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology). Strictly proper. Takes climatological frequency into account. Unstable when applied to small data sets.
5. Ignorance score
The logarithmic scoring rule was suggested by Good in the 1950s [Good, 1952]. It can be defined as follows: If there are n (mutually exclusive) possible outcomes and fi (i=1,...n) is the predicted probability of the ith outcome occurring then if the jth outcome is the one which actually occurs the score for this particular forecast-realization pair is given by IGN = -log2 fj
As defined above, with a negative sign, the logarithmic score cannot be negative and smaller values of the score are better. The minimum value of the score (zero) is obtained if a probability of 100% is assigned to the actual outcome. If a probability of zero is assigned to the actual outcome the logarithmic scoring rule is infinite. We will examine the meaning of this below.
The logarithmic scoring rule is strictly proper which means that if a forecaster believes the probabilities of each outcome occurring are gi (i=1,...n) then that forecaster will minimize their expected logarithmic score by issuing a forecast fi = gi. The Brier score is also strictly proper. Unlike the Brier score, however, the logarithmic score is local in that it only depends upon the probability assigned to the outcome which occurs and not to any of the probabilities assigned to the other outcomes (Roulston, M.S. and Smith, L.A., 2002: Evaluating probabilistic forecasts using information theory, Monthly Weather Review, 130, 1653-1660.)
6. Sharpness
An attribute of the marginal distribution of the forecasts that aims to quantify the ability of the forecasts to “stick their necks out”. In other words, how much the forecasts deviate from the mean climatological value/category for deterministic forecasts, or from the climatological mean probabilities for probabilistic forecasts. Unvarying climatological forecasts take no great risks and so have zero sharpness; perfect forecasts are as sharp as the time-varying observations.
For deterministic forecasts of discrete or continuous variables, sharpness is most simply estimated by the variance of the forecasts.
For perfectly calibrated forecasts where , the sharpness becomes identical to the resolution of the forecasts. For probabilistic forecasts, although sharpness can also be defined by the variance
, it is often frequently defined in terms of the information content (negative entropy) of the forecasts. High-risk forecasts in which p? is either 0 or 1 have maximum information content and are said to be perfectly sharp. Perfectly calibrated perfectly sharp forecasts correctly predict all events. By interpreting deterministic forecasts as probabilistic forecasts with zero prediction uncertainty in the predictand, deterministic forecasts may be considered to be perfectly sharp probabilistic forecasts. However, it is perhaps more realistic to consider deterministic forecasts to be ones in which the prediction uncertainty in the predictand is not supplied as part of the forecast rather than ones in which the prediction uncertainty is exactly equal to zero. Hence, a deterministic forecast can be considered to be a deterministic forecast with spread/sharpness , yet at the same time can also be considered to be a probability forecast with perfect sharpness. The word refinement is also sometimes used to denote sharpness.
Skill Score (General Definition)
Relative measure of the quality of the forecasting system compared to some (usually “lowskill”) benchmark forecast. Commonly used reference forecasts include mean climatology, persistence (random walk forecast), or output from an earlier version of the forecasting system. There are as many skill scores as there are possible scores and they are usually based on the expression
where S is the forecast score, S0 is the score for the benchmark forecast, and S1 is the best possible score. The skill scores generally lie in the range 0 to 1 but can in practice be negative when using good benchmark forecasts (e.g. previous versions of the forecasting system). Compared to raw scores, skill scores have the advantage that they help take account of nonstationarities in the system to be forecast. For example, improved forecast scores often occur during periods when the atmosphere is in a more persistent state.
< Prev | Next > |
---|