Monday, December 16, 2013

TC model accuracy v operational value

 Assessing the Performance of Dynamical Tropical Cyclone Model Predictions

Forecast Accuracy v Forecast Utility

Mike Fiorino, NOAA ESRL, Boulder CO
18 December 2013
07 March 2014

Abstract

Tropical Cyclones (TCs) are often part of a dynamical numerical weather prediction (NWP) model solution.  The verification of model TCs, however, has two somewhat disparate aspects: 1) accuracy of the model itself ; and 2) utility of the solution in making operational forecasts by humans.  While the same verification method can be used for both aspects, the utility assessment process differs.

This blog evaluates the accuracy and forecast utility of three GFS-based NOAA models and the ECMWF HRES model.  The main result is that the post-processing of model TC track output, necessary for consistency with forecast operations, results in a 10-20% degradation in accuracy relative to raw model output.  Thus, any comparison with the official forecasts of National Hurricane Center (NHC) and Joint Typhoon Warning Center (JTWC) requires appropriate post-processing.

1.0 Introduction

The goal of the NOAA Hurricane Forecast Improvement Project (HFIP) is to improve human TC forecasts through advanced dynamical modeling.  This advanced modeling consists of models with TC-appropriate physics and data assimilation of  TC observations so as to improve the analysis of the TC vortex.  

Assessing the performance of these advanced modeling systems, however, must be done along two somewhat orthogonal directions:  1) model accuracy versus 2) utility/value to actual TC forecasting.  Standard verification methods can be applied to the model accuracy issue, but not directly to the question of model utility, i.e., model accuracy != model value.

We first review the verification process in some detail and then the timeline of operational TC forecasting.  Because dynamical model guidance will always be 'late' relative to operations, the model output must be synchronized with the human forecast through a post processing procedure.  Verification of the post-processed output is used to assess/define model utility.

1.1 TC forecast verification

The conventional verification of NWP model TC forecasts involves a relatively simple calculation of  the difference between a model track and an observed or 'best' track (BT).  The track consists of a series of 'posits' with:
  1. valid date-time-group (YYYYMMDDHH)
  2. latitude/longitude of the center (position)
  3. maximum surface (10 m; 2-min average) wind speed (intensity)
  4. TC state code:
    • LO - a cyclone LOw
    • DB - DisturBance
    • WV - tropical WaVe
    • TD - Tropical Depression (max wind < 35 kt)
    • TS - Tropical Storm (max wind >= 35 kt)
    • HU or TY - HUrricane/TYphoon (max wind >= 65 kt)
    • STY - Super-TYphoon (max wind >= 130 kt)
    • SD - Subtropical Depression (hybrid cyclone)
    • SS - Subtropical Storm (hybrid cyclone, max wind speed >= 50 kt)
    • XT/ET/PT - EXtra-, Post-tropical cyclone (mid-latitude cyclone)
  5. (optionally) minimum/central sea-level pressure
  6. (optionally) radius of 34, 50 and 64 kt surface winds
  7. ...other variables of TC structure, radius of max winds, eye diameter, depth...
typically every 6 h. 

The two primary verification variables are 'forecast (position) error' (FE = great circle distance between model and best track position) and 'intensity error' (IE = model - best track max surface winds).  While calculating the forecast and intensity error is simple, selecting which posits to verify, and the handling of missing and/or incomplete model tracks, is quite a bit more complicated.  

In this study we use the verification rule of both NHC and JTWC:
"If the posit is a TC initially and at the forecast time or tau, and is a TC in the best track - verify"
A TC is defined as having a state code = TD | TS | HY | TY | STY | SD | SS ; all other posits are considered nonTC or NT.

Availability of the TC state code in the best track is a somewhat recent addition:  1) JTWC since 2000; and 2) NHC since 2001. TC state for prior years is based on max wind speed.  

One of the trickier verification complications is handling storms that change state during its lifetime, e.g., TD -> WV/DB/LO -> TD -> TS -> SS -> TS -> XT.  Only the TC posits in the BT will be verified.

Other verification variables include 'cross-track' and 'along-track' error -- that measures 'speed' (along-track error or ahead/behind the storm) and 'track' (cross-track error or right/left of the storm track).  

At the end of the day, however, the most important verification variable is forecast error and we will concentrate mainly on FE.

2.0 Model Performance

Model TC performance is often boiled down to an inter-model comparison of error statistics of the raw model tracks.  For FE the statistic is the mean FE, for IE it's the mean absolute IE (the mean IE is the intensity bias).  One crude measure of the model TC analysis error is the model initial position and initial intensity error. These initial errors, however, are not strongly correlated with FE (e.g., Fiorino and Elsberry 1989).  

We consider three GFS-based NOAA models used in the HFIP 2013 summer demo and the ECMWF HRES model 
  • GFS - NOAA/NCEP Global Forecast System; dx~27 km
  • HWRF - NOAA/NCEP Hurricane WRF model; dx=27/9/3 km
  • FIM9 - NOAA/ESRL FIM global model; dx=15 km
  • ECMWF HRES - high-resolution deterministic run of the ECMWF model; dx~16 km 
Fig. 1 below gives the mean FE for the 2013 Western north PACific season for the four models where we follow the verification rule of 'if it's a TC in the BT - verify'.  

Figure 1. WPAC 2013 season (storms 01W-32W) mean FE for three NOAA GFS-based models (HWRF, FIM9, GFS) v ECMWF HRES; verify all TC points in best track.
First note that the BT for 2013 is the 'working' BT and not the final, post-season BT.  The working best track can contain posits for forecasting operations where the storm's TC characteristics have less consideration.  

For example, typhoon 11W (UTOR) made landfall over China and was kept in the working BT for over 60 h in case it moved back over water and regenerated into a TC.  Because these over-land points were coded as a TC in the working BT, they will (should) be verified.  

Another case was 30W which formed in WPAC moved into the Bay of Bengal, intensified into a TS, crossed the southern tip of India and finally dissipated in the Arabian Sea off the coast of Somalia.  Of the 89 posits in the BT for 30W, JTWC issued 17 warnings, but because the storm was coded as a TD/TS for 79 of the posits, if a forecast was made by the model, it was verified. These 'extras' (a cricket term btw) can distort the statistics by including cases where a limited-area TC model would not be expected to perform well (weak TCs and over-land TCs).

The verification code was enhanced to filter out: 1) over-land points < 24 h; or 2) points where JTWC did not issue a warning.  The warning-only filter eliminates posits with no operational value, i.e., the TC was not classified as a threat to either land and/or sea assets.

Fig. 2 gives the mean FE for posits in the BT considered significant -- the warning-point only filter.

Figure 2. as in Fig. 1 except verify only BT posits of operational significance

Both the means and the relative location of the models has changed significantly.  The number of cases at 72/96/120 h has gone from 125/86/57 to 87/51/31!  Most of the excluded cases come from two storms: 11W and 30W.  Note that the FIM9 model has gone from having the lowest errors at 72/96/120 h and HWRF the highest, to HWRF with comparable or lowest errors and FIM9 with mean FE similar to GFS and ECMWF.

The main point is that case selection can have a profound effect on mean FE errors, especially when using the working BT.

3.0 The Operational Forecast Timeline

The NHC advisories and the JTWC warnings that contain the forecast of track, intensity and radius of 34, 50 and 64 kt winds are issued every 6 h for the synoptic times of 00/06/12/18 UTC.  The timeline for the 06 UTC is given below
  • 06:30-07:00 UTC - the 06UTC position, intensity and wind radii to initiate tracking in the models and for preparing consensus aids is submitted to the NWP centers (NCEP & FNMOC).  This posit has many names but is fairly universally known as the 'TCvitals'
  • 07:00-07:30 UTC - prepare the forecast
  • 07:30-08:00 UTC - prepare ancillary products including warnings (NHC only), TCD (tropical cyclone discussion, NHC) or Prog Reasoning (JTWC) and submit the 'package' for dissemination to the public.
  • 08:30 UTC - 'drop-dead' time -- the time when the package must be communicated, otherwise the forecast is considered late.  This happens very rarely...
The 06UTC model run will never be complete by 07:30 UTC.  Thus, only the earlier 00UTC run can be used for the 06UTC package and this earlier forecast must be adjusted to be consistent with the time of the advisory/warning.  

We express the timing in terms of + hours after the model run.  For the timeline above, the model forecast is preferred to be ready by +6:45, but if later than +7:30 the run cannot be used.  The required adjustment for operations makes it impossible to do a direct comparison between the model and the NHC/JTWC forecast and consensus aids based on the 'late' models.   That is, forecast accuracy, as verified using the raw model output, does not equal to forecast utility.
 

4.0 Model Forecast Post-processing - the N-h interp scheme

There are two elements of the required post-processing we will call the N-h interp:  1) a bias correction of the raw tracker output using initial position and intensity 'offsets'; and 2) interpolation/extrapolation to relabel or recenter a forecast time to the initial time.  The most common recentering is to set the 6-h forecast (and subsequent taus) as the 0-h forecast or the start of the tracker.  In the case of a 6-h forward interpolation, the model is being penalized for being 'late' be calling the 6-h forecast as the 0-h initial posit.  Thus, forecast utility depends on the error growth of the model and how far forward the time interpolation needs to done.  If a model is available from +6:30 to +7:30, then only a 6-h interp is need, but if greater than +7:30, then a 12-h interp is required.

The ECMWF HRES and EPS trackers are good examples of how the interpolation works in operations.  ECMWF transmits a BUFR message on the GTS  with their tracker after the HRES run and then again after the EPS completes.  The HRES or hi-resolution deterministic run is available around +7:10, but the EPS comes in later at around +8:40 h.  Thus, the HRES run needs a 6-h interp, but the EPS a 12-h interp.  Even if the EPS made a better forecast it would have to have lower error growth to be useful to NHC/JTWC forecaster.

4.1 Interpolate track (dt=6h) to a finer time increment (dt=3h)

The first post-processing step is to interpolate the raw model track with a typical increment of 6 h, to a smaller time increment, typically 3 h in the standard scheme.  My processing differs in that I use rhumb lines to interpolate between posits vice a separate linear interpolation of latitude/longitude.  The difference is small but does become more significant for fast-moving storms and for more poleward posits.  

4.2 Smoothing of the dt=3h track

The last model tracker posit is then extrapolated forward using the motion between the last two posits in the track and the finer interpolation time increment (3-h) for an extra end point in the smoothing.
  
The lat/lon of the 3-h interpolated track is (optionally) smoothed (in time) with a 1-2-1 filter.  Experimentation with the number of passes showed 10 to be optimal in that it minimized the FE in the final post-processed track.  No smoothing is done for intensity in my scheme but apparently is at NHC/JTWC.

4.3 Bias correction

Before relabeling to form the final model forecast track or 'aid' used in the operational forecast process, the track is bias corrected by removing an 'offset.'  This offset is simply the difference between the initial raw model posit and that from the corresponding TCvitals and is expressed as a delta lat/lon and max sfc wind.

For lat/lon (position), the offset is applied to all forecast posits in the interpolated track, whereas for intensity the full offset is applied to tau 0 (initial posit) and is linearly decreased to 0 at some fixed forecast tau.  Whereas the current standard at NHC/JTWC is to apply the full intensity offset to all forecast taus (except for the GFDL & HWRF models), my bias correction sets the intensity offset to 0 at tau 24 h for limited-area models such as HWRF, and tau 72 h for global models.  As with smoothing,  the 0 intensity offset at 72 h for global models was found through experimentation to give the lowest intensity error (mean absolute max wind difference).

4.4 Relabeling/Recentering

The final step in generating the forecast aid that will be used by the human forecaster is to relabel the bias-corrected, interpolated track to the time of the advisory/warning.  For example, to make the 6-h interp track, the 6-h posit is set to 0-h, the 12-h to 6-h and so on.  In the 12-h interp, the 12-h posit becomes the 0-h, 18-h the 6-h and so on.  In ATCF nomenclature, the 6-h interp are the 'I' trackers and 12-h the '2' trackers, e.g., AVNO (raw) -> AVNI (6-h interp) and AVN2 (12-h) and occasionally AVN3 (18-h interp).

4.5 Error as a function of N-h interp time

Relableing a forecast to an analysis will obviously degrade the accuracy (increase error) as defined by the verification metrics.  The more relevant question in determining forecast utility is how the degradation depends on both the intrinsic error growth in the model, and on the basin/synoptic situation.  

Three models are considered:
  • AVNO - deterministic run of the GFS global model, requires a 6-h interp
  • EDET - ECMWF HRES deterministic run, requires a 6-h interp
  • EEMN - ECMWF EPS ensemble-mean, requires a 12-h interp
over a four-year period 2010-2013 in three basins: WPAC, EPAC and the atLANTic.  The change in accuracy is expressed as a % change or improvement in mean FE over the raw model tracker.  If the mean error of the N-h interp is larger than the raw model, then the % improvement is negative.


Figure 3.  % improvement of mean FE for 12-h, 6-h, 0-h interp over raw model tracker for EDET, AVNO, EEMN in the WPAC, EPAC, LANT basins for 2010-2013 and taus 12, 24, 36, 48, 72, 96, 120 h (x-axis).  A negative improvement is a degradation.  The tau 72 h 12- and 6-h degradations are indicated in the callout.
The most striking variation in the degradation curves is between basins.  WPAC shows the most degradation and EPAC the least.  Also note how the degradation decrease with tau in WPAC, whereas in EPAC it's positive or smaller for the early taus (12-24 h) and maximizes at 72 h.  The ECMWF deterministic forecast suffers less degradation from tau 12-48 h in the LANT compared to the GFS (AVNO), but has greater change from 72-120 h.  The curve for the mean ensemble tracker in the ECMWF EPS (EEMN) is similar to the higher-resolution deterministic trackers, but is generally lower.  The big problem with EEMN for operational forecasting is that is not available until > +7:30 and thus requires a 12-h interp.  Even if the ensemble forecast was superior to the deterministic run, its lateness would make it uncompetitive as the 12-h degradation is about 2X that for the 6-h interp.  

The effect of the smoothing and bias correction without relabeling (0-h interp) is generally small and sometimes even positive, except for the ECMWF trackers in WPAC and in EPAC for the medium-range taus.  Uniform application of the position offset for all taus is apparently less appropriate for the ECMWF model.   

The biggest difference between the data assimilation (DA) of the GFS v ECMWF around TCs is that the background TC vortex is relocated to the TCvital position in the GFS DA whereas no adjustment is made in the ECMWF DA.   Consequently, the ECMWF analysis has much larger initial position errors, but despite the implied error in the analyzed TC vortex, the 12-h FE is comparable or even lower than in the GFS.  Clearly these analysis errors do not affect FE suggesting the analysis errors are on a scale smaller than those that dominate motion (Fiorino and Elsberry 1989).

Finally, some of the smaller degradation for the GFS is explained by a large mean FE for the raw model.

5.0 Forecast accuracy v utility

Returning to the four models considered in Figs. 1 and 2, the runs are available  < +7:30 h so that we can apply a 6-h interp for comparison to the JTWC warnings.  Fig. 3 below is the same as Fig. 2 except now we can compare to JTWC and the best consensus aid prepared using 6- and 12-h interp tracks - CONW.
Figure 3.  as in Fig. 2 except for comparing 6-h interp trackers to the best consensus aid and JTWC

The notably differences compared to the raw model tracker mean errors in Fig. 2 are the errors are generally higher, the number of cases is reduced, especially at 120 h and the initial position error is much lower.  However ,the (interp) models have higher error than CONW and JTWC implying that forecast utility is (much) reduced compared to utility implied using the raw model trackers.

Appendix A - a python implementation of the post-processing

I have developed a standalone version of the post-processor used in this paper that can be applied to any standard ATFC adeck, see:

http://sourceforge.net/p/wxmap2/svn/HEAD/tree/branches/TCinterp/README

for details.  The only requirement is python V2.?

contact me at michael.fiorino@noaa.gov for any questions/comments.