Monday, December 16, 2013

TC model accuracy v operational value

 Assessing the Performance of Dynamical Tropical Cyclone Model Predictions

Forecast Accuracy v Forecast Utility

Mike Fiorino, NOAA ESRL, Boulder CO
18 December 2013
07 March 2014

Abstract

Tropical Cyclones (TCs) are often part of a dynamical numerical weather prediction (NWP) model solution.  The verification of model TCs, however, has two somewhat disparate aspects: 1) accuracy of the model itself ; and 2) utility of the solution in making operational forecasts by humans.  While the same verification method can be used for both aspects, the utility assessment process differs.

This blog evaluates the accuracy and forecast utility of three GFS-based NOAA models and the ECMWF HRES model.  The main result is that the post-processing of model TC track output, necessary for consistency with forecast operations, results in a 10-20% degradation in accuracy relative to raw model output.  Thus, any comparison with the official forecasts of National Hurricane Center (NHC) and Joint Typhoon Warning Center (JTWC) requires appropriate post-processing.

1.0 Introduction

The goal of the NOAA Hurricane Forecast Improvement Project (HFIP) is to improve human TC forecasts through advanced dynamical modeling.  This advanced modeling consists of models with TC-appropriate physics and data assimilation of  TC observations so as to improve the analysis of the TC vortex.  

Assessing the performance of these advanced modeling systems, however, must be done along two somewhat orthogonal directions:  1) model accuracy versus 2) utility/value to actual TC forecasting.  Standard verification methods can be applied to the model accuracy issue, but not directly to the question of model utility, i.e., model accuracy != model value.

We first review the verification process in some detail and then the timeline of operational TC forecasting.  Because dynamical model guidance will always be 'late' relative to operations, the model output must be synchronized with the human forecast through a post processing procedure.  Verification of the post-processed output is used to assess/define model utility.

1.1 TC forecast verification

The conventional verification of NWP model TC forecasts involves a relatively simple calculation of  the difference between a model track and an observed or 'best' track (BT).  The track consists of a series of 'posits' with:
  1. valid date-time-group (YYYYMMDDHH)
  2. latitude/longitude of the center (position)
  3. maximum surface (10 m; 2-min average) wind speed (intensity)
  4. TC state code:
    • LO - a cyclone LOw
    • DB - DisturBance
    • WV - tropical WaVe
    • TD - Tropical Depression (max wind < 35 kt)
    • TS - Tropical Storm (max wind >= 35 kt)
    • HU or TY - HUrricane/TYphoon (max wind >= 65 kt)
    • STY - Super-TYphoon (max wind >= 130 kt)
    • SD - Subtropical Depression (hybrid cyclone)
    • SS - Subtropical Storm (hybrid cyclone, max wind speed >= 50 kt)
    • XT/ET/PT - EXtra-, Post-tropical cyclone (mid-latitude cyclone)
  5. (optionally) minimum/central sea-level pressure
  6. (optionally) radius of 34, 50 and 64 kt surface winds
  7. ...other variables of TC structure, radius of max winds, eye diameter, depth...
typically every 6 h. 

The two primary verification variables are 'forecast (position) error' (FE = great circle distance between model and best track position) and 'intensity error' (IE = model - best track max surface winds).  While calculating the forecast and intensity error is simple, selecting which posits to verify, and the handling of missing and/or incomplete model tracks, is quite a bit more complicated.  

In this study we use the verification rule of both NHC and JTWC:
"If the posit is a TC initially and at the forecast time or tau, and is a TC in the best track - verify"
A TC is defined as having a state code = TD | TS | HY | TY | STY | SD | SS ; all other posits are considered nonTC or NT.

Availability of the TC state code in the best track is a somewhat recent addition:  1) JTWC since 2000; and 2) NHC since 2001. TC state for prior years is based on max wind speed.  

One of the trickier verification complications is handling storms that change state during its lifetime, e.g., TD -> WV/DB/LO -> TD -> TS -> SS -> TS -> XT.  Only the TC posits in the BT will be verified.

Other verification variables include 'cross-track' and 'along-track' error -- that measures 'speed' (along-track error or ahead/behind the storm) and 'track' (cross-track error or right/left of the storm track).  

At the end of the day, however, the most important verification variable is forecast error and we will concentrate mainly on FE.

2.0 Model Performance

Model TC performance is often boiled down to an inter-model comparison of error statistics of the raw model tracks.  For FE the statistic is the mean FE, for IE it's the mean absolute IE (the mean IE is the intensity bias).  One crude measure of the model TC analysis error is the model initial position and initial intensity error. These initial errors, however, are not strongly correlated with FE (e.g., Fiorino and Elsberry 1989).  

We consider three GFS-based NOAA models used in the HFIP 2013 summer demo and the ECMWF HRES model 
  • GFS - NOAA/NCEP Global Forecast System; dx~27 km
  • HWRF - NOAA/NCEP Hurricane WRF model; dx=27/9/3 km
  • FIM9 - NOAA/ESRL FIM global model; dx=15 km
  • ECMWF HRES - high-resolution deterministic run of the ECMWF model; dx~16 km 
Fig. 1 below gives the mean FE for the 2013 Western north PACific season for the four models where we follow the verification rule of 'if it's a TC in the BT - verify'.  

Figure 1. WPAC 2013 season (storms 01W-32W) mean FE for three NOAA GFS-based models (HWRF, FIM9, GFS) v ECMWF HRES; verify all TC points in best track.
First note that the BT for 2013 is the 'working' BT and not the final, post-season BT.  The working best track can contain posits for forecasting operations where the storm's TC characteristics have less consideration.  

For example, typhoon 11W (UTOR) made landfall over China and was kept in the working BT for over 60 h in case it moved back over water and regenerated into a TC.  Because these over-land points were coded as a TC in the working BT, they will (should) be verified.  

Another case was 30W which formed in WPAC moved into the Bay of Bengal, intensified into a TS, crossed the southern tip of India and finally dissipated in the Arabian Sea off the coast of Somalia.  Of the 89 posits in the BT for 30W, JTWC issued 17 warnings, but because the storm was coded as a TD/TS for 79 of the posits, if a forecast was made by the model, it was verified. These 'extras' (a cricket term btw) can distort the statistics by including cases where a limited-area TC model would not be expected to perform well (weak TCs and over-land TCs).

The verification code was enhanced to filter out: 1) over-land points < 24 h; or 2) points where JTWC did not issue a warning.  The warning-only filter eliminates posits with no operational value, i.e., the TC was not classified as a threat to either land and/or sea assets.

Fig. 2 gives the mean FE for posits in the BT considered significant -- the warning-point only filter.

Figure 2. as in Fig. 1 except verify only BT posits of operational significance

Both the means and the relative location of the models has changed significantly.  The number of cases at 72/96/120 h has gone from 125/86/57 to 87/51/31!  Most of the excluded cases come from two storms: 11W and 30W.  Note that the FIM9 model has gone from having the lowest errors at 72/96/120 h and HWRF the highest, to HWRF with comparable or lowest errors and FIM9 with mean FE similar to GFS and ECMWF.

The main point is that case selection can have a profound effect on mean FE errors, especially when using the working BT.

3.0 The Operational Forecast Timeline

The NHC advisories and the JTWC warnings that contain the forecast of track, intensity and radius of 34, 50 and 64 kt winds are issued every 6 h for the synoptic times of 00/06/12/18 UTC.  The timeline for the 06 UTC is given below
  • 06:30-07:00 UTC - the 06UTC position, intensity and wind radii to initiate tracking in the models and for preparing consensus aids is submitted to the NWP centers (NCEP & FNMOC).  This posit has many names but is fairly universally known as the 'TCvitals'
  • 07:00-07:30 UTC - prepare the forecast
  • 07:30-08:00 UTC - prepare ancillary products including warnings (NHC only), TCD (tropical cyclone discussion, NHC) or Prog Reasoning (JTWC) and submit the 'package' for dissemination to the public.
  • 08:30 UTC - 'drop-dead' time -- the time when the package must be communicated, otherwise the forecast is considered late.  This happens very rarely...
The 06UTC model run will never be complete by 07:30 UTC.  Thus, only the earlier 00UTC run can be used for the 06UTC package and this earlier forecast must be adjusted to be consistent with the time of the advisory/warning.  

We express the timing in terms of + hours after the model run.  For the timeline above, the model forecast is preferred to be ready by +6:45, but if later than +7:30 the run cannot be used.  The required adjustment for operations makes it impossible to do a direct comparison between the model and the NHC/JTWC forecast and consensus aids based on the 'late' models.   That is, forecast accuracy, as verified using the raw model output, does not equal to forecast utility.
 

4.0 Model Forecast Post-processing - the N-h interp scheme

There are two elements of the required post-processing we will call the N-h interp:  1) a bias correction of the raw tracker output using initial position and intensity 'offsets'; and 2) interpolation/extrapolation to relabel or recenter a forecast time to the initial time.  The most common recentering is to set the 6-h forecast (and subsequent taus) as the 0-h forecast or the start of the tracker.  In the case of a 6-h forward interpolation, the model is being penalized for being 'late' be calling the 6-h forecast as the 0-h initial posit.  Thus, forecast utility depends on the error growth of the model and how far forward the time interpolation needs to done.  If a model is available from +6:30 to +7:30, then only a 6-h interp is need, but if greater than +7:30, then a 12-h interp is required.

The ECMWF HRES and EPS trackers are good examples of how the interpolation works in operations.  ECMWF transmits a BUFR message on the GTS  with their tracker after the HRES run and then again after the EPS completes.  The HRES or hi-resolution deterministic run is available around +7:10, but the EPS comes in later at around +8:40 h.  Thus, the HRES run needs a 6-h interp, but the EPS a 12-h interp.  Even if the EPS made a better forecast it would have to have lower error growth to be useful to NHC/JTWC forecaster.

4.1 Interpolate track (dt=6h) to a finer time increment (dt=3h)

The first post-processing step is to interpolate the raw model track with a typical increment of 6 h, to a smaller time increment, typically 3 h in the standard scheme.  My processing differs in that I use rhumb lines to interpolate between posits vice a separate linear interpolation of latitude/longitude.  The difference is small but does become more significant for fast-moving storms and for more poleward posits.  

4.2 Smoothing of the dt=3h track

The last model tracker posit is then extrapolated forward using the motion between the last two posits in the track and the finer interpolation time increment (3-h) for an extra end point in the smoothing.
  
The lat/lon of the 3-h interpolated track is (optionally) smoothed (in time) with a 1-2-1 filter.  Experimentation with the number of passes showed 10 to be optimal in that it minimized the FE in the final post-processed track.  No smoothing is done for intensity in my scheme but apparently is at NHC/JTWC.

4.3 Bias correction

Before relabeling to form the final model forecast track or 'aid' used in the operational forecast process, the track is bias corrected by removing an 'offset.'  This offset is simply the difference between the initial raw model posit and that from the corresponding TCvitals and is expressed as a delta lat/lon and max sfc wind.

For lat/lon (position), the offset is applied to all forecast posits in the interpolated track, whereas for intensity the full offset is applied to tau 0 (initial posit) and is linearly decreased to 0 at some fixed forecast tau.  Whereas the current standard at NHC/JTWC is to apply the full intensity offset to all forecast taus (except for the GFDL & HWRF models), my bias correction sets the intensity offset to 0 at tau 24 h for limited-area models such as HWRF, and tau 72 h for global models.  As with smoothing,  the 0 intensity offset at 72 h for global models was found through experimentation to give the lowest intensity error (mean absolute max wind difference).

4.4 Relabeling/Recentering

The final step in generating the forecast aid that will be used by the human forecaster is to relabel the bias-corrected, interpolated track to the time of the advisory/warning.  For example, to make the 6-h interp track, the 6-h posit is set to 0-h, the 12-h to 6-h and so on.  In the 12-h interp, the 12-h posit becomes the 0-h, 18-h the 6-h and so on.  In ATCF nomenclature, the 6-h interp are the 'I' trackers and 12-h the '2' trackers, e.g., AVNO (raw) -> AVNI (6-h interp) and AVN2 (12-h) and occasionally AVN3 (18-h interp).

4.5 Error as a function of N-h interp time

Relableing a forecast to an analysis will obviously degrade the accuracy (increase error) as defined by the verification metrics.  The more relevant question in determining forecast utility is how the degradation depends on both the intrinsic error growth in the model, and on the basin/synoptic situation.  

Three models are considered:
  • AVNO - deterministic run of the GFS global model, requires a 6-h interp
  • EDET - ECMWF HRES deterministic run, requires a 6-h interp
  • EEMN - ECMWF EPS ensemble-mean, requires a 12-h interp
over a four-year period 2010-2013 in three basins: WPAC, EPAC and the atLANTic.  The change in accuracy is expressed as a % change or improvement in mean FE over the raw model tracker.  If the mean error of the N-h interp is larger than the raw model, then the % improvement is negative.


Figure 3.  % improvement of mean FE for 12-h, 6-h, 0-h interp over raw model tracker for EDET, AVNO, EEMN in the WPAC, EPAC, LANT basins for 2010-2013 and taus 12, 24, 36, 48, 72, 96, 120 h (x-axis).  A negative improvement is a degradation.  The tau 72 h 12- and 6-h degradations are indicated in the callout.
The most striking variation in the degradation curves is between basins.  WPAC shows the most degradation and EPAC the least.  Also note how the degradation decrease with tau in WPAC, whereas in EPAC it's positive or smaller for the early taus (12-24 h) and maximizes at 72 h.  The ECMWF deterministic forecast suffers less degradation from tau 12-48 h in the LANT compared to the GFS (AVNO), but has greater change from 72-120 h.  The curve for the mean ensemble tracker in the ECMWF EPS (EEMN) is similar to the higher-resolution deterministic trackers, but is generally lower.  The big problem with EEMN for operational forecasting is that is not available until > +7:30 and thus requires a 12-h interp.  Even if the ensemble forecast was superior to the deterministic run, its lateness would make it uncompetitive as the 12-h degradation is about 2X that for the 6-h interp.  

The effect of the smoothing and bias correction without relabeling (0-h interp) is generally small and sometimes even positive, except for the ECMWF trackers in WPAC and in EPAC for the medium-range taus.  Uniform application of the position offset for all taus is apparently less appropriate for the ECMWF model.   

The biggest difference between the data assimilation (DA) of the GFS v ECMWF around TCs is that the background TC vortex is relocated to the TCvital position in the GFS DA whereas no adjustment is made in the ECMWF DA.   Consequently, the ECMWF analysis has much larger initial position errors, but despite the implied error in the analyzed TC vortex, the 12-h FE is comparable or even lower than in the GFS.  Clearly these analysis errors do not affect FE suggesting the analysis errors are on a scale smaller than those that dominate motion (Fiorino and Elsberry 1989).

Finally, some of the smaller degradation for the GFS is explained by a large mean FE for the raw model.

5.0 Forecast accuracy v utility

Returning to the four models considered in Figs. 1 and 2, the runs are available  < +7:30 h so that we can apply a 6-h interp for comparison to the JTWC warnings.  Fig. 3 below is the same as Fig. 2 except now we can compare to JTWC and the best consensus aid prepared using 6- and 12-h interp tracks - CONW.
Figure 3.  as in Fig. 2 except for comparing 6-h interp trackers to the best consensus aid and JTWC

The notably differences compared to the raw model tracker mean errors in Fig. 2 are the errors are generally higher, the number of cases is reduced, especially at 120 h and the initial position error is much lower.  However ,the (interp) models have higher error than CONW and JTWC implying that forecast utility is (much) reduced compared to utility implied using the raw model trackers.

Appendix A - a python implementation of the post-processing

I have developed a standalone version of the post-processor used in this paper that can be applied to any standard ATFC adeck, see:

http://sourceforge.net/p/wxmap2/svn/HEAD/tree/branches/TCinterp/README

for details.  The only requirement is python V2.?

contact me at michael.fiorino@noaa.gov for any questions/comments.




Tuesday, November 5, 2013

WPAC 2013 -- GFS-based Models v ECMWF

 TC Performance of GFS-based US Models v ECMWF 

2013 WPAC season

Mike Fiorino, NOAA ESRL, Boulder CO 
05 November 2013





Now the that 2013 in the western North Pacific (WPAC) basin is nearly over, I've prepared preliminary error statistics for the tropical cyclone (TC) forecasts of three GFS-based US models v the ECMWF HRES (the 10-d, High-RESolution deterministic run of the IFS).  

These results are preliminary for two reasons:
  1. as of time of this writing two systems are active:  30W and 31W (haiyan).  31W is the 7th typhoon to undergo 'ED' - explosive deepening which is defined as a 50 kt change in 24 h.  When you add in RI - rapid intensification (30 kt / 24 h) storms, 11 of 31 typhoons have made large intensity changes.  Further,  with the recent promotion of 31W, there have been 5 'supertyphoon' -- max winds >= 130 kt
  2. I use the 'working' best track for verification vice the 'final' best track.  The final, post-season reanalysis of the fix data that makes the true or final best track is typically not available until around Feb/Mar for NHC whereas JTWC finalizes the best track during the season.
The models considered are (ATCF 4-char name in caps):
  • HWRF - a three-grid, high-resolution limited-area model that use the GFS for initial and lateral boundary conditions. the grids have a horizontal resolution (dx) of 27:9:3 km
  • AVNO - the GFS global model dx~21 km (a spectral model formerly know as the 'aviation' run)
  • FIM9 - the ESRL global model dx=15 km (finite-volume, flow-following icosahedral dynamical core)
  • EDET - ECMWF HRES, dx~15 km (a spectral model with 129 layers)


Table 1. test of tables from google docs
a
b
e
d

from google doc + snagit grab
First consider the mean forecast (track) error, defined as the great-circle distance between the best track and forecast position, but before looking at the stats, the verification system needs to be reviewed.   

There were cases of tracker failures (inability to accurately locate the model TC) that had to be eliminated to avoid skewing the statistics.  These failures are obvious when plotting the tracks, as is done in operations, and are few in number in WPAC.  Details are given in appendix A.

Figure 1.  WPAC 2013 mean forecast error [nmi]




There are no clear winners in that no single model has lower numbers at all forecast times or 'taus'.  For the longer taus 96 and 120 h (day4 D4 & D5) there is greater separation because of fewer storms going into the mean --  one bad storm can ruin the mean.  We'll dig into the details later...

A second measure of forecast performance is storm 'intensity' or a comparison of the model max surface wind speed (Vmax) to 'observed' from the (working) best track.  The standard metric for intensity is the mean absolute error (no directionality), but from a modeling perspective the mean error itself, i.e., the bias, may be more important.  Although bias is not commonly displayed, I do below:


Figure 2.  WPAC 2013 intensity or Vmax eror: lines are the mean absolute error and the bars the mean error or bias
The bars show bias and the lines the mean absolute error.  In some ways the numbers are typical but other not:
  • HWRF has the highest spatial resolution (3km) and takes great care in initializing the model vortex.  Consequently, the model has near zero initial bias and shows only a slight over intensified bias out to 120 h (5 kt). More remarkable is the mean abs error of 15 kt at 72 h.  One of the best statistical-dynamical intensity aids is LGEM (Logistic Growth Equation Model) and this aid serves as a baseline for performance.  In a head-to-head comparison at 72 h: HWRF: 16|1 kt v LGEM: 19|-12 kt [MM|BB where MM is the mean abs error and BB is the bias].  A 15 kt mean abs error is simply excellent and these statistics are among the best I have ever seen for any model, numerical or statistical, in WPAC.
  • The initial ECMWF mean|bias is 20|-18 kt.  These statistics imply a very poor analysis of the cyclone intensity.  The bias does decrease in time and is only -3 kt at tau 120 h.  This drop in bias is typical for the ECMWF model and in the past there were cases where the bias goes positive at 120 h.  The basic conclusion is that the ECMWF data assimilation system does a (very) poor job in analyzing the TC vortex.
  •  The two GFS-based global models: AVNO and FIM9 both under analyze and under predict TC intensity.  The mean abs error is about 5 kt higher than HWRF at all taus and most of the error comes from a negative (weak-storm) bias.
  • Large intensity errors do not imply large track errors.  The key result of my PhD work at the Naval Postgraduate School (Fiorino and Elsberry 1989) was that the TC inner-core, where the intensity change occurs, does not affect the dynamics of vortex motion, i.e., the scales that dominate the motion process are much larger the the TC inner core (~50 km).  Nonetheless, the large intensity error in the ECMWF model may reflect vortex structure errors on larger scales.  We will explore this relationship in a separate blog...
  • FIM9 with higher horizontal resolution than the GFS (AVNO) has less bias than AVNO despite the tracker using the same grid (0.5 deg global).  The higher model resolution does permit more intense cyclones and if we tracked at the native resolution, the bias would probably be less...
One way to compensate for large intensity errors in the numerical models when forecasting, is to apply statistical post-processing.  The scheme used at both JTWC and NHC is to calculate an initial 'offset' and then add a portion of the inital offset to the forecast.  The offset is defined as the intensity analyzed in operations minus the model initial intensity.  This operational intensity is generally not the same as the working best track, but since all intensity is analyzed to the nearest 5 kt, does not make much of a practical difference, but does result a some initial intensity error.  For global models, I apply 100% offset at tau 0 and 0% at tau 72 h, with a linear variation between 0-72 h.  The limited-area models set the 0% at tau 24 h.

Here are the bias-corrected statistics:

Figure 3. WPAC 2013 intensity or Vmax eror with bias correction. lines are the mean absolute error and the bars the mean error or bias
The big initial under bias for ECMWF has been eliminated and the correction has greatly reduced the size of the bias from tau 0-48 h and the mean abs errors. Still, the global model errors are higher the HWRF (and LGEM, not shown).














Appendix A - details of the stats




  
Figure 1 plots these numbers

                      000    012    024    036    048    072    096    120  
           HWRF      10.1   31.0   43.7   58.0   74.9  106.2  165.4  276.9
           AVNO      13.5   27.0   39.5   54.8   76.2  113.5  172.6  237.5
           FIM9      14.7   25.5   40.6   56.8   74.9  110.9  165.1  212.3
           EDET      22.0   27.2   39.2   55.5   68.4  108.5  178.2  247.3
           CONW       7.5   28.3   45.0   63.5   78.8  118.6  155.9  220.2
         #CASES       206    189    172    151    132    95     60     36    
 #Tossed( HWRF)       2      1      1      1     
 #Tossed( AVNO)       1      1      1      1     
 #Tossed( FIM9)       1      1     
 #Tossed( EDET)       1      2      2     

#Tossed is the number of cases removed at that forecast tau

model runs that were filtered:

BE filter Cases for: HWRF
stmid     dtg         tau     BE[nmi]
02W.2013  2013022018   12     221
14W.2013  2013083100   12     253

BE filter Cases for: AVNO
stmid     dtg         tau     BE[nmi]
14W.2013  2013082912   12     203

BE filter Cases for: FIM9
stmid     dtg         tau     BE[nmi]
02W.2013  2013022100    0     173
02W.2013  2013022100   12     335

BE filter Cases for: EDET
stmid     dtg         tau     BE[nmi]
08W.2013  2013071700    0      190
08W.2013  2013071612   12     217

adsfasdf

Friday, November 1, 2013

FIM9 2013 LANT/EPAC

FIM9 Track Performance during 2013 LANT/EPAC seasons

Mike Fiorino, NOAA ESRL, Boulder CO 




Craig Mattocks, the new TECHDEV lead at NHC (I was in this position from 2006-2009) made an interesting comment in an email of 25 Oct 2013:

... "I have heard good things from the Hurricane Specialists this year about the FIM(9), especially the track accuracy."

To see if the 'good things' can be seen in the standard metric of track skill, I looked at the  errors in the atLANTic (LANT) and Eastern north PACific (EPAC) using the working best tracks.
2013 atLANTic mean forecast ('track') error for HWRF, GFS, FIM9

For forecast times (as they say in .mil 'taus') 0-72 h, FIM9 did have slightly higher errors than the GFS.  Note the considerably poorer performance of HWRF which is telling because all three models used the exact same initial (and lateral boundary for HWRF) conditions.  The errors at tau 120h only represent the performance for one storm (09L - HU Humberto). 

At tau 72 h there are more storms:
2013 LANT tau 72 storm-by-storm mean forecast error [nmi]


Most of the GFS improvement over the FIM9 at tau 72 h comes from storm 11L - TS Jerry (7 cases), whereas the FIM9 was better for storms 04L, 05L and 10L with 1,3,4 cases respectively.  

The statistics don't show a clear advantage of FIM9 over the GFS in the LANT, yet the hurricane specialists were impressed(?).  Do we need a better way to measure 'good things.' 

The results in the EPAC are different:

2013 EPAC mean forecast error [nmi] for HWRF, GFS, FIM9


Here we find the FIM9 has lower errors than both HWRF and GFS at all taus.  Also note how the GFS errors grow faster than in FIM9.  One speculation is that while both GFS and FIM9 share a common set of physics, their dynamical cores are substantially different. The precipitation fields in my TCgen site http://ruc.noaa.gov/hfip/tcgen look different with GFS generally having more intense and smaller-scale precipitation events.

It's also impressive that HWRF has lower errors at the longer taus (72-120h) compared to its host global model the GFS.  This error growth is counter intuitive in that the lateral boundaries in HWRF should cause greater errors.  Further investigation is needed, but given the better errors for FIM9 maybe it is in EPAC that the specialists are impressed...

For completeness here are the mean tau 72 errors by storm:
2013 EPAC storm-by-storm tau 72 mean forecast errors [nmi]



Storms 03E (5 cases) and 05E (3 cases) are noteworthy in that FIM9 has lowest errors for 03E but higher errors for 05E

Let's combine the basins for stats in EPAC & LANT:

EPAC & LANT 2013 mean forecast errors [nmi] HWRF v GFS v FIM9

No comparison is complete without including the 'gold standard' of TC track prediction - ECMWF HRES (this is how ECMWF refers to their hi-resolution deterministic run).  There were 8 cases of serious errors in the ECMWF trackers - failure to track the observed storm initially and making large track changes in the first 24 h (with implied speed of motion > 100 kt).  The main problem storm was 02L in which the tracker jumped into the EPAC from the Bay of Campeche.  These errors will be the subject of an upcoming post...but my verification code had to be improved to toss out these cases.  Fortunately no cases > 48 h were removed and most of the bad tracks were for tau 0-24.

For EPAC/LANT combined:
EPAC & LANT 2013 mean forecast error HWRF v GFS v FIM v ECMWF


 ECMWF is clearly the gold standard at tau 48-120 h

The results in the LANT are in some ways more dramatic:

LANT 2013 mean forecast error [nmi] HWRF v GFS v FIM9 v ECMWF
The large initial position errors and the relative poorer performance for tau 0-24 are explained by deficiencies in the ECMWF tracker and no special treatment of TCs in their data assimilation system.  However, the stats at tau 48 and 72 are hugely better than the US models.  A mean forecast error of 90 nmi may be a record low.  When I started in this business in the late 1970s the typical 72-h mean error was around 350 nmi!!!!

In EPAC we find a similar pattern - ECMWF better at the medium and long-range taus (72-120 h)
EPAC 2013 mean forecast error [nmi] HWRF v GFS v FIM9 v ECMWF