visit my main pagehome mail


published in: Meteorologicke Zpravy, vol 47 (1994), No.4, p.103-112
language: English + Czech abstract; includes 7 figures and 5 tables with English+Czech captions


Probabilistic prediction of thunderstorm occurrence


Martin Dubrovsky
Institute of Atmospheric Physics, Hradec Kralove‚ Czech Republic


Contents

  • Introduction
  • Data
  • Predictive power of individual predictors
  • Multivariate regression techniques
  • Results
  • Conclusion
  • References
  • Figures
  • Tables

    I am sorry, the original text contains some non-standard symbols, which were not transformed during conversion from Word Perfect. If you are interested in this topic, write me for a copy


  • 1 Introduction

    Forecasting thunderstorms (TS) for time projections longer than the characteristic lifetime of TS cells often relies on semiempirical methods relating TS activity with predictors derived from the latest aerological data and/or output of the numerical weather prediction model. A prominent role among thunderstorm predictors belongs to various stability indices and related thermodynamical variables derived from a single station aerological sounding. The indices may be used either autonomously or together with information from other sources. Prognostic equations relating the value(s) of the predictor(s) with thunderstorm activity at given location and within given period of the day and season of the year are typically derived from the historical data sets. Let x = (x1, x2,.., xr) is the predictor vector, x IS_IN Xr, and y IS_IN Y is a predictand. Given actual observation of x, and the task to forecast y, we may either estimate its value by a single number (deterministic forecast) which is in some way optimal, e.g. minimizes expected mean square error or maximizes likelihood, or we may estimate the conditional probability P(y|x) for each y IS_IN Y. The latter approach (probabilistic forecasting) is seemingly more informative: it may be simply converted to the deterministic forecast (postprocessing) but it simultaneously includes information on the uncertainty involved in the forecast. In case y is a dichotomous variable, y IS_IN Y={0,1}, its conditional distribution may be fully expressed by conditional probability of y taking value 1 which is identical with a value of a regression function:

    P(y=1 | x) = E(y|x) ( = r(x))................(eq.1)

    where r(x) denotes a regression function. Having the learning sample, S = {[xi,yi]; i = 1,..,n}, comprising previously observed values of predictors and predictand, the probabilistic prognostic function estimating the conditional probability of event occurrence,

    f(x) = (y=1 | x) ...................... (eq.2)

    may be determined by regression analysis of the learning sample. Quality of the prognostic function may be expressed by reduction of variance (which is equivalent with a coefficient of determination defined as a square of a multiple correlation coefficient):

    RV(f) = 1 - MSE(f)/MSEref......................(eq.3)

    where MSE(f) = E(y - f(x))2 and MSEref = E(y - y-))2. The value of RV is maximized (MSE is minimized) if f(*) coincidates with r(*) and RV=1 indicates 100% successful forecasts.
    The present paper brings the most interesting results found by analysis of 9y series of summer months observations related to Prague. The main aims of the analysis are (1) to assess the diagnostic and prognostic potential of 38 thermodynamical predictors derived from a TEMP-A data, (2) to select the most optimal combination of predictors with use of three variable screening techniques, and (3) to compare efficiency of four regression techniques to establish relation between the multidimensional predictors and thunderstorm activity. The set of candidate predictors is supplemented by persistence and information on front passages across Prague.


    2 Data

    The data archive available for the present study consists of observations from warm months (May-August) of 1981-1989 period and include: TS observations in Prague, frontal passages across Prague and aerological soundings launched in Prague.

    2.1 Predictand: definition and sample climatology

    The dichotomous predictand, TS, takes value 1 if (and only if) the TS (either nearby or distant) is observed at least at one of the four Prague synoptic stations (Karlov, Libus, Kbely, Ruzyne) within target forecast interval, , where t1 and t2 are times of the day (GMT). The times preceded by plus sign indicate the time of the tomorrow (e.g., interval <12,+12> represents 24h interval starting at 12 GMT today and ending 12 GMT tomorrow. The set of target intervals consists of four 6h, two 12h and one 24h intervals following each of the two sounding terms (00 and 12 GMT).
    The sample climatology of TS occurrence is displayed in Fig. 1 and Fig. 2. It is seen (Fig. 1) that the mean annual thunderstorm frequency both in Prague and its 100 km surroundings slightly increases during the learning period. The increasing trend is statistically significant at level ALPHA=0.018 for TS<12,18> and at level ALPHA=0.002 for the mean (per day) number of stations reporting thunderstorm (triangles in Fig. 1). The mean annual frequency of days with TS occurrence in Prague (TS<0,24>=1) displayed in Fig. 1 is higher than the average number of stormy days for Praha-Karlov during 1946-1955 (28.6 stormy days for whole year, 24.1 stormy days for May-August; [18]). This may be caused partly by different frequency of thunderstorm occurrence within the two periods and partly by different definition of thunderstorm occurrence in the two studies ([18] vs. present study). Since the variability of the series displayed in Fig. 1 mostly well corresponds with the standard error of the single year frequency of TS days we may assume that none of the years of the learning period was anomalous from the point of view of TS activity.
    The mean annual course of TS activity (Fig. 2) displays several maxima. The location of the best marked maxima (May 20, June 20, July 15, August 10-14) resembles the near-1-month periodicity of TS occurrence in South Bohemia (based on records from 1956-1986 [17] and 1921-1970 [19]) and even in East Slovakian Lowlands ([7], 25y series). This suggests that the peaks need not be a product of randomness. The discussion on an origin of the peaks is beyond the scope of present paper but may be partly found in above referred papers.

    2.2 Predictors

    Three groups of predictors are considered:
    1) Thermodynamical predictors are derived from the single station TEMP-A data (Tab. I) and include various stability indices, measures of parcel energy, heights of selected levels important from the point of view of cloud formation and several characteristics derived from a 1D steady state convective cloud model [16]. The predictors are derived from both midnight and midday soundings.
    2) Persistence predictors are defined and labelled analogically with the predictand except that they relate periods prior the target forecast period. The boundary times of the intervals relating to the previous day are preceded by minus sign: e.g. TS<-18,6> represents TS occurrence in the 12h period starting at 18 GMT on the preceding day. This group includes variable PERS giving a number of stations within 100 km from Prague having reported TS occurrence during 24h interval starting at 6 GMT of the previous day. When developing prognostic equations, we consider only those persistence predictors which cover periods ending at least 6h prior the start of the target forecast period (i.e., e.g., PERS may be used for target periods starting at 12 GMT but not for periods starting at 6 GMT).
    3) Frontal predictors. Since an occurrence of fronts may reduce prognostic potential of predictors derived from the pre-frontal soundings, and since non-negligible portion of thunderstorms is closely associated with frontal systems, passages of the fronts treated as binary variables (0/1) were included into the set of predictors: Predictor F<t1,t2> takes value 1 if at least one front (cold or occluded) passes across Prague within time interval <t1,t2>. The set of frontal predictors covers 6-h, 12-h and 24-h intervals of the day.
    The date-conditioned climatological probability of TS occurrence was also tested as a predictor. However, as the predictive power of the climatological predictors was found to be totally insignificant, they were no longer considered in subsequent experimentations.


    3 Predictive power of individual predictors

    Individual predictive power of the most effective predictors is given in Tab. II and Tab. III in terms of RV score related to the dichotomous prognostic function:

    fi(xi) = P^(y=1|xi<=x) = n((y=1) AND (xi<=x))/n(xi<=x)   for xi<=x
    (eq.4) fi(xi) = P^(y=1|xi>x) = n((y=1) AND (xi>x))/n(xi>x)   for xi>x

    where xi is the variable being tested as a predictor, x is its threshold value which maximizes the sample RV score and n(*) is the number of elements of the learning sample satisfying conditions given in the parentheses. The standard error of the individual predictive power, RVi, is about +/-0.03 for RVi being close to 0.2 and +/-0.01 for RVi being close to zero, as it was determined by Monte Carlo simulations. Although the dichotomous prognostic function is rather rough approximation of the regression function, the obtained skill scores provide a good basis for comparison of predictive power of the predictors:
    (a) effect of the length of time projection
    The effect of the length of time projection upon predictive power of thermodynamical predictors may be observed from two viewpoints:
    (i) Having predictors derived from soundings obtained at t0, we examine how the predictive power depends on the delimitation of the target prognostic period, <t1,t2>.
    (ii) Having target prognostic period <t1,t2>, we examine how the predictive power depends on the time of acquiring the source sounding data.
    Generally it holds that the predictive power decreases with increasing length of time projection. In case of the latter viewpoint (ii), it is seen from Tab. II and Tab. III that elongation of the time projection leads to the decrease of the RV score, which is best pronounced in case of the afternoon target period: the midnight predictors (Tab. II) gain much lower scores compared to the midday predictors (Tab. III). In case of the former viewpoint (i), one must consider several restraints: Firstly, occurrence of certain weather phenomenon within two distinct periods implies two distinct phenomena (night-time thunderstorms are generally of different origin than the afternoon thunderstorms). Further, comparison of the prediction skill is complicated by different mean climatological probability of event occurrence within the two distinct target periods. In addition to lower efficiency of regression techniques when applied on rarer phenomenon, we deal with the effect of the mean frequency of event occurrence on the RV score. To demonstrate this effect, one may consider the following model: Let r(x) = P(y=1|x) is given regression function which relates binary predictand, y, and continuous predictor (scalar), x, and px(*) is a probability distribution of the predictor which varies in a position parameter: px(x) = PHI(x-m). Then it may be found that RV is maximized for such mi for which P0 = P(y=1) ( = INTEGRALr(x)PHI(x-m)dx) = 1/2 and RV tends to 0 proportionally with (P0(1 - P0))0.5. It means: having fixed relation between predictor and predictand represented by r(x), attainable RV score depends on the apriori distribution of x.
    Bearing above restraints in mind we leave attempts to compare the prediction skill related to TS occurrence within distinct periods of the day. We just note, that the significantly lower RV scores related to the night-time and morning thunderstorms need not necessarily imply lower predictability.
    (b) Individual predictive power of thermodynamical predictors.
    The most powerful predictors derived from the noon soundings are the stability indices of Faust, Showalter and Adedokun. It may be noted that the only two indices not considering humidity conditions (Simila and VT) are not among fifteen best predictors derived from the noon soundings (Tab. II), but they are among the best if derived from the midnight soundings (Tab. III). Assimilation of surface conditions into the definition of the predictors may increase their predictive power, as it follows from comparison of KMOD with K, TTMOD with TT, ADED2 with ADED1. The former versions of the predictors in each of the three pairs better account for the surface conditions than the latter versions, and also gain higher values of RV score. Predictive power of the midnight predictors seems less dependent on the air humidity conditions as indicated by dominance of the Simil„ index. Inclusion of the surface conditions does not improve predictive power of the midnight predictors. On the contrary, KMOD, TTMOD and ADED2 are even worse than K, TT and ADED1 if derived from the midnight soundings. The relation between two of the strongest predictors with relatively low correlation is displayed in Fig. 3.
    (c) The individual predictive power of the persistence and frontal predictors is much lower compared to the thermodynamical predictors. The maximal values of the RV score were acquired by frontal predictors which cover front passages within both target periods and 6h interval preceding the target period. The maximum value of RV score (0.07) is attained by F<6,18> when applied to <12,18> and <12,24> target periods. Of the persistence predictors (the persistence predictors closely preceding the target period are not considered), PERS attains the highest RV score (0.05), also for <12,18> and <12,24> target periods. Similarly with the thermodynamical predictors, the prediction skill of frontal and persistence predictors is considerably lower when applied on target periods with lower probability of thunderstorm occurrence.


    4 Multivariate regression techniques

    Having the binary predictand, set of candidate predictors, learning sample of given size, and the task to develop the probabilistic prognostic function relating the predictand with the predictors, we stand two problems: (1) selection of optimal set of predictors and (2) determination of prognostic function best estimating the conditional probability of event occurrence. The first task aims to comprise the most of the information on the predictand in the lowest number of predictors. The set of selected predictors should be large enough to bear sufficient information on the predictand, but not too large to avoid overfitting data and instability of the regression estimate. The second step involves choice of the model (type of the function approximating the regression function) and finding parameters of the model. The choice of the model depends on the structure of the data, types of the predictors and size of the learning sample.
    Three techniques are used in present work to select the predictors: stepwise linear regression, stepwise logistic regression and tree growing algorithm run in an SP mode. The probabilistic prognostic function (2) was afterwards estimated by REEP technique, logistic regression, k Nearest Neighbors and Binary decision tree. The former two techniques come from a family of the parametric techniques which search for the regression function within the set of functions of given type, the latter two techniques come from the family of the robust techniques which do not suppose any specific type of the predictor distribution.
    REEP (Regression Estimation of Event Probabilities, [11]) approximates prognostic function by linear regression function which is truncated to the <0,1> range to avoid probabilities below 0 and above 1. The REEP was originally designed by Miller [13] for multiple category predictands. Applications of REEP to probabilistic forecasting of various weather elements are given in [12].
    Logistic regression approximates the prognostic function by logistic function: f(x) = {1 + exp[-(ALPHA+BETA'.x)]}-1 where ALPHA and BETA'=(BETA1,BETA2,...,BETAr) are to be determined from the learning data (the prime denotes the row vector here). The logistic regression converges to the regression function, r(*), in case two conditional distributions p(x|y=0) and p(x|y=1) have normal distributions with the same covariance matrices. The logistic regression was found superior to the linear model in case the binary predictand was considered [1,4].
    Method of k nearest neighbors (kNN) estimates the value of the regression function by weighted average of k observations selected from the learning sample which are `closest' to the observation x, for which the prediction is to be issued:

    f(x) = SUMj=1,..,k qix(j) yix(j) / SUMj=1,..,k qix(j)

    where ix(j) is the index of the j-th closest neighbor selected from S such that D(x,xix(j-1)) <= D(x,xix(j)) <= D(x,xix(j+1)) where D(*,*) is the distance of two vectors in the predictor space; qix(j) is the weight assigned to the ix(j)-th element of the learning sample. The regression estimate is finetuned by choice of weight q, distance D(*,*) and number of neighbors k. Based on Monte Carlo simulations, an optimal value of k was found to be about 2.(n)0.5 where n is a number of elements in the learning sample. The tests were made with various q-weights which smoothly diminish to zero as the distance between the elements increases. The best results were, however, obtained for the q-weight being independent of the distance between the neighbors. Two types of distance D will be alternatively used:
    (a) Weighted Mahalanobis distance:

    DMAH(x,xi) =SUMk=1,..,rSUMl=1,..,r(xk-xik)ckCkl-1cl(xl-xil).................(eq.6)

    where Ckl are elements of the covariance matrix of x, c are weights proportional to the individual predictive power (expressed by RVi) of the respective component of x, xi* are components of vector xi.
    (b) Distance of projections of the two predictor vectors upon the `best linear discriminant function':

    Dbldf(x,xi) = |w^.(x - xi)|...................(eq.7)

    where w^ is the best linear discriminant vector. It might be noted here that the latter version of the D-distance is less robust. On the contrary, greater robustness may be achieved by transformation of components of x: x'j = F-1[Rj(x)/(n+1)], where x'j (j=1,..,r) are the components of the new predictor vector, Rj(x) is the rank of x in the learning sample sorted according to the value of j-th component of x and F is the distribution function of normal distribution, N(0,1).
    Construction of the binary decision tree (BDT) consists in recursive partition of the predictor space aiming to maximal discrimination of elements with event occurrence from elements with event non-occurrence. The tree growing algorithm used in the present study follows the one described by Rez cov  and Motl [16]. It was further enhanced, tested and described in detail in [9]. Two versions of the growing algorithm are alternatively employed for present tests:
    (a) The algorithm run in an SP-mode allows only single predictor based splits - the predictor space is partitioned into the r-dimensional `rectangles'. An advantage of the SP mode is a greater vividness of the structure of the resultant tree and ability to reduce the dimensionality of the predictor vector.
    (b) The algorithm run in a BLDF-mode allows to partition the predictor space by hyperplanes perpendicular to the best linear discriminant functions calculated separately in each node of the tree. In result, the predictor space may be effectively partitioned in less number of splits. The disadvantage of the BLDF-mode is its lower robustness.
    For completeness, it should be noted here, that the decision trees are often used to graphically visualize human decision process and are therefore referred to as simple expert systems. The structure of such trees is often designed by human expert (e.g. [2,5,6,10]) which contrasts with a present work where a computer algorithm is employed to develop the tree from the learning data. More general tree growing algorithm allowing for the categorical variables may be found in [3].


    5 Results

    Results of the variable selection analysis (Tab. IV) show that the sets of predictors selected by both screening techniques do not differ significantly. Number of selected predictors generally decreases with decreasing correlation between the predictors and predictand. In case only thermodynamical predictors are considered, not more then three predictors are selected. If persistence and frontal predictors are included the set of selected predictors is typically expanded by one persistence and one frontal predictor. The profit due to the persistence and frontal predictors is better pronounced in case they supplement midnight soundings data which are poorer correlated with the afternoon TS occurrence. The first selected predictor in most cases corresponds with the most powerful predictor for given target period (cf. Tab. II and Tab. III). The skill scores related to the standalone first-selected predictors are higher than those given in Tab. II and Tab. III, due to better approximation of the regression function by linear and logistic functions compared to the dichotomous prognostic function used in the previous tables. Further increase of the skill score is due to increasing dimension of the predictor vector.
    The following tests were designed to compare the performance of the four above described regression techniques. The tests will deal with a prediction for the afternoon period with use of a Perfect Prog approach (thermodynamic predictors are derived from the noon soundings). Additionally to the sets of predictors selected by the stepwise linear regression and stepwise logistic regression (Tab. IV), the third set of predictors was derived from a tree grown in an SP-mode (Fig. 7; see [9] for details):

    X(lin) = (FI, POSsfc, F<12,18>, PERS)
    X(log) = (FI, HEATsfc, ADED2, F<12,18>, PERS)
    X(bdt) = (SICP, POSsfc, PERS, FI, F<12,18>)

    Predictors in X(bdt) are arranged according to their contribution to the total skill score. We may immediately notice the good concordance between the three sets: the first-selected predictor is in all cases the stability index (correlation coefficient between FI and SICP is -0.95). The second-selected predictors are the parcel energies also coming from the group of the thermodynamical predictors. All three combinations include both persistence predictor (PERS) and frontal predictor (F<12,18>). On each of the three predictor sets, regression techniques described in paragraph 4 were applied.
    To get a notion on the effect of errors following from the limitness of the learning data sample, the multiple cross-validation tests were designed: The set of 1052 elements was randomly split in two halves. The first half served as a learning sample to develop a prognostic function, the second half served as a testing sample to evaluate an unbiased estimate of prediction skill, RV^. The procedure was run ten times and the average and sample standard deviation of RV^ were calculated. The variability of RV^ results from two sources:
    (1) Stability of an estimate of the regression function - the prognostic function f(*) and its prediction skill, RV(f), vary due to the variability of the learning sample. In case of the proper choice of the parametric model or in case we use robust methods, f(*) converges to r(*) and RV(f) converges from below to RV(r). (2) Accuracy of sample skill scores - RV^(f) varies about RV(f) due to the limitness of the testing sample. Based on the author's experience following from experimentations with randomly generated data the second source contributes to the total variability of RV^ about twice larger than the first source - on the assumption of equal size of both learning and testing samples.
    The results displayed in Fig. 4 to Fig. 6 indicate that the prediction skill in average increases with increasing dimension of the predictor for all groups of predictors and all tested techniques. Irregularities may be due to relatively great dispersion of RV^ which also disables to select reliably the most optimal statistical technique. Anyway, the logistic regression seems to provide the most consistent results - not only when the predictors were selected by stepwise logistic regression which could produce some positive bias involved in RV^. The good performance of the logistic regression is believed to be the consequence of the fact that the most of the predictors is quite well behaved. Although some of the predictors are far from being normal (e.g. POSsfc in Fig. 3), the quicker convergence of the logistic regression seems to play greater role than the robustness of kNN and BDT techniques. The REEP attains poorest skill. Although the tree growing algorithm run in an SP-mode may cope with highly dimensional predictor vector and selfishly select the best combination of the variables it is seen from the three figures that unnecessary high dimension of the predictor vector may reduce quality of the tree.
    Of the three sets of predictors, the one selected by stepwise logistic regression gains the highest skill. Prediction skill related to the predictors selected by tree growing algorithm is about the same as for the predictors selected by stepwise linear regression.
    An explicit formulations of the logistic prognostic function for the afternoon target period developed in a Perfect Prog approach are given in Tab. V.


    6 Conclusion

    Of the predictors being tested, the stability indices of Faust, Showalter and Adedokun are the most efficient thunderstorm predictors. An optimal combination of predictors selected by variable selection procedures contains no more than five predictors: thermodynamical predictors dominate, but the set of selected predictors mostly include one persistence and one frontal predictor.
    Of the regression techniques being tested to determine the probabilistic prognostic function, the logistic regression seems to provide best results. The preliminary tests have verified, that the logistic function used as a predictor is superior to all single indices even for 1989-1991 period (to be published). Since the differences between the performance of the regression techniques (except of the REEP) are only slight with respect to the variability of the sample RV scores, one might also consider some secondary features of the regression techniques, such as vividness and requirements on computer resources: The binary decision tree provides a vivid view into the structure of the data. The kNN method is the most time and memory consuming technique since it requires permanently accessible data archive and may spend relatively long time in search for the neighbors. On the other hand kNN provides a simple mechanism of self-learning: just enhancing the archive of the historical data improves quality of the prognosis. Logistic regression is the only one which extrapolates the value of the probability prognostic function for the values of predictors beyond the range comprised in the learning sample. Thus it may better respond to the extreme values of predictors, surely only in case the tails of the logistic prognostic function sufficiently well approximate the regression function. Further important aspect of the regression techniques is their requirements on the size of the learning sample: The typical property of the robust techniques (here represented by binary decision tree and k Nearest Neighbors) is their lower rate of convergence to the regression function [8]. They require more data to attain sufficient stability. On the other hand, parametric models (logistic regression, REEP) converge quicker and may consequently provide better results than the robust techniques when applied on learning sample of lower size, even if the parametric model does not precisely suit the structure of data.
    The results of the present work indicate very poor performance of the midnight predictors for afternoon thunderstorms. On the other hand, skill scores related to the midday predictors are not entirely realistic - assessment of the true prognostic potential of predictors derived from a midday soundings would require sufficiently long archive of the NWP model output, which would allow to verify the Perfect Prog based prognostic equations or possibly to design prognostic equations in a MOS approach.

    Acknowledgements. I wish to thank RNDr. Daniela Rezacova, CSc. for providing the source computer code of the convective cloud model. I am also indebted to her for valuable suggestions while finalizing the manuscript of the paper.


    References

    [1] BOCCHIERI J.R., GLAHN H.R., 1976: Verification and further development of an operational model for forecasting the probability of frozen precipitation. Mon.Wea.Rev., 104, 691-701.
    [2] BOTHWELL P.D., 1988: Forecasting convection with the AFOS data analysis programs (ADAP-Version 2.0). NOAA Tech. Memo NWS SR-122, NWS Southern Region Scientific Services Division, Ft. Worth, TX, 91 pp.
    [3] BREIMAN L., FRIEDMAN J.H., OLSHEN R.A. and STONE C.J., 1984: Classification and regression trees. Wadsworth International Group, Belmont, California. 358 pp.
    [4] BRELSFORD W.M., JONES R.H., 1967: Estimating probabilities. Mon.Wea.Rev., 95, 370-376.
    [5] BROWN J.M., 1986: A decision tree for forecasting downslope windstorms in Colorado, Eleventh Conf on Weather Forecasting and Analysis, June 17-30, 1986, Amer.Meteor.Soc., Boston, 83-88.
    [6] COLQUHOUN J.R., 1987: A decision-tree method of forecasting thunderstorms, severe thunderstorms and tornadoes. Weather and Forecasting 2, 337-345.
    [7] CERMAK C., 1981: Information on statistical evaluation of dangerous weather phenomena in the East Slovakian Lowlands (in Slovak). Meteorologick‚ Zpr vy 34, 163-165.
    [8] DUBROVSKY M, 1993: Extraction of the probabilistic prognostic function from the finite size training data sample. Annales Geophysicae, supplement II to vol. 11 (Book of Abstracts of XVIII General Assembly of EGS), C195.
    [9] DUBROVSKY (1995): The Binary Decision Tree: The Growing Algorithm And Application To Thunderstorm Forecasting. Studia geoph. et geod. 39, 84-100.
    [10] ELLROD G., 1989: A decision tree approach to Clear Air Turbulence analysis using satellite and upper air data. NOAA Technical Memorandum NESDIS 23, Washington, D.C.
    [11] GLAHN H.R., MURPHY A.H., WILSON L.J. and JESENIUS J.S., Jr., 1991: Lectures presented at the WMO training workshop on the interpretation of NWP products in terms of local weather phenomena and their verification. PSMP Report Series No.34, WMO.
    [12] KLEIN W.H. and GLAHN H.R., 1974: Forecasting local weather by means of model output statistics. Bull.Amer.Meteor.Soc., 55, 1217-1228.
    [13] MILLER, R.G., 1964: Regression Estimation of Event Probabilities. Technical Report No.1, Travelers Research Center, Hartford, Conn., 153 pp.
    [14] PEPPLER R.A., 1988: A review of static stability indices and related thermodynamical parameters. Illinois State Water Survey Misc. Publ. 104, 87 pp.
    [15] PEPPLER R.A. and LAMB P.J., 1989: Tropospheric static stability and Central North American growing season rainfall. Mon.Wea.Rev. 117, 1156-1180.
    [16] REZACOVA D. and MOTL V., 1990: The use of the simple 1D steady-state convective cloud model in the decision tree for determining the probability of thunderstorm occurrence. Studia geoph. et geod. 34, 147-166.
    [17] STAROSTOVA M., VAVRUSKA F., 1991: Strong storms in South Bohemia (in Czech). In: Sbornik praci CHMU, sv.40, 8-28.
    [18] STUCHLIK F., POPOLANSKY F., TREFNA E., 1961: The frequency of thunderstorms and their duration on the territory of the Czechoslovak Socialist Republic (in Czech). Meteorologicke‚ Zpravy 14, 8-13
    [19] VAVRUSKA F., 1974: Storm conditions in Ceske‚ Budejovice (in Czech). Meteorologick‚ Zpr vy 27, 87-90.



    Tables

    Tab I The list of predictors derived from the aerological sounding. (references for Table I: [a] = PEPPLER [14] or PEPPLER and LAMB [15]; [b] = REZACOVA and MOTL [16])

    a) indices of stability
    SIMILA
    Simila index [b]
    FI
    Faust index [b]
    SI
    Showalter index[a,b]
    SICP
    modified Showalter index [a]
    SIHH
    modified Showalter index (850 mb level is used instead of 800 mb) [a]
    K
    K-index (Whiting) [a,b]
    Kmod
    modified K-index [the 850 hPa values are replaced by (surface - 850 mb) averages] [a]
    RACK
    Rackliff index; (850 mb level is used instead of 900 mb in present work) [a]
    JEFF
    Jefferson index [a]
    ADED1, ADED2
    Adedokun indices [a]
    VT
    vertical totals index [a]
    CT
    cross totals index [a]
    TT
    total totals index ( CT + VT ) [a]
    TTmod
    modified total totals index [a]
    TEI
    total energy index [a]
    PWBI
    potential wet-bulb index [a]
    SWEAT
    severe weather threat index [a]
    CIIR
    convective instability of Reap [a]
    CIIB
    convective instability index of Barber [a]

    b) Heights of selected levels
    LCLsfc
    lifting condensation level (initial parcel has surface characteristics) [a]
    CCLsfc
    Convective condensation level (initial parcel has surface characteristics) [a]
    LFCsfc
    level of free convection (initial parcel has surface characteristics)
    ELsfc
    equilibrium level (initial parcel has surface characteristics)
    ELccl
    equilibrium level (parcel lifted from convective condensation level with CCL characteristics

    c) Measures of parcel energy
    NEGsfc
    energy that must be supplied to parcel to make it rise from the surface level to its LFC [a]
    POSsfc
    energy released by surface parcel during buoyant rise beyond level of free convection [a]
    NETsfc
    = NEGSFC + POSSFC [a]
    HEATsfc
    Energy which must be supplied to surface parcel by low-level heating to induce convective overturning [a]
    EELccl
    energy released by parcel during its ascend from CCL (to EL)

    d) Predictors derived from 1D steady state convective cloud model [b] The four characteristics were calculated in two variations: the cloud base being the lifting condensation level (appendix -lcl) and the convective condensation level (appendix -ccl).
    Tlcl, Tccl:
    temperature of the cloud base
    VZlcl, VZccl:
    maximal vertical velocity of the updraft
    Hlcl, Hccl:
    heights of the top of the updraft
    DHlcl, DHccl:
    thickness of the layer boundered by the top of the updraft and level of the maximal vertical velocity

    Tab II The individual predictive power of 15 most powerful thermodynamical predictors derived from the midnight TEMP-A data (t0 = 00 GMT). Prediction skill is given in terms of the sample reduction of variance related to the dichotomous prognostic function (4). Size of the learning sample: N = 993. <t1,t2> is the target period for which the thunderstorm occurrence is forecast.

               |               <t1,t2> [GMT]           
               |
     Predictor |<00,06>     <00,12>     <18,24>     <00,24>
               |      <06,12>     <00,18>     <12,24>
    -------------------------------------------------------
        SIMILA |  0.04  0.02  0.04  0.09  0.04  0.09  0.11
         ADED1 |  0.03  0.02  0.04  0.09  0.03  0.09  0.11
            FI |  0.05  0.02  0.05  0.08  0.02  0.08  0.10
            SI |  0.03  0.02  0.04  0.09  0.03  0.09  0.10
          SICP |  0.03  0.02  0.04  0.08  0.03  0.09  0.10
          CIIB |  0.02  0.01  0.03  0.09  0.04  0.09  0.10
        EELccl |  0.04  0.01  0.03  0.07  0.04  0.08  0.10
            VT |  0.03  0.02  0.04  0.09  0.02  0.08  0.09
           TEI |  0.02  0.01  0.03  0.09  0.04  0.09  0.09
         VZccl |  0.03  0.01  0.03  0.07  0.04  0.08  0.10
         SWEAT |  0.02  0.01  0.04  0.07  0.02  0.08  0.09
          SIHH |  0.02  0.01  0.03  0.07  0.02  0.08  0.09
          PWBI |  0.01  0.01  0.02  0.07  0.04  0.08  0.08
          Hccl |  0.02  0.02  0.03  0.06  0.03  0.07  0.08
             K |  0.03  0.03  0.05  0.05  0.01  0.05  0.07
    

    Tab III The same as in but for predictors being derived from the midday TEMP-A data (t0 = 12 GMT; N=1052). The times preceded by plus sign relate to the subsequent day.

               |               <t1,t2> GMT           
               |
     Predictor |<12,18>     <12,24>     <+6,+12>    <12,+12>
               |       <18,24>    <24,+6>     <24,+12>
    --------------------------------------------------------
            FI |  0.23  0.08  0.26  0.02  0.02  0.02  0.23
         ADED2 |  0.22  0.06  0.26  0.01  0.02  0.02  0.23
          SICP |  0.24  0.05  0.23  0.00  0.02  0.02  0.21
         SWEAT |  0.21  0.06  0.20  0.01  0.01  0.01  0.16
          Kmod |  0.19  0.07  0.20  0.01  0.01  0.02  0.16
         VZlcl |  0.18  0.05  0.18  0.03  0.01  0.03  0.18
         TTmod |  0.19  0.05  0.19  0.00  0.02  0.02  0.18
        POSsfc |  0.19  0.06  0.19  0.01  0.01  0.02  0.16
          JEFF |  0.21  0.04  0.20  0.00  0.02  0.01  0.15
            SI |  0.18  0.05  0.19  0.01  0.01  0.01  0.17
         ADED1 |  0.18  0.05  0.19  0.00  0.01  0.01  0.17
             K |  0.16  0.07  0.19  0.01  0.01  0.01  0.16
          CIIB |  0.17  0.05  0.16  0.01  0.01  0.02  0.16
            TT |  0.18  0.04  0.17  0.00  0.02  0.01  0.15
         VZccl |  0.16  0.04  0.16  0.02  0.01  0.03  0.15
    

    Tab IV Predictors selected by stepwise linear regression and stepwise logistic regression. The predictors are screened out of (a) only thermodynamical predictors; (b) all predictors (thermodynamical, persistence, frontal). The numbers in the parenthesis are RV-scores related to the predictor vector consisting of all predictors ahead of the number. The value is based on the developmental sample, so that it may little bit overestimate the true quality. The stepwise logistic regression was run only for 12-h long target periods regarding the demands on computer-time resources.

    
    ================================================================================
     t1,t2 |sel |     candidate predictors:     |       candidate predictors:       
           |(*1)|   thermodynamical variables   |    thermodynamical variables,     
           |    |                               |    persistence, front passages    
    ================================================================================
                                                                                    
              Thermodynamical predictors are derived from 00 GMT soundings          
    --------------------------------------------------------------------------------
     00,06 |lin |VZccl (0.044), FI (0.053)      |VZccl (0.044),                     
           |    |                               |F<-12,12> (0.055)                  
    --------------------------------------------------------------------------------
     06,12 |lin |SICP (0.023)                   |TS<-12,00> (0.022),                
           |    |                               |VT (0.036), F<6,12> (0.046)        
    --------------------------------------------------------------------------------
     00,12 |lin |FI (0.050), LFCsfc (0.059),    |FI (0.050), F<-12,12> (0.059),     
           |    |ELsfc (0.069)                  |TS<-12,-18> (0.068)                
           |    |                               |                                   
           |log |SICP (0.053), LFCsfc (0.061),  |SICP (0.053),                      
           |    |Hccl (0.070)                   |F<-12,12> (0.058),                 
           |    |                               |TS<-12,-18> (0.067),               
           |    |                               |Hccl (0.076)                       
    --------------------------------------------------------------------------------
     12,18 |lin |SIHH (0.112), VT (0.132)       |SIHH (0.111),                      
           |    |                               |F<6,18> (0.159),                   
           |    |                               |PERS (0.175), VT (0.188)           
    --------------------------------------------------------------------------------
     18,24 |lin |SIMILA (0.051), Tlcl (0.060)   |SIMILA (0.051),                    
           |    |                               |F<6,18> (0.075),                   
           |    |                               |F<18,24> (0.092), Tlcl (0.102)     
    --------------------------------------------------------------------------------
     12,24 |lin |SIMILA (0.126), SIHH           |SIMILA (0.126),                    
           |    |(0.146)                        |F<6,18> (0.169),                   
           |    |                               |SIHH (0.189),                      
           |    |                               |TS<-18,6> (0.201)                  
           |    |                               |                                   
           |log |SIMILA (0.134), SIHH (0.157),  |SIMILA (0.134),                    
           |    |LCLsfc (0.167)                 |F<6,18> (0.190),                   
           |    |                               |SIHH (0.214), PERS (0.228)         
    --------------------------------------------------------------------------------
     00,24 |lin |SIMILA (0.142), SIHH           |SIMILA (0.142),                    
           |    |(0.162)                        |F<6,18> (0.179),                   
           |    |                               |SIHH (0.199),                      
           |    |                               |TS<-12,-18> (0.208)                
    --------------------------------------------------------------------------------
    --------------------------------------------------------------------------------
    
              Thermodynamical predictors are derived from 12 GMT soundings          
    --------------------------------------------------------------------------------
     12,18 |lin |FI (0.228), POSsfc (0.284)     |FI (0.230), POSsfc (0.283),        
           |    |                               |F<12,18> (0.301), PERS             
           |    |                               |(0.313)                            
           |    |                               |                                   
           |log |FI (0.279), HEATsfc (0.338),   |FI(0.279), HEATsfc (0.338),        
           |    |ADED2 (0.357)                  |ADED2 (0.357),                     
           |    |                               |F<12,18> (0.378), PERS(0.387)      
    --------------------------------------------------------------------------------
     18,24 |lin |FI (0.095), POSsfc (0.107)     |FI (0.095), F<0,24> (0.112),       
           |    |                               |POSsfc (0.121)                     
    --------------------------------------------------------------------------------
     12,24 |lin |FI (0.260), POSsfc (0.304)     |FI (0.260), POSsfc (0.303),        
           |    |                               |F<12,18> (0.322), PERS             
           |    |                               |(0.333)                            
           |    |                               |                                   
           |log |FI (0.312), ADED2 (0.353),     |FI(0.312), ADED2(0.353),           
           |    |CCLsfc (0.370)                 |F<0,24>(0.375), PERS(0.388),       
           |    |                               |CCLsfc(0.397)                      
    --------------------------------------------------------------------------------
     24,+6 |lin |EELccl (0.040)                 |EELccl (0.040), F<12,36>           
           |    |                               |(0.055)                            
    --------------------------------------------------------------------------------
     +6,+12|lin |POSsfc (0.028)                 |POSsfc (0.028)                     
           |    |                               |                                   
    --------------------------------------------------------------------------------
     24,+12|lin |VZccl (0.040)                  |VZccl (0.040), F<12,36>            
           |    |                               |(0.055)                            
           |    |                               |                                   
           |log |ADED2 (0.044), CCLsfc          |ADED2 (0.044), F<12,36>            
           |    |(0.054)                        |(0.055)                            
    --------------------------------------------------------------------------------
     12,+12|lin |FI (0.240), POSsfc (0.282)     |FI (0.241), POSsfc (0.281),        
           |    |                               |F<0,24> (0.299)                    
    ================================================================================
    

    (*1)variable selection technique:

    • lin - stepwise linear regression;
    • log - stepwise

    Tab V Prognostic logistic functions for prediction of afternoon thunderstorms based on predictors derived from the noon aerological soundings.

    Prob(TS<12,18> | TEMP-A) =
    {1 + exp[0.21 - 0.32FI + 0.0040HEATsfc - 0.32ADED2 ]}-1

    Prob(TS<12,18> | TEMP-A, Persistence, Fronts) =
    {1 + exp[0.69 - 0.30FI + 0.0038HEATsfc - 0.33ADED2 - 0.044PERS - 1.1F<12-18>]}-1

    Figures

    Fig 1. The annual (May-August) frequency of thunderstorm occurrence in Prague and its 100 km surroundings. Triangles: daily average number of stations within 100 km reporting TS. Upper heavy solid line: number of days with at least one station reporting TS. The other 5 series relate to Prague and represent number of days with TS occurrence within given part of the day. The rectangles on the left hand side of the graph demarcate intervals for the respective series (except for the triangle series) calculated under assumption that the elements of given time series come from the same population.


    Fig.2 Sample climatological probability of thunderstorm occurrence related to given day of the year and period of the day. The curves represent 11-days running averages of observed frequency of TS occurrence during 1981-1989. An uppermost curve relates to thunderstorm occurrence at least in 1 station within 100 km surroundings of Prague and 24-h interval (starting at 06 GMT). The other curves relate to TS occurrence in Prague within time interval given in the brackets to the left of the respective time series and `error' rectangle. The height of the rectangles corresponds to interval, where average is based on the complete sample (1107 elements) and standard deviation is calculated under assumption that the probability of TS occurrence within given period of the day is equal for each day of the year.


    Fig 3 Scatter plot of FI vs. POSsfc derived from the midday soundings. Thunderstorm occurrence in Prague within 12h interval following the sounding is expressed by the cross. Sample conditional distributions of FI and POSsfc are represented by curves adjacent to the respective axes: dashed lines are for TS occurrence within the 12h interval, solid lines are for non-occurrence of TS.


    Fig 4. Comparison of performance of six regression techniques. The vertical bars demarcate dispersion of RV^ calculated on the testing samples (n = 526) randomly selected from the complete data set in ten runs. Applied regression techniques: (A) logistic regression; (B) tree growing algorithm run in an SP mode, (C) tree growing algorithm run in a BLDF mode; (D) k nearest neighbors with distance defined by (7); (E) k nearest neighbors with weighted Mahalanobis distance (6), (F) REEP. Predictand = TS<12,18>. The predictor vector is formed by first r variables taken from set X(lin) = (FI, POSsfc, F<12,18>, PERS) which was selected by stepwise multiple linear regression. The complete set of predictors (r = 52) consisting of 38 thermodynamical, 7 persistent, and 7 frontal predictors was processed only by tree growing algorithm run in an SP mode.


    Fig.5 The same as in but with predictors being selected by stepwise logistic regression: X(log) = (FI, HEATsfc, ADED2, F<12,18>, PERS).


    Fig.6 The same as in but with predictors being selected by binary decision tree: X(bdt) = (SICP, POSsfc, PERS, FI, F<12,18>).


    Fig.7 The binary decision tree developed in a Perfect Prog approach designed for prediction of thunderstorm occurrence in Prague in the afternoon (Predictand = TS<12,18>). The figures in terminal nodes (ellipses) give the total number of elements of the learning sample falling into the terminal node (denominator = Nterminal) and number of those with event occurrence (numerator = N(1)terminal). The plus/minus signs below the decision nodes (rectangles) show where to proceed if the respective decision rule is satisfied (+) or not satisfied (-).