Wikipedia talk:WikiProject Statistics/Manual of Style

Discussion

There are several things that could be discussed:

Random variables

Is it ok to discourage the use of capital letters to denote the random variables? It certainly seems that statisticians don't use that convention anymore. When we have a sample x₁, …, x_n then each of these quantities is itself a random variable, an iid copy from some common distribution F.

quantities? Maybe that usage should be deprecated.

Oppose. A sequence of random variables should be X1, X2, ...; a sequence of quantities (variates?) x1, x2, ... --P64 (talk) 15:06, 14 May 2010 (UTC)[reply]

What is a variate? This is a rather strange concept... When we write the linear regression model y_i = x′_iβ + ε_i, then (y_i, x_i, ε_i) are all random variables here. Do you want to write them in uppercase then? // stpasha » 19:23, 15 May 2010 (UTC)[reply]

Really? I guess it depends what area you're in, but capitalisation still seems pretty prominent. Also, they are capitalised in both the Notation in probability and statistics and random variable articles. —3mta3 (talk) 13:06, 16 May 2010 (UTC)[reply]

And see random variate, although not much notation there. The distinction between a random variable and an outcome of the random variable (ie random variate) is commonly made in stats books and has been commonly observed in most stats articles here. The distinction is much the same as, and has been around at the same time as, the use of correponding greek and latin characters for population and sample statistics. Melcombe (talk) 09:45, 17 May 2010 (UTC)[reply]

And the distinction between random variables and outcomes really does need to be maintained, otherwise you end up with nonsense such as conditioning on x=x. Melcombe (talk) 10:04, 17 May 2010 (UTC)[reply]

Parentheses / brackets

It seems a common practice to use square brackets with the expectation: E[x], whereas for variance it is half/half: either Var[x] or Var(x), for covariance it is already mostly the parentheses Cov(x, y). So what should our recommendation be?

In contexts where rules exist for the ordering of different types of brackets, as for some journals, the result is often that the type of bracket immediately following an E would be determined by that rule, for example by using ( for a simple formula inside the backets and by [ if there were was a formula which itself had two internal layers of brackets. Thus quite often there is no specific determination for what should follow an operator such as E, other than this rule derived from complexity of the formula. Melcombe (talk) 09:58, 17 May 2010 (UTC)[reply]

Small / capital correlation

The symbols for expectation, variance and covariance are all traditionally uppercase: E, Var, Cov; whereas correlation is almost always seen in lowercase: corr. Should we convert it to uppercase as well, or leave it be?

Transposition

The common symbol in statistics to denote transposition is ′: x′. This contradicts the MOS:MATH recommendation which is to use the \top or x^T.

Italic / straight distributions

Some common distributions such as normal N(μ, σ²), t-distribution t_k, F-distribution F_k,ℓ, chi-squared χ_k², uniform U(a, b) are all traditionally written in italic. For other, not-so-common distributions, there is no tradition. Should we require them to be in italic as well (eg: Poisson, Exponential, Binomial)? If so should they be abbreviated if possible (eg: Poi, Exp, B)?

Abbreviations recommended for general use should not be shorter than Uni, Poi, Expo (not Exp), Bin, Gam, Beta. (I prefer Unif, Pois, Gamma.) In other words the classical N, t, F, and chi should be exceptions.

Offhand I would deprecate italic face too. Use italics only for one-letter abbreviations.

What about the use of t, F, and chi for statistics, which may sometimes be interpreted as realizations of t, F, and chi random variables? --P64 (talk) 15:17, 14 May 2010 (UTC)[reply]

One-letter B should be a Brownian random variable, if anything. Brownian Bt, unlike binomial Bin(n,p), may be considered an extension of the classical family of one-letter exceptions {N, t, F, chi}. --P64 (talk) 15:59, 14 May 2010 (UTC)[reply]

The one-letter abbreviation for uniform is very common: U(a, b). As for the exponential, I don’t see why you’d oppose the more natural Exp(λ)? It is unlikely to be mistaken for the exponent function, since the function is always written in the lowercase. And generally the manner in which this notation will be used will also indicate that this is a random variable: X ~ Exp(λ). For gamma and beta distributions the most natural notations would of course be Γ and Β, but those letters can be too easily mistaken for the gamma- and beta- functions. // stpasha » 19:13, 15 May 2010 (UTC)[reply]

Distributions: parameters vs. degrees of freedom

I think there is a critical distinction between the parameter, such as λ of the exponential distribution, and the degrees of freedom, such as ν of the t-distribution. In practice the λ of the exponential is rarely known, so it has to be estimated. This is why it is a parameter, and it is for example meaningful to ask what is the Hessian of the log-likelihood of the distribution with respect to this parameter. On the other hand the degrees of freedom “parameter” is not a true parameter since it is always known beforehand in applications and never estimated. In particular the Fisher information with respect to this ν does not exist (although technically it could probably be calculated). The distinction between these “estimable” and “non-estimable” parameters is that the former are given in parentheses, like N(μ, σ²), while the latter as a subscript: t_k. If we make this into a rule, then some of the distributions will have to be changed, for example the binomial B_n(p).

Do not encourage to string the symbols for rarely estimated parameters together as subscripts that precede parentheses, such as B_n(p). If some notational distinction is valuable, why not use inside the parentheses a separator alternative to the comma? For example if the semicolon is adopted: Bin(n;p) or perhaps Bin(p;n) for the binomial family of distributions. --P64 (talk) 15:51, 14 May 2010 (UTC)[reply]

Sample size

Both n or T are viable symbols to denote the sample size. The T is more frequent in time series models, whereas n in iid settings. However it should be forbidden to use these symbols to denote anything else other than the sample size (for example like the Numerical methods for linear least squares article), otherwise it would cause too much confusion.

So you would not only deprecate but forbid the use of T for a random variable such as stopping time or hitting time or return time? Or is that the random size of a kind of "sample"?

Why not I for the index set, and thus commonly the sample size, where i is the index variable; J where j is the index variable; K where k is the index variable; T where t is the index variable? This is consistent with permission for I, J, K to be random sample sizes aka index sets where appropriate.

In company with I for the index set, indicator variables should use some fontface rendition of 1 (one) rather than some fontface rendition of I. --P64 (talk) 15:41, 14 May 2010 (UTC)[reply]

Samples from a finite population are often not iid, but n is still typically used for the sample size. We also have N for the population size, and many more associated conventions, e.g. m for cluster size, M for the number of clusters, h for a stratum index, and so on. It's probably unnecessary to incorporate much of this explicitly here. In most cases, I think the notation used in an article should simply follow sources in the relevant field. --Avenue (talk) 13:16, 15 May 2010 (UTC)[reply]

Mathematical formulas

The content of the Mathematical Formulas section, while sensible advice, is not really specific to statistics (or probability). I would propose adding it to MOS:MATH. Alternatively, move it to the bottom under a General Advice section, to make the statistics-specific guidance more prominent. —3mta3 (talk) 12:51, 16 May 2010 (UTC)[reply]