Practical Statistics for Astronomers

Welcome to the Solutions page

for Practical Statistics for Astronomers,

First edition (2003).

The exercises in the book span a wide range of difficulty. Some are relatively simple algebraic exercises and others are mini research projects. Many of them are open-ended and involve constructing the right question as well as deducing the right answer. In this respect they are much like the process which goes on in real-life research. The solutions given here, correspondingly, are often not stage-by-stage answers, but rather a set of hints and suggestions, with numbers provided in cases where we have real numerical data available. This is useful for the simulations in particular, as then you can use the same set of random numbers as we did.

We welcome comments and errata, both for the solutions and for the book.

A note on data files: some are in fixed formats, and some are in free-format. The latter do not have any embedded comments so their contents are described in the relevant solution.

There are some typographical errors and omissions in the book - please refer to the

Errata and Clarifications

for more information on these.

You may contact the authors via

jvw@phas.ubc.ca

Select one of the following links for solutions to the relevant chapter's exercise.

In the Exercises denoted by (D), datasets are provided; or create your own.

Chapter 1

1.1 At first sight, discovery of a new phenomenon may not read as an experiment as described in Section 1.1 How is science done. But it is. Describe the discovery of pulsars (Hewish, A. et al., 1968, Nature 217, 709) in terms of the five experimental stages.

answer 1.1

1.2 The significance of a certain conclusion depends very strongly on whether the most luminous known quasar is included in the dataset. The object is legitimately in the dataset in terms of pre-stated selection criteria. Is the conclusion robust? Believable?

answer 1.2

Chapter 2

2.1 A Warm-Up on Coin-Tossing. This is not an astronomical problem but does provide a warm-up exercise on probability and random numbers. Every computer has a way of producing a random number between zero and one. Use this to simulate a simple coin-tossing game where player A gets a point for heads, player B a point for tails. Guess how often in a game of N tosses the lead will change; if A is in the lead at toss N, when was the previous change of lead most likely to be? And by how much is a player typically in the lead? Try to back these guesses up with calculations, and then simulate the game. For many more game-based illustrations of probability, see Haigh, J., 1999, Taking Chances, Oxford.

answer 2.1

2.2 Efficient choosing. Imagine you are on a ten-night observing run with a colleague, in settled weather. You have an agreement that one of the nights, of your choosing, will be for your exclusive use. Show that, if you wait for five nights and then choose the first night that is better than any of the five, you have a 25 per cent chance of getting the best night of the ten. For a somewhat harder challenge, show that the optimum length of the "training sample" is a fraction 1/e of the total.

answer 2.2

2.3 Bayesian inference. Consider the proverbial bad penny, for which prior information has indicated that there is a probability of 0.99 that it is unbiased ("ok"); or a probability of 0.01 that it is double-headed ("dh"). What is the (Bayesian) posterior probability, given this information, of obtaining 7 heads in a row? In such a circumstance, how might we consider the fairness of the coin? Or of the experimenter who provided us with the prior information? What are the odds on the penny being fair?

answer 2.3

2.4 Laplace's Rule and priors. Laplace's rule (Section 2.3) <ρ>=(N+1)/(N+M+2) depends on our prior for ρ. If we have one success and no failures, consider what the rule implies, and discuss why this is odd. How is the rule changed for alternative priors, for example Haldane's?

answer 2.4

2.5 Bayesian reasoning in an everyday situation. The probability of a certain medical test being positive is 90 per cent, if the patient has disease D. If your doctor tells you the test is positive, what are your chances of having the disease? If your doctor also tells you that 1 per cent of the population have the disease, and that the test will record a false positive 10 per cent of the time, use Bayes' theorem to calculate the chance of having D if the test is positive.

answer 2.5

2.6 "Inverse" χ² statistic. For a Gaussian of known mean (say zero), show that the posterior distribution for the variance is an inverse χ². Use the "Jeffreys Prior" for the variance: prob(σ)=1/σ. Comment on the differences between this result, and the one obtained by using a uniform prior on σ.

answer 2.6

2.7 Maximum likelihood and the Poisson distribution. Suppose we have data which obey a Poisson distribution with parameter μ, and in successive identical intervals we observe n₁, n₂ ..... events. Form the likelihood function by taking the product of the distributions for each n_i, and differentiate to find the maximum likelihood estimate of μ. Is it what you expect?

answer 2.7

2.8 Maximum likelihood and the exponential distribution. Suppose we have data X₁, X₂ ..... from the distribution
1/2a e^(-x/a). Compute the posterior distribution of a for a uniform prior, and Jeffreys's prior prob(a) ∝ 1/a. Do the differences seem reasonable? Which prior would you choose? If a were known, but the location μ was to be found, what would be the maximum likelihood estimate?

answer 2.8

2.9 Birth control. Imagine a society where boys and girls were (biologically) equally likely to be born, but families cease producing children after the birth of the first boy. Are there more males than females in the population? Attack the problem in three ways: pure thought, by a simulation, and by an analytic calculation.

answer 2.9

Chapter 3

3.1 Means and variances. Find the mean and variance of a Poisson distribution and of a power law; find the variance (= ∞) of a Cauchy distribution.

answer 3.1

3.2 Simple error analysis. Derive the well known results for error combining, for two products, and the the sum and difference of two quantities, from the Taylor expansion of Section 3.3.2.

answer 3.2

3.3 Combining Gaussian variables. Use the result of Section 3.3.2 for errors on z when z = x + y to find the distribution of the sum of two Gaussian variables.

answer 3.3

3.4 Average of Cauchy variables. Show that the average value of Cauchy-distributed variables has the same distribution as the original data. Use characteristic functions and the convolution theorem.

answer 3.4

3.5 Poisson statistics. Draw random numbers from Poisson distributions (Section 6.5.1) with μ = 10 and μ=100. Take 10 or 100 samples, find the average and the rms scatter. How close is the scatter to √μ?

answer 3.5

3.6 Robust statistics. Make a Gaussian with outliers by combining two Gaussians, one of unit variance, one three times wider. Leave the relative weight of the wide Gaussian as a parameter. Compare the mean deviation with the rms, for various relative weights. How sensitive are the two measures of scatter to outliers? Repeat the exercise, with a width derived from order statistics.

answer 3.6

3.7 Change of Variable. Suppose that φ is uniformly distributed between zero and 2π. Find the distribution of sin φ. How could you find the distribution of a sum of sines of independent random angles?

answer 3.7

3.8 Order statistics. We record a burst of N neutrinos from a supernova, and the probability of recording a neutrino at time t is, in suitable units, e^(t-t₀) where t₀ is the time of emission. The maximum likelihood estimate of t₀ is just T₁, the time of arrival of the first neutrino. Use order statistics (Section 3.4) to show that the average value of T₁ is just t₀ + 1/N. Is this MLE biased, but consistent (i.e. the correct answer as N → ∞)?

answer 3.8

Chapter 4

4.1 Correlation testing (D). Consider the Hubble plot of Figure 4.3. What is (a) the most likely value for ρ via the Jeffreys test; (b) the significance of the correlation via the standard Fisher test and (c) the significance via the Spearman rank test? Estimate distributions for these statistics with a bootstrap (Section 6.6); and compare the results with the standard tests.

data 4.1
answer 4.1

4.2 Permutation tests (D). (a) Take a small set of uncorrelated pairs (X,Y), preferably non-Gaussian. By permutation methods on the computer, derive distributions of Fisher r, Spearman's and Kendall's statistics. (b) Try the same numerical experiment with correlated data, using the bootstrap and the jackknife to estimate distributions (Section 6.6). (Correlated non-Gaussian data are provided for the multivariate t-distribution, which is Cauchy-like for one degree of freedom and becomes more Gaussian for larger degrees of freedom.) How robust are the conclusions against outliers?

data 4.2
answer 4.2

4.3 Principal-Component Analysis (D). Carry through a PCA on the data of the quasar sample given in Francis et al. 1999 (Francis, P.J. and Wills, B.J. 1999, in ASP Conf. Ser. 162: Quasars and Cosmology, p363). Compute errors with a bootstrap analysis or jackknife (Section 6.6).

data 4.3
answer 4.3

4.4 Lurking third variables. Consider the following correlations, and speculate on how a third variable might be involved. (a) During the Second World War J.W. Tukey discovered a strong positive correlation between accuracy of bombing and the presence of enemy fighter planes. (b) There is a well-known correlation between stock market indices and the sunspot cycle. (c) The apparent angular size of radio sources shows a strong inverse correlation with radio luminosity.

answer 4.4

Chapter 5

5.1 Kolmogorov-Smirnov (D). Use the data provided, two datasets, one with a total of m = 290 observations the other with 386 measurements. The former is of flux densities measured at random positions in the sky; the latter of flux densities at the positions of a specified set of galaxies. Using the Kolmogorov-Smirnov two-sample test, examine the hypothesis that there is excess flux density at the non-random positions.

data 5.1
answer 5.1

5.2 Wilcoxon-Mann-Whitney (D). Repeat the test with the Wilcoxon-Mann-Whitney statistic. Is the significance level different? How would you combine the results from these two tests, plus the chi-square test in the text?

data 5.2
answer 5.2

5.3 t-test and outliers (D). Create two datasets, one drawn from a Gaussian of unit variance, the other drawn from a variable combination of two Gaussians, the dominant one of unit variance and the other three times wider. All Gaussians are of zero mean. Perform a t-test on sets of 10 observations and investigate what happens as contamination from the wide Gaussian is increased. Compare the effect on the posterior distribution of the difference of the means. Now shift the narrow Gaussian by half a unit, and repeat the experiment. What effect do the outliers have on our ability to refute the null hypothesis? How does the Bayesian approach compare?

data 5.3 data with difference in means. 5.3 data with no difference in means
answer 5.3

5.4 F-test (D). Create some random data, as in the first part of Exercise 5.3. Investigate the sensitivity of the standard F-test to a small level of contamination by outliers.

data 5.4
answer 5.4

5.5 Non-parametric alternatives. (D) Repeat the analysis of the last two exercises, using a non-parametric test; the Wilcoxon-Mann-Whitney test for the location test, and the Kolmogorov-Smirnov test for the variance test. How do the results compare with the parametric tests? Can you detect genuine differences in variance, apart from the outliers?

data 5.3 data-set 1. 5.3 data-set 2
answer 5.5

5.6 Several datasets, one test. Suppose you have N independent datasets, and with a certain test you obtain a significance level of p_i for each one. A useful overall significance is given by the W statistic (Peacock, J.A., 1985, Mon. Not. R. astr. Soc., 217, 601) which is

W=∏^Np_i.

Find the distribution of logW and describe how it could be used. Note this contrasts to the case discussed in the text, where we might perform several different tests on the same dataset. (Each p_i will be uniformly distributed between zero and one, under the null hypothesis. The distribution of logW is the sum of these uniformly distributed numbers, and tends to a Gaussian of mean N and variance N.)

answer 5.6

5.7 Gram-Charlier (D). Take some data drawn from a Gaussian and investigate the posterior likelihood if just one term (the quadratic) is used in a Gram-Charlier expansion as an assumption for the "true" distribution. Take the location as known. Find the distribution of the variance, marginalizing out the Gram-Charlier parameter. Also, find the odds on including the parameter in the model. What does this tell you about assuming a Gaussian distribution when the amounts of data are limited?

data 5.7
answer 5.7

5.8 Odds versus classical tests. Use the small dataset from the Example in Section 5.2.3. Perform a classical analysis, using t- and F-tests. Compare and contrast to the odds calculated in the text. Does the Behrens-Fisher distribution give a better answer than either or both? See Jaynes' comments on confidence intervals (Jaynes, E.T. 1983, Papers on Probability Theory, Statistics and Statistical Physics, Ed. Rosenkrantz, R.D. ESO Garching).

answer 5.8

Chapter 6

6.1 Covariance matrix. Consider N data X_i, drawn from a Gaussian of mean μ and standard deviation σ. Use maximum likelihood to find estimators of both μ and σ, and find the covariance matrix of these estimates.

answer 6.1

6.2 Weighting data. Show that the optimum weight for an observation of standard deviation σ is just 1/σ² . This weight turns up naturally in modelling using minimum-χ².

answer 6.2

6.3 MLE and power laws. In the example of Section 6.1 we fit a power law truncated at the faint end, and assume we know where to cut it off. What happens if you try to infer the faint-end cutoff by ML as well? Formulate this problem at least.

answer 6.3

6.4 Univariate random numbers. Work out the inverses of the integral functions required to generate (a) f(x) = 2x³, (b) a power-law, representative of luminosity functions, f(x) = x^-γ . Use these results to produce random experiments following these probabilities by drawing 1000 random samples uniformly distributed between 0 and 1; verify by comparison with the given functions.

answer 6.4

6.5 Multivariate random numbers. (a) Give the justification for why the prescription (Section 6.5) for generating (X,Y) pairs following a Gaussian of given variance and correlation coefficient is correct. (b) Using a Gaussian Monte-Carlo generator, find 1000 (X,Y) pairs following a given prescription, i.e. σ_x², σ_y² and ρ. Plot these on contours of the bivariate probability distribution, as in Figure 6.2, to check roughly that the prescription works. (c) Find the error matrix for the (X,Y) pairs to verify that the prescription works.

answer 6.5

6.6 Monte Carlo integration. The Gaussian or Normal distribution function 1/σ√ 2π exp[ -x² / 2σ²] does not have an analytic integral form. Use Monte Carlo integration to find erf, the so-called error function of Table A2.1. Show that (a) approximately 68 per cent of its area lies between ± σ, and (b) that the total area under the curve is unity.

answer 6.6

6.7 Maximum Likelihood estimates. Find an estimator of μ when the distribution is (a) prob(x) = exp [-|x-μ|] and (b) the Poisson prob (n) = μⁿ e^-μ / n!.

answer 6.7

6.8 Least Squares linear fits. Derive the "minimum distance" OLS for errors in both x and y, assuming Gaussian errors.

answer 6.8
View Gull Paper

6.9 Marginalization. Using the data supplied, use maximum likelihood to find the distribution of the parameters of a fitted Gaussian plus a baseline. Test to see how the estimates are affected by marginalizing out the baseline parameters.

data 6.9
answer 6.9
View Jaynes Paper (PDF)
Save Jaynes Paper (PS)

6.10 The Jackknife method. Using the MLE for a power-law index (Section 6.1), work out and compare the confidence intervals with the analytic result from that section using the jackknife and bootstrap tests. Check how the results depend on sample size.

data 6.10
answer 6.10

Chapter 7

7.1 Source counts and luminosity function. Derive the relationship between the number count and the luminosity function for a general luminosity function; show that the result takes a simple form for a power-law luminosity function.

answer 7.1

7.2 Noise and source-count slope. Generate data from a power-law source count and add noise; by a maximum-likelihood fit, investigate the effect of the noise level on the inferred source-count slope. Use the results from Exercise 7.1 to show the effect of the noise on the luminosity function.

answer 7.2

7.3 Survey completeness and noise. Make a 1-D Gaussian signal plus noise plus baseline, fit a profile, verify completeness versus signal-to-noise ratio. Do the same for an empty field.

answer 7.3

7.4 Reliability and completeness. Calculate the relationship between reliability and completeness for an exponential noise distribution. This shows the effect of wide wings on the noise distribution. Compare with the result for a Gaussian.

answer 7.4

7.5 V_max method (D). Simulate a flux-limited sample of galaxies by populating a large volume of space with galaxies drawn from a Schechter distribution. (The cumulative form of the Schechter distribution is rather complicated so you may prefer to use a power law.) Apply the V_max method and see if you can recover the input distribution. Check the simple √N error bars against repeated runs of the simulation.

data 7.5
answer 7.5

7.6 Error estimates (D). Adapt the simulation of Exercise 7.5 to produce bootstrap error estimates. Compare these with √N and Monte Carlo estimates, especially for the case of few objects per bin. (The data here are the same as for 7.5)

data 7.6
answer 7.6

7.7 Luminosity-distance `correlation' (D). Adapt the simulation of Exercise 7.5 to the case for which each galaxy has two independent luminosities assigned to it (at different wavelengths, say). Check that these luminosities show a bogus correlation unless upper limits are included in the analysis. Adapt the simulation to produce intrinsically-correlated luminosities and show that the Kendall test can detect these correlations. (The data here are the same as for 7.5)

data 7.7
answer 7.7

7.8 Parameter error estimates. Use the X-ray and radio data from Avni et al. (Avni, Y., Soltan, A., Tananbaum, H. and Zamorani, G., 1980, Astrophys. J., 238, 800) as given in the example in the text, to work out the mean spectral index in their survey. Using their likelihood function as a starting point, work out error bounds on the mean, using a likelihood ratio. You will need to use a Lagrange multiplier in the maximization of the likelihood.

answer 7.8
View Avni paper (PDF)

7.9 Source counts from confusion (D). In a confusion-limited survey where there are potentially several sources per beam, the apparent source count can be very different from the true one. On the assumption that sources can lie anywhere in the beam and are not clustered, derive the result for the source count

as given in Section 7.6.

answer 7.9

Chapter 8

8.1 Fourier transform and FFT. Use a direct numerical integration to do a numerical Fourier transform of an oscillatory function, say a sine wave or a Bessel function. Compare the timings with an off-the-shelf FFT routine, checking how many oscillations you can fit in your region of integration before the FFT accelerates away from the direct method.

answer 8.1

8.2 Wiener filtering and 1/f noise (D). Make some synthetic data along the lines of the example in Figure 8.1, and make it work with a Wiener filter for uncorrelated Gaussian noise. Now generate some 1/f noise. Add this in to the input spectrum, and perform the filtering again, without taking account of the extra low-frequency noise in the form of the Wiener filter. Does the 1/f noise affect (a) the line profile parameters (b) the baseline parameters?

data 8.2
answer 8.2

8.3 Periodogram (D). Consider the Lomb-Scargle periodogram method as formulated by Press et al. (Press, W.H., Teukolsky, S.A., Vettering, W.T. and Flannery, B.P., 1992, Numerical Recipes (Chapter 13.8), CUP); use the Numerical Recipes routines to test the following issues.

a) If we can sample at much above the pseudo-Nyquist rate, how much? Where does this run out? Why in practice can we not realize the sampling at these high frequencies provided by scattered time measurement?

b) The lines of probability in Figure 8.5 are roughly correct for the random uniform coverage of the left set of data. For the data on the right, uniformity has been assumed and the probabilities in the diagram are incorrect. Use the Numerical Recipes routine and the Monte-Carlo technique outlined to determine how they should be adjusted.

Solution not yet available

8.4 Properties of the power spectrum of periodic data. From the max/min statistics analysis of Section 3.4:

a) Find the probability density function equivalent to equation 7.11 for minimum values.

b) Show that the peak of the distribution function of the max in the power spectrum of data N long is ln(N).

answer 8.4

8.5 Power spectrum of signal + noise. For a signal containing a deterministic signal S and Gaussian noise x, show that the noise distribution in each component of the power spectrum is in general a non-trivial combination of χ² and Gaussian noise.

answer 8.5

8.6 1/f noise. Harmonic analysis (sampling, Fourier transforming) of Beethoven's symphonies indicate that their power spectra follow the 1/f law to a good approximation. Consider why this should be so. See Press 1978 (Press, W.H., 1978, Comments Astrophys., 7, 103) for a few hints.

answer 8.6

Example piece of jazz

Example piece of rock

Jazz phases, rock amplitudes combined

Rock phases, jazz amplitudes combined

8.7 Filtering and mean values. Take your favourite implementation of the FFT, and form the power spectrum of a scan consisting entirely of uncorrelated Gaussian noise. Integrate the power spectrum; is the answer the variance of the input data? If not, why not? Now convolve the data with your favourite (normalized) filter. From the zero frequency of the power spectrum, what is the variance in the mean? Does it change if you change the width of the filter? Explain.

answer 8.7

8.8 Baselines (D). Fit a Fourier baseline interactively to a spectrum containing a moderately obvious but contaminated line. Now, separately, fit a Gaussian to the line and give your best estimate of the uncertainty in the total flux in the line. Compare this with a complete Bayesian analysis, fitting the same number of harmonics plus Gaussian ab initio, and then marginalizing out the baseline parameters.

data 8.8
answer 8.8

Chapter 9

9.1 Rayleigh's test. Why should the test statistic for Rayleigh's test be asymptotically χ²? Compute the statistic for small numbers, say <10; see Section 3.3.3.

answer 9.1

9.2 Variance of estimators for w(θ) (D). Generate 20,000 data points randomly in the region 0⁰ < α < 5⁰, 0⁰ < δ < 5⁰. Estimate w(θ) using the Natural estimator w₁, the Peebles estimator w₂, the Landy-Szalay estimator w₃ and the Hamilton estimator w₄. (Average DR and RR over say 10 comparison sets each of 20,000 random points). Plot the results as a function of δ showing Poisson error bars 1/√DD. Comment on the results - which estimator is best?

data 9.2
answer 9.2

9.3 Integral constraint on w(θ). (a) Show that the factor C in equation 9.27 is 1/(1+W) where W = ∫ w(θ) dG_p. (b) Derive an approximate expression for W, assuming a power-law form for w(θ) = (θ / θ₀)^-b.

answer

9.3

9.4 The effect of surface density changes on w(θ) (D). (a) Estimate the magnitude of the offset in w(θ) taking a simple model in which a survey is divided into two equal areas between which there is a fractional surface density shift ε (Equation 9.29). Find the expected step in w(θ) as a function of ε; verify that a step of 20 per cent results in Δw = 0.01. (b) Confirm this prediction with a toy-model simulation, putting say 100,000 random points in the region 0⁰ < α < 60⁰, -20⁰ < δ < +20⁰ with a 20 per cent step at δ = 0⁰. Then calculate the w(θ) using say Landy-Szalay (w₄) estimator over the small-angular-scale range θ < 1⁰, checking that w(θ) agrees within errors with the prediction from equation 9.29.

data 9.4
answer 9.4

9.5 The effect of surface-density changes on c-in-c (D). (a) Use the 100,000 random points generated in the region 0⁰ < α < 60⁰, -20⁰ < δ < +20⁰ for Exercise 9.4. Generate a set of 10 grid patterns for circular non-overlapping cells over the area, with diameter 0.03⁰ to 3⁰, evenly spaced in logθ. Compile P(N) for each of these and show that the means and variances are as expected for Poissonian distributions. Calculate and plot the variance statistic y(L) (equation 9.31) as a function of cell size; verify that there is no significant offset from zero. (b) Put in a 20 per cent step in surface density, dividing the field in half at δ = 0⁰. Recalculate the y(L) and verify that the apparent offset in y(L) is of the expected magnitude Δy = 0.01 (equation 9.41).

data 9.5
answer 9.5

9.6 w(θ) and the angular power spectrum (D). Simulate a square piece of sky 10⁰ x 10⁰ , using 10000 points placed at random. (a) Verify that there is no significant signal either in w(θ) or in the angular power spectrum. (b) Build a hierarchy of galaxy clusters and clusters of clusters using perhaps another 10000 points, adopting Gaussian shapes for clusters, cluster-clusters, etc. (c) Show that with a few adjustments to the parameters (see Figure 9.1), it is possible to produce an approximate power law of slope ~-1 for a single hierarchy of clustering. Relate the resultant form of the angular power spectrum and its information content to this w(θ).

data 9.6
answer 9.6