Mathematical Statistics Jun Shao Pdf Download

Book cover Mathematical statistics: exercises and solutions
Mathematical statistics: exercises and solutions

Jun Shao
How much do you like this book?
What's the quality of the file?
Download the book for quality assessment
What's the quality of the downloaded files?
This book consists of solutions to four hundred exercises, over 95% of which are in the author's Mathematical Statistics. That textbook covers topics in statistical theory essential for graduate students preparing for work on a Ph.D. degree in statistics.
On the other hand, this is a stand-alone book, since exercises and solutions are comprehensible independently of their source. Many solutions involve standard exercises that appear in other textbooks listed in the references. To help readers not using this book with Mathematical Statistics, lists of notation, terminology, and some probability distributions are given in the front of the book.
Readers are assumed to have a good knowledge in advanced calculus. A course in real analysis or measure theory is highly recommended. If this book is used with a statistics textbook that does not include probability theory, then knowledge in measure-theoretic probability theory is required. The exercises are grouped into seven chapters with titles matching those in Mathematical Statistics.
The file will be sent to your email address. It may take up to 1-5 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 1-5 minutes before you received it.
Please note: you need to verify every book you want to send to your Kindle. Check your mailbox for the verification email from Amazon Kindle.
Mathematical Statistics: Exercises and Solutions  Jun Shao  Mathematical Statistics: Exercises and Solutions  Jun Shao Department of Statistics University of Wisconsin Madison, WI 52706 USA shao@stat.wisc.edu  Library of Congress Control Number: 2005923578 ISBN-10: 0-387-24970-2 ISBN-13: 978-0387-24970-4  Printed on acid-free paper.  © 2005 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springeronline.com  (EB)  To My Parents  Preface Since the publication of my book Mathematical Statistics (Shao, 2003), I have been asked many times for a solution manual to the exercises in my book. Without doubt, exercises form an important part of a textbook on mathematical statistics, not only in training students for their research ability in mathematical statistics but also in presenting many additional results as complementary material to the main text. Written solutions to these exercises are important for students who initially do not have the skills in solving these exercises completely and are very helpful for instructors of a mathematical statistics course (whether or not my book Mathematical Statistics is used as the textbook) in providing answers to students as well as ﬁnding additional examples to the main text. Motivated by this and encoura; ged by some of my colleagues and Springer-Verlag editor John Kimmel, I have completed this book, Mathematical Statistics: Exercises and Solutions. This book consists of solutions to 400 exercises, over 95% of which are in my book Mathematical Statistics. Many of them are standard exercises that also appear in other textbooks listed in the references. It is only a partial solution manual to Mathematical Statistics (which contains over 900 exercises). However, the types of exercise in Mathematical Statistics not selected in the current book are (1) exercises that are routine (each exercise selected in this book has a certain degree of diﬃculty), (2) exercises similar to one or several exercises selected in the current book, and (3) exercises for advanced materials that are often not included in a mathematical statistics course for ﬁrst-year Ph.D. students in statistics (e.g., Edgeworth expansions and second-order accuracy of conﬁdence sets, empirical likelihoods, statistical functionals, generalized linear models, nonparametric tests, and theory for the bootstrap and jackknife, etc.). On the other hand, this is a stand-alone book, since exercises and solutions are comprehensible independently of their source for likely readers. To help readers not using this book together with Mathematical Statistics, lists of notation, terminology, and some probability distributions are given in the front of the book.  vii  viii  Preface  All notational conventions are the same as or very similar to those in Mathematical Statistics and so is the mathematical level of this book. Readers are assumed to have a good knowledge in advanced calculus. A course in real analysis or measure theory is highly recommended. If this book is used with a statistics textbook that does not include probability theory, then knowledge in measure-theoretic probability theory is required. The exercises are grouped into seven chapters with titles matching those in Mathematical Statistics. A few errors in the exercises from Mathematical Statistics were detected during the preparation of their solutions and the corrected versions are given in this book. Although exercises are numbered independently of their source, the corresponding number in Mathematical Statistics is accompanied with each exercise number for convenience of instructors and readers who also use Mathematical Statistics as the main text. For example, Exercise 8 (#2.19) means that Exercise 8 in the current book is also Exercise 19 in Chapter 2 of Mathematical Statistics. A note to students/readers who have a need for exercises accompanied by solutions is that they should not be completely driven by the solutions. Students/readers are encouraged to try each exercise ﬁrst without reading its solution. If an exercise is solved with the help of a solution, they are encouraged to provide solutions to similar exercises as well as to think about whether there is an alternative solution to the one given in this book. A few exercises in this book are accompanied by two solutions and/or notes of brief discussions. I would like to thank my teaching assistants, Dr. Hansheng Wang, Dr. Bin Cheng, and Mr. Fang Fang, who provided valuable help in preparing some solutions. Any errors are my own responsibility, and a correction of them can be found on my web page http://www.stat.wisc.edu/˜ shao. Madison, Wisconsin April 2005  Jun Shao  Contents Preface . . . . . . . . Notation . . . . . . . Terminology . . . . Some Distributions Chapter Chapter Chapter Chapter Chapter Chapter Chapter  1. 2. 3. 4. 5. 6. 7.  . . . . . . . . . . . . . . . . . . . . . . . .  vii  . . . . . . . . . . . . . . . . . . . . . . . .  xi  . . . . . . . . . . . . . . . . . . . . . . . .  xv  . . . . . . . . . . . . . . . . . . . . . . . . xxiii  Probability Theory . . . . . . . . . . . . Fundamentals of Statistics . . . . . . . Unbiased Estimation . . . . . . . . . . . Estimation in Parametric Models . . . Estimation in Nonparametric Models Hypothesis Tests . . . . . . . . . . . . . . Conﬁdence Sets . . . . . . . . . . . . . .  . . . .  1  . . . .  51  . . . .  95  . . . . 141 . . . . 209 . . . . 251 . . . . 309  References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353  Notation R: The real line. Rk : The k-dimensional Euclidean space. c = (c1 , ..., ck ): A vector (element) in Rk with jth component cj ∈ R; c is considered as a k × 1 matrix (column vector) when matrix algebra is involved. τ c : The transpose of a vector c ∈ Rk considered as a 1 × k matrix (row vector) when matrix algebra is involved. c: The Euclidean norm of a vector c ∈ Rk , c2 = cτ c. |c|: The absolute value of c ∈ R. Aτ : The transpose of a matrix A. Det(A) or |A|: The determinant of a matrix A. tr(A): The trace of a matrix A. A: The norm of a matrix A deﬁned as A2 = tr(Aτ A). A−1 : The inverse of a matrix A. A− : The generalized inverse of a matrix A. A1/2 : The square root of a nonnegative deﬁnite matrix A deﬁned by A1/2 A1/2 = A. −1/2 : The inverse of A1/2 . A R(A): The linear space generated by rows of a matrix A. Ik : The k × k identity matrix. Jk : The k-dimensional vector of 1's. ∅: The empty set. (a, b): The open interval from a to b. [a, b]: The closed interval from a to b. (a, b]: The interval from a to b including b but not a. [a, b): The interval from a to b including a but not b. {a, b, c}: The set consisting of the elements a, b, and c. A1 × · · · × Ak : The Cartesian product of sets A1 , ..., Ak , A1 × · · · × Ak = {(a1 , ..., ak ) : a1 ∈ A1 , ..., ak ∈ Ak }. xi  xii  Notation  σ(C): The smallest σ-ﬁeld that contains C. σ(X): The smallest σ-ﬁeld with respect to which X is measurable. ν1 × · · · × νk : The product measure of ν1 ,...,νk on σ(F1 × · · · × Fk ), where νi is a measure on Fi , i = 1, ..., k. B: The Borel σ-ﬁeld on R. B k : The Borel σ-ﬁeld on Rk . Ac : The complement of a set A. A ∪ B: The union of sets A and B. ∪Ai : The union of sets A1 , A2 , .... A ∩ B: The intersection of sets A and B. ∩Ai : The intersection of sets A1 , A2 , .... IA : The indicator function of a set A. P (A): The probability of a set A.  f dν: The integral of a Borel function f with respect to a measure ν.  f dν: The integral of f on the set A. A f (x)dF (x): The integral of f with respect to the probability measure corresponding to the cumulative distribution function F . λ  ν: The measure λ is dominated by the measure ν, i.e., ν(A) = 0 always implies λ(A) = 0. dλ : The Radon-Nikodym derivative of λ with respect to ν. dν P: A collection of populations (distributions). a.e.: Almost everywhere. a.s.: Almost surely. a.s. P: A statement holds except on the event A with P (A) = 0 for all P ∈ P. δx : The point mass at x ∈ Rk or the distribution degenerated at x ∈ Rk . {an }: A sequence of elements a1 , a2 , .... an → a or limn an = a: {an } converges to a as n increases to ∞. lim supn an : The largest limit point of {an }, lim supn an = inf n supk≥n ak . lim inf n an : The smallest limit point of {an }, lim inf n an = supn inf k≥n ak . →p : Convergence in probability. →d : Convergence in distribution. g  : The derivative of a function g on R. g  : The second-order derivative of a function g on R. g (k) : The kth-order derivative of a function g on R. g(x+): The right limit of a function g at x ∈ R. g(x−): The left limit of a function g at x ∈ R. g+ (x): The positive part of a function g, g+ (x) = max{g(x), 0}.  Notation  xiii  g− (x): The negative part of a function g, g− (x) = max{−g(x), 0}. ∂g/∂x: The partial derivative of a function g on Rk . ∂ 2 g/∂x∂xτ : The second-order partial derivative of a function g on Rk . exp{x}: The exponential function ex . log x or log(x): The inverse of ex , log(ex ) = x. ∞ Γ(t): The gamma function deﬁned as Γ(t) = 0 xt−1 e−x dx, t > 0. F −1 (p): The pth quantile of a cumulative distribution function F on R, F −1 (t) = inf{x : F (x) ≥ t}. E(X) or EX: The expectation of a random variable (vector or matrix) X. Var(X): The variance of a random variable X or the covariance matrix of a random vector X. Cov(X, Y ): The covariance between random variables X and Y . E(X|A): The conditional expectation of X given a σ-ﬁeld A. E(X|Y ): The conditional expectation of X given Y . P (A|A): The conditional probability of A given a σ-ﬁeld A. P (A|Y ): The conditional probability of A given Y . X(i) : The ith order statistic of X1 , ..., Xn . n X̄ or X̄· : The sample mean of X1 , ..., Xn , X̄ = n−1 i=1 Xi . n X̄·j : The average of Xij 's over the index i, X̄·j = n−1 i=1 Xij . n S 2 : The sample variance of X1 , ..., Xn , S 2 = (n − 1)−1 i=1 (Xi − X̄)2 . n Fn : The empirical distribution of X1 , ..., Xn , Fn (t) = n−1 i=1 δXi (t). (θ): The likelihood function. H0 : The null hypothesis in a testing problem. H1 : The alternative hypothesis in a testing problem. L(P, a) or L(θ, a): The loss function in a decision problem. RT (P ) or RT (θ): The risk function of a decision rule T . rT : The Bayes risk of a decision rule T . N (µ, σ 2 ): The one-dimensional normal distribution with mean µ and variance σ 2 . Nk (µ, Σ): The k-dimensional normal distribution with mean vector µ and covariance matrix Σ. Φ(x): The cumulative distribution function of N (0, 1). zα : The (1 − α)th quantile of N (0, 1). χ2r : The chi-square distribution with degrees of freedom r. χ2r,α : The (1 − α)th quantile of the chi-square distribution χ2r . χ2r (δ): The noncentral chi-square distribution with degrees of freedom r and noncentrality parameter δ.  xiv  Notation  tr : The t-distribution with degrees of freedom r. tr,α : The (1 − α)th quantile of the t-distribution tr . tr (δ): The noncentral t-distribution with degrees of freedom r and noncentrality parameter δ. Fa,b : The F-distribution with degrees of freedom a and b. Fa,b,α : The (1 − α)th quantile of the F-distribution Fa,b . Fa,b (δ): The noncentral F-distribution with degrees of freedom a and b and noncentrality parameter δ. : The end of a solution.  Terminology σ-ﬁeld: A collection F of subsets of a set Ω is a σ-ﬁeld on Ω if (i) the empty set ∅ ∈ F; (ii) if A ∈ F, then the complement Ac ∈ F; and (iii) if Ai ∈ F, i = 1, 2, ..., then their union ∪Ai ∈ F. σ-ﬁnite measure: A measure ν on a σ-ﬁeld F on Ω is σ-ﬁnite if there are A1 , A2 , ... in F such that ∪Ai = Ω and ν(Ai ) < ∞ for all i. Action or decision: Let X be a sample from a population P . An action or decision is a conclusion we make about P based on the observed X. Action space: The set of all possible actions. Admissibility: A decision rule T is admissible under the loss function L(P, ·), where P is the unknown population, if there is no other decision rule T1 that is better than T in the sense that E[L(P, T1 )] ≤ E[L(P, T )] for all P and E[L(P, T1 )] < E[L(P, T )] for some P . Ancillary statistic: A statistic is ancillary if and only if its distribution does not depend on any unknown quantity. Asymptotic bias: Let Tn be an estimator of θ for every n satisfying an (Tn −θ) →d Y with E|Y | < ∞, where {an } is a sequence of positive numbers satisfying limn an = ∞ or limn an = a > 0. An asymptotic bias of Tn is deﬁned to be EY /an . Asymptotic level α test: Let X be a sample of size n from P and T (X) be a test for H0 : P ∈ P0 versus H1 : P ∈ P1 . If limn E[T (X)] ≤ α for any P ∈ P0 , then T (X) has asymptotic level α. Asymptotic mean squared error and variance: Let Tn be an estimator of θ for every n satisfying an (Tn − θ) →d Y with 0 < EY 2 < ∞, where {an } is a sequence of positive numbers satisfying limn an = ∞. The asymptotic mean squared error of Tn is deﬁned to be EY 2 /a2n and the asymptotic variance of Tn is deﬁned to be Var(Y )/a2n . Asymptotic relative eﬃciency: Let Tn and Tn be estimators of θ. The asymptotic relative eﬃciency of Tn with respect to Tn is deﬁned to be the asymptotic mean squared error of Tn divided by the asymptotic mean squared error of Tn . xv  xvi  Terminology  Asymptotically correct conﬁdence set: Let X be a sample of size n from P and C(X) be a conﬁdence set for θ. If limn P (θ ∈ C(X)) = 1 − α, then C(X) is 1 − α asymptotically correct. Bayes action: Let X be a sample from a population indexed by θ ∈ Θ ⊂ Rk . A Bayes action in a decision problem with action space A and loss function L(θ, a) is the action that minimizes the posterior expected loss E[L(θ, a)] over a ∈ A, where E is the expectation with respect to the posterior distribution of θ given X. Bayes risk: Let X be a sample from a population indexed by θ ∈ Θ ⊂ Rk . The Bayes risk of a decision rule T is the expected risk of T with respect to a prior distribution on Θ. Bayes rule or Bayes estimator: A Bayes rule has the smallest Bayes risk over all decision rules. A Bayes estimator is a Bayes rule in an estimation problem. Borel σ-ﬁeld B k : The smallest σ-ﬁeld containing all open subsets of Rk . Borel function: A function f from Ω to Rk is Borel with respect to a σ-ﬁeld F on Ω if and only if f −1 (B) ∈ F for any B ∈ Bk . Characteristic The characteristic function of a distribution F on  √function: τ Rk is e −1t x dF (x), t ∈ Rk . Complete (or bounded complete) statistic: Let X be a sample from a population P . A statistic T (X) is complete (or bounded complete) for P if and only if, for any Borel (or bounded Borel) f , E[f (T )] = 0 for all P implies f = 0 except for a set A with P (X ∈ A) = 0 for all P. Conditional expectation E(X|A): Let X be an integrable random variable on a probability space (Ω, F, P ) and A be a σ-ﬁeld contained in F. The conditional expectation of X given A, denoted by E(X|A), is deﬁned to be the a.s.-unique random  variable satisfying  (a) E(X|A) is Borel with respect to A and (b) A E(X|A)dP = A XdP for any A ∈ A. Conditional expectation E(X|Y ): The conditional expectation of X given Y , denoted by E(X|Y ), is deﬁned as E(X|Y ) = E(X|σ(Y )). Conﬁdence coeﬃcient and conﬁdence set: Let X be a sample from a population P and θ ∈ Rk be an unknown parameter that is a function of P . A conﬁdence set C(X) for θ is a Borel set on Rk depending on X. The conﬁdence coeﬃcient of a conﬁdence set C(X) is inf P P (θ ∈ C(X)). A conﬁdence set is said to be a 1 − α conﬁdence set for θ if its conﬁdence coeﬃcient is 1 − α. Conﬁdence interval: A conﬁdence interval is a conﬁdence set that is an interval.  Terminology  xvii  Consistent estimator: Let X be a sample of size n from P . An estimator T (X) of θ is consistent if and only if T (X) →p θ for any P as n → ∞. T (X) is strongly consistent if and only if limn T (X) = θ a.s. for any P . T (X) is consistent in mean squared error if and only if limn E[T (X) − θ]2 = 0 for any P . Consistent test: Let X be a sample of size n from P . A test T (X) for testing H0 : P ∈ P0 versus H1 : P ∈ P1 is consistent if and only if limn E[T (X)] = 1 for any P ∈ P1 . Decision rule (nonrandomized): Let X be a sample from a population P . A (nonrandomized) decision rule is a measurable function from the range of X to the action space. Discrete probability density: A probability density with respect to the counting measure on the set of nonnegative integers. Distribution and cumulative distribution function: The probability measure corresponding to a random vector is called its distribution (or law). The cumulative distribution function of a distribution or probability measure P on B k is F (x1 , ..., xk ) = P ((−∞, x1 ]×· · ·×(−∞, xk ]), xi ∈ R. Empirical Bayes rule: An empirical Bayes rule is a Bayes rule with parameters in the prior estimated using data. Empirical distribution: The empirical distribution based on a random sample (X1 , ..., Xn ) is the distribution putting mass n−1 at each Xi , i = 1, ..., n. Estimability: A parameter θ is estimable if and only if there exists an unbiased estimator of θ. Estimator: Let X be a sample from a population P and θ ∈ Rk be a function of P . An estimator of θ is a measurable function of X. Exponential family: A family of probability densities {fθ : θ ∈ Θ} (with respect to a common σ-ﬁnite measure ν), Rk , is an expo Θ ⊂ τ nential family if and only if fθ (x) = exp [η(θ)] T (x) − ξ(θ) h(x), where T is a random p-vector with a ﬁxed positive integer p, η is a function from Θ to Rp , h is a nonnegative Borel function, and  ξ(θ) = log exp{[η(θ)]τ T (x)}h(x)dν . Generalized Bayes rule: A generalized Bayes rule is a Bayes rule when the prior distribution is improper. Improper or proper prior: A prior is improper if it is a measure but not a probability measure. A prior is proper if it is a probability measure. Independence: Let (Ω, F, P ) be a probability space. Events in C ⊂ F are independent if and only if for any positive integer n and distinct events A1 ,...,An in C, P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 ) · · · P (An ). Collections Ci ⊂ F, i ∈ I (an index set that can be uncountable),  xviii  Terminology  are independent if and only if events in any collection of the form {Ai ∈ Ci : i ∈ I} are independent. Random elements Xi , i ∈ I, are independent if and only if σ(Xi ), i ∈ I, are independent. Integration or integral: Let ν be a measure on a σ-ﬁeld F on a set Ω. The integral of a nonnegative simple function (i.e., a function of k the form ϕ(ω) = i=1 ai IAi (ω), where ω ∈ Ω, k is a positive integer, A1 , ..., Ak are in F, and a1 , ..., ak are nonnegative numbers)  k is deﬁned as ϕdν = i=1ai ν(Ai ). The integral of a nonnegative  Borel function is deﬁned as f dν = supϕ∈Sf ϕdν, where Sf is the collection of all nonnegative simple functions that are bounded by f . For a Borel function f , its integral exists if and only if at least one of max{f, 0}dν and  max{−f, 0}dν is ﬁnite, in which case   f dν = max{f, 0}dν − max{−f, 0}dν. f is integrable if and   only if both max{f, 0}dν and max{−f, 0}dν are ﬁnite. When ν is a probability measure corresponding   to the cumulative distribution k function F on R , we write f dν = f (x)dF (x). For any event A,   f dν is deﬁned as I f dν. A A Invariant decision rule: Let X be a sample from P ∈ P and G be a group of one-to-one transformations of X (gi ∈ G implies g1◦ g2 ∈ G and gi−1 ∈ G). P is invariant under G if and only if ḡ(PX ) = Pg(X) is a one-to-one transformation from P onto P for each g ∈ G. A decision problem is invariant if and only if P is invariant under G and the loss L(P, a) is invariant in the sense that, for every g ∈ G and every a ∈ A (the collection of all possible  actions), there exists a unique ḡ(a) ∈ A such that L(PX , a) = L Pg(X) , ḡ(a) . A decision rule T (x) in an invariant decision problem is invariant if and only if, for every g ∈ G and every x in the range of X, T (g(x)) = ḡ(T (x)). Invariant estimator: An invariant estimator is an invariant decision rule in an estimation problem. LR (Likelihood ratio) test: Let (θ) be the likelihood function based on a sample X whose distribution is Pθ , θ ∈ Θ ⊂ Rp for some positive integer p. For testing H0 : θ ∈ Θ0 ⊂ Θ versus H1 : θ ∈ Θ0 , an LR test is any test that rejects H0 if and only if λ(X) < c, where c ∈ [0, 1] and λ(X) = supθ∈Θ0 (θ)/ supθ∈Θ (θ) is the likelihood ratio. LSE: The least squares estimator. Level α test: A test is of level α if its size is at most α. Level 1 − α conﬁdence set or interval: A conﬁdence set or interval is said to be of level 1 − α if its conﬁdence coeﬃcient is at least 1 − α. Likelihood function and likelihood equation: Let X be a sample from a population P indexed by an unknown parameter vector θ ∈ Rk . The joint probability density of X treated as a function of θ is called the likelihood function and denoted by (θ). The likelihood equation is ∂ log (θ)/∂θ = 0.  Terminology  xix  Location family: A family of Lebesgue densities on R, {fµ : µ ∈ R}, is a location family with location parameter µ if and only if fµ (x) = f (x − µ), where f is a known Lebesgue density. Location invariant estimator. Let (X1 , ..., Xn ) be a random sample from a population in a location family. An estimator T (X1 , ..., Xn ) of the location parameter is location invariant if and only if T (X1 + c, ..., Xn + c) = T (X1 , ..., Xn ) + c for any Xi 's and c ∈ R. Location-scale family: A family of Lebesgue densities on R, {fµ,σ : µ ∈ R, σ > 0}, is a location-scale family with location µ and  parameter  scale parameter σ if and only if fµ,σ (x) = σ1 f x−µ , where f is a σ known Lebesgue density. Location-scale invariant estimator. Let (X1 , ..., Xn ) be a random sample from a population in a location-scale family with location parameter µ and scale parameter σ. An estimator T (X1 , ..., Xn ) of the location parameter µ is location-scale invariant if and only if T (rX1 + c, ..., rXn + c) = rT (X1 , ..., Xn ) + c for any Xi 's, c ∈ R, and r > 0. An estimator S(X1 , ..., Xn ) of σ h with a ﬁxed h = 0 is locationscale invariant if and only if S(rX1 + c, ..., rXn + c) = rh T (X1 , ..., Xn ) for any Xi 's and r > 0. Loss function: Let X be a sample from a population P ∈ P and A be the set of all possible actions we may take after we observe X. A loss function L(P, a) is a nonnegative Borel function on P × A such that if a is our action and P is the true population, our loss is L(P, a). MRIE (minimum risk invariant estimator): The MRIE of an unknown parameter θ is the estimator has the minimum risk within the class of invariant estimators. MLE (maximum likelihood estimator): Let X be a sample from a population P indexed by an unknown parameter vector θ ∈ Θ ⊂ Rk and (θ) be the likelihood function. A θ̂ ∈ Θ satisfying (θ̂) = maxθ∈Θ (θ) is called an MLE of θ (Θ may be replaced by its closure in the above deﬁnition). Measure: A set function ν deﬁned on a σ-ﬁeld F on Ω is a measure if (i) 0 ≤ ν(A) ≤ ∞ for any A ∈ F; (ii) ν(∅) = 0; and (iii) ν (∪∞ i=1 Ai ) =  ∞ i=1 ν(Ai ) for disjoint Ai ∈ F, i = 1, 2, .... Measurable function: a function from a set Ω to a set Λ (with a given σﬁeld G) is measurable with respect to a σ-ﬁeld F on Ω if f −1 (B) ∈ F for any B ∈ G. Minimax rule: Let X be a sample from a population P and RT (P ) be the risk of a decision rule T . A minimax rule is the rule minimizes supP RT (P ) over all possible T . Moment generating function:  τ The moment generating function of a distribution F on Rk is et x dF (x), t ∈ Rk , if it is ﬁnite.  xx  Terminology  Monotone likelihood ratio: The family of densities {fθ : θ ∈ Θ} with Θ ⊂ R is said to have monotone likelihood ratio in Y (x) if, for any θ1 < θ2 , θi ∈ Θ, fθ2 (x)/fθ1 (x) is a nondecreasing function of Y (x) for values x at which at least one of fθ1 (x) and fθ2 (x) is positive. Optimal rule: An optimal rule (within a class of rules) is the rule has the smallest risk over all possible populations. Pivotal quantity: A known Borel function R of (X, θ) is called a pivotal quantity if and only if the distribution of R(X, θ) does not depend on any unknown quantity. Population: The distribution (or probability measure) of an observation from a random experiment is called the population. Power of a test: The power of a test T is the expected value of T with respect to the true population. Prior and posterior distribution: Let X be a sample from a population indexed by θ ∈ Θ ⊂ Rk . A distribution deﬁned on Θ that does not depend on X is called a prior. When the population of X is considered as the conditional distribution of X given θ and the prior is considered as the distribution of θ, the conditional distribution of θ given X is called the posterior distribution of θ. Probability and probability space: A measure P deﬁned on a σ-ﬁeld F on a set Ω is called a probability if and only if P (Ω) = 1. The triple (Ω, F, P ) is called a probability space. Probability density: Let (Ω, F, P ) be a probability space and ν be a σﬁnite measure on F. If P  ν, then the Radon-Nikodym derivative of P with respect to ν is the probability density with respect to ν (and is called Lebesgue density if ν is the Lebesgue measure on Rk ). Random sample: A sample X = (X1 , ..., Xn ), where each Xj is a random d-vector with a ﬁxed positive integer d, is called a random sample of size n from a population or distribution P if X1 , ..., Xn are independent and identically distributed as P . Randomized decision rule: Let X be a sample with range X , A be the action space, and FA be a σ-ﬁeld on A. A randomized decision rule is a function δ(x, C) on X × FA such that, for every C ∈ FA , δ(X, C) is a Borel function and, for every X ∈ X , δ(X, C) is a probability measure on FA . A nonrandomized decision rule T can be viewed as a degenerate randomized decision rule δ, i.e., δ(X, {a}) = I{a} (T (X)) for any a ∈ A and X ∈ X . Risk: The risk of a decision rule is the expectation (with respect to the true population) of the loss of the decision rule. Sample: The observation from a population treated as a random element is called a sample.  Terminology  xxi  Scale family: A family of Lebesgue densities on R, {fσ : σ > 0}, is a scale family with scale parameter σ if and only if fσ (x) = σ1 f (x/σ), where f is a known Lebesgue density. Scale invariant estimator. Let (X1 , ..., Xn ) be a random sample from a population in a scale family with scale parameter σ. An estimator S(X1 , ..., Xn ) of σ h with a ﬁxed h = 0 is scale invariant if and only if S(rX1 , ..., rXn ) = rh T (X1 , ..., Xn ) for any Xi 's and r > 0. Simultaneous conﬁdence intervals: Let θt ∈ R, t ∈ T . Conﬁdence intervals Ct (X), t ∈ T , are 1−α simultaneous conﬁdence intervals for θt , t ∈ T , if P (θt ∈ Ct (X), t ∈ T ) = 1 − α. Statistic: Let X be a sample from a population P . A known Borel function of X is called a statistic. Suﬃciency and minimal suﬃciency: Let X be a sample from a population P . A statistic T (X) is suﬃcient for P if and only if the conditional distribution of X given T does not depend on P . A suﬃcient statistic T is minimal suﬃcient if and only if, for any other statistic S suﬃcient for P , there is a measurable function ψ such that T = ψ(S) except for a set A with P (X ∈ A) = 0 for all P . Test and its size: Let X be a sample from a population P ∈ P and Pi i = 0, 1, be subsets of P satisfying P0 ∪ P1 = P and P0 ∩ P1 = ∅. A randomized test for hypotheses H0 : P ∈ P0 versus H1 : P ∈ P1 is a Borel function T (X) ∈ [0, 1] such that after X is observed, we reject H0 (conclude P ∈ P1 ) with probability T (X). If T (X) ∈ {0, 1}, then T is nonrandomized. The size of a test T is supP ∈P0 E[T (X)], where E is the expectation with respect to P . UMA (uniformly most accurate) conﬁdence set: Let θ ∈ Θ be an unknown parameter and Θ be a subset of Θ that does not contain the true value of θ. A conﬁdence set C(X) for θ with conﬁdence coeﬃcient 1 − α is Θ -UMA if and only if for any other conﬁdence set C1 (X)   with signiﬁcance level 1 − α, P θ ∈ C(X) ≤ P θ ∈ C1 (X) for all θ ∈ Θ . UMAU (uniformly most accurate unbiased) conﬁdence set: Let θ ∈ Θ be an unknown parameter and Θ be a subset of Θ that does not contain the true value of θ. A conﬁdence set C(X) for θ with conﬁdence coeﬃcient 1 − α is Θ -UMAU if and only if C(X) is unbiased and for any  other unbiased  conﬁdence set  C1 (X) with signiﬁcance level 1 − α, P θ ∈ C(X) ≤ P θ ∈ C1 (X) for all θ ∈ Θ . UMP (uniformly most powerful) test: A test of size α is UMP for testing H0 : P ∈ P0 versus H1 : P ∈ P1 if and only if, at each P ∈ P1 , the power of T is no smaller than the power of any other level α test. UMPU (uniformly most powerful unbiased) test: An unbiased test of size α is UMPU for testing H0 : P ∈ P0 versus H1 : P ∈ P1 if and only  xxii  Terminology  if, at each P ∈ P1 , the power of T is no larger than the power of any other level α unbiased test. UMVUE (uniformly minimum variance estimator): An estimator is a UMVUE if it has the minimum variance within the class of unbiased estimators. Unbiased conﬁdence set: A level 1 − α conﬁdence set C(X) is said to be unbiased if and only if P (θ ∈ C(X)) ≤ 1 − α for any P and all θ = θ. Unbiased estimator: Let X be a sample from a population P and θ ∈ Rk be a function of P . If an estimator T (X) of θ satisﬁes E[T (X)] = θ for any P , where E is the expectation with respect to P , then T (X) is an unbiased estimator of θ. Unbiased test: A test for hypotheses H0 : P ∈ P0 versus H1 : P ∈ P1 is unbiased if its size is no larger than its power at any P ∈ P1 .  Some Distributions 1. Discrete uniform distribution on the set {a1 , ..., am }: The probability density (with respect to the counting measure) of this distribution is  −1 x = ai , i = 1, ..., m m f (x) = 0 otherwise, where ai ∈ R, i = 1, ..., m, and m is a positive integer. The expecm tation of this distribution is ā = j=1 aj /m and the variance of this m distribution is j=1 (aj − ā)2 /m. The moment generating function of m this distribution is j=1 eaj t /m, t ∈ R. 2. The binomial distribution with size n and probability p: The probability density (with respect to the counting measure) of this distribution is   n x n−x x = 0, 1, ..., n x p (1 − p) f (x) = 0 otherwise, where n is a positive integer and p ∈ [0, 1]. The expectation and variance of this distributions are np and np(1 − p), respectively. The moment generating function of this distribution is (pet + 1 − p)n , t ∈ R. 3. The Poisson distribution with mean θ: The probability density (with respect to the counting measure) of this distribution is  x −θ θ e x = 0, 1, 2, ... x! f (x) 0 otherwise, where θ > 0 is the expectation of this distribution. The variance of this distribution is θ. The moment generating function of this t distribution is eθ(e −1) , t ∈ R. 4. The geometric with mean p−1 : The probability density (with respect to the counting measure) of this distribution is  x = 1, 2, ... (1 − p)x−1 p f (x) = 0 otherwise, xxiii  xxiv  Some Distributions where p ∈ [0, 1]. The expectation and variance of this distribution are p−1 and (1 − p)/p2 , respectively. The moment generating function of this distribution is pet /[1 − (1 − p)et ], t < − log(1 − p).  5. Hypergeometric distribution: The probability density (with respect to the counting measure) of this distribution is ⎧ n m ⎨ (x)(r−x) x = 0, 1, ..., min{r, n}, r − x ≤ m (Nr ) f (x) = ⎩ 0 otherwise, where r, n, and m are positive integers, and N = n + m. The expectation and variance of this distribution are equal to rn/N and rnm(N − r)/[N 2 (N − 1)], respectively. 6. Negative binomial with size r and probability p: The probability density (with respect to the counting measure) of this distribution is    x−1 r x−r x = r, r + 1, ... r−1 p (1 − p) f (x) = 0 otherwise, where p ∈ [0, 1] and r is a positive integer. The expectation and variance of this distribution are r/p and r(1−p)/p2 , respectively. The moment generating function of this distribution is equal to pr ert /[1 − (1 − p)et ]r , t < − log(1 − p). 7. Log-distribution with probability p: The probability density (with respect to the counting measure) of this distribution is  f (x) =  −(log p)−1 x−1 (1 − p)x 0  x = 1, 2, ... otherwise,  where p ∈ (0, 1). The expectation and variance of this distribution are −(1−p)/(p log p) and −(1−p)[1+(1−p)/ log p]/(p2 log p), respectively. The moment generating function of this distribution is equal to log[1 − (1 − p)et ]/ log p, t ∈ R. 8. Uniform distribution on the interval (a, b): The Lebesgue density of this distribution is 1 f (x) = I(a,b) (x), b−a where a and b are real numbers with a < b. The expectation and variance of this distribution are (a + b)/2 and (b − a)2 /12, respectively. The moment generating function of this distribution is equal to (ebt − eat )/[(b − a)t], t ∈ R.  Some Distributions  xxv  9. Normal distribution N (µ, σ 2 ): The Lebesgue density of this distribution is 2 2 1 f (x) = √ e−(x−µ) /2σ , 2πσ where µ ∈ R and σ 2 > 0. The expectation and variance of N (µ, σ 2 ) are µ and σ 2 , respectively. The moment generating function of this 2 2 distribution is eµt+σ t /2 , t ∈ R. 10. Exponential distribution on the interval (a, ∞) with scale parameter θ: The Lebesgue density of this distribution is f (x) =  1 −(x−a)/θ e I(a,∞) (x), θ  where a ∈ R and θ > 0. The expectation and variance of this distribution are θ+a and θ2 , respectively. The moment generating function of this distribution is eat (1 − θt)−1 , t < θ−1 . 11. Gamma distribution with shape parameter α and scale parameter γ: The Lebesgue density of this distribution is f (x) =  1 xα−1 e−x/γ I(0,∞) (x), Γ(α)γ α  where α > 0 and γ > 0. The expectation and variance of this distribution are αγ and αγ 2 , respectively. The moment generating function of this distribution is (1 − γt)−α , t < γ −1 . 12. Beta distribution with parameter (α, β): The Lebesgue density of this distribution is f (x) =  Γ(α + β) α−1 (1 − x)β−1 I(0,1) (x), x Γ(α)Γ(β)  where α > 0 and β > 0. The expectation and variance of this distribution are α/(α + β) and αβ/[(α + β + 1)(α + β)2 ], respectively. 13. Cauchy distribution with location parameter µ and scale parameter σ: The Lebesgue density of this distribution is f (x) =  σ , π[σ 2 + (x − µ)2 ]  where µ ∈ R and σ > 0. The expectation and variance of this distribution do not exist. The characteristic function of this distribution √ is e −1µt−σ|t| , t ∈ R.  xxvi  Some Distributions  14. Log-normal distribution with parameter (µ, σ 2 ): The Lebesgue density of this distribution is f (x) = √  2 2 1 e−(log x−µ) /2σ I(0,∞) (x), 2πσx  where µ ∈ R and σ 2 > 0. The expectation and variance of this 2 2 2 distribution are eµ+σ /2 and e2µ+σ (eσ − 1), respectively. 15. Weibull distribution with shape parameter α and scale parameter θ: The Lebesgue density of this distribution is f (x) =  α α−1 −xα /θ x e I(0,∞) (x), θ  where α > 0 and θ > 0. The expectation and variance of this distribution are θ1/α Γ(α−1 + 1) and θ2/α {Γ(2α−1 + 1) − [Γ(α−1 + 1)]2 }, respectively. 16. Double exponential distribution with location parameter µ and scale parameter θ: The Lebesgue density of this distribution is f (x) =  1 −|x−µ|/θ e , 2θ  where µ ∈ R and θ > 0. The expectation and variance of this distribution are µ and 2θ2 , respectively. The moment generating function of this distribution is eµt /(1 − θ2 t2 ), |t| < θ−1 . 17. Pareto distribution: The Lebesgue density of this distribution is f (x) = θaθ x−(θ+1) I(a,∞) (x), where a > 0 and θ > 0. The expectation this distribution is θa/(θ−1) when θ > 1 and does not exist when θ ≤ 1. The variance of this distribution is θa2 /[(θ − 1)2 (θ − 2)] when θ > 2 and does not exist when θ ≤ 2. 18. Logistic distribution with location parameter µ and scale parameter σ: The Lebesgue density of this distribution is f (x) =  e−(x−µ)/σ , σ[1 + e−(x−µ)/σ ]2  where µ ∈ R and σ > 0. The expectation and variance of this distribution are µ and σ 2 π 2 /3, respectively. The moment generating function of this distribution is eµt Γ(1 + σt)Γ(1 − σt), |t| < σ −1 .  Some Distributions  xxvii  19. Chi-square distribution χ2k : The Lebesgue density of this distribution is 1 xk/2−1 e−x/2 I(0,∞) (x), f (x) = Γ(k/2)2k/2 where k is a positive integer. The expectation and variance of this distribution are k and 2k, respectively. The moment generating function of this distribution is (1 − 2t)−k/2 , t < 1/2. 20. Noncentral chi-square distribution χ2k (δ): This distribution is deﬁned as the distribution of X12 +· · ·+Xk2 , where X1 , ..., Xk are independent and identically distributed as N (µi , 1), k is a positive integer, and δ = µ21 + · · · + µ2k ≥ 0. δ is called the noncentrality parameter. The Lebesgue density of this distribution is f (x) = e−δ/2  ∞ j=0  (δ/2)j f2j+n (x), j!  where fk (x) is the Lebesgue density of the chi-square distribution χ2k . The expectation and variance of this distribution are k + δ and 2k + 4δ, √ respectively.√ The characteristic function of this distribution √ is (1 − 2 −1t)−k/2 e −1δt/(1−2 −1t) . 21. t-distribution tn : The Lebesgue density of this distribution is Γ( n+1 ) f (x) = √ 2 n nπΓ( 2 )  x2 1+ n  −(n+1)/2 ,  where n is a positive integer. The expectation of tn is 0 when n > 1 and does not exist when n = 1. The variance of tn is n/(n − 2) when n > 2 and does not exist when n ≤ 2. 22. Noncentral t-distribution tn (δ): This distribution is deﬁned as the  distribution of X/ Y /n, where X is distributed as N (δ, 1), Y is distributed as χ2n , X and Y are independent, n is a positive integer, and δ ∈ R is called the noncentrality parameter. The Lebesgue density of this distribution is  ∞ √ 2 1 √ f (x) = (n+1)/2 n y (n−1)/2 e−[(x y/n−δ) +y]/2 dy. 2 Γ( 2 ) πn 0  n The expectation of tn (δ) is δΓ( n−1 2 ) n/2/Γ( 2 ) when n > 1 and does not exist when n = 1. The variance of tn (δ) is [n(1 + δ 2 )/(n − 2)] − n 2 2 [Γ( n−1 2 )/Γ( 2 )] δ n/2 when n > 2 and does not exist when n ≤ 2.  xxviii  Some Distributions  23. F-distribution Fn,m : The Lebesgue density of this distribution is f (x) =  n/2−1 nn/2 mm/2 Γ( n+m 2 )x I(0,∞) (x), n m (n+m)/2 Γ( 2 )Γ( 2 )(m + nx)  where n and m are positive integers. The expectation of Fn,m is m/(m − 2) when m > 2 and does not exist when m ≤ 2. The variance of Fn,m is 2m2 (n + m − 2)/[n(m − 2)2 (m − 4)] when m > 4 and does not exist when m ≤ 4. 24. Noncentral F-distribution Fn,m (δ): This distribution is deﬁned as the distribution of (X/n)/(Y /m), where X is distributed as χ2n (δ), Y is distributed as χ2m , X and Y are independent, n and m are positive integers, and δ ≥ 0 is called the noncentrality parameter. The Lebesgue density of this distribution is  ∞ n1 x n1 (δ/2)j −δ/2 f2j+n1 ,n2 , f (x) = e j!(2j + n1 ) 2j + n1 j=0 where fk1 ,k2 (x) is the Lebesgue density of Fk1 ,k2 . The expectation of Fn,m (δ) is m(n + δ)/[n(m − 2)] when m > 2 and does not exist when m ≤ 2. The variance of Fn,m (δ) is 2m2 [(n + δ)2 + (m − 2)(n + 2δ)]/[n2 (m − 2)2 (m − 4)] when m > 4 and does not exist when m ≤ 4. 25. Multinomial distribution with size n and probability vector (p1 ,...,pk ): The probability density (with respect to the counting measure on Rk ) is n! f (x1 , ..., xk ) = px1 · · · pxkk IB (x1 , ..., xk ), x1 ! · · · xk ! 1 k where B = {(x1 , ..., xk ) : xi 's are nonnegative integers, i=1 xi = n}, k n is a positive integer, pi ∈ [0, 1], i = 1, ..., k, and i=1 pi = 1. The mean-vector (expectation) of this distribution is (np1 , ..., npk ). The variance-covariance matrix of this distribution is the k × k matrix whose ith diagonal element is npi and (i, j)th oﬀ-diagonal element is −npi pj . 26. Multivariate normal distribution Nk (µ, Σ): The Lebesgue density of this distribution is τ −1 1 e−(x−µ) Σ (x−µ)/2 , x ∈ Rk , f (x) = (2π)k/2 [Det(Σ)]1/2 where µ ∈ Rk and Σ is a positive deﬁnite k × k matrix. The meanvector (expectation) of this distribution is µ. The variance-covariance matrix of this distribution is Σ. The moment generating function of τ τ Nk (µ, Σ) is et µ+t Σt/2 , t ∈ Rk .  Chapter 1  Probability Theory Exercise 1. Let Ω be a set, F be σ-ﬁeld on Ω, and C ∈ F. Show that FC = {C ∩ A : A ∈ F} is a σ-ﬁeld on C. Solution. This exercise, similar to many other problems, can be solved by directly verifying the three properties in the deﬁnition of a σ-ﬁeld. (i) The empty subset of C is C ∩ ∅. Since F is a σ-ﬁeld, ∅ ∈ F. Then, C ∩ ∅ ∈ FC . (ii) If B ∈ FC , then B = C ∩ A for some A ∈ F. Since F is a σ-ﬁeld, Ac ∈ F. Then the complement of B in C is C ∩ Ac ∈ FC . (iii) If Bi ∈ FC , i = 1, 2, ..., then Bi = C ∪ Ai for some Ai ∈ F, i = 1, 2, .... Since F is a σ-ﬁeld, ∪Ai ∈ F. Therefore, ∪Bi = ∪(C ∩ Ai ) = C ∩ (∪Ai ) ∈ FC . Exercise 2 (#1.12)† . Let ν and λ be two measures on a σ-ﬁeld F on Ω such that ν(A) = λ(A) for any A ∈ C, where C ⊂ F is a collection having the property that if A and B are in C, then so is A ∩ B. Assume that there are Ai ∈ C, i = 1, 2, ..., such that ∪Ai = Ω and ν(Ai ) < ∞ for all i. Show that ν(A) = λ(A) for any A ∈ σ(C), where σ(C) is the smallest σ-ﬁeld containing C. Note. Solving this problem requires knowing properties of measures (Shao, 2003, §1.1.1). The technique used in solving this exercise is called the "good sets principle". All sets in C have property A and we want to show that all sets in σ(C) also have property A. Let G be the collection of all sets having property A (good sets). Then, all we need to show is that G is a σ-ﬁeld. Solution. Deﬁne G = {A ∈ F : ν(A) = λ(A)}. Since C ⊂ G, σ(C) ⊂ G if G is a σ-ﬁeld. Hence, the result follows if we can show that G is a σ-ﬁeld. (i) Since both ν and λ are measures, 0 = ν(∅) = λ(∅) and, thus, the empty set ∅ ∈ G. † The  number in parentheses is the exercise number in Mathematical Statistics (Shao, 2003). The ﬁrst digit is the chapter number.  1  2  Chapter 1. Probability Theory  (ii) For any B ∈ F, by the inclusion and exclusion formula,  n   ν Ai ∩ B = ν(Ai ∩ B) − ν(Ai ∩ Aj ∩ B) + · · · i=1  1≤i≤n  1≤i<j≤n  for any positive integer n, where Ai 's are the sets given in the description of this exercise. The same result also holds for λ. Since Aj 's are in C, Ai ∩ Aj ∩ · · · ∩ Ak ∈ C and, if B ∈ G, ν(Ai ∩ Aj ∩ · · · ∩ Ak ∩ B) = λ(Ai ∩ Aj ∩ · · · ∩ Ak ∩ B) < ∞. Consequently, ν(Ai ∩ Aj ∩ · · · ∩ Ak ∩ B c ) = λ(Ai ∩ Aj ∩ · · · ∩ Ak ∩ B c ) < ∞. By the inclusion and exclusion formula again, we obtain that   n   n   c c =λ Ai ∩ B Ai ∩ B ν i=1  i=1  for any n. From the continuity property of measures (Proposition 1.1(iii) in Shao, 2003), we conclude that ν(B c ) = λ(B c ) by letting n → ∞ in the previous expression. Thus, B c ∈ G whenever B ∈ G. (iii) Suppose that Bi ∈ G, i = 1, 2, .... Note that ν(B1 ∪ B2 ) = ν(B1 ) + ν(B1c ∩ B2 ) = λ(B1 ) + λ(B1c ∩ B2 ) = λ(B1 ∪ B2 ), since B1c ∩ B2 ∈ G. Thus, B1 ∪ B2 ∈ G. This shows that for any n, ∪ni=1 Bi ∈ G. By the continuity property of measures, ∞    ∞   n  n     ν Bi = lim ν Bi = lim λ Bi = λ Bi . i=1  n→∞  i=1  n→∞  i=1  i=1  Hence, ∪Bi ∈ G. Exercise 3 (#1.14). Show that a real-valued function f on a set Ω is Borel with respect to a σ-ﬁeld F on Ω if and only if f −1 (a, ∞) ∈ F for all a ∈ R. Note. Good sets principle is used in this solution. Solution. The only if part follows directly from the deﬁnition of a Borel function. Suppose that f −1 (a, ∞) ∈ F for all a ∈ R. Let G = {C ⊂ R : f −1 (C) ∈ F}. Note that (i) ∅ ∈ G; (ii) if C ∈ G, then f −1 (C c ) = (f −1 (C))c ∈ F, i.e., C c ∈ G; and (iii) if Ci ∈ G, i = 1, 2, ..., then f −1 (∪Ci ) = ∪f −1 (Ci ) ∈ F,  Chapter 1. Probability Theory  3  i.e., ∪Ci ∈ G. This shows that G is a σ-ﬁeld. Thus B ⊂ G, i.e., f −1 (B) ∈ F for any B ∈ B and, hence, f is Borel. Exercise 4 (#1.14). Let f and g be real-valued functions on Ω. Show that if f and g are Borel with respect to a σ-ﬁeld F on Ω, then so are f g, f /g (when g = 0), and af + bg, where a and b are real numbers. Solution. Suppose that f and g are Borel. Consider af + bg with a > 0 and b > 0. Let Q be the set of all rational numbers on R. For any c ∈ R,  {af + bg > c} = {f > (c − t)/a} ∩ {g > t/b}. t∈Q  Since f and g are Borel, {af + bg > c} ∈ F. By Exercise 3, af + bg is Borel. Similar results can be obtained for the case of a > 0 and b < 0, a < 0 and b > 0, or a < 0 and b < 0. From the above result, f + g and f − g are Borel if f and g are Borel. Note that for any c > 0, √ √ {(f + g)2 > c} = {f + g > c} ∪ {f + g < − c}. Hence, (f + g)2 is Borel. Similarly, (f − g)2 is Borel. Then f g = [(f + g)2 − (f − g)2 ]/4 is Borel. Since any constant function is Borel, this shows that af is Borel if f is Borel and a is a constant. Thus, af + bg is Borel even when one of a and b is 0. Assume g = 0. For any c, ⎧ c>0 ⎨ {0 < g < 1/c} {1/g > c} = {g > 0} c=0 ⎩ {g > 0} ∪ {1/c < g < 0} c < 0. Hence 1/g is Borel if g is Borel and g = 0. Then f /g is Borel if both f and g are Borel and g = 0. Exercise 5 (#1.14). Let fi , i = 1, 2, ..., be Borel functions on Ω with respect to a σ-ﬁeld F. Show that supn fn , inf n fn , lim supn fn , and lim inf n fn are Borel with respect to F. Also, show that the set   A = ω ∈ Ω : lim fn (ω) exists n  is in F and the function h(ω) =    limn fn (ω) f1 (ω)  ω∈A ω∈A  4  Chapter 1. Probability Theory  is Borel with respect to F. Solution. For any c ∈ R, {supn fn > c} = ∪n {fn > c}. By Exercise 3, supn fn is Borel. By Exercise 4, inf n fn = − supn (−fn ) is Borel. Then lim supn fn = inf n supk≥n fk is Borel and lim inf n fn = − lim supn (−fn ) is Borel. Consequently, A = {lim supn fn − lim inf n fn = 0} ∈ F. The function h is equal to IA lim supn fn + IAc f1 , where IA is the indicator function of the set A. Since A ∈ F, IA is Borel. Thus, h is Borel. Exercise 6. Let f be a Borel function on R2 . Deﬁne a function g from R to R as g(x) = f (x, y0 ), where y0 is a ﬁxed point in R. Show that g is Borel. Is it true that f is Borel from R2 to R if f (x, y) with any ﬁxed y or ﬁxed x is Borel from R to R? Solution. For a ﬁxed y0 , deﬁne G = {C ⊂ R2 : {x : (x, y0 ) ∈ C} ∈ B}. Then, (i) ∅ ∈ G; (ii) if C ∈ G, {x : (x, y0 ) ∈ C c } = {x : (x, y0 ) ∈ C}c ∈ B, i.e., C c ∈ G; (iii) if Ci ∈ G, i = 1, 2, ..., then {x : (x, y0 ) ∈ ∪Ci } = ∪{x : (x, y0 ) ∈ Ci } ∈ B, i.e., ∪Ci ∈ G. Thus, G is a σ-ﬁeld. Since any open rectangle (a, b) × (c, d) ∈ G, G is a σ-ﬁeld containing all open rectangles and, thus, G contains B 2 , the Borel σ-ﬁeld on R2 . Let B ∈ B. Since f is Borel, A = f −1 (B) ∈ B2 . Then A ∈ G and, thus, g −1 (B) = {x : f (x, y0 ) ∈ B} = {x : (x, y0 ) ∈ A} ∈ B. This proves that g is Borel. If f (x, y) with any ﬁxed y or ﬁxed x is Borel from R to R, f is not necessarily to be a Borel function from R2 to R. The following is a counterexample. Let A be a non-Borel subset of R and  f (x, y) =  1 0  x=y∈A otherwise  Then for any ﬁxed y0 , f (x, y0 ) = 0 if y0 ∈ A and f (x, y0 ) = I{y0 } (x) (the indicator function of the set {y0 }) if y0 ∈ A. Hence f (x, y0 ) is Borel. Similarly, f (x0 , y) is Borel for any ﬁxed x0 . We now show that f (x, y) is not Borel. Suppose that it is Borel. Then B = {(x, y) : f (x, y) = 1} ∈ B2 . Deﬁne G = {C ⊂ R2 : {x : (x, x) ∈ C} ∈ B}. Using the same argument in the proof of the ﬁrst part, we can show that G is a σ-ﬁeld containing B 2 . Hence {x : (x, x) ∈ B} ∈ B. However, by deﬁnition {x : (x, x) ∈ B} = A ∈ B. This contradiction proves that f (x, y) is not Borel. Exercise 7 (#1.21). Let Ω = {ωi : i = 1, 2, ...} be a countable set, F be all subsets of Ω, and ν be the counting measure on Ω (i.e., ν(A) = the number of elements in A for any A ⊂ Ω). For any Borel function f , the  Chapter 1. Probability Theory  5  integral of f w.r.t. ν (if it exists) is   ∞  f dν =  f (ωi ). i=1  Note. The deﬁnition of integration and properties of integration can be found in Shao (2003, §1.2). This type of exercise is much easier to solve if we ﬁrst consider nonnegative functions (or simple nonnegative functions) and then general functions by using f+ and f− . See also the next exercise for another example. ∞ Solution. First, consider nonnegative f . Then f = i=1 ai I{ωi } , where n ai = f (ωi ) ≥ 0. Since fn = i=1 ai I{ωi } is a nonnegative simple function (a function is simple if it is a linear combination of ﬁnitely many indicator functions of sets in F) and fn ≤ f , by deﬁnition     n  ai ≤  fn dν =  f dν.  i=1  Letting n → ∞ we obtain that   ∞  f dν ≥  ai . i=1  k Let s = i=1 bi I{ωi } be a nonnegative simple function satisfying s ≤ f . Then 0 ≤ bi ≤ ai and   ∞  k  bi ≤  sdν = i=1  ai . i=1  Hence      sdν : s is simple, 0 ≤ s ≤ f  f dν = sup  ∞  ≤  ai i=1  and, thus,    ∞  f dν =  ai i=1  for nonnegative f . For general f , let f+ = max{f, 0} and f− = max{−f, 0}. Then     ∞  f+ dν =  f+ (ωi ) i=1  and  ∞  f− dν =  f− (ωi ). i=1  6  Chapter 1. Probability Theory  Then the result follows from    f dν = f+ dν − f− dν if at least one of    f+ dν and    f− dν is ﬁnite.  Exercise 8 (#1.22). Let ν be a measure on a σ-ﬁeld F on Ω and f and g be Borel functions with respect to F. Show that   (i) if f dν exists and a ∈ R, then (af)dν exists  and is equal to a f dν; (ii) if both f dν and gdν existand f dν + gdν is well deﬁned, then (f + g)dν exists and is equal to f dν + gdν.   Note. For integrals   in calculus, properties such as (af )dν = a f dν and (f + g)dν f dν + gdν are obvious. However, the proof of them are complicated for integrals deﬁned on general measure spaces. As shown in this exercise, the proof often has to be broken into several steps: simple functions, nonnegative functions, and then   general functions.  Solution. (i) If a = 0, then (af )dν = 0dν = 0 = a f dν. Suppose that a > 0 and f ≥ 0. By deﬁnition, there exists of   a sequence nonnegative simple functions s such that s ≤ f and lim dν = f dν. s n n n n    Then asn ≤ af and limn asn dν = a limn sn dν = a f dν. This shows  −1 −1 (af )dν ≥ a f dν. Let  b = a and considerthe function h = b f . From −1 what we have shown, f dν = (bh)dν ≥ b hdν = a (af )dν. Hence  (af )dν = a f dν. For a > 0 and general f , the result follows by considering af = af+ − af− . For a < 0, the result follows by considering af = |a|f− − |a|f+ . (ii) Consider the case where f ≥ 0 and g ≥ 0. If both f and g are simple functions, the result is obvious. Let  sn , tn , and rn be simple   functions such that 0 ≤ sn ≤ f , limn sndν = f dν, 0 ≤ t ≤ g, lim tn dν = gdν, n n  0 ≤ rn ≤ f + g, and limn rn dν = (f + g)dν. Then sn + tn is simple, 0 ≤ sn + tn ≤ f + g, and     f dν + gdν = lim sn dν + lim tn dν n n  = lim (sn + tn )dν, n  which implies     f dν +   gdν ≤  (f + g)dν.  If any of f dν and gdν is inﬁnite, then so is (f + g)dν. Hence, we only need to consider the case where both f and g are integrable. Suppose that g is simple. Then rn − g is simple and     lim rn dν − gdν = lim (rn − g)dν ≤ f dν,     n  n  Chapter 1. Probability Theory  7  since rn − g ≤ f . Hence     (f + g)dν = lim rn dν ≤ f dν + gdν n  and, thus, the result follows if g is simple. For a general g, by the proved result,    lim rn dν − gdν = lim (rn − g)dν. n      n    Hence (f + g)dν = limn rn dν ≤ f dν + Consider general f and g. Note that    gdν and the result follows.  (f + g)+ − (f + g)− = f + g = f+ − f− + g+ − g− , which leads to (f + g)+ + f− + g− = (f + g)− + f+ + g+ . From the proved result for nonnegative functions,     [(f + g)+ + f− + g− ]dν = (f + g)+ dν + f− dν + g− dν  = [(f + g)− + f+ + g+ ]dν    = (f + g)− dν + f+ dν + g+ dν. If both f and g are integrable, then       (f + g)+ dν − (f + g)− dν = f+ dν − f− dν + g+ dν − g− dν, i.e.,     (f + g)dν =   f dν +  gdν.     Suppose f− dν = ∞. Then f+ dν < ∞  now that   since f dν exists. Since f dν + gdν is well deﬁned, we must have g+ dν < ∞. Since  (f + g) ≤ f + g , (f + g) dν < ∞. Thus, (f + g)−dν = ∞ and + + + +   (f + g)dν = −∞. On the other hand, we also have f dν  + gdν = −∞. Similarly, we can prove the case where f+ dν = ∞ and f− dν < ∞. Exercise 9 (#1.30). Let F be a cumulative distribution function on the real line R and a ∈ R. Show that  [F (x + a) − F (x)]dx = a.  8  Chapter 1. Probability Theory  Solution. For a ≥ 0,    [F (x + a) − F (x)]dx = I(x,x+a] (y)dF (y)dx. Since I(x,x+a] (y) ≥ 0, by Fubini's theorem, the above integral is equal to    I(y−a,y] (x)dxdF (y) = adF (y) = a. The proof for the case of a < 0 is similar. Exercise 10 (#1.31). Let F and G be two cumulative distribution functions on the real line. Show that if F and G have no common points of discontinuity in the interval [a, b], then   G(x)dF (x) = F (b)G(b) − F (a)G(a) − F (x)dG(x). (a,b]  (a,b]  Solution. Let PF and PG be the probability measures corresponding to F and G, respectively, and let P = PF × PG be the product measure (see Shao, 2003, §1.1.1). Consider the following three Borel sets in R2 : A = {(x, y) : x ≤ y, a < y ≤ b}, B = {(x, y) : y ≤ x, a < x ≤ b}, and C = {(x, y) : a < x ≤ b, x = y}. Since F and G have no common points of discontinuity, P (C) = 0. Then,     F (b)G(b) − F (a)G(a) = P (−∞, b]×(−∞, b] − P (−∞, a]×(−∞, a] = P (A) + P (B) − P (C) = P (A) + P (B)   = dP + dP A B     = dPF dPG + dPG dPF (a,b] (−∞,y] (a,b] (−∞,x]   = F (y)dPG + G(x)dPF (a,b] (a,b]   = F (y)dG(y) + G(x)dF (x) (a,b] (a,b]   = F (x)dG(x) + G(x)dF (x), (a,b]  (a,b]  where the ﬁfth equality follows from Fubini's theorem. Exercise 11. Let Y be a random variable and m be a median of Y , i.e., P (Y ≤ m) ≥ 1/2 and P (Y ≥ m) ≥ 1/2. Show that, for any real numbers  Chapter 1. Probability Theory  9  a and b such that m ≤ a ≤ b or m ≥ a ≥ b, E|Y − a| ≤ E|Y − b|. Solution. We can assume E|Y | < ∞, otherwise ∞ = E|Y −a| ≤ E|Y −b| = ∞. Assume m ≤ a ≤ b. Then E|Y − b| − E|Y − a| = E[(b − Y )I{Y ≤b} ] + E[(Y − b)I{Y >b} ] − E[(a − Y )I{Y ≤a} ] − E[(Y − a)I{Y >a} ] = 2E[(b − Y )I{a<Y ≤b} ] + (a − b)[E(I{Y >a} ) − E(I{Y ≤a} )] ≥ (a − b)[1 − 2P (Y ≤ a)] ≥ 0, since P (Y ≤ a) ≥ P (Y ≤ m) ≥ 1/2. If m ≥ a ≥ b, then −m ≤ −a ≤ −b and −m is a median of −Y . From the proved result, E|(−Y ) − (−b)| ≥ E|(−Y ) − (−a)|, i.e., E|Y − a| ≤ E|Y − b|. Exercise 12. Let X and Y be independent random variables satisfying E|X + Y |a < ∞ for some a > 0. Show that E|X|a < ∞. Solution. Let c ∈ R such that P (Y > c) > 0 and P (Y ≤ c) > 0. Note that E|X + Y |a ≥ E(|X + Y |a I{Y >c,X+c>0} ) + E(|X + Y |a I{Y ≤c,X+c≤0} ) ≥ E(|X + c|a I{Y >c,X+c>0} ) + E(|X + c|a I{Y ≤c,X+c≤0} ) = P (Y > c)E(|X + c|a I{X+c>0} ) + P (Y ≤ c)E(|X + c|a I{X+c≤0} ), where the last inequality follows from the independence of X and Y . Since E|X + Y |a < ∞, both E(|X + c|a I{X+c>0} ) and E(|X + c|a I{X+c≤0} ) are ﬁnite and E|X + c|a = E(|X + c|a I{X+c>0} ) + E(|X + c|a I{X+c≤0} ) < ∞. Then, E|X|a ≤ 2a (E|X + c|a + |c|a ) < ∞. Exercise 13 (#1.34). Let ν be a σ-ﬁnite measure on a σ-ﬁeld F on Ω, λ be another measure with λ  ν, and f be a nonnegative Borel function on Ω. Show that   dλ f dλ = f dν, dν where dλ dν is the Radon-Nikodym derivative. Note. Two measures λ and ν satisfying λ  ν if ν(A) = 0 always implies  10  Chapter 1. Probability Theory  λ(A) = 0, which ensures the existence of the Radon-Nikodym derivative dλ dν when ν is σ-ﬁnite (see Shao, 2003, §1.1.2). Solution. By the deﬁnition of the Radon-Nikodym derivative and the linearity of integration, the result follows if f is a simple function. For a general nonnegative f , there is a sequence {sn } of nonnegative simple functions such that sn ≤ sn+1 , n = 1, 2, ..., and limn sn = f . Then dλ dλ dλ 0 ≤ sn dλ dν ≤ sn+1 dν and limn sn dν = f dν . By the monotone convergence theorem (e.g., Theorem 1.1 in Shao, 2003),     dλ dλ f dλ = lim sn dλ = lim sn dν = f dν. n n dν dν Exercise 14 (#1.34). Let Fi be a σ-ﬁeld on Ωi , νi be a σ-ﬁnite measure on Fi , and λi be a measure on Fi with λi  νi , i = 1, 2. Show that λ1 × λ2  ν1 × ν2 and d(λ1 × λ2 ) dλ1 dλ2 a.e. ν1 × ν2 , = d(ν1 × ν2 ) dν1 dν2 where ν1 × ν2 (or λ1 × λ2 ) denotes the product measure of ν1 and ν2 (or λ1 and λ2 ). Solution. Suppose that A ∈ σ(F1 × F2 ) and ν1 × ν2 (A) = 0. By Fubini's theorem,     0 = ν1 × ν2 (A) = IA d(ν1 × ν2 ) = IA dν1 dν2 . Since IA ≥ 0, this implies that there is a B ∈ F2 such that ν2 (B c ) = 0 and on the set B, IA dν1 = 0. Since λ1  ν1 , on the set B   dλ1 IA dλ1 = IA dν1 = 0. dν1 Since λ2  ν2 , λ2 (B c ) = 0. Then   λ1 × λ2 (A) = IA d(λ1 × λ2 ) = B      dλ1 dλ2 = 0. A  Hence λ1 × λ2  ν1 × ν2 . For the second assertion, it suﬃces to show that for any A ∈ σ(F1 ×F2 ), λ(A) = ν(A), where  d(λ1 × λ2 ) d(ν1 × ν2 ) λ(A) = A d(ν1 × ν2 )   and ν(A) =  A  dλ1 dλ2 d(ν1 × ν2 ). dν1 dν2  Chapter 1. Probability Theory  11  Let C = F1 × F2 . Then C satisﬁes the conditions speciﬁed in Exercise 2. For A1 × A2 ∈ F1 × F2 ,  d(λ1 × λ2 ) d(ν1 × ν2 ) λ(A) = d(ν1 × ν2 ) A1 ×A2 = d(λ1 × λ2 ) A1 ×A2  = λ1 (A1 )λ2 (A2 ) and, by Fubini's theorem,    dλ1 dλ2 d(ν1 × ν2 ) A ×A dν1 dν2   1 2 dλ1 dλ2 dν1 dν2 = dν 1 A1 A2 dν2 = λ1 (A1 )λ2 (A2 ).  ν(A) =  Hence λ(A) = ν(A) for any A ∈ C and the second assertion of this exercise follows from the result in Exercise 2. Exercise 15. Let P and Q be two probability measures on a σ-ﬁeld F. dQ Assume that f = dP dν and g = dν exists for a measure ν on F. Show that  |f − g|dν = 2 sup{|P (C) − Q(C)| : C ∈ F}. Solution. Let A = {f ≥ g} and B = {f < g}. Then A ∈ F, B ∈ F, and    |f − g|dν = (f − g)dν + (g − f )dν A  B  = P (A) − Q(A) + Q(B) − P (B) ≤ |P (A) − Q(A)| + |P (B) − Q(B)| ≤ 2 sup{|P (C) − Q(C)| : C ∈ F}. For any C ∈ F,    P (C) − Q(C) =  (f − g)dν C   (f − g)dν +  =   C∩A  ≤  (f − g)dν C∩B  (f − g)dν. A  Since     (f − g)dν +  C  Cc   (f − g)dν =  (f − g)dν = 1 − 1 = 0,  12  Chapter 1. Probability Theory  we have   P (C) − Q(C) =  (g − f )dν  C  c  C  c ∩A  = ≤   (g − f )dν +  C c ∩B  (g − f )dν  (g − f )dν. B  Hence     (f − g)dν +  2[P (C) − Q(C)] ≤ A   (g − f )dν =  |f − g|dν.  B    Similarly, 2[Q(C)−P  (C)] ≤ |f −g|dν. Thus, 2|P (C)−Q(C)| ≤ |f −g|dν and, consequently, |f − g|dν ≥ 2 sup{|P (C) − Q(C)| : C ∈ F}. Exercise 16 (#1.36). Let Fi be a cumulative distribution function on the real line having a Lebesgue density fi , i = 1, 2. Assume that there is a real number c such that F1 (c) < F2 (c). Deﬁne  −∞ < x < c F1 (x) F (x) = F2 (x) c ≤ x < ∞. Show that the probability measure P corresponding to F satisﬁes P  m + δc , where m is the Lebesgue measure and δc is the point mass at c, and ﬁnd the probability density of F with respect to m + δc . Solution. For any A ∈ B,    P (A) = f1 (x)dm + a dδc + f2 (x)dm, (−∞,c)∩A  {c}∩A  (c,∞)∩A     where a = F2 (c) − F1 (c). Note that (−∞,c)∩A dδc = 0, (c,∞)∩A dδc = 0,  and {c}∩A dm = 0. Hence,   P (A) = f1 (x)d(m + δc ) + a d(m + δc ) (−∞,c)∩A {c}∩A  + f2 (x)d(m + δc ) (c,∞)∩A  = [I(−∞,c) (x)f1 (x) + aI{c} (x) + I(c,∞) f2 (x)]d(m + δc ). A  This shows that P  m + δc and dP = I(−∞,c) (x)f1 (x) + aI{c} (x) + I(c,∞) f2 (x). d(m + δc )  Chapter 1. Probability Theory  13  Exercise 17 (#1.46). Let X1 and X2 be independent random variables having the standard normal distribution. Obtain the joint Lebesgue density X12 + X22 and Y2 = X1 /X2 . Are Y1 and Y2 of (Y1 , Y2 ), where Y1 = independent? Note. For this type of problem, we may apply the following result. Let X be a random k-vector with a Lebesgue density fX and let Y = g(X), where g is a Borel function from (Rk , B k ) to (Rk , B k ). Let A1 , ..., Am be disjoint sets in B k such that Rk − (A1 ∪ · · · ∪ Am ) has Lebesgue measure 0 and g on Aj is one-to-one with a nonvanishing Jacobian, i.e., the determinant Det(∂g(x)/∂x) = 0 on Aj , j = 1, ..., m. Then Y has the following Lebesgue density: m   Det (∂hj (x)/∂x)  fX (hj (x)) , fY (x) = j=1  where hj is the inverse function of g on Aj , j = 1, ..., m. Solution. Let A1 = {(x1 , x2 ): x1 > 0, x2 > 0}, A2 = {(x1 , x2 ): x1 > 0, x2 < 0}, A3 = {(x1 , x2 ): x1 < 0, x2 > 0}, and A4 = {(x1 , x2 ): x1 < 0, x2 < 0}. Then 2 the Lebesgue measure  of R − (A1 ∪ A2 ∪ A3 ∪ A4 ) is 0. On each Ai , the function (y1 , y2 ) = ( x21 + x22 , x1 /x2 ) is one-to-one with   y1 y22  y1   √ y2 √ − 2   3/2 2 ∂(x1 , x2 ) (1+y2 ) 1+y22  = y1 . 2 =  1+y Det y y  1 + y2 1 1 2 √ ∂(y1 , y2 ) −   2 2 (1+y 2 )3/2 1+y2  2  Since the joint Lebesgue density of (X1 , X2 ) is 1 −(x21 +x22 )/2 e 2π and x21 + x22 = y12 , the joint Lebesgue density of (Y1 , Y2 ) is   4 2 2 ∂(x1 , x2 )  1 −(x21 +x22 )/2  y1 = e−y1 Det . e   2π ∂(y , y ) π 1 + y22 1 2 i=1 Since the joint Lebesgue density of (Y1 , Y2 ) is a product of two functions that are functions of one variable, Y1 and Y2 are independent. Exercise 18 (#1.45). Let Xi , i = 1, 2, 3, be independent random variables having the same Lebesgue density f (x) = e−x I(0,∞) (x). Obtain the joint Lebesgue density of (Y1 , Y2 , Y3 ), where Y1 = X1 + X2 + X3 , Y2 = X1 /(X1 + X2 ), and Y3 = (X1 + X2 )/(X1 + X2 + X3 ). Are Yi 's independent? Solution: Let x1 = y1 y2 y3 , x2 = y1 y3 − y1 y2 y3 , and x3 = y1 − y1 y3 . Then,  ∂(x1 , x2 , x3 ) = y12 y3 . Det ∂(y1 , y2 , y3 )  14  Chapter 1. Probability Theory  Using the same argument as that in the previous exercise, we obtain the joint Lebesgue density of (Y1 , Y2 , Y3 ) as e−y1 y12 I(0,∞) (y1 )I(0,1) (y2 )y3 I(0,1) (y3 ). Because this function is a product of three functions, e−y1 y12 I(0,∞) (y1 ), I(0,1) (y2 ), and y3 I(0,1) (y3 ), Y1 , Y2 , and Y3 are independent. Exercise 19 (#1.47). Let X and Y be independent random variables with cumulative distribution functions FX and FY , respectively. Show that (i) the cumulative distribution function of X + Y is  FX+Y (t) = FY (t − x)dFX (x); (ii) FX+Y is continuous if one of FX and FY is continuous; (iii) X +Y has a Lebesgue density if one of X and Y has a Lebesgue density. Solution. (i) Note that  FX+Y (t) = dFX (x)dFY (y) x+y≤t    = dFY (y) dFX (x) y≤t−x  = FY (t − x)dFX (x), where the second equality follows from Fubini's theorem. (ii) Without loss of generality, we assume that FY is continuous. Since FY is bounded, by the dominated convergence theorem (e.g., Theorem 1.1 in Shao, 2003),  lim FX+Y (t + ∆t) = lim FY (t + ∆t − x)dFX (x) ∆t→0 ∆t→0  = lim FY (t + ∆t − x)dFX (x) ∆t→0  = FY (t − x)dFX (x) = FX+Y (t). (iii) Without loss of generality, we assume that Y has a Lebesgue density fY . Then  FX+Y (t) = FY (t − x)dFX (x)    t−x fY (s)ds dFX (x) = −∞  Chapter 1. Probability Theory   15   t  =   −∞ t  = −∞     fY (y − x)dy dFX (x)  fY (y − x)dFX (x) dy,  where the last equality follows from  Fubini's theorem. Hence, X + Y has the Lebesgue density fX+Y (t) = fY (t − x)dFX (x). Exercise 20 (#1.94). Show that a random variable X is independent of itself if and only if X is constant a.s. Can X and f (X) be independent, where f is a Borel function? Solution. Suppose that X = c a.s. for a constant c ∈ R. For any A ∈ B and B ∈ B, P (X ∈ A, X ∈ B) = IA (c)IB (c) = P (X ∈ A)P (X ∈ B). Hence X and X are independent. Suppose now that X is independent of itself. Then, for any t ∈ R, P (X ≤ t) = P (X ≤ t, X ≤ t) = [P (X ≤ t)]2 . This means that P (X ≤ t) can only be 0 or 1. Since limt→∞ P (X ≤ t) = 1 and limt→−∞ P (X ≤ t) = 0, there must be a c ∈ R such that P (X ≤ c) = 1 and P (X < c) = 0. This shows that X = c a.s. If X and f (X) are independent, then so are f (X) and f (X). From the previous result, this occurs if and only if f (X) is constant a.s. Exercise 21 (#1.38). Let (X, Y, Z) be a random 3-vector with the following Lebesgue density:  1−sin x sin y sin z 0 ≤ x, y, z, ≤ 2π 8π 3 f (x, y, z) = 0 otherwise Show that X, Y, Z are pairwise independent, but not independent. Solution. The Lebesgue density for (X, Y ) is  2π  2π 1 − sin x sin y sin z 1 f (x, y, z)dz = dz = , 3 2 8π 4π 0 0 0 ≤ x, y, ≤ 2π. The Lebesgue density for X or Y is  2π  2π  2π 1 1 , f (x, y, z)dydz = dy = 2 4π 2π 0 0 0 0 ≤ x ≤ 2π. Hence X and Y are independent. Similarly, X and Z are independent and Y and Z are independent. Note that  π 1 1 P (X ≤ π) = P (Y ≤ π) = P (Z ≤ π) = dx = . 2π 2 0  16  Chapter 1. Probability Theory  Hence P (X ≤ π)P (Y ≤ π)P (Z ≤ π) = 1/8. On the other hand,  π π π 1 − sin x sin y sin z P (X ≤ π, Y ≤ π, Z ≤ π) = dxdydz 8π 3 0 0 0 3  π 1 1 = − 3 sin xdx 8 8π 0 1 1 = − 3. 8 π Hence X, Y , and Z are not independent. Exercise 22 (#1.51, #1.53). Let X be a random n-vector having the multivariate normal distribution Nn (µ, In ). (i) Apply Cochran's theorem to show that if A2 = A, then X τ AX has the noncentral chi-square distribution χ2r (δ), where A is an n × n symmetric matrix, r = rank of A, and δ = µτ Aµ. (ii) Let Ai be an n × n symmetric matrix satisfying A2i = Ai , i = 1, 2. Show that a necessary and suﬃcient condition that X τ A1 X and X τ A2 X are independent is A1 A2 = 0. Note. If X1 , ..., Xk are independent and Xi has the normal distribution N (µi , σ 2 ), i = 1, ..., k, then the distribution of (X12 + · · · + Xk2 )/σ 2 is called the noncentral chi-square distribution χ2k (δ), where δ = (µ21 + · · · + µ2k )/σ 2 . When δ = 0, χ2k is called the central chi-square distribution. Solution. (i) Since A2 = A, i.e., A is a projection matrix, (In − A)2 = In − A − A + A2 = In − A. Hence, In −A is a projection matrix with rank tr(In −A) = tr(In )−tr(A) = n−r. The result then follows by applying Cochran's theorem (e.g., Theorem 1.5 in Shao, 2003) to X τ X = X τ AX + X τ (In − A)X. (ii) Suppose that A1 A2 = 0. Then (In − A1 − A2 )2 = In − A1 − A2 − A1 + A21 + A2 A1 − A2 + A1 A2 + A22 = In − A1 − A2 , i.e., In − A1 − A2 is a projection matrix with rank = tr(In − A1 − A2 ) = n − r1 − r2 , where ri = tr(Ai ) is the rank of Ai , i = 1, 2. By Cochran's theorem and X τ X = X τ A1 X + X τ A2 X + X τ (In − A1 − A2 )X, X τ A1 X and X τ A2 X are independent.  Chapter 1. Probability Theory  17  Assume that X τ A1 X and X τ A2 X are independent. Since X τ Ai X has the noncentral chi-square distribution χ2ri (δi ), where ri is the rank of Ai and δi = µτ Ai µ, X τ (A1 + A2 )X has the noncentral chi-square distribution χ2r1 +r2 (δ1 + δ2 ). Consequently, A1 + A2 is a projection matrix, i.e., (A1 + A2 )2 = A1 + A2 , which implies A1 A2 + A2 A1 = 0. Since A21 = A1 , we obtain that 0 = A1 (A1 A2 + A2 A1 ) = A1 A2 + A1 A2 A1 and 0 = A1 (A1 A2 + A2 A1 )A1 = 2A1 A2 A1 , which imply A1 A2 = 0. Exercise 23 (#1.55). Let X be a random variable having a cumulative distribution function F . Show that if EX exists, then  0  ∞ [1 − F (x)]dx − F (x)dx. EX = 0  −∞  Solution. By Fubini's theorem,  ∞  [1 − F (x)]dx = 0  0    ∞   dF (y)dx (x,∞)  ∞  =  dxdF (y) 0   =  (0,y) ∞  ydF (y). 0  Similarly,     0  0      F (x)dx = −∞  −∞  (−∞,x]  If EX exists, then at least one of     ∞  EX =  ydF (y) = −∞  0  ∞ 0 ∞  dF (y)dx = − ydF (y) and  0  ydF (y). −∞  0 −∞   [1 − F (x)]dx −  ydF (y) is ﬁnite and  0  F (x)dx. −∞  Exercise 24 (#1.58(c)). Let X and Y be random variables having the bivariate normal distribution with EX = EY = 0, Var(X) = Var(Y ) = 1,  18  Chapter 1. Probability Theory  and Cov(X, Y ) = ρ. Show that E(max{X, Y }) = Solution. Note that    (1 − ρ)/π.  |X − Y | = max{X, Y } − min{X, Y } = max{X, Y } + max{−X, −Y }. Since the joint distribution of (X, Y ) is symmetric about 0, the distribution of max{X, Y } and max{−X, −Y } are the same. Hence, E|X − Y | = 2E(max{X, Y }). From the property of the normal distribution, X − Y is normally distributed with mean 0 and variance Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ) = 2 − 2ρ. Then,    E(max{X, Y }) = 2−1 E|X − Y | = 2−1 2/π 2 − 2ρ = (1 − ρ)/π. Exercise 25 (#1.60). Let X be a random variable with EX 2 < ∞ and let Y = |X|. Suppose that X has a Lebesgue density symmetric about 0. Show that X and Y are uncorrelated, but they are not independent. Solution. Let f be the Lebesgue density of X. Then f (x) = f (−x). Since X and XY = X|X| are odd functions of X, EX = 0 and E(X|X|) = 0. Hence, Cov(X, Y ) = E(XY ) − EXEY = E(X|X|) − EXE|X| = 0. Let t be a positive constant such that p = P (0 < X < t) > 0. Then P (0 < X < t, Y < t) = P (0 < X < t, −t < X < t) = P (0 < X < t) =p and P (0 < X < t)P (Y < t) = P (0 < X < t)P (−t < X < t) = 2P (0 < X < t)P (0 < X < t) = 2p2 , i.e., P (0 < X < t, Y < t) = P (0 < X < t)P (Y < t). Hence X and Y are not independent. Exercise 26 (#1.61). Let (X, Y ) be a random 2-vector with the following Lebesgue density:  −1 π x2 + y 2 ≤ 1 f (x, y) = 0 x2 + y 2 > 1. Show that X and Y are uncorrelated, but they are not independent. Solution. Since X and Y are uniformly distributed on the Borel set  Chapter 1. Probability Theory  19  {(x, y) : x2 +y 2 ≤ 1}, EX = EY = 0 and E(XY ) = 0. Hence Cov(X, Y ) = 0. A direct calculation shows that √ √ 1 P (0 < X < 1/ 2, 0 < Y < 1/ 2) = 2π and  √ √ 1 1 P (0 < X < 1/ 2) = P (0 < Y < 1/ 2) = + . 4 2π  Hence, √ √ √ √ P (0 < X < 1/ 2, 0 < Y < 1/ 2) = P (0 < X < 1/ 2)P (0 < Y < 1/ 2) and X and Y are not independent. Exercise 27 (#1.48, #1.70). Let Y be a random variable having the noncentral chi-square distribution χ2k (δ), where k is a positive integer. Show that (i) the Lebesgue density of Y is gδ,k (t) = e−δ/2  ∞ j=0  (δ/2)j f2j+k (t), j!  where fj (t) = [Γ(j/2)2j/2 ]−1 tj/2−1 e−t/2 I(0,∞) (t) is the Lebesgue density of the central chi-square distribution χ2j , j = 1, 2, ...; √ √ √ (ii) the characteristic function of Y is (1 − 2 −1t)−k/2 e −1t/(1−2 −1t) ; (iii) E(Y ) = k + δ and Var(Y ) = 2k + 4δ. Solution A. (i) Consider ﬁrst k = 1. By the deﬁnition of the noncentral chi-square distribution (e.g., Shao, 2003, p. 26), the distribution of Y is the √ same as that of X 2 , where X has the normal distribution with mean δ and variance 1. Since √ √ P (Y ≤ t) = P (X ≤ t) − P (X ≤ − t) for t > 0, the Lebesgue density of Y is √ √ 1 fY (t) = √ [fX ( t) + fX (− t)]I(0,∞) (t), 2 t where fX is the Lebesgue density of X. Using the fact that X has a normal distribution, we obtain that, for t > 0, √ √ 2  1  −(√t−√δ)2 /2 + e−(− t− δ) /2 e fY (t) = √ 2 2πt  √ e−δ/2 e−t/2  √δt √ = + e −δt e 2 2πt  20  Chapter 1. Probability Theory ⎞ ⎛ √ ∞ √ ∞ e−δ/2 e−t/2 ⎝ ( δt)j (− δt)j ⎠ √ = + j! j! 2 2πt j=0 j=0 =  e−δ/2 e−t/2 √ 2πt  ∞ j=0  (δt)j . (2j)!  On the other hand, for k = 1 and t > 0, −δ/2  ∞  gδ,1 (t) = e  (δ/2)j j!  j=0 −δ/2 −t/2 ∞  =  e  √  e 2t  j=0  1 tj−1/2 e−t/2 Γ(j + 1/2)2j+1/2    (δt)j . j!Γ(j + 1/2)22j  √ Since j!22j Γ(j + 1/2) = π(2j)!, fY (t) = gδ,1 (t) holds. We then use induction. By deﬁnition, Y = X1 + X2 , where X1 has the noncentral chi-square distribution χ2k−1 (δ), X2 has the central chi-square distribution χ21 , and X1 and X2 are independent. By the induction assumption, the Lebesgue density of X1 is gδ,k−1 . Note that the Lebesgue density of X2 is f1 . Using the convolution formula (e.g., Example 1.15 in Shao, 2003), the Lebesgue density of Y is  fY (t) = gδ,k−1 (u)f1 (t − u)du −δ/2  ∞  =e  j=0  (δ/2)j j!   f2j+k−1 (u)f1 (t − u)du   for t > 0. By the convolution formula again, f2j+k−1 (u)f1 (t − u)du is the Lebesgue density of Z+X2 , where Z has density f2j+k−1 and is independent of X2 . By deﬁnition, Z + X2 has the central chi-square distribution χ22j+k , i.e.,  f2j+k−1 (u)f1 (t − u)du = f2j+k (t). Hence, fY = gδ,k . (ii) Note that the moment generating function of the central chi-square distribution χ2k is, for t < 1/2,  ∞  1 tu e fk (u)du = uk/2−1 e−(1−2t)u/2 du Γ(k/2)2k/2 0  ∞ 1 sk/2−1 e−s/2 ds = Γ(k/2)2k/2 (1 − 2t)k/2 0 1 = , (1 − 2t)k/2  Chapter 1. Probability Theory  21  where the second equality follows from the following change of variable in the integration: s = (1 − 2t)u. By the result in (i), the moment generating function for Y is   ∞ (δ/2)j etx gδ,k (x)dx = e−δ/2 etx f2j+k (x)dx j! j=0 = e−δ/2  ∞ j=0  (δ/2)j j!(1 − 2t)(j+k/2) ∞  −δ/2  =  e (1 − 2t)k/2  j=0  {δ/[2(1 − 2t)]}j j!  e−δ/2+δ/[2(1−2t)] = (1 − 2t)k/2 =  eδt/(1−2t) . (1 − 2t)k/2  √ function of Y √ , we obtain Substituting t by −1t in the moment generating √ √ the characteristic function of Y as (1 − 2 −1t)−k/2 e −1δt/(1−2 −1t) . (iii) Let ψY (t) be the moment generating function of Y . By the result in (ii),  δ k 2δt + ψ  (t) = ψ(t) + 1 − 2t (1 − 2t)2 1 − 2t and  2δt δ k + ψ (t) = ψ (t) + 1 − 2t (1 − 2t)2 1 − 2t  4δ 2δt 2k + ψ(t) . + + (1 − 2t)2 (1 − 2t)3 (1 − 2t)2     Hence, EY = ψ  (0) = δ + k, EY 2 = ψ  (0) = (δ + k)2 + 4δ + 2k, and Var(Y ) = EY 2 − (EY )2 = 4δ + 2k. Solution B. (i) We ﬁrst derive result (ii). Let X be a random variable having the standard normal distribution and µ be a real number. The moment generating function of (X + µ)2 is  2 2 1 √ e−x /2 et(x+µ) dx ψµ (t) = 2π  2 2 eµ t/(1−2t) √ = e−(1−2t)[x−2µt/(1−2t)] /2 dx 2π 2  eµ t/(1−2t) = √ . 1 − 2t  22  Chapter 1. Probability Theory  √ 2 By deﬁnition, Y has the same distribution as X12 +· · ·+Xk−1 +(Xk + δ)2 , where Xi 's are independent and have the standard normal distribution. From the obtained result, the moment generating function of Y is 2  √ 2 δ) ]  2  Eet[X1 +···+Xk−1 +(Xk +  2  = [ψ0 (t)]k−1 ψ√δ (t) =  eµ t/(1−2t) . (1 − 2t)k/2  (ii) We now use the result in (ii) to prove the result in (i). From part (ii) of 2 Solution A, the moment generating function of gδ,k is eµ t/(1−2t) (1−2t)−k/2 , which is the same as the moment generating function of Y derived in part (i) of this solution. By the uniqueness theorem (e.g., Theorem 1.6 in Shao, 2003), we conclude that gδ,k is the Lebesgue density of Y . (iii) Let Xi 's be as deﬁned in (i). Then, √ 2 EY = EX12 + · · · + EXk−1 + E(Xk + δ)2 √ = k − 1 + EXk2 + δ + E(2 δXk ) = k+δ and √   2 ) + Var (Xk + δ)2 Var(Y ) = Var(X12 ) + · · · + Var(Xk−1 √ = 2(k − 1) + Var(Xk2 + 2 δXk ) √ √ = 2(k − 1) + Var(Xk2 ) + Var(2 δXk ) + 2Cov(Xk2 , 2 δXk ) = 2k + 4δ, since Var(Xi2 ) = 2 and Cov(Xk2 , Xk ) = EXk3 − EXk2 EXk = 0. Exercise 28 (#1.57). Let U1 and U2 be independent random variables having the χ2n1 (δ) and χ2n2 distributions, respectively, and let F = (U1 /n1 )/(U2 /n2 ). Show that (n1 +δ) when n2 > 2; (i) E(F ) = nn12 (n 2 −2) 2n2 [(n +δ)2 +(n −2)(n +2δ)]  (ii) Var(F ) = 2 n12 (n2 −2)22(n2 −4)1 when n2 > 4. 1 Note. The distribution of F is called the noncentral F-distribution and denoted by Fn1 ,n2 (δ). Solution. From the previous exercise, EU1 = n1 +δ and EU12 = Var(U1 )+ (EU1 )2 = 2n1 + 4δ + (n1 + δ)2 . Also,  ∞ 1 −1 xn2 /2−2 e−x/2 dx EU2 = Γ(n2 /2)2n2 /2 0 Γ(n2 /2 − 1)2n2 /2−1 Γ(n2 /2)2n2 /2 1 = n2 − 2 =  Chapter 1. Probability Theory  23  for n2 > 2 and EU2−2 =  1 Γ(n2 /2)2n2 /2    ∞  xn2 /2−3 e−x/2 dx  0  Γ(n2 /2 − 2)2n2 /2−2 = Γ(n2 /2)2n2 /2 1 = (n2 − 2)(n2 − 4) for n2 > 4. Then, E(F ) = E  U1 /n1 n2 n2 (n1 + δ) = EU1 EU2−1 = U2 /n2 n1 n1 (n2 − 2)  when n2 > 2 and Var(F ) = E  U12 /n21 − [E(F )]2 U22 /n22  2 n22 n2 (n1 + δ) −2 2 EU EU − 1 2 n21 n1 (n2 − 2)  2 (n1 + δ)2 2n1 + 4δ + (N − 1 + δ)2 n − = 22 n1 (n2 − 2)(n2 − 4) (n2 − 2)2 2n22 [(n1 + δ)2 + (n2 − 2)(n1 + 2δ)] = n21 (n2 − 2)2 (n2 − 4) =  when n2 > 4. Exercise 29 (#1.74). Let φn be the characteristic function of a probabil{an } be a sequence of nonnegative numbers ity measure ∞ Pn , n = 1, 2, .... Let ∞ with n=1 an = 1. Show that n=1 an φn is a characteristic function and ﬁnd its corresponding probability measure. Solution A. For any event A, deﬁne ∞  P (A) =  an Pn (A). n=1  Then P is a probability measure and Pn  P for any n. Denote the RadonNikodym derivative of Pn with respect to P as fn , n = 1, 2, .... By Fubini's theorem, for any event A,   ∞ ∞ an fn dP = an fn dP A n=1  A  n=1 ∞  an Pn (A)  = n=1  = P (A).  24  Hence,  Chapter 1. Probability Theory   an fn = 1 a.s. P . Then, ∞    ∞  an φn (t) = n=1  √ −1tx  e  an n=1 ∞    √ −1tx  e  an  =  dPn (x) fn (x)dP  n=1   =  ∞ √ −1tx  e   =  an fn (x)dP  n=1 √ −1tx  e  dP.  ∞ Hence, n=1 an φn is the characteristic function of P . Solution B. Let X be a discrete random variable satisfying P (X = n) = an and Y be a random variable such that given X = n, the conditional distribution of Y is Pn , n = 1, 2, .... The characteristic function of Y is √ −1tY  E(e  √ −1tY  ) = E[E(e ∞  =  |X)]  √ −1tY  an E(e n=1 ∞   an  =  √ −1ty  e  |X = n) dPn (y)  n=1 ∞  an φn (t).  = n=1  ∞ This shows that n=1 an φn is the characteristic function of the marginal distribution of Y . Exercise 30 (#1.79). Find an example of two random variables X and Y such that X and Y are not independent but their characteristic functions φX and φY satisfy φX (t)φY (t) = φX+Y (t) for all t ∈ R. Solution. Let X = Y be a random variable having the Cauchy distribution with φX (t) = φY (t) = e−|t| . Then X and Y are not independent (see Exercise 20). The characteristic function of X + Y = 2X is √ −1t(2X)  φX+Y (t) = E(e  ) = φX (2t) = e−|2t| = e−|t| e−|t| = φX (t)φY (t).  Exercise 31 (#1.75). Let X be a random variable whose√characteristic   function φ satisﬁes |φ(t)|dt < ∞. Show that (2π)−1 e− −1xt φ(t)dt is the Lebesgue density of X.  Chapter 1. Probability Theory  25  √ √ √ Solution. Deﬁne g(t, x) = (e− −1ta − e− −1tx )/( −1t) for a ﬁxed real number a. For any x, |g(t, x)| ≤ |x−a|. Under the condition |φ(t)|dt < ∞,  ∞ 1 φ(t)g(t, x)dt F (x) − F (a) = 2π −∞  (e.g., Theorem 1.6 in Shao, 2003), where F is the cumulative distribution function of X. Since   √  ∂g(t, x)   = |e− −1tx | = 1,   ∂x  by the dominated convergence theorem (Theorem 1.1 and Example 1.8 in Shao, 2003),   ∞ 1 d F  (x) = φ(t)g(t, x)dt dx 2π −∞  ∞ ∂g(t, x) 1 φ(t) = dt 2π −∞ ∂x  ∞ √ 1 = φ(t)e− −1tx dt. 2π −∞ Exercise 32 (#1.73(g)). Let φ be a characteristic function and G be a cumulative distribution function on the real line. Show that φ(ut)dG(u) is a characteristic function on the real line. Solution. Let F be the cumulative distribution function corresponding to φ and let X and U be independent random variables having distributions F and G, respectively. The characteristic function of U X is   √ √ −1tU X Ee = e −1tux dF (x)dG(u)  = φ(ut)dG(u).  Exercise 33. Let X and Y be independent random variables. Show that if X and X − Y are independent, then X must be degenerate. Solution. We denote the characteristic function of any random variable Z by φZ . Since X and Y are independent, so are −X and Y . Hence, φY −X (t) = φY (t)φ−X (t) = φY (t)φX (−t), t ∈ R. If X and X −Y are independent, then X and Y −X are independent. Then φY (t) = φX+(Y −X) (t) = φX (t)φY −X (t) = φX (t)φX (−t)φY (t), t ∈ R.  26  Chapter 1. Probability Theory  Since φY (0) = 1 and φY is continuous, φY (t) = 0 for a neighborhood of 0. Hence φX (t)φX (−t) = |φX (t)|2 = 1 on this neighborhood of 0. Thus, X is degenerate. Exercise 34 (#1.98). Let PY be a discrete distribution on {0, 1, 2, ...} and given Y = y, the conditional distribution of X be the binomial distribution with size y and probability p. Show that (i) if Y has the Poisson distribution with mean θ, then the marginal distribution of X is the Poisson distribution with mean pθ; (ii) if Y + r has the negative binomial distribution with size r and probability π, then the marginal distribution of X + r is the negative binomial distribution with size r and probability π/[1 − (1 − p)(1 − π)]. Solution. (i) The moment generating function of X is E(etX ) = E[E(etX |Y )] = E[(pet + 1 − p)Y ] = eθp(e  t  −1)  ,  which is the moment generating function of the Poisson distribution with mean pθ. (ii) The moment generating function of X + r is E(et(X+r) ) = etr E[E(etX |Y )] = etr E[(pet + 1 − p)Y ] etr = E[(pet + 1 − p)Y +r ] (pet + 1 − p)r π r (pet + 1 − p)r etr = (pet + 1 − p)r [1 − (1 − π)(pet + 1 − p)]r π r ert = . [1 − (1 − π)(pet + 1 − p)]r Then the result follows from the fact that   1 − (1 − π)(pet + 1 − p) π et . =1− 1− 1 − (1 − p)(1 − π) 1 − (1 − p)(1 − π) Exercise 35 (#1.85). Let X and Y be integrable random variables on the probability space (Ω, F, P ) and A be a sub-σ-ﬁeld of F. Show that (i) if X ≤ Y a.s., then E(X|A) ≤ E(Y |A) a.s.; (ii) if a and b are constants, then E(aX + bY |A) = aE(X|A) + bE(X|A) a.s. Solution. (i) Suppose that X ≤ Y a.s. By the deﬁnition of the conditional expectation and the property of integration,     E(X|A)dP = XdP ≤ Y dP = E(Y |A)dP, A  A  A  A  Chapter 1. Probability Theory  27  where A = {E(X|A) > E(Y |A)} ∈ A. Hence P (A) = 0, i.e., E(X|A) ≤ E(Y |A) a.s. (ii) Note that aE(X|A) + bE(Y |A) is measurable from (Ω, A) to (R, B). For any A ∈ A, by the linearity of integration,    (aX + bY )dP = a XdP + b Y dP A A A  =a E(X|A)dP + b E(Y |A)dP A  A = [aE(X|A) + bE(Y |A)]dP. A  By the a.s.-uniqueness of the conditional expectation, E(aX + bY |A) = aE(X|A) + bE(X|A) a.s. Exercise 36 (#1.85). Let X be an integrable random variable on the probability space (Ω, F, P ) and A and A0 be σ-ﬁelds satisfying A0 ⊂ A ⊂ F. Show that E[E(X|A)|A0 ] = E(X|A0 ) = E[E(X|A0 )|A] a.s. Solution. Note that E(X|A0 ) is measurable from (Ω, A0 ) to (R, B) and A0 ⊂ A. Hence E(X|A0 ) is measurable from (Ω, A) to (R, B) and, thus, E(X|A0 ) = E[E(X|A0 )|A] a.s. Since E[E(X|A)|A0 ] is measurable from (Ω, A0 ) to (R, B) and for any A ∈ A0 ⊂ A,    E[E(X|A)|A0 ]dP = E(X|A)dP = XdP, A  A  A  we conclude that E[E(X|A)|A0 ] = E(X|A0 ) a.s. Exercise 37 (#1.85). Let X be an integrable random variable on the probability space (Ω, F, P ), A be a sub-σ-ﬁeld of F, and Y be another random variable satisfying σ(Y ) ⊂ A and E|XY | < ∞. Show that E(XY |A) = Y E(X|A) a.s. Solution. Since σ(Y ) ⊂ A, Y E(X|A) is measurable from (Ω, A) to (R, B). The result follows if we can show that for any A ∈ A,   Y E(X|A)dP = XY dP. A  A  (1) If Y = aIB , where a ∈ R and B ∈ A, then A ∩ B ∈ A and     XY dP = a XdP = a E(X|A)dP = Y E(X|A)dP. A  A∩B  A∩B  A  28  Chapter 1. Probability Theory  (2) If Y =  k    i=1   ai XIBi dP =  k  XY dP = A  ai IBi , where Bi ∈ A, then  i=1  A    ai IBi E(X|A)dP = Y E(X|A)dP.  k  A  i=1  A  (3) Suppose that X ≥ 0 and Y ≥ 0. There exists a sequence of increasing simple functions Yn such that σ(Yn ) ⊂ A, Yn ≤ Y and limn Yn = Y . Then limn XYn = XY and limn Yn E(X|A) = Y E(X|A). By the monotone convergence theorem and the result in (2),     XY dP = lim XYn dP = lim Yn E(X|A)dP = Y E(X|A)dP. A  n  n  A  A  A  (4) For general X and Y , consider X+ , X− , Y+ , and Y− . Since σ(Y ) ⊂ A, so are σ(Y+ ) and σ(Y− ). Then, by the result in (3),    XY dP = X+ Y+ dP − X+ Y− dP A A A   − X− Y+ dP + X− Y− dP A  A  = Y+ E(X+ |A)dP − Y− E(X+ |A)dP A A   − Y+ E(X− |A)dP + Y− E(X− |A)dP A A   = Y E(X+ |A)dP − Y E(X− |A)dP A A = Y E(X|A)dP, A  where the last equality follows from the result in Exercise 35. Exercise 38 (#1.85). Let X1 , X2 , ... and X be integrable random variables on the probability space (Ω, F, P ). Assume that 0 ≤ X1 ≤ X2 ≤ · · · ≤ X and limn Xn = X a.s. Show that for any σ-ﬁeld A ⊂ F, E(X|A) = lim E(Xn |A) a.s. n  Solution. Since each E(Xn |A) is measurable from (Ω, A) to (R, B), so is the limit limn E(Xn |A). We need to show that   lim E(Xn |A)dP = XdP A n  A  for any A ∈ A. By Exercise 35, 0 ≤ E(X1 |A) ≤ E(X2 |A) ≤ · · · ≤ E(X|A) a.s. By the monotone convergence theorem (e.g., Theorem 1.1 in Shao,  Chapter 1. Probability Theory  29  2003), for any A ∈ A,   lim E(Xn |A)dP = lim E(Xn |A)dP n A n A  = lim Xn dP n A  = lim Xn dP n A = XdP. A  Exercise 39 (#1.85). Let X1 , X2 , ... be integrable random variables on the probability space (Ω, F, P ). Show that for any σ-ﬁeld A ⊂ F, (i) E(lim inf n Xn |A) ≤ lim inf n E(Xn |A) a.s. if Xn ≥ 0 for any n; (ii) limn E(Xn |A) = E(X|A) a.s. if limn Xn = X a.s. and |Xn | ≤ Y for any n and an integrable random variable Y . Solution. (i) For any m ≥ n, by Exercise 35, E(inf m≥n Xm |A) ≤ E(Xm |A) a.s. Hence, E(inf m≥n Xm |A) ≤ inf m≥n E(Xm |A) a.s. Let Yn = inf m≥n Xm . Then 0 ≤ Y1 ≤ Y2 ≤ · · · ≤ limn Yn and Yn 's are integrable. Hence, E(lim inf Xn |A) = E(lim Yn |A) n  n  = lim E(Yn |A) n  = lim E(inf m≥n Xm |A) n  ≤ lim inf E(Xm |A) n m≥n  = lim inf E(Xn |A) n  a.s., where the second equality follows from the result in the previous exercise and the ﬁrst and the last equalities follow from the fact that lim inf n fn = limn inf m≥n fm for any sequence of functions {fn }. (ii) Note that Y + Xn ≥ 0 for any n. Applying the result in (i) to Y + Xn , lim inf E(Y + Xn |A) ≤ E(lim inf (Y + Xn )|A) = E(Y + X|A) a.s. n  n  Since Y is integrable, so is X and, consequently, E(Y + X|A) = E(Y |A) + E(X|A) a.s. and lim inf n E(Y + Xn |A) = E(Y |A) + lim inf n E(Xn |A) a.s. Hence, lim inf E(Xn |A) ≤ E(X|A) a.s. n  Applying the same argument to Y − Xn , we obtain that lim inf E(−Xn |A) ≤ E(−X|A) a.s. n  30  Chapter 1. Probability Theory  Since lim inf n E(−Xn |A) = − lim supn E(Xn |A), we obtain that lim sup E(Xn |A) ≥ E(X|A) a.s. n  Combining the results, we obtain that lim sup E(Xn |A) = lim inf E(Xn |A) = lim E(Xn |A) = E(X|A) a.s. n  n  n  Exercise 40 (#1.86). Let X and Y be integrable random variables on the probability space (Ω, F, P ) and A ⊂ F be a σ-ﬁeld. Show that E[Y E(X|A)] = E[XE(Y |A)], assuming that both integrals exist. Solution. (1) The problem is much easier if we assume that Y is bounded. When Y is bounded, both Y E(X|A) and XE(Y |A) are integrable. Using the result in Exercise 37 and the fact that E[E(X|A)] = EX, we obtain that E[Y E(X|A)] = E{E[Y E(X|A)|A]} = E[E(X|A)E(Y |A)] = E{E[XE(Y |A)|A]} = E[XE(Y |A)]. (2) Assume that Y ≥ 0. Let Z be another nonnegative integrable random variable. We now show that if σ(Z) ⊂ A, then E(Y Z) = E[ZE(Y |A)]. (Note that this is a special case of the result in Exercise 37 if E(Y Z) < ∞.) Let Yn = max{Y, n}, n = 1, 2, .... Then 0 ≤ Y1 ≤ Y2 ≤ · · · ≤ Y and limn Yn = Y . By the results in Exercises 35 and 39, 0 ≤ E(Y1 |A) ≤ E(Y2 |A) ≤ · · · a.s. and limn E(Yn |A) = E(Y |A) a.s. Since Yn is bounded, Yn Z is integrable. By the result in Exercise 37, E[ZE(Yn |A)] = E(Yn Z),  n = 1, 2, ....  By the monotone convergence theorem, E(Y Z) = lim E(Yn Z) = lim E[ZE(Yn |A)] = E[ZE(Y |A)]. n  n  Consequently, if X ≥ 0, then the result follows by taking Z = E(X|A). (3) We now consider general X and Y . Let f+ and f− denote the positive and negative parts of a function f . Note that E{[XE(Y |A)]+ } = E{X+ [E(Y |A)]+ } + E{X− [E(Y |A)]− } and E{[XE(Y |A)]− } = E{X+ [E(Y |A)]− } + E{X− [E(Y |A)]+ }.  Chapter 1. Probability Theory  31  Since E[XE(Y |A)] exists, without loss of generality we assume that E{[XE(Y |A)]+ } = E{X+ [E(Y |A)]+ } + E{X− [E(Y |A)]− } < ∞. Then, both E[X+ E(Y |A)] = E{X+ [E(Y |A)]+ } − E{X+ [E(Y |A)]− } and E[X− E(Y |A)] = E{X− [E(Y |A)]+ } − E{X− [E(Y |A)]− } are well deﬁned and their diﬀerence is also well deﬁned. Applying the result established in (2), we obtain that E[X+ E(Y |A)] = E{E(X+ |A)[E(Y |A)]+ } − E{E(X+ |A)[E(Y |A)]− } = E[E(X+ |A)E(Y |A)], where the last equality follows from the result in Exercise 8. Similarly, E[X− E(Y |A)] = E{E(X− |A)[E(Y |A)]+ } − E{E(X− |A)[E(Y |A)]− } = E[E(X− |A)E(Y |A)]. By Exercise 8 again, E[XE(Y |A)] = E[X+ E(Y |A)] − E[X− E(Y |A)] = E[E(X+ |A)E(Y |A)] − E[E(X− |A)E(Y |A)] = E[E(X|A)E(Y |A)]}. Switching X and Y , we also conclude that E[Y E(X|A)] = E[E(X|A)E(Y |A)]. Hence, E[XE(Y |A)] = E[Y E(X|A)]. Exercise 41 (#1.87). Let X, X1 , X2 , ... be a sequence of integrable random variables on the probability space (Ω, F, P ) and A ⊂ F be a σ-ﬁeld. Suppose that limn E(Xn Y ) = E(XY ) for every integrable (or bounded) random variable Y . Show that limn E[E(Xn |A)Y ] = E[E(X|A)Y ] for every integrable (or bounded) random variable Y . Solution. Assume that Y is integrable. Then E(Y |A) is integrable. By the condition, E[Xn E(Y |A)] → E[XE(Y |A)]. By the result of the previous exercise, E[Xn E(Y |A)] = E[E(Xn |A)Y ] and E[XE(Y |A)] = E[E(X|A)Y ]. Hence, E[E(Xn |A)Y ] → E[E(X|A)Y ] for every integrable Y . The same result holds if "integrable" is changed to "bounded".  32  Chapter 1. Probability Theory  Exercise 42 (#1.88). Let X be a nonnegative integrable random variable on the probability space (Ω, F, P ) and A ⊂ F be a σ-ﬁeld. Show that   ∞  E(X|A) =    P X > t|A dt.  0  Note. For any B ∈ F, P (B|A) is deﬁned to be E(IB |A). Solution. From the theory of conditional distribution (e.g., Theorem 1.7 in Shao, 2003), there exists P̃ (B, ω) deﬁned on F × Ω such that (i) for any ω ∈ Ω, P̃ (·, ω) is a probability measure on (Ω, F) and (ii) for any B ∈ F, P̃ (B, ω) = P (B|A) a.s. From Exercise 23,     ∞  XdP̃ (·, ω) =  P̃ ({X > t}, ω)dt 0 ∞  =  P (X > t|A)dt a.s. 0  Hence, the result follows if  E(X|A)(ω) =  XdP̃ (, ω) a.s.  This is certainly true if X = IB for a B ∈ F. By the linearity of the integration and conditional expectation, this equality also holds when X is a nonnegative simple function. For general nonnegative X, there exists a sequence of simple functions X1 , X2 , ..., such that 0 ≤ X1 ≤ X2 ≤ · · · ≤ X and limn Xn = X a.s. From Exercise 38, E(X|A) = lim E(Xn |A) n  = lim Xn dP̃ (·, ω) n  = XdP̃ (·, ω) a.s.  Exercise 43 (#1.97). Let X and Y be independent integrable random variables on a probability space and f be a nonnegative convex function. Show that E[f (X + Y )] ≥ E[f (X + EY )]. Note. We need to apply the following Jensen's inequality for conditional expectations. Let f be a convex function and X be an integrable random variable satisfying E|f (X)| < ∞. Then f (E(X|A)) ≤ E(f (X)|A) a.s. (e.g., Theorem 9.1.4 in Chung, 1974). Solution. If E[f (X + Y )] = ∞, then the inequality holds. Hence, we may assume that f (X + Y ) is integrable. Using Jensen's inequality and some  Chapter 1. Probability Theory  33  properties of conditional expectations, we obtain that E[f (X + Y )] = ≥ = =  E{E[f (X + Y )|X]} E{f (E(X + Y |X))} E{f (X + E(Y |X))} E[f (X + EY )],  where the last equality follows from E(Y |X) = EY since X and Y are independent. Exercise 44 (#1.83). Let X be an integrable random variable with a Lebesgue density f and let Y = g(X), where g is a function with positive derivative on (0, ∞) and g(x) = g(−x). Find an expression for E(X|Y ) and verify that it is indeed the conditional expectation. Solution. Let h be the inverse function of g on (0, ∞) and ψ(y) = h(y)  f (h(y)) − f (−h(y)) . f (h(y)) + f (−h(y))  We now show that E(X|Y ) = ψ(Y ) a.s. It is clear that ψ(y) is a Borel function. Also, the σ-ﬁeld generated by Y is generated by the sets of the form Aa = {y : g(0) ≤ y ≤ a}, a > g(0). Hence, it suﬃces to show that for any a > g(0),   XdP = Aa  ψ(Y )dP. Aa  Note that    XdP =  Aa  xf (x)dx g(0)≤g(x)≤a    h(a)  =  xf (x)dx −h(a)      0  =  h(a)  xf (x)dx + −h(a)      0  =  xf (x)dx 0  xf (−x)dx + h(a)  x[f (x) − f (−x)]dx  = 0    xf (x)dx 0  h(a)    h(a)  a  =  h(y)[f (h(y)) − f (−h(y))]h (y)dy.  g(0)  On the other hand, h (y)[f (h(y)) + f (−h(y))]I(g(0),∞) (y) is the Lebesgue  34  Chapter 1. Probability Theory  density of Y (see the note in Exercise 17). Hence,   a ψ(Y )dP = ψ(y)h (y)[f (h(y)) + f (−h(y))]dy g(0) a  Aa   =  h(y)[f (h(y)) − f (−h(y))]h (y)dy  g(0)  by the deﬁnition of ψ(y). Exercise 45 (#1.91). Let X, Y , and Z be random variables on a probability space. Suppose that E|X| < ∞ and Y = h(Z) with a Borel h. Show that (i) E(XZ|Y ) = E(X)E(Z|Y ) a.s. if X and Z are independent and E|Z| < ∞; (ii) if E[f (X)|Z] = f (Y ) for all bounded continuous functions f on R, then X = Y a.s.; (iii) if E[f (X)|Z] ≥ f (Y ) for all bounded, continuous, nondecreasing functions f on R, then X ≥ Y a.s. Solution. (i) It suﬃces to show   XZdP = E(X) ZdP Y −1 (B)  Y −1 (B)  for any Borel set B. Since Y = h(Z), Y −1 (B) = Z −1 (h−1 (B)). Then    XZdP = XZIh−1 (B) (Z)dP = E(X) ZIh−1 (B) (Z)dP, Y −1 (B)  since X and Z are independent. On the other hand,    ZdP = ZdP = ZIh−1 (B) (Z)dP. Y −1 (B)  h−1 (B)  (ii) Let f (t) = et /(1+et ). Then both f and f 2 are bounded and continuous. Note that E[f (X) − f (Y )]2 = EE{[f (X) − f (Y )]2 |Z} = E{E[f 2 (X)|Z] + E[f 2 (Y )|Z] − 2E[f (X)f (Y )|Z]} = E{E[f 2 (X)|Z] + f 2 (Y ) − 2f (Y )E[f (X)|Z]} = E{f 2 (Y ) + f 2 (Y ) − 2f (Y )f (Y )} = 0, where the third equality follows from the result in Exercise 37 and the fourth equality follows from the condition. Hence f (X) = f (Y ) a.s. Since  Chapter 1. Probability Theory  f is (iii) and real  35  strictly increasing, X = Y a.s. For any real number c, there exists a sequence of bounded, continuous nondecreasing functions {fn } such that limn fn (t) = I(c,∞) (t) for any number t. Then, P (X > c, Y > c) = E{E(I{X>c} I{Y >c} |Z)} = E{I{Y >c} E(I{X>c} |Z)} = E{I{Y >c} E[lim fn (X)|Z]} n  = E{I{Y >c} lim E[fn (X)|Z]} n  ≥ E{I{Y >c} lim fn (Y )} n  = E{I{Y >c} I{Y >c} } = P (Y > c), where the fourth and ﬁfth equalities follow from Exercise 39 (since fn is bounded) and the inequality follows from the condition. This implies that P (X ≤ c, Y > c) = P (Y > c) − P (X > c, Y > c) = 0. For any integer k and positive integer n, let ak,i = k + i/n, i = 1, ..., n. Then ∞  n−1  P (X ≤ ak,i , ak,i < Y ≤ ak,i+1 ) = 0.  P (X < Y ) = lim n  k=−∞ i=0  Hence, X ≥ Y a.s. Exercise 46 (#1.115). Let X1 , X2 , ... be a sequence of identically distributed random variables with E|X1 | < ∞ and let Yn = n−1 max1≤i≤n |Xi |. Show that limn E(Yn ) = 0 and limn Yn = 0 a.s. Solution. (i) Let gn (t) = n−1 P (max1≤i≤n |Xi | > t). Then limn gn (t) = 0 for any t and 1 0 ≤ gn ≤ n  n  P (|Xi | > t) = P (|X1 | > t). i=1  ∞ Since E|X1 | < ∞, 0 P (|X1 | > t)dt < ∞ (Exercise 23). By the dominated convergence theorem,  ∞  ∞ lim E(Yn ) = lim gn (t)dt = lim gn (t)dt = 0. n  n  0  0  n  (ii) Since E|X1 | < ∞, ∞  ∞  P (|Xn |/n > ) = n=1  P (|X1 | > n) < ∞, n=1  36  Chapter 1. Probability Theory  which implies that limn |Xn |/n = 0 a.s. (see, e.g., Theorem 1.8(v) in Shao, 2003). Let Ω0 = {ω : limn |Xn (ω)|/n = 0}. Then P (Ω0 ) = 1. Let ω ∈ Ω0 . For any  > 0, there exists an N ,ω such that |Xn (ω)| < n whenever n > N ,ω . Also, there exists an M ,ω > N ,ω such that max1≤i≤Nω |Xi (ω)| ≤ n whenever n > M ,ω . Then, whenever n > M ,ω , max1≤i≤n |Xi (ω)| n max1≤i≤Nω |Xi (ω)| maxNω <i≤n |Xi (ω)| + ≤ n n |Xi (ω)| ≤  + max Nω <i≤n i ≤ 2,  Yn (ω) =  i.e., limn Yn (ω) = 0. Hence, limn Yn = 0 a.s., since P (Ω0 ) = 1. Exercise 47 (#1.116). Let X, X1 , X2 , ... be random variables. Find an example for each of the following cases: (i) Xn →p X, but {Xn } does not converge to X a.s.; (ii) Xn →p X, but E|Xn − X|p does not converge for any p > 0; (iii) Xn →d X, but {Xn } does not converge to X in probability; (iv) Xn →p X, but g(Xn ) does not converge to g(X) in probability for some function g; (v) limn E|Xn | = 0, but |Xn | cannot be bounded by any integrable function. Solution: Consider the probability space ([0, 1], B[0,1] , P ), where B[0,1] is the Borel σ-ﬁeld and P is the Lebesgue measure on [0, 1]. (i) Let X = 0. For any positive integer n, there exist integers m and k such that n = 2m − 2 + k and 0 ≤ k < 2m+1 . Deﬁne  1 k/2m ≤ ω ≤ k + 1/2m Xn (ω) = 0 otherwise for any ω ∈ [0, 1]. Note that P (|Xn − X| > ) ≤ P ({ω : k/2m ≤ ω ≤ (k + 1)/2m }) =  1 →0 2m  as n → ∞ for any  > 0. Thus Xn →p X. However, for any ﬁxed ω ∈ [0, 1] and m, there exists k with 1 ≤ k ≤ 2m such that (k − 1)/2m ≤ ω ≤ k/2m . Let nm = 2m − 2 + k. Then Xnm (ω) = 1. Since m is arbitrarily selected, we can ﬁnd an inﬁnite sequence {nm } such that Xnm (ω) = 1. This implies Xn (ω) does not converge to X(ω) = 0. Since ω is arbitrary, Xn does not converge to X a.s. (ii) Let X = 0 and  0 1/n < ω ≤ 1 Xn (ω) = en 0 ≤ ω ≤ 1/n.  Chapter 1. Probability Theory  37  For any  ∈ (0, 1) P (|Xn − X| > ) = P (|Xn | = 0) =  1 →0 n  as n → ∞, i.e., Xn →p X. On the other hand, for any p > 0, E|Xn − X|p = E|Xn |p = enp /n → ∞. (iii) Deﬁne   X(ω) =  and   Xn (ω) =  1 0 ≤ ω ≤ 1/2 0 1/2 < ω ≤ 1 0 1  0 ≤ ω ≤ 1/2 1/2 < ω ≤ 1.  For any t, ⎧ ⎨ 1 P (X ≤ t) = P (Xn ≤ t) = 1/2 ⎩ 0  t≥1 0≤t<1 t < 0,  Therefore, Xn →d X. However, |Xn −X| = 1 and thus P (|Xn −X| > ) = 1 for any  ∈ (0, 1). (iv) let g(t) = 1 − I{0} (t), X = 0, and Xn = 1/n. Then, Xn →p X, but g(Xn ) = 1 and g(X) = 0. (v) Deﬁne  √ m n m−1 n <ω ≤ n Xn,m (ω) = m = 1, ..., n, n = 1, 2, .... 0 otherwise, Then,   E|Xn,m | =  m/n  (m−1)/n  √  1 ndx = √ → 0 n  as n → ∞. Hence, the sequence {Xn,m : m = 1, ..., n, n = 1, 2, ...} satisﬁes the requirement. If there is a function f such that |Xn,m | ≤ f , then f (ω) = ∞ for any ω ∈ [0, 1]. Hence, f cannot be integrable. Exercise 48. Let Xn be a random variable and mn be a median of Xn , n = 1, 2, .... Show that if Xn →d X for a random variable X, then any limit point of mn is a median of X. Solution. Without loss of generality, assume that limn mn = m. For  > 0 such that m +  and m −  are continuity points of the distribution of X, m −  < mn < m +  for suﬃciently large n and 1 ≤ P (Xn ≤ mn ) ≤ P (Xn ≤ m + ) 2  38  Chapter 1. Probability Theory  and 1 ≤ P (Xn ≥ mn ) ≤ P (Xn ≥ m − ). 2 Letting n → ∞, we obtain that 12 ≤ P (X ≤ m + ) and 12 ≤ P (X ≥ m − ). Letting  → 0, we obtain that 12 ≤ P (X ≤ m) and 21 ≤ P (X ≥ m). Hence m is a median of X. Exercise 49 (#1.126). Show that if Xn →d X and X = c a.s. for a real number c, then Xn →p X. Solution. Note that the cumulative distribution function of X has only one discontinuity point c. For any  > 0, P (|Xn − X| > ) = P (|Xn − c| > ) ≤ P (Xn > c + ) + P (Xn ≥ c − ) → P (X > c + ) + P (X ≤ c − ) = 0 as n → ∞. Thus, Xn →p X. Exercise 50 (#1.117(b), #1.118). Let X1 , X2 , ... be random variables. Show that {|Xn |} is uniformly integrable if one of the following condition holds: (i) supn E|Xn |1+δ < ∞ for a δ > 0; (ii) P (|Xn | ≥ c) ≤ P (|X| ≥ c) for all n and c > 0, where X is an integrable random variable. Note. A sequence of random variables {Xn } is uniformly integrable if limt→∞ supn E(|Xn |I{|Xn |>t} ) = 0. Solution. (i) Denote p = 1 + δ and q = 1 + δ −1 . Then E(|Xn |I{|Xn |>t} ) ≤ (E|Xn |p )1/p [E(I{|Xn |>t} )q ]1/q = (E|Xn |p )1/p [P (|Xn | > t)]1/q ≤ (E|Xn |p )1/p (E|Xn |p )1/q t−p/q = E|Xn |1+δ t−δ , where the ﬁrst inequality follows from Hölder's inequality (e.g., Shao, 2003, p. 29) and the second inequality follows from P (|Xn | > t) ≤ t−p E|Xn |p . Hence lim sup E(|Xn |I{|Xn |>t} ) ≤ sup E|Xn |1+δ lim t−δ = 0.  t→∞ n  n  t→∞  Chapter 1. Probability Theory  39  (ii) By Exercise 23,  sup E(|Xn |I{|Xn |>t} ) = sup n  n  ∞  0 ∞  P (|Xn |I{|Xn |>t} > s)ds  P (|Xn | > s, |Xn | > t)ds   ∞ = sup tP (|Xn | > t) + P (|Xn | > s)ds n t  ∞ ≤ tP (|X| > t) + P (|X| > s)ds = sup n  0  t  →0 as t → ∞ when E|X| < ∞.  Exercise 51. Let {Xn } and {Yn } be sequences of random variables such that Xn diverges to ∞ in probability and Yn is bounded in probability. Show that Xn + Yn diverges to ∞ in probability. Solution. By the deﬁnition of bounded in probability, for any  > 0, there is C > 0 such that supn P (|Yn | > C ) < /2. By the deﬁnition of divergence to ∞ in probability, for any M > 0 and  > 0, there is n > 0 such that P (|Xn | ≤ M + C ) < /2 whenever n > n . Then, for n > n , P (|Xn + Yn | ≤ M ) ≤ P (|Xn | ≤ M + |Yn |) = P (|Xn | ≤ M + |Yn |, |Yn | ≤ C ) + P (|Xn | ≤ M + |Yn |, |Yn | > C ) ≤ P (|Xn | ≤ M + C ) + P (|Yn | > C ) ≤ /2 + /2 = . This means that Xn + Yn diverges to ∞ in probability. Exercise 52. Let X, X1 , X2 , ... be random variables. Show that if limn Xn = X a.s., then supm≥n |Xm | is bounded in probability. Solution. Since supm≥n |Xm | ≤ supm≥1 |Xm | for any n, it suﬃces to show that for any  > 0, there is a C > 0 such that P (supn≥1 |Xn | > C) ≤ . Note that limn Xn = X implies that, for any  > 0 and any ﬁxed c1 > 0, there exists a suﬃciently large N such that P (∪∞ n=N {|Xn − X| > c1 }) < /3 (e.g., Lemma 1.4 in Shao, 2003). For this ﬁxed N , there exist constants c2 > 0 and c3 > 0 such that N  P (|Xn | > c2 ) < n=1   3  and P (|X| > c3 ) <   . 3  40  Chapter 1. Probability Theory  Let C = max{c1 , c2 } + c3 . Then the result follows from ∞    P sup |Xn | > C = P {|Xn | > C} n≥1  n=1    N  ≤  P (|Xn | > C) + P n=1  ∞    {|Xn | > C}  n=N  ∞      ≤ + P (|X| > c3 ) + P {|Xn | > C, |X| ≤ c3 } 3 n=N  ∞     ≤ + +P {|Xn − X| > c1 } 3 3 n=N    ≤ + + = . 3 3 3  Exercise 53 (#1.128). Let {Xn } and {Yn } be two sequences of random variables such that Xn is bounded in probability and, for any real number t and  > 0, limn [P (Xn ≤ t, Yn ≥ t + ) + P (Xn ≥ t + , Yn ≤ t)] = 0. Show that Xn − Yn →p 0. Solution. For any  > 0, there exists an M > 0 such that P (|Xn | ≥ M ) ≤  for any n, since Xn is bounded in probability. For this ﬁxed M , there exists an N such that 2M/N < /2. Let ti = −M +2M i/N , i = 0, 1, ..., N . Then, P (|Xn − Yn | ≥ ) ≤ P (|Xn | ≥ M ) + P (|Xn | < M, |Xn − Yn | ≥ ) N  ≤ +  P (ti−1 ≤ Xn ≤ ti , |Xn − Yn | ≥ ) i=1 N  ≤ +  P (Yn ≤ ti−1 − /2, ti−1 ≤ Xn ) i=1 N  P (Yn ≥ ti + /2, Xn ≤ ti ).  + i=1  This, together with the given condition, implies that lim sup P (|Xn − Yn | ≥ ) ≤ . n  Since  is arbitrary, we conclude that Xn − Yn →p 0. Exercise 54 (#1.133). Let Fn , n = 0, 1, 2, ..., be cumulative distribution functions such that Fn → F0 for every continuity point of F0 . Let U be a random variable having the uniform distribution on the interval [0, 1] and let  Chapter 1. Probability Theory  41  Gn (U ) = sup{x : Fn (x) ≤ U }, n = 0, 1, 2, .... Show that Gn (U ) →p G0 (U ). Solution. For any n and real number t, Gn (U ) ≤ t if and only if Fn (t) ≥ U a.s. Similarly, Gn (U ) ≥ t if and only if Fn (t) ≤ U a.s. Hence, for any n, t and  > 0, P (Gn (U ) ≤ t, G0 (U ) ≥ t + ) = P (Fn (t) ≥ U, F0 (t + ) ≤ U ) = max{0, Fn (t) − F0 (t + )} and P (Gn (U ) ≥ t + , G0 (U ) ≤ t) = P (Fn (t + ) ≤ U, F0 (t) ≤ U ) = max{0, F0 (t) − Fn (t + )}. If both t and t +  are continuity points of F0 , then limn [Fn (t) − F0 (t + )] = F0 (t) − F0 (t + ) ≤ 0 and limn [F0 (t) − Fn (t + )] = F0 (t) − F0 (t + ) ≤ 0. Hence, lim [P (Gn (U ) ≤ t, G0 (U ) ≥ t + ) + P (Gn (U ) ≥ t + , G0 (U ) ≤ t)] = 0 n  when both t and t +  are continuity points of F0 . Since the set of discontinuity points of F0 is countable, Gn (U )−G0 (U ) →p 0 follows from the result in the previous exercise, since G0 (U ) is obviously bounded in probability. Exercise 55. Let {Xn } be a sequence of independent and identically distributed random variables. Show  th