Licensed under a Creative Commons BY-NC-SA 4.0 License.

1 The exponential random variable

1.1 Review of descriptive statistics

We next review the basic tools that can be used to describe a random variable for the case of the exponential random variable. We generate 1000 samples from an \(\mathrm{Exp}(5)\) (exponential with mean = 5). Note that rexp takes as input parameter the “rate”, i.e., the “lambda”:

x <- rexp(1000, rate=1/5) # rate = lambda = 1/mean
plot(x)

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0014  1.6718  3.7766  5.3897  7.2926 42.3117

Estimation of the density function \(f(x)\):

plot(density(x), ylim=c(0,0.5))

# Theoretical f(x)
curve(dexp(x, 1/5), add=TRUE, lty=2, col="red")

Cummulative Distribution Function (CDF):

# Fx is a function representing the empirical CDF of x
Fx <- ecdf(x)
# Returns the empirical CDF at zero (should be 0 because we have samples from an exponential R.V.)
Fx(0)
## [1] 0
plot(x, Fx(x))

# Theoretical CDF
curve(pexp(x, 1/5), add=TRUE, lty=2, col="red")

For the case of the exponential random variable, the CCDF (Complementary CDF, also known as Survival Function) results very convenient. This is because \(F^C(x)\) is given by

\[F^C(x) = S(x) = 1 - F(x) = 1 - \left(1-e^{-\lambda x}\right) = e^{-\lambda x}\]

and therefore, using a logarithmic scale, it results in a linear function proportional to the exponential rate:

\[\log F^C(x) = -\lambda x\]

The survival function or CCDF:

plot(x, 1-Fx(x))

Or using a logarithmic scale:

plot(x, log(1-Fx(x)))

Now, we can estimate the slope (\(\lambda\)) using a linear regression. (What is a linear regression? Some theory and a video).

logCCDF <- log(1 - Fx(x))
# Given that the last value of the CDF is 1, the last value of the CCDF
# will be 0, that after the logarithm will result in a -\infinity.
# Therefore we remove all these values
x <- x[is.finite(logCCDF)]
logCCDF <- logCCDF[is.finite(logCCDF)]

# Linear regression to estimate the slope and summary of its values
fit <- lm(logCCDF ~ x)
summary(fit)
## 
## Call:
## lm(formula = logCCDF ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49409 -0.01349  0.00589  0.01265  0.60599 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.0073921  0.0016227    4.555 5.87e-06 ***
## x           -0.1875699  0.0002165 -866.486  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0359 on 997 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 7.508e+05 on 1 and 997 DF,  p-value: < 2.2e-16
# Plot of the resulting line
plot(x, logCCDF)
abline(fit$coefficients, lty=2, col="red")

The use of summary provides a lot of numerical results. Out of these, some of the most important are the parameters of the regression line (a+bx) –try to identify them. Also, some of the numberical parametes are provided with statistical information, regarding the goodness of the fit, i.e., the significance of the parameter, which is measured with the p-value (the smaller the p-value, the better the fit).

1.2 Memorylessness property

A random variable \(X\) has no memory if the following holds:

\[\Pr(X > t+s \mid X>s) = \Pr(X > t)\]

It is relatively easy to confirm that the uniform random variable does have memory. To start with, we generate 1000 samples from a \(\mathrm{U}(0,10)\), and depict its survival function (i.e., the complementary cummulative distribution function)

x <- runif(1000, 0, 10)
Fu <- ecdf(x)
plot(sort(x), 1-Fu(sort(x)))

If we now take the subset of samples larger than \(s=2\) and substract 2 from them, we can confirm that the resulting survival function is different

x2 <- x[x>2]-2
Fu2 <- ecdf(x2)
plot(sort(x), 1-Fu(sort(x)))
points(sort(x2), 1-Fu2(sort(x2)), col="red")

The same procedure can be repeated for the case of an exponential random variable, this resulting in relatively similar survival functions:

x <- rexp(1000, 1/5)
x2 <- x[x>5] - 5
Fu <- ecdf(x)
plot(sort(x), 1-Fu(sort(x)))
Fu2 <- ecdf(x2)
points(sort(x2), 1-Fu2(sort(x2)), col="red")

That can be further confirmed using a qqplot

qqplot(x, x2)
abline(0, 1, lty=2, col="red")

1.3 Minimum of two exponentials

We generate two set of samples, \(X \sim \mathrm{Exp}(5), Y \sim \mathrm{Exp}(10)\). To generate the new random variable \(Z \sim \min\{X, Y\}\), we make use of the function pmin:

x <- rexp(1000, 1/5)
y <- rexp(1000, 1/10)
z <- pmin(x, y)

We can confirm that not only \(X\) and \(Y\), but also \(Z\) have a density function similar to that of an exponential random variable.

plot(density(x), ylim=c(0,0.3))
lines(density(y), col='blue')
lines(density(z), col='red')

and the average of \(Z\) is very similar to the one expected

mean(z)
## [1] 3.252753

We can also repeat the procedure above, i.e., compute the logarithm of the survival function, which should follow a line, and then estimate \(Z\)’s \(\lambda\) parameter via a linear regression:

Fx <- ecdf(z)
logCCDF <- log(1 - Fx(z))
z <- z[is.finite(logCCDF)]
logCCDF <- logCCDF[is.finite(logCCDF)]

fit <- lm(logCCDF ~ z)
plot(z, logCCDF)
abline(fit$coefficients, lty=2, col="red")

whose slope approximately matches the theoretical value: the new rate is the sum of the rates, \(\lambda_z =\sum \lambda_i = \frac{1}{5}+\frac{1}{10}=0.3\).

-fit$coefficients["z"]
##         z 
## 0.3150273

2 Poisson processes

2.1 Generating Poisson arrival process

We next verify that both the first and the second definition of a Poisson process serve to generate it. To this aim, we will first generate an arrival process following the second definition, and confirm that it is a Poisson process via the first definition, and then proceed in the other direction: generate a process following the first definition, and confirm that it is a Poisson process via the first definition.

2.1.1 From 2nd def. to 1st def.

According to the exponential definition (second definition), a Poisson arrival process is composed of inter-arrival times that follow an exponentially distributed random variable with average \(1/\lambda\). That is, given exponentially distributed inter-arrivals at a rate \(\lambda\), the number of arrivals in an interval of length \(t\) follows a Poisson distribution with parameter \(\lambda t\).

Below, we generate 1000 exponentially distributed samples (i.e., inter-arrival times). The cumulative sum of that sample corresponds to the absolute arrival times:

inter <- rexp(1000, 2)
llegadas <- cumsum(inter)

This step completes the generation of the process. It can be verified that, given that 1000 times between arrivals of average 1/2 have been generated, the last arrival (the last position of the vector) will have occurred near instant 500.

llegadas[length(llegadas)] # another option: max(llegadas)
## [1] 524.2192

Next, we check by means of the first definition that this vector contains the arrival times of a Poisson arrival process. According to the first definition, the number of events \(n\) that occur in an interval of length \(t\) should follow a discrete Poisson variable of mean \(\lambda t\).

To obtain a sequence with different realizations of the number of events \(n\) of an interval of a given size, three steps are performed: 1) We divide the total time into intervals of a given length (e.g., 1) with the command seq 2) Each arrival is assigned to one of these intervals with the command cut 3) The number of “assignments” per interval is counted with tabulate

intervalos <- seq(0, max(llegadas), 1) # Intervalos de longitud 1
asignacion <- cut(llegadas, intervalos)
n_llegadas <- tabulate(asignacion)

The resulting histogram can be compared with the expected theoretical value using the following commands

hist(n_llegadas,probability=TRUE, breaks=seq(-0.5, 10.5, 1))
lines(0:10, dpois(0:10, 2), type = "h", lwd = 2, col="red")

You could also compare the resulting vector with one generated by therpois function using a qqplot (although the use of qqplot with discrete variables is not that common)

qqplot(n_llegadas, rpois(1000, 2 * 1))
abline(0, 1, lty=2, col="red")

Repeat the above (generation of arrival times and comparison with a Poisson process) for different values of \(\lambda\) and \(t\).

2.1.2 From 1st def. to the 2nd def.

According to the first definition, given a time interval of length $ t $ and an arrival rate $$, a Poisson variable of mean \(\lambda t\) can be used to obtain a random number of arrivals in that interval.

N <- 1000 # Intervals
lambda <- 2
t <- 1
n <- rpois(N, lambda * t)

To obtain at what specific instant these arrivals are located, they are randomly placed (as seen in the the conditional distribution of the instant of an arrival), which are ordered from lowest to highest before incorporating them into the general arrivals vector

llegadas <- c()
for (i in seq(1,N)) { 
  llegadas_intervalo <- runif(n[i], (i-1)*t, i*t)
  llegadas <- c(llegadas, sort(llegadas_intervalo))
}

Next, we check by means of the second definition that the vector contains arrival times of a Poisson process. To this aim, the times between arrivals are computed and it is verified that they follow an exponential:

inter <- diff(llegadas)
mean(inter)
## [1] 0.486135
plot(density(inter), ylim=c(0,lambda))
curve(dexp(x, lambda), add=TRUE, lty=2, col="red")

Repeat the above for different values of \(\lambda\) and \(t\).

2.2 Aggregating and sampling Poisson processes

As seen in the previous section, when working with Poisson processes it is usually more convenient to operate with the times between arrivals (that is, exponential random variables). Next, the aggregation and decomposition properties of Poisson processes, and the Palm-Khintchine theorem, are illustrated.

2.2.1 Aggregation

Two Poisson processes are generated, each one with a different rate

N <- 2000
lambda1 <- 1
inter1 <- rexp(N, lambda1)
times1 <- cumsum(inter1)

lambda2 <- 3
inter2 <- rexp(N, lambda2)
times2 <- cumsum(inter2)

Since the same number of samples has been generated, one vector has a greater duration than the other in time, so they are limited to the minimum:

max(times1); max(times2)
## [1] 2005.472
## [1] 668.2591
tmax = min(max(times1), max(times2))

times1 <- times1[times1 < tmax]
times2 <- times2[times2 < tmax]

To add two processes, just join the vectors and order them:

llegadas <- sort(c(times1, times2))

It can be verified that the times between arrivals follow an exponential with rate equal to the sum of the rates

inter <- diff(llegadas)
mean(inter)
## [1] 0.2498687
plot(density(inter), ylim=c(0,lambda1+lambda2))
curve(dexp(x, lambda1+lambda2), add=TRUE, lty=2, col="red")

And for extra rigor, the resulting distribution can be compared with the theoretical one, using a qqplot:

qqplot(inter, rexp(length(inter), lambda1 + lambda2))
abline(0, 1, lty=2, col="red")

2.2.2 Sampling

Let us start with a Poisson process at rate 4:

N <- 2000
lambda <- 4
inter <- rexp(N, lambda)
llegadas <- cumsum(inter)

One approach to sample e.g. 1/4 of the samples at random, is to first generate a vector of indices where each value is true with probability 1/4, and then use that vector to select the samples:

prob <- 1/4
indices <- runif(length(llegadas)) < prob
llegadas_subset <- llegadas[indices]

It can be verified that the time between arrivals follows an exponential of average the expected one, for example with a qqplot:

interarrivals <- diff(llegadas_subset)
qqplot(interarrivals, rexp(length(interarrivals), lambda*prob))
abline(0, 1, lty=2, col="red")

2.2.3 Palm-Khintchine theorem

The time between arrivals in a process, in general, does not have to be exponential. Take, for example, a voice over IP (VoIP) encoder that generates a frame every 10~ms, but with a certain randomness that the time between frames follows a uniform random variable between 8~ms and 12~ms.

interu <- runif(1000, 8, 12)
timesu <- cumsum(interu)

Obviously, the density function of the time between arrivals is far from the exponential:

plot(density(interu))

And the number of arrivals in time windows of a fixed length (for example, 15), does not follow a Poisson variable

intervalos <- seq(0, max(timesu), 15) # Intervalos de longitud 15
asignacion <- cut(timesu, intervalos)
n_llegadas <- tabulate(asignacion)
hist(n_llegadas,probability=TRUE, breaks=seq(-0.5, 4.5, 1))

Try changing the length of the interval and see the impact on the resulting histogram

However, adding a sufficient number of arrival processes tends to behave like a Poisson process, so the time between arrivals will resemble an exponential time between arrivals. Let us consider the aggregation of four processes like the previous one

llegadasu1 <- cumsum(runif(1000, 8, 12))
llegadasu2 <- cumsum(runif(1000, 8, 12))
llegadasu3 <- cumsum(runif(1000, 8, 12))
llegadasu4 <- cumsum(runif(1000, 8, 12))

llegadas <- sort(c(llegadasu1, llegadasu2, llegadasu3, llegadasu4))

In this case, the time between arrivals begins to resemble that of an exponential random variable, as it has a bias towards smaller values.

interarrivals <- diff(llegadas)
plot(density(interarrivals), ylim=c(0,0.4))
curve(dexp(x, 0.4), add=TRUE, lty=2, col="red")

Although the qqplot shows that for four processes are too few to have a high similarity with an exponential random variable:

qqplot(interarrivals, rexp(length(interarrivals), 0.4))
abline(0, 1, lty=2, col="red")

You can try to add more processes and check the resemblance to the corresponding exponential

Ntotal <- 10
RateUno <- 1/((8+12)/2) # Tasa de llegadas de un proceso
llegadas_total <- c()
for (i in seq(1,Ntotal)) { 
  llegadas <- cumsum(runif(1000, 8, 12))
  llegadas_total <- sort(c(llegadas_total, llegadas))
}


interarrivals <- diff(llegadas_total)
qqplot(interarrivals, rexp(length(interarrivals), Ntotal*RateUno))
abline(0, 1, lty=2, col="red")

Try changing the number of aggregated processes and/or the random variable of the time between arrivals (e.g., the limits of the uniform v.a.), to see the impact on the distribution of the times between arrivals

3 Solving problems with R

3.1 Problem 2.7

Assume that the network delay can be modeled with an exponential random variable with an average of 10ms. Compute the average delay of the dataframes with a delay bigger than 50ms.

We just need to generate the random variable and calculate the mean of the subset of values greater than 10

N <- 1000
lambda <- 1/50
x <- rexp(N, lambda)
x <- x[x>10]
mean(x)
## [1] 61.69479

3.2 Problema 2.9

The lifetime of a hard disk (HD) can be modeled with an exponential random variable of mean \(1/\lambda\). You want to store 2 TB of information. There are two configurations for the hard drives: * In the RAID~0 (Data Striping) configuration there is no redundancy and the total capacity is the sum of the capacities: if any disk goes bad, the information is lost. * In RAID~1 (Mirroring) configuration the same information is copied to all disks and the total capacity matches that of one disk: only when no disk works, the information is lost. You need to know the probability that the 2~TB of information is available after one year of use for the following options:

  1. 1 HD of 2TB, average lifetime per disk: \(1/\lambda=\) 2~years

For this case, it is enough to calculate the number of times that an exponential of average 2 years exceeds one year

required_lifetime <- 1
lambda <- 1/2
N <- 10000
hd_1 <- rexp(N, lambda)
survived <- hd_1[hd_1 > required_lifetime]

length(survived)/N
## [1] 0.6081
  1. 2 HDs of 1TB each, RAID0, average lifetime per disk \(1/\lambda=\) 1~year

In this case we need both hard drives to have survived at least one year

N <- 10000
required_lifetime <- 1
lambda <- 1
hd_1 <- rexp(N, lambda)
hd_2 <- rexp(N, lambda)

sobreviven = (hd_1 > required_lifetime) & (hd_2 > required_lifetime)
survived <- sum(sobreviven, na.rm = TRUE)
survived/N
## [1] 0.1312
  1. 2 HDs of 2TB each, RAID1, average lifetime \(1/\lambda=\) 1~year
N <- 10000
required_lifetime <- 1
lambda <- 1
hd_1 <- rexp(N, lambda)
hd_2 <- rexp(N, lambda)

sobreviven = (hd_1 > required_lifetime) | (hd_2 > required_lifetime)
survived <- sum(sobreviven, na.rm = TRUE)
survived/N
## [1] 0.6001

3.3 Problema 3.8

A given application generates frames following a Poisson process with rate \(\lambda=\)2frames/s. To maintain synchronization, the wireless interface generates a frame if if does not detect any apprication traffic during a timeout of \(Timeout=1\)~s. Compute the resulting frame transmission rate, and discuss whether or not the process is Poissonian

One way to simulate the frame generation process is to start from several samples of an exponential random variable of mean 1/2. These samples will be the times between arrivals if they are less than \(Timeout=\)~1, otherwise an arrival must be added. First, the application arrival process is generated

N <- 2000
lambda <- 2
inter <- rexp(N, lambda)
llegadas <- cumsum(inter)

Next, we go through this process, and in case more there is more than \(Timeout\) between one arrival and the next, we insert as many frames are required:

Timeout <- 1
salidas_interfaz <- c()
t_now <- 0
for (t_llegada in llegadas) {
  if (t_llegada - t_now < Timeout) { # Llega antes del umbral
    salidas_interfaz <- c(salidas_interfaz, t_llegada)
  } else { # Llega tarde, hay que ver cuanto de tarde
    n_thresholds <- floor((t_llegada-t_now)/Timeout)
    t_generados <- t_now + seq(1,n_thresholds)*Timeout
    salidas_interfaz <- c(salidas_interfaz, t_generados, t_llegada)
  }
  t_now <- t_llegada
}

It can be verified that the mean time between frames is less than in the previous case, and the frame generation rate is higher

iarrival_salida <- diff(salidas_interfaz)
mean(iarrival_salida)
## [1] 0.4235571
1/mean(iarrival_salida)
## [1] 2.360957

Finally, we can confirm that the time between arrivals is far from following an exponential random variable, since there is an accumulation of values in \(Timeout\).

plot(density(iarrival_salida), ylim=c(0, 1/mean(iarrival_salida)))
curve(dexp(x, 1/mean(iarrival_salida)), add=TRUE, lty=2, col="red")

Using a qqplot the difference is also evident:

qqplot(iarrival_salida, rexp(length(iarrival_salida), 1/mean(iarrival_salida)))
abline(0, 1, lty=2, col="red")