Licensed under a Creative Commons BY-NC-SA 4.0 License.

1 The exponential random variable

1.1 Review of descriptive statistics

We next review the basic tools that can be used to describe a random variable for the case of the exponential random variable. We generate 1000 samples from an \(\mathrm{Exp}(5)\) (exponential with mean = 5). Note that rexp takes as input parameter the “rate”, i.e., the “lambda”:

x <- rexp(1000, rate=1/5) # rate = lambda = 1/mean
plot(x)

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0014  1.6718  3.7766  5.3897  7.2926 42.3117

Estimation of the density function \(f(x)\):

plot(density(x))

# Theoretical f(x)
curve(dexp(x, 1/5), add=TRUE, lty=2, col="red")

Cummulative Distribution Function (CDF):

# Fx is a function representing the empirical CDF of x
Fx <- ecdf(x)
# Returns the empirical CDF at zero (should be 0 because we have samples from an exponential R.V.)
Fx(0)
## [1] 0
plot(x, Fx(x))

# Theoretical CDF
curve(pexp(x, 1/5), add=TRUE, lty=2, col="red")

For the case of the exponential random variable, the CCDF (Complementary CDF, also known as Survival Function) results very convenient. This is because \(F^C(x)\) is given by

\[F^C(x) = S(x) = 1 - F(x) = 1 - \left(1-e^{-\lambda x}\right) = e^{-\lambda x}\]

and therefore, using a logarithmic scale, it results in a linear function proportional to the exponential rate:

\[\log F^C(x) = -\lambda x\]

The survival function or CCDF:

plot(x, 1-Fx(x))

Or using a logarithmic scale:

plot(x, log(1-Fx(x)))

Now, we can estimate the slope (\(\lambda\)) using a linear regression. (What is a linear regression? Some bits of theory and an instructional video).

logCCDF <- log(1 - Fx(x))
# Given that the last value of the CDF is 1, the last value of the CCDF
# will be 0, that after the logarithm will result in a -\infinity.
# Therefore we remove all these values
x <- x[is.finite(logCCDF)]
logCCDF <- logCCDF[is.finite(logCCDF)]

# Linear regression to estimate the slope and summary of its values
fit <- lm(logCCDF ~ x)
summary(fit)
## 
## Call:
## lm(formula = logCCDF ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49409 -0.01349  0.00589  0.01265  0.60599 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.0073921  0.0016227    4.555 5.87e-06 ***
## x           -0.1875699  0.0002165 -866.486  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0359 on 997 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 7.508e+05 on 1 and 997 DF,  p-value: < 2.2e-16
# Plot of the resulting line
plot(x, logCCDF)
abline(fit$coefficients, lty=2, col="red")

The use of summary provides a lot of numerical results. Out of these, some of the most important are the parameters of the regression line (a+bx) –try to identify them. Also, some of the numberical parametes are provided with statistical information, regarding the goodness of the fit, i.e., the significance of the parameter, which is measured with the p-value (the smaller the p-value, the better the fit).

1.2 Memorylessness property

To start with, we generate 1000 samples from a \(\mathrm{U}(0,10)\). Next, we take those samples larger than 2, and substract 2 from them:

x <- runif(1000, 0, 10)
x2 <- x[x>2] - 2

The new random variable x2 differs from the original as expected, because the uniform distribution has memory:

qqplot(x, x2)
abline(0, 1, lty=2, col="red")

However, it can be demonstrated that the exponential distribution has no memory:

x <- rexp(1000, 1/5)
x2 <- x[x>2] - 2
qqplot(x, x2)
abline(0, 1, lty=2, col="red")

1.3 Minimum of two exponentials

We generate two set of samples, \(X \sim \mathrm{Exp}(5), Y \sim \mathrm{Exp}(10)\). To generate the new random variable \(Z \sim \min\{X, Y\}\), we make use of the function pmin:

x <- rexp(1000, 1/5)
y <- rexp(1000, 1/10)
z <- pmin(x, y)

If the preious variables conforms a unique component (for example a router), to estimate \(Z\)’s \(\lambda\) parameter, we fit a linear regression over the survival function:

Fx <- ecdf(z)
logCCDF <- log(1 - Fx(z))
z <- z[is.finite(logCCDF)]
logCCDF <- logCCDF[is.finite(logCCDF)]

fit <- lm(logCCDF ~ z)
plot(z, logCCDF)
abline(fit$coefficients, lty=2, col="red")

whose slope approximately matches the theoretical value: the new rate is the sum of the rates, \(\lambda_z =\sum \lambda_i = \frac{1}{5}+\frac{1}{10}=0.3\).

-fit$coefficients["z"]
##         z 
## 0.3150273

2 Poisson processes

2.1 Generation

According to the exponential definition (second definition), a Poisson arrival process is composed of inter-arrival times that follow an exponentially distributed random variable with average \(1/\lambda\). That is, given exponentially distributed inter-arrivals at a rate \(\lambda\), the number of arrivals in an interval of length \(t\) follows a Poisson distribution with parameter \(\lambda t\).

Below, we generate 1000 exponentially distributed samples (i.e., inter-arrival times). The cumulative sum of that sample corresponds to the absolute arrival times:

inter <- rexp(1000, 2)
times <- cumsum(inter)

We next define a function that counts the number of arrivals over periods of time of length \(t\). Then, we compare the counting process with a \(\mathrm{Pois}(2)\) distribution:

# Split the total time in intervals of length "t" and count the arrivals
count_arrivals <- function(times, t=1) {
  # number of intervals of length "t" (vector)
  breaks <- seq(0, max(times), t)
  # cut times vector into intervals and codes the values
  # in times according to which interval they fall
  # Tabulate counts the number of times each interval has received an arrival
  tabulate(cut(times, breaks))
}

qqplot(count_arrivals(times, t=1), rpois(1000, 2 * 1))
abline(0, 1, lty=2, col="red")

If we double the \(t\), then the result is a \(\mathrm{Pois}(4)\):

qqplot(count_arrivals(times, t=2), rpois(1000, 2 * 2))
abline(0, 1, lty=2, col="red")

We can also follow the path in the opposite direction. Following the first definition, given an interval of length \(t\), we can use a Poisson distribution to draw the number of arrivals to place there. Then, following the conditional probability of an arrival, we just have to randomly place those arrivals over the interval, and the resulting inter-arrival times follow an exponential distribution.

lambda <- 2
t <- 1
x <- rpois(1000, lambda * t)

# Randomly place x[i] arrivals within each interval i
place <- function(i) {
  # arrivals generation
  arrivals <- runif(x[i], 0, t)
  # we place them within the corresponding interval by adding i * size of the interval
  sort(arrivals) + (i-1)*t
}
# times format: 
#  - [[interval]] [1] arrival1 arrival2
times <- sapply(1:length(x), place)

# Extracts arrivals from all intervals in a vector and replace 
# the absolute time of the arrival with the difference between arrivals (inter-arrivals)
inter <- diff(unlist(times))

# The result is exponentially distributed
qqplot(inter, rexp(length(inter), lambda))
abline(0, 1, lty=2, col="red")

2.2 Aggregating and sampling Poisson processes

In general, it is more convenient to work with inter-arrival times (the underlying exponential variable) when dealing with Poisson processes. Below, we generate the aggregation of two Poisson arrival processes. To construct the aggregated vector, note that because of the difference in rates, one of them reaches (way) further than the other; for a proper merge, we need to stop when there are no more arrivals from the other vector. Then, aggregating arrivals is as simple as concatenating and sorting:

lambda1 <- 1
inter1 <- rexp(1000, lambda1)
times1 <- cumsum(inter1)

lambda2 <- 3
inter2 <- rexp(1000, lambda2)
times2 <- cumsum(inter2)

max(times1); max(times2)
## [1] 1002.422
## [1] 345.4732
times1 <- times1[times1 < max(times2)]
times2 <- times2[times2 < max(times1)]
times_agg <- sort(c(times1, times2))
inter_agg <- diff(times_agg)

Now we analyze the resulting inter-arrival times to confirm that they follow an exponentially distributed random variable with a rate equal to the sum of the rates:

qqplot(inter_agg, rexp(length(inter_agg), lambda1 + lambda2))
abline(0, 1, lty=2, col="red")

We next sample a Poisson arrival process, by sampling each arrival with a probability 1/2. This new process should follow an exponential distribution with half of the original rate:

prob <- 1/2

# Sampling using a for loop
# times_dec = c()
# for (i in 1:length(times_agg)){
#   random_val <- runif(1)
#   if (random_val < prob){
#      times_dec <- c(times_dec, times_agg[i])
#   }
# }

# Sampling directly working with vectors
times_dec <- subset(times_agg, runif(length(times_agg)) < prob)

inter_dec <- diff(times_dec)
qqplot(inter_dec, rexp(length(inter_dec), (lambda1 + lambda2) * prob))
abline(0, 1, lty=2, col="red")

2.3 Palm-Khintchine theorem

Let us suppose that a switch connects 64 Voice over IP (VoIP) phones. The time between frames generated by a VoIP phone can be modeled with a random variable uniformly distributed between 8 and 12 ms. The arrival process for each phone cannot be considered as Poissonian, as times do not follow an exponential random variable of the same mean:

inter <- runif(1000, 8, 12)
qqplot(inter, rexp(length(inter), 1 / ((8 + 12) / 2)))
abline(0, 1, lty=2, col="red")

However, as the Palm-Khintchine theorem states, the aggregation of these 64 traffic sources closely reassembles an exponential Process:

f <- function(i){
  samples <- runif(1000, 8, 12)
  return(cumsum(samples))
}
# Matrix of 64 columns and 1000 rows
traffic_data <- sapply(1:64, f)
vectorized_traffic_data <- unlist(traffic_data)

times_agg <- sort(vectorized_traffic_data)
inter_agg <- diff(times_agg)

qqplot(inter_agg, rexp(length(inter_agg), 64 / ((8 + 12) / 2)))
abline(0, 1, lty=2, col="red")

3 Solving problems with R

3.1 Problem 1

Assume that the network delay can be modeled with an exponential random variable with an average of 10ms. Compute the average delay of the dataframes with a delay bigger than 50ms.

lambda <- 1/10
f <- function(x){
  x <- rexp(1000, lambda)
  return(mean(x[x>50]))
}
avg <- sapply(1:1000, f)
mean(avg)
## [1] 60.02601

3.2 Problem 2

Suppose that we want a hardisk configuration in order to store 2 \(TB\) of data. The lifetime of a hardisk can be modeled with an exponential random variable of average \(1/\lambda\). Compute the probability of the information to be available after \(1 Year\) existing 3 different options:

  1. 1 HD 500€, 2TB, \(1/\lambda = 2 years\)
expected_lifetime <- 1

f <- function(x){
  hd_1 <- rexp(10000, 1/2)
  greater <- hd_1[hd_1 > expected_lifetime]
  return(length(greater)/length(hd_1))
}

r <- sapply(1:1000, f)
print(mean(r))
## [1] 0.6065801
  1. 2 HD 100€, 1TB each, \(1/\lambda = 1 year\), Raid 0
expected_lifetime <- 1

f <- function(x){
  hd_1 <- rexp(1000, 1)
  hd_2 <- rexp(1000, 1)
  greater <- 0
  for (i in 1:length(hd_1)){
    if((hd_1[i] > expected_lifetime) && (hd_2[i] > expected_lifetime)){
      greater <- greater + 1
    }
  }
  return(greater/length(hd_1))
}
r <- sapply(1:1000, f)
print(mean(r))
## [1] 0.135007
  1. 2 HD 200€, 2TB each, \(1/\lambda = 1 year\), Raid 1
expected_lifetime <- 1

f <- function(x){
  hd_1 <- rexp(1000, 1)
  hd_2 <- rexp(1000, 1)
  greater <- 0
  for (i in 1:length(hd_1)){
    if((hd_1[i] > expected_lifetime) || (hd_2[i] > expected_lifetime)){
      greater <- greater + 1
    }
  }
  return(greater/length(hd_1))
}
r <- sapply(1:1000, f)
print(mean(r))
## [1] 0.600491

3.3 Problem 3

The time between two consecutive Youtube requests can be modelled using a uniformly distributed random variable between 0 and 40 minutes, and 60% of the requests go to USA servers. Now, consider a mobile base station serving 100 users:

  1. Characterize the request generation process to USA servers, justify your assumptions.
  2. Find the probability that less than two requests to USA servers are generated within a period of one minute.

This exercise is intended to use the P-K theorem, to assume that the resulting process is Poissonian, with exponential interarrivals of rate \(0.6\cdot 100 \cdot 2 / 40 = 3\) requests per minute. But how similar it is to a Poisson process? To answer this question, we simulate the process with the given information:

n <- 100000
users <- 100
t.min <- 0
t.max <- 40
prob <- 0.6
aggr_times <- function(x) {
    u <- cumsum(runif(n/users/prob, t.min, t.max))
    return(u[runif(n/users/prob) < prob])
}
inter_times <- function(x) {
  aggr_times_data <- sapply(1:users, aggr_times)
  return(diff(sort(unlist(aggr_times_data))))
}

data <- inter_times()

This arrival process seems to fit an exponential distribution:

qqplot(data, rexp(length(data), 3))
abline(0, 1, lty=2, col="red")

Hence, we can say thar EEUU requests are samples from the process, being then another Poisson process with rate:

out <- sapply(1:1000, function(i) {
  cell <- inter_times()
  return(1/mean(cell))
})
t.test(out)
## 
##  One Sample t-test
## 
## data:  out
## t = 5128.2, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.897266 2.899484
## sample estimates:
## mean of x 
##  2.898375

Therefore, the probability of having less than two requests in one minute is:

out <- sapply(1:1000, function(i) {
  # count_arrivals was defined at Poisson processes section
  counts <- count_arrivals(cumsum(inter_times()), t=1)
  sum(counts < 2) / length(counts)
})
t.test(out)
## 
##  One Sample t-test
## 
## data:  out
## t = 1518.3, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2226627 0.2232390
## sample estimates:
## mean of x 
## 0.2229508

3.4 Problem 4

A given application generates frames following a Poisson process with rate \(\lambda\)~frames/s. In order to save energy, the wireless interface generates a frame if if does not detect any traffic during \(T=1/\lambda\)~s. Find the frame transmission rate, and discuss whether or not the process is Poisson.

First of all, we implement a function to generate this process. To that aim, we first generate exponentially distributed samples, which represent the traffic generated by the app. Then, we have to fill the holes of length \(T\ge1/\lambda\) with frames coming from the control module. The which function finds those holes. e1 is converted to a list for us to be able to expand those holes identified in gap. In the for loop, we compute how many control frames should be generated (depending on how long is the gap), and they are inserted. Note that the original interarrival time, which was idenfitied as \(T\ge1/\lambda\), must be modified to consider the frames inserted.

rgen <- function(n, lambda) {
  # samples generation
  e1 <- rexp(n, lambda)
  # Finds gaps of length T>=1/lambda
  gap <- which(e1 >= 1/lambda)
  e1 <- as.list(e1)
  # calculates how many control frams should be generated
  for (j in gap) {
    N <- as.integer(e1[[j]] * lambda)
    reps <- rep(1/lambda, N)
    e1[[j]] <- c(reps, e1[[j]] - sum(reps))
  }
  unlist(e1)[1:n]
}

Finally, we proceed as in previous exercises to compute the rate for multiple simulations. We assume, for example, \(\lambda=2\):

out <- sapply(1:1000, function(i) {
  1/mean(rgen(1000, 2))
})
t.test(out)
## 
##  One Sample t-test
## 
## data:  out
## t = 1763.7, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.161935 3.168979
## sample estimates:
## mean of x 
##  3.165457

It is clear that the resulting process is not Poissonian. The density function evidences this fact, because there is a peak superimposed in the tail which accounts for the traffic from the control module:

e1 <- rgen(1000, 2)
plot(density(e1))
curve(dexp(x, 2), add=TRUE, lty=2, col="red")

And the qqplot agains an exponential distribution tells us the same story:

qqplot(e1, rexp(length(e1), 2))
abline(0, 1, lty=2, col="red")