6.854 Lecture Notes

Balls in Bins

Load balancing:

$n$ users want to choose among $n$ resources---eg web servers.
They can coordinate---but communication is expensive
Or, each can choose a random one

$n$ balls in $n$ bins

$n$ balls in $n$ bins
Expected balls per bin: 1 (not very interesting)
What is max balls we expect to see in a bin?

Markov:

$E[]=1$ so $\Pr[>t] < 1/t$
so one bin unlikely to be big
But $n$ bins so union bound only tells us less than $n$
Example? Distribution that puts all balls in random bin

Chebyshev a little better:

variance of one ball in one bin is $1/n(1-1/n)< 1/n$
variance of $n$ balls is $1$
$\Pr[|X-1|>t]< 1/t^2$
So $O(\sqrt{n})$ with probability $1-1/n$
So now can apply union bound to all $n$ bins
conclude max is probably $O(\sqrt{n})$
but still hard to bound expected max
true for any pairwise independent distribution (cf 2-universal hashing)
exists pairwise independent distributions where this is tight

Chernoff:

expected number of balls in bin $1$
$\Pr[>t]< 2^{-t}$ (for $t > 2e-1$)
So $O(\log n)$ for one bin with high probability.
now apply union bound
Now can bound expected max:
- With probability $1-1/n$, max is $O(\log n)$.
- With probability $1/n$, max is bigger, but at most $n$
- So, expected max $O(\log n)$

Exact analysis

Start by bounding probability of many balls $$ \begin{align*} \Pr[k \mbox{ balls in bin 1}] &= \binom{n}{k}(1/n)^k(1-1/n)^{n-k}\\ & \le \binom{n}{k}(1/n)^k\\ & \le \left(\frac{ne}{k}\right)^k (1/n)^k\\ & = \left(\frac{e}{k}\right)^k\\ \end{align*} $$
So prob at least $k$ balls is $\sum_{j \ge k} (e/j)^j=O((e/k)^k)$ (geometric series)
$ \le 1/n^2 \mbox{ if } k > (e\ln n)/\ln\ln n$
What is probability any bin is over $k$? $1/n$ union bound.
Now can bound expected max:
- With probability $1-1/n$, max is $O(\ln n/\ln\ln n)$.
- With probability $1/n$, max is bigger, but at most $n$
- So, expected max $O(\ln n/\ln\ln n)$
Typical approach: small expectation as small “common case” plus large “rare case”

Purely random has clusters

Saw $\log n/\log\log n$ balls in one bin
Similarly saw $\Omega(\log n)$ “runs” in coin flipping
Gives good party trick---deciding if sequence is random
How can we do better?

Two Choices

Two choices

Azar, Broder, Karlin, Upfal
Pick two random bins
Insert in less loaded
Intuitive analysis:
only $n/4$ bins with 4 items
two random bins: probability both have 4 is $(1/4)(1/4)=1/16$
So, only $n/16$ with 5
so $n/256$ have 6
In general, $n/2^{2^{k-3}}$ have $k$
So none have $\log\log n$.
Problem: analysis bounds one layer based on assumption about previous. must cope with this conditioning.

Formalize

First some lemmas.
Chernoff: $\Pr[B(n,p)>2np]< 2^{-np/3}$.
Dependence: Let $Y_i$ be a function of $X_1,\ldots,X_i$. If $\Pr[Y_i=1 \mid X_1,\ldots,X_{i-1}]< p$ for all $X$'s then $\sum Y_i$ is stochastically dominated by $B(n,p)$,
- i.e. $Z_i$ form $B(n,p)$ and $\Pr[Y_i \ge k] \le \Pr[Z_i \ge k]$.
- so Chernoff holds.
- Even if $Y_i$ dependent.
- proof philosophically subtle, by “coupling”:
  - imagine generating a sequence $Z_i$.
  - do it while generating $Y_i$
  - first, check $p_i=Pr[Y_i=1]$ which may be a messy function of $X$ sequence
  - Now generate $Y_i$. Set $Z_i=1$ if $Y_i=1$ or with prob. $p-p_i$ otherwise.
  - So each $Z_i$ is 1 with probability $p$ independent of all other $Z_i$.
  - but (guaranteed) $Z_i \ge Y_i$.
  - conclude $Z_i$ stochastically dominates $Y_i$

Main analysis:

$v_i(t)=$ number of bins of height at least $i$ after insert $t$ balls
$h_t=$ height of $t^{th}$ ball inserted
bounds $\beta_4=1/4$, $\beta_{i+1} = 2\beta_i^2$
so $(2\beta_{i+1})=(2\beta_i)^2=(1/2)^{2^{i+1}}$
so $\beta_{O(\log\log n)}=O(1/n^2)$
prove $v_i \le \beta_i n$ (event $Q_i$) w.h.p.
base case clear
Induction
- Let $Y_t=1$ if $h_t \ge i+1$ and $v_i(t-1)\le \beta_i n$
- second condition odd but simplifies analysis
- only decreases probability $Y_t=1$
- So $\Pr[Y_t=1] \le \beta^2_i$ (relies on having second condition)
- So $\sum Y_t \le B(n,\beta^2_i)$
- So Chernoff says $\Pr[>2n\beta^2_i=\beta_{i+1}n] < 2^{-\beta_{i+1}n/6}$
- which is $O(1/n^2)$ (“high probability”) until $\beta_{i+1} n =O(\log n)$
When $Q_i$ holds, $\sum Y_t$ is number of tall balls
- Now note number tall bins $< $ number tall balls.
- So we've shown: $$ \begin{eqnarray*} \Pr[\neg Q_{i+1} \mid Q_i] &\le &\Pr[\sum Y_t > \beta_{i+1}n \mid Q_i]\\ &\le & \Pr[\sum Y_t > \beta_{i+1}n]/\Pr[Q_i]\\ &\le &1/(n^2\Pr[Q_i]) \end{eqnarray*} $$
Deal with conditioning $$ \begin{eqnarray*} \Pr[\neg Q_{i+1}] &= \Pr[\neg Q_{i+1} \mid Q_i]\Pr[Q_i] + \Pr[\neg Q_{i+1} \mid \neg Q_i]\Pr[\neg Q_i]\\ &\le \Pr[\neg Q_{i+1}\mid Q_i]\Pr[Q_i] + \Pr[\neg Q_i]\\ &\le O(1/n^2)+\Pr[\neg Q_i] \end{eqnarray*} $$
so by induction, whp all $Q_i$ hold
What of last few stages?
- At last $i$ above, number of tall bins is $\beta_i n= O(\log n)$ whp
- At this point, prob. a ball is tall is $O((\log n)^2/n^2)$
- So prob. of 2 distinct tall balls is $O((n^2)(\log n)^2/n^4)$ which is negligible (union bound on $n^2$ pairs of balls)
- if only one ball is tall, done.
- because it would need to stand on other tall balls to be really tall

More choices? Only $O(\log_d\log n)$

slight variation: break ties to the left
improves to $O((\log\log n)/d)$
intuition: a slight amount of imbalance helps cope with clusters all attacking same spots
ensures some spots will be smaller