Professor: Arne Storjohann | Term: Fall 2024

Module 1: Introduction & Asymptotic Analysis

Much of Computer Science is problem solving: Write a program that converts the given input to the expected output.

When first learning to program, we emphasize correctness. With this course, we will also be concerned with efficiency. We will study efficient methods of storing, accessing, and organizing large collections of data.

Typical operations include: Inserting new data items, deleting data items, searching for specific data items, sorting.

We consider various abstract data types (ADTs), and how to realize them efficiently using appropriate data structures. There is a strong emphasis on mathematical analysis in this course.

Useful Logarithms facts:

$y = lo g_{b} (x)$ means $b^{y} = x$
$lo g (x)$ in this course means $lo g_{2} (x)$
$lo g (x \cdot y) = lo g (x) + lo g (y)$ , $lo g (x^{y}) = y lo g (x)$ , $lo g (x) \leq x$
$lo g_{b} (a) = \frac{l o g _{c} a}{l o g _{c} b} = \frac{1}{l o g _{a} ( b )}$ , $a^{l o g_{b} c} = c^{l o g_{b} a}$
$lo g (n!) = lo g n + lo g (n - 1) + \dots + lo g 2 + lo g 1 \in Θ (n lo g n)$

Useful Sums:

$\sum_{i = 0}^{n - 1} i = \frac{( n - 1 ) n}{2}$
$\sum_{i = 0}^{n - 1} (a + d i) = na + \frac{d n ( n - 1 )}{2} \in Θ (n^{2})$ if $d \neq = 0$ .

Geometric Sequences:

i = 0 \sum n - 1 2^{i} i = 0 \sum n - 1 a r^{i} = 2^{n} - 1 = ⎩ ⎨ ⎧ a \frac{r ^{n} - 1}{r - 1} \in Θ (r^{n - 1}) na \in Θ (n) a \frac{1 - r ^{n}}{1 - r} \in Θ (1) if r > 1 if r = 1 if 0 < r < 1

Harmonic Sequence:

$H_{n} : - \sum_{i = 1}^{n} \frac{1}{i} = ln n + γ + o (1) \in Θ (lo g n)$

Algorithms and Problems

Problem: Description of possible input and desired output.

Problem Instance: One possible input for the specified problem.

Algorithm: Step-by-step process for carrying out a series of computations, given an arbitrary instance $I$ .

Solving a problem: An algorithm $A$ solves a problem $Π$ if, for every instance $I$ of $Π$ , $A$ computes a valid output for the instance $I$ in finite time.

Program: A program is an implementation of an algorithm using a specified computer language.

Pseudocode: Communicate an algorithm to another person.

insertion-sort(A, n)
A: array of size n
for (i <- i; i < n; i++) do
	for (j <- i; j > 0 and A[j-1] > A[j]; j--) do
		swap A[j] and A[j-1]

Omits obvious details, and should be precise about exit-conditions.

For problem $Π$ to program that solves it:

Design an algorithm $A$ that solves $Π$ $\to$ Algorithm Design.
Assess correctness and efficiency of each $A$ $\to$ Algorithm Analysis. Correctness $\to$ CS245.
If acceptable, implement algorithms.
If multiple acceptable algorithms, run experiments to determine the best solution.

This course will focus on the first two points.

Question

What do we mean by efficiency?

In this course, we are primarily concerned with the amount of time a program takes to run (run time).

We also may be interested in the amount of additional memory the program requires (auxiliary space).

The amount of time and or memory required by a program will usually depend on the given problem instance.

RAM model:

Each memory cell stores one datum, typically a number, character, or reference.
Any access to a memory location takes constant time.
Any primitive operation takes constant time.

We compare algorithms by considering the growth rate: What is the behaviour of algorithms as size $n$ gets large?

We use order notation (big- $O$ and friends). Ignore constants and lower order terms.

We study relationships between functions.

Definition

$O$ -notation: $f (x) \in O (g (x))$ ( $f$ is asymptotically upper-bounded by $g$ ) if there exists constants $c > 0$ and $n_{0} \geq 0$ such that $∣ f (x) ∣ \leq c ∣ g (x) ∣$ for all $x \geq n_{0}$ .

In order to prove that $2 n^{2} + 3 n + 11 \in O (n^{2})$ from first principles, we need to find $c$ and $n_{0}$ such that the following condition is satisfied:

2 n^{2} + 3 n + 11 \leq c n^{2} \forall n \geq n_{0}

Many, but not all, choices of $c$ and $n_{0}$ will work.

We also have that $2 n^{2} + 3 n + 11 \in O (n^{10})$ . We want a tight asymptotic bound.

Definition

$Ω$ -notation: $f (x) \in Ω (g (x))$ ( $f$ is asymptotically lower-bounded by $g$ ) if there exists constants $c > 0$ and $n_{0} \geq 0$ such that $c ∣ g (x) ∣ \leq ∣ f (x) ∣$ for all $x \geq n_{0}$ .

Definition

$Θ$ -notation: $f (x) \in Θ (g (x))$ ( $f$ is asymptotically tightly-bounded by $g$ ) if there exists constants $c_{1}$ , $c_{2}$ $> 0$ and $n_{0} \geq 0$ such that $c_{1} ∣ g (x) ∣ \leq ∣ f (x) ∣ \leq c_{2} ∣ g (x) ∣\forall x \geq n_{0}$ .

Equivalently:

f (n) \in Θ (g (n)) ⟺ f (n) \in O (g (n)) \land f (n) \in Ω (g (n))

We also say that the growth rates of $f$ and $g$ are the same.

How growth rates affect running time. It is interesting to see how the running time is affected when the size of the problem instance doubles.

Constant complexity : T (n) Logarithmic complexity : T (n) Linear complexity : T (n) Linearithmic complexity : T (n) Quadratic complexity : T (n) Cubic complexity : T (n) Exponential complexity : T (n) = c = c lo g n = c n = c n lo g n = c n^{2} = c n^{3} = c 2^{n} ⇝ T (2 n) = c ⇝ T (2 n) = T (n) + c ⇝ T (2 n) = 2 T (n) ⇝ T (2 n) = 2 T (n) + 2 c n ⇝ T (2 n) = 4 T (n) ⇝ T (2 n) = 8 T (n) ⇝ T (2 n) = (T (n))^{2} / c

We have $f (n) = n \in Θ (n)$ .

Question

How to express that $f (n)$ grows slower than $n^{2}$ .

Definition

$o$ -notation: $f (x) \in o (g (x))$ ( $f$ is asymptotically strictly smaller than $g$ ) if for all constants $c > 0$ , there exists a constant $n_{0} \geq 0$ such that $∣ f (x) ∣ \leq c ∣ g (x) ∣\forall x \geq n_{0}$ .

Definition

$ω$ -notation: $f (x) \in ω (g (x))$ ( $f$ is asymptotically strictly larger than $g$ ) if for all constants $c > 0$ , there exists a constant $n_{0} \geq 0$ such that $∣ f (x) ∣ \geq c ∣ g (x) ∣\forall x \geq n_{0}$ .

Suppose that $f (n) > 0$ and $g (n) > 0$ for all $n \geq n_{0}$ . Suppose that

L = n \to \infty lim \frac{f ( n )}{g ( n )}

Then

f (n) \in {o (g (n)) Θ (g (n)) if L = 0 if 0 < L < \infty

If the fraction goes towards $\infty$ , then $f (n) \in ω (g (n))$ .

This can be computed using l’Hôspital’s rule. This result gives sufficient but not necessary conditions for the stated conclusion to hold.

Many rules are easily proved from first principle.

Identity rule: $f (n) \in Θ (f (n))$ .

Transitivity:

If $f (n) \in O (g (n))$ and $g (n) \in O (h (n))$ then $f (n) \in O (h (n))$ .
If $f (n) \in Ω (g (n))$ and $g (n) \in Ω (h (n))$ then $f (n) \in Ω (h (n))$ .
If $f (n) \in O (g (n))$ and $g (n) \in o (h (n))$ then $f (n) \in o (h (n))$ .

Suppose that $f (n) > 0$ and $g (n) > 0$ for all $n \geq n_{0}$ . Then:

$f (n) + g (n) \in O (max {f (n), g (n)})$
$f (n) + g (n) \in Ω (max {f (n), g (n)})$

Relationships between order notations

f (n) \in Θ (g (n)) f (n) \in O (g (n)) f (n) \in o (g (n)) f (n) \in Θ (g (n)) f (n) \in o (g (n)) f (n) \in o (g (n)) f (n) \in ω (g (n)) f (n) \in ω (g (n)) ⟺ g (n) \in Θ (f (n)) ⟺ g (n) \in Ω (f (n)) ⟺ g (n) \in ω (f (n)) ⟺ f (n) \in O (g (n)) \land f (n) \in Ω (g (n)) ⟹ f (n) \in O (g (n)) ⟹ f (n) \in / Ω (g (n)) ⟹ f (n) \in Ω (g (n)) ⟹ f (n) \in / O (g (n))

Avoid doing arithmetic with asymptotic notations. Do not write $O (n) + O (n) = O (n)$ .

Techniques for run-time analysis:

Identify primitive operations that require $Θ (1)$ time
The complexity of a loop is expressed as the sum of the complexities of each iteration of the loop.
Nested loops: start with the innermost loop and proceed outwards. This gives nested summations.

Let $T_{A} (I)$ denote the running time of an algorithm $A$ on instance $I$ .

Worst-case (best-case) complexity of an algorithm: The worst-case (best-case) running time of an algorithm $A$ is a function $T : Z^{+} \to R$ mapping $n$ to the longest (shortest) running time for any input instance of size $n$ :

T_{A}^{worst} (n) T_{A}^{best} (n) = I \in I_{n} max {T_{A} (I)} = I \in I_{n} min {T_{A} (I)}

Average-case complexity of an algorithm: The average-case running time of an algorithm $A$ is a function $T : Z^{+} \to R$ mapping $n$ to the average running time of $A$ over all instances of size $n$ :

T_{A}^{avg} (n) = I \in I_{n} \sum T_{A} (I) \cdot (relative frequency of I)

Question

Suppose algorithm $A_{1}$ has worst-case run-time $O (n^{3})$ and algorithm $A_{2}$ has worst-case run-time $O (n^{2})$ , and both the same problem. Is $A_{2}$ more efficient?

No! $O$ -notation is an upper-bound. $A_{1}$ may well have worst-case run-time $O (n)$ . We should always give $Θ$ -bounds.

Question

Suppose the run-times are $Θ (n^{3})$ and $Θ (n^{2})$ , respectively. We consider $A_{2}$ to be better. But is it always more efficient?

No! The worst-case run-time of $A_{1}$ may only be achieved on some instances. Possibly $A_{1}$ is better on most instances.

Merge-sort

Step 1: Describe the overall idea

Input: Array $A$ of $n$ integers.

We split $A$ into subarrays $A_{L}$ and $A_{R}$ that are roughly half as big.
Recursively sort $A_{L}$ and $A_{R}$
After $A_{L}$ and $A_{R}$ have been sorted, use a function merge to merge them into a single sorted array.

Step 2: Give pseudo-code or detailed description

merge-sort(A,n)
if (n <= 1) then return
else
	m = (n-1)/2
	merge-sort(A[0..m], m+1)
	merge-sort(A[m+1..n-1], r)
	merge(A[0..m], A[m +1...n-1])

Do not pass array $A$ by value, instead indicate the range of the array that needs to be sorted. merge an auxiliary array $S$ . Allocate this only once.

merge(A,l,m,r,S)
A[0..n-1] is an array, A[l..m] is sorted, A[m+1..r] is sorted
S[0..n-1] is an array
copy A[l..r] into S[l..r]
(i_L, r_R) <- (l, m+1)
for (k <- l, k <= r: k++) do
	if (i_L > m) A[k] <- S[i_R ++]
	else if (i_R > r) A[k] <- S[i_L ++]
	else if (S[i_L] <= S[i_R]) A[k] <- S[i_L ++]
	else A[k] <- S[i_R ++]

Step 3: Argue correctness. Typically state loop-invariants. Sometimes obvious enough from idea-description and comments.

Step 3: Analyze the run-time

First analyze work done outside recursion
If applicable, analyze subroutines separately. If there are recursions: how big are the subproblems? The run-time then becomes a recursive function.

The recurrence relation for $T (n)$ is as follows

T (n) = {T (⌊ \frac{n}{2} ⌋) + T (⌊ \frac{n}{2} ⌋) + c n c if n > 1 if n = 1

The following is the corresponding sloppy recurrence

T (n) = {2 T (\frac{n}{2}) + c n c if n > 1 if n = 1

This resolves to $T (n) \in O (n lo g n)$ .

Module 2: Priority Queues

Definition

Abstract Data Type (ADT): A description of information and a collection of operations on that information. (Covered in CS135).

The information is accessed only through the operations.

We can have various realizations of an ADT, which specify how the information is stored and how the operations are performed.

Recall:

Stack: An ADT consisting of a collection of items with operations:

push: Add an item to the stack.
pop: Remove and return the most recently added item.

Items are removed in LIFO order. We can have extra operations size, is-empty, and top.

Queue: An ADT consisting of a collection of items with operations:

enqueue: Add an item to the queue
dequeue: Remove and return the least recently inserted item.

Items are removed in FIFO order.

We can have extra operations size, is-empty, and peek/format.

Priority Queue generalizes both ADT Stack and ADT Queue.

It is a collection of items (each with a priority or key) with operations:

insert: Inserting an item tagged with a priority
delete-max: Removing and returning an item of highest priority.

We can have extra operations: size, is-empty, get-max.

This is a maximum-oriented priority queue. A minimum-oriented priority queue replaces operation delete-max by delete-min.

Using a priority queue to sort:

PQ-Sort(A[0..n-1])
for i <- 0 to n-1 do
	PQ.insert (an item with priority A[i])
for i <- n-1 down to 0 do
	A[i] <- priority of PQ.delete-max()

Written as: $O$ (initialization $+ n \cdot$ insert + $n \cdot$ delete-max $)$

Realizations of Priority Queues

Realization 1: Unsorted arrays

Run-time of operations:

insert: $Θ (1)$
delete-max: $Θ (n)$

We assume dynamic arrays, expand by doubling as needed. Amortized over all insertions this takes $Θ (1)$ extra time.

PQ-sort with this realization yields selection-sort.

Realization 2: Sorted arrays

Run-time of operations:

insert: $Θ (n)$
delete-max: $Θ (1)$

PQ-sort with this realization yields insertion-sort. insert can be implemented to have $Θ (1)$ best-case run-time. insertion-sort then has $Θ (n)$ best-case run-time.

Realization 3: Heaps

A binary heap is a certain type of binary tree. A binary tree is either empty, or consists of a node and two binary trees (left subtree and right subtree).

Level $ℓ =$ all nodes with $ℓ$ distance from the root. Root is on level $0$ .

Height $h =$ maximum number for which level $h$ contains nodes. Single-node tree has height $0$ , empty tree has height $- 1$ .

Any binary tree with height $h$ has $n \leq 2^{h + 1} - 1$ nodes. So height $h \leq lo g (n + 1) - 1Ω (lo g n)$ .

A heap is a binary tree with the following two properties:

Structural property: All the levels of a heap are completely filled, except possibly for the last level. The filled items in the last level are left-justified.
Heap-order property: For any node $i$ , the key of the parent of $i$ is larger than or equal to key of $i$ .

The full name for this is max-oriented binary heap.

Lemma: The height of a heap with $n$ nodes is $Θ (lo g n)$ .

Heaps should not be stored as binary trees. Let $H$ be a heap of $n$ items and let $A$ be an array of size $n$ . Store root in $A [0]$ and continue with elements level-by-level from top to bottom, in each level-to-right.

It is easy to navigate using this array representation:

The root node is at index 0
The last node is $n - 1$
The left child is of node $i$ is node $2 i + 1$
The right child of node $i$ is node $2 i + 2$
The parent of node $i$ is node $⌊ \frac{i - 1}{2} ⌋$
These nodes exist if the index falls in the range ${0, \dots, n - 1}$

We hide implementation details using helper-functions.

insert in heaps: Place the new key at the first free leaf. Use fix-up to restore heap order.

insert(x)
l <- last()+ 1
A[l] <- x
increase size
fix-up(A, l)

fix-up(A,i)
while parent(i) exists and A[parent(i)].key < A[i].key do
	swap A[i] and A[parent(i)]
	i <- parent(i)

Time: $O ($ height of heap $) = O (lo g n)$ (and this is tight).

delete-max in heaps

The maximum item of a heap is just the root node. We replace the root by the last leaf (last leaf is taken out). The heap-order property might be violated, perform a fix-down.

fix-down(A,i)
while i is not a leaf do
	j <- left child of i
	if (i has right child and A[right child of i].key > A[j].key)
		j <- right child of i
	if A[i].key >= A[j].key break
	swap A[j] and A[i]
	i <- j

Time: $O ($ height of heap $) = O (lo g n)$ (and this is tight).

delete-max()
l <- last()
swap A[root()] and A[l]
decrease size
fix-down(A, root(), size)
return A[l]

Time: $O ($ height of heap $) = O (lo g n)$ (and this is tight).

Binary heap are a realization of priority queues where the operations insert and delete-max take $Θ (lo g n)$ time.

Any priority queue can be used to sort in time

Using the binary-heaps implementation of PQs, we obtain:

PQ-sort-with-heaps(A)
for i <- 0 to n-1 do
	H.insert(A[i])
for <- n-1 down to 0 do
	A[i] <- H.delete-max()

Both operations run in $O (lo g n)$ time for heaps $⇝$ PQ-sort using heaps takes $O (n lo g n)$ time.

Can improve this with two simple tricks $\to$ heap-sort.

Can use the same array for input and heap $⇝ O (1)$ auxiliary space.
Heaps can be built faster if we know all input in advance.

Problem: Given $n$ items all at once (in $A [0 \dots n - 1]$ ) build a heap containing all of them.

Solution 1: Start with an empty heap and insert items one at a time.

simple-heap-building(A)
for i <- 0 to A.size() - 1 do
	H.insert(A[i])

This corresponds to doing fix-up’s. Worst-case running time: $O (n lo g n)$

We can use fix-down’s instead.

heapify(A)
n <- A.size()
for i <- parent(last()) downto root() do
	fix-down(A, i, n)

This yields a worst-case complexity of $Θ (n)$ . A heap can be built in linear time.

Efficient sorting with heaps: Idea is to use PQ-sort with heaps.

heap-sort()A
n <- A.size()
for i <- parent(last()) downto 0 do
	fix-down(A,i,n)
 
while n > 1
	swap items at A[root()] and A[last()]
	decrease n
	fix-down(A, root(), n)

The for-loop takes $Θ (n)$ time and the while-loop takes $Θ (n lo g n)$ time.

Problem: Find the $k$ th smallest item in an array $A$ of $n$ numbers.

Solution 1: Make $k + 1$ passes through the array, deleting the minimum number each time. Complexity $Θ (kn)$ .

Solution 2: Sort $A$ , then return $A [k]$ . Complexity $Θ (n lo g n)$ .

Solution 3: Create a min-heap with heapify(A). Call delete-min(A) $k + 1$ times. Complexity: $Θ (n + k lo g n)$ .

Question

We can achieve $Θ (n lo g n)$ worst-case time easily, but can we do better?

Yes.

Module 3: Sorting, Average-case and Randomization

We introduce (and solve) a new problem, and then analyze the average-case run-time of our algorithm.

Recall the definition of average-case run-time:

T_{A}^{avg} (n) = instance I of size n \sum T_{A} (I) \cdot (relative frequency of I)

Assume that the set $I_{n}$ of size- $n$ instances is finite. Assume that all instances can occur equally frequently. Then we can use the following simplified formula

T^{avg} (n) = \frac{\sum _{I : size (I) = n} T ( I )}{# instances of size n} = \frac{1}{∣ I _{n} ∣} i \in I_{n} \sum T (I)

To learn how to analyze this, we will do simpler examples first.

silly-test(π, n)
π: a permutation of {0, ..., n-1}, stored as an array
if π[0] = 0 then for j = 1 to n do print 'bad case'
else print 'good case'

T^{avg} (n) = \frac{1}{n !} π \in Π_{n} \sum T (π) = \frac{1}{n !} π \in Π_{n} in bad case \sum T (π) + π \in Π_{n} in good case \sum T (π)

$(n - 1)!$ permutations have $π [0] = 0 ⟹$ run-time $c \cdot n$ . The remaining $n! - (n - 1)!$ permutations have run-time $c$ .

T^{avg} (n) = \frac{1}{n !} (# {π \in Π_{n} in base case} \cdot c n + # {π \in Π_{n} in good case} \cdot c) = \frac{1}{n !} ((n - 1)! \cdot c n + (n! - (n - 1)!) \cdot c) \leq \frac{1}{n} c n + c = 2 c \in O (1)

Another not-so-contrived example

all-0-test(w,n)
// test whether all entries of bitstring w[0..n-1] are 0
if (n=0) return true
if (w[n-1] = 1) return false
all-0-test(w,n-1)

Define $T (w) = #$ bit-comparisons on input $w$ . This is asymptotically the same as the run-time.

Worst-case run-time: Always go into the recursion until $n = 0$ . $T (n) = 1 + T (n - 1) = 1 + 1 + \dots + T (0) = n \in Θ (n)$ .

Best-case run-time: Return immediately. $T (n) = 1 \in Θ (1)$ .

Average-case run-time?

Recall $T^{avg} (n) = \frac{1}{∣ B _{n} ∣} \sum_{w \in B_{n}} T (w)$ \quad $B_{n} = {bistrings of length n}$ .

Recursive formula for one non-empty bitstring $w$ :

T (w) = ⎩ ⎨ ⎧ 1 1 + T (length n - 1 w [0.. n - 2]) if w [n - 1] = 1 otherwise

Natural guess for the recursive formula for $T^{avg} (n)$ :

T^{avg} (n) = half have w [n - 1] = 1 \frac{1}{2} \cdot 1 + half have w [n - 1] = 0 f r a c 12 (1 + T^{???} (n - 1))

This holds with $\leq$ but is useless if ??? is worst. This is not obvious if ??? is avg.

T^{avg} (n) = \frac{1}{∣ B _{n} ∣} x \in B_{n} \sum T (w) = \frac{1}{∣ B _{n} ∣} x \in B_{n} w [n - 1] = 1 \sum 1 + \frac{1}{∣ B _{n} ∣} x \in B_{n} w [n - 1] = 0 \sum (1 + T (w [0.. n - 2])) = \frac{1}{2} + \frac{1}{2} + \frac{1}{∣ B _{n} ∣} w^{'} \in B_{n} - 1 \sum w \in B_{n} w [n - 1] = 0 w [0.. n - 2] = w^{'} \sum T (w^{'})

Observe: $w$ is uniquely determined by $w^{'}$ and $w [n - 1] = 0$ .

= 1 + \frac{1}{∣ B _{n} ∣} w^{'} \in B_{n - 1} \sum T (w^{'}) = 1 + \frac{∣ B _{n - 1} ∣}{∣ B _{n} ∣} \frac{1}{∣ B _{n - 1} ∣} w^{'} \in B_{n - 1} \sum T (w^{'}) = 1 + \frac{1}{2} T^{avg} (n - 1)

This recursion resolves $T^{avg} (n) \in O (1)$ .

Question

Why can’t we always write ‘avg’ for ’???’ in $T^{avg} (n) = 1 + \frac{1}{2} T^{???} (n - 1)$ ?

Consider

silly-all-0-test(w,n)
if (n = 0) then return true
if (w[n-1] = 1) then return false
if (n>1) then w[n-2] <- 0
silly-all-0-test(w,n-1)

Only one more line of code in each recursion, so same formula applies.

But observe that now $T (w) = {1 n if w [n - 1] = 1 if w [n - 1] = 0$

So $T^{avg} (n) = 1 + \frac{n}{2} \in Θ (n)$ . The ‘obvious’ recursion did not hold. Average-case analysis is highly non-trivial for recursive algorithms.

We saw SELECTION: Given an array $A$ of $n$ numbers, and $0 \leq k < n$ , find the element that would be at position $k$ of the sorted array.

SELECTION can be done with heaps in time $Θ (n + k lo g n)$ .

Special case: MEDIANFINDING = SELECTION with $k = ⌊ \frac{n}{2} ⌋$ . With previous approaches, this takes time $Θ (n lo g n)$ , no better than sorting.

Question

Can we do selection in linear time?

Yes, the algorithm quick-select.

quick-select and the related quick-sort rely on two subroutines:

choose-pivot(A): Return an index $p$ in $A$ . We will use the pivot-value $v \leftarrow A [p]$ to rearrange the array.
partition(A,p): Rearrange $A$ and return pivot-index $i$ so that
- the pivot-value $v$ is in $A [i]$ ,
- all items in $A [0, \dots, i - 1]$ are $\leq v$ , and
- all items in $A [i + 1, \dots, n - 1]$ are $\geq v$ .

$p =$ index of pivot-value before partition (we choose it), $i =$ index of pivot-value after partition (no control).

Partition Algorithm

partition(A,p)
v <- A[p]
for each element x in A do
	if x < v then smaller.append(x)
	else if x > v then larger.append(x)
	else equal.append(x)
i <- smaller.size
j <- equal.size
Overwrite A[0...i-1] by elements in smaller
Overwrite A[i...i+j-1] by elements in equal
Overwrite A[i+j...n-1] by elements in larger
return i

Efficient in-place partition (Hoare)

partition(A,p)
swap(A[n-1], A[p])
i <- -1
j <- n-1
v <- A[n-1]
loop
	do i <- i+1 while A[i] < v
	do j <- j-1 while j > i and A[j] > v
	if i >= j then break 
	else swap(A[i], A[j])
swap(A[n-1], A[i])
return i

Running time $Θ (n)$ .

quick-select(A,k)
p <- choose-pivot(A)
i <- partition(A,p)
if i = k then return A[i]
else if i > k then return quick-select(A[0...i-1], k)
else if i < k then return quick-select(A[i+1...n-1], k-(i+1))

Let $T (n, k)$ be the number of key-comparisons in a size- $n$ array with parameter $k$ . partition uses $n$ key-comparisons.

Worst-case run-time: Sub-array always gets smaller, so $\leq n$ recursions. Each takes $\leq n$ comparisons $⟹ O (n^{2})$ time. This is tight: If pivot-value is always the maximum and $k = 0$ $T^{worst} (n, 0) \geq n + (n - 1) + (n - 2) + \dots + 1 \in Ω (n^{2})$ .

Best-case run-time: First chosen pivot could be the $k$ th element. No recursive calls; $T (n, k) = n \in Θ (n)$ .

Average case analysis is complicated. We detour into a randomized version.

A randomized algorithm is one which relies on some random numbers in addition to the input. Randomization is a good idea if an algorithm has bad worst-case time but seems to perform much better on most instances.

The run-time will depend on the input and the random numbers used. We assume that there exists a pseudo-random number generator.

Goal: Shift the dependency of run-time from what we can’t control (input) to what we can control (random numbers).

randomized-all-0-test(w,n)
if n = 0 return true
if (random(2)=0) then
	w[n-1] = 1 - w[n-1]
if w[n-1] = 1 return false
randomized-all-0-test(w,n-1)

This is all-0-test, except that we flip last bit based on a coin toss. random(n) returns an integer uniformly from ${0, 1, 2, \dots, n - 1}$ .

The run-time of the algorithm now depends on the random numbers. Define $T_{A} (I, R)$ to be the run-time of a randomized algorithm $A$ for an instance $I$ and the sequence $R$ of outcomes of random trials.

The expected run-time $T^{exp} (I)$ for instance $I$ is the expected value:

T^{exp} (I) = E [T (I, R)] = R \sum T (I, R) \cdot Pr (R)

Now take the maximum over all instances of size $n$ to define the expected run-time (formally: worst-instance expected-luck run-time) of $A$ .

T^{exp} (n) : = I \in I_{n} max T^{exp} (I)

Expected run-time of randomized-all-0-test.

Define $T (w, R) : = #$ bit-comparisons used on input $w$ if the random outcomes are $R$ .

The random outcomes $R$ consist of two parts $R = ⟨ x, R^{'} ⟩$ .

$x :$ outcome of first coin toss
$R^{'} :$ random outcomes (if any) for the recursions

We have $P r (R) = P r (x) \cdot P r (R^{'})$ .

Recursive formula for one instance:

T (w, r) = T (w, ⟨ x, R^{'} ⟩) = {1 1 + T (w [0 \dots n - 2], R^{'}) if x = w [n - 1] otherwise

Natural guess for the recursive formula for $T^{exp} (n) :$

T^{exp} (n) = P r (x = w [n - 1]) \frac{1}{2} \cdot 1 + P r (x \neq = w [n - 1]) \frac{1}{2} (1 + T^{exp} (n - 1)) = 1 + \frac{1}{2} T^{exp} (n - 1)

In contrast to average-case analysis, the natural guess usually is correct for the expected run-time.

Proof for randomized-all-0-test.

T^{exp} (w) = R \sum P r (R) T (w, R) = x \sum R^{'} \sum P r (x) P r (R^{'}) T (w, ⟨ x, R^{'} ⟩) = P r (x = w [n - 1]) R^{'} \sum P r (R^{'}) \cdot + P r (x \neq = w [n - 1]) R^{'} \sum P r (R^{'}) (1 + T (w [- \dots n - 2], R^{'})) = \frac{1}{2} + \frac{1}{2} + \frac{1}{2} T^{exp} (some instance of size n - 1) R^{'} \sum P r (R^{'}) \cdot T (w [0 \dots n - 2], R^{'}) \leq 1 + \frac{1}{2} w^{'} \in B_{n - 1} max T^{exp} (w^{'}) = 1 + \frac{1}{2} T^{exp} (n - 1) holds for all w

Therefore $T^{exp} (n) = max_{w \in B_{n}} T^{exp} (w) \leq 1 + \frac{1}{2} T^{exp} (n - 1)$ .

Question

Does the expected time of a randomized version always have something to do with the average-case time?

Yes if the randomization is a shuffle (choose instance randomly).

Average-case run-time and expected run-time are not the same. Average-case is the average over instances usually applied to a deterministic algorithm. Expected is the weighted average over random outcomes, applied only to a randomized algorithm.

For analyzing the average-case run-time of quick-select, we make two assumptions:

All input-items are distinct.
All possible orders of the input-items occur equally often.

Goal is to create a randomized version of quick-select.

First idea: Shuffle the input, then do quick-select.

shuffled-quick-select(A,k)
for (j <- 1 to n-1) do swap(A[j], A[random(j+1)])
quick-select(A,k)

We know $T_{quick-select}^{avg} (n) = T_{shuffled-quick-select}^{exp} (n)$ .

Second idea: Do the shuffling inside the recursion.

randomized-quick-select(A,k)
swap A[n-1] witih A[random(n)]
i <- partition(A,n-1)
if i = k then return A[i]
else if i > k then
	return randomized-quick-select(A[0 ... i-1], k)
else if i < k then
	return randomized-quick-select(A[i+1 ... n-1], k-(i+1))

Let $T (A, k,, R) = #$ key-comparisons of randomized-quick-select on input $⟨ A, k ⟩$ if the random outcomes are $R$ .

Write random outcomes $R$ as $R = ⟨ i, R^{'} ⟩$ . Observe: $P r (pivot-index is i) = \frac{1}{n}$ . We recurse in an array of size $i$ or $n - i - 1$ .

T (A, k, ⟨ i, R^{'} ⟩) = n + ⎩ ⎨ ⎧ T (size- i array, k, R^{'}) T (size- (n - i - 1) array, k - i - 1, R^{'}) 0 if i > k if i < k otherwise

As for rand-all-0-test: over all $R$ , the recursions can use $T^{exp} (array-size)$ .

T^{exp} (A, k) = R \sum P (R) \cdot T (⟨ A, k ⟩, R) = i = 0 \sum n - 1 R^{'} \sum P (i) \cdot P (R^{'}) \cdot T (⟨ A, k ⟩, ⟨ i, R^{'} ⟩) = \frac{1}{n} i = 0 \sum k - 1 R^{'} \sum P (R^{'}) (n + T (⟨ A [i + 1 \dots n - 1], k - i - 1 ⟩, R^{'})) + \frac{1}{n} \cdot n + \frac{1}{n} i = k + 1 \sum n - 1 R^{'} \sum P (R^{'}) (n + T (⟨ A [0 \dots i - 1], k ⟩, R^{'})) = n + \frac{1}{n} i = 0 \sum k - 1 R^{'} \sum P (R^{'}) T (⟨ A [i + 1 \dots n - 1], k - i - 1 ⟩, R^{'}) + \frac{1}{n} i = k + 1 \sum n - 1 R^{'} \sum P (R^{'}) T (⟨ A [0 \dots i - 1], k ⟩, R^{'}) = n + \frac{1}{n} i = 0 \sum k - 1 \leq T^{exp} (n - i - 1) T^{exp} (⟨ A [i + 1 \dots n - 1], k - i - 1 ⟩) + \frac{1}{n} i = k + 1 \sum n - 1 \leq T^{exp} (i) T^{exp} (⟨ A [0 \dots i - 1], k ⟩) \leq n + \frac{1}{n} i = 0 \sum n - 1 max {T^{exp} (i), T^{exp} (n - i - 1)} independent of A, k

Tedious but straightforward. This recursion resolves to $O (n)$ .

randomized-quick-select is generally the fastest solution to the selection problem.

Hoare developed partition and quick-select. He then used them to sort based on partitioning.

quick-sort(A)
if n <= 1 then return
p <- choose-pivot(A)
i <- partition(A,p)
quick-sort(A[0, 1, ..., i-1])
quick-sort(A[i+1, ..., n-1])

Worst-case run-time: $Θ (n^{2})$

Best-case run-time: $Θ (n)$

Average-case run-time we prove via randomization.

$T^{exp} (n) \leq n + \frac{1}{n} \sum_{i = 0}^{n - 1} (T^{exp} (i) + T^{exp} (n - i - 1)) = n + \frac{2}{n} \sum_{i = 1}^{n - 1} T^{exp} (i)$ .

$T^{exp} (n) \in O (n lo g n)$

We have seen many sorting algorithms.

Question

Can one do better than $Θ (n lo g n)$ running time?

Yes and no, depends on what we allow.

No: Comparison-based sorting lower bound is $Ω (n lo g n)$

Yes: Non-comparison-based sorting can achieve $O (n)$ (under restrictions).

All algorithms so far are comparison-based. Data is accessed by comparing two elements, and moving elements around.

Theorem

Any comparison-based sorting algorithm requires in the worst case $Ω (n lo g n)$ comparisons to sort $n$ distinct items.

Easy to understand via a decision tree.

Non-Comparison-Based Sorting

Assume keys are numbers in base $R$ ( $R$ : radix). All digits are in ${0, \dots, R - 1}$ .

Assume all keys have the same number $m$ of digits. Can achieve after padding with leading $0$ s.

Can sort based on individual digits.

bucket-sort(A, n, sort-key())
sort-key(): maps items of A to {0,...,R-1}
Initialize an array B[0..R-1] of empty queues (buckets)
for i <- 0 to n-1 do
	Append A[0] at the end of B[sort-key(A[i])]
i <- 0
for j <- 0 to R-1 do
	while B[j] is non-empty do
		move front element of B[j] to A[i++]

bucket-sort is stable, equal items stay in original order.

Run-time is $Θ (n + R)$ , auxiliary space $Θ (n + R)$ . It is possible to replace the lists by arrays $⇝$ count-sort.

Most-significant-digit(MSD)-radix-sort

Sort array of $m$ -digit radix- $R$ numbers recursively: sort by 1st digit, then each group by 2nd digit etc.

MSD-radix-sort(A,n,d<-1)
A: array of size n, contains m-digit radix-R numbers
if (d <= m and (n > 1))
	bucket-sort(A, n, return dth digit of A[i])
	l <- 0
	for j <- 0 to R-1
		Let r >= l - 1 be maximal s.t. A[l..r] have dth digit j
		MSD-radix-sort(A[l..r], r - l + 1, d+1)
		l <- r + 1

$Θ (m)$ levels of recursion in worst-case
$Θ (n)$ subproblems on most levels in worst-case
$Θ (R + (size of sub-array))$ time for each bucket-sort call.

Run-time $Θ (mn R)$ - slow. Many recursions and allocated arrays.

LSD-radix-sort(A,n)
A: array of size n, contains m-digit radix-R numbers
for d <- least significant to most significant do
	bucket-sort(A,n,return dth digit of A[i])

Time cost $Θ (m (n + r))$ with auxiliary space $Θ (n + R)$ .

Module 4: Dictionaries

Definition

Dictionary: An ADT consisting of a collection of items, each of which contains a key and some data. This is called a key value pair. Keys can be compared and are typically unique.

Common assumptions:

Dictionary has $n$ KVPs, each KVP uses constant space, keys can be compared in constant time.

In an unordered array or linked list, search is $Θ (n)$ , insertion is $Θ (1)$ , delete is $Θ (n)$ .

In an ordered array, search is $Θ (lo g n)$ , insertion is $O (n)$ , delete is $Θ (n)$ .

Recall binary search only applies to a sorted array.

Binary search trees: All nodes have two (possibly empty) subtrees. Every node stores a KVP, empty subtrees aren’t shown.

Every $k$ in $T . left$ is less than the root key, and every key $k$ in $T . right$ is greater than the root key.

BST as a realization of ADT dictionary.

BST::search(k): Start at root, compare $k$ to current node’s key. Stop if found or subtree empty, else recurse at subtree.

Deletion in a BST: First search for the node $x$ that contains the key. If $x$ is a leaf, delete it. If $x$ has one non-empty subtree, move child up. Otherwise, swap key at $x$ with key at successor node and then delete that node. The successor is the next-smallest among all keys in the dictionary.

BST::search, BST::insert, BST::delete all have cost $Θ (h)$ , where $h =$ height of tree $=$ max path length from root to leaf.

Question

If $n$ items are inserted one-at-a-time, how big is $h$ ?

Worst-case: $n - 1 = Θ (n)$

Best-case: $Θ (lo g n)$ . Any binary tree with $n$ nodes has height $h \geq lo g (n + 1) - 1$ . Layer $i$ has at most $2^{i}$ nodes. So $n \leq \sum_{i = 0}^{h} 2^{i} = 2^{h + 1} - 1$ .

Goal is to create a subclass of BSTs where the height is always good. We need to impose a structural property, argue that the property implies logarithmic height, and discuss how to maintain the property during operations.

AVL Trees: An AVL Tree is a BST with an additional height-balance property at every node: The heights of left and right subtree differ by at most $1$ .

If node $v$ has left subtree $L$ and right subtree $R$ , then

balance (v) balance (v) balance (v) : = height (R) - height (L) must be in {- 1, 0, 1} = - 1 means v is left-heavy = + 1 means v is right-heavy

Need to store at each node $v$ the height of the subtree rooted at it.

Theorem

An AVL tree on $n$ nodes has $Θ (lo g n)$ height. search, BST::insert, BST::delete all cost $Θ (lo g n)$ in the worst case.

AVL Insertion (AVL::insert(k,v))

First, insert $(k, v)$ with the usual BST insertion. We assume that this returns the new leaf $z$ where the key was stored. Then, move up the tree from $z$ . Update height (we can do this in constant time). If the height difference becomes $\pm 2$ at nodes $z$ , then $z$ is unbalanced. Must re-structure the tree to rebalance.

There are many different BSTs with the same keys.

The goal is to change the structure locally without changing the order. Longterm goal is to restructure so the subtree becomes balanced.

Right rotation on node $z$

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> g
c --> C[/C\]
g((g)) --> A[/A\]
g --> B[/B\]

Becomes

flowchart TD
c((c)) --> g((g))
c --> z((z))
g --> A[/A\]
g --> B[/B\]
z --> C[/C\]
z --> D[/D\]

rotate-right(z)
c <- z.left, z.left <- c.right, c.right <- z
setHeightFromSubtrees(z), setHeightFromSubtrees(c)
return c

Left rotation is used to fix a right-right imbalance.

A double right rotation on node $z$ starts with a left rotation at $c$ , then a right rotation at $z$ .

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> A[/A\]
c --> g((g))
g --> B[/B\]
g --> C[/C\]

Becomes

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> g
c --> C[/C\]
g((g)) --> A[/A\]
g --> B[/B\]

Which then becomes

flowchart TD
g((g)) --> c((c))
g --> z((z))
c --> A[/A\]
c --> B[/B\]
z --> C[/C\]
z --> D[/D\]

Symmetrically, there is a double left rotation.

AVL insertion revisited. When there is an imbalance at $z$ do (single or double) rotation. Choose $c$ as child where subtree has bigger height.

AVL::insert(k,v)
z <- BST::insert(k,v)
while(z is not NULL)
	if (|z.left.heigth - z.right.height| > 1) then
		Let c be taller child of z
		Let g be taller child of c
		restructure(g,c,z)
		break
	setHeightFromSubtrees(z)
	z <- z.parent

We can argue that for insertion, one rotation restores all heights of subtrees.

restructure(g,c,z)
node g is a child of c which is a child of z
p <- z.parent
(make the right rotation)
make u the appropriate child of p and return u

The middle key of $g, c, z$ becomes the new root.

AVL Deletion (AVL::delete(k))

Remove the key $k$ with BST::delete. Find node where structure change happened. Go back up to the root, update heights, and rotate if needed.

AVL::delete(k)
z <- BST::delete(k)
Assume z is the parent of the BST node that was removed
while (z is not NULL)
	if (|z.left.height - z.right.height| > 1) then
		Let c be taller child of z
		Let g be taller child of c
		z <- restructure(g,c,z)
	setHeightFromSubtrees(z)
	z <- z.parent

Ties must be broken to avoid double rotation.

AVL Tree Summary

Search: $Θ (height)$
Insert: $Θ (height)$ . Restructure will be called at most once
Delete: $Θ (height)$ . Restructure may be called $Θ (height)$ times.

Worst-case cost for all operations is $Θ (height) = Θ (lo g n)$ .

Module 5: Other Dictionary Implementations

So far we’ve seen unordered arrays, ordered arrays, binary search trees, and balanced binary search trees.

If the KVPs were inserted in random order, then the expected height of the binary search tree would be $O (lo g n)$ .

We did not consider an ordered list as a realization of ADT dictionary. insert and delete take $Θ (1)$ time in an ordered list, once we know where to place them. The bottleneck is in search.

In an ordered array, we can do binary search to achieve $O (lo g n)$ . In an ordered list, we cannot skip to the middle, and so cannot do binary search. Search takes $Θ (n)$ time in an ordered list, this is slow.

To speed up search in an ordered list, add links to help us move forward quicker. Choose randomly when to add links.

Skip Lists

A hierarchy of ordered linked lists (levels) $L_{0}, L_{1}, \dots, L_{h}$ .

Each list $L_{i}$ contains the special keys $- \infty$ and $+ \infty$ (sentinels). List $L_{0}$ contains the KVPs of $S$ in non-decreasing order. Each list is a subsequence of the previous one, i.e., $L_{0} \supseteq L_{1} \supseteq \dots \supseteq L_{h}$ . List $L_{h}$ contains only the sentinels.

A node is an entry in one list, a KVP is one non-sentinel entry in $L_{0}$ .

There are usually mode nodes than KVPs. Each node $p$ has references $p . after$ and $p . below$ . Each key $k$ belongs to a tower of nodes. Height of tower of $k$ : maximal index $i$ such that $k \in L_{i}$ . Height of skip list: maximal index $h$ such that $L_{h}$ exists.

Search in skip list. For each list, find predecessor.

get-predecessors(k)
p <- root
P <- stack of nodes, initially containing p
while p.below != NULL do
	p <- p.below
	while p.after.key < k do p <- p.after
	P.push(p)
return P

skipList::search(k)
P <- get-predecessors(k)
p_0 <- P.top()
if p_0.after.key = k return KVP at p_0.after
else return "not found, but would be after p_0"

It is easy to remove a key since we can find all predecessors. We eliminate lists if there are multiple ones with only sentinels.

skipList::delete(k)
P <- get-predecessors(k)
while P is non-empty
	p <- P.pop()
	if p.after.key = k
		p.after <- p.after.after
	else break
 
p <- left sentinel of the root-list
while p.below after is the infinity-sentinel
	// the two top lists are both only sentinels, remove one
	p.below <- p.below.below
	p.after.below <- p.after.below.below

Inserting in skip lists skipList::insert(k,v). There is no choice as to where to put the tower of $k$ . Only choice, how tall should we make the tower of $k$ ? We choose randomly.

Repeatedly toss a coin until you get tails. Let $i$ be the number of times the coin came up heads. We want key $k$ to be in lists $L_{0}, \dots L_{i}$ , so $i \to$ height of tower of $k$ .

P (tower of key k has height \geq i) = (\frac{1}{2})^{i}

Before we can insert, must check that these lists exist. Then we do the actual insertion. Use get-predecessors(k) to get stack $P$ . The top $i$ items of $P$ are the predecessors $p_{0}, p_{1}, \dots, p_{i}$ of where $k$ should be in each list $L_{0}, L_{1}, \dots, L_{i}$ . Insert $(k, v)$ after $p_{0}$ in $L_{0}$ , and $k$ after $p_{j}$ in $L_{j}$ for $1 \leq j \leq i$ .

skipList::insert(k,v)
for (i <- 0; random(2) = 1; i++) {}
for (h <- 0, p <- root.below, p != NULL; p <- p.below, h++) {}
while i >= h
	create new sentinel-only list; link it in below topmost list
	h++
P <- get-predecessors(k)
p <- P.pop()
zbelow <- new node with (k,v)
zbelow.after <- p.after; p.after <- zbelow
while i > 0
	p <- P.pop()
	z <- new node with k
	z.after <- p.after; p.after <- z; z.below <- zbelow; zbelow <- z

Analysis of skip lists.

Expected space usage: $O (n)$ . Expected height: $O (lo g n)$ .

skipList::get-predecessors: $O (lo g n)$ expected time.

Search, insert, delete: $O (lo g n)$ expected time.

Recall unsorted array realization:

Search: $Θ (n)$
Insert: $Θ (1)$
Delete: $Θ (1)$

Question

Can we do something to make search more effective in practice?

No. If items are accessed with equal likelihoods. We can show that the average-case cost for search is then $Θ (n)$ .

We can make search more effective if search requests are biased (some items are accessed much more frequently than others). Intuition is that frequently accessed items should be in the front. Two scenarios - do we know the access distribution beforehand or not?

Optimal Static Ordering

Scenario: We know access distribution, and want the best order of a list.

Key	A	B	C	D	E
Frequency of access	2	8	1	10	5

Recall:

T^{avg} (n) = I \in I_{n} \sum T (I) \cdot (relative frequency of I) = expected run-time on randomly chosen input = I \in I_{n} \sum T (I) \cdot P r (randomly chosen instance is I)

Count of cost $i$ if search-key ( $=$ instance $I$ ) is at $i$ th position $(i \geq 1$ ).

$T^{avg} (n) =$ expected access cost $= \sum_{i \geq 1} i \cdot access-probability of that key P r (search for key at position i)$

Order $A \to B \to C \to D \to E$ has expected access cost of $\approx 3.31$ .

Order $D \to B \to E \to A \to C$ is better (access cost of $\approx 2.54$ ).

Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected access cost.

Dynamic Ordering: MTF

Scenario: We do not know the access probabilities ahead of time.

Idea: Modify the order dynamically, i.e., while we are accessing. Rule of thumb (temporal locality), a recently accessed item is likely to be used again.

Move-To-Front heuristic (MTF): Upon a successful search, move the accessed item to the front of the list.

We can also do MTF on an array, but should then insert and search from the back so that we have room to grow.

There are other heuristics we can also use. Upon a successful search, swap the accessed item with the item immediately preceding it (transpose heuristic). Keep counters how often items were accessed, and sort in non-decreasing order (frequency-count heuristic).

Module 6: Dictionaries for Special Keys

So far we’ve seen balanced binary search trees and skip lists. Various other realizations sometimes faster on insert, but search always takes $Ω (lo g n)$ time.

Interpolation Search

Scenario: Numbers in sorted array [40, 50, 70, 90, 100, 110, 120].

binary-search(A,n,k)
l <- 0, r <- n-1
while (l <= r)
	m <- (l + r) / 2
	if (A[m] equals k) then return "found at A[m]"
	else if (A[m] < k) then l <- m + 1
	else r <- m -1 
return "not found"

Compare at index $⌊ \frac{ℓ + r}{2} ⌋ = ℓ + ⌊ \frac{1}{2} (r - ℓ) ⌋$ . If keys are numbers, where would you expect key $k = 100$ ?

Interpolation search is similar to binary search, but compare at index.

ℓ + \frac{k - A [ ℓ ] distance from left key}{distance between left and right keys A [ r ] - A [ ℓ ]} \cdot number of indices in range (r - ℓ)

interpolation-search(A,k)
while (l <= r)
	if (k < A[l] or k > A[r]) return "not found"
	if (k = A[r]) then return "found at A[r]"
	m = expression above
	if (A[m] equals k) then return "found at A[m]"
	else if (A[m] < k) then l <- m + 1
	else r <- m-1

In the worst case, this can be very slow $Θ (n)$ . But it works well on average. We can show $T^{avg} (n) \leq T^{avg} (n) + Θ (1)$ . This resolves to $T^{avg} (n) \in O (lo g lo g n)$ .

Scenario: Keys in dictionary are words. Need brief review.

Words ( $=$ strings): sequences of characters over alphabet $Σ {be, bear, beer}$ .

Convention: Words have end-sentinel $. w.size = $∣ w ∣ = #$ non-sentinel characters: $|\text{be}\$ | = 2$.

Trie (a radix tree): A dictionary for bit-strings. Comes from retrieval. A tree based on bitwise comparisons, edge labelled with corresponding bit. Similar to radix sort: use individual bits, not the whole key. Due to end-sentinels, all key-value pairs are at leaves.

Tries: Search

Follow links that corresponds to current bits in $w$ . Repeat until no such link or $w$ found at a leaf. Similar as for skip lists, we find search-path separately first.

Trie::get-path-to(w)
P <- empty stack; z <- root; d <- 0; P.push(z)
while d <= |w|
	if z has a child-link labelled with w[d]
		z <- child at this link; d++; P.push(z)
	else break
return P

Trie::search(w)
P <- get-path-to(w), z <- P.top
if (z is not a leaf) then
	return "not found, would be in sub-trie of z"
return key-value pair at z

For later applications of tries, we want another search-operation. prefix-search(w): find word $w^{'}$ in trie for which $w$ is a prefix. Testing whether $w^{'}$ exists is easy.

To find $w^{'}$ quickly, we need leaf-references. Every node $z$ stores reference $z . leaf$ to a leaf in subtree. Convention: store leaf with the longest word.

Trie::prefix-search(w)
P <- get-path-to(w), p <- P.top()
if number of nodes on P is w.size or less
	return "no extension of w found"
return p.leaf

Trie::insert(w). $P \leftarrow$ get-path-to(w) gives ancestors that exist already. Expand the trie from $P . t o p ()$ by adding necessary nodes that correspond to extra bits of $w$ . Update leaf-references.

search(w), prefix-search(w), insert(w), delete(w) all take time $Θ (∣ w ∣)$ . Search time is independent of number of $n$ words stored in the trie! Search-time is small for short words.

The trie for a given set of words is unique (expend for order of children and ties among leaf-references).

Tries can be wasteful with respect to space. Worst-case space is $Θ (n \cdot (maximum length of a word))$ .

Question

What can we do to save space?

Pruned Tries: Stop adding nodes to trie as soon as the key is unique.

A node has a child only if it has at least two descendants. Saves spaces if there are only few bit-strings that are long. This is a more efficient version of tires, but the operations get a bit more complicated.

We have (implicitly) seen pruned tries before: For equal-length bit-strings, pruned trie equals recursion tree of MSD radix-sort.

Compressed Tries: Compress paths of nodes with only one child. Each node stores an index, corresponding to the level of the node in the uncompressed trie. These are known as Patricia-Tries (Practical Algorithm to Retrieve Information Coded in Alphanumeric)

As for tries, follow links that correspond to current bits $w$ . Main difference: stored indices say which bits to compare.

CompressedTrie::get-path-to(w)
P <- empty stack; z <- root: P.push(z)
while z is not a leaf and (d <- z.index <= w.size) do
	if (z has a child-link labelled with w[d]) then
		z <- child at this link; P.push(z)
	else break
return P

CompressedTrie::search(w)
P <- get-path-to(w), z <- P.top
if (z is not a leaf or word stored at z is not w) then
	return "not found"
return key-value pair at z

search(w) and prefix-search(w) are easy. insert(w) and delete(w) are conceptually simple. Search for path $P$ to word, uncompress this path, insert/delete $w$ as in an uncompressed trie, compress path from root to where change happened.

All operations take $O (∣ w ∣)$ time for a word $w$ . Compressed tries use $O (n)$ space. We have $n$ leaves, every internal node has two or more children. Can show that there are more leaves than internal nodes.

To represent strings over any fixed alphabet $Σ$ , any node will have at most $∣Σ∣ + 1$ children.

Variation: Compressed multi-way tries: compress paths as before.

Operations search(w), prefix-search(w), insert(w), and delete(w) are exactly as for tries for bit-strings. Run-time $O (∣ w ∣ \cdot (time to find the appropriate child))$ . Each node now has up to $∣Σ∣ + 1$ children.

Question

How should they be stored?

Time/space tradeoffs. Arrays are fast, lists are space-efficient. Dictionary best in theory, not worth it in practice unless $∣Σ∣$ is huge. In practice, we use hashing.

Module 7: Dictionaries via Hashing

Dictionaries via Hashing

Special situation: For a known $M \in N$ , every key $k$ is an integer with $0 \leq k < M$ . We can then implement a dictionary easily: Use an array $A$ of size $M$ that stores $(k, v)$ via $A [k] \leftarrow v$ .

search(k): Check whether $A [k]$ is NULL.
insert(k,v): $A [k] \leftarrow v$
delete(k): $A [k] \leftarrow$ NULL

Each operation is $Θ (1)$ . Total space is $Θ (M)$ .

Two disadvantages of direct addressing. It cannot be used if the keys are not integers. It wastes space if $M$ is unknown or $n << M$ .

Hashing idea: Map (arbitrary) keys to integers in range ${0, \dots, M - 1}$ (for an integer $M$ of our choice), then use direct addressing.

We know that all keys come from some universe $U$ . We pick a table-size $M$ . We pick a hash function $h : U \to {0, 1, \dots, M - 1}$ . Store dictionary in a hash table. An item with key $k$ wants to be stored in slot $h (k)$ .

Generally hash function $h$ is not injective, so many keys can map to the same integer. We get collisions: we want to insert $(k, v)$ into the table, but $T [h (k)]$ is already occupied. There are many strategies to resolve collisions.

Hashing with Chaining

Simplest collision-resolution strategy: Each slot stores a bucket containing $0$ or more KVPs. A bucket could be implemented by any dictionary realization. The simplest approach is to use unsorted lists with MTF for buckets. This is called collision resolution by chaining.

insert(k, v): Add $(k, v)$ to the front of the list at $T [h (k)]$
search(k): Look for key $k$ in the list at $T [h (k)]$ . Apply MTF-heuristic.
delete(k): Perform a search, then delete from the linked list.

insert takes time $O (1)$ . search and delete have run-time $O (1 + length of a list at T (h (k)))$ .

Complexity of chaining: insert takes time $Θ (1)$ . search and delete have run-time $Θ (1 + size of bucket T [h (k)])$ .

The average bucket size is $\frac{n}{M} = : α$ ( $α$ is also called the load factor).

However, this does not imply that the average-case cost of search and delete is $Θ (1 + α)$ . Consider the case where all keys hash to the same slot. The average bucket-size is still $α$ . But the operations take $Θ (n)$ time on average.

To get meaningful average-case bounds, we need assumptions on hash-functions and keys.

We can switch to randomized hashing. Assume that the hash-function is chosen randomly. We assume the following:

Uniform Hashing Assumption: Any possible hash-function is equally likely chosen has hash-function.

UHA implies that the distribution of keys is unimportant. Claim: Hash-values are uniform. Formally: $P (h (k) = i) = \frac{1}{M}$ for any key $k$ and slot $i$ .

Proof: Let $H_{j}$ (for $j = 0, \dots, M - 1)$ be hash-functions with $h (k) = j$ . For any $i \neq = j$ , can map $H_{i}$ to $H_{j}$ and vice versa. So $P (h (k) = i) = P (h \in H_{i}) = \frac{1}{M}$

Claim: Hash-values of any two keys are independent of each other.

Back to complexity of chaining. Each bucket has expected length $\frac{n}{M} \leq α$ . $n$ other keys are in this slot with probability $\frac{1}{M}$ . Each key in dictionary is expected to collide with $\frac{n - 1}{M}$ other keys. $n - 1$ other keys are in same slot with probability $\frac{1}{M}$ . Expected search and delete is hence $Θ (1 + α)$ .

For hashing with chaining, the run-time bound depends on $α$ . We keep the load factor small by rehashing when needed.

For Hashing with Chaining: Rehash so that $α \in Θ (1)$ throughout. Rehashing costs $Θ (M + n)$ time. Rehashing happens rarely enough that we can ignore this term when amortizing over all operations. We re-hash when $α$ gets too small.

The amortized expected cost for hashing with chaining is $O (1)$ and the space is $Θ (n)$ .

Probe Sequences

Main Idea: Avoid the links needed for chaining by permitting only one item per slot, but allowing a key $k$ to be in multiple slots.

search and insert follow a probe sequence of possible locations for key $k$ : $⟨ h (k, 0), h (k, 1), h (k, 2), \dots, h (k, M - 1)⟩$ until an empty spot is found.

Simplest method for open addressing: linear probing.

$h (k, j) = (h (k) + j) (mod M)$ , for some hash function $h$ .

delete becomes problematic. Cannot leave an empty spot behind; the next search might otherwise not go far enough. We can try to move later items in probe sequence forward. Better idea is lazy deletion. Mark the spot as deleted, search continues past deleted spots, insertion reuses deleted spots.

Keep track of how many items are ‘deleted’ and re-hash if there are too many.

probe-sequencce::insert(T,(k,v))
for (j=0; j<m; j++) 
	if T[h(k,j)] is NULL or "deleted"
		T[h(k,j)] = (k,v)
		return "success"
return "failure to insert"

probe-sequence-search(T,k)
for (j=0; j < m; j++) 
	if T[h(k,j)] is NULL return "item not found"
	if T[h(k,j)] has key k return T[h(k,j)]
return "item not found"

Some hashing methods require two hash functions $h_{0}, h_{1}$ . These hash functions should be independent in the sense that the random variables $P (h_{0} (k) = i)$ and $P (h_{1} (k) = j)$ are independent. Using two modular hash-functions often leads to dependencies.

Better idea: Use multiplication method for second hash function:

h (k) = ⌊ M (k A - ⌊ ka ⌋) ⌋

$A$ is some floating-point number with $0 < A < 1$ . $k A - ⌊ k A ⌋$ computes fractional part $k A$ , which is in $[0, 1)$ . Multiple with $M$ to get floating-point number in $[0, M)$ . Round down to get integer in ${0, \dots, M - 1}$ .

Assume there are two independent hash functions $h_{0}, h_{1}$ .

Assume further that $h_{1} (k) \neq = 0$ and that $h_{1} (k)$ is relative prime with the table-size $M$ for all keys $k$ . Choose $M$ prime. Modify standard hash-functions to ensure $h_{1} (k) \neq = 0$ .

Double-hashing: open-addressing with probe seqeunce

h (k, j) = (h_{0} (k) + j \cdot h_{1} (k)) (mod M)

search, insert, delete work just like for linear probing, with this different probe sequence.

Cuckoo Hashing

We use two independent hash functions $h_{0}, h_{1}$ , and two tables $T_{0}, T_{1}$ .

Main idea: An item with key $k$ can only be at $T_{0} [h_{0} (k)]$ or $T_{1} [h_{1} (k)]$ . search and delete always take constant time.

insert always initially puts the new item into $T_{0} [h_{0} (k)]$ . Evict item that may have been there already. If so, evicted item inserted at alternate position. This may lead to a loop of evictions. Can show that if insertion is possible, then there are at most $2 n$ evictions.

Can show: expected number of evictions during insert is $O (1)$ . So in practice, stop evictions much earlier than $2 n$ rounds. This crucially requires load factor $α < \frac{1}{2}$ . Here $α = n / (size of T_{0} + size of T_{1})$ .

Cuckoo hashing is wasteful on space. In fact, space is $ω (n)$ if insert forces lots of re-hashing. Can show the expected space is $O (n)$ .

For any open addressing scheme, we must have $α \leq 1$ . For the analysis, we require $0 < α < 1$ . Cuckoo hashing requires $0 < α < 1/2$ .

All strategies have $O (1)$ expected time for search, insert, delete. Cuckoo Hashing has $O (1)$ worst-case time for search, delete. Probe sequences use $O (n)$ worst-case space, Cuckoo Hashing uses $O (n)$ expected space.

For any hash-functions the worst-case run-time is $Θ (n)$ for insert.

Hash Function Strategies

Recall UHA: Hash function is randomly chosen among all possible hash-functions.

Satisfying this is impossible: There are too many hash functions; we would not know how to look up $h (k)$ .

We have to compromise, choose a hash-function that is easy to compute.

Two basic methods for integer keys (module method and multiplication method). Every hash function must do badly for some sequences of inputs: If the universe contains at least $M \cdot n$ keys, then there are $n$ keys that all hash to the same value.

Carter-Wegman’s universal hashing: Choose hash-function randomly.

Requires: all keys are in ${0, \dots, p - 1}$ for some big prime $p$ . At initialization, and whenever we re-hash: Choose $M < p$ arbitrarily, power of $2$ is okay. Choose (and store) two random numbers $a, b$ $(b =$ random(p) $)$ . $(a = 1 +$ random(p-1) $)$ . $h (k)$ can be computed quickly.

Analysis of these Carter-Wegman hash functions. Choosing $h$ in this way does not satisfy uniform hashing assumptions. But can show: two keys collide with probability at most $\frac{1}{M}$ . This suffices to prove the run-time bounds for hashing with chaining.

Question

What if the keys are multi-dimensional, such as strings?

Standard approach is to flatten string $w$ to integer $f (w) \in N$ .

Module 8: Range-Searching in Dictionaries for Points

So far, search(k) looks for one specific item. New operation range-search: look for all items that fall within a given range. Input: A range, i.e., an interval $Q = (x, x^{'})$ . It may be open or closed at the ends. We want to report all KVPs in the dictionary whose key $k$ satisfies $k \in Q$ .

As usual $n$ denotes the number of input-items. Let $s$ be the output-size (the number if items in the range). We need $Ω (s)$ time simply to report the items. Typical run-time: $O (lo g n + s)$ .

Unsorted list/array/hash table: Range search requires $Ω (n)$ time: We have to check for each item explicitly whether it is in the range.

Sorted array: Range search in $A$ can be done in $O (lo g n + s)$ time.

BST: Range searches can similarly be done in time $O (height + s)$ time.

Range searches are of special interest for multi-dimensional data.

Each item has $d$ aspects (coordinates): $(x_{0}, x_{1}, \dots, x_{d - 1})$ so corresponds to a point in $d$ -dimensional space.

We concentrate on $d = 2$ (points in Euclidean plane).

(Orthogonal) $d$ -dimensional range search: Given a query rectangle $Q = [x_{1}, x_{1}^{'}] \times \dots \times [x_{d}, x_{d}^{'}]$ , find all points that lie within $Q$ .

The time for range searches depends on how the points are stored. Could store a 1-dimensional dictionary (key is the combination of aspects). However, range search on one aspect is not straightforward.

Better idea: Design new data structures specifically for points (quadtrees, kd-trees, range-trees). Assumption: Points are in general position: No two points on a horizontal line, no two points on a vertical line.

Quadtrees

We have $n$ points $P = {(x_{0}, y_{0}), (x_{1}, y_{1}), \dots, (x_{n - 1}, y_{n - 1})}$ in the plane. Find a bounding box $R = [0, 2^{k}) \times [0, 2^{k})$ : a square containing all points. Assume that all coordinates are non-negative. Find max-coordinate in $P$ , use the smallest $k$ such that it is $< 2^{k}$ .

Structure (how to build the quadtree that stores $P$ ):

Root $r$ of the quadtree is associated with region $R$
If $R$ contains 0 or 1 points, then root $r$ is a leaf that stores point.
Else, split: Partition $R$ into four equal subsquares (quadrants) $R_{NE}, R_{N W}, R_{S W}, R_{SE}$ .
Partition $P$ into sets $P_{NE}, P_{N W}, P_{S W}, P_{SE}$ of points in these regions. Points on split lines belong to right/top side.
Recursively build tree $T_{i}$ for points $P_{i}$ in region $R_{i}$ and make them children of the root.

graph TD

    A["[0,16] × [0,16]"]

    B["[0,8] × [8,16]"]
    C["[0,8] × [0,8]"]

    P4((p4))
    P5((p5))

    A --> B
    A --> C
    A --> P4
    A --> P5

    B --> D((∅))
    B --> E((∅))
    B --> F["[0,4] × [8,12]"]
    B --> P8((p8))

    F --> P9((p9))
    F --> P3((p3))
    F --> G((∅))
    F --> P1((p1))


    C --> P6((p6))
    C --> P0((p0))
    C --> P2((p2))
    C --> P7((p7))

search: Analogous to binary search trees and tries

insert: Search for the point, split the leaf while there are two points in one region.

delete: Search for the point, remove the point, if the parent has only one point left: delete parent.

QTree::range-search(r <- root, Q)
r: The root of a quadtree, Q: Query-rectangle
R <- region associated with node r
if (R \subseteq Q) then
	report all points below r and return
else if (R \cap Q is empty) then return
 
if (r is a leaf) then
	p <- point stored at r
	if p is not NULL and in Q then report it and return
	else return
for each child v of r do QTree::range-search(v,Q)

Question

What is the height of a quadtree?

Can have very large height for bad distributions of points. Even with $n = 3$ points, the height could be arbitrarily large.

There exists a (weaker) bound that depends on the spread factor of points $P$ :

\frac{sidelength of R}{minimum distance between points in P}

Can show: height $h$ of quadtree is in $Θ (lo g (spread factor))$ .

Complexity to build initial tree: $Θ (nh)$ worst-case.

Complexity of range search: $Θ (nh)$ worst-case even if the answer is $\emptyset$ .

Quadtrees easily generalize to higher dimensions, but are rarely used beyond dimension 3.

Quadtrees are very easy to compute. No complicated arithmetic, space can potentially be wasteful if points are not well-distributed. We can use quadtrees to store pixelated images.

kd-trees

We have $n$ points $P = {(x_{0}, y_{0}), (x_{1}, y_{1}), \dots, (x_{n - 1}, y_{n - 1})}$ . Quadtrees split square into quadrants regardless of where points are. kd-tree idea: split the region at upper median of coordinates. Each node of the kd-tree keeps track of a splitting line in one dimension.

Convention: Points on split lines belong to right/top side. Continue splitting, switching between vertical and horizontal lines, until every point is in a separate region.

Build kd-tree with initial split by $x$ on points $P$ :

If $∣ P ∣ \leq 1$ create a leaf and return
Else $X : =$ randomized-quick-select(P, n/2) (select by $x$ -coordinate
Partition $P$ by $x$ -coordinate into $P_{x < X}$ and $P_{x \geq X}$ . $⌊ \frac{n}{2} ⌋$ points on one side and $⌈ \frac{n}{2} ⌉$ points on the other.
Create left subtree recursively (splitting by $y$ ) for points $P_{x < X}$ .
Create right subtree recursively (splitting by $y$ ) for points $P_{x \geq X}$ .

Run-time: Find $X$ and partition $P$ in $Θ (n)$ expected time using randomized-quick-select. Both subtrees have $\approx n /2$ points.

T^{exp} (n) = 2 T^{exp} (n /2) + O (n)

This resolve to $Θ (n lo g n)$ expected time. This can be reduced to $Θ (n lo g n)$ worst-case time by pre-sorting.

Height: $h (1) = 0, h (n) \leq h (⌈ n /2 ⌉) + 1$ . This resolves to $O (lo g n)$ (tight).

Space: All interior nodes have exactly two children. Therefore have $n - 1$ interior nodes. Space is $Θ (n)$ .

search: as in binary search tree using indicated coordinate.

insert: search, insert as new leaf.

delete: search, remove leaf.

Problem: After insert or delete, the split might no longer be at exact median and the height is no longer guaranteed to be $⌈ lo g_{2} n ⌉$ . We can maintain $O (lo g n)$ height by occasionally re-building entire subtrees. But range-search will be slower.

kd-trees do not handle insertion/deletion well.

Range search is exactly as for quad-trees, except that there are only two children and leaves always store points.

We assume that each node stores its associated region. To save space, we could instead pass the region as a parameter and compute the region for each child using the splitting line.

We spend $O (1)$ time at each visited node.

Observe: $#$ visited nodes is $O (β (n))$ where $β (n)$ is the number of “boundary” nodes. Neither $R \subseteq Q$ nor $R \cap Q = \emptyset$ .

Can show: $β (n)$ satisfies the following recurrence relation:

β (n) \leq 2 β (n /4) + O (1)

This implies $β (n) \in O (n)$ .

Therefore, the complexity of range search in kd-trees is $O (s + n)$ .

Storage is $O (n)$ , height is $O (lo g n)$ , construction time is $O (n lo g n)$ , range search time is $O (s + n^{1 - 1/ d})$ . This assumes that $d$ is constant (where $d$ is for $d$ -dimensional space).

Both quadtrees and kd-trees are intuitive and simple. But both may be very slow for range searches. Quadtrees are also potentially wasteful in space.

New idea: Range trees

Tree of trees (a multi-level data structure). So fare, nodes in our trees stored a key-value pair and references to children and (maybe) the parent. But we can store much more in a node! Each node stores in another binary search tree. They are wasteful in space, but permit much faster range search.

2-dimensional range trees.

Primary structure: Balanced binary search tree $T$ that stores $P$ and uses $x$ -coordinates as keys.

Every node $z$ of $T$ stores an associate structure $T_{ass} (z)$ :

Let $P (z)$ be all points in subtree of $z$ in $T$ (including point at $z$ )
$T_{ass} (z)$ stores $P (z)$ in a balanced binary search tree, using the $y$ -coordinates as key. Note: Point of $z$ is not necessarily in the root of $T_{ass} (z)$ .

Primary tree $T$ uses $O (n)$ space.

Question

How many nodes do all associate trees together have?

$b$ is a child of $a$ , $c$ is a child of $b$ . Point of $a$ is only in associate tree $T_{ass} (a)$ . Point of $b$ is in associate trees $T_{ass} (a)$ , $T_{ass} (b)$ . Point of $c$ is in associate trees $T_{ass} (a)$ , $T_{ass} (b)$ , $T_{ass} (c)$ .

Key insight: Point of $z$ is in associate tree $T_{ass} (u)$ if and only if $u$ is an ancestor of $z$ in $T$ .

A range tree with $n$ points uses $O (n lo g n)$ space. This is tight for some primary trees.

search: search by $x$ -coordinate in $T$ .

insert: First, insert point by $x$ -coordinate into $T$ . Then, walk back up to the root and insert the point by $y$ -coordinate in all associate trees $T_{ass} (z)$ of nodes $z$ on path to the root.

delete: Analogous to insertion.

Problem: We want the binary search trees to be balanced. This makes insert/delete very slow if we use AVL-trees. Solution: Completely rebuild highly unbalanced subtrees. Run-time for insert/delete becomes $O (lo g^{2} n)$ amortized.

BST::range-search-recursive(r <- root, x_1, x_2)
r: root of a binary search tree, x_1, x_2: search keys
Returns keys in subtree at r that are in range [x_1, x_2]
if r = NULL then return
if x_1 <= r.key <= x_2 then
	L <- BST::range-search-recursive(r.left, x_1, x_2)
	R <- BST::range-search-recursive(r.right, x_1, x_2)
	return L \cup r.{key} \cup R
if r.key < x_1 then
	return BST::range-search-recursive(r.right, x_1, x_2)
if r.key > x_2 then
	return BST::range-search-recursive(r.left, x_1, x_2)

Keys are reported in in-order.

BST Range Search analysis. Assume that the binary search tree is balanced. Search for path $P_{1}$ : $O (lo g n)$ . Search for path $P_{2}$ : $O (lo g n)$ .

$O (lo g n)$ boundary nodes. We spend $O (1)$ time on each. We spend $O (1)$ time per topmost inside node $v$ . They are children of boundary nodes, so this takes $O (lo g n)$ time. For 1d range search, also report the descendants of $v$ .

We have $\sum_{z topmost inside} # {descendants of z} \leq s$ since subtrees of topmost inside nodes are disjoint. So this takes time $O (s)$ overall.

Run-time for 1d range search: $O (lo g n + s)$ . This is no faster overall, but topmost inside nodes will be important for 2d range search.

Range search for $Q = [x_{1}, x_{2}] \times [y_{1}, y_{2}]$ is a two stage process: Perform a range search (on the $x$ -coordinates) for the interval $[x_{1}, x_{2}]$ in primary tree $T$ (BST::range-search(T, x_1, x_2)). Get boundary and topmost inside nodes as before. For every boundary node, test to see if the corresponding point is within the region $Q$ .

For every topmost inside node $v$ :

Let $P (z)$ be the points in the subtree of $z$ in $T$ .
We know that all $x$ -coordinates of points in $P (z)$ are within range.
Recall: $P (z)$ is stored in $T_{ass} (z)$ .
To find points in $P (z)$ where the $y$ -coordinates are within range as well, perform a range search in $T_{ass} (z)$ : BST::range-search(T_ass(z), y_1, y_2).

Range Trees: Range Search Run-Time

$O (lo g n)$ time to find boundary and topmost inside nodes in primary tree. There are $O (lo g n)$ such nodes. $O (lo g n + s_{v})$ time for each topmost inside node $v$ , where $s_{v}$ is the number of points in $T_{ass} (v)$ that are reported. Two topmost inside nodes have no common point in their trees $⟹$ every point is reported in at most one associate structure $⟹$ $\sum_{v topmost inside} s_{v} \leq s$ .

Time for range search in range-tree is proportional to

v topmost inside \sum (lo g n + s_{v}) \in O (lo g^{2} n + s)

Range trees can be generalized to $d$ -dimensional space.

Space is $O (n (lo g n)^{d - 1})$ , construction time $O (n (lo g n)^{d})$ , range search time is $O (s + (lo g n)^{d})$ .

Space/time trade-off compared to kd-trees.

Module 9: String Matching

Pattern Matching

Search for a string (pattern) in a large body of text.

$T [0 \dots n - 1]$ , the text (or haystack) being searched within.

$P [0 \dots m - 1]$ , the pattern (or needle) being searched for.

Strings over alphabet $Σ$ . Return the first $i$ such that

P [j] = T [i + j] for 0 \leq j \leq m - 1

This is the first occurrence of $P$ in $T$ . If $P$ does not occur in $T$ , return FAIL.

Definition

Substring $T [i \dots j] 0 \leq i \leq j \leq n$ : a string of length $j - i + 1$ which consists of characters $T [i], \dots, T [j]$ in order.

Definition

A prefix of $T$ : A substring $T [0 \dots i]$ of $T$ for some $0 \leq i < n$ .

Definition

A suffix of $T$ : A substring $T [i \dots n - 1]$ of $T$ for some $0 \leq i \leq n - 1$ .

Pattern matching algorithms consist of guesses and checks. A guess is a position $i$ such that $P$ might start at $T [i]$ . Valid guesses (initially) are $0 \leq i \leq n - m$ . A check of a guess is a single position $j$ with $0 \leq j < m$ where we compare $T [i + j]$ to $P [j]$ . We must perform $m$ checks of a single correct guess, but may make (many) fewer checks of an incorrect guess.

We will diagram a single run of any pattern matching algorithm by a matrix of checks, where each row represents a single guess.

Brute-force Algorithm. Idea: Check every possible guess.

BruteforcePM(T[0..n-1], P[0..m-1])
T: String of length n (text), P: String of length m (pattern)
for i <- 0 to n-m do
	match <- true
	j <- 0
	while j < m and match do
		if T[i+j] = P[j] then
			J <- j + 1
		else 
			match <- false
	if match then
		return i
return FAIL

Example: $T = abbbababbab$ , $P = abba$ .

a	b	b	b	a	b	a	b	b	a
a	b	b	$a$
	$a$
		$a$
			$a$
				a	b	$b$
					$a$
						a	b	b	a

Question

What is the worst possible input?

$P = a^{m - 1} b, T = a^{n}$ .

Worst case performance $Θ ((n - m + 1) m)$ . $m \leq n /2 ⟹ Θ (mn)$ .

More sophisticated algorithms (KMP and Boyer-Moore). Do extra preprocessing on the pattern $P$ . We eliminate guesses based on completed matches and mismatches.

KMP Algorithm

Knuth-Morris-Pratt algorithm.

Compares the pattern to the text in left-to-right. Shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern.

T	a	b	c	d	c	a	b	c	?	?	?
	a	b	c	d	c	a	b	a
						a	b	c	d	c	a

KMP answer: the largest prefix of $P [0 \dots j]$ that is a suffix of $P [i \dots j]$ .

KMP Failure Array: Preprocess the pattern to find matches of prefixes of the pattern with the pattern itself.

The failure array $F$ of size $m$ : $F [j]$ is defined as the length of the largest prefix of $P [0 \dots j]$ that is also a suffix of $P [1 \dots j]$ . $F [0] = 0$ .

If a mismatch occurs at $P [j] \neq = T [i]$ we set $j \leftarrow F [j - 1]$ .

Consider $P = aba c aba$ .

$j$	$P [1 \dots j]$	$P$	$F [j]$
0	-	abacaba	0
1	b	abacaba	0
2	b $a$	$a$ bacaba	1
3	bac	abacaba	0
4	bac $a$	$a$ bacaba	1
5	bac $ab$	$ab$ acaba	2
6	bac $aba$	$aba$ caba	3

KMP(T, P)
F <- failureArrayy(P)
i <- 0
j <- 0
while i < n do
	if T[i] = P[j] then
		if j = m-1 then
			return i-j
		else
			i <- i + 1
			j <- j + 1
	else
		if j > 0 then
			j <- F[j-1]
		else 
			i <- i+1
return -1

$P = aba c aba$ $T = aba x y aba c abbaababa c aba$

a	b	a	x	y	a	b	a	c	a	b	b
a	b	a	c
		(a)	b
			a
				a
					a	b	a	c	a	b	a
									(a)	(b)	a

failureArray(P)
F[0] <- 0
i <- 1
j <- 0
while i < m do
	if P[i] = P[j] then
		F[i] <- j + 1
		i <- i + 1
		j <- j + 1
	else if j > 0 then
		j <- F[j-1]
	else 
		F[i] <- 0
		i <- i + 1

For failureArray, at each iteration of the while loop, either $i$ increases by one, or the guess index $i - j$ increases by at least one $(F [j - 1] < j$ ). There are no more than $2 m$ iterations of the while loop. Running time: $Θ (m)$ .

Bayer-Moore Algorithm

Based on three key ideas.

Reverse-order searching: Compare $P$ with a subsequence of $T$ moving backwards.

Bad character jumps: When a mismatch occurs at $T [i] = c$ . If $P$ contains $c$ , we can shift $P$ to align the last occurrence of $c$ in $P$ with $T [i]$ . Otherwise, we can shift $P$ to align $P [0]$ with $T [i + 1]$ .

Good suffix jumps: If we have already matched a suffix of $P$ , then get a mismatch, we can shift $P$ forward to align with the previous occurrence of that suffix. Similar to failure array in KMP.

When a mismatch occurs, Boyer-Moore chooses whichever of bad character or good suffix shifts the pattern further to the right. Can skip large parts of $T$ .

$P = a l d o$ $T = w h ere i s w a l d o$

r	w	a	l	d	o
o
	o
		a	l	d	o

Last-occurrence function: Preprocess the pattern $P$ and the alphabet $Σ$ . Build the last-occurrence function $L$ mapping $Σ$ to integers. $L (c)$ is defined as the largest index $i$ such that $P [i] = c$ or $- 1$ if no such index exists.

Example

Example: $Σ = {a, b, c, d}, P = aba c ab$ .

c	a	b	c	d
$L (c)$	4	5	3	-1

The last-occurrence function can be computed in time $O (m + ∣Σ∣)$ . In practice, $L$ is stored in a size- $∣Σ∣$ array.

Suffix skip array: Preprocess $P$ to build a table. Suffix skip array $S$ of size $m$ : For $0 \leq i < m$ , $S [i]$ is the largest index $j$ such that $P [i + 1 \dots m - 1] = P [j + 1 \dots j + m - 1 - i]$ and $P [j] \neq = P [i]$ . Any negative indices are considered to make the given condition true.

Similar to the KMP failure array, with an extra condition.

Example

$P = b o n o b o b o$

$i$	0	1	2	3	4	5	6	7
$P [i]$	b	o	n	o	b	o	b	o
$S [i]$	-6	-5	-4	-3	2	-1	2	6

Computed similarly to KMP failure array in $Θ (m)$ time.

boyer-moore(T,P)
L <- last occurrance array computed from P
S <- suffix skip array computed from P
i <- m-1, j <- m-1
while i < n and j >= 0 do
	if T[i] = P[j] then
		i <- i-1
		j <- j-1
	else
		i <- i + m - 1 - min(L[T[i]], S[j])
		j <- m - 1
if j = -1 return i + 1
else return FAIL

Worst-case running-time $\in O (n + ∣Σ∣)$ . Faster than KMP in practice on English text.

Question

What if we want to search for many patterns $P$ within the same fixed text $T$ ?

Idea: Preprocess the text $T$ rather than the pattern $P$ .

Observation: $P$ is a substring of $T$ if and only if $P$ is a prefix of some suffix of $T$ . So we want to store all suffixes of $T$ in a trie.

To save space use a compressed trie, store suffixes implicitly via indices into $T$ . This is called a suffix tree.

Text $T$ has $n$ characters and $n + 1$ suffixes. We can build the suffix tree by inserting each suffix of $T$ into a compressed trie. This takes time $Θ (n^{2})$ . There is a way to build a suffix tree of $T$ in $Θ (n)$ time.

Suffix trees: String matching

Assume we have a suffix tree of text $T$ .

To search for pattern $P$ does not have the final $. $P$ is the prefix of some suffix of $T$ . In the uncompressed trie, searching for $P$ would be easy: $P$ exists in $T$ if and only if the search for $P$ until one of the following occurs:

If search fails due to “no such child” then $P$ is not in $T$ .
If we reach end of $P$ , say at node $v$ , then jump to leaf $ℓ$ in subtree of $v$ .
Else we reach a leaf $ℓ = v$ while characters of $P$ left.

Either way, left index at $ℓ$ gives the shift that we should check. This takes $O (∣ P ∣)$ time.

	Brute-Force	KMP	Boyer-Moore	Suffix trees
Preprocessing:	-	$O (m)$	$O (m + ∥Σ∥)$	$O (n^{2})$
Search time:	$O (nm)$	$O (n)$	$O (n)$	$O (m)$
Extra space:	-	$O (m)$	$O (m + ∥Σ∥)$	$O (n)$

Module 10: Data Compression

Question

The problem: How to store and transmit data efficiently?

Source text: The original data, string $S$ of characters from the source alphabet $Σ_{s}$ .

Coded text: The encoded data, string $C$ of characters from the code alphabet $Σ_{C}$ .

Encoding: An algorithm mapping source texts to coded texts.

Decoding: An algorithm mapping coded texts back to their original source text.

S encode C transmit C decode

Source “text” can be any sort of data. The code alphabet $Σ_{C}$ is usually ${0, 1}$ . We consider here only lossless compression: Can recover $S$ from $C$ without error.

Main objective: For data compression, we want to minimize the size of the coded text. We will measure the compression ratio:

\frac{∣ C ∣ \cdot lo g ∣ Σ _{C} ∣}{∣ S ∣ \cdot lo g ∣ Σ _{S} ∣}

We consider the efficiency of the encoding/decoding algorithms. We always need time $Ω (∣ S ∣ + ∣ C ∣)$ , but sometimes need more.

Observation: No lossless encoding scheme can have compression ratio $< 1$ for all input strings.

Definition

A character encoding maps each character in the source alphabet to a string in code alphabet
$E : Σ_{S} \to Σ_{C}^{*}$

ASCII is a fixed-length code. Each codeword $E (c)$ has the same length. Encoding/decoding is easy, just concatenate/decode the next 7 bits.

Better idea: Variable-length codes.

Motivation: Some letters in $Σ_{S}$ occur more often than others. For example, consider the frequency of letters in typical English text. We can use shorter codes for more frequent characters.

So as before, map source alphabet to codewords $E : Σ_{S} \to Σ_{C}^{*}$ . Not all codewords have the same length, this ought to make the code text shorter. Morse code is a variable-length code.

Assume that we have some character encoding $E : Σ_{S} \to Σ_{C}^{*}$ . Note that $E$ is a dictionary with keys in $Σ_{S}$ .

singleChar::encoding(E,S,C)
while S is non-empty:
	w <- E.search(S.pop())
	append each bit of w to C

The decoding algorithm must map $Σ_{C}^{*}$ to $Σ_{S}^{*}$ . The code must be lossless (i.e. uniquely decodable). This is false for Morse code.

From now on only consider prefix-free codes $E$ : no codeword is a prefix of another.

This corresponds to a trie with characters of $Σ_{S}$ only at the leaves.

Any prefix-free code is uniquely decodable.

prefixFree::decoding(C,S,T)
while C is non-empty:
	z <- T.root
	while z is not a leaf:
		if C is empty or z has no child labelled C.top()
			return "invalid encoding"
		z <- child of z that is labelled with C.pop()
	S.append(character stored at z)

Runtime is $O (∣ C ∣)$ .

prefixFree::encoding(S,C,T)
E <- array of trie-nodes indexed by \Sigma_S
for all leaves l in T do E[character at l] <- l
while S is non-empty
	w <- empty bitstring
	v <- E[S.pop()]
	while v is not the root
		w.prepend(character from v to its parent)
	append each bit of w to C

Runtime is $O (∣ T ∣ + ∣ C ∣)$ . We assume that all interior nodes of $T$ have two children, otherwise encoding scheme can be improved. Therefore $∣ T ∣ \leq 2∣ Σ_{S} ∣ - 1$ and run-time is $O (∣ Σ_{S} ∣ + ∣ C ∣)$ .

Huffman’s Algorithm

Question

How do we determine the best trie for a given source text $S$ ?

Idea: Frequent characters should have short codewords. Infrequent characters should be ‘far down’ the trie.

Greedy Algorithm: Always pair up most infrequent characters.

We store a set of encoding-tries. Initially each $c \in Σ_{S}$ adds “C” (height-0 trie holding $c$ ).
Our tries have weight: Sum of frequencies of all letters in trie.
Find the two tries with the minimum weight and merge them.
Repeat Step 3 until there is only one trie left.

We can store these tries via a min-ordered heap.

Huffman::encoding(S,C)
S: Input-stream with characters in alphabet
C: Output-stream
f: array indexed by S alpahabet, initially all-0
while S is non-empty do increase f[S.pop()] by 1
Q <- min-oriented priority queue that stores tries
for all c in S alphabet with f[c] > 0 do Q.insert([c], f[c])
while Q.size() > 1 do
	(T_1, f_1) <- Q.delete-min()
	(T_2, f_2) <- Q.delete-min()
	Q.insert(trie with T_1, T_2 as subtries, f_1 + f_2)
T <- Q.delete-min()
C.append(encoding trie T)
Re-set input-stream S
prefixFree::encoding(S,C,T)

The constructed trie is optimal in the sense that $C$ is shortest. The constructed trie is not unique. Decoder does not know the trie (either send the decoding trie along, or send character-frequencies and tie-breaking rules). Encoding must pass through text twice. Cannot use a stream unless it can be re-set.

Run-time:

Encoding: $O (∣ Σ_{S} ∣ lo g ∣ Σ_{S} ∣ + ∣ C ∣)$
Decoding: $O (∣ C ∣)$

Run-Length Encoding

Example of multi-character encoding: multiple source-text characters receive one code-word.

Requires: Input must be a bitstring $(Σ_{S} = {0, 1})$ . Decoding dictionary is uniquely defined and not explicitly stored.

Good to use when $S$ has long runs.

Send the leftmost bit of $S$ . Then send a sequence of integers indicating run lengths. We don’t have to give the bit for runs since they alternate.

Various ideas for encoding one integer $k$ :

Base-2 encoding: $E_{b 2} (k) : =$ the string $w$ such that $(w)_{2}$ = $k$
Bijective base-2 encoding: $E_{bb 2} (k) : = E_{b 2} (k + 1)$ with the leading 1 removed

$k$	1	2	3	4	5	6	7	8
$E_{b 2} (k)$	1	10	11	100	101	110	111	1000
$E_{bb 2} (k)$	0	1	00	01	10	11	000	001
	A	B	AA	AB	BA	BB	AAA	AAB

For a sequence of integers, we need a separator-symbol

Question

Can we develop codes that do not need a separator-symbol?

Elias gamma code (for an integer $k$ ). To encode $k$ , take binary representation $E_{b 2} (k)$ of $k$ . Add leading zeroes so that initial 1 is in the middle. This means adding $⌊ lo g k ⌋$ leading zeroes.

$k$	1	2	3	4	5	6	7	8
$E_{b 2} (k)$	1	10	11	100	101	110	111	1000
$⌊ lo g k ⌋$	0	1	1	2	2	2	2	3
$E_{γ} (k)$	1	010	011	00100	00101	00110	00111	0001000

Definition

To do run-length encoding: Write initial bit to output, determine the run-lengths $k_{1}, k_{2}, \dots, k_{d}$ , write $E_{γ} (k_{1}), E_{γ} (k_{2}), \dots E_{γ} (k_{d})$ .

RLE::encoding(S,C)
S:input-stream of bits, C:output-stream
b <- S.top(); C.append(b)
while S is non-empty do
	k <- 1
	while (S is non-empty and S.top() = b) do k++; S.pop()
	w <- empty string
	while k > 1
		C.append(0)
		w.prepend(k mod 2)
		k <- k/2
	w.prepend(1)
	append each bit of w to C
	b <- 1 - b

Claim: A sequence of Elias gamma codes can be decoded uniquely. Each Elias gamma code has form $0^{ℓ} 1 *^{ℓ}$ . Can determine $ℓ$ by scanning until we encounter a 1. Convert this 1 plus the next $ℓ$ bits into the integer.

RLE::decoding(C,S)
C:input-stream of bits, S: output-stream
b <- C.pop()
while C is non-empty
	l <- 0
	while C.pop() = 0 do l++
	k <- 1
	for (j <- 1 to l) do k <- k * 2 + C.pop()
	for (j <- 1 to k) do S.append(b)
	b <- 1 - b

If C.pop() is called when there are no bits left, then $C$ was not valid input.

An all-0 string of length $n$ would be compressed to $0 E_{γ} (n)$ , which has $2 ⌊ lo g n ⌋ + 2 \in o (n)$ bits.

Huffman and RLE take advantage of frequent/repeated single characters.

Observation: Certain substrings are much more frequent than others.

Ingredient 1 for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are.

ASCII, UTF-8, and RLE use fixed dictionaries.

In Huffman, the dictionary is not fixed, but it is static: the dictionary is the same for the entire encoding/decoding.

Ingredient 2 for LZW: adaptive encoding:

There is a fixed initial dictionary $D_{0}$ .
For $i \geq 0$ , $D_{i}$ is used to determine the $i$ th character to output,
After writing the $i$ th character to output, encoder updates $D_{i}$ to $D_{i + 1}$

Challenge: Decoder must know how encoder changed the dictionary.

Convert $S$ into a list of code-numbers. Start with a dictionary $D_{0}$ that stores ASCII. Every step adds to dictionary a multi-character string, using code-numbers 128, 129, $\dots$

Encoding alternates two steps:

Find longest string $w$ that is already in $D_{i}$ . So all of $w$ can be encoded with one number.
Add the substring that would have been useful to dictionary: add $w c$ where $c$ is the character that follows $w$ in $S$ .

Need to match characters $\to$ use a trie. To find $w$ : parse in trie until we reach node $z$ with ‘no such child’ for next character $c$ . To add $D_{i}$ : add child of $z$ with link labelled $c$ .

LZW::encoding(S,C)
idx <- 128
while S is non-empty do
	z <- root of trie D
	while (S is non-empty and z has a child c labelled S.top())
		z <- c; S.pop()
	C.append(code-number stored at z)
	if S is non-empty
		create child of z labelled S.top() with code-number idx
		idx++

Run-time: $O (∣ S ∣)$ , assuming we can look up child in $O (1)$ time.

LZW::decoding(C,S)
D ← dictionary that maps {0, . . . , 127} to ASCII
idx ← 128
k ← C.pop(); s ← D.search(k); S.append(s)
while there are more codes in C do
	sprev ← s; k ← C.pop()
	if k < idx do s ← D.search(k)
	else if k = idx do s ← sprev sprev [0] // special situation
	else return FAIL
	append each character of s to S
	D.insert(idx,sprev s[0])
	idx++

Dictionary $D$ maps consecutive integers to words. Use an array.

Decoding run-time is $O (∣ S ∣)$ . $Θ (∣ s ∣)$ each round to append $s$ to output.

Dictionary wastes space, words may get long. To save space, store strings as code of prefix $+$ one character.

Encoding and decoding take $O (∣ S ∣)$ . Encoding and decoding need to go through the string only once $⟹$ can do compression while streaming the text.

bzip2

Combine multiple compression schemes. Key ingredient is to use text transforms. Change input into a different text that compresses better.

$T_{0} Burrows-Wheeler transform T_{1} Move-to-front transform T_{2} Modified RLE T_{3} Huffman encoding T_{4}$

$T_{1}$ is a permutation. If $T_{0}$ has repeated substrings, then $T_{1}$ has long runs of characters. $T_{2}$ uses ASCII-numbers. If $T_{1}$ has long runs of characters, then $T_{2}$ has long runs of 0 and skewed frequencies. If $T_{2}$ has long runs of zeroes, then $T_{3}$ is shorter. Skewed frequencies remain. Compresses well since frequencies are skewed.

Move-to-front transform: Dictionary $D : {0, \dots, 127} \to$ ASCII is unordered array with MTF. Character $c$ gets mapped to index $i$ with $D [i] = c$ . A character in $S$ repeats $k$ times $⟺ C$ has run $k - 1$ zeroes. We would expect lots of small numbers in the output.

Modified run-length encoding: Input is a list of ‘characters’ in ${0, \dots, 127}$ . Encode only the runs of $0$ s. Encode run-length $k$ with bijective base-2 encoding, using two new characters $A, B$ .

Huffman encoding is the same as before.

Burrows-Wheeler Transform

Given: Source text $S$ as an array with end-sentinel.

Step 1: Write down cyclic shifts. $i$ th cyclic shift - move $i$ characters from front to back. We treat end-sentinel $ like any other character.

Observe: Every column contains a permutation of $S$ .

Step 2: Sort lexicographically. Use MSD-radix-sort. $Θ (n)$ strings of length $Θ (n) ⟹ Θ (n^{2})$ worst-case time. But usually much faster.

Observation: Every column continues to contain a permutation of $S$ .

Step 3: Extract rightmost column.

The Burrows-Wheeler transform consists of the last characters of the cyclic shifts of $S$ after sorting them lexicographically.

Same sorting permutation for cyclic shifts and suffixes. That’s the suffix array $A^{s}$ . Can compute this in $O (n lo g n)$ time. Read BWT encoding from it: $C[i] = \begin{cases} S[A^s[i] -1] & \text{if } A^s[i] > 0 \\ \$ & \text{otherwise} \end{cases}$.

Given string $C$ obtained from BWT encoding. We can reconstruct the first and last column of the matrix of cyclic shifts.

BWT::encoding(C,S)
for all indices i of C
	A[i] <- (C[i], i)
for all indices j of C
	if C[j] = $ break
do
	S.append(character stored in A[j])
	j <- index stored in A[j]
while appended character is not $

BWT encoding cost is $O (n lo g n)$ . BWT decoding cost is $O (n + ∣ Σ_{S} ∣)$ . Encoding and decoding both use $O (n)$ space.

They need all of the text. BWT is a block compression method that compresses one block at a time.

bzip2 encoding cost is $O (n (lo g n + ∣Σ∣))$ with a big constant. bzip2 decoding cost is $O (n ∣Σ∣)$ with a big constant.

Module 11: External Memory

Recall the RAM model of a computer. Any access to a memory location takes the same constant time. This is not realistic.

Typical current computer architecture includes registers, cache L1, L2, main memory, disk or cloud.

Question

How to adapt our algorithms to take the memory hierarchy into account, avoiding transfers as much as possible?

Define a new computer model that models one such ‘gap’ across which we must transfer.

Assumption: During a transfer, we automatically load a whole block. This is realistic.

New objective: Revisit all algorithms with the objective of minimizing block transfers.

Stream based algorithms (with $O (1)$ resets) use $Θ (\frac{n}{B})$ block transfers.

Recall: Dictionaries store $n$ KVPs and support search, insert, and delete. AVL-trees were optimal in time and space in RAM model. $Θ (lo g n)$ run time $⟹ O (lo g n)$ block transfers per operation. Inserts happen at varying locations of the tree. Nearby nodes are unlikely to be on the same block, typically $Θ (lo g n)$ transfers per operation.

We would like to have fewer block transfers. The goal is $O (lo g_{B} n)$ transfers.

Design a tree-structure that guarantees that many nodes on search-paths are within one block.

Idea: Store complete subtrees with $lo g b$ levels in one block of memory. Each block/subtree then covers height $lo g b$ . Search-path hits $\frac{l o g n}{l o g b}$ blocks $⟹ lo g_{b} n$ block-transfers. Since $b \in Θ (B)$ , we have $lo g_{b} n \in Θ (lo g_{B} n)$ .

View the entire content of a block as one node.

Define multiway-tree: A node can store multiple keys.

Definition

A $d$ -node stores $d$ keys, has $d + 1$ subtrees, and stored keys are between the keys in the subtrees.

We always have one more subtree than keys.

To allow insert/delete, we permit a varying numbers of keys in nodes. We also rigidly restrict where empty subtrees may be. This gives much smaller height than for AVL-trees $⟹$ fewer block transfers.

Definition

An $a$ - $b$ -tree (for some $b \geq 3$ and $2 \leq a \leq ⌊ \frac{b}{2} ⌋$ ) satisfies

Every non-root is a $d$ -node for some $a - 1 \leq d \leq b - 1$ .

Between $a$ and $b$ subtrees, between $a - 1$ and $b - 1$ keys.

The root is a $d$ -node for $1 \leq d \leq b - 1$ .

Between 2 and $b$ subtrees, between $1$ and $b - 1$ keys.

All empty subtrees are at the same level.

For $2$ - $4$ -trees, every node has between $1$ and $3$ keys.

Typically, we specify the order $b$ and then set $a = ⌊ \frac{b}{2} ⌋$ . With small height we can store many keys.

Theorem

An $a$ - $b$ -tree with $n$ keys has $O (lo g_{a} (n))$ height.

How many keys must an $a$ - $b$ -tree of height $h$ have?

# non-root nodes n = # KVPs \geq i = 1 \sum h 2 a^{i - 1} = 2 j = 0 \sum h - 1 a^{j} = 2 \frac{a ^{h} - 1}{a - 1} \geq root 1 + \geq a - 1 KVPs at non-root a - 1 2 \frac{a ^{h} - 1}{a - 1} = 2 a^{h} - 1

Therefore $h \leq lo g_{a} (\frac{n + 1}{2})$ .

Search is similar to BST: Compare search-key to keys at node. If not found, continue in appropriate subtree until empty.

abTree::search(k)
z <- root, p <- NULL
while z is not NULL
	let < T_0, k_1, ..., k_d, T_d > be key-subtree list at z
	if k >= k_1 
		i <- maximal index such that k_i <= k
		if k_i = k then return KVP at k_i
		else p <- z, z <- root of T_i
	else p <- z, z <- root of T_0
return "not found, would be in p"

$#$ visited nodes: $O (lo g_{a} n)$ . Finding $i$ is not constant time.

For insert, do abTree::search and add key and empty subtree at leaf. If the leaf had room then we are done. Else overflow: More keys/subtrees than permitted. Resolve overflow by node splitting.

abTree::insert(k)
z <- abTree::search(k)
Add k and an empty subtree in key-subtree-list of z
while z has b keys (overflow -> node split)
	Let < L_0, k_1, ..., k_b, T_b > be a key-subtree list at v
	if (z has no parent) create a parent of z without KVPs
	move upper median k_m of keys to parent p of z
	z' <- new node with < T_0, k_1, ..., k_m-1, T_m-1 > 
	z'' <- new node with < T_m, k_m+1, ..., k_b, T_b > 
	Replace <z> by <z', k_m, z''> in key-subtree-list of p
	z <- p

An $a$ - $b$ tree has height $O (lo g_{a} n)$ . If $a \approx b /2$ , then this height-bound is tight. Level $i$ contains at most $b^{i}$ nodes. Each node contains at most $b - 1$ KVPs. So $n \leq b^{h + 1} - 1$ and $h \in Ω (lo g_{b} n)$ .

search and insert visit $O (lo g_{a} n)$ nodes. delete can also be implemented with $O (lo g_{a} n)$ node-visits. But usually use lazy deletion, space is cheap in external memory.

Definition

A red-black tree is a binary search tree such that:

Every node has a colour (red or black),

Every red node has a black parent (in particular the root is black),

Any empty subtree $T$ has the same black-depth (number of black nodes on path from root to $T$ ).

Rather than proving properties or describing operations directly, we convert back to $2$ - $4$ -trees.

Lemma: Any red-black tree $T$ can be converted into a $2$ - $4$ -tree $T^{'}$ .

Black node with $0 \leq d \leq 2$ red children becomes a $(d + 1)$ -node. This covers all nodes. Empty subtrees on same level due to the same black-depth.

insert can be done in $O (lo g n)$ worst-case time. delete can also be done in $O (lo g n)$ time.

A $B$ -tree is an $a$ - $b$ -tree tailored to the external memory model. Every node is one block of memory (of size $B$ ). The order $b$ is chosen maximally such that $(b - 1)$ -node fits into a block of memory. Typically $b \in Θ (B)$ . $a$ is set to $⌊ b /2 ⌋$ as before.

search, insert, and delete each requires visiting $Θ (height)$ nodes. Work within a node is done in internal memory $⟹$ no block-transfer. The height is $Θ (lo g_{a} b) = Θ (lo g_{B} n)$ $($ since $a = ⌊ b /2 ⌋ \in Θ (B))$ .

So all operations require $Θ (lo g_{B} n)$ block transfers. This is asymptotically optimal.

Hash functions don’t adapt well to external memory. We must occasionally re-hash to keep load factor $α$ small. And re-hashing must load all $n / B$ blocks. This is unacceptably slow.

Goal: Data structure for hash-values that typically uses $O (1)$ block transfers, and never needs to load all blocks.

Idea: Keys $⇝$ hash-values $=$ integers $⇝$ fixed-length bit-strings. Store trie of bit-strings whose leaves are blocks of memory.

Assumption: We store fixed-length bit-strings.

Build trie $D$ (the directory) of bit-strings in internal memory. Stop splitting in $D$ when remaining items fit in one block. Each leaf of $F$ refers to a block of external memory. The blocks store KVPs in no particular order.

search(k): Search for $k$ in $D$ until we reach leaf $ℓ$ . Load block at $ℓ$ . Search for $k$ in block. 1 block transfer.

delete(k): search(k) loads block, delete $k$ from block, transfer updated block back. 2 block transfers.

insert(k): Search for $k$ in $D$ until we reach leaf $ℓ$ . Load block $P$ at $ℓ$ . If $P$ is at capacity, leaf $ℓ$ gets two new children, create two new blocks, split items in $ℓ$ by next bit. Insert $k$ into appropriate block. Transfer updated block back. Typically 2-3 block transfers.

If all items in $P$ have the same next bit, then split repeatedly. For big $B$ this is extremely unlikely.

Hashing collisions mean duplicate bit-strings, so all colliding items are in the same block. We do not care how collisions are resolved within the block.

If more than $B$ items have the same hash-value $\to$ if all bit-strings in block are the same, we cannot split. This means either the load factor is too big or the hash-function is bad. We extend the hash function.

Table of Contents

Backlinks

CS240 Course Notes

Module 1: Introduction & Asymptotic Analysis

Module 2: Priority Queues

Module 3: Sorting, Average-case and Randomization

Module 4: Dictionaries

Module 5: Other Dictionary Implementations

Module 6: Dictionaries for Special Keys

Module 7: Dictionaries via Hashing

Module 8: Range-Searching in Dictionaries for Points

Module 9: String Matching

Module 10: Data Compression

Module 11: External Memory