Professor: Arne Storjohann | Term: Fall 2024

Module 1: Introduction & Asymptotic Analysis

Much of Computer Science is problem solving: Write a program that converts the given input to the expected output.

When first learning to program, we emphasize correctness. With this course, we will also be concerned with efficiency. We will study efficient methods of storing, accessing, and organizing large collections of data.

Typical operations include: Inserting new data items, deleting data items, searching for specific data items, sorting.

We consider various abstract data types (ADTs), and how to realize them efficiently using appropriate data structures. There is a strong emphasis on mathematical analysis in this course.

Useful Logarithms facts:

  • means
  • in this course means
  • , ,
  • ,

Useful Sums:

  • if .

Geometric Sequences:

Harmonic Sequence:

Algorithms and Problems

Problem: Description of possible input and desired output.

Problem Instance: One possible input for the specified problem.

Algorithm: Step-by-step process for carrying out a series of computations, given an arbitrary instance .

Solving a problem: An algorithm solves a problem if, for every instance of , computes a valid output for the instance in finite time.

Program: A program is an implementation of an algorithm using a specified computer language.

Pseudocode: Communicate an algorithm to another person.

insertion-sort(A, n)
A: array of size n
for (i <- i; i < n; i++) do
	for (j <- i; j > 0 and A[j-1] > A[j]; j--) do
		swap A[j] and A[j-1]

Omits obvious details, and should be precise about exit-conditions.

For problem to program that solves it:

  • Design an algorithm that solves Algorithm Design.
  • Assess correctness and efficiency of each Algorithm Analysis. Correctness CS245.
  • If acceptable, implement algorithms.
  • If multiple acceptable algorithms, run experiments to determine the best solution.

This course will focus on the first two points.

Question

What do we mean by efficiency?

In this course, we are primarily concerned with the amount of time a program takes to run (run time).

We also may be interested in the amount of additional memory the program requires (auxiliary space).

The amount of time and or memory required by a program will usually depend on the given problem instance.

RAM model:

  • Each memory cell stores one datum, typically a number, character, or reference.
  • Any access to a memory location takes constant time.
  • Any primitive operation takes constant time.

We compare algorithms by considering the growth rate: What is the behaviour of algorithms as size gets large?

We use order notation (big- and friends). Ignore constants and lower order terms.

We study relationships between functions.

Definition

-notation: ( is asymptotically upper-bounded by ) if there exists constants and such that for all .

In order to prove that from first principles, we need to find and such that the following condition is satisfied:

Many, but not all, choices of and will work.

We also have that . We want a tight asymptotic bound.

Definition

-notation: ( is asymptotically lower-bounded by ) if there exists constants and such that for all .

Definition

-notation: ( is asymptotically tightly-bounded by ) if there exists constants , and such that .

Equivalently:

We also say that the growth rates of and are the same.

How growth rates affect running time. It is interesting to see how the running time is affected when the size of the problem instance doubles.

We have .

Question

How to express that grows slower than .

Definition

-notation: ( is asymptotically strictly smaller than ) if for all constants , there exists a constant such that .

Definition

-notation: ( is asymptotically strictly larger than ) if for all constants , there exists a constant such that .

Suppose that and for all . Suppose that

Then

If the fraction goes towards , then .

This can be computed using l’Hôspital’s rule. This result gives sufficient but not necessary conditions for the stated conclusion to hold.

Many rules are easily proved from first principle.

Identity rule: .

Transitivity:

  • If and then .
  • If and then .
  • If and then .

Suppose that and for all . Then:

Relationships between order notations

Avoid doing arithmetic with asymptotic notations. Do not write .

Techniques for run-time analysis:

  • Identify primitive operations that require time
  • The complexity of a loop is expressed as the sum of the complexities of each iteration of the loop.
  • Nested loops: start with the innermost loop and proceed outwards. This gives nested summations.

Let denote the running time of an algorithm on instance .

Worst-case (best-case) complexity of an algorithm: The worst-case (best-case) running time of an algorithm is a function mapping to the longest (shortest) running time for any input instance of size :

Average-case complexity of an algorithm: The average-case running time of an algorithm is a function mapping to the average running time of over all instances of size :

Question

Suppose algorithm has worst-case run-time and algorithm has worst-case run-time , and both the same problem. Is more efficient?

No! -notation is an upper-bound. may well have worst-case run-time . We should always give -bounds.

Question

Suppose the run-times are and , respectively. We consider to be better. But is it always more efficient?

No! The worst-case run-time of may only be achieved on some instances. Possibly is better on most instances.

Merge-sort

Step 1: Describe the overall idea

Input: Array of integers.

  1. We split into subarrays and that are roughly half as big.
  2. Recursively sort and
  3. After and have been sorted, use a function merge to merge them into a single sorted array.

Step 2: Give pseudo-code or detailed description

merge-sort(A,n)
if (n <= 1) then return
else
	m = (n-1)/2
	merge-sort(A[0..m], m+1)
	merge-sort(A[m+1..n-1], r)
	merge(A[0..m], A[m +1...n-1])

Do not pass array by value, instead indicate the range of the array that needs to be sorted. merge an auxiliary array . Allocate this only once.

merge(A,l,m,r,S)
A[0..n-1] is an array, A[l..m] is sorted, A[m+1..r] is sorted
S[0..n-1] is an array
copy A[l..r] into S[l..r]
(i_L, r_R) <- (l, m+1)
for (k <- l, k <= r: k++) do
	if (i_L > m) A[k] <- S[i_R ++]
	else if (i_R > r) A[k] <- S[i_L ++]
	else if (S[i_L] <= S[i_R]) A[k] <- S[i_L ++]
	else A[k] <- S[i_R ++]

Step 3: Argue correctness. Typically state loop-invariants. Sometimes obvious enough from idea-description and comments.

Step 3: Analyze the run-time

  • First analyze work done outside recursion
  • If applicable, analyze subroutines separately. If there are recursions: how big are the subproblems? The run-time then becomes a recursive function.

The recurrence relation for is as follows

The following is the corresponding sloppy recurrence

This resolves to .

Module 2: Priority Queues

Definition

Abstract Data Type (ADT): A description of information and a collection of operations on that information. (Covered in CS135).

The information is accessed only through the operations.

We can have various realizations of an ADT, which specify how the information is stored and how the operations are performed.

Recall:

Stack: An ADT consisting of a collection of items with operations:

  • push: Add an item to the stack.
  • pop: Remove and return the most recently added item.

Items are removed in LIFO order. We can have extra operations size, is-empty, and top.

Queue: An ADT consisting of a collection of items with operations:

  • enqueue: Add an item to the queue
  • dequeue: Remove and return the least recently inserted item.

Items are removed in FIFO order.

We can have extra operations size, is-empty, and peek/format.

Priority Queue generalizes both ADT Stack and ADT Queue.

It is a collection of items (each with a priority or key) with operations:

  • insert: Inserting an item tagged with a priority
  • delete-max: Removing and returning an item of highest priority.

We can have extra operations: size, is-empty, get-max.

This is a maximum-oriented priority queue. A minimum-oriented priority queue replaces operation delete-max by delete-min.

Using a priority queue to sort:

PQ-Sort(A[0..n-1])
for i <- 0 to n-1 do
	PQ.insert (an item with priority A[i])
for i <- n-1 down to 0 do
	A[i] <- priority of PQ.delete-max()

Written as: (initialization insert + delete-max

Realizations of Priority Queues

Realization 1: Unsorted arrays

Run-time of operations:

  • insert:
  • delete-max:

We assume dynamic arrays, expand by doubling as needed. Amortized over all insertions this takes extra time.

PQ-sort with this realization yields selection-sort.

Realization 2: Sorted arrays

Run-time of operations:

  • insert:
  • delete-max:

PQ-sort with this realization yields insertion-sort. insert can be implemented to have best-case run-time. insertion-sort then has best-case run-time.

Realization 3: Heaps

A binary heap is a certain type of binary tree. A binary tree is either empty, or consists of a node and two binary trees (left subtree and right subtree).

Level all nodes with distance from the root. Root is on level .

Height maximum number for which level contains nodes. Single-node tree has height , empty tree has height .

Any binary tree with height has nodes. So height .

A heap is a binary tree with the following two properties:

  1. Structural property: All the levels of a heap are completely filled, except possibly for the last level. The filled items in the last level are left-justified.
  2. Heap-order property: For any node , the key of the parent of is larger than or equal to key of .

The full name for this is max-oriented binary heap.

Lemma: The height of a heap with nodes is .

Heaps should not be stored as binary trees. Let be a heap of items and let be an array of size . Store root in and continue with elements level-by-level from top to bottom, in each level-to-right.

It is easy to navigate using this array representation:

  • The root node is at index 0
  • The last node is
  • The left child is of node is node
  • The right child of node is node
  • The parent of node is node
  • These nodes exist if the index falls in the range

We hide implementation details using helper-functions.

insert in heaps: Place the new key at the first free leaf. Use fix-up to restore heap order.

insert(x)
l <- last()+ 1
A[l] <- x
increase size
fix-up(A, l)
fix-up(A,i)
while parent(i) exists and A[parent(i)].key < A[i].key do
	swap A[i] and A[parent(i)]
	i <- parent(i)

Time: height of heap (and this is tight).

delete-max in heaps

The maximum item of a heap is just the root node. We replace the root by the last leaf (last leaf is taken out). The heap-order property might be violated, perform a fix-down.

fix-down(A,i)
while i is not a leaf do
	j <- left child of i
	if (i has right child and A[right child of i].key > A[j].key)
		j <- right child of i
	if A[i].key >= A[j].key break
	swap A[j] and A[i]
	i <- j

Time: height of heap (and this is tight).

delete-max()
l <- last()
swap A[root()] and A[l]
decrease size
fix-down(A, root(), size)
return A[l]

Time: height of heap (and this is tight).

Binary heap are a realization of priority queues where the operations insert and delete-max take time.

Any priority queue can be used to sort in time

Using the binary-heaps implementation of PQs, we obtain:

PQ-sort-with-heaps(A)
for i <- 0 to n-1 do
	H.insert(A[i])
for <- n-1 down to 0 do
	A[i] <- H.delete-max()

Both operations run in time for heaps PQ-sort using heaps takes time.

Can improve this with two simple tricks heap-sort.

  1. Can use the same array for input and heap auxiliary space.
  2. Heaps can be built faster if we know all input in advance.

Problem: Given items all at once (in ) build a heap containing all of them.

Solution 1: Start with an empty heap and insert items one at a time.

simple-heap-building(A)
for i <- 0 to A.size() - 1 do
	H.insert(A[i])

This corresponds to doing fix-up’s. Worst-case running time:

We can use fix-down’s instead.

heapify(A)
n <- A.size()
for i <- parent(last()) downto root() do
	fix-down(A, i, n)

This yields a worst-case complexity of . A heap can be built in linear time.

Efficient sorting with heaps: Idea is to use PQ-sort with heaps.

heap-sort()A
n <- A.size()
for i <- parent(last()) downto 0 do
	fix-down(A,i,n)
 
while n > 1
	swap items at A[root()] and A[last()]
	decrease n
	fix-down(A, root(), n)

The for-loop takes time and the while-loop takes time.

Problem: Find the th smallest item in an array of numbers.

Solution 1: Make passes through the array, deleting the minimum number each time. Complexity .

Solution 2: Sort , then return . Complexity .

Solution 3: Create a min-heap with heapify(A). Call delete-min(A) times. Complexity: .

Question

We can achieve worst-case time easily, but can we do better?

Yes.

Module 3: Sorting, Average-case and Randomization

We introduce (and solve) a new problem, and then analyze the average-case run-time of our algorithm.

Recall the definition of average-case run-time:

Assume that the set of size- instances is finite. Assume that all instances can occur equally frequently. Then we can use the following simplified formula

To learn how to analyze this, we will do simpler examples first.

silly-test(π, n)
π: a permutation of {0, ..., n-1}, stored as an array
if π[0] = 0 then for j = 1 to n do print 'bad case'
else print 'good case'

permutations have run-time . The remaining permutations have run-time .

Another not-so-contrived example

all-0-test(w,n)
// test whether all entries of bitstring w[0..n-1] are 0
if (n=0) return true
if (w[n-1] = 1) return false
all-0-test(w,n-1)

Define bit-comparisons on input . This is asymptotically the same as the run-time.

Worst-case run-time: Always go into the recursion until . .

Best-case run-time: Return immediately. .

Average-case run-time?

Recall \quad .

Recursive formula for one non-empty bitstring :

Natural guess for the recursive formula for :

This holds with but is useless if ??? is worst. This is not obvious if ??? is avg.

Observe: is uniquely determined by and .

This recursion resolves .

Question

Why can’t we always write ‘avg’ for ’???’ in ?

Consider

silly-all-0-test(w,n)
if (n = 0) then return true
if (w[n-1] = 1) then return false
if (n>1) then w[n-2] <- 0
silly-all-0-test(w,n-1)

Only one more line of code in each recursion, so same formula applies.

But observe that now

So . The ‘obvious’ recursion did not hold. Average-case analysis is highly non-trivial for recursive algorithms.

We saw SELECTION: Given an array of numbers, and , find the element that would be at position of the sorted array.

SELECTION can be done with heaps in time .

Special case: MEDIANFINDING = SELECTION with . With previous approaches, this takes time , no better than sorting.

Question

Can we do selection in linear time?

Yes, the algorithm quick-select.

quick-select and the related quick-sort rely on two subroutines:

  • choose-pivot(A): Return an index in . We will use the pivot-value to rearrange the array.
  • partition(A,p): Rearrange and return pivot-index so that
    • the pivot-value is in ,
    • all items in are , and
    • all items in are .

index of pivot-value before partition (we choose it), index of pivot-value after partition (no control).

Partition Algorithm

partition(A,p)
v <- A[p]
for each element x in A do
	if x < v then smaller.append(x)
	else if x > v then larger.append(x)
	else equal.append(x)
i <- smaller.size
j <- equal.size
Overwrite A[0...i-1] by elements in smaller
Overwrite A[i...i+j-1] by elements in equal
Overwrite A[i+j...n-1] by elements in larger
return i

Efficient in-place partition (Hoare)

partition(A,p)
swap(A[n-1], A[p])
i <- -1
j <- n-1
v <- A[n-1]
loop
	do i <- i+1 while A[i] < v
	do j <- j-1 while j > i and A[j] > v
	if i >= j then break 
	else swap(A[i], A[j])
swap(A[n-1], A[i])
return i

Running time .

quick-select(A,k)
p <- choose-pivot(A)
i <- partition(A,p)
if i = k then return A[i]
else if i > k then return quick-select(A[0...i-1], k)
else if i < k then return quick-select(A[i+1...n-1], k-(i+1))

Let be the number of key-comparisons in a size- array with parameter . partition uses key-comparisons.

Worst-case run-time: Sub-array always gets smaller, so recursions. Each takes comparisons time. This is tight: If pivot-value is always the maximum and .

Best-case run-time: First chosen pivot could be the th element. No recursive calls; .

Average case analysis is complicated. We detour into a randomized version.

A randomized algorithm is one which relies on some random numbers in addition to the input. Randomization is a good idea if an algorithm has bad worst-case time but seems to perform much better on most instances.

The run-time will depend on the input and the random numbers used. We assume that there exists a pseudo-random number generator.

Goal: Shift the dependency of run-time from what we can’t control (input) to what we can control (random numbers).

randomized-all-0-test(w,n)
if n = 0 return true
if (random(2)=0) then
	w[n-1] = 1 - w[n-1]
if w[n-1] = 1 return false
randomized-all-0-test(w,n-1)

This is all-0-test, except that we flip last bit based on a coin toss. random(n) returns an integer uniformly from .

The run-time of the algorithm now depends on the random numbers. Define to be the run-time of a randomized algorithm for an instance and the sequence of outcomes of random trials.

The expected run-time for instance is the expected value:

Now take the maximum over all instances of size to define the expected run-time (formally: worst-instance expected-luck run-time) of .

Expected run-time of randomized-all-0-test.

Define bit-comparisons used on input if the random outcomes are .

The random outcomes consist of two parts .

  • outcome of first coin toss
  • random outcomes (if any) for the recursions

We have .

Recursive formula for one instance:

Natural guess for the recursive formula for

In contrast to average-case analysis, the natural guess usually is correct for the expected run-time.

Proof for randomized-all-0-test.

Therefore .

Question

Does the expected time of a randomized version always have something to do with the average-case time?

Yes if the randomization is a shuffle (choose instance randomly).

Average-case run-time and expected run-time are not the same. Average-case is the average over instances usually applied to a deterministic algorithm. Expected is the weighted average over random outcomes, applied only to a randomized algorithm.

For analyzing the average-case run-time of quick-select, we make two assumptions:

  • All input-items are distinct.
  • All possible orders of the input-items occur equally often.

Goal is to create a randomized version of quick-select.

First idea: Shuffle the input, then do quick-select.

shuffled-quick-select(A,k)
for (j <- 1 to n-1) do swap(A[j], A[random(j+1)])
quick-select(A,k)

We know .

Second idea: Do the shuffling inside the recursion.

randomized-quick-select(A,k)
swap A[n-1] witih A[random(n)]
i <- partition(A,n-1)
if i = k then return A[i]
else if i > k then
	return randomized-quick-select(A[0 ... i-1], k)
else if i < k then
	return randomized-quick-select(A[i+1 ... n-1], k-(i+1))

Let key-comparisons of randomized-quick-select on input if the random outcomes are .

Write random outcomes as . Observe: . We recurse in an array of size or .

As for rand-all-0-test: over all , the recursions can use .

Tedious but straightforward. This recursion resolves to .

randomized-quick-select is generally the fastest solution to the selection problem.

Hoare developed partition and quick-select. He then used them to sort based on partitioning.

quick-sort(A)
if n <= 1 then return
p <- choose-pivot(A)
i <- partition(A,p)
quick-sort(A[0, 1, ..., i-1])
quick-sort(A[i+1, ..., n-1])

Worst-case run-time:

Best-case run-time:

Average-case run-time we prove via randomization.

.

We have seen many sorting algorithms.

Question

Can one do better than running time?

Yes and no, depends on what we allow.

No: Comparison-based sorting lower bound is

Yes: Non-comparison-based sorting can achieve (under restrictions).

All algorithms so far are comparison-based. Data is accessed by comparing two elements, and moving elements around.

Theorem

Any comparison-based sorting algorithm requires in the worst case comparisons to sort distinct items.

Easy to understand via a decision tree.

Non-Comparison-Based Sorting

Assume keys are numbers in base (: radix). All digits are in .

Assume all keys have the same number of digits. Can achieve after padding with leading s.

Can sort based on individual digits.

bucket-sort(A, n, sort-key())
sort-key(): maps items of A to {0,...,R-1}
Initialize an array B[0..R-1] of empty queues (buckets)
for i <- 0 to n-1 do
	Append A[0] at the end of B[sort-key(A[i])]
i <- 0
for j <- 0 to R-1 do
	while B[j] is non-empty do
		move front element of B[j] to A[i++]

bucket-sort is stable, equal items stay in original order.

Run-time is , auxiliary space . It is possible to replace the lists by arrays count-sort.

Most-significant-digit(MSD)-radix-sort

Sort array of -digit radix- numbers recursively: sort by 1st digit, then each group by 2nd digit etc.

MSD-radix-sort(A,n,d<-1)
A: array of size n, contains m-digit radix-R numbers
if (d <= m and (n > 1))
	bucket-sort(A, n, return dth digit of A[i])
	l <- 0
	for j <- 0 to R-1
		Let r >= l - 1 be maximal s.t. A[l..r] have dth digit j
		MSD-radix-sort(A[l..r], r - l + 1, d+1)
		l <- r + 1
  • levels of recursion in worst-case
  • subproblems on most levels in worst-case
  • time for each bucket-sort call.

Run-time - slow. Many recursions and allocated arrays.

LSD-radix-sort(A,n)
A: array of size n, contains m-digit radix-R numbers
for d <- least significant to most significant do
	bucket-sort(A,n,return dth digit of A[i])

Time cost with auxiliary space .

Module 4: Dictionaries

Definition

Dictionary: An ADT consisting of a collection of items, each of which contains a key and some data. This is called a key value pair. Keys can be compared and are typically unique.

Common assumptions:

Dictionary has KVPs, each KVP uses constant space, keys can be compared in constant time.

In an unordered array or linked list, search is , insertion is , delete is .

In an ordered array, search is , insertion is , delete is .

Recall binary search only applies to a sorted array.

Binary search trees: All nodes have two (possibly empty) subtrees. Every node stores a KVP, empty subtrees aren’t shown.

Every in is less than the root key, and every key in is greater than the root key.

BST as a realization of ADT dictionary.

BST::search(k): Start at root, compare to current node’s key. Stop if found or subtree empty, else recurse at subtree.

Deletion in a BST: First search for the node that contains the key. If is a leaf, delete it. If has one non-empty subtree, move child up. Otherwise, swap key at with key at successor node and then delete that node. The successor is the next-smallest among all keys in the dictionary.

BST::search, BST::insert, BST::delete all have cost , where height of tree max path length from root to leaf.

Question

If items are inserted one-at-a-time, how big is ?

Worst-case:

Best-case: . Any binary tree with nodes has height . Layer has at most nodes. So .

Goal is to create a subclass of BSTs where the height is always good. We need to impose a structural property, argue that the property implies logarithmic height, and discuss how to maintain the property during operations.

AVL Trees: An AVL Tree is a BST with an additional height-balance property at every node: The heights of left and right subtree differ by at most .

If node has left subtree and right subtree , then

Need to store at each node the height of the subtree rooted at it.

Theorem

An AVL tree on nodes has height. search, BST::insert, BST::delete all cost in the worst case.

AVL Insertion (AVL::insert(k,v))

First, insert with the usual BST insertion. We assume that this returns the new leaf where the key was stored. Then, move up the tree from . Update height (we can do this in constant time). If the height difference becomes at nodes , then is unbalanced. Must re-structure the tree to rebalance.

There are many different BSTs with the same keys.

The goal is to change the structure locally without changing the order. Longterm goal is to restructure so the subtree becomes balanced.

Right rotation on node

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> g
c --> C[/C\]
g((g)) --> A[/A\]
g --> B[/B\]

Becomes

flowchart TD
c((c)) --> g((g))
c --> z((z))
g --> A[/A\]
g --> B[/B\]
z --> C[/C\]
z --> D[/D\]
rotate-right(z)
c <- z.left, z.left <- c.right, c.right <- z
setHeightFromSubtrees(z), setHeightFromSubtrees(c)
return c

Left rotation is used to fix a right-right imbalance.

A double right rotation on node starts with a left rotation at , then a right rotation at .

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> A[/A\]
c --> g((g))
g --> B[/B\]
g --> C[/C\]

Becomes

flowchart TD
z((z)) --> c((c))
z --> D[/D\]
c --> g
c --> C[/C\]
g((g)) --> A[/A\]
g --> B[/B\]

Which then becomes

flowchart TD
g((g)) --> c((c))
g --> z((z))
c --> A[/A\]
c --> B[/B\]
z --> C[/C\]
z --> D[/D\]

Symmetrically, there is a double left rotation.

AVL insertion revisited. When there is an imbalance at do (single or double) rotation. Choose as child where subtree has bigger height.

AVL::insert(k,v)
z <- BST::insert(k,v)
while(z is not NULL)
	if (|z.left.heigth - z.right.height| > 1) then
		Let c be taller child of z
		Let g be taller child of c
		restructure(g,c,z)
		break
	setHeightFromSubtrees(z)
	z <- z.parent

We can argue that for insertion, one rotation restores all heights of subtrees.

restructure(g,c,z)
node g is a child of c which is a child of z
p <- z.parent
(make the right rotation)
make u the appropriate child of p and return u

The middle key of becomes the new root.

AVL Deletion (AVL::delete(k))

Remove the key with BST::delete. Find node where structure change happened. Go back up to the root, update heights, and rotate if needed.

AVL::delete(k)
z <- BST::delete(k)
Assume z is the parent of the BST node that was removed
while (z is not NULL)
	if (|z.left.height - z.right.height| > 1) then
		Let c be taller child of z
		Let g be taller child of c
		z <- restructure(g,c,z)
	setHeightFromSubtrees(z)
	z <- z.parent

Ties must be broken to avoid double rotation.

AVL Tree Summary

  • Search:
  • Insert: . Restructure will be called at most once
  • Delete: . Restructure may be called times.

Worst-case cost for all operations is .

Module 5: Other Dictionary Implementations

So far we’ve seen unordered arrays, ordered arrays, binary search trees, and balanced binary search trees.

If the KVPs were inserted in random order, then the expected height of the binary search tree would be .

We did not consider an ordered list as a realization of ADT dictionary. insert and delete take time in an ordered list, once we know where to place them. The bottleneck is in search.

In an ordered array, we can do binary search to achieve . In an ordered list, we cannot skip to the middle, and so cannot do binary search. Search takes time in an ordered list, this is slow.

To speed up search in an ordered list, add links to help us move forward quicker. Choose randomly when to add links.

Skip Lists

A hierarchy of ordered linked lists (levels) .

Each list contains the special keys and (sentinels). List contains the KVPs of in non-decreasing order. Each list is a subsequence of the previous one, i.e., . List contains only the sentinels.

A node is an entry in one list, a KVP is one non-sentinel entry in .

There are usually mode nodes than KVPs. Each node has references and . Each key belongs to a tower of nodes. Height of tower of : maximal index such that . Height of skip list: maximal index such that exists.

Search in skip list. For each list, find predecessor.

get-predecessors(k)
p <- root
P <- stack of nodes, initially containing p
while p.below != NULL do
	p <- p.below
	while p.after.key < k do p <- p.after
	P.push(p)
return P
skipList::search(k)
P <- get-predecessors(k)
p_0 <- P.top()
if p_0.after.key = k return KVP at p_0.after
else return "not found, but would be after p_0"

It is easy to remove a key since we can find all predecessors. We eliminate lists if there are multiple ones with only sentinels.

skipList::delete(k)
P <- get-predecessors(k)
while P is non-empty
	p <- P.pop()
	if p.after.key = k
		p.after <- p.after.after
	else break
 
p <- left sentinel of the root-list
while p.below after is the infinity-sentinel
	// the two top lists are both only sentinels, remove one
	p.below <- p.below.below
	p.after.below <- p.after.below.below

Inserting in skip lists skipList::insert(k,v). There is no choice as to where to put the tower of . Only choice, how tall should we make the tower of ? We choose randomly.

Repeatedly toss a coin until you get tails. Let be the number of times the coin came up heads. We want key to be in lists , so height of tower of .

Before we can insert, must check that these lists exist. Then we do the actual insertion. Use get-predecessors(k) to get stack . The top items of are the predecessors of where should be in each list . Insert after in , and after in for .

skipList::insert(k,v)
for (i <- 0; random(2) = 1; i++) {}
for (h <- 0, p <- root.below, p != NULL; p <- p.below, h++) {}
while i >= h
	create new sentinel-only list; link it in below topmost list
	h++
P <- get-predecessors(k)
p <- P.pop()
zbelow <- new node with (k,v)
zbelow.after <- p.after; p.after <- zbelow
while i > 0
	p <- P.pop()
	z <- new node with k
	z.after <- p.after; p.after <- z; z.below <- zbelow; zbelow <- z

Analysis of skip lists.

Expected space usage: . Expected height: .

skipList::get-predecessors: expected time.

Search, insert, delete: expected time.

Recall unsorted array realization:

  • Search:
  • Insert:
  • Delete:

Question

Can we do something to make search more effective in practice?

No. If items are accessed with equal likelihoods. We can show that the average-case cost for search is then .

We can make search more effective if search requests are biased (some items are accessed much more frequently than others). Intuition is that frequently accessed items should be in the front. Two scenarios - do we know the access distribution beforehand or not?

Optimal Static Ordering

Scenario: We know access distribution, and want the best order of a list.

KeyABCDE
Frequency of access281105

Recall:

Count of cost if search-key ( instance ) is at th position ).

expected access cost

Order has expected access cost of .

Order is better (access cost of ).

Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected access cost.

Dynamic Ordering: MTF

Scenario: We do not know the access probabilities ahead of time.

Idea: Modify the order dynamically, i.e., while we are accessing. Rule of thumb (temporal locality), a recently accessed item is likely to be used again.

Move-To-Front heuristic (MTF): Upon a successful search, move the accessed item to the front of the list.

We can also do MTF on an array, but should then insert and search from the back so that we have room to grow.

There are other heuristics we can also use. Upon a successful search, swap the accessed item with the item immediately preceding it (transpose heuristic). Keep counters how often items were accessed, and sort in non-decreasing order (frequency-count heuristic).

Module 6: Dictionaries for Special Keys

So far we’ve seen balanced binary search trees and skip lists. Various other realizations sometimes faster on insert, but search always takes time.

Interpolation Search

Scenario: Numbers in sorted array [40, 50, 70, 90, 100, 110, 120].

binary-search(A,n,k)
l <- 0, r <- n-1
while (l <= r)
	m <- (l + r) / 2
	if (A[m] equals k) then return "found at A[m]"
	else if (A[m] < k) then l <- m + 1
	else r <- m -1 
return "not found"

Compare at index . If keys are numbers, where would you expect key ?

Interpolation search is similar to binary search, but compare at index.

interpolation-search(A,k)
while (l <= r)
	if (k < A[l] or k > A[r]) return "not found"
	if (k = A[r]) then return "found at A[r]"
	m = expression above
	if (A[m] equals k) then return "found at A[m]"
	else if (A[m] < k) then l <- m + 1
	else r <- m-1

In the worst case, this can be very slow . But it works well on average. We can show . This resolves to .

Scenario: Keys in dictionary are words. Need brief review.

Words ( strings): sequences of characters over alphabet .

Convention: Words have end-sentinel $. w.size = non-sentinel characters: |\text{be}\| = 2$.

Trie (a radix tree): A dictionary for bit-strings. Comes from retrieval. A tree based on bitwise comparisons, edge labelled with corresponding bit. Similar to radix sort: use individual bits, not the whole key. Due to end-sentinels, all key-value pairs are at leaves.

Tries: Search

Follow links that corresponds to current bits in . Repeat until no such link or found at a leaf. Similar as for skip lists, we find search-path separately first.

Trie::get-path-to(w)
P <- empty stack; z <- root; d <- 0; P.push(z)
while d <= |w|
	if z has a child-link labelled with w[d]
		z <- child at this link; d++; P.push(z)
	else break
return P
Trie::search(w)
P <- get-path-to(w), z <- P.top
if (z is not a leaf) then
	return "not found, would be in sub-trie of z"
return key-value pair at z

For later applications of tries, we want another search-operation. prefix-search(w): find word in trie for which is a prefix. Testing whether exists is easy.

To find quickly, we need leaf-references. Every node stores reference to a leaf in subtree. Convention: store leaf with the longest word.

Trie::prefix-search(w)
P <- get-path-to(w), p <- P.top()
if number of nodes on P is w.size or less
	return "no extension of w found"
return p.leaf

Trie::insert(w). get-path-to(w) gives ancestors that exist already. Expand the trie from by adding necessary nodes that correspond to extra bits of . Update leaf-references.

search(w), prefix-search(w), insert(w), delete(w) all take time . Search time is independent of number of words stored in the trie! Search-time is small for short words.

The trie for a given set of words is unique (expend for order of children and ties among leaf-references).

Tries can be wasteful with respect to space. Worst-case space is .

Question

What can we do to save space?

Pruned Tries: Stop adding nodes to trie as soon as the key is unique.

A node has a child only if it has at least two descendants. Saves spaces if there are only few bit-strings that are long. This is a more efficient version of tires, but the operations get a bit more complicated.

We have (implicitly) seen pruned tries before: For equal-length bit-strings, pruned trie equals recursion tree of MSD radix-sort.

Compressed Tries: Compress paths of nodes with only one child. Each node stores an index, corresponding to the level of the node in the uncompressed trie. These are known as Patricia-Tries (Practical Algorithm to Retrieve Information Coded in Alphanumeric)

As for tries, follow links that correspond to current bits . Main difference: stored indices say which bits to compare.

CompressedTrie::get-path-to(w)
P <- empty stack; z <- root: P.push(z)
while z is not a leaf and (d <- z.index <= w.size) do
	if (z has a child-link labelled with w[d]) then
		z <- child at this link; P.push(z)
	else break
return P
CompressedTrie::search(w)
P <- get-path-to(w), z <- P.top
if (z is not a leaf or word stored at z is not w) then
	return "not found"
return key-value pair at z

search(w) and prefix-search(w) are easy. insert(w) and delete(w) are conceptually simple. Search for path to word, uncompress this path, insert/delete as in an uncompressed trie, compress path from root to where change happened.

All operations take time for a word . Compressed tries use space. We have leaves, every internal node has two or more children. Can show that there are more leaves than internal nodes.

To represent strings over any fixed alphabet , any node will have at most children.

Variation: Compressed multi-way tries: compress paths as before.

Operations search(w), prefix-search(w), insert(w), and delete(w) are exactly as for tries for bit-strings. Run-time . Each node now has up to children.

Question

How should they be stored?

Time/space tradeoffs. Arrays are fast, lists are space-efficient. Dictionary best in theory, not worth it in practice unless is huge. In practice, we use hashing.

Module 7: Dictionaries via Hashing

Dictionaries via Hashing

Special situation: For a known , every key is an integer with . We can then implement a dictionary easily: Use an array of size that stores via .

  • search(k): Check whether is NULL.
  • insert(k,v):
  • delete(k): NULL

Each operation is . Total space is .

Two disadvantages of direct addressing. It cannot be used if the keys are not integers. It wastes space if is unknown or .

Hashing idea: Map (arbitrary) keys to integers in range (for an integer of our choice), then use direct addressing.

We know that all keys come from some universe . We pick a table-size . We pick a hash function . Store dictionary in a hash table. An item with key wants to be stored in slot .

Generally hash function is not injective, so many keys can map to the same integer. We get collisions: we want to insert into the table, but is already occupied. There are many strategies to resolve collisions.

Hashing with Chaining

Simplest collision-resolution strategy: Each slot stores a bucket containing or more KVPs. A bucket could be implemented by any dictionary realization. The simplest approach is to use unsorted lists with MTF for buckets. This is called collision resolution by chaining.

  • insert(k, v): Add to the front of the list at
  • search(k): Look for key in the list at . Apply MTF-heuristic.
  • delete(k): Perform a search, then delete from the linked list.

insert takes time . search and delete have run-time .

Complexity of chaining: insert takes time . search and delete have run-time .

The average bucket size is ( is also called the load factor).

However, this does not imply that the average-case cost of search and delete is . Consider the case where all keys hash to the same slot. The average bucket-size is still . But the operations take time on average.

To get meaningful average-case bounds, we need assumptions on hash-functions and keys.

We can switch to randomized hashing. Assume that the hash-function is chosen randomly. We assume the following:

Uniform Hashing Assumption: Any possible hash-function is equally likely chosen has hash-function.

UHA implies that the distribution of keys is unimportant. Claim: Hash-values are uniform. Formally: for any key and slot .

Proof: Let (for be hash-functions with . For any , can map to and vice versa. So

Claim: Hash-values of any two keys are independent of each other.

Back to complexity of chaining. Each bucket has expected length . other keys are in this slot with probability . Each key in dictionary is expected to collide with other keys. other keys are in same slot with probability . Expected search and delete is hence .

For hashing with chaining, the run-time bound depends on . We keep the load factor small by rehashing when needed.

For Hashing with Chaining: Rehash so that throughout. Rehashing costs time. Rehashing happens rarely enough that we can ignore this term when amortizing over all operations. We re-hash when gets too small.

The amortized expected cost for hashing with chaining is and the space is .

Probe Sequences

Main Idea: Avoid the links needed for chaining by permitting only one item per slot, but allowing a key to be in multiple slots.

search and insert follow a probe sequence of possible locations for key : until an empty spot is found.

Simplest method for open addressing: linear probing.

, for some hash function .

delete becomes problematic. Cannot leave an empty spot behind; the next search might otherwise not go far enough. We can try to move later items in probe sequence forward. Better idea is lazy deletion. Mark the spot as deleted, search continues past deleted spots, insertion reuses deleted spots.

Keep track of how many items are ‘deleted’ and re-hash if there are too many.

probe-sequencce::insert(T,(k,v))
for (j=0; j<m; j++) 
	if T[h(k,j)] is NULL or "deleted"
		T[h(k,j)] = (k,v)
		return "success"
return "failure to insert"
probe-sequence-search(T,k)
for (j=0; j < m; j++) 
	if T[h(k,j)] is NULL return "item not found"
	if T[h(k,j)] has key k return T[h(k,j)]
return "item not found"

Some hashing methods require two hash functions . These hash functions should be independent in the sense that the random variables and are independent. Using two modular hash-functions often leads to dependencies.

Better idea: Use multiplication method for second hash function:

is some floating-point number with . computes fractional part , which is in . Multiple with to get floating-point number in . Round down to get integer in .

Assume there are two independent hash functions .

Assume further that and that is relative prime with the table-size for all keys . Choose prime. Modify standard hash-functions to ensure .

Double-hashing: open-addressing with probe seqeunce

search, insert, delete work just like for linear probing, with this different probe sequence.

Cuckoo Hashing

We use two independent hash functions , and two tables .

Main idea: An item with key can only be at or . search and delete always take constant time.

insert always initially puts the new item into . Evict item that may have been there already. If so, evicted item inserted at alternate position. This may lead to a loop of evictions. Can show that if insertion is possible, then there are at most evictions.

Can show: expected number of evictions during insert is . So in practice, stop evictions much earlier than rounds. This crucially requires load factor . Here .

Cuckoo hashing is wasteful on space. In fact, space is if insert forces lots of re-hashing. Can show the expected space is .

For any open addressing scheme, we must have . For the analysis, we require . Cuckoo hashing requires .

All strategies have expected time for search, insert, delete. Cuckoo Hashing has worst-case time for search, delete. Probe sequences use worst-case space, Cuckoo Hashing uses expected space.

For any hash-functions the worst-case run-time is for insert.

Hash Function Strategies

Recall UHA: Hash function is randomly chosen among all possible hash-functions.

Satisfying this is impossible: There are too many hash functions; we would not know how to look up .

We have to compromise, choose a hash-function that is easy to compute.

Two basic methods for integer keys (module method and multiplication method). Every hash function must do badly for some sequences of inputs: If the universe contains at least keys, then there are keys that all hash to the same value.

Carter-Wegman’s universal hashing: Choose hash-function randomly.

Requires: all keys are in for some big prime . At initialization, and whenever we re-hash: Choose arbitrarily, power of is okay. Choose (and store) two random numbers random(p). random(p-1). can be computed quickly.

Analysis of these Carter-Wegman hash functions. Choosing in this way does not satisfy uniform hashing assumptions. But can show: two keys collide with probability at most . This suffices to prove the run-time bounds for hashing with chaining.

Question

What if the keys are multi-dimensional, such as strings?

Standard approach is to flatten string to integer .

Module 8: Range-Searching in Dictionaries for Points

So far, search(k) looks for one specific item. New operation range-search: look for all items that fall within a given range. Input: A range, i.e., an interval . It may be open or closed at the ends. We want to report all KVPs in the dictionary whose key satisfies .

As usual denotes the number of input-items. Let be the output-size (the number if items in the range). We need time simply to report the items. Typical run-time: .

Unsorted list/array/hash table: Range search requires time: We have to check for each item explicitly whether it is in the range.

Sorted array: Range search in can be done in time.

BST: Range searches can similarly be done in time time.

Range searches are of special interest for multi-dimensional data.

Each item has aspects (coordinates): so corresponds to a point in -dimensional space.

We concentrate on (points in Euclidean plane).

(Orthogonal) -dimensional range search: Given a query rectangle , find all points that lie within .

The time for range searches depends on how the points are stored. Could store a 1-dimensional dictionary (key is the combination of aspects). However, range search on one aspect is not straightforward.

Better idea: Design new data structures specifically for points (quadtrees, kd-trees, range-trees). Assumption: Points are in general position: No two points on a horizontal line, no two points on a vertical line.

Quadtrees

We have points in the plane. Find a bounding box : a square containing all points. Assume that all coordinates are non-negative. Find max-coordinate in , use the smallest such that it is .

Structure (how to build the quadtree that stores ):

  • Root of the quadtree is associated with region
  • If contains 0 or 1 points, then root is a leaf that stores point.
  • Else, split: Partition into four equal subsquares (quadrants) .
  • Partition into sets of points in these regions. Points on split lines belong to right/top side.
  • Recursively build tree for points in region and make them children of the root.
graph TD

    A["[0,16] × [0,16]"]

    B["[0,8] × [8,16]"]
    C["[0,8] × [0,8]"]

    P4((p4))
    P5((p5))

    A --> B
    A --> C
    A --> P4
    A --> P5

    B --> D((∅))
    B --> E((∅))
    B --> F["[0,4] × [8,12]"]
    B --> P8((p8))

    F --> P9((p9))
    F --> P3((p3))
    F --> G((∅))
    F --> P1((p1))


    C --> P6((p6))
    C --> P0((p0))
    C --> P2((p2))
    C --> P7((p7))

search: Analogous to binary search trees and tries

insert: Search for the point, split the leaf while there are two points in one region.

delete: Search for the point, remove the point, if the parent has only one point left: delete parent.

QTree::range-search(r <- root, Q)
r: The root of a quadtree, Q: Query-rectangle
R <- region associated with node r
if (R \subseteq Q) then
	report all points below r and return
else if (R \cap Q is empty) then return
 
if (r is a leaf) then
	p <- point stored at r
	if p is not NULL and in Q then report it and return
	else return
for each child v of r do QTree::range-search(v,Q)

Question

What is the height of a quadtree?

Can have very large height for bad distributions of points. Even with points, the height could be arbitrarily large.

There exists a (weaker) bound that depends on the spread factor of points :

Can show: height of quadtree is in .

Complexity to build initial tree: worst-case.

Complexity of range search: worst-case even if the answer is .

Quadtrees easily generalize to higher dimensions, but are rarely used beyond dimension 3.

Quadtrees are very easy to compute. No complicated arithmetic, space can potentially be wasteful if points are not well-distributed. We can use quadtrees to store pixelated images.

kd-trees

We have points . Quadtrees split square into quadrants regardless of where points are. kd-tree idea: split the region at upper median of coordinates. Each node of the kd-tree keeps track of a splitting line in one dimension.

Convention: Points on split lines belong to right/top side. Continue splitting, switching between vertical and horizontal lines, until every point is in a separate region.

Build kd-tree with initial split by on points :

  • If create a leaf and return
  • Else randomized-quick-select(P, n/2) (select by -coordinate
  • Partition by -coordinate into and . points on one side and points on the other.
  • Create left subtree recursively (splitting by ) for points .
  • Create right subtree recursively (splitting by ) for points .

Run-time: Find and partition in expected time using randomized-quick-select. Both subtrees have points.

This resolve to expected time. This can be reduced to worst-case time by pre-sorting.

Height: . This resolves to (tight).

Space: All interior nodes have exactly two children. Therefore have interior nodes. Space is .

search: as in binary search tree using indicated coordinate.

insert: search, insert as new leaf.

delete: search, remove leaf.

Problem: After insert or delete, the split might no longer be at exact median and the height is no longer guaranteed to be . We can maintain height by occasionally re-building entire subtrees. But range-search will be slower.

kd-trees do not handle insertion/deletion well.

Range search is exactly as for quad-trees, except that there are only two children and leaves always store points.

We assume that each node stores its associated region. To save space, we could instead pass the region as a parameter and compute the region for each child using the splitting line.

We spend time at each visited node.

Observe: visited nodes is where is the number of “boundary” nodes. Neither nor .

Can show: satisfies the following recurrence relation:

This implies .

Therefore, the complexity of range search in kd-trees is .

Storage is , height is , construction time is , range search time is . This assumes that is constant (where is for -dimensional space).

Both quadtrees and kd-trees are intuitive and simple. But both may be very slow for range searches. Quadtrees are also potentially wasteful in space.

New idea: Range trees

Tree of trees (a multi-level data structure). So fare, nodes in our trees stored a key-value pair and references to children and (maybe) the parent. But we can store much more in a node! Each node stores in another binary search tree. They are wasteful in space, but permit much faster range search.

2-dimensional range trees.

Primary structure: Balanced binary search tree that stores and uses -coordinates as keys.

Every node of stores an associate structure :

  • Let be all points in subtree of in (including point at )
  • stores in a balanced binary search tree, using the -coordinates as key. Note: Point of is not necessarily in the root of .

Primary tree uses space.

Question

How many nodes do all associate trees together have?

is a child of , is a child of . Point of is only in associate tree . Point of is in associate trees , . Point of is in associate trees , , .

Key insight: Point of is in associate tree if and only if is an ancestor of in .

A range tree with points uses space. This is tight for some primary trees.

search: search by -coordinate in .

insert: First, insert point by -coordinate into . Then, walk back up to the root and insert the point by -coordinate in all associate trees of nodes on path to the root.

delete: Analogous to insertion.

Problem: We want the binary search trees to be balanced. This makes insert/delete very slow if we use AVL-trees. Solution: Completely rebuild highly unbalanced subtrees. Run-time for insert/delete becomes amortized.

BST::range-search-recursive(r <- root, x_1, x_2)
r: root of a binary search tree, x_1, x_2: search keys
Returns keys in subtree at r that are in range [x_1, x_2]
if r = NULL then return
if x_1 <= r.key <= x_2 then
	L <- BST::range-search-recursive(r.left, x_1, x_2)
	R <- BST::range-search-recursive(r.right, x_1, x_2)
	return L \cup r.{key} \cup R
if r.key < x_1 then
	return BST::range-search-recursive(r.right, x_1, x_2)
if r.key > x_2 then
	return BST::range-search-recursive(r.left, x_1, x_2)

Keys are reported in in-order.

BST Range Search analysis. Assume that the binary search tree is balanced. Search for path : . Search for path : .

boundary nodes. We spend time on each. We spend time per topmost inside node . They are children of boundary nodes, so this takes time. For 1d range search, also report the descendants of .

We have since subtrees of topmost inside nodes are disjoint. So this takes time overall.

Run-time for 1d range search: . This is no faster overall, but topmost inside nodes will be important for 2d range search.

Range search for is a two stage process: Perform a range search (on the -coordinates) for the interval in primary tree (BST::range-search(T, x_1, x_2)). Get boundary and topmost inside nodes as before. For every boundary node, test to see if the corresponding point is within the region .

For every topmost inside node :

  • Let be the points in the subtree of in .
  • We know that all -coordinates of points in are within range.
  • Recall: is stored in .
  • To find points in where the -coordinates are within range as well, perform a range search in : BST::range-search(T_ass(z), y_1, y_2).

Range Trees: Range Search Run-Time

time to find boundary and topmost inside nodes in primary tree. There are such nodes. time for each topmost inside node , where is the number of points in that are reported. Two topmost inside nodes have no common point in their trees every point is reported in at most one associate structure .

Time for range search in range-tree is proportional to

Range trees can be generalized to -dimensional space.

Space is , construction time , range search time is .

Space/time trade-off compared to kd-trees.

Module 9: String Matching

Pattern Matching

Search for a string (pattern) in a large body of text.

, the text (or haystack) being searched within.

, the pattern (or needle) being searched for.

Strings over alphabet . Return the first such that

This is the first occurrence of in . If does not occur in , return FAIL.

Definition

Substring : a string of length which consists of characters in order.

Definition

A prefix of : A substring of for some .

Definition

A suffix of : A substring of for some .

Pattern matching algorithms consist of guesses and checks. A guess is a position such that might start at . Valid guesses (initially) are . A check of a guess is a single position with where we compare to . We must perform checks of a single correct guess, but may make (many) fewer checks of an incorrect guess.

We will diagram a single run of any pattern matching algorithm by a matrix of checks, where each row represents a single guess.

Brute-force Algorithm. Idea: Check every possible guess.

BruteforcePM(T[0..n-1], P[0..m-1])
T: String of length n (text), P: String of length m (pattern)
for i <- 0 to n-m do
	match <- true
	j <- 0
	while j < m and match do
		if T[i+j] = P[j] then
			J <- j + 1
		else 
			match <- false
	if match then
		return i
return FAIL

Example: , .

abbbababbab
abb
ab
abba

Question

What is the worst possible input?

.

Worst case performance . .

More sophisticated algorithms (KMP and Boyer-Moore). Do extra preprocessing on the pattern . We eliminate guesses based on completed matches and mismatches.

KMP Algorithm

Knuth-Morris-Pratt algorithm.

Compares the pattern to the text in left-to-right. Shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern.

Tabcdcabc???
abcdcaba
abcdca

KMP answer: the largest prefix of that is a suffix of .

KMP Failure Array: Preprocess the pattern to find matches of prefixes of the pattern with the pattern itself.

The failure array of size : is defined as the length of the largest prefix of that is also a suffix of . .

If a mismatch occurs at we set .

Consider .

0-abacaba0
1babacaba0
2bbacaba1
3bacabacaba0
4bacbacaba1
5bacacaba2
6baccaba3
KMP(T, P)
F <- failureArrayy(P)
i <- 0
j <- 0
while i < n do
	if T[i] = P[j] then
		if j = m-1 then
			return i-j
		else
			i <- i + 1
			j <- j + 1
	else
		if j > 0 then
			j <- F[j-1]
		else 
			i <- i+1
return -1

abaxyabacabb
abac
(a)b
a
a
abacaba
(a)(b)a
failureArray(P)
F[0] <- 0
i <- 1
j <- 0
while i < m do
	if P[i] = P[j] then
		F[i] <- j + 1
		i <- i + 1
		j <- j + 1
	else if j > 0 then
		j <- F[j-1]
	else 
		F[i] <- 0
		i <- i + 1

For failureArray, at each iteration of the while loop, either increases by one, or the guess index increases by at least one ). There are no more than iterations of the while loop. Running time: .

Bayer-Moore Algorithm

Based on three key ideas.

Reverse-order searching: Compare with a subsequence of moving backwards.

Bad character jumps: When a mismatch occurs at . If contains , we can shift to align the last occurrence of in with . Otherwise, we can shift to align with .

Good suffix jumps: If we have already matched a suffix of , then get a mismatch, we can shift forward to align with the previous occurrence of that suffix. Similar to failure array in KMP.

When a mismatch occurs, Boyer-Moore chooses whichever of bad character or good suffix shifts the pattern further to the right. Can skip large parts of .

whereiswaldo
o
o
aldo

Last-occurrence function: Preprocess the pattern and the alphabet . Build the last-occurrence function mapping to integers. is defined as the largest index such that or if no such index exists.

Example

Example: .

cabcd
453-1

The last-occurrence function can be computed in time . In practice, is stored in a size- array.

Suffix skip array: Preprocess to build a table. Suffix skip array of size : For , is the largest index such that and . Any negative indices are considered to make the given condition true.

Similar to the KMP failure array, with an extra condition.

Example

01234567
bonobobo
-6-5-4-32-126

Computed similarly to KMP failure array in time.

boyer-moore(T,P)
L <- last occurrance array computed from P
S <- suffix skip array computed from P
i <- m-1, j <- m-1
while i < n and j >= 0 do
	if T[i] = P[j] then
		i <- i-1
		j <- j-1
	else
		i <- i + m - 1 - min(L[T[i]], S[j])
		j <- m - 1
if j = -1 return i + 1
else return FAIL

Worst-case running-time . Faster than KMP in practice on English text.

Question

What if we want to search for many patterns within the same fixed text ?

Idea: Preprocess the text rather than the pattern .

Observation: is a substring of if and only if is a prefix of some suffix of . So we want to store all suffixes of in a trie.

To save space use a compressed trie, store suffixes implicitly via indices into . This is called a suffix tree.

Text has characters and suffixes. We can build the suffix tree by inserting each suffix of into a compressed trie. This takes time . There is a way to build a suffix tree of in time.

Suffix trees: String matching

Assume we have a suffix tree of text .

To search for pattern does not have the final $. is the prefix of some suffix of . In the uncompressed trie, searching for would be easy: exists in if and only if the search for until one of the following occurs:

  1. If search fails due to “no such child” then is not in .
  2. If we reach end of , say at node , then jump to leaf in subtree of .
  3. Else we reach a leaf while characters of left.

Either way, left index at gives the shift that we should check. This takes time.

Brute-ForceKMPBoyer-MooreSuffix trees
Preprocessing:-
Search time:
Extra space:-

Module 10: Data Compression

Question

The problem: How to store and transmit data efficiently?

Source text: The original data, string of characters from the source alphabet .

Coded text: The encoded data, string of characters from the code alphabet .

Encoding: An algorithm mapping source texts to coded texts.

Decoding: An algorithm mapping coded texts back to their original source text.

Source “text” can be any sort of data. The code alphabet is usually . We consider here only lossless compression: Can recover from without error.

Main objective: For data compression, we want to minimize the size of the coded text. We will measure the compression ratio:

We consider the efficiency of the encoding/decoding algorithms. We always need time , but sometimes need more.

Observation: No lossless encoding scheme can have compression ratio for all input strings.

Definition

A character encoding maps each character in the source alphabet to a string in code alphabet

ASCII is a fixed-length code. Each codeword has the same length. Encoding/decoding is easy, just concatenate/decode the next 7 bits.

Better idea: Variable-length codes.

Motivation: Some letters in occur more often than others. For example, consider the frequency of letters in typical English text. We can use shorter codes for more frequent characters.

So as before, map source alphabet to codewords . Not all codewords have the same length, this ought to make the code text shorter. Morse code is a variable-length code.

Assume that we have some character encoding . Note that is a dictionary with keys in .

singleChar::encoding(E,S,C)
while S is non-empty:
	w <- E.search(S.pop())
	append each bit of w to C

The decoding algorithm must map to . The code must be lossless (i.e. uniquely decodable). This is false for Morse code.

From now on only consider prefix-free codes : no codeword is a prefix of another.

This corresponds to a trie with characters of only at the leaves.

Any prefix-free code is uniquely decodable.

prefixFree::decoding(C,S,T)
while C is non-empty:
	z <- T.root
	while z is not a leaf:
		if C is empty or z has no child labelled C.top()
			return "invalid encoding"
		z <- child of z that is labelled with C.pop()
	S.append(character stored at z)

Runtime is .

prefixFree::encoding(S,C,T)
E <- array of trie-nodes indexed by \Sigma_S
for all leaves l in T do E[character at l] <- l
while S is non-empty
	w <- empty bitstring
	v <- E[S.pop()]
	while v is not the root
		w.prepend(character from v to its parent)
	append each bit of w to C

Runtime is . We assume that all interior nodes of have two children, otherwise encoding scheme can be improved. Therefore and run-time is .

Huffman’s Algorithm

Question

How do we determine the best trie for a given source text ?

Idea: Frequent characters should have short codewords. Infrequent characters should be ‘far down’ the trie.

Greedy Algorithm: Always pair up most infrequent characters.

  1. We store a set of encoding-tries. Initially each adds “C” (height-0 trie holding ).
  2. Our tries have weight: Sum of frequencies of all letters in trie.
  3. Find the two tries with the minimum weight and merge them.
  4. Repeat Step 3 until there is only one trie left.

We can store these tries via a min-ordered heap.

Huffman::encoding(S,C)
S: Input-stream with characters in alphabet
C: Output-stream
f: array indexed by S alpahabet, initially all-0
while S is non-empty do increase f[S.pop()] by 1
Q <- min-oriented priority queue that stores tries
for all c in S alphabet with f[c] > 0 do Q.insert([c], f[c])
while Q.size() > 1 do
	(T_1, f_1) <- Q.delete-min()
	(T_2, f_2) <- Q.delete-min()
	Q.insert(trie with T_1, T_2 as subtries, f_1 + f_2)
T <- Q.delete-min()
C.append(encoding trie T)
Re-set input-stream S
prefixFree::encoding(S,C,T)

The constructed trie is optimal in the sense that is shortest. The constructed trie is not unique. Decoder does not know the trie (either send the decoding trie along, or send character-frequencies and tie-breaking rules). Encoding must pass through text twice. Cannot use a stream unless it can be re-set.

Run-time:

  • Encoding:
  • Decoding:

Run-Length Encoding

Example of multi-character encoding: multiple source-text characters receive one code-word.

Requires: Input must be a bitstring . Decoding dictionary is uniquely defined and not explicitly stored.

Good to use when has long runs.

Send the leftmost bit of . Then send a sequence of integers indicating run lengths. We don’t have to give the bit for runs since they alternate.

Various ideas for encoding one integer :

  • Base-2 encoding: the string such that =
  • Bijective base-2 encoding: with the leading 1 removed
12345678
110111001011101111000
0100011011000001
ABAAABBABBAAAAAB

For a sequence of integers, we need a separator-symbol

Question

Can we develop codes that do not need a separator-symbol?

Elias gamma code (for an integer ). To encode , take binary representation of . Add leading zeroes so that initial 1 is in the middle. This means adding leading zeroes.

12345678
110111001011101111000
01122223
1010011001000010100110001110001000

Definition

To do run-length encoding: Write initial bit to output, determine the run-lengths , write .

RLE::encoding(S,C)
S:input-stream of bits, C:output-stream
b <- S.top(); C.append(b)
while S is non-empty do
	k <- 1
	while (S is non-empty and S.top() = b) do k++; S.pop()
	w <- empty string
	while k > 1
		C.append(0)
		w.prepend(k mod 2)
		k <- k/2
	w.prepend(1)
	append each bit of w to C
	b <- 1 - b

Claim: A sequence of Elias gamma codes can be decoded uniquely. Each Elias gamma code has form . Can determine by scanning until we encounter a 1. Convert this 1 plus the next bits into the integer.

RLE::decoding(C,S)
C:input-stream of bits, S: output-stream
b <- C.pop()
while C is non-empty
	l <- 0
	while C.pop() = 0 do l++
	k <- 1
	for (j <- 1 to l) do k <- k * 2 + C.pop()
	for (j <- 1 to k) do S.append(b)
	b <- 1 - b

If C.pop() is called when there are no bits left, then was not valid input.

An all-0 string of length would be compressed to , which has bits.

Huffman and RLE take advantage of frequent/repeated single characters.

Observation: Certain substrings are much more frequent than others.

Ingredient 1 for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are.

ASCII, UTF-8, and RLE use fixed dictionaries.

In Huffman, the dictionary is not fixed, but it is static: the dictionary is the same for the entire encoding/decoding.

Ingredient 2 for LZW: adaptive encoding:

  • There is a fixed initial dictionary .
  • For , is used to determine the th character to output,
  • After writing the th character to output, encoder updates to

Challenge: Decoder must know how encoder changed the dictionary.

Convert into a list of code-numbers. Start with a dictionary that stores ASCII. Every step adds to dictionary a multi-character string, using code-numbers 128, 129,

Encoding alternates two steps:

  • Find longest string that is already in . So all of can be encoded with one number.
  • Add the substring that would have been useful to dictionary: add where is the character that follows in .

Need to match characters use a trie. To find : parse in trie until we reach node with ‘no such child’ for next character . To add : add child of with link labelled .

LZW::encoding(S,C)
idx <- 128
while S is non-empty do
	z <- root of trie D
	while (S is non-empty and z has a child c labelled S.top())
		z <- c; S.pop()
	C.append(code-number stored at z)
	if S is non-empty
		create child of z labelled S.top() with code-number idx
		idx++

Run-time: , assuming we can look up child in time.

LZW::decoding(C,S)
D ← dictionary that maps {0, . . . , 127} to ASCII
idx ← 128
k ← C.pop(); s ← D.search(k); S.append(s)
while there are more codes in C do
	sprev ← s; k ← C.pop()
	if k < idx do s ← D.search(k)
	else if k = idx do s ← sprev sprev [0] // special situation
	else return FAIL
	append each character of s to S
	D.insert(idx,sprev s[0])
	idx++

Dictionary maps consecutive integers to words. Use an array.

Decoding run-time is . each round to append to output.

Dictionary wastes space, words may get long. To save space, store strings as code of prefix one character.

Encoding and decoding take . Encoding and decoding need to go through the string only once can do compression while streaming the text.

bzip2

Combine multiple compression schemes. Key ingredient is to use text transforms. Change input into a different text that compresses better.

is a permutation. If has repeated substrings, then has long runs of characters. uses ASCII-numbers. If has long runs of characters, then has long runs of 0 and skewed frequencies. If has long runs of zeroes, then is shorter. Skewed frequencies remain. Compresses well since frequencies are skewed.

Move-to-front transform: Dictionary ASCII is unordered array with MTF. Character gets mapped to index with . A character in repeats times has run zeroes. We would expect lots of small numbers in the output.

Modified run-length encoding: Input is a list of ‘characters’ in . Encode only the runs of s. Encode run-length with bijective base-2 encoding, using two new characters .

Huffman encoding is the same as before.

Burrows-Wheeler Transform

Given: Source text as an array with end-sentinel.

Step 1: Write down cyclic shifts. th cyclic shift - move characters from front to back. We treat end-sentinel $ like any other character.

Observe: Every column contains a permutation of .

Step 2: Sort lexicographically. Use MSD-radix-sort. strings of length worst-case time. But usually much faster.

Observation: Every column continues to contain a permutation of .

Step 3: Extract rightmost column.

The Burrows-Wheeler transform consists of the last characters of the cyclic shifts of after sorting them lexicographically.

Same sorting permutation for cyclic shifts and suffixes. That’s the suffix array . Can compute this in time. Read BWT encoding from it: C[i] = \begin{cases} S[A^s[i] -1] & \text{if } A^s[i] > 0 \\ \ & \text{otherwise} \end{cases}$.

Given string obtained from BWT encoding. We can reconstruct the first and last column of the matrix of cyclic shifts.

BWT::encoding(C,S)
for all indices i of C
	A[i] <- (C[i], i)
for all indices j of C
	if C[j] = $ break
do
	S.append(character stored in A[j])
	j <- index stored in A[j]
while appended character is not $

BWT encoding cost is . BWT decoding cost is . Encoding and decoding both use space.

They need all of the text. BWT is a block compression method that compresses one block at a time.

bzip2 encoding cost is with a big constant. bzip2 decoding cost is with a big constant.

Module 11: External Memory

Recall the RAM model of a computer. Any access to a memory location takes the same constant time. This is not realistic.

Typical current computer architecture includes registers, cache L1, L2, main memory, disk or cloud.

Question

How to adapt our algorithms to take the memory hierarchy into account, avoiding transfers as much as possible?

Define a new computer model that models one such ‘gap’ across which we must transfer.

Assumption: During a transfer, we automatically load a whole block. This is realistic.

New objective: Revisit all algorithms with the objective of minimizing block transfers.

Stream based algorithms (with resets) use block transfers.

Recall: Dictionaries store KVPs and support search, insert, and delete. AVL-trees were optimal in time and space in RAM model. run time block transfers per operation. Inserts happen at varying locations of the tree. Nearby nodes are unlikely to be on the same block, typically transfers per operation.

We would like to have fewer block transfers. The goal is transfers.

Design a tree-structure that guarantees that many nodes on search-paths are within one block.

Idea: Store complete subtrees with levels in one block of memory. Each block/subtree then covers height . Search-path hits blocks block-transfers. Since , we have .

View the entire content of a block as one node.

Define multiway-tree: A node can store multiple keys.

Definition

A -node stores keys, has subtrees, and stored keys are between the keys in the subtrees.

We always have one more subtree than keys.

To allow insert/delete, we permit a varying numbers of keys in nodes. We also rigidly restrict where empty subtrees may be. This gives much smaller height than for AVL-trees fewer block transfers.

Definition

An --tree (for some and ) satisfies

  1. Every non-root is a -node for some .
  2. Between and subtrees, between and keys.
  3. The root is a -node for .
  4. Between 2 and subtrees, between and keys.
  5. All empty subtrees are at the same level.

For --trees, every node has between and keys.

Typically, we specify the order and then set . With small height we can store many keys.

Theorem

An --tree with keys has height.

How many keys must an --tree of height have?

Therefore .

Search is similar to BST: Compare search-key to keys at node. If not found, continue in appropriate subtree until empty.

abTree::search(k)
z <- root, p <- NULL
while z is not NULL
	let < T_0, k_1, ..., k_d, T_d > be key-subtree list at z
	if k >= k_1 
		i <- maximal index such that k_i <= k
		if k_i = k then return KVP at k_i
		else p <- z, z <- root of T_i
	else p <- z, z <- root of T_0
return "not found, would be in p"

visited nodes: . Finding is not constant time.

For insert, do abTree::search and add key and empty subtree at leaf. If the leaf had room then we are done. Else overflow: More keys/subtrees than permitted. Resolve overflow by node splitting.

abTree::insert(k)
z <- abTree::search(k)
Add k and an empty subtree in key-subtree-list of z
while z has b keys (overflow -> node split)
	Let < L_0, k_1, ..., k_b, T_b > be a key-subtree list at v
	if (z has no parent) create a parent of z without KVPs
	move upper median k_m of keys to parent p of z
	z' <- new node with < T_0, k_1, ..., k_m-1, T_m-1 > 
	z'' <- new node with < T_m, k_m+1, ..., k_b, T_b > 
	Replace <z> by <z', k_m, z''> in key-subtree-list of p
	z <- p

An - tree has height . If , then this height-bound is tight. Level contains at most nodes. Each node contains at most KVPs. So and .

search and insert visit nodes. delete can also be implemented with node-visits. But usually use lazy deletion, space is cheap in external memory.

Definition

A red-black tree is a binary search tree such that:

  • Every node has a colour (red or black),
  • Every red node has a black parent (in particular the root is black),
  • Any empty subtree has the same black-depth (number of black nodes on path from root to ).

Rather than proving properties or describing operations directly, we convert back to --trees.

Lemma: Any red-black tree can be converted into a --tree .

Black node with red children becomes a -node. This covers all nodes. Empty subtrees on same level due to the same black-depth.

insert can be done in worst-case time. delete can also be done in time.

A -tree is an --tree tailored to the external memory model. Every node is one block of memory (of size ). The order is chosen maximally such that -node fits into a block of memory. Typically . is set to as before.

search, insert, and delete each requires visiting nodes. Work within a node is done in internal memory no block-transfer. The height is since .

So all operations require block transfers. This is asymptotically optimal.

Hash functions don’t adapt well to external memory. We must occasionally re-hash to keep load factor small. And re-hashing must load all blocks. This is unacceptably slow.

Goal: Data structure for hash-values that typically uses block transfers, and never needs to load all blocks.

Idea: Keys hash-values integers fixed-length bit-strings. Store trie of bit-strings whose leaves are blocks of memory.

Assumption: We store fixed-length bit-strings.

Build trie (the directory) of bit-strings in internal memory. Stop splitting in when remaining items fit in one block. Each leaf of refers to a block of external memory. The blocks store KVPs in no particular order.

search(k): Search for in until we reach leaf . Load block at . Search for in block. 1 block transfer.

delete(k): search(k) loads block, delete from block, transfer updated block back. 2 block transfers.

insert(k): Search for in until we reach leaf . Load block at . If is at capacity, leaf gets two new children, create two new blocks, split items in by next bit. Insert into appropriate block. Transfer updated block back. Typically 2-3 block transfers.

If all items in have the same next bit, then split repeatedly. For big this is extremely unlikely.

Hashing collisions mean duplicate bit-strings, so all colliding items are in the same block. We do not care how collisions are resolved within the block.

If more than items have the same hash-value if all bit-strings in block are the same, we cannot split. This means either the load factor is too big or the hash-function is bad. We extend the hash function.