The complexity of add/find(collision), would depend on the implementation of union.
If you are using some hashtable based datastructure then your collision operation will indeed be constant assuming a good hash function.
Otherwise, add will probably be O(Log(N)) for a sorted list/tree datastructure.
Answer from Akusete on Stack OverflowThe complexity of add/find(collision), would depend on the implementation of union.
If you are using some hashtable based datastructure then your collision operation will indeed be constant assuming a good hash function.
Otherwise, add will probably be O(Log(N)) for a sorted list/tree datastructure.
First answer: If you are dealing with sets of numbers, you could implement a set as a sorted vector of distinct elements. Then you could implement union(S1, S2) simply as a merge operation (checking for duplicates), which takes O(n) time, where n = sum of cardinalities.
Now, my first answer is a bit naive. And Akusete is right: You can, and you should, implement a set as a hash table (a set should be a generic container, and not all objects can be sorted!). Then, both search and insertion are O(1) and, as you guessed, the union takes O(n) time.
(Looking at your Python code) Python sets are implemented with hash tables. Read through this interesting thread. See also this implementation which uses sorted vectors instead.
The complexity of add/find(collision), would depend on the implementation of union.
If you are using some hashtable based datastructure then your collision operation will indeed be constant assuming a good hash function.
Otherwise, add will probably be O(Log(N)) for a sorted list/tree datastructure.
Answer from Akusete on Stack Overflowalgorithm - Why is the time complexity of performing n union find (union by size) operations O(n log n)? - Stack Overflow
data structures - Why time complexity of union-find is $O(lgN)$ with only "Union by Rank"? - Computer Science Stack Exchange
time complexity - Union of multiple overlapping sets efficiently? - Computer Science Stack Exchange
c++ - What is the time complexity of this code Union Of two Arrays using set_union? - Stack Overflow
Let's assume for the moment, that each tree of height h contains at least 2^h nodes. What happens, if you join two such trees?
If they are of different height, the height of the combined tree is the same as the height of the higher one, thus the new tree still has more than 2^h nodes (same height but more nodes).
Now if they are the same height, the resulting tree will increase its height by one, and will contain at least 2^h + 2^h = 2^(h+1) nodes. So the condition will still hold.
The most basic trees (1 node, height 0) also fulfill the condition. It follows, that all trees that can be constructed by joining two trees together fulfill it as well.
Now the height is just the maximal number of steps to follow during a find. If a tree has n nodes and height h (n >= 2^h) this gives immediately log2(n) >= h >= steps.
You can do n union find (union by rank or size) operations with complexity O(n lg* n) where lg* n is the inverse Ackermann function using path compression optimization.
Note that O(n lg* n) is better than O(n log n)
In the question Why is the Ackermann function related to the amortized complexity of union-find algorithm used for disjoint sets? you can find details about this relation.
Without path compression : When we use linked list representation of disjoint sets and the weighted-union heuristic, a sequence of m MAKE-SET, UNION by rank , FIND-SET operations takes place where n of which are MAKE-SET operations. So , it takes O(m+ nlogn).
With only path compression : The running time is theta( n + f * ( 1 + (log (base( 2 + f/n)) n ) ) ) where f is no of find sets operations and n is no of make set operations
With both union by rank and path compression: O( m*p(n )) where p(n) is less than equal to 4
The time complexity of both union and find would be linear if you use neither ranks nor path compression, because in the worst case, it would be necessary to iterate through the entire tree in every query.
If you use only union by ranks, without path compression, the complexity would be logarithmic.
The detailed solution is quite difficult to understand, but basically you wouldn't traverse the entire tree, because the depth of the tree would only increase if the ranks of the two sets are equal. So the iteration would be O(log*n) per query.
If you use the path compression optimization, the complexity would be even lower, because it "flattens" the tree, thus reducing the traversal. Its amortized time per operation is even faster than O(n), as you can read here.
Does path compression eventually make operations O(1)? But if it starts at log(n), is that the overall time complexity? If we have to loop over every edge, shouldn't that be accounted for?
Seidel and Sharir proved in 2005 [1] that using path compression with arbitrary linking roughly on $m$ operations has a complexity of roughly $O((m+n)\log(n))$.
See [1], Section 3 (Arbitrary Linking): Let $f(m,n)$ denote the runtime of union-find with $m$ operations and $n$ elements. They proved the following:
Claim 3.1. For any integer $k>1$ we have $f(m, n)\leq (m+(k−1)n)\lceil \log_k(n) \rceil$.
According to [1], setting $k = \lceil m/n \rceil + 1$ gives $$f(m, n)\leq (2m+n) \log_{\lceil m/n\rceil +1}n$$.
A similar bound was given using a more complex method by Tarjan and van Leeuwen in [2], Section 3:
Lemma 7 of [2]. Suppose $m \geq n$. In any sequence of set operations implemented using any form of compaction and naive linking, the total number of nodes on find paths is at most $(4m + n) \lceil \log_{\lfloor 1 + m/n \rfloor}n \rceil$ With halving and naive linking, the total number of nodes on find paths is at most $ (8m+2n)\lceil \log_{\lfloor 1 + m/n \rfloor} (n) \rceil $.
Lemma 9 of [2]. Suppose $m < n$. In any sequence of set operations implemented using compression and naive linking, the total number of nodes on find paths is at most $ n + 2m \lceil \log n\rceil + m$.
[1]: R. Seidel and M. Sharir. Top-Down Analysis of Path Compression. Siam J. Computing, 2005, Vol. 34, No. 3, pp. 515-525.
[2]: R. Tarjan and J. van Leeuwen. Worst-case Analysis of Set Union Algorithms. J. ACM, Vol. 31, No. 2, April 1984, pp. 245-281.
I don't know what the amortized running time is, but I can cite one possible reason why in some situations you might want to use both rather than just path compression: the worst-case running time per operation is $\Theta(n)$ if you use just path compression, which is much larger than if you use both union by rank and path compression.
Consider a sequence of $n$ Union operations maliciously chosen to yield a tree of depth $n-1$ (it is just a sequential path of nodes, where each node is the child of the previous node). Then performing a single Find operation on the deepest node takes $\Theta(n)$ time. Thus, the worst-case running time per operation is $\Theta(n)$.
In contrast, with the union-by-rank optimization, the worst-case running time per operation is $O(\log n)$: no single operation can ever take longer than $O(\log n)$. For many applications, this won't matter: only the total running time of all operations (i.e., the amortized running time) will matter, not the worst-case time for a single operation. However, in some cases the worst-case time per operation might matter: for instance, reducing the worst-case time per operation to $O(\log n)$ might be useful in an interactive application where you want to make sure that no single operation can cause a long delay (e.g., you want a guarantee that no single operation can cause the application to freeze for a long time) or in a real-time application where you want to ensure that you will always meet the real-time guarantees.