From the Myhill-Nerode Theorem(sai16vicky.wordpress.com/2014/09/24/myhill-nerode-theorem-a-beautiful-alternative-to-pumping-lemma), its can be concluded that Regular Languages have finite pairwise distinctive(PD) sets. This leads to a very interesting problem – Given a Regular Language in the form of a Regular Expression, is it possible to come up with bounds on the number of PD sets ?
Let’s say we are given a Regular Expression R of length K. We convert R to its lowest terms (R’). K now becomes length(R’).
If R = (0* + 1*).(2* + 3*), we expand it to make R’= 0*2* + 0*3* + 1*2* + 1*3*. K = length(R’) = 8.
The immediate idea would be to visualize R’ as a string S with length K(considering only the alphabets allowed in the language). If we generate all subsets of this string and match every pair of them, then we will be getting some PD sets(say Val). Can we conclude that Val is the total number of PD sets? It looks tempting to say so, but we need a solid proof.
Whenever there is something to be proved and we don’t know how to go about it, Induction comes in handy :).
For a string of length 1, we get a 2*2 matrix. The number of PD sets can at most be 1 and hence, this matrix is sufficient to conclude.
For a string of length k, let’s assume that there exists a (2**k)*(2**k) matrix for which this property is satisfied.
For a string of length k + 1,
There already exists a string of length k which satisfies this property. We are adding an additional element c’ now. This element is added to the string, and has to be added to every row and column of the subset matrix. Hence, we need an additional row and additional column into the matrix which makes it a ( 2**k’)*(2**k’) matrix, where k’ = k + 1.
We ended up proving our intuition correctly. Once we fix the dimensions of the matrix, we can also conclude that every PD set will be of length at most 2*K, where K is the length of the Regex in its lowest terms.
Moving forward –
Though we got a solution by reducing the Regex to its lowest terms, it is not very efficient. Is it possible to solve this problem without this reduction? Any Suggestions ?
The bounds we found are invalid if the Regex is not reduced to its lowest terms. Example – (0* + 1*).(2* + 3*).(4* + 5*). Here if K = 6, a 6*6 matrix does not cover all the subsets of this string.