Thursday, December 17, 2015

A numeric sequence finder based on genetic algorithms

I've been interested in numeric sequences since university; I've never been too good with them, so I've decided to develop a toy project for finding the sequence definition given a set of numbers; since I've finished university from more than 15 years, I admit it's a bit too late. :-)
For this project, I've used a Genetic Algorithm approach. The main idea is to treat a random expression as a phenotype. So, we have a set of expression created randomly and the simulation compute how much they "fit" to the searched sequence and let them evolve; the fitness is defined as the difference between the searched values and the values evaluated by the random expression.
A brief example will explain it better: if we want our genetic algorithm to find the numeric sequence of the squares of the integer, we can give to the algorithm the squares as the input (note that I wrote a set of five numbers, but it works as well with more or a bit less values):
1, 4, 9, 16, 25
which are the values of the squares of the first five integers. Now, for every expression randomly created, the expression is evaluated for the first five integers; if the random expression we want to evaluate is:
`pi^sin(e*n)/e`

the result of the expression is computed for the values of `n` as the first integers and compare them to the expected results (the squares of the first integers):
 `n`  `n^2` `pi^sin(e*n)/e`delta
110.5887424955090.411257504491
240.1560691847493.84393081525
391.097801550117.90219844989
4160.11810762599615.881892374
5250.97867053056724.0213294694

The delta column is the difference between the expression evaluation and the expected result; as we can see the sample expression is not very near to the values of the searched sequence. But this difference is measurable, and this is the reason why this delta was chosen as the value of the fitness function for the finder.
Every expression is formed by elements which are in the recursive form of element, followed by an operator and finally another element, all surrounded by parenthesis:
`(2 + n)`
An element can be a number, a math constant, a trigonometric function or - as we've seen - another element. A more complex element looks like this:
`(2 + (pi * sin(n))`
As in standard GA, the population of phenotypes (the expressions) evolve during time using selection, crossover and mutation. Let's see them in detail:

Selection
On a population of 300 expressions, only the best 75 elements (the ones that have the best values of fitness) of the round remain in the arena; all the other are newly created (with random values).

Crossover
The crossover operation takes a random element of an expression and changes it with a random element of another expression. So these two expression:
`(pi*(cos(e*pi)+n))`
`((n + (pi * (tan(4*pi))))+10)`

can become:

`(pi*(cos(pi * (tan(4*pi)))+n))`
`((n + (e*pi))+10)`
where the two inner elements were swapped between the expressions.

Mutation
For the mutation we take a random element of the expression and we change it with a newly created random one.

Conclusions
The web version of the sequence finder is running at http://pythonplayground-andreaiacono.rhcloud.com/finder. You can run the default sequence of the numbers multiplied by two, and you'll see that most of the times, the finder will give the correct solution; when it can't find anything correct in 30 secs, it returns the best expression (in terms of fitness) it has found till then.

The sources are available on github: https://github.com/andreaiacono/python-playground/tree/master/sequence_finder.