Friday, January 10, 2014

Fun with Evolution

Darwin's theory of evolution suggests that random variation can lead to non-random change in species, where individual organisms that are marginally better suited to their environment have a better chance of surviving and passing their winning genes on to the next generation. From this extremely simple idea, repeated countless times, we get the massive diversity and complexity of life.

While it's pretty cool and poetic and stuff, what's particularly fascinating is how it's been adopted into computing science. An entire branch of problem-solving algorithms exist that attempt to recreate evolution, an when reproduced on a large scale can actually be effective at solving incredibly complex problems.

Evolutionary algorithms are particularly well suited to problems where the connections between variables aren't well understood, and computing shortcuts can't be taken, but where programmers know in general what they're looking for. They are also prone to several issues: they're slow, and they tend to latch onto solutions that are pretty good, but not the best (also known as local maximums).

In the simplest terms, computer scientists will create a "gene pool" full of randomly-determined organisms, with the random genes corresponding to variables to be optimized. If the problem is well-defined, each organism can be evaluated and ranked based on how good of a solution they are, and then (similar to real life), the fittest will get together and have baby organisms, with genes from both parents. The process can be repeated as long as desired until a suitable solution is found.

Like real evolution, the non-random selection of the fittest, with random variation in their genes and starting condition, is expected to eventually produce organisms that are well-suited to their environment, but in this case being well-suited means they are an optimized solution to a problem.

In the real world, designers have used evolutionary algorithms to optimize extremely complex designs like car engines or wind turbine blades, but for fun I decided to do a fairly simple test on Excel as a proof of concept to myself. Instead of doing anything useful or fancy, I decided to try to get Excel to draw for me. Specifically, I wanted it to say "HI" to me (hey, I yell at it enough, figured it deserved a chance to yell back).

In order to do this, I politely asked Excel to create 100 random organisms, with each 'organism' defined as a 24-'gene' series of random values. This resulted in 100 random assortments of 6 rectangles. These 'organisms' could look something like this:


And I wanted to give them the goal of overlapping to trace this:


In order to do that, each of my original 100 'organisms' were given a fitness value based on whether they covered the target areas, but lost points based on how much they covered non-target areas. Then they all got a chance to breed like rabbits, but the fittest ones (with the highest scores) had a better chance of breeding than the others.

The actual breeding process is indistinguishable from zookeepers breeding uncooperative endangered animals - I stuck two randomly (but not equitably) chosen organisms in a room and made them watch videos of other bits of data 'doing it' until something popped out. In Excel terms, though, each of the 24 genes were examined in turn, and each child gene had an even chance of coming from either parent. In order to freshen up the gene pool, each gene also had a 5% chance of mutating.

Breeding continued until I got 100 new little babies, at which point I killed off the old stock and started over. In each generation, the fittest have a disproportionate chance of passing on their genes to the next generation, with the idea being that hopefully each generation is stronger than the previous until the goal is met.

These are the results of the first test, after 1,000 generations:


To use technical terminology, this is extremely unimpressive. Two of the rectangles form the I on the right, the lower right leg of the H is there, but a bunch of dead space in the H is taken up on the top, and one rectangle has wandered off completely (there were supposed to be 6...). Don't even get me started on the green rectangle, I think he's shy.

What was encouraging about this, though, was that the average scores for the population did tend to grow each generation, though they ended up plateau-ing after about 500 generations:


This is what I meant before by 'local maximum' solutions. In order for the score to improve, the purple rectangle would have to change by an amount that's more than what I've allowed mutations alone to cover. Also, during the required changes, it's likely that the scores would decrease (as the purple rectangle abandons those upper H legs), which is resisted from an evolutionary point of view. This actually parallels evolution quite well in that once animals are adapted to a situation they stop changing rapidly (even though they may not be perfectly optimized), until external pressures force them to need to adapt again.

The best way to improve solutions from evolutionary algorithms is to increase the sample size (get more genes in there!) and number of generations, but since those both involved forcing my computer to make funny noises, I decided instead to try a brand new set of 100 random organisms. After 1000 generations, I got:


That's... sort of better, actually. At least all six rectangles made it onto the screen, and all three vertical lines are definitely there. Again, though, in order to get this one to perfectly spell the word, the little purple rectangle would have required a tremendously lucky mutation, which was discouraging enough that instead I decided to try one last new group of 100. Here's their final result:


Actually, that's not bad. I have no idea what the little purple guy (why's it always purple?) is doing, but he's pretty much out of the way, and all the major parts of the word "HI" are definitely covered. Not bad for random number on Excel, eh?

In case you really wanted to play around with this spreadsheet, I've put a version of it here. It's already been seeded with random numbers, but if you open it as a macro-enabled spreadsheet, every time you hit "Crtl+q", a new generation will form (Crtl+w for 10 new generations, Crtl+e for 100, but that'll take a bit of time...). Enjoy!

No comments: