Head to Head testing

Dilbert cartoon on agile programming

Wally’s comment is insightful. Giving a thing a name makes it possible discuss that thing. Witness the enormous discussion of Test Driven Development on the internet.

The purpose of this post is to give a name to a testing practice I have found useful, and to explain why it is useful. I call the practice Head to Head testing, or HTH. I propose HTH as a solution to a problem that arises in test driven development, and will introduce HTH by first explaining the problem. My explanation will use functional programming notation, but the underlying issues are the same in object oriented or any other form of programming. This is because the issues are mathematical in nature.

Suppose my task is to write a function to compute logarithms base 10. TDD demands that I begin by constructing some tests. OK; here they are.

(assert (= (log 1) 0))
(assert (= (log 10) 1))
(assert (= (log 100) 2))
(assert (= (log 1000) 3))

What do these tests accomplish? A function that fails these tests is certainly an inadequate log function, but passing this suite proves little. Infinitely many functions satisfy these tests. Let's step back and consider the logic of testing.

Suppose we are testing the implementation of a function F defined on a finite domain D. An ideal test suite would have one test case for each element of D, and would exhaustively check whether the implementation of F was correct at each point in D. An implementation that passed this ideal test suite would be demonstrably correct.

The problem is that exhaustive testing is impossible in practice. Many functions have infinite domains, or domains so large that exhaustive testing is not physically feasible. Since we cannot test our implementation at each point in its domain, we are forced to settle for testing on a subset the domain. But which subset? I contend that we have now entered the realm of statistical sampling.

My test suite for the log function proposed to test at the points {1, 10, 100, 1000}. Looked at through the lens of statistical sampling, this test suite suffers from two defects. The lesser defect is that we have no idea whether a test of size N=4 is sufficient. The fatal defect is that the points chosen for testing were not chosen randomly. This makes the test suite statistically worthless.

Imagine by way of analogy that you engaged me to conduct a survey. I offer to use my friends as the sample population. You ought to conclude either that I am pulling your leg or that I am incompetent statistician. But in the programming world, this approach to sampling is commonly accepted. Test suites are hand written by programmers who write whatever test cases they deem appropriate. This flies in the face of statistical practice. A statistically valid test requires randomly sampled data, not data hand-chosen by a human being.

It could be argued that as software testers we are not in search of statistically valid tests, but of errors. And that is true as far as it goes, but the search for errors demands randomized data for two reasons. The computer can generate untold millions of random test cases for each one a tester writes write by hand. But the deeper purpose of randomization is to avoid blind spots or bias in the selection of test data. If I design the tests, presumably I choose the data based on what I think might go wrong in the software. What about all the things that never occur to me to test? In the long run, randomized tests go everywhere. They have no long term biases or blind spots.

Suppose I decide to do the right thing by testing the log function on randomly sampled data. How could I do that? It's not hard. Here's how to conduct a single trial. Randomly pick two positive real numbers x and y. The corresponding test is this:

(assert (= (log (* x y)
           (+ (log x) (log y)))))

The test exploits the fact that log(x * y) must equal log(x) + log(y). It would be a simple matter automate this test by constructing a stream of randomly generated pairs (x y). Doing this would make it possible to test the log function at hundreds of millions of points rather the relative few a developer or tester could test using hand written tests.

That's all well and good, but my example is exceptionally well suited to randomized testing. In my example I could exploit a mathematical property of the log function to construct tests with known expected outcomes. It is not obvious how to do this while testing non-mathematical software. That is where Head to Head testing enters the picture.

The HTH approach is as follows. Suppose we are implementing a function F. We build not one but two independent implementations of F; call the implementations g and h. We also create a suitable of set randomized data D, and test at each point d in D whether g(d) = h(d). If g and h disagree on d, then at least one of them is in error. Notice that we do not have to trust that either g or h is correct, nor do we need to know correct value of F at d. If g and h disagree at d, at least one of them must be in error. Both must be investigated. This is a feature, not a bug. Examining the differences in the two implementations will lead us to the flaw much more quickly and easily than would a the study of a single implementation.

A nice side effect of this approach is that a failed HTH test does not result in a mere assertion of failure. We are handed a test case known to trigger the failure, a very useful thing for debugging. And it is rare to find a lone bug. Very likely running more tests will result in a growing collection data points that trigger failures.

Return for a moment to the question of sample size. How many test cases are appropriate? As software testers we have a tremendous advantage over statisticians. In general it is very difficult to determine how large a sample size N is required to create a statistical test of known power. But we are not doing mathematical statistics, we are testing software. Just make N huge. Better still, make N infinite. Run HTH tests continuously. Many advocate continuous builds; why not continuous testing? By continuous testing I don't mean a test suite that runs as part of the build, with successful completion acting a gate on code check-in. I mean tests that run 24/7 on dedicate hardware, never completing, just searching tirelessly for bugs. Machines are cheap and bugs are expensive.

Is it practical to build two independent implementations of a function F as part of testing? Sometimes it's not that hard to do. Other times you don't have to, because someone else has already written the other implementation for you. Imagine you are writing statistical function. Chances are you are not the first to write that function. Test your version HTH against a commercial product, or an open source package. Let your competitor help you test. Maybe your implementation of F is multi-threaded. Run F head to head against itself on different boxes, with different thread pool sizes. Run the latest version of F against a previous version for regression testing, but do it against randomly generated data instead of a few dozen hand written tests.

But is it really practical to build two independent implementations of a function F in support of testing? I think it is. Go look at the history of your code in version control. How many times have you built and rebuilt the thing? How many times was another round of changes triggered by a bug? You are going to build more than one version. You can do reactively in response to bugs, or you can do it in advance, by design. Your choice.

HTH is a conceptual approach. You can find ways to use it if you try. And its not an either-or thing. By all means, keeping writing test cases by hand. Just don't stop there.

These thoughts were set into motion by the recent Clojure bowling problem and the ensuing discussion of the role of TDD in functional programming.

Clojuratica announced

Life is sweet. Yesterday a friend came by. He had purchased a Macbook Pro, and wanted to me his Levovo T60 that he no longer needed. Today Garth Sheldon-Coulson announced Clojuratica, an interface between Clojure and Mathematica. This is the stuff that software dreams are made of.

Clojure bowling problem

ObjectMentor’s Uncle Bob posted about learning Clojure via a bowling challenge. The challenge is to write a program to compute bowling scores. I decided to give it a go.

I’m not a bowler, so my first step was to try to understand how bowling scores are computed. Once I did that, it struck me that the reason the bowling challenge is challenging is the peculiar way scores are recorded. Sometimes there are two numbers recorded per frame, sometimes only one. Also, a game consists of a variable number of frames, anywhere from ten to twelve. My approach to the problem centered on normalizing the data so to minimize special case processing.

Here’s the core of my approach.

(defn score [pin-counts]
  (reduce + (map score-tuple (tuples (normalize pin-counts)))))

The input data are pin-counts; the first step is to normalize the pin counts. Normalization consists of two things: following each strike, insert a zero. This insures that each frame contains exactly two pin counts. Next, pad the tail of the pin counts with zeroes as needed to insure that every game consists of exactly twelve frames.

(defn normalize
  "Insert 0's after strikes in pin counts, then pad with 0's to length of 24."
  ([pin-counts]
     (letfn [(pad [x] (take 24 (concat x (repeat 0))))]
       ( pad (normalize pin-counts []))))

  ([pin-counts normed-counts]
     (cond
       (empty? pin-counts) normed-counts
       (= 10 (first pin-counts))  (recur (next pin-counts)
                                         (conj normed-counts 10 0))
       :else (recur (nnext pin-counts)
                    (conj normed-counts
                          (first pin-counts)
                          (second pin-counts))))))


The rules of bowling are such that scoring frame N may depend on frames N+1 and N+2. The tuple function splits the normalized data into ten tuples of five numbers each. A tuple represents three frames.

(defn tuples [pin-counts]
  (partition 5 2 pin-counts))

The score-tuple function is mapped across the tuples, and the results summed. Done.

;; Score the frame corresponding to counts a and aa
(defn score-tuple [tuple]
  (let [ [ a aa b bb c ] tuple
         strike  10
         spare?  #(= 10 (+ %1 %2)) ]
    (cond
      (= strike a b)   (+ a b c)
      (= strike a)     (+ a b bb)
      (spare? a aa)    (+ a aa b)
      :else            (+ a aa))))

Over on reddit, Uncle Bob’s post spawned a donnybrook about TDD vs type checking. I’m not religious about TDD, tending to agree with Michael Feather’s take on it:

One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense. Tests can pass or fail and when they fail we learn that we have a problem and we correct it. If you subscribe to this theory, you expect to find fewer integration errors when you do integration testing and fewer “unit” errors when you do unit testing. It’s a nice theory, but it’s wrong. The best way to see this is to compare unit testing to another way of improving quality – one that has a very dramatic measurable effect.

Back in the 1980s, there was a movement to use something called Clean Room Software Development. The notion behind Clean Room was that you could increase quality by increasing the rigor of development. In Clean Room, you had to write a logical predicate for every little piece of your code and you had to demonstrate, during a review, that your code did no more or less than the predicate described. It was a very serious approach, and it was a bit more radical than what I just described: another tenet of Clean Room was that there was to be no unit testing. None. Zilch. When you wrote your code it was assumed correct after it was reviewed. The only testing that was done was stochastic testing at the functional level.

Amazingly, Clean Room worked. Clean Room teams demonstrated very high quality numbers. …

In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.

All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.

That’s the magic, and it’s why unit testing works also.

During my IBM days I took a two week class on Harlan Mill’s version of Clean Room. But I have yet to see CR practiced in the wild. On that point I have to give the nod to the TDD folks. But stipulating that both code and tests are in fact written, I can’t see that it much matters which is written first.

I do have one testing axe to grind. Humans are slow, and computers are fast. Why should I write a dozen or so tests when I can have the computer write several million randomized tests? Presumably because testing depends on knowing the expected outcomes of the tests, and it might be difficult to determine the expected outcome of randomly generated tests. Often that is not a problem.

Once I had my solution to the bowling problem ready, I wrote code to generate random sets of pin counts. I then downloaded a copy of Stuart Halloway’s solution to the bowling problem. My idea was to feed the same random pin counts to both our programs, and watch for discrepancies. If the programs disagree, then at least one of them is in error. If you want to see the details, you can grab the code here. As it turns out, I found no discrepancies after testing over 20 million cases.

What if you don’t have a convenient other implementation to test against? All is not lost. Randomized testing is great for regression testing. Test the latest version of your code against the previous version.

Sometimes it even makes sense to go ahead and build a second implementation just to help test a first. For instance, sometimes one can build a simple but slow version of a fast but complicated thing (say an array vs a fancy tree structure), and test them against each other. Or test a single threaded version against a multi-threaded version. I have gotten tremendous mileage out of randomized tests using exactly that approach.

If you happen to program in Haskell, QuickCheck is a randomized testing godsend. Once I get a better handle the language, I’d like to take a look at building a QuickCheck for Clojure.

My road to Clojure

I’ve been studying Clojure, using my standard approach of solving Project Euler problems. I am very much impressed. Here’s why.

Functional Programming
Short of speech itself, mathematics is humanity’s most powerful and influential symbolic technology. Mathematics makes possible the science that shapes our world-view, and the technology that shapes our world. The central abstraction in modern mathematics is the function. As a mathematician, thinking in terms of functions is second nature to me. Clojure has excellent functional programming support, and that support meshes beautifully with Clojure’s sequence libraries and persistent data structures.

The computer languages I loved best have been, to a greater or lesser extent, functional. They are Mathematica, Erlang, Scheme, and Haskell. Mathematica is proprietary and expensive. Erlang and Haskell are experiencing some interest. I hope both will thrive, but I see them as influential niche players for the foreseeable future. Scheme is a beautiful dinosaur, going nowhere in terms of adoption. Why?

Benevolent Dictators
Python, Ruby, and Perl have “benevolent dictators” and reference implementations. Scheme and Common Lisp have official specifications, multiple slightly incompatible implementations, and fractured communities. Both the lisps are living fossils, evolving too slowly to adapt to the pace of technological change. I believe a computer language benefits from a benevolent dictator. A language needs someone to make critical decisions on time scales much shorter than those in which standards bodies operate.

Rich Hickey
Rich Hickey is the creator of Clojure. I’ve watched several of his video presentations, read a bit of his writing, and enjoyed the benefit of his work. The man is crazy smart, and has thought long and hard about computer languages. Above all, he has fantastic taste in language design. Design is the art of trade-off, and Rich consistently makes excellent choices. Mr. Hickey is close to the Platonic ideal of the benevolent dictator.

Concurrency
Mutable state + threads + locks is a formula for insanity and pain. Clojure and Erlang offer alternatives to the madness. I prefer Clojure syntax.

Syntax
I am a sucker for elegant syntax. I believe s-expressions are the pinnacle of syntactic elegance. De gustibus non disputandum est. And yes, I am aware of LFE.

Emacs integration
I have used emacs for many years. It’s home. Back in the dark ages (circa 2001) I did Java development in emacs using jde-mode, a crazy and masochistic thing to do. That’s devotion. Later I switched to Eclipse, and then again to IntelliJ IDEA. That’s sanity. Emacs, while great, is not equally great at all things. Java development is better served by other tools.

Getting back to the matter at hand, emacs + clojure-mode + slime + clojure is a little slice of heaven.

Java
Stuart Sierra has already said what needs saying:

But all the arguments about functional programming, software transactional memory, and reader macros miss what was, for me, the biggest reason to switch to Clojure. It’s about the libraries, stupid. Building on the JVM and providing direct access to Java classes/methods was the best decision in Clojure’s design. ‘Cause if it’s ever been done, anywhere, by anyone, someone’s done it in Java. Twice.

Stuart exaggerates; Fortran’s math libraries are untouchable. But his larger point is spot on.

I think Clojure’s seamless Java integration will enable it to slip in through the back door at Java development shops. Maybe just a few at first. Then a few more. Then a lot more. I predict Clojure will thrive, both on its own merits and because of its symbiotic relationship with Java. I’m willing to back that prediction with the most precious of currencies: my time, and my attention.

The wait is over

Daniel Weinreb is a man with impeccable Common Lisp credentials. Wikipedia cites him as one of the authors of the original language spec, Common Lisp: The Language, First Edition, and as a cofounder of Symbolics, Inc., makers of legendary lisp machines. Here is what Weinreb had to say in a comment on Stuart Sierra’s blog:

Speaking as one of the original “Gang of Five” authors of “Common Lisp: The Language”, I am at this point convinced that Clojure is the future of Lisp …

Strong words.

I have been studying Clojure for a couple of weeks now, in what time I can spare. I have been waiting for many years for an acceptable lisp. The wait is over. I’ll say more in my upcoming “Road to Clojure” post.

Rebinding Mathematica keys

Out of the box, Mathematica expects the user to press the SHIFT-RETURN to trigger the evaluation of an expression. Because of my hard-wired emacs reflexes, I frequently type CONTROL-RETURN instead. Today I finally did something about that. I rebound the SHIFT-RETURN sequence.

Mathematica does not make that easy to do. I found nothing in the documentation that even suggested rebinding keys is possible. Much digging around in newsgroups led me to suspect that the answer lay in the file /usr/local/Wolfram/Mathematica/7.0/SystemFiles/FrontEnd/TextResources/X/KeyEventTranslations.tr. I made the following changes:


(* Item[KeyEvent["Return", Modifiers -> {Shift}], "HandleShiftReturn"], *)
    Item[KeyEvent["Return", Modifiers -> {Control}], "HandleShiftReturn"],

(* Item[KeyEvent["Return", Modifiers -> {Control}], "NewRow"], *)
    Item[KeyEvent["Return", Modifiers -> {Shift}], "NewRow"],

No joy. A bit more digging and I discovered that I had to modify two files, the second being /usr/local/Wolfram/Mathematica/7.0/SystemFiles/FrontEnd/TextResources/X/MenuSetup.tr where I made the following change:


(* MenuItem["&Evaluate Cells", "HandleShiftReturn", MenuKey["Return", Modifiers->{"Shift"}]], *)
   MenuItem["&Evaluate Cells", "HandleShiftReturn", MenuKey["Return", Modifiers->{"Control"}]],

Ah, sweet relief.

Selecting random observations from SAS datasets

Here are some simple ways to extract various types of random samples from SAS datasets, using only base SAS.

Sampling without replacement.
A simple way to select N random elements from a dataset without replacement is to first randomly permute the dataset, and then take the first N elements of the permuted dataset. The process of permuting a dataset is surprisingly useful, and worthy of a macro:

%macro permute(dsn_in=, dsn_out=, tag_name=);
	&local dsn_temp = temp_work_space;

	data &dsn_temp;
		set &dsn_in;
		&tag_name = uniform(-1); 

	proc sort data=&dsn_temp;
		by &tag_name;

	data &dsn_out (drop=&tag_name);
		set &dsn_temp;
%mend permute;


The parameters dsn_in and dsn_out are the names of the dataset to be permuted and the name of the resultant dataset, respectively. The tag_name parm requires explanation. The macro permutes a dataset S by creating a temporary dataset T which consists of S with an additional field. That field contains uniformly distributed random numbers in (0,1) and goes by the name &tag_name. The value of that name should be chosen to avoid collisions with variable names already in S.

A simple example of usage:

data counts;
do count = 1 to 20;
	output;
end;
run;

* Choose the tag_name to avoid collisions with existing variables in dsn_in ;
%permute(dsn_in=counts, dsn_out=permuted_counts, tag_name=foobar);
run;

Sampling with replacement
A simple way to choose with replacement is to use the point keyword when reading the data. The macro randomly chooses N elements from &dns_in and writes them to &dsn_out

%macro choose(dsn_in=, dsn_out=, N=);
  DATA &dsn_out;
  do k = 1 to &N;
        choice = ceil( uniform(-1) * FILE_SIZE );   * choice is a random integer in 1 .. FILE_SIZE ;
        set &dsn_in point=choice nobs=FILE_SIZE;
        output;
  end;
  stop;
%mend choose;

* here's the usage ;
%choose(dsn_in=counts, dsn_out=chosen, N=5);
run;

Bernoulli sampling
In the methods above, the user specifies in advance the number of records to be chosen. In some cases, one might want the number of elements selected to vary randomly. Essentially one considers each element and decides whether to select it by tossing a coin. The process is so simple it’s not worth bothering with a macro.

DATA bernoulli_choice;
	set counts;  * use our old friend counts as input;
	if (uniform(-1) < .5) then output;
run;

The choice of .5 as the selection probability guarantees that any subset of counts (including the empty set and all of counts) is equally likely to be selected.

A problem I ran across

I was chasing links concerning the lack of tail recursion optimization in ruby and ran across a math problem:

Find the smallest positive integer n such that n % x = x-1 for x from 2 to 18
i.e. The remainder is one less than the divisor for all integral divisors from 2 to 18.

The author, Sriram Narayan, offered java, ruby, and erlang solutions. His erlang solution had a bit of a java accent. Sriram states that he is new to erlang and requested advice and commentary, so I thought I would offer a slightly more idiomatic version of his solution.

The problem Sriram poses is easy enough to solve by hand if one makes a simple observation: n mod x = x-1 if and only if n = k*x + x-1 for some integer k. In particular, n+1 = (k+1)*x, i.e., x divides n+1. So the problem amounts to finding the least positive integer divisible by 18, 17, …, 2. Clearly 18! has the necessary divisors, so a solution exists, and is bounded above by 18!.

Let’s just do a brute force search, in erlang. This solution is tail recursive, as requested.

-module(f18).
-export([find_it/0]).
%% N is suitable if divisible by each of 18,17, ... ,2
suitable(N) when is_integer(N), N > 0 ->
    suitable(N, 18).

suitable(_N, 1) ->
    true;
suitable(N, K) when N rem K == 0 ->
    suitable(N, K-1);
suitable(_N, _K) ->
    false.

find_it() ->
    find_it(19).

%% This will halt since since 18! satisfies suitable/1.
%% 1st suitable N found will be the smallest possible.
find_it(N) ->
    case suitable(N) of
        true ->
            N;
        false ->
            find_it(N+1)
    end.

Running f18:find_it() returns 12252240, so the solution to Sriram’s original problem is 12252239.

In the multilingual spirit of the original post, here is the ruby verification that 12252239 satisfies the original problem.

irb(main):001:0> $y=12252239
=> 12252239
irb(main):002:0> (2..18).map {|x| $y.modulo(x)}
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

I'll leave it as an exercise for the reader to determine how to solve the problem by hand. Hint: consider the prime factorizations of each of the numbers 18, 17, ... , 2.

The two envelope paradox

I’m a sucker for probability puzzles. Here’s an interesting and unusual one I found on the site of Amos Storkey:

You are taking part in a game show. The host introduces you to two envelopes. He explains carefully that you will get to choose one of the envelopes, and keep the money that it contains. He makes sure you understand that each envelope contains a cheque for a different sum of money, and that in fact, one contains twice as much as the other. The only problem is that you don’t know which is which.

The host offers both envelopes to you, and you may choose which one you want. There is no way of knowing which has the larger sum in, and so you pick an envelope at random (equiprobably). The host asks you to open the envelope. Nervously you reveal the contents to contain a cheque for 40,000 pounds.

The host then says you have a chance to change your mind. You may choose the other envelope if you would rather. You are an astute person, and so do a quick sum. There are two envelopes, and either could contain the larger amount. As you chose the envelope entirely at random, there is a probability of 0.5 that the larger check is the one you opened. Hence there is a probability 0.5 that the other is larger. Aha, you say. You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably. Hence the expected gain is 0.5×20000+0.5×80000-40000, ie the expected amount in the other envelope minus what you already have. The expected gain is therefore 10,000 pounds. So you swap.

Does that seem reasonable? Well maybe it does. If so consider this. It doesn’t matter what the money is, the outcome is the same if you follow the same line of reasoning. Suppose you opened the envelope and found N pounds in the envelope, then you would calculate your expected gain from swapping to be 0.5(N/2)+0.5(2N)-N = N/4, and as this is greater than zero, you would swap.

But if it doesn’t matter what N actually is, then you don’t actually need to open the envelope at all. Whatever is in the envelope you would choose to swap. But if you don’t open the envelope then it is no different from choosing the other envelope in the first place. Having swapped envelopes you can do the same calculation again and again, swapping envelopes back and forward ad-infinitum. And that is absurd.

That is the paradox. A simple mathematical puzzle. The question is: What is wrong? Where does the fallacy lie, and what is the problem?

Storkey proposes a somewhat involved Bayesian explanation of what is wrong. I think a much simpler explanation is possible.

Let’s go back and examine the player’s reasoning. He opened the first envelope and found 40,000 pounds. Then as Storkey puts it “You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably.”

No. The player is in no position to calculate expected values because he does not know what the sample space is. He does not know that 20,000 and 80,000 pounds are both possible, much less equiprobable.

To make this concrete, imagine we are the hosts, and we prepared the envelopes. We put 20,000 in one, 40,000 in the other. The player observes 40,000 in the first envelope. If from that he infers that there is a fifty percent chance of 80,000 in the other envelope he is just mistaken. There is no chance of it.

The player cannot compute expectations without knowing the sample space. We, the hosts, know the sample space and can easily compute expected values. The player has two pure strategies available, call them Hold and Swap. A player using the Hold strategy opens the envelope and then does not swap. The expected value of this strategy is 30,000. A player using the Swap strategy opens the first envelope and then swaps. The expected value of this strategy is 30,000. Any mixed strategy will be a weighted sum of Hold and Swap, and hence will have expected value of 30,000. Observing the first value does not provide the player any useful information. He cannot perform conditional probability computations of the form P(Y|X=40,000) because he does not know what values of Y are admissible.

Too lazy to flip coins

I was helping my daughter with some math homework; one of the problems was to flip three coins and tabulate the number of heads over the course of eighty trials. This struck me as excessively tedious, so I suggested we write a computer simulation instead.

Here it is, in its full glory:

coin := Random[Integer, {0, 1}]
flips := coin + coin + coin
Length /@ Split @ Sort @ Table[flips, {80}]

The code is written in the Mathematica language. The first two lines should be self explanatory. The last line is the workhorse; I'll illustrate how it works, but with only twenty trials so that results fit the screen better.

The code should be read from right to left. Table[flips, {20}] produces a list of twenty values of flip, something like this:

{2, 2, 0, 0, 3, 1, 0, 2, 1, 2, 2, 1, 2, 0, 3, 0, 0, 1, 1, 1}

This is then sorted: Sort @ Table[flips, {80}] resulting in

{0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3}

The Split operator breaks the above into runs of identical elements

{{0, 0, 0, 0, 0, 0}, {1, 1, 1, 1, 1, 1}, {2, 2, 2, 2, 2, 2}, {3, 3}}

I then map the Length operator over the list of runs, resulting in tabulated counts:

{6, 6, 6, 2}

The use of @ and /@ for function application and mapping, respectively, is syntactic genius. It doesn't get much sweeter than that.

Wolfram recently started licensing Mathematica for home use at $300. It is a bargain at that price. The academic version costs even less, somewhere around $150. The academic and the home versions are functionally identical to the much more expensive commercial version. Grab a copy and you are armed to go do some Project Euler problems.