[ Content | Sidebar ]

Rebinding Mathematica keys

June 30th, 2009

Out of the box, Mathematica expects the user to press the SHIFT-RETURN to trigger an evaluation. Due to my hard-wired emacs reflexes, I frequently type CONTROL-RETURN instead. Today I finally got fed up enough to do something about it.

A bit of digging around in newsgroups led me to suspect that the answer lay in the file /usr/local/Wolfram/Mathematica/7.0/SystemFiles/FrontEnd/TextResources/X/KeyEventTranslations.tr. I made the following changes:


(* Item[KeyEvent["Return", Modifiers -> {Shift}], "HandleShiftReturn"], *)
    Item[KeyEvent["Return", Modifiers -> {Control}], "HandleShiftReturn"],

(* Item[KeyEvent["Return", Modifiers -> {Control}], "NewRow"], *)
    Item[KeyEvent["Return", Modifiers -> {Shift}], "NewRow"],

No joy. A bit more digging and I discovered that I had to modify two files, the second being /usr/local/Wolfram/Mathematica/7.0/SystemFiles/FrontEnd/TextResources/X/MenuSetup.tr where I made the following change:


(* MenuItem["&Evaluate Cells", "HandleShiftReturn", MenuKey["Return", Modifiers->{"Shift"}]], *)
   MenuItem["&Evaluate Cells", "HandleShiftReturn", MenuKey["Return", Modifiers->{"Control"}]],

Ah, sweet relief.

Selecting random observations from SAS datasets

June 12th, 2009

Here are some simple ways to extract various types of random samples from SAS datasets, using only base SAS.

Sampling without replacement.
A simple way to select N random elements from a dataset without replacement is to first randomly permute the dataset, and then take the first N elements of the permuted dataset. The process of permuting a dataset is surprisingly useful, and worthy of a macro:

%macro permute(dsn_in=, dsn_out=, tag_name=);
	&local dsn_temp = temp_work_space;

	data &dsn_temp;
		set &dsn_in;
		&tag_name = uniform(-1); 

	proc sort data=&dsn_temp;
		by &tag_name;

	data &dsn_out (drop=&tag_name);
		set &dsn_temp;
%mend permute;


The parameters dsn_in and dsn_out are the names of the dataset to be permuted and the name of the resultant dataset, respectively. The tag_name parm requires explanation. The macro permutes a dataset S by creating a temporary dataset T which consists of S with an additional field. That field contains uniformly distributed random numbers in (0,1) and goes by the name &tag_name. The value of that name should be chosen to avoid collisions with variable names already in S.

A simple example of usage:

data counts;
do count = 1 to 20;
	output;
end;
run;

* Choose the tag_name to avoid collisions with existing variables in dsn_in ;
%permute(dsn_in=counts, dsn_out=permuted_counts, tag_name=foobar);
run;

Sampling with replacement
A simple way to choose with replacement is to use the point keyword when reading the data. The macro randomly chooses N elements from &dns_in and writes them to &dsn_out

%macro choose(dsn_in=, dsn_out=, N=);
  DATA &dsn_out;
  do k = 1 to &N;
        choice = ceil( uniform(-1) * FILE_SIZE );   * choice is a random integer in 1 .. FILE_SIZE ;
        set &dsn_in point=choice nobs=FILE_SIZE;
        output;
  end;
  stop;
%mend choose;

* here's the usage ;
%choose(dsn_in=counts, dsn_out=chosen, N=5);
run;

Bernoulli sampling
In the methods above, the user specifies in advance the number of records to be chosen. In some cases, one might want the number of elements selected to vary randomly. Essentially one considers each element and decides whether to select it by tossing a coin. The process is so simple it’s not worth bothering with a macro.

DATA bernoulli_choice;
	set counts;  * use our old friend counts as input;
	if (uniform(-1) < .5) then output;
run;

The choice of .5 as the selection probability guarantees that any subset of counts (including the empty set and all of counts) is equally likely to be selected.

A problem I ran across

May 10th, 2009

I was chasing links concerning tail recursion in ruby (or rather, lack thereof) and ran across a math problem:

Find the smallest positive integer n such that n % x = x-1 for x from 2 to 18
i.e. The remainder is one less than the divisor for all integral divisors from 2 to 18.

The author, Sriram Narayan, offered java, ruby, and erlang solutions. His erlang solution had a bit of a java accent. Sriram states that he is new to erlang and requested advice and commentary, so I thought I would offer a slightly more idiomatic version of his solution.

The problem Sriram poses is easy enough to solve by hand if one makes a simple observation: n mod x = x-1 if and only if n = k*x + x-1 for some integer k. In particular, n+1 = (k+1)*x, i.e., x divides n+1. So the problem amounts to finding the least positive integer divisible by 18, 17, …, 2. Clearly 18! has the necessary divisors, so a solution exists, and is bounded above by 18!.

Let’s just do a brute force search, in erlang. This solution is tail recursive, as requested.

-module(f18).
-export([find_it/0]).
%% N is suitable if divisible by each of 18,17, ... ,2
suitable(N) when is_integer(N), N > 0 ->
    suitable(N, 18).

suitable(_N, 1) ->
    true;
suitable(N, K) when N rem K == 0 ->
    suitable(N, K-1);
suitable(_N, _K) ->
    false.

find_it() ->
    find_it(19).

%% This will halt since since 18! satisfies suitable/1.
%% 1st suitable N found will be the smallest possible.
find_it(N) ->
    case suitable(N) of
        true ->
            N;
        false ->
            find_it(N+1)
    end.

Running f18:find_it() returns 12252240, so the solution to Sriram’s original problem is 12252239.

In the multilingual spirit of the original post, here is the ruby verification that 12252239 satisfies the original problem.

irb(main):001:0> $y=12252239
=> 12252239
irb(main):002:0> (2..18).map {|x| $y.modulo(x)}
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

I'll leave it as an exercise for the reader to determine how to solve the problem by hand. Hint: consider the prime factorizations of each of the numbers 18, 17, ... , 2.

The two envelope paradox

March 11th, 2009

I’m a sucker for probability puzzles. Here’s an interesting and unusual one I found on the site of Amos Storkey:

You are taking part in a game show. The host introduces you to two envelopes. He explains carefully that you will get to choose one of the envelopes, and keep the money that it contains. He makes sure you understand that each envelope contains a cheque for a different sum of money, and that in fact, one contains twice as much as the other. The only problem is that you don’t know which is which.

The host offers both envelopes to you, and you may choose which one you want. There is no way of knowing which has the larger sum in, and so you pick an envelope at random (equiprobably). The host asks you to open the envelope. Nervously you reveal the contents to contain a cheque for 40,000 pounds.

The host then says you have a chance to change your mind. You may choose the other envelope if you would rather. You are an astute person, and so do a quick sum. There are two envelopes, and either could contain the larger amount. As you chose the envelope entirely at random, there is a probability of 0.5 that the larger check is the one you opened. Hence there is a probability 0.5 that the other is larger. Aha, you say. You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably. Hence the expected gain is 0.5×20000+0.5×80000-40000, ie the expected amount in the other envelope minus what you already have. The expected gain is therefore 10,000 pounds. So you swap.

Does that seem reasonable? Well maybe it does. If so consider this. It doesn’t matter what the money is, the outcome is the same if you follow the same line of reasoning. Suppose you opened the envelope and found N pounds in the envelope, then you would calculate your expected gain from swapping to be 0.5(N/2)+0.5(2N)-N = N/4, and as this is greater than zero, you would swap.

But if it doesn’t matter what N actually is, then you don’t actually need to open the envelope at all. Whatever is in the envelope you would choose to swap. But if you don’t open the envelope then it is no different from choosing the other envelope in the first place. Having swapped envelopes you can do the same calculation again and again, swapping envelopes back and forward ad-infinitum. And that is absurd.

That is the paradox. A simple mathematical puzzle. The question is: What is wrong? Where does the fallacy lie, and what is the problem?

Storkey proposes a somewhat involved Bayesian explanation of what is wrong. I think a much simpler explanation is possible.

Let’s go back and examine the player’s reasoning. He opened the first envelope and found 40,000 pounds. Then as Storkey puts it “You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably.”

No. The player is in no position to calculate expected values because he does not know what the sample space is. He does not know that 20,000 and 80,000 pounds are both possible, much less equiprobable.

To make this concrete, imagine we are the hosts, and we prepared the envelopes. We put 20,000 in one, 40,000 in the other. The player observes 40,000 in the first envelope. If from that he infers that there is a fifty percent chance of 80,000 in the other envelope he is just mistaken. There is no chance of it.

The player cannot compute expectations without knowing the sample space. We, the hosts, know the sample space and can easily compute expected values. The player has two pure strategies available, call them Hold and Swap. A player using the Hold strategy opens the envelope and then does not swap. The expected value of this strategy is 30,000. A player using the Swap strategy opens the first envelope and then swaps. The expected value of this strategy is 30,000. Any mixed strategy will be a weighted sum of Hold and Swap, and hence will have expected value of 30,000. Observing the first value does not provide the player any useful information. He cannot perform conditional probability computations of the form P(Y|X=40,000) because he does not know what values of Y are admissible.

Too lazy to flip coins

February 24th, 2009

I was helping my daughter with some math homework; one of the problems was to flip three coins and tabulate the number of heads over the course of eighty trials. This struck me as excessively tedious, so I suggested we write a computer simulation instead.

Here it is, in its full glory:

coin := Random[Integer, {0, 1}]
flips := coin + coin + coin
Length /@ Split @ Sort @ Table[flips, {80}]

The code is written in the Mathematica language. The first two lines should be self explanatory. The last line is the workhorse; I'll illustrate how it works, but with fewer trials. Table[flips, {20}] produces a list of twenty values of flip, something like this:

{2, 2, 0, 0, 3, 1, 0, 2, 1, 2, 2, 1, 2, 0, 3, 0, 0, 1, 1, 1}

This is then sorted, resulting in

{0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3}

The Split operator breaks the above into runs of identical elements

{{0, 0, 0, 0, 0, 0}, {1, 1, 1, 1, 1, 1}, {2, 2, 2, 2, 2, 2}, {3, 3}}

I then map the Length operator over the list of runs, resulting in tabulated counts:

{6, 6, 6, 2}

The use of @ and /@ for function application and mapping, respectively, is syntactic genius. It doesn't get much sweeter than that.

Wolfram recently started licensing Mathematica for home use at $300. It is a bargain at that price. The academic version costs even less, somewhere around $150. The academic and the home versions are functionally identical to the much more expensive commercial version. Grab a copy and you are armed to go do some Project Euler problems.

Erlang surprises me.

February 18th, 2009

I recently wanted to calculate some binomial coefficients in Erlang, and so needed an implementation of the factorial function. No problem. The factorial function is the canonical illustration of how to define a recursive function in Erlang. It goes something like this:

fac(0) -> 1;
fac(N) -> N * fac(N-1).

I have to confess that definition has always bothered me because it is not tail recursive. And it only takes one more line to write a tail recursive version, just the way God intended:

fac1(N) -> fac1(N,1).
fac1(0,R) -> R;
fac1(N,R) -> fac1(N-1, N*R).

The advantage of fac1 over fac is that fac1 operates in constant space, while fac requires space linear in N. For large N one has to wonder whether fac's chain of deferred multiplications will overflow the stack. But is this really a problem? Just how large a value of N is required to overflow the stack? I decided to try to find out.

I ran some test cases: fac(1000), fac(10000), fac(50000). No stack overflow yet, but I could see it might get old waiting for the enormous answers to be computed and displayed. I decided to switch gears. To avoid suffering through monster integers, I modified the test. Rather than testing with fac, I settled on using the structurally similar function sum:

sum(0) -> 0;
sum(N) -> N + sum(N-1).

Just for grins, I decided to try the same thing in a couple of other languages. Here is sum in Clojure:

(defn sum [n]
   (if (= n 0)
       0
       (+ n (sum (- n 1)))))


And here it is in Haskell, where I changed the spelling to summ to avoid a name collision with a Prelude function called sum:

summ 0 = 0
summ n = n + summ (n-1)

I was surprised by the results of my tests.

Just to build up the suspense, I'll pause to describe the test machine. It's a Dell T105, quad-core AMD Opteron, 8GB of ram, running Ubuntu Intrepid. Clojure is revision 1173, running on top of Java 1.6.0_10. Haskell is ghci 6.10.1, and Erlang is R12B-5.

On to the results. First up is Clojure:

user=> (sum 10000)
50005000
user=> (sum 20000)
200010000
user=> (sum 30000)
450015000
user=> (sum 40000)
java.lang.StackOverflowError (NO_SOURCE_FILE:0)

The stack overflow is just the sort of thing that makes me prefer fac1 over fac. Let's see what happens in Haskell.

*Main> summ 10000
50005000
*Main> summ 30000
450015000
*Main> summ 40000
800020000
*Main> summ 50000
1250025000
*Main> summ 75000
2812537500
*Main> summ 100000
5000050000
*Main> summ 150000
11250075000
*Main> summ 175000
*** Exception: stack overflow

Haskell, or rather ghci, did roughly four times better. It's probably worth pointing out that this is not really a comparison of the languages so much as a comparison of how the language implementations handle one very specific detail.

Time to see how the Erlang virtual machine fares. I stuck the sum function in a module called fac. No matter.


Eshell V5.6.5  (abort with ^G)
1> fac:sum(1000).
500500
2> fac:sum(10000).
50005000
3> fac:sum(50000).
1250025000
4> fac:sum(100000).
5000050000
5> fac:sum(200000).
20000100000
6> fac:sum(500000).
125000250000
7> fac:sum(1000000).
500000500000
8> fac:sum(5000000).
12500002500000
9> fac:sum(10000000).
50000005000000
10> fac:sum(50000000).
1250000025000000
11> fac:sum(100000000).
5000000050000000
12> fac:sum(200000000).
20000000100000000


At that point I went for broke, and tried to compute sum(1000000000). Maybe it would have worked eventually, but I killed it. My machine had gone nearly non-responsive, and I could hear my hard drive thrashing. Those disk sounds incline me to guess that when Erlang ran out of stack space, it moved the computation into virtual memory and soldiered on. Whatever the actual means, N=200,000,000 is pretty impressive. That's three orders of magnitude better than ghci or the jvm. Kudos to the authors of the Erlang virtual machine.

One final thought. In spite of what I saw here, I'll still go with the tail recursive version of the factorial function. It better satisfies my sense of aesthetics. But it's nice to know I don't have to.

A fix for the ipod recovery mode loop

December 30th, 2008

My daugter’s ipod shuffle fell into the dreaded recovery mode loop. Briefly, itunes claims that the ipod is in recovery mode, and must be restored. Restoration and reboot is followed by itunes claiming that the ipod is in recovery mode, and must be restored. The linked discussion on an Apple discussion forum offers various suggestions, most of which either refer to Windows specific issues (no Windows here) or to diagnostic mode (shuffles have no such mode).

I came up with a solution that worked for me. I make no claim it will work for you, but here it is. Fire up the Apple disk utility ( /Applications/Utilities/Disk Utility ). Select the volume corresponding to your ipod, and then select the Erase tab. Yes, the procedure is to erase all data on the ipod. But not just any erase, notice the Security Options button. Click it, and check the Zero Out Data box. The goal is to actually overwrite the data with zeros. In my early tries, simple erase did nothing, but this version did the trick. YMMV. Do keep in mind your music will not survive this process. Don’t try it unless you have a backup or are willing to lose the tunes.

After erasing the data, one more round of itunes restore did the trick. If you try this, leave a comment and let me know whether it worked for you. Good luck.

installing gem on ubuntu ibex

November 20th, 2008

Installing ruby and gem on Ubuntu (Intrepid Ibex) is somewhat problematic because of differences of opinion between the apt folk and the gem folk. If you want a working version of gem on Ubuntu Ibex, the RailsOnUbuntu site has the instructions. Worked for me.

An excellent domain registrar

November 11th, 2008

I’ve used a number of domain name registars, including GoDaddy, Network Solutions, Yahoo, and the dreadful and defunct Registerfly. After kissing too many toads, I’ve finally found a princess, gandi.net.

Their prices are fair, their site is well designed and easy to use, and they don’t try to cross sell, up sell, or otherwise try to get you to buy services you don’t want. My impression is that the folks at gandi believe technical competence combined with ethical business practices will win and keep customers. Truly a breath of fresh air.

Mirroring an SVN repository

October 27th, 2008

One advantage of a distributed version control system such as Mercurial is that any clone of a project is complete. The clone contains the entire history of the project. That makes backing up a repository trivial.

Not so for a centralized vcs such as Subversion. If you check out a copy of a project, you have a local copy sans the history of the project. Backing up the entire history of the project takes a bit of doing. This post concerns a method for creating a live mirror of a Subversion repository.

Here is the problem we are going to solve. You have a working Subversion (svn) repository. It contains valuable code, so naturally you want to back it up, both on a local machine, and maybe an offsite machine as insurance against disaster striking your site.

You could try to back up your repository by backing up the physical media containing the files which constitute your repository, say using rsynch-backup or some such. I don’t like that approach because if you ever have to use your backup, your first step will be to try to turn a bunch of files into a working svn repository. No thanks, I’d rather have a running svn instance all set to go.

I’ll show you how to back up your svn repository using three very simple bash scripts, and later give a link to to where you can download the scripts. Here’s the first one, make-svnrepo.

#!/bin/bash
## The user should edit this script to insert appropriate repository paths.
## The goal is to mirror ${REMOTESVN}/${REPO} in ${LOCALSVN}/${REPO}, which is created here.
## The population of ${LOCALSVN}/${REPO} is done via another script, sync-svnrepo

REPO=${1}                             # name of repository passed in by user
REMOTESVN='10.0.0.11'                 # omit final "/" since script provides it later
LOCALSVN='/home/drc/repo/saffronsvn'  # ditto
LOCALREPO=${LOCALSVN}/${REPO}         # the file path to local svn repository about to be created
PROTOCOL=svn                          # the transfer protocol to be used : one of {svn,  svn+ssh, file}

## Sanity check: uncomment the svn info command below, and make sure it works.
## If it does not, you are not talking to the remote repo, and may not have correctly specified some component above.
## svn info ${PROTOCOL}://${REMOTESVN}/${REPO}

## do the magic
svnadmin create ${LOCALREPO}
echo "#!/bin/bash" > ${LOCALREPO}/hooks/pre-revprop-change
chmod +x ${LOCALREPO}/hooks/pre-revprop-change
svnsync init file://${LOCALREPO} ${PROTOCOL}://${REMOTESVN}/${REPO}

How would I use this? Well, I have an existing svn repository running on my LAN, on a host with internal ip address 10.0.0.11. Suppose it hosts a project called MyProject. I want to mirror that entire project on my local machine, in the existing directory /home/drc/repo/svn. I have already installed Subversion on my local machine, and make-svnrepo is marked executable and is in my path. What I have to do is run the following in the console:

make-svnrepo My-Project

This will create an empty project. To populate it, use the next script, sync-svnrepo:

#!/bin/bash
## You will want to run this periodically, say via a cron job. I run it 4x daily.
LOCALSVN='/home/drc/repo/saffronsvn'  # should agree with LOCALSVN in make-svnrepo
REPO=${1}
REPOSITORY=${LOCALSVN}/${REPO}
svnsync sync file://${REPOSITORY}

As you might expect, we would populate the mirror created in the first step by running the following in a console:

sync-svnrepo My-Project

This has to be done on a regular basis to keep up with changes; cron does the trick for me. I imagine it would be possible to use hooks within the original repo to trigger backups whenever code is committed, but I prefer not to touch the original repository.

What I’ve done at work is to mirror our working repository in three separate mirrors. Two are onsite, one is offsite. I use the following script, query-svnrepo, to check up on the status of the mirrors.

#!/bin/bash
NAME=${1}
echo "---> " ${NAME}

## The idea here is that we have mirrored the repo named by NAME is in several places
## We want to check that revision numbers on the mirrors more or less agree with the original repo's numbers

ORIGINAL=10.0.0.11                       ## on our lan
MIRROR1="/home/drc/repo/svn"             ## file on localhost
MIRROR2=10.0.0.66/svn                    ## on our LAN
MIRROR3="foo.bar.org/home/drc/repo/svn"  ## offsite, requires ssh access
COMMAND='svn info '

## adjust the protocol as required for each mirror
REPOSITORY=${ORIGINAL}
PROTOCOL='svn://'
RESULT=`${COMMAND} ${PROTOCOL}${REPOSITORY}/${NAME} | grep "Revision:"`
MSG="original:  ${RESULT}"
echo $MSG

REPOSITORY=${MIRROR1}
PROTOCOL='file:///'   ## the third / is needed as an escape because the path begins with /
RESULT=`${COMMAND} ${PROTOCOL}${REPOSITORY}/${NAME} | grep "Revision:"`
MSG=" mirror1:  ${RESULT}"
echo $MSG

REPOSITORY=${MIRROR2}
PROTOCOL='svn://'
RESULT=`${COMMAND} ${PROTOCOL}${REPOSITORY}/${NAME} | grep "Revision:"`
MSG=" mirror2:  ${RESULT}"
echo $MSG

REPOSITORY=${MIRROR3}
PROTOCOL='svn+ssh://'
RESULT=`${COMMAND} ${PROTOCOL}${REPOSITORY}/${NAME} | grep "Revision:"`
MSG=" mirror3:  ${RESULT}"
echo $MSG

This system has worked like a champ for me. I have never had to use the backups, but it’s nice to know they are there. And it is real nice to know that they work. This can easily be tested by checking out the project from one or all of the mirrors.

Screen scraping code can be a pain because blogging software sometimes does weird things to certain characters. You can get clean copies of the scripts from my repository on BitBucket.

If you actually use any of these scripts, let me know how it goes. I’d be happy to hear any suggested improvements.