w: top-k sampling

Let's see what happens when we take a "top-k" slice from a triple-u ranking, to give our intuition something to gnaw on besides itself.

By doing so, we're trying to reach a middle ground between deterministic top-k slicing (as seen in the form of best-seller lists, front page stories, and top-k radio hits) and uniformly random sampling.

Instead, let's try to be eclectic, as the upper crust gets stale and blockbusters get boring once you've seen enough of them. On the other hand, we can still do better than random sampling, can't we?

If deterministic top-k is total order and uniform randomness is total chaos, then, let's choose both and neither at the same time, and see what comes to life.

...

First, we sample n=100 points from a Pareto distribution, using R:


# http://en.wikipedia.org/wiki/Pareto_distribution

n <- 100; lambda <- 10; alpha <- 8 
samples <- rexp(n,rgamma(n,shape=alpha,rate=lambda))
Z <- sort(samples, decreasing = TRUE) * 32

write(Z, "pareto.csv", ncolumns = 1)

Then, run a (descending) triple-u sort in Ruby.

Slice out the "top" ten (distributionally) ranked entries.


#!/usr/bin/env ruby

def triple_u_sort(arr) 
  arr.sort{ |x,y| (x <=> y) * (rand <=> [x,y].max / (x+y).to_f) }
end

arr = IO.read("pareto.csv").split.map { |x| x.to_f }
puts triple_u_sort(arr).slice(0..9)

Finally, prettify in NeoOffice. The thin purple line represents a (sorted) sample of 100 points from a Pareto distribution. There are ten blue bars. Each bar corresponds to a single "distributional" sample.

Let me point out some interesting properties of this sampling technique.

Sure, it may just be von Neumann's rejection sampling on randomized combinatorial steroids, and I have yet to prove anything serious, but I believe it's still worth a bit of thought.

As I mentioned before, we're trying to reach a kind of middle ground behind total order (best-of only lists) and total chaos (no filtering at all), both of which can be pretty boring.

The non-uniformly random sampling we've done, on the other hand, respects the head and the tail(s) of any distribution. It's just more likely that, when sampling a Pareto distribution, an arbitrary "head" element will tend to see more action than any single "tail" element.

In other words, we're sampling in a way that respects the area under the curve, whether or not there's more area in the head or the tail of your distribution.

From a socioeconomic point of view, this has the nice property that you're not focusing on the head at the expense of the tail or the tail at the head's expense.

(Also, I would imagine that wealth distributed according to this sampling tends to keep the distribution somewhere between total chaos and total order, but I have yet to run that experiment.)

I've been trying to argue why it makes sense to think, sample, and take action distributionally, and perhaps for business people, the simplest argument is as follows.

Whenever you see a distribution, imagine yourself taking as many dollar coins as possible and fitting them "under the curve". You're not allowed to stack coins, of course, you can just try to pack them coins as tightly as possible.

This distribution corresponds to your future income: a compact Pareto chart depiction of your customers and where they put their monies.

Thus, it's in your best interest to sample appropriately from this distribution, without ignoring either head or tail. Focus only on the head and you miss out on the up and coming stars. You'll also be incredibly boring, and lose those of us who have interests outside of the mainstream, as many of us do. Focus only on the tail and you may not do much volume.

Business-ese aside, I think diversity itself is an argument for thinking this way. If there is such a thing as the American dream, it's that everyone has a chance, even if the odds are stacked against you.

I suppose I would like to think, then, that this is one way to teach our systems to love and know diversity, in a stable (?) way that scales...

w:

Tuesday, August 08, 2006

top-k sampling

0 Comments:

About Me

Advertising

Previous Posts

Search