Sunday, October 15, 2006

Dingbat

Chris is studying for the GRE today and we're testing each other's vocabulary. This definition for dingbat is by far my favorite today:
1. An eccentric or silly person; 2. An ornimental piece of type for borders, seperators, etc; 3. An object (as a brick) serving as a missile.

Friday, October 13, 2006

Finding a job sucks. Or: on exposing your trackers to the world

Finding a job sucks. One thing that drives me crazy is when you go to a career fair for your university and you physically hand a recruiter your resume. They look it over and say "Wow! You look really qualified, we'd love to interview you!" You nod in satisfaction, having spent days polishing your resume. "But," says the recruiter, "we'd like you to go online to your campus's Career Services website and submit your resume through them to us; and then, would you go ahead and visit our corporate website and use our lame-ass resume tracker too?" Honestly, what's the point of giving the recruiters your resume if they're just going to shred it and ask for two electronic copies?

I wouldn't mind visiting corporate website X Y and Z to submit my resume if that's all their resume trackers wanted me to do. But that isn't what a resume tracker is for. Resume trackers provide a company three things: a way to keep track of applicants, a bar to entry, and a way to avoid paying something with a brain to weed through the pile of resumes, discarding the obviously incompetent. What I truly resent is having to rewrite my resume for every single company I apply for. Every company has its own unique take on what a resume should be; every company just knows the One True Resume Format. And it's your responsibility as an applicant to take the work of art that is your resume and compromise its integrity and eye-catching uniqueness in order to shoehorn it into the bland, uniform, generic Corporate Resume Form.

Okay, I get it: you're a big company and you've got thousands of hungry developers scratching at your door and you'd like a little bar to entry. Well, here's a hint. Think about exactly who you're barring. You're not barring the desperate underqualified morons with Gud Tiping Skills. They have nothing else to do but rewrite their resumes over and over, targeting with mindnumbing persistence every company they can think of. You're definitely discouraging creative programmers who find repetitive tasks disgusting. Your Generic Corporate Resume form doesn't provide the creative programmer a way to express all the neat projects he's worked on. And if you think letting your own applicants tell you how experienced they are in the limited number of skill fields you allow them to enter will produce accurate profiles of characteristically hard to qualify skills, you're just fooling yourself. Yes, resume trackers are a bar to entry, but it's the wrong bar to entry.

Okay, I get it: you're a big company and your HR department of 40 needs a way to track the state of 40,000 applications. I'm no stranger to that problem. My honors fraternity usually has two people in charge of handling all the requests for new membership. They aren't paid, they're in school, and they handle collectively about 150 new contacts each semester. These new contacts are the brightest crayons in the box, the top kids in their class, and by far the most discerning applicants. We don't want to drop the ball with them. I built our fraternity a recruiting tracker to keep track of who applied, when they applied, their statistics, which events they've attended, and any number of other things we would otherwise have to keep track of with paper. HR departments have these trackers as well.

When my honors fraternity wants new members, we ask our campus's registrar to send a note to students with a certain GPA who have a certain number of credit hours. Companies do this too: they send messages through my university's career services website, through my department chair, and through advertisements on campus. The result of this kind of direct advertising is a landslide response. Our honors fraternity recruitment begins with an info session, followed by two weeks of events open to the public before Pinning, the official start of the initiation period---a six-week long introduction to the fraternity before Induction, when the initiates become members. Companies do this too: you attend an info session, you interview, and you get hired.

The big difference is how we handle our recruitment trackers. When I wrote the recruitment tracker for my fraternity, I explicitly did NOT allow anyone besides the two officers who handle new applicants to use the tracker. Everyone else needs to ask them for information if they want it. Applicants never ever see the recruitment tracker. We don't want them to! The president and recruitment advisor, the two officers responsible for handling applicants, respond to every single application by email and make the changes to the recruitment tracker themselves.

"Inefficient!" you say, "Why not have a link in your advertising and have your applicants do it themselves?" Two very good reasons: integrity and respect. We don't want our applicants interacting with the recruitment tracker because we want to make sure the tracker has high data integrity. This internal tool is a great time saver for us, but it's worthless if it contains crap for data. The tool is inherently limited in what it accepts for input and what functions it can perform. If an outstanding applicant doesn't fit the mould, I fix the tracker or we flag the applicant and make annotations to the tracker entry. If an applicant hasn't provided enough information, the president or the rush advisor personally gets in touch with the applicant to get the required information. "Inefficient!" you say, "Why not use form validation and get all the information in one fell swoop?" Because it's not just about data integrity, it's also about respect for our applicants. We want them to know we care about them. Our fraternity isn't going to be just another name on their resume, it's going to be a big part of their life. Every interaction we have with them tells us more about them and tells them more about us.

Companies that don't interact with their applicants are missing worlds of information about their prospective employees! Not only do they avoid personal contact with these applicants, who may eventually become their co-workers, supervisors, or underlings; not only do they risk polluting their precious tracking databases with crappy data; but they also turn off the creative people they want to reach in the first place. Your HR people are extremely intelligent and quite adept at discerning the bullshit from the real. Much more so than any computerized filter could ever be.

The best of the best, the worst of the worst

All ranting aside, I wanted to share my application experience for the hopeful benefit of humankind.
  • Google. You don't even need to wonder. I've never seen Google's recruitment tracker, but I have no doubt it's brilliant. Every interaction I've had with them has been through email with their extremely competent and respectful recruiters. These guys are the best of the best.
  • Sun. Sure, I had to deal with their tracker, but it let me upload my own resume in PDF format. It didn't ask me to enter my experience as a form field. They didn't ask a bunch of questions that a non-moron would answer with their resume. If you're going to have a tracker, do it like Sun.
  • Micron. Aside from having a nice looking website anyway, their tracker uses AJAX for a much smoother experience. Kudos to Taleo. If you're going to use an off-the-shelf product, use these guys. The first page on the tracker asks you to upload your resume, transcript, any other helpful documentation. That's the way to do it. They ask you some form questions about your name, your institute of higher education (a pretty safe field to add to a form---there's pretty much only one way to report this, unlike your employment history and your skillset). They ask broad questions about where you'd like to work. That's it---end of story. Very much one of the best trackers I've seen.
  • Agilent. This may be one of the poorest trackers I've seen. You can't upload your resume, but you CAN paste a text copy of it. For those of us who don't use word processors but prefer to generate our resumes from XML or LaTeX, this is a big pain in the ass. Oh, but you can use the resume generator, which asks you a bunch of questions and produces a generic resume in text format for you. At least they don't force you to enter your work history in a series of forms. It's not really Agilent's fault, they just chose crappy tracking software. Agilent, go get Taleo.
These are just a few of the many places I'm applying to. Many are worse, some are better. None are better than Google, though.

Common Resume Interchange

In a world constantly spouting mumbo-jumbo about service connectivity and integration, you'd think the hiring process would be seeing some of this. But it hasn't. Most employers ask the same basic questions: who are you, where can we reach you, and why should we care. I call on the leaders of HR companies to produce an XML specification for resumes. If you don't want to excite your applying developers' gag reflex even further, your product had better allow for the following:
  • Be open and not restrictive about the format and content of employment history and skills. There's no objective way to quantify your skills, and your employment history (especially for the self-employed or you built your own company) is going to be very difficult to fit into someone else's idea of what your employment history should be.
  • Understand that the CRI is about metadata, not a replacement for a resume. It's fine if you want me to extract important metadata from my resume so you can get to it faster. I really don't mind. But the CRI will never replace a resume.
  • Build a plugin for Word. I don't use Word and I'd prefer never to have to. But if you want CRI to be useful, you should give away a Word plugin that can annotate your word documents with CRI data.
  • For God's sake, don't overengineer it! Don't build another XSL spec, don't build another GEDCOM spec. It doesn't need to be that complicated. And make sure you document the spec well enough.
  • If your client company can't be bothered to look at resumes individually, then your product should use natural language processing techniques to attempt to extract useful information from freeform resumes translated into text format. Yeah, it's a hard problem; your clients would be better off just paying a brain to look through the resumes. But don't succumb to the urge to put this problem in your applicant's lap, or you'll find that your tracker product reduces the quality of your clients' new hires. Talk about solving the wrong problem.

Sunday, October 01, 2006

Carbonleaf and Matt Nathanson at the Aggie

I discovered I love Carbonleaf! Perhaps despite the Aggie Theatre, which seems to be completely incapable of decent sound balance. Matt Nathanson's band sounded muddy; the accoustic guitar and the vocals were barely audible over the bass! The Fox Theater did a much job when I saw Steven Kellog and the 6ers last Thursday, and when I last saw Matt Nathanson two years ago. Carbonleaf must be really awesome when the sound technician is decent.

Tuesday, September 12, 2006

Busy Beaver hunter up and running!

A hastily-written python program hums quietly away on four different machines, patiently guessing solutions to the 5-node busy beaver problem. Hooray!! If I'd been thinking properly, I'd have written it in scheme and used gambit to compile it... the algorithm I'm using uses constant memory, but is quite greedy in terms of computation time, so every little bit of speed counts.

After about half an hour of running, one of the processes has found a machine that produces 12 marks after 61 cycles!

It would help prune down the search space enormously if I could prove there were equivalent turing machines out there that had non-isomorphic state graphs; especially if the result allowed us to prune down the possible labels for the edges of the state graph! That's where the really terrible performance comes from...

The potency of configurations

As a fun question to try to answer related to Turing machines... let Cn represent the set of all Turing templates of n states, as defined in the last post. Let P(c) represent the potency of configuration c∈Cn, defined to be the mean productivity of all the Turing machines t∈c.
The question: What's the distribution of P(c)?

Why it matters...

If P(c) has a Gaussian distribution with a fairly tight standard deviation, then the potency function is a useful heuristic that could be used to drive an A* search through Cn. A few samples of t∈c could establish whether c is worth examining.

It's not likely to be the case though, given that two machines in c could behave completely differently. This suggests looking for a different similarity metric between machines.

Monday, September 11, 2006

Hunting beavers: exhaustive generation without isomorphism

In the previous post, I decided to narrow my brute-force search for the 5-node Busy Beaver down to a smaller number of 5-node machines that at least stood a fair chance of halting and used all 5 states. I haven't proved this simplification doesn't exclude the Busy Beaver machine, but I have a strong intuition it doesn't.

The graph-theoretic reformulation

To make the search tractable, we're going to represent the busy beaver Turing machine with a graph. There are two symbols in the alphabet of the Turing machine, so each node in the graph representing a Turing machine "state" should have two outgoing edges. To guarantee that the machine at least contains a HALT instruction, add another node "E" representing the HALT instruction, and stipulate that there must be a path between the start node "S" and the HALT state "E". The situation: Suppose we have a directed graph of n nodes that allows loops and self-loops. Two nodes are distinguished: S and E. There must be a path from every node but E to E. All the nodes but E have two outgoing edges; E must have none. The question: How many non-isomorphic graphs of this form are there? The answer: 65 = 7,776 graphs. Few enough to be tractable!

Ugly details

Where did the magic 65 come from? A chain from S to E must exist, so start there. We can additionally ensure that every node has a path to E by having the chain from S to E run through every node:
The "backbone" of the isomorphism guarantees at least one HALT instruction and gives the machine a chance at reaching it
S 2 3 n E
? ? ? ? ?
That takes care of half the edges. The other n edges each choose one of the n+1 nodes independently, for a total of (n+1)n possibilities.

Applying the results

For each distinct graph found above, a number of machines can be generated by annotating the edges. For example, an annotation "1:1L" means "if I read a one, write a one, move the head left, and transition the Turing machine to the state at the end of this line." There are 2×4×4 = 32 possible annotation sets for each node, putting the solution space at 32n(n+1)n; a whopping 65325 (that's a 26-digit number in decimal!). Still too big to brute force, but still much smaller than the unrestricted solution space of (2n)8n (1040). The restrictions tighten the solution space by roughly 14 orders of magnitude for the n=5 case. Given that the constraints enforce the presence of at least one HALT, and the use of the entire 5 states of the machine, it's easy to see why the stochastic method over the unrestricted state space fails to find answers. The chances of hitting on one of the machines that fits the constraints is roughly 1 to 1014 against; given that the constraint machines aren't guaranteed to halt makes this statistic even more abysmal. Note that this approach still overgenerates. It's easy to see if both outgoing edges of a node lead to the same node that there are only 16 possible labels for the node's edges. The graph approach looks promising. Possibly there are further sensible constraints that would reduce the number of potential non-isomorphic graphs

Wednesday, September 06, 2006

Busy Beaver

Note: This article is in progress. The assignment is due Wednesday the 13th, so you can count on it being done sometime around then!

This problem comes to us by way of Tibor Radó's 1962 paper on the Busy Beaver function. A brief formulation goes like this: Suppose you define the productivity of a Turing machine with n states as the number of 1's written to the Turing machine's tape after it successfully halts. Represent the productivity as a function p(n). Call the machine with n states that maximizes p(n) the Busy Beaver Machine. Radó proved that no Turing machine can compute the value of p(n)1.

Our primary interest in the problem is to find the Busy Beaver Machine for n=5 in order to win a contest for class: who can write the 5-state Turing machine that halts and writes the most ones to an initially blank tape?

For an intuition about the magnitudes involved in this problem, previous research turns up the current leader in the n=5 case at Pascal Michel's page. This particular machine (found in 1989 by Heiner Marxen and Jürgen Buntrock) produces 4,098 marks before halting after 47,176,870 steps.

Formalizing the problem a bit

The Busy Beaver problem, being a subset of Turing literature, requires a bit further definition. Our Turing machine has the following form:

0 1
1 Snew D Inew
n
The instructions arranged into rows corresponding to the states of the Turing machine, and into columns based on the contents of the cell at the tape's current location. The instruction cell consists of three components: what to write at the current location, which way to move the tape, and what state to transition to next.

First approach: the stochastic method

The stochastic method builds random Turing machines and tests them. Because the Halting Problem is also not solvable, we run the Turing machine for an arbitrary amount of time and stop the machine if it hasn't stopped itself by then. The stochastic method is a rather naive method. Since it's not an exhaustive search of the state space, it's not possible to determine whether p(5) has been maximized or not. The best we can do is run the stochastic machine builder, keeping track of the most performant candidates, until we run out of time for the contest. It's important to understand the size of the search space. Suppose we examine each cell in the 2×n table individually and assign it an instruction. We choose Snew, D, and G independently for each instruction, giving 2×2×5 = 20 different operating instructions. Add to that the halt instruction and the total is 21 instructions, for a total of 1021 combinations! If the stochastic method finds any solution at all, it's by sheer luck. My python process has been running several hours, and the best it's come up with is 5-state Turing machine that produces 1 mark on the tape and halts. Obviously, even a human can do better than this!

Intuitions about behavior may improve our chances

Intuition 1: Avoiding Isomorphisms

The intuition here is that much of the state space is populated by clones—that is, equivalent Turing machines. Any machine M' which is equivalent to the halting machine M also halts and additionally produces the same number of marks on the tape. If a brute-force method could avoid generating all the isomorphisms of a halting machine, the state space to be searched might be significantly reduced. This approach, done without care, may produce a crippling effect on memory usage as each halting machine is stored so the algorithm can avoid producing its isomorphisms. Additionally, if the generator randomly generates new machines and then attempts to prove to itself whether it's an isomorphism of a known machine is likely to be more costly than simply running the machines themselves. What's needed is an algorithm for enumerating the non-isomorphic machines with n states that doesn't involve storing history.

Intuition 2: Graph Theory

Enter graph theory. By reformulating the question in terms of graph theory, we can take advantage of the powerful methods and proofs already developed in this domain. To simplify the machine representation, let's only examine the potential instruction sequences instead of the entire machine. Let's build a directed graph of the states. We'll represent the special HALT instruction as an extra node, giving n + 1 nodes to the state graph. We'll represent state transitions with colored edges to indicate that the state transitions are not fungible. First, some obvious observations on the necessary nature of this graph. The state graph should be connected, with no islands. By this I mean from any node it should be possible to reach the halt node. Otherwise the machine has effectively fewer than n states, and it's been proven that p(n) > p(n') where n > n'. This constraint guarantees that it is at least plausible that the machine will eventually reach the special halt node. Additionally, this helps the machine avoid some obvious infinite loops—for example, a state where both outgoing edges lead back to itself. The graph may be cyclic without necessarily causing the machine to run forever—the state graph doesn't encompass enough information to determine this (and in fact this is a good thing---we're hoping by simplifying the problem we can make it tractable. We already know the problem is not directly soluble, we're just trying to find a practical approximation). Fixing the state graph is a big win. For a given n-state graph, there are now only O(n2) machines (in reality, this number is slightly smaller—at least one of the instructions is the halt instruction, which has no second component). The next question in combinatorial analysis is: how many non-isomorphic, connected graphs of n+1 nodes, where n of the nodes have two distinct outgoing edges, and the special halt node has no outgoing edges? Finally, the last question is: how do we generate these graphs?

Intuition n: The optimal number of HALTs

The question of how many HALT instructions to include in the machine is intuitively related to the bridge-crossing question. Look at it this way: we must start at state 0, and to get the best of the machine we should try to use every state possible. In some cases, this is an even number of crossings, in some cases, an odd number. This affects the number of HALT instructions that need to be available (or does it? After all, we're not restricted to crossing a bridge only once—we're restricted to only crossing bridges that exist. That is, we can't modify the machine after it starts running). This most likely will turn out to be a false intuition.
1. http://plato.stanford.edu/entries/turing-machine/

Thursday, August 31, 2006

My Christine

ὠ ἡ χρίστη, χριστίν, ἡ καρδίᾱς ἐμῆς φύλαξ· ἐμή φιλίᾱ σοί.