17.10.10

A delightfully parallel problem

I am currently recruiting and part of this process involves the analysis of lots of CVs. This can be quite time consuming, so I have decided that I will expedite the process by developing a small text mining application that analyzes the resumes and produces a signature for each document. The most commons words in the CV will then be readout by the computer and if I like the sound of the digest I will shortlist the candidate.

The essential parts of text mining are being able to tokenize the document and filtering out noise words. Thanks to PLINQ, I can do the following:

PLINQ clusters words in doc
  1. private static IEnumerable<IGrouping<string, string>> CalculateWordFrequency(string[] content)
  2. {
  3.     var groupedWords =
  4.         content.Select(word => word.ToLowerInvariant()).GroupBy(word => word.ToLowerInvariant()).Where(
  5.             word => word.Count<string>() > 2);
  6.     return groupedWords.AsParallel();
  7. }

The document tokenizer approach is fairly naive and could do with more work.

After we have grouped the words in each resume, we then use the Microsoft Speech API found in the namespace, System.Speech.Synthesis to recite the most common terms

Reciting the most common terms
  1. public static void ReciteResume()
  2. {
  3.     using (var speechSynthesizer = new SpeechSynthesizer())
  4.         foreach (var item in GetGroupedTermsInResume(FilterContent(TokenizeContent(ReadDocument()))))
  5.         {
  6.             speechSynthesizer.SetOutputToDefaultAudioDevice();
  7.             speechSynthesizer.Speak(item.Key);
  8.         }
  9. }

I have already hired a dozen plus developers using the traditional filtering approach, so it would be interesting to see how the results of the automated CV selection process compare.

An alternative method of representing the CV digest is to generate a logarithmic plot,  “signature file” as shown below. I wonder what signature represents an ideal candidate. I would probably need to analyze large amounts of data to arrive at an empirically valid conclusion.

image

Incidentally this candidate was not hired as their CV contained a lot of buzzwords and they could not explain how they had used the technologies.

And this is what it sounds like [audio file].

No comments: