Insights from Big Data

Insights from Big Data

Tuesday, February 11, 2014

A word cloud of Analytics majors

  Ok, let's start with something simple. I created a wordle of analytics majors to see what key words in term of analytics are the significant ones in their resume.

  Because we have some privacy issues at McCombs School of Business, I didn't get a chance to use the resumes of the students of our program. But I found that another analytics program in Northwestern have let the resume of their alumni go publish. So I decided to use theirs. After all, I think the key words of us would be pretty similar.

  First of all, I copied the biographies, downloaded all the resumes of their alumni and converted them to plain text. I did it manually. Yeah I know there must be some automatic ways to do it but I haven't got programming proficiency at scripting language. So I went through a little dirty work this time. But I did it anyway because there weren't too many students there - only 29 students with about 4000 unique words presented. And then I set up and configured a single-node Hadoop installation so that I could quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). I compiled the WordCount.java program and run it on my Mac using command lines. There are text mining packages in R but I am gonna be a nerd and try Hadoop this time.

  The output was a little dirty too. First, this program will distinguish upper case and lower case letters, leading to duplicates in the output file. So I import the output file to Excel and used Pivot Table to combine these duplicates. Second, their remained punctuations and other symbols in it. I deleted them. Third, I had to remove the prepositions, auxiliaries, conjunctions and other common but meaningless words included in the output file as well. At last, I sorted them and only reserved the words that appeared above 10 times overall.

  I created the word using the online word cloud generator wordle.net. Using the wordle advanced, I pasted weighted words in the blank and then BAM!


  The most frequent words are obviously, data and analytics, while, you can find the skill sets of analytics such as SQL, SPSS, R, SAS and even excel. You can also find that the background of there analytics students is a mixture of science, management, mathematics, business, economics and statistics, which is reasonable because people are coming to this field of study from everywhere due to the hotness of Big Data.

  This visualization is useful for the recruiters. Remember when you were having a hard time writing job description? This word cloud will help you understand what this job is about. Furthermore, recruiters can go deeper - calculating the text vector of the words appeared in employees' resume or Linkedin profile and comparing this with the applicants' word vector. The resume with the highest similarity to the "insiders" would definitely considered to be indication of a good match between the job opening and the applicant.

  I think this is pretty easy but cool. Moreover, I am looking forward to analyzing the resume of ourselves, the Master of Science in Business Analytics!

2 comments:

  1. You don't have to use hadoop to count the words first.. Just put them all to the wordle, and it's done.

    ReplyDelete
    Replies
    1. Yeah you are about right. wordle is pretty smart but there remain some tricks. I used hadoop because 1) I wanted the word count; 2) I wanted to remove some location names and some common feature only shared by the students of that specific program, like "Northwestern University" and 3) There are format in a resume so words like "experience" and "education" are what I wanted to avoid.

      Delete