What I was doing in the coursework recently was a case study of the insurance industry. I use statistical and visualization tools to cluster a variety of insurance companies. Technically it was not difficult, but the findings were pretty interesting.
First of all, let's have a glimpse of the dataset. The dataset consists of 11 attributes and 689 instances, each of which is an insurance company. The 11 attributes are code, name, total assets, total liabilities, total premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium. I took natural log to the total assets, total liabilities and total premiums so that the number wouldn't vary that much, which you will see in a second.
Now let's start with some dumb analysis, how about clicking cluster analysis without any thought? Using log_assets, log_liabilities, log_premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium as the 9 dimensions, I simply moved my fingers on the mouse and came up with something cool but really not insightful.Here is the dendrogram and parallel plot showing the results.
The proportion of health insurance was high in cluster #1. The proportion of reinsurance was high in cluster #2. The proportion of life insurance was high in cluster #3. The proportion of annuity was high in cluster #4. And the fifth cluster is similar to cluster #3 but with high assets, liabilities and premiums. Seems that the five clusters are mainly categorized by the structure of premium.
My question is, though, could I do better?
Yes. I think a principle component analysis would help me.
The graph above shows the output of the principle component analysis. According to the loading matrix showing the correlation between attributes and principle components, I chose Prin #1-#4 as co-variant with which I would continue my analysis. Prin1 stands for the size of the company. Prin2 stands for the imbalance of the companies' health insurance and life insurance. Prin3 stands for the reinsurance ratio mostly and Prin4 stands for the return on capital.
Wait, where does the bunch of letters above come from? It's my new clusters using PCA. I assigned companies to cluster A-E with different color so that I could see them more clearly.
By getting multiple views of the geography of these clusters. I confirmed one thing that has been running in my mind: Insurance companies are in channels.
The scatter plots shows that although the scope of insurance companies varies a lot, but what really differentiate them is the channel they are involved with, aka their insurance product.
Obvious, this analysis is far from enough. People working in this industry must have known that for sure. Moreover, I believe analysts use statistics to inform their judgement. Though this is a case study for outsiders like me to get a better understanding of this industry, I am going to further play with the data to find more insights next week.
Stay tuned.
One spoon of machine learning, one spoon of statistics, 1/2 cup full of business, add some photography, and run the whole thing in classical music
Insights from Big Data
Saturday, February 15, 2014
Tuesday, February 11, 2014
A word cloud of Analytics majors
Ok, let's start with something simple. I created a wordle of analytics majors to see what key words in term of analytics are the significant ones in their resume.
Because we have some privacy issues at McCombs School of Business, I didn't get a chance to use the resumes of the students of our program. But I found that another analytics program in Northwestern have let the resume of their alumni go publish. So I decided to use theirs. After all, I think the key words of us would be pretty similar.
First of all, I copied the biographies, downloaded all the resumes of their alumni and converted them to plain text. I did it manually. Yeah I know there must be some automatic ways to do it but I haven't got programming proficiency at scripting language. So I went through a little dirty work this time. But I did it anyway because there weren't too many students there - only 29 students with about 4000 unique words presented. And then I set up and configured a single-node Hadoop installation so that I could quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). I compiled the WordCount.java program and run it on my Mac using command lines. There are text mining packages in R but I am gonna be a nerd and try Hadoop this time.
The output was a little dirty too. First, this program will distinguish upper case and lower case letters, leading to duplicates in the output file. So I import the output file to Excel and used Pivot Table to combine these duplicates. Second, their remained punctuations and other symbols in it. I deleted them. Third, I had to remove the prepositions, auxiliaries, conjunctions and other common but meaningless words included in the output file as well. At last, I sorted them and only reserved the words that appeared above 10 times overall.
I created the word using the online word cloud generator wordle.net. Using the wordle advanced, I pasted weighted words in the blank and then BAM!
Because we have some privacy issues at McCombs School of Business, I didn't get a chance to use the resumes of the students of our program. But I found that another analytics program in Northwestern have let the resume of their alumni go publish. So I decided to use theirs. After all, I think the key words of us would be pretty similar.
First of all, I copied the biographies, downloaded all the resumes of their alumni and converted them to plain text. I did it manually. Yeah I know there must be some automatic ways to do it but I haven't got programming proficiency at scripting language. So I went through a little dirty work this time. But I did it anyway because there weren't too many students there - only 29 students with about 4000 unique words presented. And then I set up and configured a single-node Hadoop installation so that I could quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). I compiled the WordCount.java program and run it on my Mac using command lines. There are text mining packages in R but I am gonna be a nerd and try Hadoop this time.
The output was a little dirty too. First, this program will distinguish upper case and lower case letters, leading to duplicates in the output file. So I import the output file to Excel and used Pivot Table to combine these duplicates. Second, their remained punctuations and other symbols in it. I deleted them. Third, I had to remove the prepositions, auxiliaries, conjunctions and other common but meaningless words included in the output file as well. At last, I sorted them and only reserved the words that appeared above 10 times overall.
I created the word using the online word cloud generator wordle.net. Using the wordle advanced, I pasted weighted words in the blank and then BAM!
The most frequent words are obviously, data and analytics, while, you can find the skill sets of analytics such as SQL, SPSS, R, SAS and even excel. You can also find that the background of there analytics students is a mixture of science, management, mathematics, business, economics and statistics, which is reasonable because people are coming to this field of study from everywhere due to the hotness of Big Data.
This visualization is useful for the recruiters. Remember when you were having a hard time writing job description? This word cloud will help you understand what this job is about. Furthermore, recruiters can go deeper - calculating the text vector of the words appeared in employees' resume or Linkedin profile and comparing this with the applicants' word vector. The resume with the highest similarity to the "insiders" would definitely considered to be indication of a good match between the job opening and the applicant.
I think this is pretty easy but cool. Moreover, I am looking forward to analyzing the resume of ourselves, the Master of Science in Business Analytics!
Subscribe to:
Posts (Atom)