Insights from Big Data

Insights from Big Data

Monday, March 3, 2014

A Case Study for Insurance Industry (continued)

    Sorry for the delay. I have been preparing for the IBM Watson Case Competition and guess what, we won the 1st place, which makes me really excited! Now is time to finish the work I promise to do weeks ago.

    Last time we discovered the structure of the data from the insurance industry using clustering. We found that there are channels within the industry. Now I execute a Fast Clustering procedure, which is basically the K-means clustering, in SAS. But I set the maximum number of clusters as 50. And I plot the clusters I got from that.

    Actually this plot stands for two metrics. We have 50 Gs and 50 Rs here. The G points are in the coordinate system of frequency and Gap, which is the distance to the nearest cluster. The R points are in the coordinate system of frequency and Radius, which is the distance from a cluster centroid to most distant case in that cluster. This procedure could help us identify the outliers in the dataset. Usually the cluster with a small number of instances is on suspicion of being an outlier, especially when this cluster is far away from the other clusters. For example, the upper left G cluster is a potential outlier in a sense that there are merely points in that cluster and it is a cluster that is so far away from others.

    So I started to think cutting some outlier-like instances. I eliminated the clusters that had less than 5 instances in it. I restricted the maximum number of clusters to five and reran the fast clustering procedure. I compared the mean values of attributes of each clusters and summarized the feature that the attributes had in common. Here is what I got:

cluster Return Product Size Capital Adequecy
1 high health moderate high
2 moderate/high life large moderate
3 medium annuity huge moderate
4 negative reinsurance moderate moderate
5 break-even life small least



    Knowing that insurance companies are divided by channels is not a breakthrough. But what if I got some insights about these channels? Some interesting findings:

1. The size of insurance companies varies. The biggest ones were those who play with annuity. This is simply because annuity is something one pays for social security for the rest of one's life.

2.Reinsurance companies had negative return. They didn't do well in 2011, which is consistent with the reality.

3. Health insurance companies had highest capital adequacy, showing they tended to have short term investments. This is understandable because they need to keep their capital rolling over and over so that they could pay claims as needed.

4. Life insurance companies had the least capital adequacy. I think the reason lies in the fact that people would only receive the claim when the insured are dead. So life insurance companies tended to have long term investments. Furthermore, the investment of part of them (cluster 5) went to reinsurance. Given the situation the reinsurers were faced with, it's not surprising that this group of life insurance companies just got break-even. While the insurance companies that chose other investment alternatives (cluster 2) seem achieving moderate/high return, which is much more satisfying.

5. Why did life insurance companies choose different investment targets? My answer would be the difference in size. Small firms didn't have enough investment-savvy professionals so they had to give their capital to companies that had the ability to identify better investment opportunities, aka the reinsurance companies. So they suffered from the reinsurance difficulties. At the same time, since big life insurers had the ability to invest independently, they could achieve profitability.

I think this is a happy ending besides there are more that could be done before I could fully understand this industry. But my larger point here is that I just found insights intriguing enough from massive data. Now I am very confident to say analytics is going to drive us from an information age to an insight era.



 

Saturday, February 15, 2014

A Case Study for Insurance Industry (to be continued)

  What I was doing in the coursework recently was a case study of the insurance industry. I use statistical and visualization tools to cluster a variety of insurance companies. Technically it was not difficult, but the findings were pretty interesting.

  First of all, let's have a glimpse of the dataset.  The dataset consists of 11 attributes and 689 instances, each of which is an insurance company. The 11 attributes are code, name, total assets, total liabilities, total premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium. I took natural log to the total assets, total liabilities and total premiums so that the number wouldn't vary that much, which you will see in a second.


  Now let's start with some dumb analysis, how about clicking cluster analysis without any thought? Using log_assets, log_liabilities, log_premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium as the 9 dimensions, I simply moved my fingers on the mouse and came up with something cool but really not insightful.Here is the dendrogram and parallel plot showing the results.


  The proportion of health insurance was high in cluster #1. The proportion of reinsurance was high in cluster #2. The proportion of life insurance was high in cluster #3. The proportion of annuity was high in cluster #4. And the fifth cluster is similar to cluster #3 but with high assets, liabilities and premiums. Seems that the five clusters are mainly categorized by the structure of premium.

  My question is, though, could I do better?

  Yes. I think a principle component analysis would help me.

  The graph above shows the output of the principle component analysis. According to the loading matrix showing the correlation between attributes and principle components, I chose Prin #1-#4 as co-variant with which I would continue my analysis. Prin1 stands for the size of the company. Prin2 stands for the imbalance of the companies' health insurance and life insurance. Prin3 stands for the reinsurance ratio mostly and Prin4 stands for the return on capital.

  Wait, where does the bunch of letters above come from? It's my new clusters using PCA. I assigned companies to cluster A-E with different color so that I could see them more clearly.

  By getting multiple views of the geography of these clusters. I confirmed one thing that has been running in my mind: Insurance companies are in channels.




  The scatter plots shows that although the scope of insurance companies varies a lot, but what really differentiate them is the channel they are involved with, aka their insurance product.

  Obvious, this analysis is far from enough. People working in this industry must have known that for sure. Moreover, I believe analysts use statistics to inform their judgement. Though this is a case study for outsiders like me to get a better understanding of this industry, I am going to further play with the data to find more insights next week.

  Stay tuned.

Tuesday, February 11, 2014

A word cloud of Analytics majors

  Ok, let's start with something simple. I created a wordle of analytics majors to see what key words in term of analytics are the significant ones in their resume.

  Because we have some privacy issues at McCombs School of Business, I didn't get a chance to use the resumes of the students of our program. But I found that another analytics program in Northwestern have let the resume of their alumni go publish. So I decided to use theirs. After all, I think the key words of us would be pretty similar.

  First of all, I copied the biographies, downloaded all the resumes of their alumni and converted them to plain text. I did it manually. Yeah I know there must be some automatic ways to do it but I haven't got programming proficiency at scripting language. So I went through a little dirty work this time. But I did it anyway because there weren't too many students there - only 29 students with about 4000 unique words presented. And then I set up and configured a single-node Hadoop installation so that I could quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). I compiled the WordCount.java program and run it on my Mac using command lines. There are text mining packages in R but I am gonna be a nerd and try Hadoop this time.

  The output was a little dirty too. First, this program will distinguish upper case and lower case letters, leading to duplicates in the output file. So I import the output file to Excel and used Pivot Table to combine these duplicates. Second, their remained punctuations and other symbols in it. I deleted them. Third, I had to remove the prepositions, auxiliaries, conjunctions and other common but meaningless words included in the output file as well. At last, I sorted them and only reserved the words that appeared above 10 times overall.

  I created the word using the online word cloud generator wordle.net. Using the wordle advanced, I pasted weighted words in the blank and then BAM!


  The most frequent words are obviously, data and analytics, while, you can find the skill sets of analytics such as SQL, SPSS, R, SAS and even excel. You can also find that the background of there analytics students is a mixture of science, management, mathematics, business, economics and statistics, which is reasonable because people are coming to this field of study from everywhere due to the hotness of Big Data.

  This visualization is useful for the recruiters. Remember when you were having a hard time writing job description? This word cloud will help you understand what this job is about. Furthermore, recruiters can go deeper - calculating the text vector of the words appeared in employees' resume or Linkedin profile and comparing this with the applicants' word vector. The resume with the highest similarity to the "insiders" would definitely considered to be indication of a good match between the job opening and the applicant.

  I think this is pretty easy but cool. Moreover, I am looking forward to analyzing the resume of ourselves, the Master of Science in Business Analytics!