What I was doing in the coursework recently was a case study of the insurance industry. I use statistical and visualization tools to cluster a variety of insurance companies. Technically it was not difficult, but the findings were pretty interesting.
First of all, let's have a glimpse of the dataset. The dataset consists of 11 attributes and 689 instances, each of which is an insurance company. The 11 attributes are code, name, total assets, total liabilities, total premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium. I took natural log to the total assets, total liabilities and total premiums so that the number wouldn't vary that much, which you will see in a second.
Now let's start with some dumb analysis, how about clicking cluster analysis without any thought? Using log_assets, log_liabilities, log_premiums, return on capital, RBC ratio, the proportion of life insurance to total premium, the proportion of annuity to total premium, the proportion of health insurance to total premium and the proportion of reinsurance to total premium as the 9 dimensions, I simply moved my fingers on the mouse and came up with something cool but really not insightful.Here is the dendrogram and parallel plot showing the results.
The proportion of health insurance was high in cluster #1. The proportion of reinsurance was high in cluster #2. The proportion of life insurance was high in cluster #3. The proportion of annuity was high in cluster #4. And the fifth cluster is similar to cluster #3 but with high assets, liabilities and premiums. Seems that the five clusters are mainly categorized by the structure of premium.
My question is, though, could I do better?
Yes. I think a principle component analysis would help me.
The graph above shows the output of the principle component analysis. According to the loading matrix showing the correlation between attributes and principle components, I chose Prin #1-#4 as co-variant with which I would continue my analysis. Prin1 stands for the size of the company. Prin2 stands for the imbalance of the companies' health insurance and life insurance. Prin3 stands for the reinsurance ratio mostly and Prin4 stands for the return on capital.
Wait, where does the bunch of letters above come from? It's my new clusters using PCA. I assigned companies to cluster A-E with different color so that I could see them more clearly.
By getting multiple views of the geography of these clusters. I confirmed one thing that has been running in my mind: Insurance companies are in channels.
The scatter plots shows that although the scope of insurance companies varies a lot, but what really differentiate them is the channel they are involved with, aka their insurance product.
Obvious, this analysis is far from enough. People working in this industry must have known that for sure. Moreover, I believe analysts use statistics to inform their judgement. Though this is a case study for outsiders like me to get a better understanding of this industry, I am going to further play with the data to find more insights next week.
Stay tuned.
No comments:
Post a Comment