Wednesday, June 03, 2020

The KNN PCB post !

[ I was told that I got the name of the algorithm wrong. The algorithm is supposed to called KMeans clustering and not K-nearest neighbour. Sadly, reverting to the proper name of the algorithm meant that I cannot use the eye-catching title which is quite KNN in the end.  ]

I've been getting more interesting progress in my machine learning this Post Circuit Breaker (hence PCB) season.

For rookies who are delving into darker recesses of machine learning, a low hanging fruit to improve your AI programming skills is K-nearest neighbour algorithm (KNN). So this article is a discussion on my attempts to use the K-nearest neighbour algorithm during the Post circuit breaker era.

(Hence KNN PCB !)

One question I always had for myself is how do we categorise blue-chips into clusters of stocks so that we can think about them in an easier manner. Intuitively, I tell my students that you can get a firmer grasp of the STI if you think about Banks, REITs, Developers and Jardine counters. But the weakness of this approach is that the four categories I mentioned only cover only 16 stocks, what about the remaining 14?

KNN is a method to train a computer to look at fundamental data and then divide the group of 30 stocks into different categories based on how closely their financial data fit each other. Unlike a human being that only review 1 or 2 numbers at the same time, a computer can look at the entire suite of financial figures.

Here is how the input data for 30 STI components looks like :

The data cannot be just fed from financial database, it needs some pre-processing work to remove blank entries can have to me scaled mathematically so we that we will not bias any financial metric during this exercise. Details of dimensionality reduction and scaling are too boring and technical for this blog.

Once the data is cleaned and ready, I would have to decide arbitrarily on how many clusters are there in the SGX. I figured that 7 is a nice round number because the human brain is designed to deal with about 5-7 objects at the same time.

Running the KNN algorithm I wrote for STI data, my program works!

Like a Frankenstein Monster, it can group stocks into various categories. The print-out is not that inspiring though :

Here are some insights from my program:

a) A major victory is that all three banks OCBC, UOB and DBS are categorised within their own cluster which confirms that my algorithm works.
b) Another major victory is that the AI was able to cluster all the REITs within the same group. The downside is that UOL and Hongkong Land was unfortunately also clumped together with the REITs category.
c) All three Jardine counters were mysteriously packed in the same cluster. I was a tad disappointed that the AI did not clump Hong Kong Land and Dairy into the mix.

At this point, I exhausted my experience and intuition as a retail investor.

Everything hence-forth are weird AI insights that showed me how blind I was:

a) For some strange reason, Thai Beverage is always an island in its own cluster. This cautions us to take care when analyzing Thai Beverage, it is truly the outsider in the group of 30 STI components.
b) Singapore Exchange and Dairy Farm seem to be strange bedfellows forming a cluster of 2.
c) SATs, Venture, Comfort Degro, ST Engineering and YZJ forms a cluster of engineering firms. This is likely what's missing from what I've been teaching my students!
d) I suspect the AI dumped the rest into a mega-cluster: SIA, Wilmar, SPH, Capitaland, Singtel, Genting, Keppel, SembCorp, City Developments.

Perhaps a more logical clustering will occur once I raise the number of cluster by 3 or 4.

There are random elements to the KNN algorithm such that running it multiple times produces different results so I have to level up my machine learning skills to a much higher level before I have a program that can automatically sort various stocks into different categories just by reviewing financial data.

To level up my machine learning skills, I have to revise the Maths I picked up during my JC and NUS Engineering days, this includes matrices and linear programming. Just this week, I finally figured out what an eigenvector is and how to apply it to a practical money problem!

I would also have to review this outdated notion that my law degree may be the last one that I will pursue in my lifetime... There will come a point of time I will need academic instruction to create a better artificial brain to help me with my investment decisions.

( Also we need to bring Computing and F Maths back into popular A levels subject combinations. You can't play in this new world without it! )


ML said...

What you describe is K-Means clustering for unlabeled data
KNN is a classification technique for labeled data. They are not the same.

Christopher Ng Wai Chung said...

Thank you, I have amended my post.