Wednesday, May 27, 2020

That other NLP - Natural Language Processing

I'll blog again tomorrow but I've just finished one community webinar where my students fought off two trolls that stayed on for my entire presentation!

I can't seem to shake off my excitement at my latest Python abomination.

Natural Language Processing is an advanced part of data science that allows us to write programs that can gauge the sentiment of a stock.

I took some code fragments and I spent two days figuring out how to do the following :

a) Parse an XML file because Yahoo Finance data feeds are in XML format and it produces a link to an article.
b) Parse an HTML file and strip out all the Javascript code and HTML tags so that the text can be read by a computer program.

My code came from an old data science textbook and I had to rewrite the code for it to run.

After this, it gets really exciting.

a) My program has to read an investment article and determine whether it is "good" or "bad".
b) To do this, we must find a way to strip out all the 'useless' words in the English language like 'the' and 'an'.
c) Interestingly linguists in the University of Princeton has a library of words that are synonyms of words like "good" or "increasing" so coding this was much easier than I thought.
d) We can compare the frequency of good words versus bad words to determine whether the article is bullish or bearish on a counter.
e) Combing through the Internet, we can measure the number of favourable articles versus the unfavourable one use it as a measure of the sentiment against a stock.

Sadly what we think might work in theory does not work in practice.

I wanted to try sentiment analysis on a truly shitty stock so I chose Luckin Coffee.

But my program flagged 13 favourable articles against 3 unfavourable articles.

Well, we can't win all the time. I have to adjust the program's use of vocabulary. Obviously, unlike a human being, it can't detect irony or the context of the situation.

Still, I am so happy I am writing a blog article few would understand after a busy evening conducting a webinar. I imagine levelling up in NLP may equip me with the ability to summarise legal judgments in the future.

I promise an article in simple English tomorrow!


  1. Think you'd need some machine learning & recursive self-learning for the program to really parse thru the nuances of sentences & paragraphs. Else it just becomes a relatively hard coded lookup table of words & phrases that you'll have to constantly tweak.

    As for sentiment, there are no lack of oscillators as well as various composite indicators & survey indicators. is probably the sentiment guru for US stocks. If only there is a similar level of sentiment analysis & research for local markets.

  2. That's the problem with being an NLP noob, all I know is bag of words so far. I've got a long way to go with machine learning and deep learning and there are lower hanging fruits for me at the moment.

    I think for sentiment analysis for local stocks, the data source must be available first. The Yahoo Finance RSS feeds don't really give us much when I input local stock counters. That's a problem we need to solve first.

    Deep learning is... well... deep. I am giving myself a beginner's lesson in the foundations and shocked at how much matrix manipulation and linear programming I need to get into the first level.