During the Crypto-craze of late 2017 and early 2018, I teamed up with a friend of mine from Techstars to predict cryptocurrency price movements based on Twitter sentiment.
This was my first exposure to natural language processing (NLP), handling a large amount of streaming data, and dealing with Twitter spam.
Create an automated system that would alert us when Twitter sentiment data signal price movements in the top 100 cryptocurrencies.
We used Twitter’s streaming API to capture a live stream of tweets and filtered them based on relevance to the top 100 cryptocurrencies we were monitoring. The raw tweets are saved inside AWS S3 buckets in JSON form.
We were hoping to catch tweets like this:
$LTC with an interesting setup, similar to $BTC: o Three touches, failed breakout below trendline, then strong reversal o H4 bull reversal bar is already triggered o Price successfully retested EMA 200 in Daily, now looking up
This kind of tweet would influence the crowd to buy more LTC!
We used AWS Lambda functions to fetch the minutely, hourly, daily, and weekly prices of the currencies from CryptoCompare and populate our PostgresDB for training and testing of the prediction algorithms.
One should never look directly at the sun or Twitter’s unfiltered stream.
The hardest part of the project was extracting the most meaningful tweets from the Twitter stream. Most of the crypto tweets were spam, ads, or irrelevant.
We first used a basic filter to keep the tweets we want to analyze:
- Does the tweet contain directly relevant symbols or hashtags? (#BTC, $ETH, etc)
- Does the tweet contain tangentially relevant symbols or hashtags? (#crypto)
- Is this a retweet of an existing tweet? If so, save the original tweet.
After filling up our S3 buckets with enormous amounts of junk data, I found that we needed more filters. The next two filters we needed were for hashtag duplication and crypto spam.
Hashtags on Twitter are assigned by the author and there’s no enforcement of hashtag relevance based on the content.
For example, I could tweet:
I'm so happy I got to catch up with my friends tonight. #BTC $ETH #cryptocurrency #tothemoon
Any sentiment classifier would say this is very good news for BTC and ETH prices although it has no relevance to any crypto markets (unless your friends include Vitalik Buterin - creator of Ethereum).
Another issue we found was shared hashtags between cryptocurrencies and other topics. For example: #BTS could be a famous boy band or the cryptocurrency BitShares. And OmiseGo’s symbol (OMG) shares its abbreviation with a popular expression.
To filter these out, I divided the sentence into parts of speech with SpaCy and determined if the subjects and objects of the sentence relate to cryptocurrencies.
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(u'Bitcoin is being adopted by the world\'s top banks') for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
Here’s the output:
Bitcoin bitcoin PROPN NNP nsubjpass Xxxxx True False is be VERB VBZ aux xx True True being be VERB VBG auxpass xxxx True True adopted adopt VERB VBN ROOT xxxx True False by by ADP IN agent xx True True the the DET DT det xxx True True world world NOUN NN poss xxxx True False 's 's PART POS case 'x False False top top ADJ JJ amod xxx True True banks bank NOUN NNS pobj xxxx True False
Since words like Bitcoin and BTC are specific to the domain I’m working in, I can customize the SpaCy training APIs to improve entity detection on crypto words. There wasn’t any labeled database of this data so I pulled out sample tweets, ran them through the generic SpaCy entity detection algorithm, and correct the mislabelled ones for training.
Next issue was crypto spam. Where there’s money to be made, I knew there would be spam but I didn’t know just how much there would be. I shudder at the thought of how much of the world’s compute and storage resources are wasted on spam. I’ll document my journey classifying crypto spam in a follow up post.
The most straightforward approach to determining if a sentence is positive or negative is to break the sentence into words and count how often certain words appear (like counting words pulled from a bag). This works well enough for short sentences that are clearly positive or negative such as:
Bummer! That movie sucked or
Wow! I love that dress
But it isn’t as good for more nuance texts:
Wow! That movie sucked or
Although the plot was terrible, I liked the movie
Due to its limited understand of context or word order, bag-of-words is usually just the first step to processing complex texts.
VADER - Lexical Approach
Valence Aware Dictionary and sEntiment Reasoner (VADER) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
Packaged with NLTK (Natural Language ToolKit) for Python, VADER provided an excellent baseline for how good sentiment analysis can be. At its core is the vader_lexicon.txt, which contains common words and their sentiment ratings determined by a committee of human annotators. Based on these manual ratings, it will tokenize the sentence and then provide a rating of how positive or negative a phrase is.
VADER includes heuristics such as punctuation, degree modifiers, shifts, negation, and capitalization to deal with the nuances of speech that a simple lexical approach would miss. It also includes emoticons, acronyms (such as LOL and WTF) and slang in its dictionary, making it a great choice for analyzing social media comments.
To improve VADER’s performance with crypto-specific texts, I manually added some new acronyms to its lexicon such as:
HODL (Hold On for Dear Life - semi-positive) and
BTD (Buy the Dip - positive).
My modified VADER provided a great baseline and helped us quickly label our data for training our neural network.
Recurrent Neural Networks - Machine Learning Approach
SpaCy has an amazing sample code to build your own neural net for sentiment classification with Keras. Check it out!
I’ll go over some basic concepts I learned while building this neural net so that the SpaCy sample code makes more sense.
In a typical neural net, all of the input (words of the sentence) would be passed in at once and the neural net would return a prediction. For example: feed in an image of pixel values and the neural net would return if there’s a cat in there or not. This works well for images but not so well for natural languages because of how human languages are designed to be consumed.
We typically don’t read an entire sentence or document by taking in all the words all at once, it would be too overwhelming. Perhaps this is why people say a picture is worth a thousand words: humans can process a image much faster than a thousand words. So we’ve designed natural languages to be read sequentially, each word building on the previous one until the end of the sentence where the full meaning can be understood.
Given the sequential form of many languages, scientists came up with a new type of neural networks that can better process sequential data: recurrent neural networks (RNN). RNNs are well equipped to deal with sequential data because they not only take in the current word as their input but the results of what they have perceived from previous words.
However RNNs soon ran into problems with longer texts because they weighted what was seen recently more heavily than what was seen long ago. At every step, the effects of the words at the beginning of the sentence waned.
Long Short-Term Memory (LSTM) units aimed to solve this problem. LSTM introduced gates that selectively decide which information to retain as the network moves through the sentence. (I recommend reading this intro with pretty pictures on LSTM).
To feed word data into the RNN, we can assign a word to a numerical index and then feed it in as a number to the neural net. However this method has a shortcoming in that similarity between words aren’t captured so the neural net would have trouble generalizing its predictions.
What we can do instead of just assigning a word to an index is to assign it to a rating or value, just like with the Vader lexicon. However a one-dimensional rating might only capture how positive or negative a word is but not how two words are similar in other ways. So the more dimensions we can rate words in, the more granular their similarities can be calculated. These word representations are called word embeddings and it’s what we will feed into the LSTM network (instead of actual words).
With these basic concepts, the sample code for SpaCy should make a lot more sense.
Without a team of labelers, I spent a lot of my time looking at VADER classification of the crypto tweets and relabelling them as necessary to feed into the neural network.
VADER with my customized crypto lexicon performed quite well for classifying sentiment. In my test sets of 100 randomly selected tweets, VADER correctly classified ~ 76% - 80% of the tweets.
Unfortunately, I was not able to get a good enough result with the limited data I had labeled with the LSTM network. As the project progressed, I moved onto tackling the bigger goal: predicting price movements with the sentiment classifications we had.
Predicting Price Movements
After the sentiment classification pipeline churned through the tweets, it saved the sentiment classification and intensity in a PostgresDB.
Here’s a visualization of the volume of positive vs negative tweets for ETH (Ethereum) by hour for Feb 3rd, 2018 to Feb 9th, 2018 UTC.
- VADER assigns sentiment values in a range from -1 (extremely negative) to 1 (extremely positive)
- VADER (with my lexicon) skews towards classifying tweets as positive. So I’ve set the range of -0.2 to 0.5 as neutral. I’ve filtered out all the neutrals in this graph.
To set up the alert system, we wanted to know when there are a spike in volume for both positive and negative tweets. So we looked at the 2-hr and 6-hr moving averages in sentiment volume as guides to when we should send out alerts.
Visualization of the 2-hr (red line) and 6-hr (blue line) moving averages for positive tweets:
- The shaded region is the standard deviation of the 6-hr moving average
Visualization of the 2-hr (red line) and 6-hr (blue line) moving averages for negative tweets:
Visualization of the price/volume of ETH during the same time period:
Immediately we can spot some simple triggers for our alert system based on when the pos/neg ratio for the 2-hr moving averages deviates from the 6-hr moving average.
To confirm, I took a look at the sentiment volumes for other coins the the same period and in different periods in Feb. They had a similar pattern: changes in mood = price movements.
Our feeble attempts at predicting market movements didn’t give us any edge in trading cryptocurrencies. Furthermore the crypto markets spiraled downward month after month after our experiment, as forces much larger than the Twitter-verse drove prices downward.
However I learned a lot about NLP, processing the Twitter stream at a rapid pace, and the nature of human speculators (myself included).
If I do pick up an NLP project to assist in securities / asset trading again, I’d love to:
- Improve the LSTM neural network for sentiment classification
- Use a sequence to sequence model to summarize the content of the Tweets instead of just sentiment
- Use the summarization to predict direction of price movement, instead just alerting us that price might change
- Use fundamentals of a security (non-existent in crypto world) as features for a price prediction algorithm