NLP and Sentiment Analysis for Trading¶

Overview¶

Natural Language Processing (NLP) extracts trading signals from text data: news articles, social media, earnings calls, SEC filings, and analyst reports. Sentiment is a powerful alpha source but requires careful processing to avoid noise and bias.

Difficulty advanced

Data Sources¶

Source	Frequency	Signal Horizon	Reliability
News wires (Reuters, Bloomberg)	Real-time	Minutes to hours	High
Twitter/X	Real-time	Minutes	Medium
Reddit (r/wallstreetbets)	Real-time	Hours to days	Low-Medium
SEC Filings (10-K, 10-Q)	Quarterly	Days to weeks	High
Earnings Call Transcripts	Quarterly	Days to weeks	High
Analyst Reports	Daily	Days	High
Central Bank Communications	Event-driven	Hours to days	High

Sentiment Analysis Approaches¶

1. Lexicon-Based (Loughran-McDonald)¶

The gold standard for financial sentiment: a domain-specific word list classifying terms as positive, negative, uncertainty, litigious, constraining, or modal — designed for 10-K and earnings text where general-purpose lexicons (e.g., Harvard IV-4) misclassify words like "liability" or "tax" as negative. Score = (positive count − negative count) / total words, then z-score across firms. Used in the feature-engineering phase as a cheap, interpretable baseline before any transformer model.

Risk and Pitfalls¶

1. Noise and Spam¶

Social media contains significant noise and coordinated manipulation.

2. Reverse Causality¶

Sentiment may follow price moves, not cause them.

3. Sarcasam and Irony¶

NLP models struggle with financial sarcasm and irony.

4. Data Snooping¶

Backtested sentiment signals may be overfitted.

5. Structural Breaks¶

Sentiment-price relationships change over time.

Checklist¶

[ ] Text preprocessing appropriate for financial domain
[ ] Financial lexicon used (not general sentiment)
[ ] Model trained on financial text (FinBERT, not general BERT)
[ ] Signal generation includes rolling normalization
[ ] Out-of-sample testing on unseen time periods
[ ] Transaction costs included in backtest
[ ] Latency requirements considered (real-time vs. batch)
[ ] News source reliability assessed
[ ] Sarcasm/irony handling considered
[ ] Data snooping bias tested (deflated Sharpe ratio)

References¶

Loughran, T. & McDonald, B. (2011). "When is a Liability Not a Liability?" Journal of Finance, 66(1), 35-65.
Araci, D. (2019). "FinBERT: Financial Sentiment Analysis with Pre-trained Language Models." arXiv:1908.10063.
Tetlock, P.C. (2007). "Giving Content to Investor Sentiment." Journal of Finance, 62(3), 1139-1168.