A key strength of NLP (natural language processing) is being able to process large amounts of texts and then summarise them to extract meaningful insights.
In this example, a selection of economic bulletins in PDF format from 2018 to 2019 are analysed in order to gauge economic sentiment. The bulletins in question are sourced from the European Central Bank website. tf-idf is used to rank words in a particular order of importance, and then a word cloud is used as a visual to highlight key words in the text.
As a disclaimer, the below examples are used solely to illustrate the use of natural language processing techniques for educational purposes. This is not intended as a formal economic summary in any business context.
Converting PDF to text
Firstly, pdf2txt is used to convert the pdf files into text format using a Linux shell.
pdf2txt.py -o eb201804.txt eb201804.en.pdfpdf2txt.py -o eb201806.txt eb201806.en.pdfpdf2txt.py -o eb201807.txt eb201807.en.pdf pdf2txt.py -o eb201808.txt eb201808.en.pdf
pdf2txt.py -o eb201902_a070c3a338.txt eb201902_a070c3a338.en.pdf
pdf2txt.py -o eb201903.txt eb201903.en.pdf
Text editing and tokenization
With the text files now processed, the analysis can be conducted on each text file at a time.
Firstly, the text is imported and special characters are removed:
# Import WordCloud and set file pathfrom wordcloud import WordCloudimport os; path="projects/3 text mining"
os.chdir(path)
os.getcwd()
# Read the whole text.
text = open('eb201903.txt').read()
import re
text2=re.sub('[^A-Za-z]+', ' ', text)
text2
Here is what a sample of the relevant text looks like:
'Economic Bulletin Issue Contents Update on economic and monetary developments Summary External environment Financial developments Economic activity Prices and costs Money and credit Boxes What the maturing tech cycle signals for the global economy Emerging market currencies the role of global risk the US dollar and domestic forces Exploring the factors behind the widening in euro area corporate bond spreads The predictive power of real M for real economic activity in the euro area Articles...'
The next step is word tokenization, where the text is split up into separate words:
# WORD TOKENIZATION: Splitting a large sample of text into wordsfrom nltk.corpus import stopwords # Set of stopwords needs to be downloaded when running on the first occasion
import nltk
# nltk.download('stopwords')
from nltk.tokenize import word_tokenize
print(word_tokenize(text2))
text3=word_tokenize(text2)
text3
Here is some of the output:
['Economic', 'Bulletin', 'Issue', 'Contents', 'Update', 'on', 'economic', 'and', 'monetary', 'developments', 'Summary', 'External', 'environment', 'Financial', 'developments', 'Economic', 'activity', 'Prices', 'and', 'costs', 'Money', 'and', 'credit', 'Boxes', 'What', 'the', 'maturing', 'tech', 'cycle', 'signals', 'for', 'the', 'global', 'economy', 'Emerging', 'market', 'currencies', 'the', 'role', 'of', 'global', 'risk', 'the', 'US', 'dollar', 'and', 'domestic', 'forces', 'Exploring', 'the', 'factors', 'behind', 'the', 'widening', 'in', 'euro', 'area', 'corporate', 'bond', 'spreads', 'The', 'predictive', 'power', 'of', 'real', 'M', 'for', 'real', 'economic', 'activity'...]
Removing stop words and TF-IDF Vectorizer
Now, stop words can be removed – i.e. useless words such as “the, and, but” that don’t provide any context into the topic of the text.
# Display and remove stop wordsstop_words = stopwords.words('english')stop_words[:5] text4=[word for word in text3 if word not in stop_words]
text4
The tokenized output is generated, but this time the stop words have been removed:
['Economic','Bulletin','Issue', 'Contents',
'Update',
'economic',
'monetary',
'developments',
'Summary',
'External',
'environment',
'Financial',
'developments'...]
Now, a numerical matrix can be formed with the remaining words:
from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()matrix = vectorizer.fit_transform(text4).todense() matrix
And here is the matrix!
matrix([[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.], ...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Now, this matrix must be converted into dataframe format, and the sum of each word by its occurence in the text obtained:
import pandas as pd# transform the matrix to a pandas df matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=True)
top_words_df = pd.DataFrame(top_words)
top_words_df
Here is a snippet of what was generated:
For processing purposes, headers are appended to the dataframe:
# Sort words by frequencynewdf = pd.read_csv("projects/text mining/words_dataframe.csv", header=None)newdf.columns = ["word", "frequency"] newdf
The purpose of tf-idf is to avoid appending too much weight to commonly occurring words. In this particular context, I decided to include words that appeared between 10 and 30 times in the text. The process for determining these thresholds was primarily trial and error, but found that the word clouds that were eventually generated showed insightful terms when using these frequencies.
# Top words: Frequency between 10 and 30top_words_dfrev = newdf[(newdf['frequency'] >= 10) & (newdf['frequency'] <= 30)]top_words_dfrev
Now, the top words can be sorted into a list and then transformed into string format:
# Sort top words by listtop=top_words_dfrev['word'].tolist()top
# Convert list to string
topstr=str(top)
topstr
Here is a snippet of the string output:
"['what', 'provided', 'downturn', 'far', 'facturing', 'momentum', 'borrowing', 'rather', 'yields', 'observed', 'despite', 'suggest', 'institutions', 'prepared', 'standard', 'targets', 'retaliatory', 'cross', 'competitiveness', 'respectively', 'europa', 'imposed', 'established', 'headline', 'manu', 'eurosystem', 'reflect', 'input', 'blue', 'brazil', 'construction', 'widening', 'machinery', 'social', 'environment', 'included', 'past', 'content', 'experience', 'much', 'system', 'profit', 'june', 'literature', 'september', 'pressures', 'back'...]
Word Clouds
Now, a word cloud can be generated:
# Generate word cloud and displaywordcloud = WordCloud().generate(topstr)import matplotlib.pyplot as plt plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
wordcloud = WordCloud(max_font_size=50).generate(topstr)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
The idea behind generating word clouds in this manner is that it allows for discerning of key phrases in the text that could be of significant importance in determining the overall context.
The above word cloud is based on the Issue 3, 2019 economic bulletin by the ECB. Specifically, the report itself mentions the word “downturn” in some of the following contexts:
– “Particular vulnerabilities in the manufacturing and tradable goods sectors reflect a downturn in external demand which, combined with some country and sector-specific factors, suggests a continued weak growth momentum in the first quarter of 2019.”
– “Generally, if euro area countries build up buffers to avoid fiscal tightening in a downturn, national budgets can fulfil their function as stabilisation tools.”
– “The EU’s fiscal framework still needs to be rendered more effective in reducing high national government debt burdens. This would make the countries in question less vulnerable to economic downturns and the euro area as a whole more resilient.”
In this example, an economic analyst who is seeking to digest the main points of an article quickly can identify “downturn” as a key word, and then refer to the report for greater detail on its context.
Let’s take two other examples from July and August 2018 respectively.
Word Cloud for July 2018 Bulletin
Let’s refer to the report to see the context for the word “surplus”.
– “Newly available data on the geographical breakdown of the euro area current account balance reveal that the largest share of the euro area’s external surplus of 3.5% of GDP in the year to the end of the second quarter of 2018 was accounted for by the United Kingdom and the United States, which contributed 1.4% and 1.0% of euro area GDP, respectively, followed by Switzerland (0.4% of euro area GDP).”
– “The bulk of the increase in the euro area’s current account surplus of about 1.2 percentage points of GDP since 2013 was accounted for by improvements vis-à-vis the euro area’s three largest trading partners (see Chart B).”
– “According to the euro area sectoral accounts for the second quarter of 2018, business margins (measured as the ratio of net operating surplus to net value added) have remained broadly unchanged since the end of 2015 and continue to be close to long-term averages.”
Word Cloud for August 2018 Bulletin
In this word cloud, let’s investigate the context of the word “comparison” further.
– “Emerging market vulnerabilities – a comparison with previous crises”
The above heading appears numerous times in the text, suggesting that vulnerabilities in emerging market economics are of particular concern to the ECB, and could pose economic risk for the eurozone more broadly.
For instance, the report quotes:
“The potential risks for EMEs are important for the global outlook. Compared to two decades ago, EMEs play, in aggregate, a significantly more prominent role in the global economy. They account for more than half of global GDP (at purchasing power parity) and gross capital flows. Developments in these economies could therefore have a sizeable impact on other countries through a variety of channels, including trade, financial and confidence channels.”
Conclusion
The above is an example of how tf-idf can be used in conjunction with word clouds for text summarization purposes. Of course, this analysis could be expanded further by taking sentence as well as word context into account, but serves as a good starting point in attempting to determine the main points in a text with ease.