Using unsupervized learning to cluster news articles based on content¶
Several years ago, I first stumbled on the phrase that AI poses “an existential threat” to humanity. I didn’t know at the time that that meant a threat to human existence. For some reason, I found that claim to be satisfyingly bold, and I wanted to know more. Embarking into a clickhole typical of my obsessive nature, I subscribed to some news sites on AI and started reading way more tech news.
In my reading of tech news, I struggled to identify topics or themes that interested me and were worth my time.
While tags can be helpful to get a sense of an article outside of the title, many of them seem to reference vague, poorly defined or refined terms. Rather than relying on these tags, I wanted to see if unsupervised learning could help me identify themes or groups of themes to assist in my reading.
Dataset¶
I used Scrapy and Splash to scrape article title, content, date, author, and url from the tech sections of the following news sites:
- Vox
- Vice
- New York Times
- Wired
- The Atlantic
- BuzzFeed
- The Gradient
Project Goals¶
I aimed to find clusters in tech news articles based on the full content of the article and understand themes or constructs that can summarize the clusters.
Summary of findings¶
I identfied 15 clusters and built an app to predict cluster membership of new article content.
Full project¶
Data collection: News Scrapers¶
For this project, I wanted to scrape news articles from a few of the news sites I spend time on. Because I'm interested especially in categories in "tech" or "future", I wanted to scrape these articles directly from these pages.
I like using Scrapy because of how easy it is to make modular scraping projects, structure scraped data, save to CSVs, and call spiders from different scripts.
I ran into some problems with some news sites using JavaScript to dynamically render a list of news articles in a section (rather than pressing a 'next' button and getting a URL). While this latter case has some really good tutorials to deal with this situation, I had a harder time discovering the best way to scrape the JavaScript rendered pages.
I decided to use Splash, a lightweight browser, to render the JavaScript of these websites before scraping them. After pulling Splash as a Docker image and running the container locally, I was able to render html versions of JavaScript pages to parse with Scrapy spiders.
Splash uses Lua scripts to interact with JavaScript elements, so I wrote a Lua script that clicks on a "load more" button, returns the html to parse, and follows all the loaded article links on the page, which can then be accessed with a normal scrapy.Request.
All of this happens in a normal scrapy.Spider object, so it can be crawled normally, as long as the Splash Docker image is running locally.
Below is an example of the scraping process¶
The code to scrape the articles is located in the scrape_news
directory.
The settings file contains the settings for scrapy and splash, as well as detemines which of the above publications to scrape from.
from settings import *
Here is an example of one spider that was used to scrape articles from Vox
class VoxSpider(scrapy.Spider):
name = "vox_spider"
# Start URLs
start_urls = ["https://www.vox.com/future-perfect"]
npages = 5
# Getting multiple pages of articles
for i in range(2, npages):
start_urls.append("https://www.vox.com/future-perfect/archives/"+str(i)+"")
def parse(self, response):
for href in response.xpath("//div[contains(@class, 'c-compact-river__entry')]//a[contains(@class, 'c-entry-box--compact__image-wrapper')]//@href"):
url = href.extract()
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = ScrapeNewsItem()
# Getting title
item['title'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//h1/descendant::text()").extract()[0]
# Gettings date
item['date'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//div[contains(@class, 'c-byline')]//time/descendant::text()").extract()[0].strip()
# Getting summary
item['summary'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//p/descendant::text()").extract()[0]
# Getting author
item['author'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//div[contains(@class, 'c-byline')]//span[contains(@class, 'c-byline__author-name')]/descendant::text()").extract()[0]
# Get URL
item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()
# Get publication
item['publication'] = 'Vox'
# Get content
content_list = response.xpath("//div[contains(@class, 'c-entry-content')]//descendant::text()").extract()
content_list = [content_fragment.strip() for content_fragment in content_list]
item['content'] = ' '.join(content_list).strip()
yield item
Here is an example of a spider that uses SplashRequest and a Lua script
class NYTimesSpider(scrapy.Spider):
name = "nytimes_spider"
# Start URLs
start_urls = ["https://www.nytimes.com/section/technology"]
# Script to click on 'show more' button
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.css-1stvaey button')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
splash:wait(2)
local elementt = splash:select('div.css-1stvaey button')
local boundss = elementt.bounds()
elementt:mouse_click{x=boundss.width/2, y=boundss.height/2}
splash:wait(2)
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint= 'execute', args={'lua_source': self.script})
def parse(self, response):
for href in response.xpath("//li[contains(@class, 'css-ye6x8s')]//a/@href"):
url = "https://www.nytimes.com/" + href.extract()
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = ScrapeNewsItem()
# Getting title
item['title'] = response.xpath("//h1[contains(@itemprop, 'headline')]//descendant::text()").extract()[0]
# Gettings date
item['date'] = response.xpath("//time/descendant::text()").extract()[0]
# Getting summary
item['summary'] = 'NA'
# Getting author
item['author'] = response.xpath("//p[contains(@itemprop, 'author')]//span[contains(@itemprop, 'name')]/descendant::text()").extract()[0]
# Get URL
item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()
# Get publication
item['publication'] = 'New York Times'
# Get content
content_list = response.xpath("//div[contains(@class, 'StoryBody') or contains(@class, 'story-body')]//p/descendant::text()").extract()
content_list = [content_fragment.strip() for content_fragment in content_list]
item['content'] = ' '.join(content_list).strip()
yield item
Running the scrapers¶
This code imports and runs the scrapy spiders, saving the data as .csv
in the data
folder (which is created by the settings file)
from scrape_news.scrape_news.spiders.spiders import *
from scrape_news.scrape_news.spiders.js_spiders import *
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
class Scraper:
def __init__(self):
# set scrapy settings variable to my settings file
os.environ['SCRAPY_SETTINGS_MODULE'] = 'settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings = get_project_settings()
# Initialize crawler with these settings
self.process = CrawlerProcess(settings)
def run_spiders(self):
# Crawl a spider for each publication set in settings
for pub in PUBLICATIONS:
spider = eval(pub + "Spider")
self.process.crawl(spider)
# Start the process
self.process.start()
# Initialize and run the spiders
scraper = Scraper()
scraper.run_spiders()
Loading the data¶
From the data
folder
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
def load_news_data():
"""
A function to load scraped news data from data folder
"""
# List of files
files = [f for f in os.listdir(DATA_PATH) if f.endswith(".csv")]
# List of data frames
file_list = []
# Append each data frame in files to the file_list
for filename in files:
df = pd.read_csv(os.path.join(DATA_PATH, filename))
file_list.append(df)
# Concatenate all the news data frames
df_full = pd.concat(file_list, join='outer').drop_duplicates().reset_index().drop(columns='index')
return df_full
# Drop NAs in content column
news_df = load_news_data().dropna(subset=['content'])
news_df.shape
Text processing¶
First thing to do is clean and process the text in the content
column. This process involves:
- tokenized the content (splitting it into its constituent words)
- removed stopwords and proper nouns
- reduce words to their lemmas (the root word)
I used SpaCy for this. SpaCy comes with a bunch of pre-built models that tokenize words and also determine if they are alphanumeric, stop words, find their part of speech, lemma, etc.
import spacy
# Initialize spacy with the english model
sp = spacy.load('en_core_web_sm')
The function below tokenizes a text string and determines if each token is a proper noun, alphanumeric character, or stop word. Then if it is NOT a proper noun, IS alphanumeric, and is NOT a stop word, it appends the LEMMA of the word to a new, cleaned string.
from functools import lru_cache
@lru_cache(maxsize = None)
def clean_string(text_string):
'''
A function to clean a string using SpaCy, removing stop-words, non-alphanumeric characters, and pronouns
Argument: a text string
Output: a cleaned string
'''
# Parse the text string using the english model initialized earlier
doc = sp(text_string)
# Initialize empty string
clean = []
# Add each token to the list if it is not a stop word, is alphanumeric, and if it's not a pronoun
for token in doc:
if token.is_alpha == False or token.is_stop == True or token.lemma_ == '-PRON-':
pass
else:
clean.append(token.lemma_)
# Join the list into a string
clean = " ".join(clean)
return clean
Here's an example of how the text cleaning works:¶
example = news_df.loc[2,'content']
example_clean = clean_string(example)
print("Raw example: \n" + example[:500]+ "\n \n")
print("Clean exmaple: \n" + example_clean[:500])
Processing all the article content¶
Next I cleaned and appended the cleaned content to the news data frame
def clean_content(df):
'''
A function to clean all the strings in a whole of a corpus
Argument: a dataframe with the column 'content'
Ouput: same dataframe with a new 'cleaned_content' column
'''
# Initialize list of cleaned content strings
clean_content= []
# Call clean_string() for each row in the data frame and append to clean_content list
for row in df.content:
clean_content.append(clean_string(row))
# Append clean_content list to the data frame
df['clean_content'] = clean_content
return df
news_df = clean_content(news_df)
Vectorization of text data¶
Now it's time to create the feature matrix that will be used in the KMeans model, which involves vectorizing the text data.
Sometimes people use a bag-of-words representation, which means creating a vector for each document in which each element represents the frequency, in that document, of every word that appears in the corpus.
Instead of doing a pure count, you can also scale the frequency of each word in a document by it's relative frequency in all documents. If a word appears a lot in some documents but not in most, it is likely to be informative; if it appears often in most documents, it likely isn't. This is called tf-idf (term-frequency inverse-document-frequency) vectorization. The TfidfVectorizer object multiplies the idf (inverse document frequency) vector by the document term frequency, which gives a lower score to terms that appear more frequently across all documents.
This function also applies L2 regularization to the vector of tf-idf values, which makes it so that the length of the document does not change the representation.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', min_df = 5, ngram_range = (1,3), max_df = 0.15)
I initialized the vectorizer with settings to use only words that appear in more than 5 documents, in order to cut back on tokens. Additionally, for unsuperized learning, sklearn recommends taking out common words, so I took out words that appear in more than 15% of documents
N-grams
The Tfidf Vectorizer will calculate a tf-idf value for individual words (1-grams), or it can tokenize on combinations of words (n-grams) instead of just single words. I chose to do 1-, 2-, and 3-grams because doing more can cause the number of features to explode.
Below the rows is the number of documents in the corpus, and the columns are the number of terms (1-, 2-, and 3-grams) in the corpus.
X = vectorizer.fit_transform(news_df['clean_content'])
X.shape
Inspecting the text features¶
Word clouds for visualizing terms with highest and lowest tf-idf scores
I inspected which words tf-idf found most important and least important terms in distinguishing specific documents.
# Get the list of terms in the vocabulary
terms = vectorizer.get_feature_names()
# Make the matrix dense
tfidf = X.todense()
# Get the max tfidf weight of each term and flatten to a 1D array
tfidf_max = np.array(tfidf.max(axis = 0)).ravel()
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
def plot_tfidf_word_cloud(frequencies = tfidf_max, terms = terms, most_frequent = True):
'''
Plots a word cloud for the most important or least frequent words in a list
Arguments:
frequencies: a (terms,) sized array of frequencies
terms: a list or array of terms
most_frequent: if True, plots most frequent, if False, plots least frequent
Ouput: a word cloud
'''
# Sort frequency array in ascending and returns an array of indices
sorted_frequencies = frequencies.argsort()
# Create word cloud with highest or lowest 75 terms, joined into a string
if most_frequent == True:
cloud = WordCloud(background_color = 'white').generate(text=(' '.join(np.array(terms)[sorted_frequencies[-75:]])))
title = "Features with highest tf-idf"
else:
cloud = WordCloud(background_color = 'white').generate(text = (' '.join(np.array(terms)[sorted_frequencies[75:]])))
title = "Features with lowest tf-idf"
plt.figure(figsize = [10,15])
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.title(title, size =20)
plt.show();
plot_tfidf_word_cloud()
plot_tfidf_word_cloud(most_frequent = False)
Lower scores mean less informative, higher mean more
We can see that the words with higher tf-idf scores distinguish individual topics, such as individual people (Bezos, Hitler, Olsen), specific machine learning or coding (brain, hack, processing), or current events (malaria, suicide, wind).
The words with lower tf-idf scores are clearly more generic and don't distinguish specific topics. Iterestingly, AI is in this group, which indicates that the term "AI" is used indiscriminately and widely and does not distinguish specific documents in this corpus, which may indicate something about the use of the words or about the general theme of topics in this corpus.
Time for KMeans clustering!¶
Next I ran this unsupervised clustering algorithm to discover groups in the text data
Choosing the best number of clusters¶
Without ground truth labels with which to evaluate kmeans, detemining parameters, like number of clusters, can be difficult.
One way to evaluate kmeans without ground truth is the within cluster sum of squared errors (distortion). Lower scores are generally better because they mean tighter clusters.
Another way is through silhouette scores, which compares the mean distance of points to thier own cluster to points in the neighboring cluster. It is a measure between [-1,1], where scores closers to 1 indicated points are closer to their own cluster, 0 means the clusters are overlapping, and -1 means points are closer to their neighboring cluster than thier own cluster.
The best number of clusters will cause the SSE curve to bend, and also maximize the silhouette score.
from sklearn.externals import joblib
news_df = pd.read_csv('data/clean_content.txt')
vectorizer = joblib.load('models/tfidf_vectorizer.sav')
from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib.pyplot as plt
def choose_k(max_k, X):
'''
A function to display plots of average silhouette score and average SSE (inertia) for various numbers of clusters
Arguments:
max_k = the maximum number of clusters to test
X = the document-term matrix to train KMeans
Ouput: line plots for both silhouette scores and SSE
'''
# Initialize empty lists
distortions = []
sil_score = []
# For each value of k, initialize and fit a MiniBatchKMeans and append the sillhouette score and SSE to the lists
for k in range(2, max_k):
kmeans = MiniBatchKMeans(n_clusters = k, init = 'k-means++', max_iter = 1000, random_state=42)
kmeans.fit(X)
sil_score.append(silhouette_score(X, kmeans.labels_))
distortions.append(kmeans.inertia_)
# Plot each score for each number of clusters
sns.set(style="whitegrid")
plt.figure(figsize=[15,5])
distortions_plot = sns.lineplot(x= range(2,max_k),y= distortions)
plt.ylabel("Sum of Squared Errors (distortion)")
plt.xlabel("Number of clusters")
distortions_plot.xaxis.set_major_locator(ticker.MultipleLocator(1))
plt.show()
plt.close()
plt.figure(figsize=[15,5])
silhouette_plot = sns.lineplot(x= range(2,max_k),y= sil_score)
plt.ylabel("Silhouette score")
plt.xlabel("Number of clusters")
silhouette_plot.xaxis.set_major_locator(ticker.MultipleLocator(1))
plt.show()
choose_k(25, X)
I choose 14 clusters becuase going from at 15 clusters there is a big drop in SSE and increase in average silhouette score
k = 14
Training the KMeans model¶
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = k, init = 'k-means++', random_state= 42)
cluster_labels = kmeans.fit_predict(X)
import matplotlib.cm as cm
def plot_silhouette_scores(X, k, cluster_labels):
'''
A function to plot the silhouette scores in each cluster of the kmeans model
'''
# An array of the silhouette score of each sample (article)
sample_sil_values = silhouette_samples(X, cluster_labels)
fig, (ax1) = plt.subplots(figsize = [10,6])
# For each cluster, plot the silhouette scores for each sample
y_lower = 10
for i in range(k):
# Values for each cluster
ith_cluster_sil_values = sample_sil_values[cluster_labels == i]
ith_cluster_sil_values.sort()
# How many samples are in that cluster
size_cluster_i = ith_cluster_sil_values.shape[0]
# The upper limit of that cluster group is the lower limit plus the size of the cluster
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
# Fill the length of the silhouette score on the x axis, on the y axis between the upper and lower limits of the cluster group
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_sil_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.08, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_yticks([])
ax1.set_xlim([-0.1, 0.4])
ax1.grid(False)
ax1.set_ylabel("Cluster Label")
ax1.set_xlabel("Silhouette scores")
ax1.set_title("Silhouette Plot for Various Clusters")
plot_silhouette_scores(X, k, cluster_labels)
Higher silhouette scores mean more distinct clusters. Negative scores mean that that sample is closer to the center of another cluster than it's own center. We can see that cluster 12 has lots of negative scores: it's not very distinct.
Top words by cluster¶
Next, I printed several 1-gram terms that are the most important in determining the centroid of each cluster. The centroid of each cluster is a coordinate that has as many dimensions as there are terms (features). The terms with the highest coordinate values represent the features with the most import for that cluster.
def print_top_words_by_cluster(k):
'''
A function to plot the top terms in each cluster of the kmeans model
'''
# The high dimensional coordinates of the center of each cluster (shape is 15 by the number of features)
# Within each cluster (row) sort the coordinate of each features them in descending order and return array of indices
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
for i in range(k):
print("Cluster %d:" % i, end='')
for ind in sorted_centroids[i, :12]:
if terms[ind].find(' ') == -1:
print(' %s' % terms[ind], end='')
print()
print("Top words in each cluster: \n" )
print_top_words_by_cluster(k)
KMeans clustering, because it's unsupervized, can sometimes cluster based on constructs that we aren't interested in. However, looking at the top words in each cluster, it appears that each cluster represents a reasonal topic in the realm of the articles I was looking at.
Dimensionality reduction using SVD¶
However, there are some words included that make it clear that the semantics aren't always being accurately represented, for instance, in Cluster 2, concepts of organ donation and philanthropy are being conflated. Tf-idf vectorization should convey some semantic information due to it's process of weighting them by importance in a document, which presumably have various topics. But how can we do better?
I decided to try doing a dimensionality reduction technique called SVD, which is often used for latent semantic analysis and works on tf-idf matrices. This way, latent patterns in the term frequencies can be identified, which may convey more semantic information that is then used in the KMeans clustering.
Here, I reduce the number of features to 100
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pandas as pd
news_df = pd.read_csv('data/clean_content.txt')
vectorizer = joblib.load('models/tfidf_vectorizer.sav')
X = vectorizer.transform(news_df['clean_content'])
svd = TruncatedSVD(n_components=100, random_state=42)
normalizer = Normalizer(copy=False)
svd_pipe = make_pipeline(svd, normalizer)
X_svd = svd_pipe.fit_transform(X)
X_svd.shape
Next, I checked to see what is the ideal number of cluster are for the data set with the reduced dimensions, using the same method as before.
choose_k(25, X_svd)
It looks like 11 clusters are are good number
k_svd = 11
kmeans_svd = KMeans(n_clusters = k_svd, init = 'k-means++', random_state= 42)
cluster_labels_svd = kmeans_svd.fit_predict(X_svd)
plot_silhouette_scores(X_svd, k_svd, cluster_labels_svd)
We have to print the most important words according to the highest tf-idf words in each cluster, because the cluster centers correspond to components of the terms rather than the terms.
terms = vectorizer.get_feature_names()
tfidf = X.todense()
def print_max_tfidf_per_term_per_cluster(cluster_labels):
tfidf_df = pd.DataFrame(tfidf, columns = terms)
tfidf_df['cluster_label'] = cluster_labels
max_tfidf_per_term_per_cluster={}
for label in tfidf_df.cluster_label:
max_tfidf_per_term_per_cluster[label]=(np.array(tfidf_df[tfidf_df.cluster_label == label].max(axis = 0)).ravel())
max_tfidf_per_term_per_cluster = pd.DataFrame(max_tfidf_per_term_per_cluster)
for i in range(k_svd):
sorted_tfidf = np.array(max_tfidf_per_term_per_cluster[i].argsort())[::-1][:][:15]
print("Cluster %d:" % i, end='')
for ind in sorted_tfidf[:15]:
if terms[ind-1].find(' ') == -1:
print(' %s' % terms[ind], end='')
print()
print_max_tfidf_per_term_per_cluster(cluster_labels_svd)
Interestingly, I think the clusters learned from the high-dimensional dataset are more cohesive than these. I think I need to consider more advances techniques for learning conceptual clusters from text documents.
Plotting clusters using t-SNE¶
Next, I wanted to visualize the clusters learned from the high dimensional data set in two dimensions. t-SNE is a feature reduction technique that reduces features to 2 dimensions, mainly used just for plotting purposes.
from functools import lru_cache
from sklearn.manifold import TSNE
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
@lru_cache(maxsize = None)
def tsne_plot():
tsne=TSNE(random_state = 42)
# Array with 2 dimensions for each sample
tsne_features=tsne.fit_transform(tfidf)
# Data frame of cluster labels and t-SNE features for plotting
tsne_df = pd.DataFrame({'clusters':cluster_labels, 'tsne0':tsne_features[:,0], 'tsne1':tsne_features[:,1]})
tsne_df['clusters'] = tsne_df['clusters'].astype('category', ordered = True)
# Palette for plotting
palette = ['#363737', # dark grey
'#7e1e9c', # purple
'#380282', # indigo
'#0343df', # blue
'#95d0fc', # light blue
'#06c2ac', # turquoise
'#15b01a', # green
'#a7ffb5', # light green
'#c1f80a', # chartreuse
'#fff39a', # yellow
'#fac205', # gold
'#f97306', # orange
'#ac4f06', # brown
'#e50000', # red
]
sns.set(style="whitegrid")
# Scatterplot colored by cluster label
plt.figure(figsize = [15,15])
sns.scatterplot(data = tsne_df, x="tsne0", y="tsne1", hue="clusters", palette= palette)
plt.legend(loc='upper center', bbox_to_anchor=(1.1, 1), ncol=1, fontsize =15)
plt.xlabel("t-SNE Label 0", size = 15)
plt.ylabel("t-SNE Label 1", size = 15)
plt.show()
warnings.filterwarnings('ignore')
tsne_plot()
Just as we saw in the silhouette plot, cluster 12 is very widespread. In general, it seems like this cluster isn't that informative.
Word cloud for each cluster¶
Visualization of the printed terms above
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
def print_word_cloud_per_cluster(j=2):
if j > k:
return
else:
f, axs = plt.subplots(1,2, figsize = [15, 5])
# for each of the two axes, create a word cloud for that cluster's most important terms, and display it
for ax, i in zip(axs, range(j-2, j)):
cloud = WordCloud(background_color='white').generate(text=(' '.join(np.array(terms)[sorted_centroids[i, :75]])))
title = "Important Words in Cluster {} \n".format(i)
ax.imshow(cloud, interpolation="bilinear")
ax.set_title(title, size = 20)
ax.grid(False)
ax.axis("off")
plt.show()
# update j for next loop
print_word_cloud_per_cluster(j=j+2)
print_word_cloud_per_cluster()
Prediction with new document content¶
import os
from sklearn.externals import joblib
import spacy
sp = spacy.load('en_core_web_sm')
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# vectorizer = joblib.load('models/tfidf_vectorizer.sav')
# kmeans = joblib.load('models/kmeans.sav')
def predict(input_string = None, filename = None):
'''
A function to clean, vectorize, and predict the kmeans cluster of a new string
Arguments: a string, which is the content of an article OR the 'filename.txt' of a text file
Output: a prediction and wordcloud for the document
'''
if filename != None:
with open(os.path.join(PROJ_ROOT_DIR, "content_to_predict", filename), 'r') as file:
input_string = file.read()
print('Processing file')
# Clean the string
clean = clean_string(input_string)
if input_string != None:
print('Processing text string')
clean = clean_string(input_string)
# Vectorize the string
Y = vectorizer.transform([clean])
# Predict the cluster label
prediction = kmeans.predict(Y)
# Generate the wordcloud
cloud = WordCloud(background_color = 'white').generate(clean)
# Add the prediction to the title of the word cloud
title = "Cluster Prediction for Article {}".format(prediction)
# Show the word cloud
plt.figure(figsize = [10,15])
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.title(title, size =20)
plt.show();
predict(input_string = '''Are we living in a computer simulation? The question seems absurd. Yet there are
plenty of smart people who are convinced that this is not only possible but perhaps likely.In an influential
paper that laid out the theory, the Oxford philosopher Nick Bostrom showed that at least one of three
possibilities is true: 1) All human-like civilizations in the universe go extinct before they develop the
technological capacity to create simulated realities; 2) if any civilizations do reach this phase of
technological maturity, none of them will bother to run simulations; or 3) advanced civilizations would have
the ability to create many, many simulations, and that means there are far more simulated worlds than
non-simulated ones.''')