Using unsupervized learning to cluster news articles based on content¶

Several years ago, I first stumbled on the phrase that AI poses “an existential threat” to humanity. I didn’t know at the time that that meant a threat to human existence. For some reason, I found that claim to be satisfyingly bold, and I wanted to know more. Embarking into a clickhole typical of my obsessive nature, I subscribed to some news sites on AI and started reading way more tech news.

In my reading of tech news, I struggled to identify topics or themes that interested me and were worth my time.

While tags can be helpful to get a sense of an article outside of the title, many of them seem to reference vague, poorly defined or refined terms. Rather than relying on these tags, I wanted to see if unsupervised learning could help me identify themes or groups of themes to assist in my reading.

Dataset¶

I used Scrapy and Splash to scrape article title, content, date, author, and url from the tech sections of the following news sites:

Vox
Vice
New York Times
Wired
The Atlantic
BuzzFeed
The Gradient

Project Goals¶

I aimed to find clusters in tech news articles based on the full content of the article and understand themes or constructs that can summarize the clusters.

Summary of findings¶

I identfied 15 clusters and built an app to predict cluster membership of new article content.

Full project¶

Data collection: News Scrapers¶

For this project, I wanted to scrape news articles from a few of the news sites I spend time on. Because I'm interested especially in categories in "tech" or "future", I wanted to scrape these articles directly from these pages.

I like using Scrapy because of how easy it is to make modular scraping projects, structure scraped data, save to CSVs, and call spiders from different scripts.

I ran into some problems with some news sites using JavaScript to dynamically render a list of news articles in a section (rather than pressing a 'next' button and getting a URL). While this latter case has some really good tutorials to deal with this situation, I had a harder time discovering the best way to scrape the JavaScript rendered pages.

I decided to use Splash, a lightweight browser, to render the JavaScript of these websites before scraping them. After pulling Splash as a Docker image and running the container locally, I was able to render html versions of JavaScript pages to parse with Scrapy spiders.

Splash uses Lua scripts to interact with JavaScript elements, so I wrote a Lua script that clicks on a "load more" button, returns the html to parse, and follows all the loaded article links on the page, which can then be accessed with a normal scrapy.Request.

All of this happens in a normal scrapy.Spider object, so it can be crawled normally, as long as the Splash Docker image is running locally.

Below is an example of the scraping process¶

The code to scrape the articles is located in the scrape_news directory.

The settings file contains the settings for scrapy and splash, as well as detemines which of the above publications to scrape from.

In []:

from settings import *

Here is an example of one spider that was used to scrape articles from Vox

In []:

class VoxSpider(scrapy.Spider):
    name = "vox_spider"

    # Start URLs
    start_urls = ["https://www.vox.com/future-perfect"]

    npages = 5

    # Getting multiple pages of articles
    for i in range(2, npages):
        start_urls.append("https://www.vox.com/future-perfect/archives/"+str(i)+"")

    def parse(self, response):
        for href in response.xpath("//div[contains(@class, 'c-compact-river__entry')]//a[contains(@class, 'c-entry-box--compact__image-wrapper')]//@href"):
            url = href.extract()
            yield scrapy.Request(url, callback=self.parse_dir_contents)
        
    def parse_dir_contents(self, response):
        item = ScrapeNewsItem()

        # Getting title
        item['title'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//h1/descendant::text()").extract()[0]

        # Gettings date
        item['date'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//div[contains(@class, 'c-byline')]//time/descendant::text()").extract()[0].strip()

        # Getting summary
        item['summary'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//p/descendant::text()").extract()[0]

        # Getting author
        item['author'] = response.xpath("//div[contains(@class, 'c-entry-hero c-entry-hero--default ')]//div[contains(@class, 'c-byline')]//span[contains(@class, 'c-byline__author-name')]/descendant::text()").extract()[0]

        # Get URL
        item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()

        # Get publication
        item['publication'] = 'Vox'

        # Get content
        content_list = response.xpath("//div[contains(@class, 'c-entry-content')]//descendant::text()").extract()
        content_list = [content_fragment.strip() for content_fragment in content_list]
        item['content'] = ' '.join(content_list).strip()

        yield item

Here is an example of a spider that uses SplashRequest and a Lua script

In []:

class NYTimesSpider(scrapy.Spider):
    name = "nytimes_spider"


    # Start URLs
    start_urls = ["https://www.nytimes.com/section/technology"]
    

    # Script to click on 'show more' button
    script = """
        function main(splash)
            assert(splash:go(splash.args.url))
            splash:wait(0.5)
            local element = splash:select('div.css-1stvaey button')
            local bounds = element:bounds()
            element:mouse_click{x=bounds.width/2, y=bounds.height/2}
            splash:wait(2)
            local elementt = splash:select('div.css-1stvaey button')
            local boundss = elementt.bounds()
            elementt:mouse_click{x=boundss.width/2, y=boundss.height/2}
            splash:wait(2)
            return splash:html()
        end
    """


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint= 'execute', args={'lua_source': self.script})

    
    def parse(self, response):
        for href in response.xpath("//li[contains(@class, 'css-ye6x8s')]//a/@href"):
            url = "https://www.nytimes.com/" + href.extract()
            yield scrapy.Request(url, callback=self.parse_dir_contents)
        
    def parse_dir_contents(self, response):
        item = ScrapeNewsItem()

        # Getting title
        item['title'] = response.xpath("//h1[contains(@itemprop, 'headline')]//descendant::text()").extract()[0]

        # Gettings date
        item['date'] = response.xpath("//time/descendant::text()").extract()[0]

        # Getting summary
        item['summary'] = 'NA'

        # Getting author
        item['author'] = response.xpath("//p[contains(@itemprop, 'author')]//span[contains(@itemprop, 'name')]/descendant::text()").extract()[0]

        # Get URL
        item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()

        # Get publication
        item['publication'] = 'New York Times'

        # Get content
        content_list = response.xpath("//div[contains(@class, 'StoryBody') or contains(@class, 'story-body')]//p/descendant::text()").extract()
        content_list = [content_fragment.strip() for content_fragment in content_list]
        item['content'] = ' '.join(content_list).strip()

        yield item

Running the scrapers¶

This code imports and runs the scrapy spiders, saving the data as .csv in the data folder (which is created by the settings file)

In []:

from scrape_news.scrape_news.spiders.spiders import *
from scrape_news.scrape_news.spiders.js_spiders import *
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

class Scraper:
    def __init__(self):
        # set scrapy settings variable to my settings file
        os.environ['SCRAPY_SETTINGS_MODULE'] = 'settings'
        settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
        settings = get_project_settings()
        # Initialize crawler with these settings
        self.process = CrawlerProcess(settings)
    def run_spiders(self):
        # Crawl a spider for each publication set in settings
        for pub in PUBLICATIONS:
            spider = eval(pub + "Spider")
            self.process.crawl(spider)
        # Start the process
        self.process.start()

# Initialize and run the spiders
scraper = Scraper()     
scraper.run_spiders()

Loading the data¶

From the data folder

In []:

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

def load_news_data():
    """
    A function to load scraped news data from data folder
    """
    # List of files
    files = [f for f in os.listdir(DATA_PATH) if f.endswith(".csv")]
    
    # List of data frames
    file_list = []
    
    # Append each data frame in files to the file_list
    for filename in files:
        df = pd.read_csv(os.path.join(DATA_PATH, filename))
        file_list.append(df)
        
    # Concatenate all the news data frames
    df_full = pd.concat(file_list, join='outer').drop_duplicates().reset_index().drop(columns='index')
    
    return df_full

# Drop NAs in content column
news_df = load_news_data().dropna(subset=['content'])
news_df.shape

Out[]:

(1464, 7)

Text processing¶

First thing to do is clean and process the text in the content column. This process involves:

tokenized the content (splitting it into its constituent words)
removed stopwords and proper nouns
reduce words to their lemmas (the root word)

I used SpaCy for this. SpaCy comes with a bunch of pre-built models that tokenize words and also determine if they are alphanumeric, stop words, find their part of speech, lemma, etc.

In []:

import spacy

# Initialize spacy with the english model
sp = spacy.load('en_core_web_sm')

The function below tokenizes a text string and determines if each token is a proper noun, alphanumeric character, or stop word. Then if it is NOT a proper noun, IS alphanumeric, and is NOT a stop word, it appends the LEMMA of the word to a new, cleaned string.

In []:

from functools import lru_cache

@lru_cache(maxsize = None)
def clean_string(text_string):
    '''
    A function to clean a string using SpaCy, removing stop-words, non-alphanumeric characters, and pronouns

    Argument: a text string
    Output: a cleaned string

    '''

    # Parse the text string using the english model initialized earlier
    doc = sp(text_string)
    
    # Initialize empty string
    clean = []

    # Add each token to the list if it is not a stop word, is alphanumeric, and if it's not a pronoun
    for token in doc:
        
        if token.is_alpha == False or token.is_stop == True or token.lemma_ == '-PRON-':
            pass
        else:
            clean.append(token.lemma_)

    # Join the list into a string
    clean = " ".join(clean)

    return clean

Here's an example of how the text cleaning works:¶

In []:

example = news_df.loc[2,'content']
example_clean = clean_string(example)
print("Raw example: \n" + example[:500]+ "\n \n")
print("Clean exmaple: \n" + example_clean[:500])

Out[]:

Raw example: 
Progress in the field of Natural Language Processing (NLP) depends on the existence of language resources: digitized collections of written, spoken or signed language, often with gold standard labels or annotations reflecting the intended output of the NLP system for the task at hand (e.g. the gold standard text for a speech recognition system or gold standard user intent labels in a dialogue system such as Siri, Alexa or Google Home). Unsupervised, weakly supervised, semi-supervised, or distant
 

Clean exmaple: 
progress field natural language processing nlp depend existence language resource digitize collection write speak sign language gold standard label annotation reflect intend output nlp system task hand gold standard text speech recognition system gold standard user intent label dialogue system siri alexa google home unsupervised weakly supervised semi supervised distantly supervise machine learning technique reduce overall dependence label datum approach need sufficient label datum evaluate syst

Processing all the article content¶

Next I cleaned and appended the cleaned content to the news data frame

In []:

def clean_content(df):
    '''
    A function to clean all the strings in a whole of a corpus

    Argument: a dataframe with the column 'content'
    Ouput: same dataframe with a new 'cleaned_content' column
    '''

    # Initialize list of cleaned content strings
    clean_content= []

    # Call clean_string() for each row in the data frame and append to clean_content list
    for row in df.content:

        clean_content.append(clean_string(row))

    # Append clean_content list to the data frame
    df['clean_content'] = clean_content

    return df 

news_df = clean_content(news_df)

Vectorization of text data¶

Now it's time to create the feature matrix that will be used in the KMeans model, which involves vectorizing the text data.

Sometimes people use a bag-of-words representation, which means creating a vector for each document in which each element represents the frequency, in that document, of every word that appears in the corpus.

Instead of doing a pure count, you can also scale the frequency of each word in a document by it's relative frequency in all documents. If a word appears a lot in some documents but not in most, it is likely to be informative; if it appears often in most documents, it likely isn't. This is called tf-idf (term-frequency inverse-document-frequency) vectorization. The TfidfVectorizer object multiplies the idf (inverse document frequency) vector by the document term frequency, which gives a lower score to terms that appear more frequently across all documents.

This function also applies L2 regularization to the vector of tf-idf values, which makes it so that the length of the document does not change the representation.

In []:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', min_df = 5, ngram_range = (1,3), max_df = 0.15)

I initialized the vectorizer with settings to use only words that appear in more than 5 documents, in order to cut back on tokens. Additionally, for unsuperized learning, sklearn recommends taking out common words, so I took out words that appear in more than 15% of documents

N-grams

The Tfidf Vectorizer will calculate a tf-idf value for individual words (1-grams), or it can tokenize on combinations of words (n-grams) instead of just single words. I chose to do 1-, 2-, and 3-grams because doing more can cause the number of features to explode.

Below the rows is the number of documents in the corpus, and the columns are the number of terms (1-, 2-, and 3-grams) in the corpus.

In []:

X = vectorizer.fit_transform(news_df['clean_content'])
X.shape

Out[]:

(1464, 26942)

Inspecting the text features¶

Word clouds for visualizing terms with highest and lowest tf-idf scores

I inspected which words tf-idf found most important and least important terms in distinguishing specific documents.

In []:

# Get the list of terms in the vocabulary
terms = vectorizer.get_feature_names() 
# Make the matrix dense 
tfidf = X.todense() 
# Get the max tfidf weight of each term and flatten to a 1D array
tfidf_max = np.array(tfidf.max(axis = 0)).ravel()

In []:

import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
def plot_tfidf_word_cloud(frequencies = tfidf_max, terms = terms, most_frequent = True):

    '''
    Plots a word cloud for the most important or least frequent words in a list
    
    Arguments: 
    frequencies: a (terms,) sized array of frequencies
    terms: a list or array of terms
    most_frequent: if True, plots most frequent, if False, plots least frequent 
    
    Ouput: a word cloud
    
    '''
    # Sort frequency array in ascending and returns an array of indices
    sorted_frequencies = frequencies.argsort()

    # Create word cloud with highest or lowest 75 terms, joined into a string
    if most_frequent == True:
        cloud = WordCloud(background_color = 'white').generate(text=(' '.join(np.array(terms)[sorted_frequencies[-75:]])))
        title = "Features with highest tf-idf"
    else:
        cloud = WordCloud(background_color = 'white').generate(text = (' '.join(np.array(terms)[sorted_frequencies[75:]])))
        title = "Features with lowest tf-idf"
    
    plt.figure(figsize = [10,15])
    plt.imshow(cloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(title, size =20)
    plt.show();


plot_tfidf_word_cloud()
plot_tfidf_word_cloud(most_frequent = False)

Out[]:

Lower scores mean less informative, higher mean more

We can see that the words with higher tf-idf scores distinguish individual topics, such as individual people (Bezos, Hitler, Olsen), specific machine learning or coding (brain, hack, processing), or current events (malaria, suicide, wind).

The words with lower tf-idf scores are clearly more generic and don't distinguish specific topics. Iterestingly, AI is in this group, which indicates that the term "AI" is used indiscriminately and widely and does not distinguish specific documents in this corpus, which may indicate something about the use of the words or about the general theme of topics in this corpus.

Time for KMeans clustering!¶

Next I ran this unsupervised clustering algorithm to discover groups in the text data

Choosing the best number of clusters¶

Without ground truth labels with which to evaluate kmeans, detemining parameters, like number of clusters, can be difficult.

One way to evaluate kmeans without ground truth is the within cluster sum of squared errors (distortion). Lower scores are generally better because they mean tighter clusters.

Another way is through silhouette scores, which compares the mean distance of points to thier own cluster to points in the neighboring cluster. It is a measure between [-1,1], where scores closers to 1 indicated points are closer to their own cluster, 0 means the clusters are overlapping, and -1 means points are closer to their neighboring cluster than thier own cluster.

The best number of clusters will cause the SSE curve to bend, and also maximize the silhouette score.

In []:

from sklearn.externals import joblib
news_df = pd.read_csv('data/clean_content.txt')
vectorizer = joblib.load('models/tfidf_vectorizer.sav')

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.ticker as ticker
import seaborn as sns

import matplotlib.pyplot as plt

def choose_k(max_k, X):
    '''
    A function to display plots of average silhouette score and average SSE (inertia) for various numbers of clusters

    Arguments: 
    max_k = the maximum number of clusters to test 
    X = the document-term matrix to train KMeans

    Ouput: line plots for both silhouette scores and SSE
    '''

    # Initialize empty lists
    distortions = []
    sil_score = []

    # For each value of k, initialize and fit a MiniBatchKMeans and append the sillhouette score and SSE to the lists
    for k in range(2, max_k):
        kmeans = MiniBatchKMeans(n_clusters = k, init = 'k-means++', max_iter = 1000, random_state=42)
        kmeans.fit(X)
        sil_score.append(silhouette_score(X, kmeans.labels_))
        distortions.append(kmeans.inertia_)

    # Plot each score for each number of clusters
    sns.set(style="whitegrid")
    
    plt.figure(figsize=[15,5])
    distortions_plot = sns.lineplot(x= range(2,max_k),y= distortions)
    plt.ylabel("Sum of Squared Errors (distortion)")
    plt.xlabel("Number of clusters")
    distortions_plot.xaxis.set_major_locator(ticker.MultipleLocator(1))
    plt.show()
    plt.close()

    plt.figure(figsize=[15,5])
    silhouette_plot = sns.lineplot(x= range(2,max_k),y= sil_score)
    plt.ylabel("Silhouette score")
    plt.xlabel("Number of clusters")
    silhouette_plot.xaxis.set_major_locator(ticker.MultipleLocator(1))
    plt.show()

In []:

choose_k(25, X)

Out[]:

I choose 14 clusters becuase going from at 15 clusters there is a big drop in SSE and increase in average silhouette score

In []:

k = 14

Training the KMeans model¶

In []:

from sklearn.cluster import KMeans
kmeans =  KMeans(n_clusters = k, init = 'k-means++', random_state= 42)

cluster_labels = kmeans.fit_predict(X)

Inspecting the clusters¶

Silhouette scores¶

First, I inspected the distinctness of the clusters by plotting the silhouette values of each document.

In []:

import matplotlib.cm as cm

  
def plot_silhouette_scores(X, k, cluster_labels):
    '''
    A function to plot the silhouette scores in each cluster of the kmeans model
    '''

    # An array of the silhouette score of each sample (article) 
    sample_sil_values = silhouette_samples(X, cluster_labels)

    fig, (ax1) = plt.subplots(figsize = [10,6])

    # For each cluster, plot the silhouette scores for each sample
    y_lower = 10
    for i in range(k):

        # Values for each cluster
        ith_cluster_sil_values = sample_sil_values[cluster_labels == i]

        ith_cluster_sil_values.sort()

        # How many samples are in that cluster
        size_cluster_i = ith_cluster_sil_values.shape[0]

        # The upper limit of that cluster group is the lower limit plus the size of the cluster
        y_upper = y_lower + size_cluster_i
        
        color = cm.nipy_spectral(float(i) / k)

        # Fill the length of the silhouette score on the x axis, on the y axis between the upper and lower limits of the cluster group 
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_sil_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.08, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_yticks([])
    ax1.set_xlim([-0.1, 0.4])
    ax1.grid(False)
    ax1.set_ylabel("Cluster Label")
    ax1.set_xlabel("Silhouette scores")
    ax1.set_title("Silhouette Plot for Various Clusters")

In []:

plot_silhouette_scores(X, k, cluster_labels)

Out[]:

Higher silhouette scores mean more distinct clusters. Negative scores mean that that sample is closer to the center of another cluster than it's own center. We can see that cluster 12 has lots of negative scores: it's not very distinct.

Top words by cluster¶

Next, I printed several 1-gram terms that are the most important in determining the centroid of each cluster. The centroid of each cluster is a coordinate that has as many dimensions as there are terms (features). The terms with the highest coordinate values represent the features with the most import for that cluster.

In []:

def print_top_words_by_cluster(k):

    '''
    A function to plot the top terms in each cluster of the kmeans model
    '''
    # The high dimensional coordinates of the center of each cluster (shape is 15 by the number of features)
    # Within each cluster (row) sort the coordinate of each features them in descending order and return array of indices
    sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

    for i in range(k):
        print("Cluster %d:" % i, end='')
        for ind in sorted_centroids[i, :12]:
            if terms[ind].find(' ') == -1:
                print(' %s' % terms[ind], end='')
        print()

print("Top words in each cluster: \n" )
print_top_words_by_cluster(k)

Out[]:

Top words in each cluster: 

Cluster 0: facial recognition police surveillance camera ban
Cluster 1: china chinese huawei hong kong tiktok xinjiang camp apple russia uighurs
Cluster 2: instagram influencer follower meme brand youtube creator hashtag star dm
Cluster 3: amazon uber bezos delivery driver alexa seller assistant lyft contractor contract customer
Cluster 4: programming python java web javascript programmer command variable script file application
Cluster 5: ad election apple employee india ban store court whatsapp egg
Cluster 6: climate carbon energy emission opioid fire air gas pollution renewable warming
Cluster 7: learning wired deepfake
Cluster 8: meat burger plant beyond impossible foods vegan
Cluster 9: foundation billionaire philanthropy gates tax wealth giving charity democratic pledge matthews
Cluster 10: income poverty cash tax charity poor vaccine ebola credit kidney eitc
Cluster 11: youtube camera tiktok police fan meme tumblr device creator star ring iphone
Cluster 12: robot learning deepfake art bias brain openai train nuclear car task
Cluster 13: musk tesla tweet unsworth whistleblower ukraine ukrainian retweet crowdstrike giuliani impeachment reply

KMeans clustering, because it's unsupervized, can sometimes cluster based on constructs that we aren't interested in. However, looking at the top words in each cluster, it appears that each cluster represents a reasonal topic in the realm of the articles I was looking at.

Dimensionality reduction using SVD¶

However, there are some words included that make it clear that the semantics aren't always being accurately represented, for instance, in Cluster 2, concepts of organ donation and philanthropy are being conflated. Tf-idf vectorization should convey some semantic information due to it's process of weighting them by importance in a document, which presumably have various topics. But how can we do better?

I decided to try doing a dimensionality reduction technique called SVD, which is often used for latent semantic analysis and works on tf-idf matrices. This way, latent patterns in the term frequencies can be identified, which may convey more semantic information that is then used in the KMeans clustering.

Here, I reduce the number of features to 100

In []:

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pandas as pd

news_df = pd.read_csv('data/clean_content.txt')
vectorizer = joblib.load('models/tfidf_vectorizer.sav')

X = vectorizer.transform(news_df['clean_content'])

svd = TruncatedSVD(n_components=100, random_state=42)
normalizer = Normalizer(copy=False)

svd_pipe = make_pipeline(svd, normalizer)

X_svd = svd_pipe.fit_transform(X)

X_svd.shape

Out[]:

(1464, 100)

Next, I checked to see what is the ideal number of cluster are for the data set with the reduced dimensions, using the same method as before.

In []:

choose_k(25, X_svd)

Out[]:

It looks like 11 clusters are are good number

In []:

k_svd = 11

kmeans_svd =  KMeans(n_clusters = k_svd, init = 'k-means++', random_state= 42)

cluster_labels_svd = kmeans_svd.fit_predict(X_svd)

In []:

plot_silhouette_scores(X_svd, k_svd, cluster_labels_svd)

Out[]:

We have to print the most important words according to the highest tf-idf words in each cluster, because the cluster centers correspond to components of the terms rather than the terms.

In []:

terms = vectorizer.get_feature_names() 
tfidf = X.todense() 

def print_max_tfidf_per_term_per_cluster(cluster_labels):
    tfidf_df = pd.DataFrame(tfidf, columns = terms)
    tfidf_df['cluster_label'] = cluster_labels

    max_tfidf_per_term_per_cluster={}
    for label in tfidf_df.cluster_label:
        max_tfidf_per_term_per_cluster[label]=(np.array(tfidf_df[tfidf_df.cluster_label == label].max(axis = 0)).ravel())
    max_tfidf_per_term_per_cluster = pd.DataFrame(max_tfidf_per_term_per_cluster)

    for i in range(k_svd):

        sorted_tfidf = np.array(max_tfidf_per_term_per_cluster[i].argsort())[::-1][:][:15]
        print("Cluster %d:" % i, end='')
        for ind in sorted_tfidf[:15]:
            if terms[ind-1].find(' ') == -1:
                print(' %s' % terms[ind], end='')
        print()


print_max_tfidf_per_term_per_cluster(cluster_labels_svd)

Out[]:

Cluster 0: kidney wage alcohol microfinance medicaid recession debt alaska disaster udacity bank venezuela
Cluster 1: twin bathroom airdrop asian vine clout discord whistle meme venmo ticket wedding
Cluster 2: woody ring cbp megvii emotion axon surveillance israeli verification body camera ton facial
Cluster 3: planning bert robot cube deepfake nt legend bias
Cluster 4: null php art git java sculpture api sql hip hop
Cluster 5: asteroid enlightenment insect border hog fish mars population gang prediction nuclear
Cluster 6: ad bosworth amazon extension vaccine buffalo tv chrome ryan
Cluster 7: bear condom vaccine cancer mosquito lithium zika monkey suicide malaria cat
Cluster 8: egg meat vegan cotton fish tax chick chicken ag burger turkey
Cluster 9: bitcoin turbine fire bus carbon coal pollution geoengineering climate land grid
Cluster 10: scooter epstein gabbard yelp portal drunk drone quantum tag tumblr wework solitary

Interestingly, I think the clusters learned from the high-dimensional dataset are more cohesive than these. I think I need to consider more advances techniques for learning conceptual clusters from text documents.

Plotting clusters using t-SNE¶

Next, I wanted to visualize the clusters learned from the high dimensional data set in two dimensions. t-SNE is a feature reduction technique that reduces features to 2 dimensions, mainly used just for plotting purposes.

In []:

from functools import lru_cache
from sklearn.manifold import TSNE
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm

@lru_cache(maxsize = None)
def tsne_plot():
    tsne=TSNE(random_state = 42)

    # Array with 2 dimensions for each sample
    tsne_features=tsne.fit_transform(tfidf)

    # Data frame of cluster labels and t-SNE features for plotting
    tsne_df = pd.DataFrame({'clusters':cluster_labels, 'tsne0':tsne_features[:,0], 'tsne1':tsne_features[:,1]})
    tsne_df['clusters'] = tsne_df['clusters'].astype('category', ordered = True)

    # Palette for plotting
    palette = ['#363737', # dark grey
               '#7e1e9c', # purple
               '#380282', # indigo
               '#0343df', # blue
               '#95d0fc', # light blue
               '#06c2ac', # turquoise
               '#15b01a', # green
               '#a7ffb5', # light green
               '#c1f80a', # chartreuse
               '#fff39a', # yellow
               '#fac205', # gold
                '#f97306', # orange
               '#ac4f06', # brown
               '#e50000', # red 
                ]
    sns.set(style="whitegrid")

    # Scatterplot colored by cluster label
    plt.figure(figsize = [15,15])
    sns.scatterplot(data = tsne_df, x="tsne0", y="tsne1", hue="clusters", palette= palette)
    plt.legend(loc='upper center', bbox_to_anchor=(1.1, 1), ncol=1, fontsize =15)
    plt.xlabel("t-SNE Label 0", size = 15)
    plt.ylabel("t-SNE Label 1", size = 15)

    plt.show()
    
warnings.filterwarnings('ignore')
tsne_plot()

Out[]:

Just as we saw in the silhouette plot, cluster 12 is very widespread. In general, it seems like this cluster isn't that informative.

Word cloud for each cluster¶

Visualization of the printed terms above

In []:

sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

def print_word_cloud_per_cluster(j=2):
    
    if j > k:
        return 
    
    else:
        f, axs = plt.subplots(1,2, figsize = [15, 5])
        
        # for each of the two axes, create a word cloud for that cluster's most important terms, and display it
        for ax, i in zip(axs, range(j-2, j)):
            cloud = WordCloud(background_color='white').generate(text=(' '.join(np.array(terms)[sorted_centroids[i, :75]])))
            title = "Important Words in Cluster {} \n".format(i)

            ax.imshow(cloud, interpolation="bilinear")
            ax.set_title(title, size = 20)
            ax.grid(False)
            ax.axis("off")
        plt.show()
            
        # update j for next loop
        print_word_cloud_per_cluster(j=j+2)

print_word_cloud_per_cluster()

Out[]:

Prediction with new document content¶

In []:

import os
from sklearn.externals import joblib
import spacy
sp = spacy.load('en_core_web_sm')
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# vectorizer = joblib.load('models/tfidf_vectorizer.sav')

# kmeans = joblib.load('models/kmeans.sav')

def predict(input_string = None, filename = None):
    '''
    A function to clean, vectorize, and predict the kmeans cluster of a new string
    
    Arguments: a string, which is the content of an article OR the 'filename.txt' of a text file
    
    Output: a prediction and wordcloud for the document
    '''
    if filename != None:
        with open(os.path.join(PROJ_ROOT_DIR, "content_to_predict", filename), 'r') as file:
            input_string = file.read()
        print('Processing file')
        # Clean the string
        clean = clean_string(input_string)
    
    if input_string != None:
        print('Processing text string')
        clean = clean_string(input_string)


    # Vectorize the string
    Y = vectorizer.transform([clean])

    # Predict the cluster label
    prediction = kmeans.predict(Y)
    
    # Generate the wordcloud
    cloud = WordCloud(background_color = 'white').generate(clean)
    
    # Add the prediction to the title of the word cloud
    title = "Cluster Prediction for Article {}".format(prediction)
    
    # Show the word cloud
    plt.figure(figsize = [10,15])
    plt.imshow(cloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(title, size =20)
    plt.show();

In []:

predict(input_string = '''Are we living in a computer simulation? The question seems absurd. Yet there are 
plenty of smart people who are convinced that this is not only possible but perhaps likely.In an influential 
paper that laid out the theory, the Oxford philosopher Nick Bostrom showed that at least one of three 
possibilities is true: 1) All human-like civilizations in the universe go extinct before they develop the 
technological capacity to create simulated realities; 2) if any civilizations do reach this phase of 
technological maturity, none of them will bother to run simulations; or 3) advanced civilizations would have 
the ability to create many, many simulations, and that means there are far more simulated worlds than 
non-simulated ones.''')

Out[]:

Processing text string

Out[]:

Finding Topic Clusters in Tech News

Using unsupervized learning to cluster news articles based on content¶

Dataset¶

Project Goals¶

Summary of findings¶

Full project¶

Data collection: News Scrapers¶

Below is an example of the scraping process¶

Running the scrapers¶

Loading the data¶

Text processing¶

Here's an example of how the text cleaning works:¶

Processing all the article content¶

Vectorization of text data¶

Inspecting the text features¶

Time for KMeans clustering!¶

Choosing the best number of clusters¶

Training the KMeans model¶

Inspecting the clusters¶

Silhouette scores¶

Top words by cluster¶

Dimensionality reduction using SVD¶

Plotting clusters using t-SNE¶

Word cloud for each cluster¶

Prediction with new document content¶

Return to the top