
CLUSTERING
OVERVIEW:
Clustering is a technique used to group similar records based on their underlying features without relying on any predefined labels or categories. In the context of this project, clustering was applied to a collection of news articles related to the regulation of misinformation. The primary objective was to discover whether the language used in different articles reveals naturally occurring patterns that align with significant viewpoints or themes in the ongoing debate.
​
The use of clustering was intended to uncover hidden groupings that might not be immediately visible through basic observation. These groupings could represent differences in tone, stance, or subject focus—such as articles emphasizing government regulation, platform accountability, free speech concerns, or neutral reporting. The approach allowed for the exploration of similarities and differences in language across articles from various sources, time periods, and writing styles.
​
This process played a foundational role in supporting the project's broader goals. By using unsupervised learning, meaningful distinctions were identified without the influence of manual labeling or bias. Clustering revealed structure in the data that aligned with social, political, and thematic divisions in public discourse around misinformation. The visual and numerical results from this step informed later analyses, including topic modeling and classification, and demonstrated how textual data can reflect larger societal conversations.
​
DATA PREP:
Clustering methods require numeric data without labels. A term frequency-inverse document frequency (TF-IDF) matrix was used, where each article was converted into a vector based on the importance of the words used in it.
​
Before clustering, the data was normalized to prepare it for cosine similarity calculations, especially for hierarchical clustering. Any labeled columns, such as sentiment or category information, were removed to ensure that the data remained unlabeled.
​
Before Transformation:

After Transformation:

CODE:
Two types of clustering were performed:
-
K-Means clustering was implemented using Python. The number of clusters (k) was varied between 2 and 7. The best k value was determined by checking silhouette scores, which measure how well data points fit within their clusters.
-
Principal Component Analysis (PCA) was applied to reduce the high-dimensional data to two and three dimensions. These reduced dimensions were then used to plot the clusters visually.
-
Hierarchical clustering was performed using R. Cosine similarity as the distance measure, and average linkage was used to form the dendrogram.
​
Link to the code of K-Means and Hierarchical Clustering: https://github.com/saketh-saridena/TextMining_Project
RESULTS:
K-Means Clustering Results:​
Clustering was performed on the normalized TF-IDF matrix using K-Means for different values of k (from 2 to 7). Silhouette scores were computed for each clustering result to evaluate cluster cohesion and separation.​

The silhouette scores were generally low, indicating limited natural clustering in the data. However, the best silhouette score was observed at k = 7 (score ≈ 0.021). The progression of scores is as follows:

PCA was applied to reduce the dimensionality of the normalized data to 3D for visualization. A 3D scatter plot displayed below shows how K-Means grouped the documents into clusters. Though some overlap was visible, certain separation patterns emerged, suggesting latent topic-based groupings among documents.

Hierarchical Clustering Results:
Hierarchical clustering was performed on the same normalized TF-IDF dataset using cosine distance and the average linkage method. The dendrogram provided a visual representation of document similarity and cluster structure.
​
Only the first 200 documents were visualized to improve clarity. The resulting dendrogram showed a gradual merging of clusters without sharp boundaries. No clear "elbows" or dramatic merges were observed, indicating a lack of strong hierarchical structure in the data.

The full dendrogram appeared visually cluttered and difficult to interpret due to the large number of articles. To address this, a subset of the data (top 50–200 articles) was used to generate a cleaner and more readable visualization as follows.

A horizontal cut in the dendrogram was tested at several levels. A cut at k = 5 to 7 clusters appeared most reasonable, visually aligning with the patterns seen in K-Means. This reinforced the earlier finding that around 7 clusters may be optimal for this dataset.
CONCLUSION:
When looking at news stories about misinformation and regulation, patterns begin to emerge even without knowing anything about who wrote them or what side they are on. Some stories naturally seem to group together—talking about similar ideas, raising similar concerns, or supporting similar actions. This suggests that within all the noise, there are underlying themes that repeat and connect.
One group of articles, for example, might focus on how platforms should be more responsible and how new rules could help stop the spread of false information. Another group might raise worries about censorship or argue that too much control could limit free speech. Even though the boundaries between these groups aren't always sharp, the fact that they appear at all says something important: people are having very different kinds of conversations about the same issue.
What this shows is that misinformation isn’t just a single-topic issue. It touches politics, technology, media, and individual rights in different ways. By finding these clusters of conversation, it becomes easier to understand the bigger picture—how public dialogue is forming, where the disagreements lie, and what kinds of stories are shaping opinions. These patterns set the stage for exploring deeper ideas and debates in the next steps of the project.
Github Repo (Code and Data): https://github.com/saketh-saridena/TextMining_Project