LDA | My Site 1

LATENT DIRICHLET ALLOCATION (LDA)

OVERVIEW:

Topic modeling is an unsupervised machine learning method used to identify abstract topics or themes present across a large collection of text data. One of the most widely used topic modeling algorithms is Latent Dirichlet Allocation (LDA). LDA assumes that each document is made up of a mixture of topics, and each topic is defined by a distribution over words. By analyzing patterns of word co-occurrence across documents, LDA uncovers dominant themes that are not explicitly labeled.

In this project, topic modeling was applied to a collection of news articles related to misinformation regulation. The goal was to identify the main areas of focus within this discourse—such as public policy, platform responsibility, elections, freedom of speech, or enforcement efforts—by analyzing the language used in the articles.

Using topic modeling provided a way to uncover latent structure in the data without needing predefined categories. This technique was especially helpful in understanding how different aspects of misinformation and its regulation are discussed across media sources, and how common vocabulary or themes emerge across varied viewpoints. The resulting topics helped organize the articles into semantically meaningful groups, aiding in deeper analysis of the public narrative surrounding misinformation.

DATA PREP:

Latent Dirichlet Allocation (LDA) requires a specific type of input: unlabeled, preprocessed text converted into a document-term matrix. Labels, categories, or sentiments must not be included in this step, as LDA is an unsupervised method intended to discover hidden themes without prior classification.

For this project, a preprocessed dataset named final_data_misinformation.csv was used. The text for each article was prepared by combining the title and description fields. This combined text was then cleaned by removing punctuation, converting to lowercase, and eliminating short or irrelevant words. The cleaning process ensured that the input to the model was consistent and free of noise.

A CountVectorizer was applied to generate the document-term matrix, which transforms each article into a vector based on word frequency. Parameters such as max_df=0.9, min_df=2, and stop_words='english' were used to filter out overly common or rare terms, improving the quality of the topics generated.

The resulting matrix had 1000 unique word features across all articles, providing a strong foundation for extracting topic patterns using LDA.

Link to the Data: https://github.com/saketh-saridena/TextMining_Project

Before Transformation:

After Transformation:

CODE:

The topic modeling process was implemented using Python with the sklearn library's Latent Dirichlet Allocationmodule. The following steps were followed to prepare the data and run the model:

Data Loading and Text Preparation
The dataset was loaded from final_data_misinformation.csv. Each article's title and description were combined into a single text field. Basic text cleaning was performed using regular expressions to remove punctuation, symbols, numbers, and short words.
Vectorization (Document-Term Matrix)
A CountVectorizer was used to convert the cleaned text into a numerical format. This step created a sparse document-term matrix with a vocabulary of the top 1000 terms. Parameters such as max_df=0.9 and min_df=2 ensured that overly frequent or rare terms were excluded.
LDA Model Training
An LDA model with a number of topics as five was created to extract 5 distinct topics. The model was trained using the online learning method with 100 iterations. After training, the document-topic distribution was computed.
Topic Interpretation and Visualization
To interpret the model results:
- A multi-panel plot showed the top 15 terms in each topic.
- A bar chart visualized the top 30 terms from a single topic.
- A PCA-based scatter plot displayed the relationship and distance between topics in 2D space.

Link to the code of LDA: https://github.com/saketh-saridena/TextMining_Project

RESULTS:

The Latent Dirichlet Allocation (LDA) model was trained on preprocessed text data extracted from news articles focused on misinformation and its regulation. A total of 5 distinct topics were extracted using CountVectorizer and LatentDirichletAllocation. The results are visualized and interpreted using three complementary plots.

1. Topic-Term Overview (Multi-topic Visualization)

A horizontal stacked visualization displays the top 15 most relevant words from each topic. These terms help interpret the thematic structure of the corpus:

Topic 0: Focused on politics and current events, featuring words like “trump,” “covid,” “election,” “war,” and “security.”
Topic 1: Represents digital content governance, with terms like “content,” “moderation,” “fake,” “social,” “law,” and “data.”
Topic 2: Centers on policy and democracy, using terms like “misinformation,” “press,” “tech,” “democracy,” “health,” and “political.”
Topic 3: Emphasizes freedom of speech and regulation, including “free,” “regulation,” “law,” “university,” and “public.”
Topic 4: Touches on research, platforms, and future outlook, with terms like “review,” “meta,” “update,” “europe,” and “community.”

2. Top 30 Terms for Topic 1 (Bar Plot)

A focused bar chart was generated to highlight the top 30 most important terms in Topic 1, which emerged as the most informative for the project's theme. Key words included:

“digital,” “content,” “control,” “moderation,” “fake,” “law,” “platform,” and “regulation.”

This topic clearly reflects the regulatory discourse surrounding misinformation, especially how governments and platforms moderate harmful content.

3. Inter-Topic Distance Map (PCA Visualization)

To understand how the topics are distributed in the feature space, a PCA-based 2D scatter plot was generated. Each topic was projected into two principal components:

Topics like Topic 1 are isolated, indicating strong thematic uniqueness around regulation and content control.
Topics 0, 2, and 4 cluster more closely, showing overlap in political, health, and security narratives.
Topic 3 is distant vertically, suggesting a unique focus on law, freedom, and speech.

CONCLUSION:

The topic modeling step helped bring out the main themes and ideas found across many news stories about misinformation and regulation. By carefully looking at how words are grouped together in articles, several major topics clearly stood out, reflecting the real conversations people and media are having on this issue.

Some of the topics focused on political matters like elections and public leaders, while others revolved around digital platforms, rules for controlling online content, or larger social concerns like public safety and trust. One group of words, in particular—centered around “moderation,” “fake,” “content,” and “regulation”—stood out as the heart of the conversation, showing how important the role of online platforms is in this ongoing debate.

The visualizations made it even easier to understand these patterns. One chart highlighted the most important words for each topic, while another showed how different topics are close or far apart based on their focus. For example, discussions around politics and public safety were close together, suggesting they are often part of the same conversation. On the other hand, some themes stood on their own, showing that not every article looks at misinformation in the same way.

Altogether, this step showed that the news stories are not just repeating the same ideas. Instead, they reflect many different points of view—some focused on law and order, others on freedom, safety, or technology. This adds depth to the bigger question of whether social media companies should be responsible for dealing with misinformation, and what areas people seem to care about most in that discussion.

Github Repo (Code and Data): https://github.com/saketh-saridena/TextMining_Project