Data | My Site 1

DATA TAB: UNDERSTANDING THE DATA USED IN THE PROJECT

This project focuses on analyzing misinformation regulation by collecting and processing data from news sources using the following methods:

Data Sources:

NewsAPI – A real-time news aggregation API that collects articles from major news outlets like BBC, CNN, and The Guardian.
Mediastack API – A comprehensive API that retrieves news from smaller news agencies and blogs, covering diverse perspectives.
Google News RSS Feeds – Extracts articles based on search queries related to misinformation regulation, collecting structured news data.

The collected data has been cleaned, processed, and labeled to be used for further analysis. Below are links to access the datasets and the code used:

Raw Data – Unprocessed news articles collected from APIs and RSS Feeds.
Cleaned Data – Text processed using tokenization, stemming, lemmatization, and vectorization.
Labeled Data – News articles labeled into "Pro-Regulation," "Anti-Regulation," or "Neutral" categories.
Code Repository – Python scripts for data collection and preprocessing.

Before and After Views of Data

The dataset initially contained raw text from NewsAPI, Mediastack API, and Google News RSS.

NewsAPI:

- NewsAPI Data Before Cleaning:

- NewsAPI Data After Cleaning:

Media Stack API:

Media Stack API Data Before Cleaning:

Media Stack API Data After Cleaning:

Google News RSS:

Google News RSS Webscraped Data Before Cleaning:

Google News RSS Webscraped Data After Cleaning:

Final Merged Dataset (After Combining All Sources):

After collecting data from all three sources (NewsAPI, Mediastack, Google News RSS), the datasets were merged while removing duplicates and inconsistent formats.

At this stage, the dataset was structured with consistent column names and formatted publication dates, making it ready for text preprocessing.

Labeled Data:

Using BERT NLP-based sentiment classification, each article was labeled as:

Pro-Regulation – Supports government or platform-based misinformation laws.
Anti-Regulation – Opposes misinformation laws, often citing free speech concerns.
Neutral – No clear stance on misinformation policies.

Data Cleaning and Preprocessing

1. Tokenization

Each article’s title and description were broken down into individual words (tokens) to facilitate further text processing.

2. Stemming

Stemming reduced words to their root form (e.g., "regulating" → "regul"), simplifying the vocabulary for modeling.

Stemmed Dataset Sample:

3. Lemmatization

Lemmatization converted words into their base dictionary form (e.g., "running" → "run") while preserving contextual meaning.

Lemmatized Dataset Sample:

4. Vectorization (CountVectorizer & TF-IDF)

To make textual data usable for machine learning models, we converted the news articles into numerical features using two different methods:

CountVectorizer Transformation:

TF-IDF Transformation:

Summary of Preprocessed Datasets:

Each dataset serves a different purpose, providing multiple ways to analyze and model the data effectively.

Github Repo (Code and Data): https://github.com/saketh-saridena/TextMining_Project