ARM | My Site 1

ASSOCIATION RULE MINING (ARM)

OVERVIEW:

Association Rule Mining (ARM) is an unsupervised learning technique that uncovers interesting relationships or associations between variables in large datasets. Originally used in market basket analysis, where it identifies products frequently bought together, ARM is now widely applied in text mining to discover frequent co-occurrences of words or concepts within documents.

In this project, ARM was used to explore co-occurrence patterns in news articles related to misinformation and regulation. Instead of structured records, each article was treated as a "transaction" consisting of a list of important, cleaned keywords. The goal was to find which terms tend to appear together across various articles—revealing subtle or strong associations that may not be obvious through other methods like clustering or topic modeling.

By generating association rules based on support, confidence, and lift, the analysis surfaces frequent conceptual pairings such as “misinformation” with “medium” or “moderation” with “content.” These associations highlight the semantic structure of how misinformation is discussed in the media and offer insight into how terms like "freedom", "policy", or "fake" often appear in regulated or politicized contexts.

DATA PREP:

Association Rule Mining (ARM) requires input data in the form of transactions, where each transaction is a set of items. In text mining, each "item" is typically a word or term, and each "transaction" represents a document—in this case, a cleaned news article.

For this project, the transaction data was prepared using the lemmatized data we made in earlier preprocessing steps. The steps were:

Text Selection: Only the lemmatized_text column was used.
Token Cleaning: All punctuation, numbers, stopwords, and words with fewer than 3 characters were removed.
Formatting: Each cleaned article was stored as a row of space-separated words in a file called transactions.csv. This created a basket-like format required by ARM libraries.

An example row from the transaction file looks like: misinformation regulation social media platform fake freedom speech moderation content

This format makes it possible to discover frequent itemsets and association rules that describe word co-occurrence patterns in misinformation-related news articles.

Before Transformation:

After Transformation:

CODE:

Association Rule Mining (ARM) was implemented in R using the popular arules and arulesViz packages. The process included loading the transactional data, generating rules, sorting by different metrics, and visualizing the most significant associations.

Link to the code of ARM: https://github.com/saketh-saridena/TextMining_Project

RESULTS:

Association Rule Mining was conducted on the transaction-style dataset containing cleaned terms from news articles about misinformation and regulation. The apriori algorithm generated a large number of rules, from which the top 15 rules were selected based on support, confidence, and lift for detailed analysis.

Top 15 Rules by Support:

Rules with the highest support indicate combinations of terms that co-occur frequently across the dataset. The rules appeared in many articles, showing that regulation and misinformation are dominant terms that regularly occur together. These high-support rules confirm that discussions around platform content are often linked with regulation and fake information narratives.

The rules with the highest support reveal the most common and recurrent associations across the dataset. Terms like “social,” “medium,” “fake,” “news,” “content,” and “regulation” appear repeatedly and together in articles. These results confirm that discussions of fake news, content moderation, and social platforms are the dominant themes in the misinformation narrative. The consistent pairings indicate a high frequency of co-occurrence, validating the relevance of these topics in media discourse on misinformation.

Top 15 Rules by Confidence:

Confidence measures the likelihood that a consequent appears when the antecedent is present. High-confidence rules indicate strong directional associations, such as how platform-related discussions often lead to mentions of regulation. Similarly, the co-occurrence of "freedom" and "speech" strongly predicts the appearance of "regulation", which supports the idea that freedom of speech debates are tied closely to regulatory policies in misinformation discourse.

The confidence-based rules demonstrate how reliably one term leads to another. Many rules had a confidence of 1.0, indicating near-certain relationships—such as between “supreme” and “court,” “fake” and “news,” or “moderation” and “content.” These relationships show strong predictive directionality and reveal that certain concepts almost always appear together in the misinformation context. The presence of triplet rules involving “tech,” “policy,” and “press” further shows that policy discussions are highly structured and predictable when certain terms co-occur.

Top 15 Rules by Lift:

Lift evaluates how much more often the antecedent and consequent appear together than would be expected if they were statistically independent. High-lift rules highlight unexpectedly strong associations. These rules reveal that certain term pairings (e.g., "moderation" and "content") have a significant reinforcing effect on mentions of regulation. Such findings emphasize how regulatory and enforcement concepts are often tightly coupled in misinformation articles.

Lift-focused rules highlight unexpectedly strong associations, revealing which term combinations co-occur far more often than random chance. The extremely high lift values (e.g., over 40 for some rules) show dense conceptual ties—especially in areas like regulatory language (moderation, policy, content), community-based correction systems (community notes), and media coverage (tech and press). These rules are particularly valuable in identifying thematic clusters that may not be immediately obvious, indicating deep-seated relationships within misinformation discourse.

Two key visualizations were created to understand better and present the associations:

1. Network Graph of Rules
A graph-based visualization of the top 50 rules shows nodes as terms and edges as rules. Densely connected clusters around words like "regulation", "platform", and "misinformation" make it easy to identify semantic hubs.

This graph visually maps the top 50 association rules by lift, displaying the connectedness of frequently associated terms.

Clear clusters emerged around themes such as:
- Policy/Moderation/Tech: A dense grouping reflects discussions on regulating digital platforms.
- Fake/News/Social/Medium/Law: This cluster highlights the key actors and mediums through which misinformation is shared and combatted.
- Musk/Elon: This separate node suggests a focused conversation on individuals impacting regulation or platform policies.
- Free/Speech and Supreme/Court: These form smaller, high-lift associations indicating narrower but impactful debates.
Edges represent rules, and the proximity and connections visualize how tightly terms are linked in narrative contexts.

2. Grouped Matrix Plot

This plot organizes rules into clusters based on shared items, allowing a quick overview of grouped co-occurrence patterns. The clusters revealed how sets of terms like “fake”, “content”, “platform” are frequently bundled within the same semantic space.

The higher the lift, the stronger and more meaningful the relationship between antecedents and consequents.

Terms like moderation, press, tech → policy and supreme → court exhibit exceptionally high lift values (e.g., 58.5), revealing rare but extremely strong co-occurrences.
Clusters of terms such as moderation, policy, and press tend to co-occur, suggesting a strong policy-based narrative.
Associations involving fake → news, medium → social, and social → medium reflect the discourse surrounding misinformation and the channels through which it spreads.

CONCLUSION:

Looking at how certain words and ideas appear together in news stories can reveal a lot about how people talk about misinformation and regulation. Some words seem to naturally go hand in hand. For example, terms like "fake" and "news" often appear together, just like "social" and "medium," or "moderation" and "content." These pairings suggest that people see these ideas as closely linked when they’re discussing issues around false information.

Other pairings point to deeper themes in the public conversation. Words like "press" and "policy," or "supreme" and "court," show that the conversation is not just about technology or social media, but also about the law, governance, and how society should respond. When three or more words appear together regularly—like "tech," "policy," and "moderation"—it paints a picture of people talking about how to manage misinformation with rules and systems.

The visual graphs helped bring these patterns to life. One chart showed which words have the strongest connections, and another showed groups of words that often appear in the same types of stories. These visuals made it easier to understand how ideas are connected across articles.

Overall, this step helped uncover how different parts of the conversation around misinformation are connected. Instead of just counting words, this approach showed the deeper relationships in how stories are told. These patterns can help explain how public opinions are formed, how issues are framed in the media, and where decision-makers might need to focus their attention.

Github Repo (Code and Data): https://github.com/saketh-saridena/TextMining_Project