SentiWordNet

SentiWordNet#

A comprehensive dictionary that assigns graded sentiment scores to synsets in WordNet 📚.#

SentiWordNet Summary

Composition:

Approximately 117k n-gram synsets
Synset-level polarity, with positive and negative scores ranging [0,1] (Note that the sentiment scores appear to be ordinal representations rather than truly continuous values.)
Note that the processed dictionary provides approximately 23k terms with continuous scoring metrics ranging [-1,1].

Creation Methodology:

SentiWordNet 1.0 (Esuli and Sebastiani, 2006): Employed a committee of eight ternary classifiers, each trained on different subsets derived from positive and negative “seed terms”. Ratings were assigned to WordNet synsets based on the classifiers’ decisions.
SentiWordNet 3.0 (Baccianella, Esuli and Sebastiani, 2010): Departed from the committee approach, adopting a “bag-of-synsets” representation and introducing a graph-based random walk procedure for sentiment scoring.

Evaluation: Esuli and Sebastiani (2006) provided an initial validation of SentiWordNet 1.0 by comparing it against the General Inquirer lexicon, demonstrating its potential utility but also noting challenges in directly evaluating accuracy due to the absence of benchmark with manual word-level sentiment annotations.

Baccianella, Esuli and Sebastiani (2010) conducted a more rigorous evaluation of SentiWordNet 3.0 using Micro-WN(Op)-3.0, an automatically mapped version of the Micro-WN(Op) dataset originally compiled by Cerini et al. (2007). Micro-WN(Op) contains 1,105 WordNet synsets which were manually annotated for degrees of positivity, negativity and neutrality by five human coders. To evaluate SentiWordNet 3.0, Baccianella, Esuli and Sebastiani (2010) tested how well it could predict the polarity ratings (positivity and negativity values) of synsets in Micro-WN(Op)-3.0. They computed the ranking correlation between the gold standard Micro-WN(Op)-3.0 rankings and the SentiWordNet 3.0 predicted rankings using p-normalised Kendall’s tau. In comparison to SentiWordNet 1.0, version 3.0 demonstrated substantial improvements in correlation (19.48% relative gain for positivity and 21.96% for negativity).

Usage Guidance: A comprehensive dictionary offering synset-level sentiment scores. Ideal as a semantic foundation for contextual sentiment analysis, acknowledging the multifaceted nature of sentiment. Access processed dictionaries via sentibank.archive.load().dict("SentiWordNet_v2010_simple") for dictionary that only includes strictly positive and negative terms, or sentibank.archive.load().dict("SentiWordNet_v2010_logtransform") for dictionary that contains ambiguous terms.

📋 Introduction#

SentiWordNet (Esuli and Sebastiani, 2006; Baccianella, Esuli and Sebastiani, 2010) is a lexicon that annotates English words from WordNet with ‘graded’ sentiment scores indicating how objective, positive, and negative they are. Note that the term ‘graded’ aligns with ‘valence’ in modern sentiment analysis research. The lexicon, evolving from the original 2006 version (SentiWordNet 1.0) to the improved SentiWordNet 3.0, recognises that terms can possess both positive and negative polarities to varying degrees. In this overview, we trace the evolution of SentiWordNet, emphasising the key methodological differences in scoring word senses between the two versions.

📚 Original Dictionary#

ver.2006#

SentiWordNet (ver.1.0) assigned three sentiment scores that range [0,1] to each WordNet (ver.2.0) synset: (i) Objective score (Obj) describing how objective the terms in the synset are; (ii) Positive score (Pos) describing how positive the terms are; and (iii) Negative score (Neg) describing how negative the terms are. These scores were derived by combining the results produced by a ‘committee’ of 8 ‘ternary classifiers’ (Esuli and Sebastiani, 2006, p.418): That is, in cases where all classifiers unanimously assign the same label to a synset, that label receives the maximum score; Otherwise, each label’s score is proportionate to the number of classifiers that have assigned it.

Classifiers, distinguished by the training set it used to train (k = 0, 2, 4, 6) and in the machine learning algorithm (SVM versus Rocchio), followed these steps: 1. Identification of Positive (L_Pos) and Negative Seeds (L_Neg); 2. Expanding Seed with WordNet relations to create Training datasets (Tr_Obj, Tr_Pos, Tr_Neg); and 3. Training machine-learning models.

Identification of Positive and Negative Seeds

Two subsets, L_Pos and L_Neg were first obtained from the seed terms proposed in Turney and Littman (2003). 47 Positive and 58 Negative synsets were obtained after removing irrelevant WordNet synsets (i.e for the term “nice”, the authors removed the synset relative to the French city of Nice).

Expanding Seed to produce Training datasets

L_Pos and L_Neg were iteratively expanded for k iterations, generating four training datasets (Tr^k=0, Tr^k=2, Tr^k=4, and Tr^k=6). Each Tr^k (for k = 0, 2, 4, 6) comprised Positive (Tr^k_Pos), Negative (Tr^k_Neg), and Objective (Tr^k_Obj) subsets. At each iteration, the seed sets were expanded using WordNet lexical relations that preserved affective meaning, mirroring the approach taken by WordNet-Affect in expanding their affective core dictionary (Strapparava and Valitutti, 2004; Valitutti, Strapparava and Stock, 2004). For instance, all the synsets of L_Pos with WordNet relations such as ‘also-see’ were added to Tr^k_Pos and those with WordNet relations such as ‘direct-antonymy’ were added to Tr^k_Neg.

Note that Tr^k_Obj was consistent across all four datasets, heuristically defined as synsets not belonging to Tr^k_Pos or Tr^k_Neg, containing terms not marked as Positive or Negative in the Harvard General Inquirer lexicon (p.419). The resulting Tr^k_Obj comprised 17,530 synsets.

Train machine-learning algorithms

Each term was given vector representations based on their ‘glosses’, which are textual definitions in WordNet. A textual representation was generated by collating all the glosses of a term in WordNet. This means that if a term has multiple senses (associated with multiple synsets), each sense contributes to the representation. The collation was converted into vectorial form by cosine-normalised TF-IDF.

The ‘ternary classifier’ distinguishes terms into Positive, Negative, or Objective based on two binary classifiers. The first classifier discerned Positive and not Positive. It was trained with the dataset Tr^k_Pos for Positive instances and the combination of datasets Tr^k_Neg and Tr^k_Obj for instances labelled as not Positive. The second classifier discriminated between Negative and not Negative. It was trained with the dataset Tr^k_Neg for Negative instances and the combination of datasets Tr^k_Pos and Tr^k_Obj for instances labelled as not Negative. The final classification was determined based on the outcomes of both classifiers, represented in a table below.

Classifier 1	Classifier 2	Final Classification
`Positive`	`not Negative`	`Positive`
`not Positive`	`Negative`	`Negative`
`not Positive`	`not Negative`	`Objective`
`Positive`	`Negative`	`Objective`

Each of the three scores Pos, Neg and Obj for a term ranges in [0,1] based on the results of 8 ternary classifiers.

ver.2010#

SentiWordNet 3.0 takes a departure from its predecessor with two notable methods: 1. Representing a term using “bag-of-synsets” instead of “bag-of-words”; and 2. Calculating the sentiment score using graph-based random walk models on WordNet, departing from the classifier committee used previously.

In SentiWordNet 3.0, Baccianella, Esuli and Sebastiani (2010) leverage manually disambiguated glosses obtained from the Princeton WordNet Gloss Corpus. Unlike the previous version (1.0) that utilised a “bag-of-words” model, SentiWordNet 3.0 represents glosses as a sequence of WordNet synsets. The term ‘manually disambiguated’ signifies the effort to resolve ambiguity in gloss interpretation, particularly when a term has multiple senses.

For clarity, consider the transformation of a term representation: instead of a bag-of-words like [“word 1”, “word 2”, …, “word N”], SentiWordNet 3.0 adopts a bag-of-synsets like [“synset 1”, “synset 2”, …, “synset N”]. This shift to a sequence of WordNet synsets allows SentiWordNet 3.0 to capture nuanced meanings associated with different senses of a word in its gloss, providing a more sophisticated and contextually rich representation compared to the simpler bag-of-words model used in SentiWordNet 1.0.

The “bag-of-synsets” representation facilitated modelling WordNet as a graph, enabling a new sentiment scoring approach. SentiWordNet 3.0 introduced a graph-based random walk procedure, by revising the PageRank algorithm (for detailed discussion, see Esuli and Sebastiani, 2007). It views WordNet as a directed graph, with synsets serving as nodes and edges connecting synsets based on their appearance in the textual definitions (glosses) of each other. A graph-based random walk procedure is then employed, allowing sentiment to dynamically “flow” through the WordNet graph.

The random walk iteratively propagates scores through the WordNet graph until convergence, leveraging its inherent structure. This contrasts the earlier iterative expansion method. Propagating scores in a context-aware manner enhanced accuracy compared to the previous approach.

The dataset includes 67,176 nouns, 14,004 adjectives, 7,440 verbs, and 3,050 adverbs, each with Pos and Neg scores in the range of [0, 1]. Notably, there are 3,047 nouns, 1,947 adjectives, 1,381 verbs, and 225 adverbs with duplicates.

from sentibank import archive

load = archive.load()
SentiWordNet = load.origin("SentiWordNet_v2010")

init_notebook_modetrusted

Loading ITables v2.4.4 from the init_notebook_mode cell... (need help?)

🧹 Processed Dictionary#

First-Pass Processing#

While the sentiment scores fall within the [0,1] range, the values appear to be more akin to a ordinal representation with distinct classes, rather than a continuous variable. Each value in the scale corresponds to a specific level of positive and negative sentiment, and the use of these specific values implies a more granular representation than a truly continuous scale. Pos scores were represented with 19 distinct data points between [0.111, 1], and Neg scores by 19 points between [0.125, 1]. The table below represents the Pos and Neg scores along with their corresponding counts.

`Pos`	Count (`Pos`)	`Neg`	Count (`Neg`)
`0.0`	100644	`0.0`	99631
`0.111`	1	`0.125`	5567
`0.125`	6684	`0.222`	2
`0.222`	17	`0.25`	4210
`0.25`	4093	`0.3`	1
`0.3`	3	`0.333`	2
`0.333`	19	`0.375`	2555
`0.364`	1	`0.444`	8
`0.375`	2611	`0.5`	2359
`0.4`	1	`0.556`	13
`0.444`	13	`0.6`	1
`0.5`	2053	`0.625`	1848
`0.556`	8	`0.636`	1
`0.625`	972	`0.667`	19
`0.667`	2	`0.7`	3
`0.7`	1	`0.75`	1118
`0.75`	387	`0.778`	17
`0.778`	2	`0.875`	287
`0.875`	128	`0.889`	1
`1.0`	19	`1.0`	16

Of the original 117,659 synsets, 7,031 terms were duplicates across nouns, verbs, adjectives and adverbs. For example, “last” appeared 7 times across 1 noun, 1 adverb, 1 verb and 4 adjective synsets. The Pos and Neg scores for duplicates were averaged. This resulted in 8,636 positive, 62,594 neutral, 9,353 negative, and 5,972 ambiguous terms after averaging. Ambiguous terms had non-zero averaged pos and neg scores, indicating both positive and negative connotations.

One distinguishing feature of SentiWordNet, setting it apart from other sentiment dictionaries, is its recognition that terms can encompass both positive and negative polarities to varying degrees. For instance, “idle” had an average negative score of 0.375 and positive score of 0.031, implying a likely negativity of 0.375 when used in an opinion-related text. This nuanced approach enables SentiWordNet to capture the multifaceted nature of sentiment, acknowledging that words may convey both positive and negative connotations depending on the context in which they are used.

The optimal approach involves embracing polysemy and algorithmically determining positive and negative scores based on the specific context. Take the term “idle,” which in a work-related context may be negatively connotated, suggesting inactivity, while in a machinery context, it could be positively perceived, indicating efficient operation. To capture intended meanings, one could use the Lesk algorithm for word sense disambiguation to assign WordNet synsets to words in context based. This would fully leverage SentiWordNet3.0 in its full potential.

However, the purpose of sentibank is to allow researchers to rapidly utilise the processed dictionary. Thus, two different versions of processed dictionaries were created:

SentiWordNet_v2010_simple, a dictionary that removed ambiguous terms, keeping only strictly positive and negative terms regardless of context; and
SentiWordNet_v2010_logtransform, a dictionary that retains ambiguous terms using logarithmic transformation for overall scores

SentiWordNet_v2010_simple created by removing all 62,594 neutral and 5,972 ambiguous terms, leaving 17,989 unique terms.

SentiWordNet_v2010_logtransform involved a logarithmic transformation, specifically defined as log(pos+1) - log(neg+1). This addressed two issues when simply subtracting the scores. Firstly, the transformation serves to mitigate the impact of extreme values in the sentiment scores. Without the logarithmic adjustment, the influence of exceptionally low or high values might overshadow the overall sentiment calculation. The logarithmic transformation helps balance the significance of different score magnitudes. Secondly, the logarithmic transformation is adept at preserving the relative differences between positive and negative scores. This ensures that the proportional relationships between scores are maintained, irrespective of their absolute magnitudes. For instance, in a term with a positive score of 0.7 and a negative score of 0.2, the resulting overall sentiment score is 0.546. Conversely, a term with a positive score of 0.5 and a negative score of 0 yields an overall sentiment score of 0.405, demonstrating the preservation of relative differences in the transformed scores.

The transformation resulted in 10,773 positive, 12,130 negative and 63,652 neutral terms. All neutral terms were removed, leaving 22,903 terms. It is important to note that SentiWordNet_v2010_logtransform increases the coverage of the sentiment dictionary, at the cost of potentially misleading values.