**4. Candidate tag generation**

Tag recommendation methods can be divided in two steps: (1) the generation of a set of candidate tags and (2) the ranking of the candidate tags produced in step (1). In this section, we introduce the main techniques to tackle the first step, while in Section 5, we discuss methods to perform the second step.

The candidate tag generation depends on the data sources available in the target application. As summarized by [9], previous tag recommendation strategies have exploited as data sources: (1) the folksonomy (history of tag assignments); (2) textual features (other than tags), such as title, description, and user comments; (3) rich media content, that is, image, audio, or video; and (4) social features, such as friendship links in social networks and other interactions among users as illustrated in Section 2.

Based on these data sources, we can name three main groups of techniques to extract or generate candidate tags: (1) extraction of terms from the textual features associated with the target object, (2) tag co-occurrences with terms in these textual features (possibly including previously assigned tags) or other features (e.g., visual features for rich media content), and (3) tags extracted from neighbors, that is, objects that are similar to the target object or users that are similar to the target user. These three groups of techniques will be the subject of Sections 4.1–4.3, respectively.

#### **4.1 Keyword extraction from texts**

The simplest strategy to extract candidate tags from a given text is to consider each (whitespace) separated word as a candidate, after removing punctuation and other special characters. After this, a basic post-processing step is to remove *stop words* (i.e., words such as articles, prepositions, and conjunctions, which carry little semantics and thus are not adequate as keywords) from the list of generated candidates. Finally, corpus-oriented statistics of these individual words are evaluated to select the most promising candidates. These statistics are also exploited to rank candidate tags, and thus they will be discussed in Section 5.

However, this simple strategy is only capable of generating single words as tags, although it is common to use expressions containing two or more words (e.g., "information systems," "digital image processing") as tags. Thus, alternative keyword extraction techniques first generate all word *n*-grams obtained from a sliding window through the text, for *n* ranging from one to, let us say, three or four words. For example, for the following sentence:

"The tagging process can benefit a lot from a tag recommendation service."

A sliding window of size *n* = 3 would produce the following initial terms as keywords:

The tagging process – tagging process can – process can benefit – can benefit a – benefit a lot – a lot from – lot from a – from a tag – a tag recommendation – tag recommendation service.

To filter out meaningless or uninformative candidate tags such as "benefit a lot" or "from a tag," some authors, such as [7, 16] exploit a selection approach based on part-of-speech (PoS) labels, which captures the idea of keywords having a certain syntactic property. Besides that, this approach is based on empirical evidence obtained

**141**

*Tagging and Tag Recommendation*

keywords found in [15] are:

the ADJECTIVE + NOUN pattern.

generating the following candidate tags:

**4.2 Tag co-occurrences**

dataset *D*, as performed by [2, 13, 14].

*DOI: http://dx.doi.org/10.5772/intechopen.82242*

• ADJECTIVE + NOUN (plural)

• ADJECTIVE + NOUN (singular or plural)

• NOUN + NOUN (both singular or plural)

in training data. First, the most frequent PoS patterns of keywords that occur in a given training dataset are identified. For example, the three most frequent PoS patterns for

Thus, only sequences of words that match the top-x (let us say, x = 50) most frequent patterns are selected as candidate tags. For the aforementioned example and considering n = 2 and x = 3, the selected candidate tags would be "tagging process," "tag recommendation," and "recommendation service", all three of them matching

Unlike the PoS-based approach, which is a supervised, language-dependent approach that processes a training dataset, the *Rapid Automatic Keyword Extraction* (RAKE) [17] relies only on the target text to generate keywords, being known as a "document-oriented" approach, as opposed to the "corpus-oriented" methods. RAKE is based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words. Instead of using an arbitrarily sized sliding window, RAKE splits the text using stop words and punctuation as delimiters. In our sentence example, "**The** tagging process **can** benefit **a lot from a** tag recommendation service," the stop words (in bold) would be discarded,

After extracting candidate keywords, RAKE builds a graph of word co-occurrences, in which there is an edge between two words if they appeared in the same keyword. The score of each word *w* is calculated as *deg(w)/freq(w)*, where *deg(w)* is the degree of *w* in the co-occurrence graph and *freq(w)* is the number of occurrences of *w* in the text. The score of a given candidate keyword is defined as the sum of the scores of its containing words. Finally, in order to consider keywords that contain stop words (e.g., "set *of* natural numbers"), pairs of candidate keywords that appear in consecutive positions of the text at least twice are adjoined.

tagging process – benefit – tag recommendation service

Another strong source of candidate tags is the history of tag assignments of the application (folksonomy). Tags that the target user frequently used in previous tagging events are good candidates to recommend for this user, especially in a personalized recommendation task. Still more interesting, we can exploit tag cooccurrences in these previous posts, recommending to an object *o*, with an initial set of tags *Io*, tags that frequently co-occur with the tags in *Io* in a training folksonomy

Tag co-occurrences are usually computed by exploiting association rules, which

are employed in general to describe frequently co-occurring item sets. For tag recommendation, association rules assume the form *X* ➔ *y*, where *X* (the antecedent) is a set of tags and *y* (the consequent) is a candidate tag for recommendation. The main metrics that estimate the strength of an association rule are the *support*,

#### *Tagging and Tag Recommendation DOI: http://dx.doi.org/10.5772/intechopen.82242*

*Cyberspace*

**4. Candidate tag generation**

**4.1 Keyword extraction from texts**

Tag recommendation methods can be divided in two steps: (1) the generation of a set of candidate tags and (2) the ranking of the candidate tags produced in step (1). In this section, we introduce the main techniques to tackle the first step, while

The candidate tag generation depends on the data sources available in the target application. As summarized by [9], previous tag recommendation strategies have exploited as data sources: (1) the folksonomy (history of tag assignments); (2) textual features (other than tags), such as title, description, and user comments; (3) rich media content, that is, image, audio, or video; and (4) social features, such as friendship links in

Based on these data sources, we can name three main groups of techniques to extract or generate candidate tags: (1) extraction of terms from the textual features associated with the target object, (2) tag co-occurrences with terms in these textual features (possibly including previously assigned tags) or other features (e.g., visual features for rich media content), and (3) tags extracted from neighbors, that is, objects that are similar to the target object or users that are similar to the target user. These three groups of techniques will be the subject of Sections 4.1–4.3, respectively.

The simplest strategy to extract candidate tags from a given text is to consider each (whitespace) separated word as a candidate, after removing punctuation and other special characters. After this, a basic post-processing step is to remove *stop words* (i.e., words such as articles, prepositions, and conjunctions, which carry little semantics and thus are not adequate as keywords) from the list of generated candidates. Finally, corpus-oriented statistics of these individual words are evaluated to select the most promising candidates. These statistics are also exploited to

However, this simple strategy is only capable of generating single words as tags,

although it is common to use expressions containing two or more words (e.g., "information systems," "digital image processing") as tags. Thus, alternative keyword extraction techniques first generate all word *n*-grams obtained from a sliding window through the text, for *n* ranging from one to, let us say, three or four

A sliding window of size *n* = 3 would produce the following initial terms as

"The tagging process can benefit a lot from a tag recommendation service."

To filter out meaningless or uninformative candidate tags such as "benefit a lot" or

The tagging process – tagging process can – process can benefit – can benefit a – benefit a lot – a lot from – lot from a – from a tag – a tag recommendation – tag

"from a tag," some authors, such as [7, 16] exploit a selection approach based on part-of-speech (PoS) labels, which captures the idea of keywords having a certain syntactic property. Besides that, this approach is based on empirical evidence obtained

rank candidate tags, and thus they will be discussed in Section 5.

words. For example, for the following sentence:

social networks and other interactions among users as illustrated in Section 2.

in Section 5, we discuss methods to perform the second step.

**140**

keywords:

recommendation service.

in training data. First, the most frequent PoS patterns of keywords that occur in a given training dataset are identified. For example, the three most frequent PoS patterns for keywords found in [15] are:


Thus, only sequences of words that match the top-x (let us say, x = 50) most frequent patterns are selected as candidate tags. For the aforementioned example and considering n = 2 and x = 3, the selected candidate tags would be "tagging process," "tag recommendation," and "recommendation service", all three of them matching the ADJECTIVE + NOUN pattern.

Unlike the PoS-based approach, which is a supervised, language-dependent approach that processes a training dataset, the *Rapid Automatic Keyword Extraction* (RAKE) [17] relies only on the target text to generate keywords, being known as a "document-oriented" approach, as opposed to the "corpus-oriented" methods. RAKE is based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words. Instead of using an arbitrarily sized sliding window, RAKE splits the text using stop words and punctuation as delimiters. In our sentence example, "**The** tagging process **can** benefit **a lot from a** tag recommendation service," the stop words (in bold) would be discarded, generating the following candidate tags:

tagging process – benefit – tag recommendation service

After extracting candidate keywords, RAKE builds a graph of word co-occurrences, in which there is an edge between two words if they appeared in the same keyword. The score of each word *w* is calculated as *deg(w)/freq(w)*, where *deg(w)* is the degree of *w* in the co-occurrence graph and *freq(w)* is the number of occurrences of *w* in the text. The score of a given candidate keyword is defined as the sum of the scores of its containing words. Finally, in order to consider keywords that contain stop words (e.g., "set *of* natural numbers"), pairs of candidate keywords that appear in consecutive positions of the text at least twice are adjoined.

#### **4.2 Tag co-occurrences**

Another strong source of candidate tags is the history of tag assignments of the application (folksonomy). Tags that the target user frequently used in previous tagging events are good candidates to recommend for this user, especially in a personalized recommendation task. Still more interesting, we can exploit tag cooccurrences in these previous posts, recommending to an object *o*, with an initial set of tags *Io*, tags that frequently co-occur with the tags in *Io* in a training folksonomy dataset *D*, as performed by [2, 13, 14].

Tag co-occurrences are usually computed by exploiting association rules, which are employed in general to describe frequently co-occurring item sets. For tag recommendation, association rules assume the form *X* ➔ *y*, where *X* (the antecedent) is a set of tags and *y* (the consequent) is a candidate tag for recommendation. The main metrics that estimate the strength of an association rule are the *support*,

defined as the number of co-occurrences of *X* and *y* in the training set, and the confidence, calculated as the conditional probability that *y* is assigned as a tag to an object given that all tags in *X* are also associated with it. Considering that the number of rules extracted from the training set can be very large and some of them may not be useful for recommendation, minimum support and confidence thresholds are used as lower bounds to select only the most important and/or reliable rules. This selection can improve both effectiveness and efficiency of the recommender.

To recommend tags for an object *o*, we select rules *X* ➔ *y* in which *X* is a subset of *Io*, the set of initial tags in *o*. For each term *c* appearing as consequent of any of the selected rules, we usually estimate its relevance as a tag for the object (and for the user in the personalized case), given the initial tag set *Io*, as the sum of the confidences of all rules containing *c*. In the absence of an initial tag set, words occurring in other textual features of the target object, such as title and description, can be used as *Io,* as performed by [18]. Another alternative is to compute co-occurrences between tags and visual features extracted from images or other rich media content associated with the target object [19].

### **4.3 Tags from neighbors**

Another form of obtaining candidate tags that are external to the target object, besides exploiting co-occurrences, is extracting tags from the neighborhood of the target object *o*, that is, the set of most similar objects with relation to *o*. Similarly, we can generate candidate tags for a target user *u* from similar users or users that have some kind of connection in the application (e.g., explicit friendship links, endorsement links, etc.). The rationale is that similar objects or users are usually associated with similar tags.

Thus, the neighborhood-based tag generation approaches exploit a graph in which the nodes correspond to objects or users, and there is an edge between two objects (or two users) if they are similar (e.g., share tags or other words in common). Alternatively, visual features extracted from image and video objects can be used to estimate content similarity [19, 20], although they may face scalability issues and a larger semantic gap [20].

To identify similar objects or users, each object (or user) is usually modeled as a bag of terms (extracted from the textual features of the object or from the vocabulary of the users). These terms receive a TFIDF weight, and a similarity measure such as the cosine of these term vector representations is exploited to estimate the similarity between objects or users [21].

### **5. Candidate tag ranking**

After generating a set of candidate tags, it is necessary to rank them, showing the most relevant tags first, in order to provide effective tag recommendations. Some tag candidate generation strategies already provide a measure to estimate the candidate tag relevance, such as the degree/frequency ratio in RAKE, as defined in Section 4.1. In this section, we first discuss various tag quality attributes that can be used to estimate tag relevance (Section 5.1), isolated or combined with other attributes. Then, we discuss methods that can automatically combine various attributes exploiting a learning-to-rank approach (Section 5.2).

#### **5.1 Tag quality attributes**

Tag quality attributes can be grouped into the following categories, based on the aspect they try to capture regarding the tag recommendation task [13]:

**143**

*Tagging and Tag Recommendation*

*DOI: http://dx.doi.org/10.5772/intechopen.82242*

features of the target object.

*5.1.1 Tag co-occurrence attributes*

of all rules that point to *c*, i.e.:

to performance issues.

*Sum*(*c*,*Io*) = ∑

*5.1.2 Descriptive power attributes*

for that object) of *o* that contain *c* [3].

guish the target object from others.

interest of a target user in certain tags.

*X*⊆*Io*

related tags associated with the target object.

• *Tag co-occurrence attributes*: estimate how relevant a candidate tag *c* is given a set

• *Descriptive power attributes*: estimate how accurately a candidate tag describes the object's content based on statistics of the occurrence of the tag in the textual

• *Discriminative power attributes*: estimate the capability of a candidate to distin-

• *Term predictability*: indicates the likelihood that a word can be predicted as a tag.

• *User interest attributes*: used for personalization, these attributes estimate the

As mentioned in Section 4.2, tag recommenders select association rules in which antecedents are included in *Io*, the set of tags already available in the target object, or terms that can be used as proxy for these initial tags. For each tag *c* appearing as consequent of any of the selected rules, the relevance of *c* as a tag for the object *o*, given the initial tag set *Io*, can be estimated by sum, which sums up the confidences

where *R* is a set of association rules generated from the training set and *l* is the size limit for the association rules' antecedents, usually limited to 1 or 2 words, due

*Sum* was proposed by [2], which also proposed several other attributes related to tag co-occurrences. For example, *Vote (c, Io)* can be defined as the number of association rules whose antecedents are tags in *Io* and whose consequent is the candidate tag *c*. In other words, it is the number of "votes" a candidate tag has received from

Descriptive power attributes usually estimate the descriptive capacity of candidate tags based on statistics of their occurrence in the textual features of the target object. We [13] proposed the use of four of these attributes for tag recommendation. We start by defining the *Term Spread* of a candidate *c* in an object *o*, *TS(c, o)*, as the number of textual features (except tags if we desire to recommend only "new" tags

The rationale behind *TS(c, o)* is that the larger the number of textual features of *o* containing *c*, the more related *c* is to *o*'s content. For example, if the term "X-men" appears in all features of a video, there is a high chance that the video is related to the famous comics. Our results in [3] indicate that, in isolation, TS provides better

TF or *term frequency*, in turn, is the total number of occurrences of the candidate tag *c* in all textual features of the target object *o* and thus considers these textual features as a single bag of words. In contrast, TS takes into account the multiple

tag recommendations than the traditional TF in most datasets.

textual blocks that compound the structure of the target object.

*confidence*(*X* → *c*), (*X* → *c*) ∈ *R*,|*X*| ≤ *l*, (1)

of input tags that often co-occur with *c* in the data collection.

*Cyberspace*

associated with the target object [19].

issues and a larger semantic gap [20].

similarity between objects or users [21].

exploiting a learning-to-rank approach (Section 5.2).

**5. Candidate tag ranking**

**5.1 Tag quality attributes**

**4.3 Tags from neighbors**

with similar tags.

defined as the number of co-occurrences of *X* and *y* in the training set, and the confidence, calculated as the conditional probability that *y* is assigned as a tag to an object given that all tags in *X* are also associated with it. Considering that the number of rules extracted from the training set can be very large and some of them may not be useful for recommendation, minimum support and confidence thresholds are used as lower bounds to select only the most important and/or reliable rules. This selection can improve both effectiveness and efficiency of the recommender. To recommend tags for an object *o*, we select rules *X* ➔ *y* in which *X* is a subset of *Io*, the set of initial tags in *o*. For each term *c* appearing as consequent of any of the selected rules, we usually estimate its relevance as a tag for the object (and for the user in the personalized case), given the initial tag set *Io*, as the sum of the confidences of all rules containing *c*. In the absence of an initial tag set, words occurring in other textual features of the target object, such as title and description, can be used as *Io,* as performed by [18]. Another alternative is to compute co-occurrences between tags and visual features extracted from images or other rich media content

Another form of obtaining candidate tags that are external to the target object, besides exploiting co-occurrences, is extracting tags from the neighborhood of the target object *o*, that is, the set of most similar objects with relation to *o*. Similarly, we can generate candidate tags for a target user *u* from similar users or users that have some kind of connection in the application (e.g., explicit friendship links, endorsement links, etc.). The rationale is that similar objects or users are usually associated

Thus, the neighborhood-based tag generation approaches exploit a graph in which the nodes correspond to objects or users, and there is an edge between two objects (or two users) if they are similar (e.g., share tags or other words in common). Alternatively, visual features extracted from image and video objects can be used to estimate content similarity [19, 20], although they may face scalability

To identify similar objects or users, each object (or user) is usually modeled as a bag of terms (extracted from the textual features of the object or from the vocabulary of the users). These terms receive a TFIDF weight, and a similarity measure such as the cosine of these term vector representations is exploited to estimate the

After generating a set of candidate tags, it is necessary to rank them, showing the most relevant tags first, in order to provide effective tag recommendations. Some tag candidate generation strategies already provide a measure to estimate the candidate tag relevance, such as the degree/frequency ratio in RAKE, as defined in Section 4.1. In this section, we first discuss various tag quality attributes that can be used to estimate tag relevance (Section 5.1), isolated or combined with other attributes. Then, we discuss methods that can automatically combine various attributes

Tag quality attributes can be grouped into the following categories, based on the

aspect they try to capture regarding the tag recommendation task [13]:

**142**


#### *5.1.1 Tag co-occurrence attributes*

As mentioned in Section 4.2, tag recommenders select association rules in which antecedents are included in *Io*, the set of tags already available in the target object, or terms that can be used as proxy for these initial tags. For each tag *c* appearing as consequent of any of the selected rules, the relevance of *c* as a tag for the object *o*, given the initial tag set *Io*, can be estimated by sum, which sums up the confidences of all rules that point to *c*, i.e.:

$$\text{Sum}\left(\mathcal{c}, I\_{\bullet}\right) = \sum\_{X \subseteq I\_{\bullet}} \text{conf}\{\text{dence}\left(X \to \mathcal{c}\right), \left \in R, |X| \le l,\tag{1}$$

where *R* is a set of association rules generated from the training set and *l* is the size limit for the association rules' antecedents, usually limited to 1 or 2 words, due to performance issues.

*Sum* was proposed by [2], which also proposed several other attributes related to tag co-occurrences. For example, *Vote (c, Io)* can be defined as the number of association rules whose antecedents are tags in *Io* and whose consequent is the candidate tag *c*. In other words, it is the number of "votes" a candidate tag has received from related tags associated with the target object.

#### *5.1.2 Descriptive power attributes*

Descriptive power attributes usually estimate the descriptive capacity of candidate tags based on statistics of their occurrence in the textual features of the target object. We [13] proposed the use of four of these attributes for tag recommendation. We start by defining the *Term Spread* of a candidate *c* in an object *o*, *TS(c, o)*, as the number of textual features (except tags if we desire to recommend only "new" tags for that object) of *o* that contain *c* [3].

The rationale behind *TS(c, o)* is that the larger the number of textual features of *o* containing *c*, the more related *c* is to *o*'s content. For example, if the term "X-men" appears in all features of a video, there is a high chance that the video is related to the famous comics. Our results in [3] indicate that, in isolation, TS provides better tag recommendations than the traditional TF in most datasets.

TF or *term frequency*, in turn, is the total number of occurrences of the candidate tag *c* in all textual features of the target object *o* and thus considers these textual features as a single bag of words. In contrast, TS takes into account the multiple textual blocks that compound the structure of the target object.

#### *Cyberspace*

However, neither TS nor TF consider that some textual features may describe the content of the target object more accurately than others. For example, the title is usually the most representative textual feature of the object's content [3]. Thus, we proposed in [13] two other attributes, which extend TF and TS, weighting a candidate tag based on the average descriptive powers of the textual features in which it appears.

To define these new attributes, we need first to automatically estimate the descriptive power of a textual feature Fi using the *average feature spread* (AFS) metric [3]. Let the *feature instance spread* of a feature *Fi,o* associated with an object *o*, *FIS*(*Fi,o*), be the average TS over all terms in *Fi,o*. We define *AFS(Fi)* as the average *FIS(Fi,o)* over all instances of *Fi* associated with objects in the training set *D*. Thus, we define weighted TS (wTS) and weighted TF (wTF) as

$$\begin{aligned} \text{we define weighted TS (wTS) and weighted TF (wTF) as} \\ wTS(c, o) &= \sum\_{F\_i, \in \sigma} I(c, F\_{i,o}) \times AFS(F\_i), \quad where I(c, F\_{i,o}) = \begin{cases} 1, \text{if } c \in F\_{i,o} \\ 0, \text{otherwise} \end{cases} \\ wTF(c, o) &= \sum\_{F\_i, \in \sigma} t f(c, F\_{i,o}) \times AFS(F\_i) \end{aligned} \tag{2}$$

where *tf(c, Fi,o)* is the number of occurrences of the candidate tag *c* in textual feature *Fi,o* of the target object *o.*

#### *5.1.3 Discriminative power attributes*

Discriminative power attributes promote more infrequent terms as tags, since they may better *discriminate* objects into different categories, topics, or levels of relevance, particularly considering that several services (e.g., classification, searching) often perform IR on multimedia content by using the associated tags as data sources. This aspect is captured by the *inverse feature frequency* (IFF) attribute [3], directly derived from the traditional *inverse document frequency* (IDF), considering, however, the term frequency in a specific textual feature (tags, in this case), instead of the full set of terms associated with the objects in the training dataset *D*. Given the number of elements in the training set *N =* |*D*|, the IFF of a candidate tag *c* in a textual feature *i* (tags in this case) is defined as *IFF(c, i) = log((N + 1)/(fi(c) + 1))*, where *fi(c)* is the number objects in the training set in which *c* appears in the textual feature *i*. In our case, *fi(c)* is the number of training objects that are tagged with *c*.

We note that the value 1 is added to both numerator and denominator, without harming the tag specificity estimation, to deal with the value 0 in the denominator, which occurs for new terms that do not appear as tags in the training data.

IFF may have privilege terms from other textual features that do not appear as tags in the training data or noisy terms such as typos. Nevertheless, this attribute can be combined with the other attributes into a function, using, for example, learning-to-rank algorithms. Thus, its relative weight can be adjusted in order to avoid negative impacts in tag recommendation effectiveness.

Considering that both too general and too specific or noisy terms may not be ideal tag recommendations, [2] propose the s*tability* attribute, which promotes terms with intermediate frequency values.

#### *5.1.4 Term predictability*

Another important aspect for tag recommendation is term predictability. Heymann et al. [22] measure this characteristic through the term's *entropy*.

If a term occurs consistently with certain tags, it is more predictable, thus having lower entropy. Terms that occur indiscriminately with many other tags are less

**145**

*Tagging and Tag Recommendation*

feature of the same object.

*5.1.5 User interest attribute*

more relevant *c* is for *u*.

*DOI: http://dx.doi.org/10.5772/intechopen.82242*

predictable, thus having higher entropy. Term entropy can be useful particularly for breaking ties, as it is better to recommend more "consistent" or less "confusing" terms. Another predictability attribute, called *Pred* [13, 18], measures the probability that a term is used as a tag in an object given that it was used in another textual

The user frequency (UF) attribute was used in [13, 18] in order to estimate the relevance of a candidate tag for a target user and thus provides personalized recommendations. *UF(c, u)* is simply the frequency at which the target user *u* assigns a candidate tag *c* to objects in a training collection. The idea is that the more frequently a user *u* assigns a candidate tag *c* to other objects in the application, the

It is also common to exploit the temporal dynamics of tagging, particularly in user frequency-based tag attributes. From the observation that the temporal decay of the users' word choices follows a power-law function, the authors in [23] integrate a time

Observing that recommendation is usually modeled as a ranking problem (i.e., we want to recommend the most relevant items first), learning-to-rank (L2R) techniques constitute an appropriate approach to tackle it. L2R-based methods are supervised approaches that automatically "learn" a ranking function from "previously seen" data known as training instances. Such training examples usually consist of candidate tags, their tag quality attribute values, and their relevance labels, which indicates their relevance levels. These labels can be assigned either manually or by exploiting previous tag assignments as ground truth. The objective of L2R approaches is to generate a model (function) that maps the tag quality attributes into a relevance score or rank. More formally, for each candidate tag *c* for each object *o* (or pair object-user <*o*, *u*>

(or *Xc*,*o*,*<sup>u</sup>* <sup>∈</sup> <sup>ℝ</sup><sup>m</sup>

),

component that gives more weight to tags that have been used more recently.

for personalized recommendation), we associate a vector *Xc*,*<sup>o</sup>* <sup>∈</sup> <sup>ℝ</sup><sup>m</sup>

learned in the training step is applied in order to predict these values.

be the best performing strategies for the tag recommendation problem.

Various L2R-based algorithms have been proposed for tag recommendation in the literature, including RankSVM, RankBoost, Genetic Programming, Random Forest (RF), Multiple Additive Regression Trees (MART), Lambda-MART, AdaRank, ListNet, Ranknet, and Coordinate Ascent. In [24] we can find a brief description of each of these algorithms and experimental results of the comparison of these methods using the RankLib tool (https://sourceforge.net/p/lemur/wiki/ RankLib/). According to our results, RF, MART, and Lambda-MART are found to

In [25], the author reviewed existing L2R algorithms in the context of document ranking, categorizing them into three approaches: pointwise, pairwise, and listwise. The *pointwise* approach associates a numerical score to each query-document pair and thus approximates the ranking problem by a regression problem. *Pairwise* approaches, in turn, transform the ranking problem into binary classification: given a pair of

where *m* is the number of considered tag quality attributes (e.g., each metric defined in Section 4). For training instances, we also assign a relevance label *yc,o* (or *yc,o,u*), indicating the relevance level of candidate tag *c* to the object *o* (and user *u*). For example, we can define two relevance levels: 1 for relevant tags, and 0 for nonrelevant tags. In the offline training step, this data is exploited to generate the recommendation model. In the online recommendation step, in which we have new objects or users as input, the *yc,o* (or *yc,o,u*) values are unknown, and the model

**5.2 Learn-to-rank-based tag recommendation**

#### *Tagging and Tag Recommendation DOI: http://dx.doi.org/10.5772/intechopen.82242*

predictable, thus having higher entropy. Term entropy can be useful particularly for breaking ties, as it is better to recommend more "consistent" or less "confusing" terms.

Another predictability attribute, called *Pred* [13, 18], measures the probability that a term is used as a tag in an object given that it was used in another textual feature of the same object.

### *5.1.5 User interest attribute*

*Cyberspace*

appears.

*wTS*(*c*, *o*) = ∑

*wTF*(*c*, *o*) = ∑

feature *Fi,o* of the target object *o.*

*5.1.3 Discriminative power attributes*

*Fi*,*o*∈*o*

*Fi*,*o*∈*o*

However, neither TS nor TF consider that some textual features may describe the content of the target object more accurately than others. For example, the title is usually the most representative textual feature of the object's content [3]. Thus, we proposed in [13] two other attributes, which extend TF and TS, weighting a candidate tag based on the average descriptive powers of the textual features in which it

To define these new attributes, we need first to automatically estimate the descriptive power of a textual feature Fi using the *average feature spread* (AFS) metric [3]. Let the *feature instance spread* of a feature *Fi,o* associated with an object *o*, *FIS*(*Fi,o*), be the average TS over all terms in *Fi,o*. We define *AFS(Fi)* as the average *FIS(Fi,o)* over all instances of *Fi* associated with objects in the training set *D*. Thus,

*<sup>I</sup>*(*c*,*Fi*,*o*) <sup>×</sup> *AFS*(*Fi*), *where <sup>I</sup>*(*c*,*Fi*,*o*) <sup>=</sup> {

*tf*(*c*,*Fi*,*o*) × *AFS*(*Fi*) (2)

where *tf(c, Fi,o)* is the number of occurrences of the candidate tag *c* in textual

Discriminative power attributes promote more infrequent terms as tags, since they may better *discriminate* objects into different categories, topics, or levels of relevance, particularly considering that several services (e.g., classification, searching) often perform IR on multimedia content by using the associated tags as data sources. This aspect is captured by the *inverse feature frequency* (IFF) attribute [3], directly derived from the traditional *inverse document frequency* (IDF), considering, however, the term frequency in a specific textual feature (tags, in this case), instead of the full set of terms associated with the objects in the training dataset *D*. Given the number of elements in the training set *N =* |*D*|, the IFF of a candidate tag *c* in a textual feature *i* (tags in this case) is defined as *IFF(c, i) = log((N + 1)/(fi(c) + 1))*, where *fi(c)* is the number objects in the training set in which *c* appears in the textual feature *i*. In our case, *fi(c)* is the number of training objects that are tagged with *c*. We note that the value 1 is added to both numerator and denominator, without harming the tag specificity estimation, to deal with the value 0 in the denominator,

which occurs for new terms that do not appear as tags in the training data.

avoid negative impacts in tag recommendation effectiveness.

terms with intermediate frequency values.

*5.1.4 Term predictability*

IFF may have privilege terms from other textual features that do not appear as tags in the training data or noisy terms such as typos. Nevertheless, this attribute can be combined with the other attributes into a function, using, for example, learning-to-rank algorithms. Thus, its relative weight can be adjusted in order to

Considering that both too general and too specific or noisy terms may not be ideal tag recommendations, [2] propose the s*tability* attribute, which promotes

Another important aspect for tag recommendation is term predictability. Heymann et al. [22] measure this characteristic through the term's *entropy*. If a term occurs consistently with certain tags, it is more predictable, thus having lower entropy. Terms that occur indiscriminately with many other tags are less

1,*if c* ∈ *Fi*,*<sup>o</sup>* 0, *otherwise*

we define weighted TS (wTS) and weighted TF (wTF) as

**144**

The user frequency (UF) attribute was used in [13, 18] in order to estimate the relevance of a candidate tag for a target user and thus provides personalized recommendations. *UF(c, u)* is simply the frequency at which the target user *u* assigns a candidate tag *c* to objects in a training collection. The idea is that the more frequently a user *u* assigns a candidate tag *c* to other objects in the application, the more relevant *c* is for *u*.

It is also common to exploit the temporal dynamics of tagging, particularly in user frequency-based tag attributes. From the observation that the temporal decay of the users' word choices follows a power-law function, the authors in [23] integrate a time component that gives more weight to tags that have been used more recently.

### **5.2 Learn-to-rank-based tag recommendation**

Observing that recommendation is usually modeled as a ranking problem (i.e., we want to recommend the most relevant items first), learning-to-rank (L2R) techniques constitute an appropriate approach to tackle it. L2R-based methods are supervised approaches that automatically "learn" a ranking function from "previously seen" data known as training instances. Such training examples usually consist of candidate tags, their tag quality attribute values, and their relevance labels, which indicates their relevance levels. These labels can be assigned either manually or by exploiting previous tag assignments as ground truth. The objective of L2R approaches is to generate a model (function) that maps the tag quality attributes into a relevance score or rank.

More formally, for each candidate tag *c* for each object *o* (or pair object-user <*o*, *u*> for personalized recommendation), we associate a vector *Xc*,*<sup>o</sup>* <sup>∈</sup> <sup>ℝ</sup><sup>m</sup> (or *Xc*,*o*,*<sup>u</sup>* <sup>∈</sup> <sup>ℝ</sup><sup>m</sup> ), where *m* is the number of considered tag quality attributes (e.g., each metric defined in Section 4). For training instances, we also assign a relevance label *yc,o* (or *yc,o,u*), indicating the relevance level of candidate tag *c* to the object *o* (and user *u*). For example, we can define two relevance levels: 1 for relevant tags, and 0 for nonrelevant tags. In the offline training step, this data is exploited to generate the recommendation model. In the online recommendation step, in which we have new objects or users as input, the *yc,o* (or *yc,o,u*) values are unknown, and the model learned in the training step is applied in order to predict these values.

Various L2R-based algorithms have been proposed for tag recommendation in the literature, including RankSVM, RankBoost, Genetic Programming, Random Forest (RF), Multiple Additive Regression Trees (MART), Lambda-MART, AdaRank, ListNet, Ranknet, and Coordinate Ascent. In [24] we can find a brief description of each of these algorithms and experimental results of the comparison of these methods using the RankLib tool (https://sourceforge.net/p/lemur/wiki/ RankLib/). According to our results, RF, MART, and Lambda-MART are found to be the best performing strategies for the tag recommendation problem.

In [25], the author reviewed existing L2R algorithms in the context of document ranking, categorizing them into three approaches: pointwise, pairwise, and listwise. The *pointwise* approach associates a numerical score to each query-document pair and thus approximates the ranking problem by a regression problem. *Pairwise* approaches, in turn, transform the ranking problem into binary classification: given a pair of

documents (or tags, in our case), we need to predict which one is the most relevant. Finally, the *listwise* approaches try to directly optimize a given evaluation measure.

Finally, it is also worth mentioning that, instead of adopting an attribute engineering approach, exploiting various handcrafted attributes like those described in Section 4, some recent works focus on investigating techniques that can learn attribute interactions from raw data, such as deep learning and factorization machines (FM) [26, 27]. The most representative method of this group is pairwise interaction tensor factorization (PITF). In this method, the tensor (i.e., a "tridimensional matrix") that models the pairwise interactions among users, items, and tags (i.e., the ranking preferences of the tags for each pair user object, which is obtained from the folksonomy relation data) is factored into lower-dimensional matrices to reduce noise [27]. The PITF model is learned from an adaption of the Bayesian personalized ranking (BPR) criterion. More recently, [26] exploit not only the folksonomy but also visual features of images, such as the objects appearing in the image, colors, shapes, or other visual aspects, into factorization machine models.
