**Abstract**

*Escherichia coli* (*E. coli*) has the hallmark of being the most extensively studied organism. This is shown by the thousands of articles published since its discovery by T. Escherich in 1885. On the other hand, very little is known about the intellectual landscape in *E. coli* research. For example, how the trend of publications on *E. coli* has evolved over time and which scientific topics have been the focus of interest for researchers. In this chapter, we present the results of a large-scale scientometric analysis of about 100,000 bibliographic records from PubMed over the period 1981–2021. To examine the evolution of research topics over time, we divided the dataset into four intervals of equal width. We created co-occurrence networks from keywords indexed in the Medical Subject Headings vocabulary and systematically examined the structure and evolution of scientific knowledge about *E. coli*. The extracted research topics were visualized in strategic diagrams and qualitatively characterized in terms of their maturity and cohesion.

**Keywords:** *Escherichia coli*, scientometric analysis, knowledge mapping, keyword analysis, co-word analysis

### **1. Introduction**

*Escherichia coli* (*E. coli*) is widely known and one of the most studied microorganisms in the life sciences. Since its discovery in 1885 by Theodor Escherich [1], *E. coli* has been the subject of intense research. *E. coli* is believed to be one of, if not the, most important organisms in studies aimed at discovering fundamental biological principles and mechanisms, as well as biology field-specific research methods and techniques. However, very little is known about the knowledge landscape in *E. coli* research. In particular, how published empirical findings on *E. coli* have evolved over time and what scientific questions have been the focus of researchers' interest? Answering these questions motivates the present work.

Scientific achievements are traditionally published in the form of a journal article, a paper in a conference volume, or a book chapter. To illustrate the effect of the accumulation of research results, we show in **Figure 1** the annual number of publications in the period 1940–2021 with *E. coli* as the main research topic. Although it is questionable whether the number of publications is directly related to

#### **Figure 1.**

*The annual distribution of PubMed publications on* E. coli*. The different colored lines represent the PubMed field tag used to retrieve* E. coli *publications: MeSH heading, which is a major topic of an article (red), MeSH term in general (blue), and all words in the title, abstract, and other relevant fields (green). The gray bars of the histogram, quantified by the second y-axis, show the total number of publications in PubMed in a given year.*

the amount of knowledge in a particular scientific field, we can at least use it as a proxy indicator of research activity in a particular area of interest.

However, the body of scientific literature is growing at an unimaginable pace [2, 3]. PubMed, for example, a leading bibliographic database in the life sciences, has indexed an average of 3800 new articles daily over the past 5 years [4]. Manual review of such a large body of new literature, even in a very specialized field of research, is therefore not only time-consuming, but virtually impossible. Fortunately, we can draw on a rich toolkit of automated methods and techniques offered by a modern scientometric approach to deepen our understanding of science itself [5, 6].

The main objective of this work is to examine the *E. coli* literature from an evolutionary point of view using a data-driven approach. Specifically, the aims are twofold: (i) to provide insights into research topics based on a quantitative textmining analysis of a large number of *E. coli* papers from 1981 to 2021 indexed in PubMed, and (ii) to highlight the evolution of scientific knowledge in the field from a domain expert's perspective.

### **2. Background and related work**

#### **2.1 Science mapping**

If we may paraphrase Ebbinghaus' famous statement, we could say that the field of scientometrics has a long past but a short history [7]. The study of scientific knowledge itself has been the subject of many famous works with great impact on the research community, including contributions by Lotka [8], Zipf [9], Price [10], Merton [11], Garfield [12], and later by Borner [13], Uzzi [14], Wang [15], Clauset [5], and Milojević [16].

*Exploring the Knowledge Landscape of* Escherichia coli *Research: A Scientometric Overview DOI: http://dx.doi.org/10.5772/intechopen.109207*

Formally, scientometrics is an umbrella term for a set of approaches that aim to describe and understand the (relational) structure between researchers, their institutions, and scientific knowledge—operationalized through ideas, concepts, citations, and keywords—in order to identify and track the driving mechanisms of science [6]. One of these approaches, commonly used in the literature, is also a science map. A science map is a spatial and/or temporal representation of individual authors, their research groups, or the knowledge concepts they have written about [17].

The seminal studies that addressed the organization of scientific knowledge were driven by the study of citation networks, a type of analysis in which we seek to understand common patterns of citation links among articles in a collection of scientific literature [18]. The authors discovered several important structural features, including the famous small-world phenomenon [19, 20], the rich-get-richer mechanism [21], and the hierarchical organization of scientific knowledge [22]. We refer the reader interested in further details to the recently published monograph by Wang and Barabási [23].

Other authors argue that scientific knowledge could be represented more realistically with keywords as basic knowledge elements [24]. Keywords and key phrases are typically extracted from the title and/or abstract of each article using natural language processing tools, or parsed from a list of descriptors already provided by the authors. To overcome the challenges of normalizing keywords, many authors use controlled vocabularies such as Medical Subject Headings (MeSH) in the life sciences [25] or Mathematics Subject Classification in mathematics [26].

#### **2.2 Co-word analysis**

Co-word analysis is an improved version of pure keyword-based co-occurrence analysis. By combining various theoretical concepts from graph theory, co-word analysis allows a simple but efficient reduction of a massive network of cooccurring keywords to a higher-level network of clustered keywords. First described by Callon et al. [27] in the 1980s, co-word analysis is a powerful method for mapping the detailed intellectual structure of unstructured text data [17]. The method has been used in a variety of scientific fields, including microbiology [28]. However, to our knowledge, it has not yet been applied to elucidate the intellectual structure of knowledge about *E. coli*.

Technically, the input for co-word analysis is a network of keywords, as described in Section 2.2. In the next step, we use a type of cluster analysis—often referred to as community detection in the language of complex networks—to partition nodes into a smaller number of communities based on the similarity of their wiring patterns. The clustering algorithm is optimized with the objective of maximizing both homogeneity within communities and heterogeneity between communities. Finally, we create a strategic diagram to uncover and explore interesting patterns within the detected community structure based on a set of predefined heuristic rules [29].

### **3. Methods**

In this study, we used a scientometric methodology to capture the structural and dynamic features of the knowledge landscape in *E. coli* research. In Section 3.1, we explain the details of compiling the dataset from the PubMed database. Then, in Sections 3.2 and 3.3, we explain the procedure for extracting keywords and creating co-word networks. Finally, in Section 3.4, we describe a method for identifying broad research topics and interpreting them.

## **3.1 Data collection**

The literature collection was created using an automated procedure from PubMed distribution. We retrieved all PubMed records indexed with the major MeSH descriptor "*Escherichia coli*" and restricted to the English language. Full bibliographic records were downloaded via the PubMed API and stored locally in XML format. To restrict a PubMed search result by the specified date range, we set the "datetype" parameter in an API call to "pdat". The last query update was performed on October 1, 2022.

#### **3.2 Keyword extraction**

The co-word analysis presented here is based on MeSH terms to overcome problems with the normalization of plain keywords, as described previously in Section 2.1.

Each PubMed record is manually annotated by human indexers at the National Library of Medicine using the MeSH terms. MeSH is a controlled vocabulary consisting of biomedical terms at different levels of granularity. There are several types of MeSH terms, two of which are important for further understanding of the present work: Main MeSH headings (or descriptors) and MeSH subheadings (or qualifiers). Descriptors are the main elements of the thesaurus and denote the main topic of the paper. For example, the MeSH descriptors for a paper dealing with adherent-invasive *E. coli* pathovar strains in the context of Crohn's disease might be "Bacterial Adhesion", "Crohn's Disease", and "*Escherichia coli*". Qualifiers are optionally assigned to the descriptors to express a particular aspect of the knowledge concept.

For further processing, we extracted all pairs of mesh heading/subheading terms along with the publication date of each bibliographic record.

#### **3.3 Co-word network**

In Section 2.2, we introduced the notion of co-word analysis, which aims to detect communities of keywords that frequently occur in conceptually similar articles. Formally, we first created a co-occurrence network based on the MeSH term lists from all retrieved documents. A node in the co-occurrence matrix refers to a particular MeSH heading/MeSH subheading pair, and a relationship between two nodes is established when both headings occur together in a particular document. In the following paragraphs, the phrase "MeSH heading/MeSH subheading" is referred to as "term" or simply "heading".

In the next step, the co-occurrence network was weighted according to the number of observed pairs of MeSH headings. For example, if MeSH heading *i* and MeSH heading *j* appear together in 100 papers, the weight of their co-occurrence was set to 100. Finally, the raw edge weights were normalized to account for the unbalanced number of MeSH headings in the papers. For normalization, we used an association measure defined as

$$
\sigma\_{i\dot{j}} = \frac{c\_{i\dot{j}}^2}{c\_{i\dot{}c\_{\dot{j}}}^2},
\tag{1}
$$

*Exploring the Knowledge Landscape of* Escherichia coli *Research: A Scientometric Overview DOI: http://dx.doi.org/10.5772/intechopen.109207*

where *cij* is the number of co-occurrences of headings *i* and *j* [29]. Also, *ci* and *cj* are the numbers of occurrences of MeSH headings *i* and *j*, respectively. The normalized value is zero if the MeSH heading pair is not associated at all, and is equal to one if a given pair occurs together in each paper.

#### **3.4 Identification of research topics**

On a prepared co-occurrence network, we ran Louvain's community detection algorithm to identify clusters of homogeneous MeSH headings [30]. Each of the detected clusters groups together several contextually similar MeSH headings and plays the role of a research topic.

The interpretation of the research topics followed the procedure described by Callon et al. [27]. We calculated two measures, centrality and density, to represent a particular research topic in a two-dimensional plot called a strategic diagram. Centrality represents the relatedness of an observed research topic to other topics in a strategic diagram. The stronger this relatedness is, the more central the topic is in the observed network. In practice, we interpret centrality as the strength of a research topic in the entire scientific domain. Formally, the centrality of a topic is defined as

$$\mathfrak{c} = \mathbf{10} \times \sum e\_{kh},\tag{2}$$

where *k* is a MeSH heading from the observed topic, *h* is a MeSH heading belonging to other topics, and *ekh* is the normalized co-occurrence frequency of the pair of MeSH headings *k* and *h* according to Eq. (1).

Density, on the other hand, represents internal cohesion, i.e., how strongly an observed research topic is conceptually developed. Density is formally defined as

$$d = \mathbf{100} \times \left(\frac{\sum e\_{\vec{\eta}}}{w}\right),\tag{3}$$

where *i* and *j* are MeSH headings associated with a cluster, and *eij* is the normalized frequency of co-occurrence of the two MeSH headings. The *w* in denominator represents the total number of MeSH headings in a given research topic.

Finally, considering centrality and density, we created a strategic diagram to represent the structural landscape of knowledge. The diagram is centered by the median of the two axis values and divides the plot area into four quadrants characterized by different types of research topics [29]. A particular topic can be assigned a unique qualitative description based on its position in the diagram as follows:


Although such topics are important to a particular research community, they are not well-developed.
