Machine Learning and Future of Data Science

## **Chapter 5**

## Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control

*Jacob Hale, Suzanna Long, Vinayaka Gude and Steven Corns*

## **Abstract**

Effective management of flood events depends on a thorough understanding of regional geospatial characteristics, yet data visualization is rarely effectively integrated into the planning tools used by decision makers. This chapter considers publicly available data sets and data visualization techniques that can be adapted for use by all community planners and decision makers. A long short-term memory (LSTM) network is created to develop a univariate time series value for river stage prediction that improves the temporal resolution and accuracy of forecasts. This prediction is then tied to a corresponding spatial flood inundation profile in a geographic information system (GIS) setting. The intersection of flood profile and affected road segments can be easily visualized and extracted. Traffic decision makers can use these findings to proactively deploy re-routing measures and warnings to motorists to decrease travel-miles and risks such as loss of property or life.

**Keywords:** trend extraction, spatial and temporal trends, images

## **1. Introduction**

Floods are the most frequently occurring natural disaster. A flood event occurs when stream flows exceed the natural or artificial confines at any point along a stream [1]. This is often due to heavy rainfall, ocean waves coming on shore, rapid snow melting, or failure of manmade structures such as dams or levees [2]. From 1998–2017, flood events affected more than two billion people globally [3]. Disasters of this frequency and magnitude are typified by extreme costs to governments. In 2019, historic flooding across Missouri, Arkansas, and the Mississippi River basin resulted in an estimated cost of 20 billion dollars [4]. These estimates typically do not reflect indirect costs such as added travel-miles and the subsequent loss of time. Further, floods are among the deadliest natural disasters. From 2010–2020, floods resulted in the fatalities of 1089 people in the United States [5]. A majority of these deaths were comprised of motorists. Therefore, urban planners such as traffic decision makers are tasked with proactively deploying resources that minimize motorist risk exposure. At present, traffic decision makers rely on static flash flood inundation profiles related to discrete rainfall events. These profiles are often created through multiagency cooperation efforts such as [6]. Some studies have begun to generate dynamic flood inundation data visualizations based on these profiles [7].

Additionally, integrated approaches that use machine learning and geographic information systems (GIS) to track changes in critical infrastructure over time are emerging as powerful decision support tools [8]. However, there is limited use of state-of-the-art time series prediction models to generate dynamic data visualizations in a GIS setting for improved flood management. This book chapter explores the integration of publicly available data and machine learning models to address this gap in the literature.

Precise determination of when and where to deploy re-routing measures is a complex task. One approach that improves planning effectiveness is to integrate time series characteristics of river behavior and corresponding spatial flood profile. In this chapter, a univariate time series prediction of river stage is conducted that improves the temporal resolution and accuracy of publicly available forecasts. This prediction is then tied to a corresponding spatial flood inundation profile in a GIS setting. The resulting geospatial deep learning model provides a data visualization tool that traffic decision makers can use to proactively manage road closures in the event that a flood is likely to occur. The first section provides an overview of relevant river behavior that causes flooding. State-of-the-art trend extraction and prediction techniques are then presented and tied to geospatial use cases. The methodology section presents the data used, time series prediction model selected, and geoprocessing procedures required for data visualization using GIS software. Next, an illustrative example is provided for a frequently flooded intersection in Missouri. A discussion section is provided that positions the findings in the context of improving traffic management in the event of a flood. Lastly, a conclusion is given that summarizes the key findings and outlines model limitations and future work.

## **2. A geospatial deep learning approach**

Two key characteristics of streams that relate to flood events are stream stage and streamflow. Stream stage refers to height (ft) of the stream and streamflow corresponds to discharge (ft<sup>3</sup> /s) or alternatively, volumetric flowrate. Typically, governmental organization such as the United States Geological Survey maintain a network of sensors that monitor these characteristics over time for various stream segments. The National Weather Service classifies flood categories into four groups based on stream stage: Action Stage, Flood Stage, Moderate flood Stage, and Major Flood Stage [9]. These values vary for a given segment of stream based on analysis of previous floods, local topography, and underlying geological properties.

Given that stage is monitored over time, the use of time series forecasting methods to predict stage values is appropriate. There are two modeling approaches that are useful in this context: statistical and computational intelligence. Statistical models use historical data to identify underlying patterns to predict future values [10]. Some commonly used techniques for flood forecasting include simple exponential smoothing [11], autoregressive moving average [12], and autoregressive integrated moving average [13]. However, one shortcoming of these approaches is lack of scalability as the quantity and complexity of data increases [14]. An alternative approach that addresses these issues is computational intelligence. A key feature of computational intelligence approaches is the capacity to manage complexity and non-linearity without needing to understand underlying processes [15]. In summary, statistical methods rely on precise underlying relationships and exhibit decreased performance as the number of variables increases whereas computational intelligence approaches identify patterns using large amounts of training data to

*Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control DOI: http://dx.doi.org/10.5772/intechopen.96347*

establish a model capable of accurate predictions [16]. Some commonly used flood forecasting computational intelligence models include support vector machines [17], artificial neural networks [18], and deep learning [19]. Further, they have demonstrated superior performance when compared to conventional statistical modeling approaches for flood prediction studies. LSTM models have explicitly shown promising results in time series contexts. Therefore, LSTM models provide a state-of-the-art trend extraction and prediction technique regarding stream stage values.

Stream stage values are categorized based on resulting flood severity. The physical reality of these categories is the spatial extent of the flooding event often referred to as a flood inundation map [20]. These maps provide decision makers with a useful visual reference to determine what specifically has been affected by a flood event. An area of research, data visualization, and practical application that has not been fully investigated is the integration of computational intelligence stream stage predictions with geospatial flood inundation maps. The methodology provided in the following section addresses this gap.

## **3. Methodology**

This section consists of three parts: LSTM prediction of stream stage, data required, and geoprocessing procedures. First, a brief overview of LSTM will be given. This will include explanatory figures and relevant mathematical formulas. Second, data required to conduct the LSTM prediction of stream stage will be procured. Flood inundation imagery and road network data will also be obtained. Lastly, data will be uploaded to a GIS software and processed for end use by traffic decision makers. An illustrative example is presented in the next section.

#### **3.1 LSTM prediction of stream stage**

Stream stage prediction is a time series forecasting procedure that is dependent on previous data to predict future values. As the quantity and quality of data continues to increase, more powerful computational approaches can be applied to prediction problems. The results of the literature review demonstrated that deep learning approaches, namely LSTM networks, are increasingly being applied to these problems.

Deep learning is an extension of the conventional neural network by adding additional layers and layer types. **Figure 1** provides a visual comparison of the two approaches [21]. The simple neural network (left) consists of a single input layer, hidden layer, and output layer. Alternatively, the deep learning neural network (right) has one input layer followed by three successive hidden layers that ultimately feed into a final output layer. This configuration has generated superior performance in capturing complex relationships.

**Figure 1.** *Simple neural network vs. deep learning neural network.*

However, neither approach retains previous time step information. Recurrent neural networks (RNNs) were introduced to address this limitation. LSTM networks are the deep learning variant of RNNs. All figures and mathematical formulation are borrowed from [15]. The primary benefit of LSTM networks is the capacity to retain longer term information. This is accomplished by removing and adding information determined by a series of 'gates' and vector operations. **Figure 2** provides a visual representation of an LSTM cell. The first gate, illustrated in yellow, generates a value between 0 and 1 using the current input (xt) and output from the previous step (yt-1) that determines how much information is passed on (forget gate). A zero corresponds to no information transfer whereas a one represents a complete transfer.

The result of this procedure ( *<sup>t</sup> f* ) is presented mathematically in Eq. (1) as a sigmoid neural network layer where U (weights) and W (recurrent connections) are matrices.

$$f\_t = \sigma\left(x\_t\boldsymbol{U}^f + \boldsymbol{y}\_{t-1}\boldsymbol{W}^f\right) \tag{1}$$

Next, a decision must be made regarding what information needs to be stored. This is accomplished by applying an additional sigmoid layer (red, *it*). New values are then added to the cell state (*Ct* <sup>ˆ</sup> ) by using a tanh layer (green). Eqs. (2) and (3) present these procedures mathematically.

$$\dot{\mathfrak{u}}\_{t} = \sigma\left(\mathfrak{x}\_{t}\mathcal{U}^{i} + \mathfrak{y}\_{t-1}\mathcal{W}^{i}\right) \tag{2}$$

$$\hat{C}\_t = \tanh\left(\mathcal{X}\_t U^\ell + \mathcal{Y}\_{t-1} W^\ell\right) \tag{3}$$

The line at the top of the cell is known as the cell state (*Ct* ) and has interactions with all components. Information has the opportunity of being forgotten when the old state (*Ct*<sup>−</sup><sup>1</sup> ) is multiplied by the result of the first forget gate ( *<sup>t</sup> f* ). The product of the second (red) and third (green) gates are then added which results in new information being provided to the cell state and is represented by Eq. (4).

$$\mathbf{C}\_{t} = f\_{t}\mathbf{C}\_{t-1} + i\_{t}\hat{\mathbf{C}}\_{t} \tag{4}$$

**Figure 2.** *LSTM network cell.*

*Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control DOI: http://dx.doi.org/10.5772/intechopen.96347*

Lastly, the output layer of the LSTM cell determines the forecast for the current time step. A sigmoid layer (blue) and tanh layer are multiplied to generate an output ( *<sup>t</sup> y* ). This final step is represented by Eqs. (5) and (6).

$$
\sigma\_t = \sigma \left( \mathcal{X}\_t \mathcal{U}^0 + \mathcal{Y}\_{t-1} \mathcal{W}^0 \right) \tag{5}
$$

$$\mathcal{Y}\_t = \tanh(\mathcal{C}\_t) \times \mathcal{o}\_t \tag{6}$$

The result of this computational procedure is a time series forecast of future values. However, a large amount of data must be gathered to use as a model input. This data is presented in the next section.

### **3.2 Data required**

Historic stream stage height for the location further explained in Section 4 must first be gathered. 113,994 data points were procured that correspond to 15-minute intervals from May 19, 2016 (5 PM) – September 1, 2019 (4 PM). Stage height is herein referred to as 'gauge height' to account for the source of the data. This data is represented graphically in **Figure 3** [22].

Using USGS' flood inundation mapper (FIM), these gauge heights can be tied to a specific flood inundation profile [23]. The FIM is a publicly available tool that provides resulting flood inundation maps for one-foot gauge height increments in image format (.tif). A sliding bar that accomplishes this is available on the online user interface and is presented in **Figure 4**.

**Figure 3.**

*Stream stage height for example locations.*

**Figure 4.** *FIM sliding gauge height tool.*

An example of a flash flood inundation profile being uploaded to a GIS software is provided in **Figure 5**. Purple lines correspond to road network data derived from the National Transportation Dataset [24]. Blue raster (grids of pixels) imagery denote the depth of water at discrete locations where darker blue reflects deeper water. Useful geoprocessing techniques that generate actionable decision support tools are presented in the next section.

## **3.3 Geoprocessing procedures**

Traffic decisions makers are tasked with identifying flood affected road segments. In **Figure 5**, it can be observed that the flood inundation profile does overlap certain road segments. Relying on visual inspection alone is time consuming and prone to inaccuracies due to human error. A solution to this issue is the application of a set of straightforward geoprocessing tools that are built-in to most GIS softwares: conversion and intersection.

Some tools do not allow raster and vector data layer interoperability. Therefore, it is necessary to convert one of the data layers to establish a consistent data type. One approach is to convert the raster layer into a vector layer using the conversion tool within ArcGIS. **Figure 6** illustrates the result of this operation. The flood inundation profile has been converted into several points at 1-m increments. This spatial resolution can be modified by the user. The road network has been changed from its previous color to improve readability.

Once the raster layer has been converted into vector format, it is eligible for use as an input layer for the intersection tool. The intersection tool generates a point at

**Figure 5.** *Flood inundation profile example.*

**Figure 6.** *Raster layer conversion example.*

every location where there is an intersection between the input layers. In the next section, an illustrative example is provided to demonstrate the effectiveness of the methodology presented.

## **4. Illustrative example**

Valley Park, Missouri is located at the intersection of I-44 and State Route 141. This location is the setting for the example figures presented previously. The Meramec River winds through this area and has regularly flooded in recent years. In 2017, the river exceeded its banks and caused significant damage to the surrounding area as seen in **Figure 7**. This location provides a suitable candidate to test the methodology presented given the extent of the flood event and data availability.

First, data is gathered from a nearby stream gauge. **Figure 8** provides a geographical point of reference for the gauge denoted by a green square with respect to I-44 and State Route 141. The data presented in **Figure 5** is then procured and used as an input for the LSTM network. **Figure 9** presents the prediction results of the LSTM model superimposed on the actual data for May 19, 2016-September 1, 2019.

The actual data (blue) can be observed deviating from the prediction results for the training (orange) and testing (green) results of the LSTM network. A lack of discrepancy between the actual data and predictions demonstrates the model's effectiveness. Further, it is useful to determine how the prediction compares with publicly available forecasts for the same location. USGS provides a forecast every six hours. Alternatively, the LSTM network provides 24 predictions in the same period. **Figure 10** provides a comparison of the prediction provided by USGS and the LSTM model for September 1, 2019 (6 PM) – September 3, 2019 (6 AM).

The red line represents the original data. Gauge height is initially observed at just above six feet. From there, it trends in a downwardly direction until it reaches

**Figure 7.** *Meramec River flood in 2017 [25].*

**Figure 8.** *Gauge location [9].*

**Figure 9.** *LSTM training and testing results.*

**Figure 10.** *USGS and LSTM prediction comparison.*

the end of the dataset at less than 3.5 feet. The green line corresponds to the USGS prediction. This prediction initially overshoots the original data before briefly correcting and then diverging significantly from the observed trend. Lastly, the blue line represents the LSTM prediction. At first, this prediction captures the downward trend missed by the USGS prediction. Ultimately, the prediction flattens out and diverges from the original observations but to a lesser extent when compared to the USGS prediction. Root Mean Squared Error (RMSE) values for each of the predictions are provided to further demonstrate the difference in model performance. The RMSE value of 0.453 reported by the LSTM model represents superior accuracy compared to the 1.065 value reported by the USGS prediction. Therefore, the LSTM model presented here improves on the accuracy of publicly available forecasts and can be used as an input for the flood inundation tool.

Valley Park has 43 flood inundation profiles available in one-foot increments from 11–54 feet. The highest stage value recorded at this location is 44.11 feet on December 31, 2015. **Figure 11** provides the flood inundation profile for 45 feet to approximate this event. Note that 45 feet is used instead of 44. This is due to the

**Figure 11.** *Flood inundation profile for 45 ft. stage value for Valley Park, Missouri.*

*Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control DOI: http://dx.doi.org/10.5772/intechopen.96347*

#### **Figure 12.**

*Flood affected road segments for flood inundation profile corresponding to 45 ft. stage value for Valley Park, Missouri.*

flood inundation profile incremental limitation and opting for a rounding approach that provides a more conservative risk assessment. The inundation profile is then converted to point format and intersected with the road network as illustrated by **Figure 12**.

## **5. Discussion**

At present, urban planners such as traffic decision makers rely on static flood inundation maps and post hoc planning to reroute traffic in the event that a flood occurs. This approach puts motorists already in-transit at risk to rapidly changing road conditions. To address these risks, a field of research has emerged to provide decision makers with real-time decision-making tools. However, using time series prediction models that capture river characteristics and integrating them with flood inundation profiles has received limited attention. The methodology provided here addresses this gap.

Traffic decision makers can use the data visualization presented in **Figure 12** as a powerful decision support tool. The flood affected road segments can be easily identified (orange) and rerouting measures can be promptly dispatched. With the improved temporal resolution and accuracy of the LSTM prediction of stage height, traffic decision makers can deploy resources proactively to avoid unnecessary risk to motorists and improve traffic flow. Concluding remarks, limitations, and future work are presented in the next section.

## **6. Conclusion**

Flash floods are a frequent and devastating natural disaster. The impetus to manage these events belongs to local decision makers that work in a resource constrained environment. To improve their decision-making effectiveness, a framework was presented that integrates machine learning and geospatial data to extract spatial and temporal trends using publicly available data. An illustrative example was provided to demonstrate the effectiveness of the framework provided. Valley Park, Missouri is located near the intersection I-44 and State Route 141. These roads represent major traffic throughputs and persistent flooding of the Meramec River has jeopardized the safety of motorists and the flow of commercial goods. Using 113, 994 river stage observations procured from a nearby sensor, an LSTM network was developed to improve the accuracy of publicly available forecasts. The result was an improvement in both the frequency and accuracy of forecasts provided.

Once the stage value is predicted it can be tied to a spatial flood inundation profile using the publicly available FIM. Using the flood inundation profile for 45 feet observed at Valley Park as a proxy for the historic crest at this location, data visualization of flood affected road segments was generated in a GIS setting. The key benefit of this output is the ease with which traffic decision makers can use the results presented to inform urban planning and decision making. Traffic decision makers can use the resulting data visualization presented here to guide real-time decision making in the event that a river stage value is predicted to reach a flood event stage for a specified river segment. Despite the usefulness of the findings, there remain a number of model limitations that represent areas of future work.

Model limitations can be divided into two categories: data gathering and model extension. Deep learning models are dependent on large amounts of data. Therefore, sensors that collect data need to be installed and active for an extended period. The cost to install and maintain an enlarged sensor network might be prohibitive for some locations. Due to this fact, model implementation is limited to river locations where sensors are already installed. Additionally, FIM coverage is confined to a small number of locations nationwide. Similarly, to sensor coverage, if there are not already-available flood inundation maps, then the model cannot be applied to those locations. Model extension includes options to improve the model in a material way. One recommendation would be to determine the best locations for road signage that will provide optimal re-routing to motorists given a finite amount of signage. Another approach would involve working with local decision makers to determine re-routing effectiveness based on how quickly resources are deployed given model predictions. Areas of future work not related to model extensions include alternative prediction approaches in river networks with no sensors and refinement of the model to account for flash floods. Each of these components represent considerable opportunity for model enrichment that further improve the decision-making effectiveness for traffic management professionals.

The results presented here demonstrate the utility of using machine learning models and geospatial data to generate data visualization tools that key stakeholders can use to improve planning effectiveness. As data becomes increasingly available, use of comparably sophisticated methods can be applied to a suite of natural disaster phenomena. The outcome of such an undertaking will be the widespread use of data visualization tools that will reduce the risk motorists are exposed to and mitigate the accompanying economic fallout.

## **Acknowledgements**

This work was partially funded by the Missouri Department of Transportation, Award Number TR201912 and the Mid-America Transportation Center, Award Number 25-1121-0005-130.

## **Conflict of interest**

The authors declare no conflict of interest.

*Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control DOI: http://dx.doi.org/10.5772/intechopen.96347*

## **Author details**

Jacob Hale1 , Suzanna Long1 \*, Vinayaka Gude2 and Steven Corns1

1 Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, United States

2 Department of Arts and Media, Louisiana State University Shreveport, Shreveport, LA, United States

\*Address all correspondence to: longsuz@mst.edu

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] United States Geological Survey. Floods: Things to Know [internet]. 2019. Available from: https://www.usgs.gov/ special-topic/water-science-school/ science/floods-things-know?qt-science\_ center\_objects=0#qt-science\_center\_ objects [accessed 2019-03-15]

[2] National Severe Storms Laboratory. Flood Basics [internet]. 2019.Available from: https://www.nssl.noaa.gov/ education/svrwx101/floods/ [accessed 2019-03-16]

[3] World Health Organization. Floods [internet]Available from: https://www. who.int/health-topics/floods#tab=tab\_1 [accessed 2019-05-20]

[4] National Oceanic and Atmospheric Administration. 2010-2019: A landmarkdecade of U.S. billion-dollar weather and climate disasters [internet]. 2020.Available from: https://www. climate.gov/news-features/blogs/ beyond-data/2010-2019-landmarkdecade-us-billion-dollar-weather-andclimate [accessed 2020-10-15]

[5] National Weather Service. NWS Preliminary US Flood Fatality Statistics[internet]. 2020. Available from: https://www.weather.gov/arx/ usflood [accessed 2020-10-15]

[6] Dietsch, B.J., and Strauch, K.R., 2019, Flood-inundation maps of the Meramec River from Eureka to Arnold, Missouri, 2018: U.S. Geological Survey Scientific Investigations Report 2019-5004, 12 p. Available from: https://doi.org/10.3133/ sir20195004.

[7] Jha MK, Afreen S. Flooding urban landscapes: Analysis using combined hydrodynamic and hydrologic modeling approaches. Water (Switzerland). 2020;12(7).

[8] Shirowzhan S, Sepasgozar SME, Li H, Trinder J, Tang P. Comparative analysis of machine learning and point-based algorithms for detecting 3D changes in buildings over time using bi-temporal lidar data. Automation in Construction. 2019Sep;105. Available from: https://doi.org/10.1016/j. autcon.2019.102841

[9] Advanced Hydrological Prediction Service. Meramec River at Valley Park [internet]. 2019.Available from: https:// water.weather.gov/ahps2/hydrograph. php?wfo=lsx&gage=vllm7[accessed 2019-06-02]

[10] Lago J, De Ridder F, De Schutter B. Forecasting spot electricity prices: Deep learning approaches and empirical comparison of traditional algorithms. Appl Energy [Internet]. 2018;221(February):386-405. Available from: https://doi.org/10.1016/j. apenergy.2018.02.069

[11] Papacharalampous G, Tyralis H. Hydrological time series forecasting using simple combinations: Big data testing and investigations on one-year ahead river flow predictability. J Hydrol [Internet]. 2020;590(May):125205. Available from: https://doi.org/10.1016/j. jhydrol.2020.125205

[12] Liu J, Wang J, Pan S, Tang K, Li C, Han D. A real-time flood forecasting system with dual updating of the NWP rainfall and the river flow. Nat Hazards. 2015;77(2):1161-82.

[13] Adnan, Muhammad, X, Yuan, O, Kisi et al. Application of soft computing models in streamflow forecasting. Proceedings of the Institution of Civil Engineers – Water Management. 2019;172(3):123-124. Available from: https://doi.org/10.1680/jwama.16.00075

[14] Mosavi A, Ozturk P, Chau KW. Flood prediction using machine learning models: Literature review. Water (Switzerland). 2018;10(11):1-40. *Using Trend Extraction and Spatial Trends to Improve Flood Modeling and Control DOI: http://dx.doi.org/10.5772/intechopen.96347*

[15] Gude V, Corns S, Long S. Flood Prediction and Uncertainty Estimation Using Deep Learning. Water (Switzerland). 2020;12(3).

[16] Bzdok D, Altman N, Krzywinski M. Points of Significance: Statistics versus machine learning. Nat Methods [Internet]. 2018;15(4):233-4. Available from: http://dx.doi.org/10.1038/ nmeth.4642

[17] Tehrany MS, Pradhan B, Jebur MN. Flood susceptibility mapping using a novel ensemble weights-of-evidence and support vector machine models in GIS. J Hydrol [Internet]. 2014;512:332- 43. Available from: http://dx.doi. org/10.1016/j.jhydrol.2014.03.008

[18] Shafizadeh-Moghadam H, Valavi R, Shahabi H, Chapi K, Shirzadi A. Novel forecasting approaches using combination of machine learning and statistical models for flood susceptibility mapping. J Environ Manage. 2018;217:1-11.

[19] Tien Bui D, Hoang ND, Martínez-Álvarez F, Ngo PTT, Hoa PV, Pham TD, et al. A novel deep learning neural network approach for predicting flash flood susceptibility: A case study at a high frequency tropical storm area. Sci Total Environ [Internet]. 2020;701:134413. Available from: https://doi.org/10.1016/j. scitotenv.2019.134413

[20] Papaioannou G, Vasiliades L, Loukas A. Multi-Criteria Analysis Framework for Potential Flood Prone Areas Mapping. Water Resour Manag. 2015;29(2):399-418.

[21] Publication-ready NN-architecture schematics [internet]. 2021. Available from: alexlenail.me/NN-SVG/index. html [accessed 2021-1-27]

[22] United States Geological Survey. USGS 07019130 Meramec River at Valley Park, MO [internet]. 2019.Available

from: https://waterdata.usgs.gov/nwis/ uv?site\_no=07019130[accessed 2019-10- 16] usgs 2019

[23] United States Geological Survey. Flood Inundation Mapper [internet]. 2019.Available from: https://fim.wim. usgs.gov/fim/[accessed 2019-10-18]

[24] United States Geological Survey. USGS National Transportation Dataset (NTD) Downloadable Data Collection [internet]. 2019. Available from: https://catalog.data.gov/ dataset/usgs-national-transportationdataset-ntd-downloadable-datacollectionde7d2[accessed 2019-10-22]

[25] KMOV4. Photos: Before & After Meramec River flooding [internet]. 2017. Available from: https://www. kmov.com/news/photos-before-aftermeramec-river-flooding/article\_ fc16115e-12e2-54e6-bf84-ba1732a4dcbd. html [accessed 2019-10-25]

## **Chapter 6** Visual Data Science

*Johanna Schmidt*

## **Abstract**

Organizations are collecting an increasing amount of data every day. To make use of this rich source of information, more and more employees have to deal with data analysis and data science. Exploring data, understanding its structure, and finding new insights, can be greatly supported by data visualization. Therefore, the increasing interest in data science and data analytics also leads to a growing interest in data visualization and exploratory data analysis. We will outline how existing data visualization techniques are already successfully employed in different data science workflow stages. In some cases, visualization is beneficial, while still future research will be needed for other categories. The vast amount of libraries and applications available for data visualization has fostered its usage in data science. We will highlight the differences among the libraries and applications currently available. Unfortunately, there is still a clear gap between visualization research developments over the past decades and the features provided by commonly used tools and data science applications. Although basic charting options are commonly available, more advanced visualization techniques have hardly been integrated as new features yet.

**Keywords:** visual data science, data visualization, visual analysis, data visualization libraries, data visualization systems

## **1. Introduction**

Within the last years, data science has been established as its own important emergent scientific field. Data science is defined as a "concept to unify statistics, data analysis, machine learning, and their related methods" to "understand and analyze actual phenomena with data" [1]. As such, data science comprises more than pure statistical data analytics, but the interdisciplinary integration of techniques from mathematics, statistics, computer science, and information science [2]. Data science also involves the consideration of domain knowledge for the analysis and the interpretation of the data and the results [3].

Data visualization research is largely driven by current use cases that users have to face when working with data. The problems and tasks that need to be solved by data scientists are, naturally, a precious source for further developments in data visualization research. On the other hand, data scientists already use data visualizations to visualize data on a daily basis. It is, therefore, worthwhile to think about how the well-established methods for visual analysis fit into the existing workflows of data scientists [4]. According to recent findings from interviews with people working with data [5], data scientists' tasks usually follow a very similar workflow path, and along this path, different stages can be identified. Every stage poses different challenges for data handling. For example, at the beginning of the

workflow, data wrangling is considered to be an essential and tedious part of the workflow. Data wrangling comprises, among others, data parsing, cleaning, and merging. Data visualization techniques can help to quickly identify data flaws like missing data, anomalies like duplicates or outliers, and other inconsistencies in this stage. As a next step, data scientists have to understand the data at hand and evaluate its usefulness for modeling. Here, data visualization can help understand the structure of the data, detect correlations and clusters, and select data parts suitable for modeling.

The rise in data science currently very strongly fuels data visualization techniques by users from very diverse domains. This has now led to many new data visualization tools and libraries being developed. Many of these libraries are opensource and are embedded into programming environments like Python, R, and JavaScript. Prominent examples for such libraries are, for example, *Matplotlib* (Python), *ggplot2* (R), and *D3* (JavaScript). Open source technologies are a great advantage since data scientists can rely on a large community that can provide them with advice and support, and access to a wide range of libraries and plugins. Especially for Python, there are libraries for high-performance computing, numerical calculations, regression modeling, and visualization, which are regularly extended and maintained. On the other hand, feature-rich, standalone visual analysis applications have been increasingly established within the last years. These applications, such as *Tableau*, *Microsoft Power BI* and *Qlik*, provide easy access to data visualization and visual data exploration for users unfamiliar with programming scripting, data wrangling, and/or data visualization design. Standalone applications are usually commercial, since a lot of maintenance and continous development has to happen in the background. As many of these applications are available, data visualization and visual analysis are more widely known and used today in many different domains and are used and applied by many users and domain experts.

This chapter aims to provide a concise overview of existing data visualization techniques for data science and how they fit into the different stages of the data science workflow. Several studies that focused on categorizing and evaluating the different libraries and applications for data visualization currently used in data science will be outlined to create a better picture of which libraries should be used for which type of tasks. Unfortunately, there is still a gap between current research in data visualization and the features and techniques actually provided by libraries and applications. We would, therefore, like to foster the usage of data visualization in data science to bring both communities closer together.

### **2. Visualization supporting data science**

Data science is an interdisciplinary approach that combines input from other domains like mathematics, statistics, computer science, or graphics. Given this vast amount of tasks and skills, several studies have been conducted to understand better and start to categorize the tasks and requirements of data scientists. Kim et al. [6] highlighted the diversity of skills, tasks, and toolsets used by data scientists in software development teams. As an important conclusion, they highlighted that the heterogeneity and diversity make it hard to reuse work. Kandel et al. [5] conducted an interview study with several data scientists and categorized them into the three archetypes of *Hacker*, *Scripter*, and *Application User*. Based on the archetype, data scientists use very diverse tools to solve their tasks. The survey by Harris et al. [7] among different data workers, as they call people working with data, from different disciplines, provided a very comprehensive overview of the different tasks data

### *Visual Data Science DOI: http://dx.doi.org/10.5772/intechopen.97750*

scientists need to solve. As a result of their study, they were then able to categorize data scientists into one of four major categories based on their skills (e.g., business orientation vs. programming skills). In general, both studies concluded that data scientists either prefer to use hands-on scripts and program their own algorithms over using fully-featured applications.

As a basis for better explaining how data visualization fits into the data science workflow, we would like to use the categorization introduced by Kandel et al. [5]. They proposed to divide the data science workflow into five Stages:


In the *Discover*stage, data scientists need to identify the data relevant to their current project. This involves searching for internal but also external data sources. The main challenges in this stage are restricted data access and missing documentation of data attributes. This stage is, in general, not supported by data visualization applications. There are approaches in data visualization to, for example, improve the visualization of search engine results [8], but the general problem of data being difficult to find/access is not treated. We therefore do not concentrate on this stage here in this chapter.

## **2.1 Data wrangling**

Data wrangling in the *Wrangle* stage requires to, on the one hand, focus on data flaws like duplicates and inconsistencies (e.g., in naming), and, on the other hand, the process of profiling and transforming datasets. Data wrangling's central goal is to make the data usable in the subsequent steps.

Initially, data wrangling was not considered by data visualization itself, which started to operate once the data was available in the desired format. Since data wrangling nowadays became an essential and tedious task in data science, which eats up a lot of the time in the whole workflow (up to 50–80% [9]), data visualization researchers started to think about techniques how to support this task. *Wrangler* [10] is the most prominent application to mention here, an interactive system for creating data transformations. Changes in the data are visualized, and data scientists can explore the space of possible operations. The *Wrangler* system infers further actions from what has been done by the user so far manually, and in this way, greatly speeds up the wrangling process. The idea was picked up by the company Trifacta, which included the Wrangler idea into their product to build data pipelines.

In general, data wrangling itself constitutes a very interesting use case which, hopefully, in the future, will get more attention by data visualization research. At the moment, data scientists mostly have to rely on manual tasks and scripting tools to get the data into the right format.

### **2.2 Data profiling**

The most demanding stage in terms of visualization design is the *Profile* stage, where data scientists need to explore the data to understand its structure. This process is very circular and undirected, without a specific goal in mind. Basic information about the related problem domain is required. The goal is to understand the patterns found in the data. This includes, but is not limited to, the distribution of values, correlations, outliers, and clusters.

Datasets usually contain several quality issues, such as missing values, outliers or extreme values, and inconsistencies. Missing data might be due to observations completely missing in a dataset, which can be in many cases identified by empty cells of *null* values. There might also be cases where numbers encode missing data (e.g., 0 or 1) or characters (e.g., "N/A"), something that needs to be considered during the analysis. Inconsistencies and heterogeneous information are often erroneously created by humans, especially in names and terms, and because certain cells have been overloaded with information. Data scientists need to be aware of these flaws when working with a particular dataset. Checking the quality of a dataset has already been addressed by several approaches in visualization. *Profiler* [11] was intended to support the quality assessment of datasets visually; *Visplause* [12] provided the same for time series data. More generally, Bertini et al. [13] developed quality metrics for multi-dimensional datasets. Quality checks are nowadays also provided as features in standalone visualization applications. In Python, the package *pydqc* provides automatic quality checks.

In data visualization, the process of looking at data from different directions and studying different aspects to understand the data structure [14] is called *exploratory data analysis (EDA)*. EDA requires a high degree of interactivity and interconnectivity between different visualizations from a data visualization perspective. EDA has been studied quite extensively within the last years in visualization research. Several paradigms about interaction design [15] and system design [16] have been established. Exploration of data usually happens by using different views and different visualizations. EDA contrasts with the more traditional statistical approach to data analysis that starts with hypothesis testing. In EDA, data scientists usually do not have a clear goal and should support the hypothesis-building process. The typical EDA tasks [17] are:


#### *Visual Data Science DOI: http://dx.doi.org/10.5772/intechopen.97750*

These tasks are, in general, supported by all visualization applications. In case scripting languages are used, data scientists tend to create several data representations to check various aspects. Programming environments like *Jupyter* Notebooks [19] enable scientists to combine both data analysis scripts and visualization. An example for showing Python *Plotly* visualizations in *Jupyter* can be seen in **Figure 1**. Such narrative or literate programming tools [20] as notebooks help data scientists to record their steps and decisions in a data analysis workflow. They allow scientists to save whole workflows and, in this way, make decisions and results reproducible. This has also been recognized by data visualization research [21], where researchers increasingly think about new solutions for more advanced data visualization in notebook environments and literate programming.

### **2.3 Modeling**

Data scientists make assumptions to find out which types of transformations they need to use for modeling. This also includes understanding which of the data fields are most relevant to a given analysis task. In the *Model* stage, the data is used as input data for building models of the underlying phenomenon. When models have been built, it is important to evaluate them against suitable real-world data.

Building and evaluating simple models like regression models is already supported by some visualization applications [22]. More advanced techniques, often summarized under the term "explainable AI" [23], try to find new, often visual, ways for humans to explain the decision structure of AI (artificial intelligence)

#### **Figure 1.**

*Jupyter Notebook and Plotly. Literate programming tools as notebooks allow scientists to combine both data analysis scripts and visualization. Figure by [18].*

models and verify their decision according to their own ground-truth knowledge. One problem that is mentioned often by data scientists is the scalability of model testing to large data. Currently, models are often evaluated using EDA techniques very similar to the ones described in the *Profile* stage. In the future, data visualization research will concentrate on advanced methods for the verification of model outputs, especially in relation to the input and output data.

## **2.4 Reporting**

In the *Report* stage, mostly simple and easy-to-understand visualizations are needed since here the results of the data analysis stage have to be presented to a broader audience. The use cases in this stage can be mostly covered by employing basic charts, which are already well supported by current data science tools.

In many cases, dashboards with more or less interactivity are used to present the results. Many data science tools already support building dashboards. This was also recognized by the data visualization community recently. Sarikaya et al. [24] pointed out that dashboards are actually much more than just a collection of different graphs and that they need to be treated as separate research objects in data visualization. In their work, they categorized existing dashboards into seven categories, mostly based on the intended task (e.g., information and education vs. decision-making). Three examples are shown in **Figure 2**. Such approaches point

#### **Figure 2.**

*Dashboards types by [24]. Dashboards for reporting data findings may differ according to the intended user group and task. In this figure, dashboards for operational decision-making, strategic decision-making, communication, and studying your own data (quantified self) are shown. The dashboards in the first column (operational and organizational) target a narrow group of users with particular tasks in mind. The second column's dashboards (communication and quantified self) are intended to be viewed by a larger audience. Images by [24].*

#### *Visual Data Science DOI: http://dx.doi.org/10.5772/intechopen.97750*

out the necessity for dashboard designers to be clear about the intended user group and always have a clear story when presenting data to external people.

One important aspect to consider is to choose the right visualizations for the right type of data. This is especially important if the exact structures in the data are unknown to the viewers and if the viewers' experience with data visualization is unclear. Based on research on human perception and possibilities for data visualization, researchers started to create guidelines for data- or task-driven suggestions for data visualizations. The *Draco* system by Moritz et al. [25] uses predefined rules to suggest several visualizations based on the data and attempt what should be shown in the data. On their website *From Data to Viz*, Holtz and Healy [26] outline several paths how, starting from a specific data type, certain patterns in the datatype can be visualized. The *Data Visualisation Catalogue* [27] summarizes different visualization techniques and explains how they can be employed to encode information. All-in-all, these approaches show the need for further research on guidelines in data visualization research.

## **3. Data visualization toolboxes**

As more and more people started working in data science, more and more software applications for data analysis, many of which are open source, have evolved within the last years [28]. All steps in the data science workflow contain circular processes where data scientists have to rethink actions they made and restart analysis processes from scratch. For this reason of a very interactive and undirected workflow [29], there are no applications, yet, that can cover the entire data science workflow. Data scientists must, therefore, always use a list of combinations of different tools, scripts, and applications to achieve their goals [30]. These tools are often focused on specific tasks, such as efficient data storage and access (e.g., for Big Data applications), data wrangling (i.e., mapping data to another format), or automated analysis (e.g., machine learning). They are based on different programming languages (e.g., Python, R, JavaScript) or are built as fullyfeatured, standalone applications. In this chapter we specifically concentrate on libraries and tools for data visualization.

When talking about libraries and application for data visualization we use the definition by Rost [31]. They conducted a study about features of visualization libraries and applications by creating one specific chart with different tools. In the study the authors differentiate between *charting libraries* (i.e., programming toolkits) and *apps* (i.e., fully-featured applications). As also noted by Kandel et al. [5], different types of data scientists tend to use different types of tools. A data scientist being identified as an archetype *hacker* would not be happy by having to use a standalone application, because he/she would not be able to access the latest library in a scripting environment, and would therefore not be able to customize his/her individual workflow. We, therefore, stick to this differentiation in this chapter.

#### **3.1 Charting libraries**

Charting libraries are considered to be all kinds of visualization libraries that need some programming environment to work. In many cases, this is a scripting environment, so many libraries nowadays are based on Python or R. The popularity of data visualization libraries changes from year to year since many of these libraries are open source and therefore undergo continuous adaptations and improvements. Open-source technologies are a great advantage since data scientists can rely on a

large community that can provide them with advice and support, and access to a wide range of libraries and plugins. There are some libraries which are repeatedly mentioned in high score lists [32], which are, among others: *ggplot2* (R), *Matplotlib* (Python), *Seaborn* (Python), *Bokeh* (Python), *D3* (JavaScript), *Chart.js* (JavaScript) *Lattice* (R), *Vegas* (Scala), *Breeze-viz* (Scala), *Rgl* (R). The differences between these libraries are, on the one hand, given by the different programming environments they live in. On the other hand, the libraries also offer different features and assets for data visualization. Especially for Python, there are libraries for high-performance computing, numerical calculations, regression modeling, and visualization, which are regularly extended and maintained. This is very similar in the case of R.

The study by Rost [31] reveals fascinating differences between some of the charting libraries. The libraries which have been tested in this study have been classified according to whether they are more suited for analysis tasks or presentation tasks. The results can be seen in **Figure 3**. The analysis very nicely shows that charting libraries for both analysis (*Wrangle*, *Profile*, *Model*) and presentation (*Report*) purposes can be found. Interestingly, the charting libraries rather suited for presentation are based on JavaScript (highlighted by underline). This also shows that web-based visualization methods are currently rather placed in the presentation or reporting phase of a data science workflow. This also makes sense when thinking about the client–server environment of web-based visualizations, and that visualization designers have to carefully think about which type of data to show in this setting–since large datasets could probably not be transferred over the network and could potentially lead to processing or rendering problems on the client-side (e.g., smartphones). Such a careful design can usually only be done after the analysis (*Profile*, *Model*) is already finished.

In the study by Schmidt [33], different charting libraries were compared according to how many different visualization techniques they support. This study revealed big differences between the libraries and identified two leaders in the field, which currently offer the largest range of different visualization techniques. The first leader is *D3* (short for Data-Driven Documents), which is based on JavaScript and uses SVG (Scalable Vector Graphics) elements to display data in the web browser. It was released in 2011 [34] as a successor to the earlier *Protovis* framework to provide a more expressive framework that, at the same time, focuses on web standards and provides improved performance. The second leader is *Plotly*, a collaborative browser-based plotting and analytics platform based on Python [35]. *Plotly* developers especially take care to allow data scientists to share visualizations and information within a large community.

All-in-all, the field of charting libraries is constantly changing, and many more advances are expected to be seen in the future. When deciding for a charting


#### **Figure 3.**

*Comparison of charting libraries. The chart shows the charting libraries used in the study by Rost [31] ranked by whether they are rather suited for analysis or presentation. Charting libraries based on JavaScript (which are, therefore, web-based) are marked by underline. Figure adapted from [31].*

library to be used, other factors like the task to be solved and programming skills have to be considered.

## **3.2 Apps**

Apps are considered to be fully-featured, standalone applications. They do not require any programming environment to be installed on a system to run them. Data visualizations can be created by using the user interface tools provided by the application. Apps are more targeted towards users without programming skills who are not familiar with manual data processing, analytics, and visualization. In almost all cases, apps are commercial products. This is because a lot of maintenance and continuous development is needed in the background to keep the apps up-to-date. According to Gartner's Magic Quadrants, a study that is done every year in different areas, the leaders in the field of business intelligence platforms [36] are considered to be *Tableau*, *Microsoft Power BI*, and *Qlik*.

In its yearly study, Gartner compares business intelligence applications that are considered most significant in the marketplace. The applications are evaluated and placed in one of four quadrants, rating the applications as either challengers, leaders, visionaries, or niche players. Many apps are currently available on the market. These applications differ in terms of targeted user groups and also visualization features that they offer. Since Gartner's Magic Quadrants are published every year, interesting patterns can be detected by looking at the yearly changes, as shown in **Figure 4**. The three leaders that have been identified previously already show an excellent performance throughout the last six years. Interestingly, the field of leaders is left to the three main players over the last years. It can also be seen that

#### **Figure 4.**

*Gartner's Magic Quadrants of the last six years. The quadrants are divided into the four fields of Leaders, Visionaries, Niche Players, and Challengers. It can be seen that the group of leaders only slightly changed over the last four years. Especially Qlik stayed quite constant. The group of other apps (indicated by a gray circle) shows the field's dynamic movements. New apps have been developed (e.g., Infor, 2021) and others disappeared (e.g., GoodData, 2019). Figure adapted from [37].*

other tools appeared or vanished over the years, which shows the dynamic of the market of business intelligence tools.

The differences between commercial tools have also been highlighted in other studies. Zhang et al. [38] concentrated on specific visualization techniques and evaluated their usage in commercial business analytics tools. They ranked tools and applications based on classifications according to feature richness, flexibility, learning curve, and tasks (e.g., for analysis or presentation). Behrisch et al. [39] conducted an exhaustive survey on commercial visual analytics tools, evaluating them according to which degree they feature data handling, visualization, and automated analysis. Their findings classified the applications according to whether they are more suited for presentation or exploratory analysis. The results show that basically all applications feature data presentation, which is mainly supported by creating dashboards. Some of the applications like *Tableau* or *Qlik* also provide the ability to publish web-based dashboards. Interestingly, only about 50% of the applications were identified to be suited for exploratory analysis (like *Tableau*, *TIBCO Spotfire*, or *Microsoft Power BI*). The authors also identified the applications as useful for different types of users, mainly upper management, reporting managers, or data analysts.

The vast amount of libraries and tools being available has inspired researchers to conduct studies for quantifying, evaluating, and ranking tools and applications that data scientists use. Gartner's Magic Quadrant and several studies about apps for data visualization in data science show no tools that cover all tasks and needs. The selection of an app to be used mainly depends on the tasks that need to be solved (e.g., analysis vs. presentation) and on the scope where the app should be used in.

### **4. Integration of visualization**

Visualization researchers were very successful within the last decades, generating many different novel techniques for the visual representation of data. These techniques range from approaches for the efficient representation of data (e.g., parallel coordinates) to proposed interaction and user guidance workflows (e.g., overview-first, details-on-demand). Current surveys show a large variety of visualization techniques. A survey of survey papers in information visualization by McNabb and Laramee [40] classified already over 80 survey papers describing relevant state-of-the-art techniques, and a more recent survey of books in information visualization revealed a similar quantity and variety [41]. Unfortunately, there is only minimal overlap between the recent developments in visualization research and the data visualization features offered by charting libraries and apps. Most of the tools and applications feature basic charts and plots (e.g., scatter plots, bar charts, bubble charts, radar charts), but more advanced visualization techniques (e.g., chord diagrams, horizon graphs) can hardly be found.

This was confirmed by several studies on the integration of visualization techniques in common libraries and applications. Harger and Crossno [42] evaluated the feature richness of open source toolkits for visual analytics. They evaluated the toolkits used for the study based on which basic chart types (e.g., bar charts, line charts), which types of graph visualization (e.g., circular or force-directed layouts), and which types of geo-spatial visualization techniques (e.g., choropleth maps, cartograms) they feature. They concluded that some toolkits are more targeted towards analytics, and some are more targeted towards visualization. Like this study, Schmidt [33] surveyed commonly used tools and applications and evaluated the visualization techniques they feature. They focused on visualization techniques rather than on derived attributes (e.g., feature richness) and included more recent

#### *Visual Data Science DOI: http://dx.doi.org/10.5772/intechopen.97750*

advances in visualization research, considered open source tools and commercial applications to produce a complete picture of visualization techniques usage. They also concentrated on 2D information visualization techniques, as these techniques are more relevant for data science and data analytics, and disregarded spatial techniques like 3D volume rendering.

In all studies that have been conducted so far, not surprisingly, basic chart types like scatter plots and bar charts are highly supported by all evaluated tools and applications. From the more advanced visualization techniques, multi-dimensional techniques like parallel coordinates and radar charts are already widely used and known and therefore included in many of the tools. The same applies to scatter plot matrices and heatmaps. Techniques for hierarchical data are also well supported, especially by open source tools. Visualization techniques for temporal data are not available in the majority of the tools and applications. This is probably because temporal data (e.g., time-series data) is a particular data type used only for specific tasks. Users usually use their own tools for these purposes. Therefore, temporal data techniques have not been included yet in common tools and applications, as these tools usually try to address a broader range of data scientists and data analysts. Some visualization techniques have not been integrated into any tool or application yet, like time nets, data vases, or people garden.

From a tools and applications point of view, *Plotly* and *D3* notably provide the most features among all the tested open source tools. Other tools are targeted towards exceptional functionalities, like *dygraphs* for scientific plots, which only feature a minimal range of visualization techniques. Other libraries which are intended to be used in web-based applications (e.g., *Chart.js* or *Google Charts*) feature only visualization techniques that will most likely be needed in a web-based context. Open source tools, especially *ggplot2*, benefit a lot from the community's input since many advanced visualization techniques are only featured via extensions. In the group of commercial tools, it can be depicted that *Tableau*, *Microsoft Power BI*, and *Highcharts* feature most of the hereby evaluated visualization techniques.

Data scientists could be supported in all stages of their workflow by using visual tools. Interestingly, visualization techniques are currently mostly applied in the *Report* stage, at the end of the data science workflow. This stands in contrast to the fact that interactive data exploration workflows are strongly promoted by visualization research. Even worse, the support for more advanced visualization techniques, especially for interactive data exploration, is still minimal. This has been identified as the "Interactive Visualization Gap" by Batch and Elmquist [43]. Further exchange with data science is considered a valuable and important goal for the visualization community. Previous research efforts in data science revealed that the gap between new developments in visualization research and their application "in the wild" still exists and will hopefully be further mitigated in the future.

## **5. Conclusions**

Data visualization can provide substantial support for users working with data. Data visualization techniques have proven to be useful for different steps in the data science workflow. The techniques differ in the interactivity and complexity of the representations. Many of the visualization techniques have been successfully integrated into libraries and applications for data visualization. Especially in the opensource sector, many new directions have opened up within the last years. Due to the programming languages'success, like Python, R, and Scala, libraries targeted towards these programming environments are becoming especially popular.

Among them are *Plotly* for Python and *ggplot2* for R. Also, web-based applications increasingly gain importance. That is why JavaScript-based libraries like *D3* and *Chart.js* can also be found among the most popular data visualization libraries. The market of business intelligence tools is also very dynamic, but it shows some three leaders in the field, namely *Tableau*, *Microsoft Power BI* and *Qlik*. Different types of data scientists require different libraries or applications. It, therefore, can be seen that applications are increasingly targeted towards a specific goal and are designed to solve specific types of tasks. However, when looking at the data visualization techniques offered by the most prominent libraries and applications, the "Interactive Visualization Gap" for exploratory data analysis still exists. Many recent developments and implementations in data visualization research do not find their way into existing libraries and applications. Therefore, the further exchange between data science and data visualization is highly recommended, as both parties can learn a lot from each other and, together, further foster the usage of data visualization in data analytics.

## **Acknowledgements**

VRVis is funded by BMK, BMDW, Styria, SFG, Tyrol and Vienna Business Agency in the scope of COMET - Competence Centers for Excellent Technologies (879730) which is managed by FFG.

## **Author details**

Johanna Schmidt VRVis Zentrum für Virtual Reality und Visualisierung Forschungs-GmbH, Vienna, Austria

\*Address all correspondence to: johanna.schmidt@vrvis.at

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Chikio Hayashi (1998) *What is Data Science? Fundamental Concepts and a Heuristic Example*. In Data Science, Classification, and Related Methods, pp. 40—51. Springer Japan.

[2] Mark A. Parsons, Øystein Godøy, Ellsworth LeDrew, Taco F. de Bruin, Bruno Danis, Scott Tomlinson, and David Carlson (2011) *A conceptual framework for managing very diverse data for complex, interdisciplinary science*. Journal of Information Science, 37(6): 555—569.

[3] David M. Blei and Padhraic Smyth (2017) *Science and Data Science*. Proceedings of the National Academy of Sciences, 114(33):8689—8692.

[4] Natalia Andrienko, Gennady Andrienko, Georg Fuchs, Aidan Slingsby, Cagatay Turkay, and Stefan Wrobel (2020) *Visual Analytics for Data Scientists*. Springer International Publishing.

[5] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer (2012) *Enterprise Data Analysis and Visualization: An Interview Study*. IEEE Transactions on Visualization and Computer Graphics, 18(12):2917–2926.

[6] Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Bege (2018) *Data Scientists in Software Teams: State of the Art and Challenges*. IEEE Transactions on Software Engineering, 44(11):1024— 1038.

[7] Harlan D. Harris, Sean P. Murphy and Marck Vaisman (2013) *Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work*. O'Reilly Media, Inc.

[8] Edward Clarkson, Krishna Desai, and James Foley (2009) *ResultMaps: Visualization for Search Interfaces*. IEEE

Transactions on Visualization and Computer Graphics 15(6):1057–1064.

[9] Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer, Sean Kandel, and Connor Carreras (2017) *Principles of Data Wrangling: Practical Techniques for Data Preparation*. O'Reilly Media, Inc.

[10] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer (2011) *Wrangler: Interactive visual specification of data transformation scripts*. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11, May 7–12, Vancouver, Canada, pp. 3363–3372.

[11] Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer (2012) *Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment*. Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI '12, May 22–25, Capri, Italy, pp. 547—554.

[12] Clemens Arbesser, Florian Spechtenhauser, Thomas Mühlbacher, and Harald Piringer (2016) *Visplause: Visual data quality assessment of many time series using plausibility checks*. IEEE Transactions on Visualization and Computer Graphics 23(1):641–650.

[13] Enrico Bertini, Andrada Tatu, and Daniel Keim (2011) *Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization*. IEEE Transactions on Visualization and Computer Graphics 17(12):2203–2212.

[14] Peter Filzmoser, Karel Hron, and Matthias Templ (2018) *Exploratory Data Analysis and Visualization*. Applied Compositional Data Analysis: With Worked Examples in R, pp. 69–83, Springer International Publishing.

[15] Arvind Satyanarayan, Kanit Wongsuphasawat, and Jeffrey Heer (2014) *Declarative Interaction Design for Data Visualization*. Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, UIST '14, Honolulu, Hawaii, Oct 5–8, USA, pp. 669—678.

[16] Tamara Munzner (2014) *Visualization Analysis and Design*. Taylor & Francis, Inc.

[17] NIST/SEMATECH e-Handbook of Statistical Methods (2021) http://www. itl.nist.gov/div898/handbook/ [Accessed 2021-03-01].

[18] Project Jupyter (2021) *Interactive data visualizations* https://jupyterbook. org/interactive/interactive.html [Accessed 2021-03-02].

[19] Project Jupyter (2021) https:// jupyter.org/ [Accessed 2021-03-02].

[20] Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers (2018) *The Story in the Notebook: Exploratory Data Science Using a Literate Programming Tool*. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI '18, Apr. 21– 26, Montreal QC, Canada.

[21] Yifan Wu, Joseph M. Hellerstein, and Arvind Satyanarayan (2020) *B2: Bridging Code and Interactive Visualization in Computational Notebooks*. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, UIST '20, Oct 20–23, Virtual Event, pp. 152–165.

[22] Thomas Mühlbacher and Harald Piringer (2013) *A partition-based framework for building and validating regression models*. IEEE Transactions on Visualization and Computer Graphics 19 (12):1962–1971.

[23] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller (2019) *Explainable AI: Interpreting, Explaining and Visualizing Deep Learning*. Springer International Publishing.

[24] Alper Sarikaya, Michael Correll, Lyn Bartram, Melanie Tory, and Danyel Fisher (2019) *What Do We Talk About When We Talk About Dashboards?*. IEEE Transactions on Visualization and Computer Graphics 29(1):682—692.

[25] Dominik Moritz, Chenglong Wang, Gregory Nelson, Halden Lin, Adam M. Smith, Bill Howe, Jeffrey Heer (2019) *Formalizing Visualization Design Knowledge as Constraints: Actionable and Extensible Models in Draco*. IEEE Transactions on Visualization and Computer Graphics 25(1):438–448.

[26] Yan Holtz and Conor Healy (2017) Holtz, Y. and Healy, C. (2017) *From Data to Viz* https://www.data-to-viz. com/ [Accessed 2021-02-20].

[27] Severino Ribecca (2021) *The Data Visualisation Catalogue*. https://data vizcatalogue.com/ [Accessed 2021-03-02].

[28] Panagiotis Barlas, Ivor Lanning, and Cathal Heavey (2020) *A survey of open source data science tools*. International Journal of Intelligent Computing and Cybernetics 8(3):232–261.

[29] Jiali Liu, Nadia Boukhelifa, and James R. Eagan (2019) *Understanding the Role of Alternatives in Data Analysis Practices*. IEEE Transactions on Visualization and Computer Graphics 26(1):66—76.

[30] Sara Alspaugh, Nava Zokaei, Andrea Liu, Cindy Jin, and Marti A. Hearst (2019) *Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices*. IEEE Transactions on Visualization and Computer Graphics 25(1):22—31.

[31] Lisa Charlotte Rost (2016). *What I Learned Recreating One Chart Using 24*

*Visual Data Science DOI: http://dx.doi.org/10.5772/intechopen.97750*

*Tools*. https://source.opennews.org/artic les/what-i-learned-recreating-one-chartusing-24-tools/ [Accessed 2021-03-05].

[32] Bob Hayes (2019) *Business Broadway: Programming Languages Most Used and Recommended by Data Scientists*. https://businessoverbroadway. com/2019/01/13/programming-lang uages-most-used-and-recommendedby-data-scientists/ [Accessed 2021- 02-21].

[33] Johanna Schmidt (2020) *Usage of Visualization Techniques in Data Science Workflows*. In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP '20, Valletta, Malta, Feb 27–29, pp. 309–316.

[34] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer (2011) *D3: Data-Driven Documents*. IEEE Transactions on Visualization and Computer Graphics 17(12):2301—2309.

[35] Plotly Technologies Inc (2015) *Collaborative data science*. Montréal, QC.

[36] Gartner (2021) *Magic Quadrant for Analytics and Business Intelligence Platforms*. https://www.gartner.com/ reviews/market/analytics-businessintelligence-platforms [Accessed 2021-03-01].

[37] Gartner (2021) *Gartner Magic Quadrant*. https://www.gartner.com/en/ research/methodologies/magicquadrants-research [Accessed 2021-02-05]

[38] Leishi Zhang, Andreas Stoffel, Michael Behrisch, Sebastian Mittelstadt, Tobias Schreck, René Pompl, Stefan Hagen Weber, Holger Last, and Daniel Keim (2012) *Visual analytics for the big data era – A comparative review of stateof-the-art commercial systems*. In Proceedings of the IEEE Conference on Visual Analytics Science and

Technology, VAST '12, Oct. 14–19, Seattle, WA, USA, pp. 173—182.

[39] Michael Behrisch, Dirk Streeb, Florian Stoffel, Daniel Seebacher, Brian Matejek, Stefan Hagen Weber, Sebastian Mittelstaedt, Hanspeter Pfister, and Daniel Keim (2018) *Commercial Visual Analytics Systems-Advances in the Big Data Analytics Field*. IEEE Transactions on Visualization and Computer Graphics 25(1):3011–3031.

[40] Liam McNabb and Robert S. Laramee (2017) *Survey of Surveys (SoS) - Mapping The Landscape of Survey Papers in Information Visualization*. Computer Graphics Forum 36:589—617.

[41] Dylan Rees and Robert S. Laramee (2019) *A Survey of Information Visualization Books*. Computer Graphics Forum, 38:610—646.

[42] John R. Harger and Patricia J. Crossno (2012) *Comparison of Open Source Visual Analytics Toolkits*. Proceedings of SPIE - The International Society for Optical Engineering, 8294.

[43] Andrea Batch and Niklas Elmqvist (2018) *The Interactive Visualization Gap in Initial Exploratory Data Analysis*. IEEE Transactions on Visualization and Computer Graphics, 24(1):278—287.

## *Edited by Sara Shirowzhan*

Real-time, web-based, and interactive visualisations are proven to be outstanding methodologies and tools in numerous fields when knowledge in sophisticated data science and visualisation techniques is available. The rationale for this is because modern data science analytical approaches like machine/deep learning or artificial intelligence, as well as digital twinning, promise to give data insights, enable informed decision-making, and facilitate rich interactions among stakeholders.The benefits of data visualisation, data science, and digital twinning technologies motivate this book, which exhibits and presents numerous developed and advanced data science and visualisation approaches. Chapters cover such topics as deep learning techniques, web and dashboard-based visualisations during the COVID pandemic, 3D modelling of trees for mobile communications, digital twinning in the mining industry, data science libraries, and potential areas of future data science development.

Data Science, Data Visualization, and Digital Twins

Data Science,

Data Visualization,

and Digital Twins

*Edited by Sara Shirowzhan*