*4.4.2 Beyond the state of the art*

We propose to develop and use an integrated data visualization environment based on formal concept analysis, temporal concept analysis, temporal relational semantic systems, and self-organizing maps to identify suspicious tweets.

Formal concept analysis (FCA) is a mathematical technique that was introduced in 1982 by Rudolf Wille [94] and takes its roots in earlier work of Birkhoff [95] and the early work on applying lattice-theoretical ideas in information science, like it was done by Barbut et al. [96]. FCA was used in several security text mining projects. The goal in each of these papers was to make an overload of information available in an intuitive visual format that may speed up and improve decision making by police investigators on where and when to act. In the first case study, with the Amsterdam-Amstelland police (RPAA), which started in 2007, FCA was used to analyze statements made by victims to the police. The concept of domestic violence was iteratively enriched and refined, resulting in an improved definition and highly accurate automated labeling of new incoming cases [97]. Later on, the authors made a shift to the millions of observational and very short police reports from which persons involved in human trafficking and terrorism were extracted. Concept lattices allowed for the detection of several suspects involved in human trafficking or showing radicalizing behavior [98, 99].

Temporal concept analysis (TCA) was introduced by Wolff [100] and offers a framework for representing and analyzing data containing a temporal dimension. In previously discussed security applications, suspects were mentioned in multiple reports, and a detailed profile of one suspect (and persons in his social network) depicted as a lattice, with timestamps of the observations as objects and indications as attributes helped to gain an insight into his (their) threat to society [101]. Recently, TCA and its relational counterpart temporal relational semantic systems (TRSS, [100]) were successfully applied to the analysis of chat conversations [102].

Self-organizing maps [103] have been used in many applications, where highdimensional unsupervised data spaces had to be visualized in a two-dimensional plane to make the data accessible for human experts. For example, Ramadas et al. [104] used self-organizing maps for identifying suspicious network activity. In a previous security case study, a special type named emergent self-organizing maps was used to identify domestic violence in police reports [105, 106]. They were found to be more suitable than multidimensional scaling for text mining. Claster et al. [107] used self-organizing maps to mine over 80 million twitter micro logs in order to explore whether these data can be used to identify sentiment about tourism and Thailand amid the unrest in that country during the early part of 2010 and further whether analysis of tweets can be used to discern the effect of that unrest on Phuket's tourism environment.

Nevertheless, there are several differences between analyzing twitter feeds and traditional police reports. Whereas individual tweets may not be so interesting, a lot of information can be distilled from conversations consisting of many tweets that emerged between different users concerning a certain topic. Such feeds do not contain a summary of facts; rather several topics emerge between two or more persons. We should judge the interestingness of the feed from a security enforcement perspective and distinguish between several types of twitter users in a relevant conversation, for example, is this person someone who contributed only marginally or did he or she actually contribute to or promote criminal behavior. Ebner et al. [108]

**83**

*Novel Methods for Forensic Multimedia Data Analysis: Part I*

used Formal Concept Analysis (FCA) to categorize twitter users who write tweets about the same topics in the context of a conference event. Cuvelier et al. [109] used FCA as an e-reputation monitoring system in combination with tag clouds. Also, the Natural Language Processing of tweets is nowadays a challenging task since Twitter is characterized by a so-called noncanonical language. It is widely acknowledged that NLP systems have a drop of accuracy when tested against text characterized by this kind of language. This negatively affects different levels of text analysis ranging from the linguistic annotation to the information extraction process. It follows that the analysis of noncanonical languages is one of the main topics of the most recent NLP conferences, for example, the First Workshop on Syntactic Analysis of Noncanonical Language (SANCL-2012) (https://sites.google.com/site/ sancl2012/), the workshop series on Scritture brevi (lit.: short writings) organized by the University of Rome Tor Vergata (https://sites.google.com/site/scritturebrevi/ atti-dei-workshop), and the First Shared Task on Dependency Parsing of Legal Texts at SPLET-2012 (https://sites.google.com/site/splet2012workshop/sharedtask). The main challenges in analyzing noncanonical languages, as tweet language, result from the fact that they have different linguistic characteristics with respect to the data from which the tools are trained, typically newswire texts. Among the others, punctuation and capitalization are often inconsistent; slang, technical jargon is widely exploited; and noncanonical syntactic structures frequently occur [110–112]. Accordingly, several domain adaptation methods and different strategies of analysis have been investigated to improve the accuracies of the NLP tools, among the most recent ones the self-training method used by Le Roux et al. [113], the active-learning method used by Attardi et al. [114], and the term-extraction

Event detection in Twitter has been recently an area of active research and successfully applied to detecting earthquakes [115] and sport events [116]. For events of interest to legal forces, one can utilize the generic features, such as emerging common terms, location, date, and also potentially the participants of the event. Hence, we extract the date/time information and time-event phrases that are learnt from tweets and set the presence of them as a feature. Participant information is also captured via the presence of the '@' character followed by a username within tweets. Specific to the events of legal interest, one can also utilize the overall sentiment of the tweets as a potential feature. According to a recent research by Leetaru [117] at the University of Illinois at Chicago, strong negative emotions in news can suggest upcoming of a significant event. A sentiment analysis in a long period of news revealed that the textual sentiments before the revolutions in Libya and Egypt have shown significant negative signals. The strength of this negativity is found comparable to the signals in 1991 news, right before the United States entered

Kuwait; and also in 2003, when the United States-Iraq was about to start.

While the current approaches, such as Ref. [117], have been shown to work on static data and static models, more research is needed to enhance these methods for the dynamic case. Also, the news text is highly structured and formal, while Twitter consists of informal short text. Based on our prior work on classifying short tweets [118], and sentiment analysis on large-scale data [119], we will categorize the tweets for event detection and identify tweets with strong sentimentality. Our initial hypothesis is that strong sentiment increases the probability of event being of interest to legal forces. Recently, distributional semantic models (DSMs) have been applied to affective text analysis with good results across languages [120]. In this WP, we will also apply DSMs to sentiment analysis of multilingual tweets. The more interesting problem is the forecasting problem, where the events can be predicted beforehand. This would be of high value for preventive law enforcement. Besides the prediction problem, one can also use this approach to get feedback from

*DOI: http://dx.doi.org/10.5772/intechopen.92167*

method proposed by Bonin et al. [88].

### *Novel Methods for Forensic Multimedia Data Analysis: Part I DOI: http://dx.doi.org/10.5772/intechopen.92167*

*Digital Forensic Science*

misinformation.

*4.4.2 Beyond the state of the art*

showing radicalizing behavior [98, 99].

Phuket's tourism environment.

centrally controlled twitter accounts to create the appearance of widespread support for a candidate or opinion and to support the dissemination of political

semantic systems, and self-organizing maps to identify suspicious tweets.

in 1982 by Rudolf Wille [94] and takes its roots in earlier work of Birkhoff [95] and the early work on applying lattice-theoretical ideas in information science, like it was done by Barbut et al. [96]. FCA was used in several security text mining projects. The goal in each of these papers was to make an overload of information available in an intuitive visual format that may speed up and improve decision making by police investigators on where and when to act. In the first case study, with the Amsterdam-Amstelland police (RPAA), which started in 2007, FCA was used to analyze statements made by victims to the police. The concept of domestic violence was iteratively enriched and refined, resulting in an improved definition and highly accurate automated labeling of new incoming cases [97]. Later on, the authors made a shift to the millions of observational and very short police reports from which persons involved in human trafficking and terrorism were extracted. Concept lattices allowed for the detection of several suspects involved in human trafficking or

We propose to develop and use an integrated data visualization environment based on formal concept analysis, temporal concept analysis, temporal relational

Formal concept analysis (FCA) is a mathematical technique that was introduced

Temporal concept analysis (TCA) was introduced by Wolff [100] and offers a framework for representing and analyzing data containing a temporal dimension. In previously discussed security applications, suspects were mentioned in multiple reports, and a detailed profile of one suspect (and persons in his social network) depicted as a lattice, with timestamps of the observations as objects and indications as attributes helped to gain an insight into his (their) threat to society [101]. Recently, TCA and its relational counterpart temporal relational semantic systems (TRSS, [100]) were successfully applied to the analysis of chat conversations [102]. Self-organizing maps [103] have been used in many applications, where highdimensional unsupervised data spaces had to be visualized in a two-dimensional plane to make the data accessible for human experts. For example, Ramadas et al. [104] used self-organizing maps for identifying suspicious network activity. In a previous security case study, a special type named emergent self-organizing maps was used to identify domestic violence in police reports [105, 106]. They were found to be more suitable than multidimensional scaling for text mining. Claster et al. [107] used self-organizing maps to mine over 80 million twitter micro logs in order to explore whether these data can be used to identify sentiment about tourism and Thailand amid the unrest in that country during the early part of 2010 and further whether analysis of tweets can be used to discern the effect of that unrest on

Nevertheless, there are several differences between analyzing twitter feeds and traditional police reports. Whereas individual tweets may not be so interesting, a lot of information can be distilled from conversations consisting of many tweets that emerged between different users concerning a certain topic. Such feeds do not contain a summary of facts; rather several topics emerge between two or more persons. We should judge the interestingness of the feed from a security enforcement perspective and distinguish between several types of twitter users in a relevant conversation, for example, is this person someone who contributed only marginally or did he or she actually contribute to or promote criminal behavior. Ebner et al. [108]

**82**

used Formal Concept Analysis (FCA) to categorize twitter users who write tweets about the same topics in the context of a conference event. Cuvelier et al. [109] used FCA as an e-reputation monitoring system in combination with tag clouds. Also, the Natural Language Processing of tweets is nowadays a challenging task since Twitter is characterized by a so-called noncanonical language. It is widely acknowledged that NLP systems have a drop of accuracy when tested against text characterized by this kind of language. This negatively affects different levels of text analysis ranging from the linguistic annotation to the information extraction process. It follows that the analysis of noncanonical languages is one of the main topics of the most recent NLP conferences, for example, the First Workshop on Syntactic Analysis of Noncanonical Language (SANCL-2012) (https://sites.google.com/site/ sancl2012/), the workshop series on Scritture brevi (lit.: short writings) organized by the University of Rome Tor Vergata (https://sites.google.com/site/scritturebrevi/ atti-dei-workshop), and the First Shared Task on Dependency Parsing of Legal Texts at SPLET-2012 (https://sites.google.com/site/splet2012workshop/sharedtask). The main challenges in analyzing noncanonical languages, as tweet language, result from the fact that they have different linguistic characteristics with respect to the data from which the tools are trained, typically newswire texts. Among the others, punctuation and capitalization are often inconsistent; slang, technical jargon is widely exploited; and noncanonical syntactic structures frequently occur [110–112]. Accordingly, several domain adaptation methods and different strategies of analysis have been investigated to improve the accuracies of the NLP tools, among the most recent ones the self-training method used by Le Roux et al. [113], the active-learning method used by Attardi et al. [114], and the term-extraction method proposed by Bonin et al. [88].

Event detection in Twitter has been recently an area of active research and successfully applied to detecting earthquakes [115] and sport events [116]. For events of interest to legal forces, one can utilize the generic features, such as emerging common terms, location, date, and also potentially the participants of the event. Hence, we extract the date/time information and time-event phrases that are learnt from tweets and set the presence of them as a feature. Participant information is also captured via the presence of the '@' character followed by a username within tweets. Specific to the events of legal interest, one can also utilize the overall sentiment of the tweets as a potential feature. According to a recent research by Leetaru [117] at the University of Illinois at Chicago, strong negative emotions in news can suggest upcoming of a significant event. A sentiment analysis in a long period of news revealed that the textual sentiments before the revolutions in Libya and Egypt have shown significant negative signals. The strength of this negativity is found comparable to the signals in 1991 news, right before the United States entered Kuwait; and also in 2003, when the United States-Iraq was about to start.

While the current approaches, such as Ref. [117], have been shown to work on static data and static models, more research is needed to enhance these methods for the dynamic case. Also, the news text is highly structured and formal, while Twitter consists of informal short text. Based on our prior work on classifying short tweets [118], and sentiment analysis on large-scale data [119], we will categorize the tweets for event detection and identify tweets with strong sentimentality. Our initial hypothesis is that strong sentiment increases the probability of event being of interest to legal forces. Recently, distributional semantic models (DSMs) have been applied to affective text analysis with good results across languages [120]. In this WP, we will also apply DSMs to sentiment analysis of multilingual tweets. The more interesting problem is the forecasting problem, where the events can be predicted beforehand. This would be of high value for preventive law enforcement. Besides the prediction problem, one can also use this approach to get feedback from

the crowd on actions taken by the law officers. Such approaches have already been deployed for finance and marketing applications to understand the mood of financial markets and consumer opinions [92, 121, 122]. Similar concepts can be adapted for forensic applications. In fact, FBI and Pentagon have already started to utilize these methods to predict criminal and terrorist activities and monitor persons and regions of high interest [AP Exclusive].

The innovativeness of tool in this area lays in the fact that the combination of the discussed methods has never been proposed for visualizing and clustering data, nor integrated in a software system. It will be the first integrated human centered data discovery environment that combines both statistical methods from machine learning with order-theoretic methods such as concept lattices. The self-organizing map that can handle high-dimensional data spaces and, as a consequence, is an ideal tool for an initial preprocessing is at the start of the human centered discovery process. FCA can then be used to explore dependencies and information links in a smaller subset. TCA and TRSS are used for in-depth profiling of identified individuals and communities. In particular, we focus on the niche of twitter user and feed mining in the broader text-mining field. State-of-the-art domain adaptation methods will be tested to improve the accuracies of the linguistic annotation tools on Twitter data, and customized term-extraction methods will be devised in order to reliably extract relevant keywords from tweets. Needless to say that the proposed system can be easily expanded to other text mining applications.

A web crawler will be designed to collect the feeds from the twitter website. This is a technically challenging yet known task to the scientific community (see e.g., [107]). The data collection can be done by an employee hired by the police who received a type P screening. The type of data is fragments of texts. Concerning languages, we will first focus on Dutch tweets. This may later be extended to Hungarian and Bulgarian since most organized crime in areas such as human trafficking is committed by these nationalities in Amsterdam. Since a tweet consists of among others a user name, his twitter ID and the posted text, as well as potentially ID and name of other users, we will first replace these user-identifiable information items by numeric values using regular expressions. In the second step, we will use available Named Entity Recognition methods for removing person names from the tweets themselves.
