**4. TULIP: table/list interchangeable, unified, pivotal vocabulary**

The main idea of TULIP is to transform the semi-structured data in the form of tables and lists, regardless of the source, to the structured data in the form of fivestar open data as a set of RDF triples. Each triple contains only subject-predicateobject. The triples are connected to other triples and form a directed graph called the RDF graph. That is the principle of Linked Data, allowing Semantic Web

**3.2 Research related to the conversion of table and list to other formats**

formats.

data from the table in various cases.

that can be applied to Semantic Web.

support the search engine more efficiently.

Google Docs.

**24**

format but it may be in HTML list or other arrangements.

*Linked Open Data - Applications,Trends and Future Developments*

extracting structured data from HTML tables and spreadsheets.

There are several research works that involve converting table and list into other

The research of Pivk, Cimiano, and Sure [36] proposes a method to convert data from the Web-based table into F-logic (frame logic) which is a frame representation

Table Analysis for Generating Ontologies (TANGO) is the research of Embley [37] and Tijerino et al. [38, 39]. The goal is to transform table data into ontology. Table Extraction by Global Record Alignment (TEGRA) by Chu et al. [40] discusses the challenges of extracting structured data from Web-based table, in which in some case, a "table" that appear on a Webpage is not in HTML table

DeExcelerator is the research of Eberius et al. [41]. It is a framework for

matching. However, the user must have the skill to add this information.

Venetis et al. [42] solve the problem of dealing with semantics and ontologies by manually adding classes to column headers of the table without having to do schema

WebTables [43, 44] is a project of Google Research to extract structured data from HTML tables on Webpages. They searched 14.1 billion HTML tables and found that only 154 million tables have sufficient quality to allow extraction of the structured data [45]. Most HTML tables on the web are used to define the layout of the webpage but are not used to present the data in the actual table format [46]. WebTables uses the classifier that is adjusted to focus on recall more than precision in order to filter the table from the Webpage as much as possible. It then selects only the table with a single-line header and ignores other more complicated tables. Later, this project has been developed into a system called Octopus [47] to help

At Google, Elmeleegy et al. [48] use WebTables to support a system called ListExtract to extract 100,000 lists from the web and then transform them into relational databases. Wong et al. [49] use 1000 machines to extract 10.1 billion tuples from 1 billion Webpages with parallel algorithms in less than 6 hours.

Fusion Tables [50] is a Google Research project designed to allow users to upload table data on the web for data analysis with various tools. It is currently available on

WDC Web Table Corpus has been used in many research. For example, it is used to measure the performance of schema matching approaches for various levels of table elements (such as table-to-class, row-to-instance, and attribute-to-property

The Web Data Commons (WDC) [51] is a project to extract structured data from Common Crawl which is the largest webpage archive that is publicly available. A part of the WDC called Web Table Corpora only extracts structured data from HTML tables in the Common Crawl Web archive. Currently, Web Table Corpora has been available to download in two sets. The first set is the 2012 Corpus which extracts 147 million tables from 3.5 billion Webpages in 2012 Common Crawl. The second set is the 2015 Corpus which extracted 233 million tables from 1.78 billion webpages in July 2015 Common Crawl. The second set contains metadata about

extracted tables, while this information is not reserved in the first set.

<sup>4</sup> https://www.w3.org/2013/csvw/wiki/Main\_Page CSV on the Web Working Group Wiki

Yang and Luk [34, 35] discuss a thorough method for converting Web-based tables into key-value pair data and provide solutions to the problem of extracting

<sup>5</sup> https://www.iana.org/assignments/media-types/text/tab-separated-values

applications to consume TULIP's five-star open data in the same way as another Linked Data.

It can store data both in column-orient (columnar) and row-orient (row-based). This allows us to query specific content of all sizes and dimensions in a single query. Moreover, with this model, we can apply the principles of data warehouse and online analytical processing (OLAP) operations, such as rolling up the entire data in the same group, drilling down to any layer, or even slicing to cut only some axes of multidimensional content including dicing, i.e., rotate to change the perspective which means that we can filter and pivot the view of the data in TULIP format any

*TULIP: A Five-Star Table and List - From Machine-Readable to Machine-Understandable…*

One of the key concepts of TULIP is using an RDF feature called RDF collection, i.e., RDF list. Apply it as a one-dimensional array to store subscripts of each level in a multidimensional array by placing all subscripts as corresponding members of the RDF collection. Then put these collections to each node of TULIP. Access to each element of TULIP can be done by a SPARQL querying for its RDF collection items

Now, we will demonstrate how to create RDF triples using the TULIP schema to represent a simple table. We use the following small example table of three columns

Cell Content 1,1 Cell Content 2,1 Cell Content 3,1 Cell Content 1,2 Cell Content 2,2 Cell Content 3,2 Cell Content 1,3 Cell Content 2,3 Cell Content 3,3

Because TULIP schema can represent the table in both column-oriented (column-major) or row-oriented (row-major), in this case, we represent a table with a column-major format. The sample data in the table cells is preceded by the

Excerpts of RDF triples used to represent the three-column by three-row table

tlp:member \_:Col1, \_:Col2, \_:Col3 .

rdfs:label "Cell Content 1,1" .

rdfs:label "Cell Content 1,2" .

rdfs:label "Cell Content 3,3" .

tlp:member \_:Cell11, \_:Cell12, \_:Cell13 .

corresponding column number, followed by the row number.

ex:TableExample

tlp:member \_:Table1 . \_:Table1 rdf:type tlp:Table ; tlp:index 1 ;

\_:Col1 rdf:type tlp:Column ; tlp:index 1 ;

\_:Cell11 rdf:type tlp:Cell ; tlp:index 1 ;

\_:Cell12 rdf:type tlp:Cell ; tlp:index 2 ;

\_:Cell33 rdf:type tlp:Cell ; tlp:index 3 ;

...

*RDF triples of the example table represented by the TULIP schema.*

above using TULIP schema are shown in **Figure 1**.

**4.2 Creating five-star open data table and list with TULIP schema**

way we want.

by three rows.

**Figure 1.**

**27**

that match to the corresponding subscripts.

*DOI: http://dx.doi.org/10.5772/intechopen.91406*
