**3.2 Research related to the conversion of table and list to other formats**

There are several research works that involve converting table and list into other formats.

matching) which previously used different datasets thus making it difficult to be

*TULIP: A Five-Star Table and List - From Machine-Readable to Machine-Understandable…*

The most similar work to our proposal is WikiTables [53] which is a tool to extract information from the tables in Wikipedia. It is used to discover new hidden facts. The result of this research is a set of 15 million tuples extracted from the

There are many ways to represent tables and lists in the standard data formats

• International Organization for Standardization (ISO)/International

• World Wide Web Consortium (W3C) Recommendations (REC)

separated values (DSV) such as tab-separated values (TSV)<sup>5</sup>

guidelines for translation between them in ISO/IEC TR 29166.

**4. TULIP: table/list interchangeable, unified, pivotal vocabulary**

<sup>5</sup> https://www.iana.org/assignments/media-types/text/tab-separated-values

• Internet Engineering Task Force (IETF) Request for Comments (RFC)

• Internet Assigned Numbers Authority (IANA) Multipurpose Internet Mail

We mention only the most common standard and format that are capable of

• Comma-separated values (CSV), i.e., RFC 4180 [54], and other delimiter-

• Markup languages such as HTML table, i.e., RFC 1942 [55] (developed from USDOD SGML Table Model), and HTML list in HTML 2.0 RFC 1866 [56].

• Office spreadsheet and word processor table/list, e.g., OASIS OpenDocument Format (ODF) ISO/IEC 26300 and Microsoft Office Open XML (OOXML) ISO/IEC 29500. ISO/IEC also issues a comparison of both formats and

The main idea of TULIP is to transform the semi-structured data in the form of tables and lists, regardless of the source, to the structured data in the form of fivestar open data as a set of RDF triples. Each triple contains only subject-predicateobject. The triples are connected to other triples and form a directed graph called the RDF graph. That is the principle of Linked Data, allowing Semantic Web

• Lightweight markup languages such as Wikitext table and list and other markdown languages. After many attempts to standardize various of them, they end up with RFC 7763 for the original syntax and RFC 7764 for other

.

**3.3 Current "standard" representation of table and list**

Electrotechnical Commission (IEC) standards

issued by many standard bodies such as:

*DOI: http://dx.doi.org/10.5772/intechopen.91406*

Extensions (MIME) media types

table and list representation such as:

variants.

**25**

compared [52].

Wikipedia tables.

Yang and Luk [34, 35] discuss a thorough method for converting Web-based tables into key-value pair data and provide solutions to the problem of extracting data from the table in various cases.

The research of Pivk, Cimiano, and Sure [36] proposes a method to convert data from the Web-based table into F-logic (frame logic) which is a frame representation that can be applied to Semantic Web.

Table Analysis for Generating Ontologies (TANGO) is the research of Embley [37] and Tijerino et al. [38, 39]. The goal is to transform table data into ontology.

Table Extraction by Global Record Alignment (TEGRA) by Chu et al. [40] discusses the challenges of extracting structured data from Web-based table, in which in some case, a "table" that appear on a Webpage is not in HTML table format but it may be in HTML list or other arrangements.

DeExcelerator is the research of Eberius et al. [41]. It is a framework for extracting structured data from HTML tables and spreadsheets.

Venetis et al. [42] solve the problem of dealing with semantics and ontologies by manually adding classes to column headers of the table without having to do schema matching. However, the user must have the skill to add this information.

WebTables [43, 44] is a project of Google Research to extract structured data from HTML tables on Webpages. They searched 14.1 billion HTML tables and found that only 154 million tables have sufficient quality to allow extraction of the structured data [45]. Most HTML tables on the web are used to define the layout of the webpage but are not used to present the data in the actual table format [46]. WebTables uses the classifier that is adjusted to focus on recall more than precision in order to filter the table from the Webpage as much as possible. It then selects only the table with a single-line header and ignores other more complicated tables. Later, this project has been developed into a system called Octopus [47] to help support the search engine more efficiently.

At Google, Elmeleegy et al. [48] use WebTables to support a system called ListExtract to extract 100,000 lists from the web and then transform them into relational databases. Wong et al. [49] use 1000 machines to extract 10.1 billion tuples from 1 billion Webpages with parallel algorithms in less than 6 hours.

Fusion Tables [50] is a Google Research project designed to allow users to upload table data on the web for data analysis with various tools. It is currently available on Google Docs.

The Web Data Commons (WDC) [51] is a project to extract structured data from Common Crawl which is the largest webpage archive that is publicly available. A part of the WDC called Web Table Corpora only extracts structured data from HTML tables in the Common Crawl Web archive. Currently, Web Table Corpora has been available to download in two sets. The first set is the 2012 Corpus which extracts 147 million tables from 3.5 billion Webpages in 2012 Common Crawl. The second set is the 2015 Corpus which extracted 233 million tables from 1.78 billion webpages in July 2015 Common Crawl. The second set contains metadata about extracted tables, while this information is not reserved in the first set.

WDC Web Table Corpus has been used in many research. For example, it is used to measure the performance of schema matching approaches for various levels of table elements (such as table-to-class, row-to-instance, and attribute-to-property

<sup>4</sup> https://www.w3.org/2013/csvw/wiki/Main\_Page CSV on the Web Working Group Wiki

*TULIP: A Five-Star Table and List - From Machine-Readable to Machine-Understandable… DOI: http://dx.doi.org/10.5772/intechopen.91406*

matching) which previously used different datasets thus making it difficult to be compared [52].

The most similar work to our proposal is WikiTables [53] which is a tool to extract information from the tables in Wikipedia. It is used to discover new hidden facts. The result of this research is a set of 15 million tuples extracted from the Wikipedia tables.
