Tarql: SPARQL for Tables

Tarql is a command-line tool for converting CSV files to RDF using SPARQL 1.1 syntax. It’s written in Java and based on Apache ARQ.

Introduction

In Tarql, the following SPARQL query:

CONSTRUCT { ... }
  FROM <file:table.csv>
  WHERE {
  ...
}

is equivalent to executing the following over an empty graph:

CONSTRUCT { ... }
  WHERE {
  VALUES (...) { ... }
  ...
}

In other words, the CSV file’s contents are input into the query as a table of bindings. This allows manipulation of CSV data using the full power of SPARQL 1.1 syntax, and in particular the generation of RDF using CONSTRUCT queries. See below for more examples.

Command line

For Unix, the executable script is bin/tarql. For Windows, bin\tarql.bat. Example invocations:

Full options:

tarql [options] query.sparql [table.csv [...]]
  Main arguments
      query.sparql           File containing a SPARQL query to be applied to a CSV file
      table.csv              CSV file to be processed; can be omitted if specified in FROM clause
  Options
      --test                 Show CONSTRUCT template and first rows only (for query debugging)
      -d   --delimiter       Delimiting character of the CSV file
      -t   --tabs            Specifies that the input is tab-separagted (TSV), overriding -d
      --quotechar            Quote character used in the CSV file
      -p   --escapechar      Character used to escape quotes in the CSV file
      -e   --encoding        Override CSV file encoding (e.g., utf-8 or latin-1)
      -H   --no-header-row   CSV file has no header row; use variable names ?a, ?b, ...
      --header-row           CSV file's first row is a header with variable names (default)
      --ntriples             Write N-Triples instead of Turtle
  General
      -v   --verbose         Verbose
      -q   --quiet           Run with minimal output
      --debug                Output information for debugging
      --help
      --version              Version information

Details

Input CSV file(s) can be specified using FROM or on the command line. Use FROM <file:filename.csv> to load a file from the current directory.

Variables: By default, Tarql assumes that the CSV file has column names in the first row, and it will use those column names as variable names. If the file has no header row (indicated via -H on the command line, or by appending #header=absent to the CSV file’s URL), then ?a will contain values from the first column, ?b from the second, and so on.

All-empty rows are skipped automatically.

Syntactically, a Tarql mapping is one of the following:

In Tarql mappings with multiple CONSTRUCT queries, the triples generated by previous CONSTRUCT clauses can be queries in subsequent WHERE clauses to retrieve additional data. When using this capability, note that the order of OPTIONAL and BIND is significant in a SPARQL query.

Tarql supports a magic ?ROWNUM variable for accessing the number of the row within the input CSV file. Numbering starts at 1, and skips empty rows.

Header row, delimiters, quotes and character encoding in CSV/TSV files

The following syntax options can be adjusted:

These options can be specified in two ways:

  1. On the command line (--no-header-row, --delimiter, --tabs, --quotechar, --escapechar, --encoding)
  2. By appending a pseudo-fragment to the CSV file’s URL in the FROM clause or on the command line (e.g., file.csv#header=absent, file.csv#delimiter=tab, file.csv#delimiter=%3B, file.csv#quotechar=singlequote;escapechar=backslash, file.csv#encoding=latin1)

Design patterns and examples

List the content of the file, projecting the first 100 rows and only the ?id and ?name bindings

SELECT DISTINCT ?id ?name
  FROM <file:filename.csv>
  WHERE {}
  LIMIT 100

Skip bad rows

SELECT ...
  WHERE { FILTER (BOUND(?d)) }

Compute additional columns

SELECT ...
  WHERE {
    BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
    BIND (STRLANG(?a, 'en') AS ?with_language_tag)
  }

CONSTRUCT an RDF graph

CONSTRUCT {
    ?URI a ex:Organization;
	ex:name ?NameWithLang;
	ex:CIK ?CIK;
	ex:LEI ?LEI;
	ex:ticker ?Stock_ticker;
  }
  FROM <file:companies.csv>
  WHERE {
    BIND (URI(CONCAT('companies/', ?Stock_ticker)) AS ?URI)
    BIND (STRLANG(?Name, "en") AS ?NameWithLang)
  }

Generate URIs based on the row number

...
  WHERE {
    BIND (URI(CONCAT('companies/', STR(?ROWNUM))) AS ?URI)
  }
  ...

Count the number of triples from a csv file

SELECT (COUNT(*) AS ?count)
  FROM <file.csv>
  WHERE {}

Provide CSV file encoding and header information in the URL

CONSTRUCT { ... }
  FROM <file.csv#encoding=utf-8;header=absent>

This is equivalent to using <file.csv> in the FROM clause and specifying --no-header-row and --encoding on the command line.

Legacy convention for header row

Earlier versions of Tarql had --no-header-row/#header=absent as the default, and required the use of a convention to enable the header row:

SELECT ?First_name ?Last_name ?Phone_number
  WHERE { ... }
  OFFSET 1

Here, the OFFSET 1 is a convention that indicates that the first row is to be used to provide variable names, and not as data. This convention is still supported, but will only be recognized if none of the header-specifying command line options or URL fragment arguments are used.

Last Update: January 13, 2015