Tarql: SPARQL for Tables
Tarql is a command-line tool for converting CSV files to RDF using SPARQL 1.1 syntax. It’s written in Java and based on Apache ARQ.
Introduction
In Tarql, the following SPARQL query:
  CONSTRUCT { ... }
  FROM <file:table.csv>
  WHERE {
  ...
}is equivalent to executing the following over an empty graph:
  CONSTRUCT { ... }
  WHERE {
  VALUES (...) { ... }
  ...
}In other words, the CSV file’s contents are input into the query as a table of bindings. This allows manipulation of CSV data using the full power of SPARQL 1.1 syntax, and in particular the generation of RDF using CONSTRUCT queries. See below for more examples.
Command line
For Unix, the executable script is bin/tarql. For Windows, bin\tarql.bat. Example invocations:
- tarql --helpto show full command line help
- tarql my_mapping.sparql input1.csv input2.csvto translate two CSV files using the same mapping
- tarql --no-header-row my_mapping.sparql input1.csvif CSV file doesn’t have column headers; variables will be called- ?a,- ?b, etc.
- tarql my_mapping.sparqlto translate a CSV file defined in SPARQL- FROMclause
- tarql --test my_mapping.sparqlto show only the CONSTRUCT template, variable names, and a few input rows (useful for CONSTRUCT query development)`
- tarql --tabs my_mapping.sparql input.tsvto read a tab-separated (TSV) file
Full options:
tarql [options] query.sparql [table.csv [...]]
  Main arguments
      query.sparql           File containing a SPARQL query to be applied to an input file
      table.csv              CSV/TSV file to be processed; can be omitted if specified in FROM clause
  Input options
      --stdin                Read input from STDIN instead of file
      -d   --delimiter       Delimiting character of the input file
      -t   --tabs            Specifies that the input is tab-separated (TSV)
      --quotechar            Quote character used in the input file, or "none"
      -p   --escapechar      Character used to escape quotes in the input file, or "none"
      -e   --encoding        Override input file encoding (e.g., utf-8 or latin-1)
      -H   --no-header-row   Input file has no header row; use variable names ?a, ?b, ...
      --header-row           Input file's first row is a header with variable names (default)
      --base                 Base IRI for resolving relative IRIs
  Output options
      --test                 Show CONSTRUCT template and first rows only (for query debugging)
      --write-base           Write @base if output is Turtle
      --ntriples             Write N-Triples instead of Turtle
      --dedup                Window size in which to remove duplicate triples
  General
      -v   --verbose         Verbose
      -q   --quiet           Run with minimal output
      --debug                Output information for debugging
      --help
      --version              Version informationDetails
Input CSV file(s) can be specified using FROM or on the command line. Use FROM <file:filename.csv> to load a file from the current directory.
Variables: By default, Tarql assumes that the CSV file has column names in the first row, and it will use those column names as variable names. If the file has no header row (indicated via -H on the command line, or by appending #header=absent to the CSV file’s URL), then ?a will contain values from the first column, ?b from the second, and so on.
All-empty rows are skipped automatically.
Syntactically, a Tarql mapping is one of the following:
- A SPARQL 1.1 SELECT query
- A SPARQL 1.1 ASK query
- One or more consecutive SPARQL 1.1 CONSTRUCT queries, where PREFIXandBASEapply to any subsequent query
In Tarql mappings with multiple CONSTRUCT queries, the triples generated by previous CONSTRUCT clauses can be queried in subsequent WHERE clauses to retrieve additional data. When using this capability, note that the order of OPTIONAL and BIND is significant in a SPARQL query.
Tarql supports a magic ?ROWNUM variable for accessing the number of the row within the input CSV file. Numbering starts at 1, and skips empty rows.
Special SPARQL functions: Tarql defines two non-standard SPARQL functions: tarql:expandPrefix(?prefix) expands any prefixes declared in the Tarql query to their namespace form (as a string). tarql:expandPrefixedName(?qname) expands a prefixed name such as dc:title to a full IRI, using any prefixes declared in the Tarql query.
Header row, delimiters, quotes and character encoding in CSV/TSV files
CSV and TSV files vary in details such as the characters used for quoting and escaping, or the character encoding. Information about the input file to guide Tarql’s parsing can be provided in two ways:
- As command line options (e.g., --no-header-row,--tabs, –encoding`)
- As a pseudo-fragment appended to the file’s URL in the FROMclause or on the command line (e.g.,file.csv#header=absent;delimiter=tab)
The following syntax options can be adjusted:
Header row with column names
Presence or absence of a header row with column names to be used as variable names.
- Default: header present
- Command line options: --no-header-row,--header-row
- URL: header=absent,header=present
Delimiter character
The character used to separate fields in a row; typically comma (default), tab, or semicolon.
- Default: tab for file extension .tsv; comma otherwise
- Command line options: --tabs,--delimiter x
- URL: delimiter=x
- Recognized values: comma,tab,semicolon, any single character
Quote character
The character used to enclose field values where the delimiter or a line break occurs within the value; typically, double or single quote. Note that the quote character can be escaped as two consecutive quotes.
- Default: double quote (CSV) or none (TSV)
- Command line option: --quotechar x
- URL: quotechar=x
- Recognized values: none,singlequote,doublequote, any single character
Escape character
The character used to escape quotes occurring within quoted values; typically none or backslash.
- Default: none
- Command line option: --escapechar x
- URL: escapechar=x
- Recognized values: none,backslash, any single character
Character encoding
The character encoding of the input file. utf-8, latin1, etc.
- Default: attempt to auto-detect
- Command line option: --encoding xxx
- URL: encoding=xxx
Design patterns and examples
List the content of the file, projecting the first 100 rows and only the ?id and ?name bindings
  SELECT DISTINCT ?id ?name
  FROM <file:filename.csv>
  WHERE {}
  LIMIT 100Skip bad rows
  SELECT ...
  WHERE { FILTER (BOUND(?d)) }Skip bad values for a field
  ...
  WHERE { BIND (IF(?a != 'bad_value', ?a, ?nothing) AS ?a_fixed) }
  ...Compute additional columns
  SELECT ...
  WHERE {
    BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
    BIND (STRLANG(?a, 'en') AS ?with_language_tag)
  }CONSTRUCT an RDF graph
  CONSTRUCT {
    ?URI a ex:Organization;
	ex:name ?NameWithLang;
	ex:CIK ?CIK;
	ex:LEI ?LEI;
	ex:ticker ?Stock_ticker;
  }
  FROM <file:companies.csv>
  WHERE {
    BIND (URI(CONCAT('companies/', ?Stock_ticker)) AS ?URI)
    BIND (STRLANG(?Name, "en") AS ?NameWithLang)
  }Generate URIs based on the row number
  ...
  WHERE {
    BIND (URI(CONCAT('companies/', STR(?ROWNUM))) AS ?URI)
  }
  ...Count the number of triples from a csv file
  SELECT (COUNT(*) AS ?count)
  FROM <file.csv>
  WHERE {}Split a field value into multiple values
To split the values of column ?Colors into multiple values ?Color on spaces, using Jena’s apf:strSplit property function. The apf: prefix is automatically pre-defined.
...
WHERE {
    ....
    ?Color apf:strSplit (?Colors " ")
}Provide CSV file encoding and header information in the URL
  CONSTRUCT { ... }
  FROM <file.csv#encoding=utf-8;header=absent>This is equivalent to using <file.csv> in the FROM clause and specifying --no-header-row and --encoding on the command line.
Legacy convention for header row
Earlier versions of Tarql had --no-header-row/#header=absent as the default, and required the use of a convention to enable the header row:
  SELECT ?First_name ?Last_name ?Phone_number
  WHERE { ... }
  OFFSET 1Here, the OFFSET 1 is a convention that indicates that the first row is to be used to provide variable names, and not as data. This convention is still supported, but will only be recognized if none of the header-specifying command line options or URL fragment arguments are used.
Last Update: May 03, 2019
