Configuration files: usage and composition

Loading configuration files

When DataPrep.Connector.connect() is called, the system reads from a configuration folder, which contains information about how to build connection with specific APIs.

You can load the configuration files in the following two ways.

1. Loading existing files from our github repo

We maintain a github repo that contains configuration files for more than 20 websites here.

As an example, with the following code, the system will download the configuration file folder of dblp from our repo, load it, and build the connection. The file is placed at the system temporary file folder.

from dataprep.connector import connect
conn = connect("dblp")

connect() provides a parameter called update, which forces downloading of the fresh config files if set to True.

2. Loading from a local directory

You can also load from a local directory that contains your own configuration files. For example, by writing the following code, connect() will load from the corresponding folder and build the connection. You should have a folder called dblp in parallel with your code file and that contains the configuration files.

from dataprep.connector import connect
conn = connect("./dblp")

When the website API that you want to access is not supported by us, you will want to write your own configuration files. Or when you want to do some modification for the configuration files, you need to first download the configuration files to your local computer, change the files accordingly, and then load it from the local directory.

See below for how to create your own configuration folder and files.

Composing the configuration files

What is in a configuration folder?

connect() access the configurations for an API through a folder.

A configuration folder contains two parts: _meta.json file and the configuration files. The following is the content of the _meta.json file for the dblp folder. It describes what are the tables available for this website. In DataPrep.Connector, we model the data behind each endpoint as a table just like the tables in the DBMS.

{
    "tables": [
        "publication"
    ]
}

The _meta.json indicates what are the configuration files. In the dblp folder, it is expected that there is a configuration file called publication.json. The details of configuration files are described in next subsection.

In our repo, the folder also contains the test files for the configuration files. It tests if the configuration files can be processed smoothly by connect().

Configuration file

A config file is the magic that makes the data available via the simple function calls. The configuration files are reusable.

Configuration files describe the settings of an API, such as:

  • What is the endpoint of the API?

  • What authorization scheme an API uses? (see authorization scheme section)

  • What pagination scheme of the API? (see auto-pagination section)

  • What are the parameters the query support?

  • What is the schema of the returned results?

A tutorial of how to write a configuration file is here

Below shows the configuration file of the publication API.

{
    "version": 1,
    "request": {
        "url": "https://dblp.org/search/publ/api?format=json",
        "method": "GET",
        "params": {
            "q": true,
            "h": false,
            "f": false,
            "author": {
                "template": "author:{{author | replace(\" \", \"_\")}}:",
                "required": false,
                "removeIfEmpty": true,
                "fromKey": "author",
                "toKey": "q"
            },
            "name_parts": {
                "template": "author:{{first_name}}_{{last_name}}:",
                "required": false,
                "removeIfEmpty": true,
                "fromKey": [
                    "first_name",
                    "last_name"
                ],
                "toKey": "q"
            }
        },
        "pagination": {
            "type": "offset",
            "offsetKey": "f",
            "limitKey": "h",
            "maxCount": 1000
        },
        "search": {
            "key": "q"
        }
    },
    "examples": {
        "q": "'lee'"
    },
    "response": {
        "ctype": "application/json",
        "tablePath": "$.result.hits.hit[*].info",
        "schema": {
            "title": {
                "target": "$.title",
                "type": "string"
            },
            "venue": {
                "target": "$.venue",
                "type": "object"
            },
            "publisher": {
                "target": "$.publisher",
                "type": "string"
            },
            "year": {
                "target": "$.year",
                "type": "string"
            },
            "type": {
                "target": "$.type",
                "type": "string"
            },
            "key": {
                "target": "$.key",
                "type": "string"
            },
            "ee": {
                "target": "$.ee",
                "type": "string"
            },
            "url": {
                "target": "$.url",
                "type": "string"
            },
            "authors": {
                "target": "$.authors.author[*].text",
                "type": "object"
            },
            "pages": {
                "target": "$.pages",
                "type": "string"
            },
            "doi": {
                "target": "$.doi",
                "type": "string"
            }
        },
        "orient": "records"
    }
}