Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.
Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.
With Connector, you can collect data in two steps: connect to a website and query the data.
We currently support tens of websites: https://github.com/sfu-db/APIConnectors/tree/develop/api-connectors
You can also author your own configuration files to support new websites. We look forward to seeing your contribution to facilitate other users as well.
DBLP (https://dblp.org/) is a computer science bibliography website. We will use it as an example to illustrate how to collect data easily using DataPrep.Connector.
API: https://dblp.org/faq/13501473.html
And more examples are available here: https://github.com/sfu-db/dataprep/tree/develop/examples.
You can install the DataPrep through the single command below if you have not.
!pip install dataprep
Once the library is installed, you can connect to a website that are supported by us or loading from local configuration files for connection. The detailed usage and paramsters for connect() can be found in next section. Here, we are connecting to DBLP API through the exsiting configuration file available here: https://github.com/sfu-db/DataConnectorConfigs/tree/develop/api-connectors/dblp
[1]:
from dataprep.connector import connect conn = connect("dblp")
info() function helps you understand what is available from the website. Here, the output shows there is one table called “publication” available. And to fetch the data, we have to specify the value of the “q” (query keyword) parameter. The schema block shows the schema of the results.
[2]:
conn.info()
Generating new fontManager, this may take some time... Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face. Unknown file format.
q (required)
h
f
dc = connect('dblp')df = await dc.query('publication', q='lee', _count=20)
Once you know how to use the API and the connection is built, you can issue the query through the query function. The first parameter specifies which API endpoint you want to query. The detailed parameter explanation can be found in later sections. In this example, we are collecting 2000 CVPR2020 papers.
[3]:
await conn.query("publication", q="CVPR 2020", _count=2000)
1997 rows × 11 columns
And you have the data ready. It is so simple :)