Process overview via DBLP

What is DataPrep.Connector?

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

With Connector, you can collect data in two steps: connect to a website and query the data.

We currently support tens of websites: https://github.com/sfu-db/APIConnectors/tree/develop/api-connectors

You can also author your own configuration files to support new websites. We look forward to seeing your contribution to facilitate other users as well.

Collecting data from DBLP

DBLP Website

DBLP (https://dblp.org/) is a computer science bibliography website. We will use it as an example to illustrate how to collect data easily using DataPrep.Connector.

API: https://dblp.org/faq/13501473.html

And more examples are available here: https://github.com/sfu-db/dataprep/tree/develop/examples.

Step 1: Installation

You can install the DataPrep through the single command below if you have not.

!pip install dataprep

Step 2: Connecting to the API

Once the library is installed, you can connect to a website that are supported by us or loading from local configuration files for connection. The detailed usage and paramsters for connect() can be found in next section. Here, we are connecting to DBLP API through the exsiting configuration file available here: https://github.com/sfu-db/DataConnectorConfigs/tree/develop/api-connectors/dblp

[1]:
from dataprep.connector import connect
conn = connect("dblp")

Step 3: Understand how to use the API

info() function helps you understand what is available from the website. Here, the output shows there is one table called “publication” available. And to fetch the data, we have to specify the value of the “q” (query keyword) parameter. The schema block shows the schema of the results.

[2]:
conn.info()
Generating new fontManager, this may take some time...
Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face.  Unknown file format.
DataPrep.Connector Info

Parameters

q (required)

h

f

Example

dc = connect('dblp')
df = await dc.query('publication', q='lee', _count=20)

Schema

column_name data_type
0 title string
1 venue object
2 publisher string
3 year string
4 type string
5 key string
6 ee string
7 url string
8 authors object
9 pages string
10 doi string

Step 4: Customize the query to collect data

Once you know how to use the API and the connection is built, you can issue the query through the query function. The first parameter specifies which API endpoint you want to query. The detailed parameter explanation can be found in later sections. In this example, we are collecting 2000 CVPR2020 papers.

[3]:
await conn.query("publication", q="CVPR 2020", _count=2000)
[3]:
title venue publisher year type key ee url authors pages doi
0 2020 IEEE/CVF Conference on Computer Vision an... [CVPR] Computer Vision Foundation / IEEE 2020 Editorship conf/cvpr/2020 https://openaccess.thecvf.com/CVPR2020 https://dblp.org/rec/conf/cvpr/2020 None None None
1 2020 IEEE/CVF Conference on Computer Vision an... [CVPR Workshops] Computer Vision Foundation / IEEE 2020 Editorship conf/cvpr/2020w https://openaccess.thecvf.com/CVPR2020_workshops https://dblp.org/rec/conf/cvpr/2020w None None None
2 NTIRE 2020 Challenge on Real Image Denoising -... [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AbdelhamedATBCZ20 https://openaccess.thecvf.com/content_CVPRW_20... https://dblp.org/rec/conf/cvpr/AbdelhamedATBCZ20 [Abdelrahman Abdelhamed, Mahmoud Afifi, Radu T... 2077-2088 10.1109/CVPRW50498.2020.00256
3 NTIRE 2020 Challenge on NonHomogeneous Dehazing. [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AncutiAVTLWXQMH20 https://openaccess.thecvf.com/content_CVPRW_20... https://dblp.org/rec/conf/cvpr/AncutiAVTLWXQMH20 [Codruta O. Ancuti, Cosmin Ancuti, Florin-Alex... 2029-2044 10.1109/CVPRW50498.2020.00253
4 NTIRE 2020 Challenge on Spectral Reconstructio... [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AradTBLFGLW0LLL20 https://openaccess.thecvf.com/content_CVPRW_20... https://dblp.org/rec/conf/cvpr/AradTBLFGLW0LLL20 [Boaz Arad, Radu Timofte, Ohad Ben-Shahar, Yi-... 1806-1822 10.1109/CVPRW50498.2020.00231
... ... ... ... ... ... ... ... ... ... ... ...
1992 CVPR 2019 WAD Challenge on Trajectory Predicti... [CoRR] None 2020 Informal Publications journals/corr/abs-2004-05966 https://arxiv.org/abs/2004.05966 https://dblp.org/rec/journals/corr/abs-2004-05966 [Sibo Zhang, Yuexin Ma, Ruigang Yang, Xin Li, ... None None
1993 Priming Neural Networks. [CVPR Workshops] None 2018 Conference and Workshop Papers conf/cvpr/RosenfeldBT18 http://openaccess.thecvf.com/content_cvpr_2018... https://dblp.org/rec/conf/cvpr/RosenfeldBT18 [Amir Rosenfeld, Mahdi Biparva, John K. Tsotsos] 2011-2020 10.1109/CVPRW.2018.00270
1994 Superpixel meshes for fast edge-preserving sur... [CVPR] None 2015 Conference and Workshop Papers conf/cvpr/Bodis-SzomoruRG15 https://doi.org/10.1109/CVPR.2015.7298812 https://dblp.org/rec/conf/cvpr/Bodis-SzomoruRG15 [András Bódis-Szomorú, Hayko Riemenschneider, ... 2011-2020 10.1109/CVPR.2015.7298812
1995 Multiphase geometric couplings for the segment... [CVPR] None 2009 Conference and Workshop Papers conf/cvpr/ReinaMP09 https://doi.org/10.1109/CVPR.2009.5206524 https://dblp.org/rec/conf/cvpr/ReinaMP09 [Amelio Vázquez Reina, Eric L. Miller 0001, Ha... 2020-2027 10.1109/CVPR.2009.5206524
1996 Integrating Shape from Shading and Range Data ... [CVPR] None 1999 Conference and Workshop Papers conf/cvpr/MostafaYF99 https://doi.org/10.1109/CVPR.1999.784602 https://dblp.org/rec/conf/cvpr/MostafaYF99 [Mostafa Gadal-Haqq M. Mostafa, Sameh M. Yaman... 2015-2020 10.1109/CVPR.1999.784602

1997 rows × 11 columns

And you have the data ready. It is so simple :)