Process overview via DBLP

What is DataPrep.Connector?

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

With Connector, you can collect data in two steps: connect to a website and query the data.

We currently support tens of websites:

You can also author your own configuration files to support new websites. We look forward to seeing your contribution to facilitate other users as well.

Collecting data from DBLP

DBLP Website

DBLP ( is a computer science bibliography website. We will use it as an example to illustrate how to collect data easily using DataPrep.Connector.


And more examples are available here:

Step 1: Installation

You can install the DataPrep through the single command below if you have not.

!pip install dataprep

Step 2: Connecting to the API

Once the library is installed, you can connect to a website that are supported by us or loading from local configuration files for connection. The detailed usage and paramsters for connect() can be found in next section. Here, we are connecting to DBLP API through the exsiting configuration file available here:

from dataprep.connector import connect
conn = connect("dblp")

Step 3: Understand how to use the API

info() function helps you understand what is available from the website. Here, the output shows there is one table called “publication” available. And to fetch the data, we have to specify the value of the “q” (query keyword) parameter. The schema block shows the schema of the results.

Generating new fontManager, this may take some time...
Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face.  Unknown file format.
DataPrep.Connector Info


q (required)




dc = connect('dblp')
df = await dc.query('publication', q='lee', _count=20)


column_name data_type
0 title string
1 venue object
2 publisher string
3 year string
4 type string
5 key string
6 ee string
7 url string
8 authors object
9 pages string
10 doi string

Step 4: Customize the query to collect data

Once you know how to use the API and the connection is built, you can issue the query through the query function. The first parameter specifies which API endpoint you want to query. The detailed parameter explanation can be found in later sections. In this example, we are collecting 2000 CVPR2020 papers.

await conn.query("publication", q="CVPR 2020", _count=2000)
title venue publisher year type key ee url authors pages doi
0 2020 IEEE/CVF Conference on Computer Vision an... [CVPR] Computer Vision Foundation / IEEE 2020 Editorship conf/cvpr/2020 None None None
1 2020 IEEE/CVF Conference on Computer Vision an... [CVPR Workshops] Computer Vision Foundation / IEEE 2020 Editorship conf/cvpr/2020w None None None
2 NTIRE 2020 Challenge on Real Image Denoising -... [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AbdelhamedATBCZ20 [Abdelrahman Abdelhamed, Mahmoud Afifi, Radu T... 2077-2088 10.1109/CVPRW50498.2020.00256
3 NTIRE 2020 Challenge on NonHomogeneous Dehazing. [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AncutiAVTLWXQMH20 [Codruta O. Ancuti, Cosmin Ancuti, Florin-Alex... 2029-2044 10.1109/CVPRW50498.2020.00253
4 NTIRE 2020 Challenge on Spectral Reconstructio... [CVPR Workshops] None 2020 Conference and Workshop Papers conf/cvpr/AradTBLFGLW0LLL20 [Boaz Arad, Radu Timofte, Ohad Ben-Shahar, Yi-... 1806-1822 10.1109/CVPRW50498.2020.00231
... ... ... ... ... ... ... ... ... ... ... ...
1992 CVPR 2019 WAD Challenge on Trajectory Predicti... [CoRR] None 2020 Informal Publications journals/corr/abs-2004-05966 [Sibo Zhang, Yuexin Ma, Ruigang Yang, Xin Li, ... None None
1993 Priming Neural Networks. [CVPR Workshops] None 2018 Conference and Workshop Papers conf/cvpr/RosenfeldBT18 [Amir Rosenfeld, Mahdi Biparva, John K. Tsotsos] 2011-2020 10.1109/CVPRW.2018.00270
1994 Superpixel meshes for fast edge-preserving sur... [CVPR] None 2015 Conference and Workshop Papers conf/cvpr/Bodis-SzomoruRG15 [András Bódis-Szomorú, Hayko Riemenschneider, ... 2011-2020 10.1109/CVPR.2015.7298812
1995 Multiphase geometric couplings for the segment... [CVPR] None 2009 Conference and Workshop Papers conf/cvpr/ReinaMP09 [Amelio Vázquez Reina, Eric L. Miller 0001, Ha... 2020-2027 10.1109/CVPR.2009.5206524
1996 Integrating Shape from Shading and Range Data ... [CVPR] None 1999 Conference and Workshop Papers conf/cvpr/MostafaYF99 [Mostafa Gadal-Haqq M. Mostafa, Sameh M. Yaman... 2015-2020 10.1109/CVPR.1999.784602

1997 rows × 11 columns

And you have the data ready. It is so simple :)