Process overview via DBLP¶

What is DataPrep.Connector?¶

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

With Connector, you can collect data in two steps: connect to a website and query the data.

We currently support tens of websites: https://github.com/sfu-db/APIConnectors/tree/develop/api-connectors

You can also author your own configuration files to support new websites. We look forward to seeing your contribution to facilitate other users as well.

Collecting data from DBLP¶

DBLP Website¶

DBLP (https://dblp.org/) is a computer science bibliography website. We will use it as an example to illustrate how to collect data easily using DataPrep.Connector.

API: https://dblp.org/faq/13501473.html

And more examples are available here: https://github.com/sfu-db/dataprep/tree/develop/examples.

Step 1: Installation¶

You can install the DataPrep through the single command below if you have not.

!pip install dataprep

Step 2: Connecting to the API¶

Once the library is installed, you can connect to a website that are supported by us or loading from local configuration files for connection. The detailed usage and paramsters for connect() can be found in next section. Here, we are connecting to DBLP API through the exsiting configuration file available here: https://github.com/sfu-db/DataConnectorConfigs/tree/develop/api-connectors/dblp

[1]:

from dataprep.connector import connect
conn = connect("dblp")

Step 3: Understand how to use the API¶

info() function helps you understand what is available from the website. Here, the output shows there is one table called “publication” available. And to fetch the data, we have to specify the value of the “q” (query keyword) parameter. The schema block shows the schema of the results.

[2]:

conn.info()

Generating new fontManager, this may take some time...
Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face.  Unknown file format.

DataPrep.Connector Info

publication

Parameters

q (required)

h

f

Example

dc = connect('dblp') df = await dc.query('publication', q='lee', _count=20)

Schema

	column_name	data_type
0	title	string
1	venue	object
2	publisher	string
3	year	string
4	type	string
5	key	string
6	ee	string
7	url	string
8	authors	object
9	pages	string
10	doi	string

Step 4: Customize the query to collect data¶

Once you know how to use the API and the connection is built, you can issue the query through the query function. The first parameter specifies which API endpoint you want to query. The detailed parameter explanation can be found in later sections. In this example, we are collecting 2000 CVPR2020 papers.

[3]:

await conn.query("publication", q="CVPR 2020", _count=2000)

[3]:

	title	venue	publisher	year	type	key	ee	url	authors	pages	doi
0	2020 IEEE/CVF Conference on Computer Vision an...	[CVPR]	Computer Vision Foundation / IEEE	2020	Editorship	conf/cvpr/2020	https://openaccess.thecvf.com/CVPR2020	https://dblp.org/rec/conf/cvpr/2020	None	None	None
1	2020 IEEE/CVF Conference on Computer Vision an...	[CVPR Workshops]	Computer Vision Foundation / IEEE	2020	Editorship	conf/cvpr/2020w	https://openaccess.thecvf.com/CVPR2020_workshops	https://dblp.org/rec/conf/cvpr/2020w	None	None	None
2	NTIRE 2020 Challenge on Real Image Denoising -...	[CVPR Workshops]	None	2020	Conference and Workshop Papers	conf/cvpr/AbdelhamedATBCZ20	https://openaccess.thecvf.com/content_CVPRW_20...	https://dblp.org/rec/conf/cvpr/AbdelhamedATBCZ20	[Abdelrahman Abdelhamed, Mahmoud Afifi, Radu T...	2077-2088	10.1109/CVPRW50498.2020.00256
3	NTIRE 2020 Challenge on NonHomogeneous Dehazing.	[CVPR Workshops]	None	2020	Conference and Workshop Papers	conf/cvpr/AncutiAVTLWXQMH20	https://openaccess.thecvf.com/content_CVPRW_20...	https://dblp.org/rec/conf/cvpr/AncutiAVTLWXQMH20	[Codruta O. Ancuti, Cosmin Ancuti, Florin-Alex...	2029-2044	10.1109/CVPRW50498.2020.00253
4	NTIRE 2020 Challenge on Spectral Reconstructio...	[CVPR Workshops]	None	2020	Conference and Workshop Papers	conf/cvpr/AradTBLFGLW0LLL20	https://openaccess.thecvf.com/content_CVPRW_20...	https://dblp.org/rec/conf/cvpr/AradTBLFGLW0LLL20	[Boaz Arad, Radu Timofte, Ohad Ben-Shahar, Yi-...	1806-1822	10.1109/CVPRW50498.2020.00231
...	...	...	...	...	...	...	...	...	...	...	...
1992	CVPR 2019 WAD Challenge on Trajectory Predicti...	[CoRR]	None	2020	Informal Publications	journals/corr/abs-2004-05966	https://arxiv.org/abs/2004.05966	https://dblp.org/rec/journals/corr/abs-2004-05966	[Sibo Zhang, Yuexin Ma, Ruigang Yang, Xin Li, ...	None	None
1993	Priming Neural Networks.	[CVPR Workshops]	None	2018	Conference and Workshop Papers	conf/cvpr/RosenfeldBT18	http://openaccess.thecvf.com/content_cvpr_2018...	https://dblp.org/rec/conf/cvpr/RosenfeldBT18	[Amir Rosenfeld, Mahdi Biparva, John K. Tsotsos]	2011-2020	10.1109/CVPRW.2018.00270
1994	Superpixel meshes for fast edge-preserving sur...	[CVPR]	None	2015	Conference and Workshop Papers	conf/cvpr/Bodis-SzomoruRG15	https://doi.org/10.1109/CVPR.2015.7298812	https://dblp.org/rec/conf/cvpr/Bodis-SzomoruRG15	[András Bódis-Szomorú, Hayko Riemenschneider, ...	2011-2020	10.1109/CVPR.2015.7298812
1995	Multiphase geometric couplings for the segment...	[CVPR]	None	2009	Conference and Workshop Papers	conf/cvpr/ReinaMP09	https://doi.org/10.1109/CVPR.2009.5206524	https://dblp.org/rec/conf/cvpr/ReinaMP09	[Amelio Vázquez Reina, Eric L. Miller 0001, Ha...	2020-2027	10.1109/CVPR.2009.5206524
1996	Integrating Shape from Shading and Range Data ...	[CVPR]	None	1999	Conference and Workshop Papers	conf/cvpr/MostafaYF99	https://doi.org/10.1109/CVPR.1999.784602	https://dblp.org/rec/conf/cvpr/MostafaYF99	[Mostafa Gadal-Haqq M. Mostafa, Sameh M. Yaman...	2015-2020	10.1109/CVPR.1999.784602

1997 rows × 11 columns

And you have the data ready. It is so simple :)