read_sql()
The read_sql function issues a SQL query to a specific database and return the result in a Python DataFrame. In this section, we show how to use this feature.
connectorx
Connector wrap up the function on connectorx, in order to enable read_sql, you need to first install it by running:
read_sql
pip install connectorx
read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
Run the SQL query, download the data from database into a Pandas dataframe.
conn: str: Connection string URI.
conn: str
General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.
(postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname
For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.
sqlite:///home/user/path/test.db
query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
query: Union[str, List[str]]
return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
return_type: str = "pandas"
arrow
pandas
modin
dask
polars
protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
protocol: str = "binary"
binary
partition_on: Optional[str]: The column to partition the result.
partition_on: Optional[str]
partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
partition_range: Optional[Tuple[int, int]]
partition_num: Optioinal[int]: The number of partitions to generate.
partition_num: Optioinal[int]
Read a DataFrame from a SQL using a single thread
from dataprep.connector import read_sql postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" read_sql(postgres_url, query)
Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)
partition_range
from dataprep.connector import read_sql postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
from dataprep.connector import read_sql postgres_url = "postgresql://username:password@server:port/database" queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"] read_sql(postgres_url, queries)
Read a DataFrame parallelly using 4 threads from a more complex query
from dataprep.connector import read_sql postgres_url = "postgresql://username:password@server:port/database" query = f""" SELECT l_orderkey, SUM(l_extendedprice * ( 1 - l_discount )) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = 'BUILDING' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < DATE '1995-03-15' AND l_shipdate > DATE '1995-03-15' GROUP BY l_orderkey, o_orderdate, o_shippriority """ read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)