dataprep.eda.distribution

plot

dataprep.eda.distribution.plot(df, col1=None, col2=None, col3=None, *, config=None, display=None, dtype=None, progress=True)[source]

Generates plots for exploratory data analysis.

If no columns are specified, the distribution of each coloumn is plotted. A histogram is plotted if the column contains numerical values, a bar chart is plotted if the column contains categorical values, a line chart is plotted if the column is of type datetime.

If one column (x) is specified, the distribution of x is plotted in various ways. If x contains categorical values, a bar chart and pie chart are plotted. If x contains numerical values, a histogram, kernel density estimate plot, box plot, and qq plot are plotted. If x contains datetime values, a line chart is plotted.

If two columns (x and y) are specified, plots depicting the relationship between the variables will be displayed. If x and y contain numerical values, a scatter plot, hexbin plot, and binned box plot are plotted. If one of x and y contain categorical values and the other contains numerical values, a box plot and multiline histogram are plotted. If x and y contain categorical vales, a nested bar chart, stacked bar chart, and heat map are plotted. If one of x and y contains datetime values and the other contains numerical values, a line chart and a box plot are shown. If one of x and y contains datetime values and the other contains categorical values, a multiline chart and a stacked box plot are shown.

If x, y, and z are specified, they must be one each of type datetime, numerical, and categorical. A multiline chart containing an aggregate on the numerical column grouped by the categorical column over time is plotted.

Parameters
  • df (Union[DataFrame, DataFrame]) – DataFrame from which visualizations are generated

  • col1 (Optional[str], default None) – A valid column name from the dataframe

  • col2 (Optional[str], default None) – A valid column name from the dataframe

  • col3 (Optional[str], default None) – A valid column name from the dataframe

  • config (Optional[Dict[str, Any]]) – A dictionary for configuring the visualizations E.g. config={“hist.bins”: 20}

  • display (Optional[List[str]]) – A list containing the names of the visualizations to display E.g. display=[“Histogram”]

  • dtype (str or DType or dict of str or dict of DType, default None) – Specify Data Types for designated column or all columns. E.g. dtype = {“a”: Continuous, “b”: “Nominal”} or dtype = {“a”: Continuous(), “b”: “nominal”} or dtype = Continuous() or dtype = “Continuous” or dtype = Continuous().

  • progress (bool) – Enable the progress bar.

Examples

>>> import pandas as pd
>>> from dataprep.eda import *
>>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> plot(iris)
>>> plot(iris, "petal_length")
>>> plot(iris, "petal_width", "species")
Return type

Container

compute

Computations for plot(df, …)

dataprep.eda.distribution.compute.compute(df, col1=None, col2=None, col3=None, *, cfg=None, display=None, dtype=None)[source]

All in one compute function.

Parameters
  • df (Union[DataFrame, DataFrame]) – DataFrame from which visualizations are generated

  • cfg (Union[Config, Dict[str, Any], None], default None) – When a user call plot(), the created Config object will be passed to compute(). When a user call compute() directly, if he/she wants to customize the output, cfg is a dictionary for configuring. If not, cfg is None and default values will be used for parameters.

  • display (Optional[List[str]], default None) – A list containing the names of the visualizations to display. Only exist when a user call compute() directly and want to customize the output

  • col1 (Optional[str], default None) – A valid column name from the dataframe

  • col2 (Optional[str], default None) – A valid column name from the dataframe

  • col3 (Optional[str], default None) – A valid column name from the dataframe

  • dtype (str or DType or dict of str or dict of DType, default None) – Specify Data Types for designated column or all columns. E.g. dtype = {“a”: Continuous, “b”: “Nominal”} or dtype = {“a”: Continuous(), “b”: “nominal”} or dtype = Continuous() or dtype = “Continuous” or dtype = Continuous()

Return type

Intermediate

render

This module implements the visualization for the plot(df) function.

dataprep.eda.distribution.render.render(itmdt, cfg)[source]

Render a basic plot

Parameters
  • itmdt (Intermediate) – The Intermediate containing results from the compute function.

  • cfg (Config) – Config instance

Return type

Union[LayoutDOM, Dict[str, Any]]