dataprep.eda auxiliary modules¶

col (dask.datafram.Series) – A dataframe column
known_dtype (Optional[Union[Dict[str, Union[DType, str]], DType]], default None) – A dictionary or single DType given by users to specify the types for designated columns or all columns. E.g. known_dtype = {“a”: Continuous, “b”: “Nominal”} or known_dtype = {“a”: Continuous(), “b”: “nominal”} or known_dtype = Continuous() or known_dtype = “Continuous” or known_dtype = Continuous()
detect_small_distinct (bool, default True) – Whether to detect numerical columns with small distinct values as categorical column.

Return type

DType

dataprep.eda.dtypes.detect_without_known(col, detect_small_distinct)[source]¶

This function detects dtypes of column when users didn’t specify.

Return type: DType

dataprep.eda.dtypes.drop_null(var)[source]¶

Drop the null values (specified in NULL_VALUES) from a series or DataFrame

Return type: Union[Series, Series, DataFrame, DataFrame]

dataprep.eda.dtypes.get_dtype_cnts_and_num_cols(df, dtype)[source]¶

Get the count of each dtype in a dataframe

Return type: Tuple[Dict[str, int], List[str]]

dataprep.eda.dtypes.is_continuous(dtype)[source]¶

Given a type, return if that type is a continuous type

Return type: bool

dataprep.eda.dtypes.is_datetime(dtype)[source]¶

Given a type, return if that type is a datetime type

Return type: bool

dataprep.eda.dtypes.is_dtype(dtype1, dtype2)[source]¶

This function detects if dtype2 is dtype1.

Return type: bool

dataprep.eda.dtypes.is_geography(col)[source]¶

Given a column, return if its type is a geography type

Return type: bool

dataprep.eda.dtypes.is_geopoint(col)[source]¶

Given a column, return if its type is a geopoint type

Return type: bool

dataprep.eda.dtypes.is_nominal(dtype)[source]¶

Given a type, return if that type is a nominal type

Return type: bool

dataprep.eda.dtypes.is_pandas_categorical(dtype)[source]¶

Detect if a dtype is categorical and from pandas.

Return type: bool

dataprep.eda.dtypes.map_dtype(dtype)[source]¶

Currently, we want to keep our Type System flattened. We will map Categorical() to Nominal() and Numerical() to Continuous()

Return type: DType

dataprep.eda.dtypes.normalize_dtype(dtype_repr)[source]¶

This function normalizes a dtype repr.

Return type: DType

Intermediate¶

Intermediate class

class dataprep.eda.intermediate.ColumnMetadata(meta)[source]¶

Bases: object

Container for storing a single column’s metadata. This is immutable.

metadata: pandas.core.series.Series¶

class dataprep.eda.intermediate.ColumnsMetadata[source]¶

Bases: object

Container for storing each column’s metadata.

metadata: pandas.core.frame.DataFrame¶

class dataprep.eda.intermediate.Intermediate(*args, **kwargs)[source]¶

Bases: Dict[str, Any]

This class contains intermediate results.

save(path=None)[source]¶

Save intermediate to current working directory.

Parameters

filename (Optional[str], default 'intermediate') – The filename used for saving intermediate without the extension name.
to (Optional[str], default Path.cwd()) – The path to where the intermediate will be saved.

Return type

None

visual_type: str¶

Palette¶

This file defines palettes used for EDA.

Container¶

This module implements the Container class.

class dataprep.eda.container.Container(to_render, visual_type, cfg)[source]¶

Bases: object

This class creates a customized Container object for the plot* function.

save(filename)[source]¶

save function

Return type: None

show()[source]¶

Render the report. This is useful when calling plot in a for loop.

Return type: None

show_browser()[source]¶

Open the plot in the browser. This is useful when plotting from terminmal or when the fig is very large in notebook.

Return type: None

class dataprep.eda.container.Context(**param)[source]¶

Bases: object

Define the context class that stores all the parameters needed by template engine. The instance is read-only.

Since we use same template to render different components without strict evaluation, when the engine tries to read an attribute from Context object, it will get None if the attribute doesn’t exist, making the rendering keep going instead of being interrupted.

Here we override __getitem__() and __getattr__() to do the trick, it also makes the object have a key-value pair which works as same as accessing its attribute.

Utils¶

Miscellaneous functions

dataprep.eda.utils.cut_long_name(name, max_len=18)[source]¶

If the name is longer than max_len, cut it to max_len length and append “…

Return type: str

dataprep.eda.utils.fuse_missing_perc(name, perc)[source]¶

Append (x.y%) to the name if perc is not 0.

Return type: str

dataprep.eda.utils.preprocess_dataframe(org_df, used_columns=None, excluded_columns=None, detect_small_distinct=True)[source]¶

Make a dask dataframe with only used_columns. This function will do the following:

keep only used_columns.

2. transform column name to string (avoid object column name) and rename duplicate column names in form of {col}_{id}. 3. reset index 4. transform object column to string column (note that obj column can contain cells from different type). 5. transform to dask dataframe if input is pandas dataframe.

Parameters

org_df (dataframe) – the original dataframe
used_columns (optional list[str], default None) – used columns in org_df
excluded_columns (optional list[str], default None) – excluded columns from used_columns, mainly used for geo point data processing.
detect_small_distinct (bool, default True) – whether to detect numerical columns with small distinct values as categorical column.

Return type

DataFrame

dataprep.eda.utils.relocate_legend(fig, loc)[source]¶

Relocate legend(s) from center to loc.

Return type: Figure

dataprep.eda.utils.sample_n(arr, n)[source]¶

Sample n values uniformly from the range of the arr, not from the distribution of arr’s elems.

Return type: ndarray

dataprep.eda.utils.to_dask(df)[source]¶

Convert a dataframe to a dask dataframe.

Return type: DataFrame

dataprep.eda.utils.tweak_figure(fig, ptype=None, show_yticks=False, max_lbl_len=15)[source]¶

Set some common attributes for a figure

Return type: None

Config¶

Parameter configurations

This file contains configurations for stats, auto-insights and plots. There are mainly two settings, “display” and “config”. Display is a list of Tab names which control the Tabs to show. Config is a dictionary that contains the customizable parameters and corresponding values. There are two types of parameters, global and local. Local parameters are plot-specified and the names are separated by “.”. The portion before the first “.” is plot name and the portion after the first “.” is parameter name. e.g. “hist.bins”. The “.” is also used when the parameter name contains more than one word. e.g. “insight.duplicates.threshold”. However, in the codebase, the “.” is replaced with “__” for parameters with long names.e.g. “insight.duplicates__threshold”. Global parameter is single-word. It applies to all the plots which has that parameter. e.g. “bins:50” applies to “hist.bins”, “line.bins”, “kde.bins”, “wordlen.bins” and “box.bins”. In addition,when global parameter and local parameter are both entered by a user in config, the global parameter will be overwrote by local parameters for specific plots.

class dataprep.eda.configs.Bar(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bars: int, default 10: Maximum number of bars to display
sort_descending: bool, default True: Whether to sort the bars in descending order
yscale: str, default “linear”: Y-axis scale (“linear” or “log”)
color: str, default “#1f77b4”: Color of the bar chart
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bars: int¶

color: str¶

enable: bool¶

grid_how_to_guide()[source]¶

how-to guide for plot(df)

Return type: List[Tuple[str, str]]

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

missing_how_to_guide(height, width)[source]¶

how-to guide for plot_missing(df, x, [y])

Return type: List[Tuple[str, str]]

sort_descending: bool¶

width: Optional[int]¶

yscale: str¶

class dataprep.eda.configs.Box(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
ngroups: int, default 15: Maximum number of groups to display
bins: int, default 50: Number of bins
unit: str, default “auto”: Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15
sort_descending: bool, default True: Whether to sort the boxes in descending order of frequency
color: str, default “#d62728: Color of the box_plot
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bins: int¶

color: str¶

enable: bool¶

height: Optional[int]¶

ngroups: int¶

nom_cont_how_to_guide(height, width)[source]¶

how-to guide for plot(df, nominal, continuous)

Return type: List[Tuple[str, str]]

sort_descending: bool¶

two_cont_how_to_guide(height, width)[source]¶

how-to guide for plot(df, continuous, continuous)

Return type: List[Tuple[str, str]]

unit: str¶

univar_how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.CDF(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
sample_size:: Number of evenly spaced samples between the minimum and maximum values to compute the cdf at
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

sample_size: int¶

width: Optional[int]¶

class dataprep.eda.configs.Config(**data)[source]¶

Bases: pydantic.main.BaseModel

Configuration class

bar: dataprep.eda.configs.Bar¶

box: dataprep.eda.configs.Box¶

cdf: dataprep.eda.configs.CDF¶

correlations: dataprep.eda.configs.Correlations¶

dendro: dataprep.eda.configs.Dendrogram¶

diff: dataprep.eda.configs.Diff¶

classmethod from_dict(display=None, config=None)[source]¶

Converts an dictionary instance into a config class

Return type: Config

heatmap: dataprep.eda.configs.Heatmap¶

hexbin: dataprep.eda.configs.Hexbin¶

hist: dataprep.eda.configs.Hist¶

insight: dataprep.eda.configs.Insight¶

interactions: dataprep.eda.configs.Interactions¶

kde: dataprep.eda.configs.KDE¶

kendall: dataprep.eda.configs.KendallTau¶

line: dataprep.eda.configs.Line¶

missingvalues: dataprep.eda.configs.MissingValues¶

nested: dataprep.eda.configs.Nested¶

overview: dataprep.eda.configs.Overview¶

pdf: dataprep.eda.configs.PDF¶

pearson: dataprep.eda.configs.Pearson¶

pie: dataprep.eda.configs.Pie¶

plot: dataprep.eda.configs.Plot¶

qqnorm: dataprep.eda.configs.QQNorm¶

scatter: dataprep.eda.configs.Scatter¶

spearman: dataprep.eda.configs.Spearman¶

spectrum: dataprep.eda.configs.Spectrum¶

stacked: dataprep.eda.configs.Stacked¶

stats: dataprep.eda.configs.Stats¶

value_table: dataprep.eda.configs.ValueTable¶

variables: dataprep.eda.configs.Variables¶

wordcloud: dataprep.eda.configs.WordCloud¶

wordfreq: dataprep.eda.configs.WordFrequency¶

wordlen: dataprep.eda.configs.WordLength¶

class dataprep.eda.configs.Correlations(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
value_range: If the correlation value is out of the range, don’t show it.
k: Choose top-k element

enable: bool¶

how_to_guide()[source]¶

how-to guide for plot_correlation(df, x)

Return type: List[Tuple[str, str]]

k: Optional[int]¶

value_range: Optional[Tuple[float, float]]¶

class dataprep.eda.configs.Dendrogram(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.Diff(**data)[source]¶

Bases: pydantic.main.BaseModel

Define the parameters in the plot_diff

baseline: int¶

density: bool¶

label: Optional[List[str]]¶

class dataprep.eda.configs.Heatmap(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
ngroups: int, default 10: Maximum number of most frequent values from the first column to display
nsubgroups: int, default 5: Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(x, y, height, width)[source]¶

how-to guide for plot(df, nominal, nominal)

Return type: List[Tuple[str, str]]

missing_how_to_guide(height, width)[source]¶

how-to guide for plot_missing(df)

Return type: List[Tuple[str, str]]

ngroups: int¶

nsubgroups: int¶

width: Optional[int]¶

class dataprep.eda.configs.Hexbin(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
tile_size: float, default “auto”: The size of the tile in the hexbin plot. Measured from the middle of a hexagon to its left or right corner.
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(tile_size, height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

tile_size: str¶

width: Optional[int]¶

class dataprep.eda.configs.Hist(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bins: int, default 50: Number of bins in the histogram
yscale: str, default “linear”: Y-axis scale (“linear” or “log”)
color: str, default “#aec7e8”: Color of the histogram
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bins: int¶

color: str¶

enable: bool¶

grid_how_to_guide()[source]¶

how-to guide for plot(df)

Return type: List[Tuple[str, str]]

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

width: Optional[int]¶

yscale: str¶

class dataprep.eda.configs.Insight(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
duplicates__threshold: int, default 1: Warn if the percent of duplicated values is above this threshold
similar_distribution__threshold:float, default 0.05: The significance level for Kolmogorov–Smirnov test
uniform__threshold: float, default 0.999: The p-value threshold for chi-square test
missing__threshold: int, default 1: Warn if the percent of missing values is above this threshold
skewed__threshold: float, default 1e-5: The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin
infinity__threshold: int, default 1: Warn if the percent of infinites is above this threshold
zeros__threshold: int, default 5: Warn if the percent of zeros is above this threshold
negatives__threshold: int, default 1: Warn if the percent of negatives is above this threshold
normal__threshold: float, default 0.99: The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality
high_cardinality__threshold: int, default 50: The threshold for unique values count, count larger than threshold yields high cardinality
constant__threshold: int, default 1: The threshold for unique values count, count equals to threshold yields constant value
outstanding_no1__threshold: float, default 1.5: The threshold for outstanding no1 insight, measures the ratio of the largest category count to the second-largest category count
attribution__threshold: float, default 0.5: The threshold for the attribution insight, measures the percentage of the top 2 categories
high_word_cardinality__threshold: int, default 1000: The threshold for the high word cardinality insight, which measures the number of words of that cateogory
outstanding_no1_word__threshold: int, default 0: The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count
outlier__threshold: int, default 0: The threshold for the outlier count in the box plot

attribution__threshold: float¶

constant__threshold: int¶

duplicates__threshold: int¶

enable: bool¶

high_cardinality__threshold: int¶

high_word_cardinality__threshold: int¶

infinity__threshold: int¶

missing__threshold: int¶

negatives__threshold: int¶

normal__threshold: float¶

outlier__threshold: int¶

outstanding_no1__threshold: float¶

outstanding_no1_word__threshold: float¶

similar_distribution__threshold: float¶

skewed__threshold: float¶

uniform__threshold: float¶

zeros__threshold: int¶

class dataprep.eda.configs.Interactions(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element

enable: bool¶

class dataprep.eda.configs.KDE(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bins: int, default 50: Number of bins in the histogram
yscale: str, default “linear”: Y-axis scale (“linear” or “log”)
hist_color: str, default “#aec7e8”: Color of the density histogram
line_color: str, default “#d62728: Color of the density line
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bins: int¶

enable: bool¶

height: Optional[int]¶

hist_color: str¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

line_color: str¶

width: Optional[int]¶

yscale: str¶

class dataprep.eda.configs.KendallTau(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.Line(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bins: int, default 50: Number of bins
ngroups: int, default 10: Maximum number of groups to display
sort_descending: bool, default True: Whether to sort the groups in descending order of frequency
yscale: str, default “linear”: The scale to show on the y axis. Can be “linear” or “log”.
unit: str, default “auto”: Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15
agg: str, default “mean”: Specify the aggregate to use when aggregating over a numeric column
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

agg: str¶

bins: int¶

enable: bool¶

height: Optional[int]¶

ngroups: int¶

nom_cont_how_to_guide(height, width)[source]¶

how-to guide for plot(df, nominal, continuous)

Return type: List[Tuple[str, str]]

sort_descending: bool¶

unit: str¶

width: Optional[int]¶

yscale: str¶

class dataprep.eda.configs.MissingValues(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element

enable: bool¶

class dataprep.eda.configs.Nested(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
ngroups: int, default 10: Maximum number of most frequent values from the first column to display
nsubgroups: int, default 5: Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(x, y, height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

ngroups: int¶

nsubgroups: int¶

width: Optional[int]¶

class dataprep.eda.configs.Overview(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element

enable: bool¶

class dataprep.eda.configs.PDF(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
sample_size: int, default 100: Number of evenly spaced samples between the minimum and maximum values to compute the pdf at
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

sample_size: int¶

width: Optional[int]¶

class dataprep.eda.configs.Pearson(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.Pie(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
slices: int, default 10: Maximum number of pie slices to display
sort_descending: bool, default True: Whether to sort the slices in descending order of frequency
colors: Optional[List[str]], default None: List of colors
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

colors: Optional[List[str]]¶

enable: bool¶

height: Optional[int]¶

how_to_guide(color_list, height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

slices: int¶

sort_descending: bool¶

width: Optional[int]¶

class dataprep.eda.configs.Plot(**data)[source]¶

Bases: pydantic.main.BaseModel

Class containing global parameters for the plots

bins: Optional[int]¶

height: Optional[int]¶

ngroups: Optional[int]¶

report: bool¶

width: Optional[int]¶

class dataprep.eda.configs.QQNorm(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
point_color: str, default “#1f77b4”: Color of the density histogram
line_color: str, default “#d62728: Color of the density line
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

line_color: str¶

point_color: str¶

width: Optional[int]¶

class dataprep.eda.configs.Scatter[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
sample_size: int, optional, default=1000: Number of points to randomly sample per partition. Cannot be used with sample_rate.
sample_rate: float, optional, default None: sample rate per partition. Cannot be used with sample_size. Set it to 1.0 for no sampling.
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

sample_rate: Optional[float]¶

sample_size: Optional[int]¶

width: Optional[int]¶

class dataprep.eda.configs.Spearman(**data)[source]¶

Bases: pydantic.main.BaseModel

height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.Spectrum(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bins: int, default 20: Number of bins
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bins: int¶

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

width: Optional[int]¶

class dataprep.eda.configs.Stacked(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
ngroups: int, default 10: Maximum number of most frequent values from the first column to display
nsubgroups: int, default 5: Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)
unit: str, default “auto”: Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15
sort_descending: bool, default True: Whether to sort the groups in descending order of frequency
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

enable: bool¶

height: Optional[int]¶

how_to_guide(x, y, height, width)[source]¶

how-to guide

Return type: List[Tuple[str, str]]

ngroups: int¶

nsubgroups: int¶

sort_descending: bool¶

unit: str¶

width: Optional[int]¶

class dataprep.eda.configs.Stats(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to display the stats section

enable: bool¶

class dataprep.eda.configs.ValueTable(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
ngroups: int, default 10: Number of values to show in the table

enable: bool¶

how_to_guide()[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

ngroups: int¶

class dataprep.eda.configs.Variables(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element

enable: bool¶

class dataprep.eda.configs.WordCloud(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
top_words: int, default 30: Maximum number of most frequent words to display
stopword: bool, default True: Whether to remove stopwords
lemmatize: bool, default False: Whether to lemmatize the words
stem: bool, default False: Whether to apply Potter Stem on the words

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

lemmatize: bool¶

stem: bool¶

stopword: bool¶

top_words: int¶

width: Optional[int]¶

class dataprep.eda.configs.WordFrequency(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
top_words: int, default 30: Maximum number of most frequent words to display
stopword: bool, default True: Whether to remove stopwords
lemmatize: bool, default False: Whether to lemmatize the words
stem: bool, default False: Whether to apply Potter Stem on the words
color: str, default “#1f77b4”: Color of the bar chart

color: str¶

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

lemmatize: bool¶

stem: bool¶

stopword: bool¶

top_words: int¶

width: Optional[int]¶

class dataprep.eda.configs.WordLength(**data)[source]¶

Bases: pydantic.main.BaseModel

enable: bool, default True: Whether to create this element
bins: int, default 50: Number of bins in the histogram
yscale: str, default “linear”: Y-axis scale (“linear” or “log”)
color: str, default “#aec7e8”: Color of the histogram
height: int, default “auto”: Height of the plot
width: int, default “auto”: Width of the plot

bins: int¶

color: str¶

enable: bool¶

height: Optional[int]¶

how_to_guide(height, width)[source]¶

how-to guide for plot(df, x)

Return type: List[Tuple[str, str]]

width: Optional[int]¶

yscale: str¶