In this module lives the type tree.
dataprep.eda.dtypes.
Categorical
Bases: dataprep.eda.dtypes.DType
dataprep.eda.dtypes.DType
Type Categorical
Continuous
Bases: dataprep.eda.dtypes.Numerical
dataprep.eda.dtypes.Numerical
Type Continuous, Subtype of Numerical
DType
Bases: object
object
Root of Type Tree
DateTime
Type DateTime, Subtype of Numerical
Discrete
Type Discrete, Subtype of Numerical
GeoGraphy
Bases: dataprep.eda.dtypes.Categorical
dataprep.eda.dtypes.Categorical
Type GeoGraphy, Subtype of Categorical
GeoPoint
Type GeoPoint
LatLong
Bases: dataprep.eda.dtypes.GeoPoint
dataprep.eda.dtypes.GeoPoint
Type LatLong, Tuple
Nominal
Type Nominal, Subtype of Categorical
Numerical
Type Numerical
Ordinal
Type Ordinal, Subtype of Categorical
Text
Bases: dataprep.eda.dtypes.Nominal
dataprep.eda.dtypes.Nominal
Type Text, Subtype of Nominal
detect_dtype
Given a column, detect its type or transform its type according to users’ specification
col (dask.datafram.Series) – A dataframe column
known_dtype (Optional[Union[Dict[str, Union[DType, str]], DType]], default None) – A dictionary or single DType given by users to specify the types for designated columns or all columns. E.g. known_dtype = {“a”: Continuous, “b”: “Nominal”} or known_dtype = {“a”: Continuous(), “b”: “nominal”} or known_dtype = Continuous() or known_dtype = “Continuous” or known_dtype = Continuous()
detect_small_distinct (bool, default True) – Whether to detect numerical columns with small distinct values as categorical column.
detect_without_known
This function detects dtypes of column when users didn’t specify.
drop_null
Drop the null values (specified in NULL_VALUES) from a series or DataFrame
Union[Series, Series, DataFrame, DataFrame]
Union
Series
DataFrame
get_dtype_cnts_and_num_cols
Get the count of each dtype in a dataframe
Tuple[Dict[str, int], List[str]]
Tuple
Dict
str
int
List
is_continuous
Given a type, return if that type is a continuous type
bool
is_datetime
Given a type, return if that type is a datetime type
is_dtype
This function detects if dtype2 is dtype1.
is_geography
Given a column, return if its type is a geography type
is_geopoint
Given a column, return if its type is a geopoint type
is_nominal
Given a type, return if that type is a nominal type
is_pandas_categorical
Detect if a dtype is categorical and from pandas.
map_dtype
Currently, we want to keep our Type System flattened. We will map Categorical() to Nominal() and Numerical() to Continuous()
normalize_dtype
This function normalizes a dtype repr.
Intermediate class
dataprep.eda.intermediate.
ColumnMetadata
Container for storing a single column’s metadata. This is immutable.
metadata
ColumnsMetadata
Container for storing each column’s metadata.
Intermediate
Bases: Dict[str, Any]
Any
This class contains intermediate results.
save
Save intermediate to current working directory.
filename (Optional[str], default 'intermediate') – The filename used for saving intermediate without the extension name.
to (Optional[str], default Path.cwd()) – The path to where the intermediate will be saved.
None
visual_type
This file defines palettes used for EDA.
This module implements the Container class.
dataprep.eda.container.
Container
This class creates a customized Container object for the plot* function.
save function
show
Render the report. This is useful when calling plot in a for loop.
show_browser
Open the plot in the browser. This is useful when plotting from terminmal or when the fig is very large in notebook.
Context
Define the context class that stores all the parameters needed by template engine. The instance is read-only.
Since we use same template to render different components without strict evaluation, when the engine tries to read an attribute from Context object, it will get None if the attribute doesn’t exist, making the rendering keep going instead of being interrupted.
Here we override __getitem__() and __getattr__() to do the trick, it also makes the object have a key-value pair which works as same as accessing its attribute.
Miscellaneous functions
dataprep.eda.utils.
cut_long_name
If the name is longer than max_len, cut it to max_len length and append “…
fuse_missing_perc
Append (x.y%) to the name if perc is not 0.
preprocess_dataframe
Make a dask dataframe with only used_columns. This function will do the following:
keep only used_columns. 2. transform column name to string (avoid object column name) and rename duplicate column names in form of {col}_{id}. 3. reset index 4. transform object column to string column (note that obj column can contain cells from different type). 5. transform to dask dataframe if input is pandas dataframe.
keep only used_columns.
2. transform column name to string (avoid object column name) and rename duplicate column names in form of {col}_{id}. 3. reset index 4. transform object column to string column (note that obj column can contain cells from different type). 5. transform to dask dataframe if input is pandas dataframe.
org_df (dataframe) – the original dataframe
used_columns (optional list[str], default None) – used columns in org_df
excluded_columns (optional list[str], default None) – excluded columns from used_columns, mainly used for geo point data processing.
detect_small_distinct (bool, default True) – whether to detect numerical columns with small distinct values as categorical column.
relocate_legend
Relocate legend(s) from center to loc.
Figure
sample_n
Sample n values uniformly from the range of the arr, not from the distribution of arr’s elems.
ndarray
to_dask
Convert a dataframe to a dask dataframe.
tweak_figure
Set some common attributes for a figure
Parameter configurations
This file contains configurations for stats, auto-insights and plots. There are mainly two settings, “display” and “config”. Display is a list of Tab names which control the Tabs to show. Config is a dictionary that contains the customizable parameters and corresponding values. There are two types of parameters, global and local. Local parameters are plot-specified and the names are separated by “.”. The portion before the first “.” is plot name and the portion after the first “.” is parameter name. e.g. “hist.bins”. The “.” is also used when the parameter name contains more than one word. e.g. “insight.duplicates.threshold”. However, in the codebase, the “.” is replaced with “__” for parameters with long names.e.g. “insight.duplicates__threshold”. Global parameter is single-word. It applies to all the plots which has that parameter. e.g. “bins:50” applies to “hist.bins”, “line.bins”, “kde.bins”, “wordlen.bins” and “box.bins”. In addition,when global parameter and local parameter are both entered by a user in config, the global parameter will be overwrote by local parameters for specific plots.
dataprep.eda.configs.
Bar
Bases: pydantic.main.BaseModel
pydantic.main.BaseModel
Whether to create this element
Maximum number of bars to display
Whether to sort the bars in descending order
Y-axis scale (“linear” or “log”)
Color of the bar chart
Height of the plot
Width of the plot
bars
color
enable
grid_how_to_guide
how-to guide for plot(df)
List[Tuple[str, str]]
height
how_to_guide
how-to guide for plot(df, x)
missing_how_to_guide
how-to guide for plot_missing(df, x, [y])
sort_descending
width
yscale
Box
Maximum number of groups to display
Number of bins
Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15
Whether to sort the boxes in descending order of frequency
Color of the box_plot
bins
ngroups
nom_cont_how_to_guide
how-to guide for plot(df, nominal, continuous)
two_cont_how_to_guide
how-to guide for plot(df, continuous, continuous)
unit
univar_how_to_guide
CDF
Number of evenly spaced samples between the minimum and maximum values to compute the cdf at
how-to guide
sample_size
Config
Configuration class
bar
box
cdf
correlations
dendro
diff
from_dict
Converts an dictionary instance into a config class
heatmap
hexbin
hist
insight
interactions
kde
kendall
line
missingvalues
nested
overview
pdf
pearson
pie
plot
qqnorm
scatter
spearman
spectrum
stacked
stats
value_table
variables
wordcloud
wordfreq
wordlen
Correlations
If the correlation value is out of the range, don’t show it.
Choose top-k element
how-to guide for plot_correlation(df, x)
k
value_range
Dendrogram
Diff
Define the parameters in the plot_diff
baseline
density
label
Heatmap
Maximum number of most frequent values from the first column to display
Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)
how-to guide for plot(df, nominal, nominal)
how-to guide for plot_missing(df)
nsubgroups
Hexbin
The size of the tile in the hexbin plot. Measured from the middle of a hexagon to its left or right corner.
tile_size
Hist
Number of bins in the histogram
Color of the histogram
Insight
Warn if the percent of duplicated values is above this threshold
The significance level for Kolmogorov–Smirnov test
The p-value threshold for chi-square test
Warn if the percent of missing values is above this threshold
The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin
Warn if the percent of infinites is above this threshold
Warn if the percent of zeros is above this threshold
Warn if the percent of negatives is above this threshold
The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality
The threshold for unique values count, count larger than threshold yields high cardinality
The threshold for unique values count, count equals to threshold yields constant value
The threshold for outstanding no1 insight, measures the ratio of the largest category count to the second-largest category count
The threshold for the attribution insight, measures the percentage of the top 2 categories
The threshold for the high word cardinality insight, which measures the number of words of that cateogory
The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count
The threshold for the outlier count in the box plot
attribution__threshold
constant__threshold
duplicates__threshold
high_cardinality__threshold
high_word_cardinality__threshold
infinity__threshold
missing__threshold
negatives__threshold
normal__threshold
outlier__threshold
outstanding_no1__threshold
outstanding_no1_word__threshold
similar_distribution__threshold
skewed__threshold
uniform__threshold
zeros__threshold
Interactions
KDE
Color of the density histogram
Color of the density line
hist_color
line_color
KendallTau
Line
Whether to sort the groups in descending order of frequency
The scale to show on the y axis. Can be “linear” or “log”.
Specify the aggregate to use when aggregating over a numeric column
agg
MissingValues
Nested
Overview
PDF
Number of evenly spaced samples between the minimum and maximum values to compute the pdf at
Pearson
Pie
Maximum number of pie slices to display
Whether to sort the slices in descending order of frequency
List of colors
colors
slices
Plot
Class containing global parameters for the plots
report
QQNorm
point_color
Scatter
Number of points to randomly sample per partition. Cannot be used with sample_rate.
sample rate per partition. Cannot be used with sample_size. Set it to 1.0 for no sampling.
sample_rate
Spearman
Spectrum
Stacked
Stats
Whether to display the stats section
ValueTable
Number of values to show in the table
Variables
WordCloud
Maximum number of most frequent words to display
Whether to remove stopwords
Whether to lemmatize the words
Whether to apply Potter Stem on the words
lemmatize
stem
stopword
top_words
WordFrequency
WordLength