Core APIs

Analyzers

Analyzers file for all the different analyzers classes in Deequ

class pydeequ.analyzers.AnalysisRunBuilder(spark_session: pyspark.sql.session.SparkSession, df: pyspark.sql.dataframe.DataFrame)

Bases: object

Low level class for running analyzers module. This is meant to be called by AnalysisRunner.

Parameters
  • SparkSession (spark_session) – SparkSession

  • df (DataFrame) – DataFrame to run the Analysis on.

addAnalyzer(analyzer: pydeequ.analyzers._AnalyzerObject)

Adds a single analyzer to the current Analyzer run.

Parameters

analyzer – Adds an analyzer strategy to the run.

Return self

for further chained method calls.

run()

Run the Analysis.

Returns

self: Runs the AnalysisRunBuilder.

saveOrAppendResult(resultKey: pydeequ.repository.ResultKey)

A shortcut to save the results of the run or append them to existing results in the metrics repository.

Parameters

resultKey (ResultKey) – The result key to identify the current run

Returns

self

useRepository(repository: pydeequ.repository.MetricsRepository)

Set a metrics repository associated with the current data to enable features like reusing previously computed results and storing the results of the current run.

Parameters

repository (MetricsRepository) – A metrics repository to store and load results associated with the run

Returns

self

class pydeequ.analyzers.AnalysisRunner(spark_session: pyspark.sql.session.SparkSession)

Bases: object

Runs a set of analyzers on the data at hand and optimizes the resulting computations to minimize the number of scans over the data. Additionally, the internal states of the computation can be stored and aggregated with existing states to enable incremental computations.

Parameters

SparkSession (spark_session) – SparkSession

onData(df)

Starting point to construct an AnalysisRun. :param dataFrame df: tabular data on which the checks should be verified :return: new AnalysisRunBuilder object

class pydeequ.analyzers.AnalyzerContext

Bases: object

The result returned from AnalysisRunner and Analysis.

classmethod successMetricsAsDataFrame(spark_session: pyspark.sql.session.SparkSession, analyzerContext, forAnalyzers: Optional[list] = None, pandas: bool = False)

Get the Analysis Run as a DataFrame.

Parameters
  • spark_session (SparkSession) – SparkSession

  • analyzerContext (AnalyzerContext) – Analysis Run

  • forAnalyzers (list) – Subset of Analyzers from the Analysis Run

Return DataFrame

DataFrame of Analysis Run

classmethod successMetricsAsJson(spark_session: pyspark.sql.session.SparkSession, analyzerContext, forAnalyzers: Optional[list] = None)

Get the Analysis Run as a JSON.

Parameters
  • spark_session (SparkSession) – SparkSession

  • analyzerContext (AnalyzerContext) – Analysis Run

  • forAnalyzers (list) – Subset of Analyzers from the Analysis Run

:return JSON : JSON output of Analysis Run

class pydeequ.analyzers.ApproxCountDistinct(column: str, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Computes the approximate count distinctness of a column with HyperLogLogPlusPlus.

Parameters
  • column (str) – Column to compute this aggregation on.

  • where (str) – Additional filter to apply before the analyzer is run.

class pydeequ.analyzers.ApproxQuantile(column: str, quantile: float, relativeError: float = 0.01, where=None)

Bases: pydeequ.analyzers._AnalyzerObject

Computes the Approximate Quantile of a column. The allowed relative error compared to the exact quantile can be configured with the relativeError parameter.

Parameters
  • column (str) – The column in the DataFrame for which the approximate quantile is analyzed.

  • quantile (float [0,1]) – The computed quantile. It must be within the interval [0, 1], where 0.5 would be the median.

  • relativeError (float [0,1]) – Relative target precision to achieve in the quantile computation. A relativeError = 0.0 would yield the exact quantile while increasing the computational load.

  • where (str) – Additional filter to apply before the analyzer is run.

class pydeequ.analyzers.ApproxQuantiles(column, quantiles, relativeError=0.01)

Bases: pydeequ.analyzers._AnalyzerObject

Computes the approximate quantiles of a column. The allowed relative error compared to the exact quantile can be configured with relativeError parameter.

Parameters
  • column (str) – Column in DataFrame for which the approximate quantile is analyzed.

  • quantiles (List[float[0,1]])) – Computed Quantiles. Must be in the interval [0, 1], where 0.5 would be the median.

  • relativeError (float [0,1]) – Relative target precision to achieve in the quantile computation. A relativeError = 0.0 would yield the exact quantile while increasing the computational load.

class pydeequ.analyzers.Completeness(column, where=None)

Bases: pydeequ.analyzers._AnalyzerObject

Completeness is the fraction of non-null values in a column.

Parameters
  • column (str) – Column in DataFrame for which Completeness is analyzed.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Compliance(instance, predicate, where=None)

Bases: pydeequ.analyzers._AnalyzerObject

Compliance measures the fraction of rows that complies with the given column constraint. E.g if the constraint is “att1>3” and data frame has 5 rows with att1 column value greater than 3 and 10 rows under 3; a DoubleMetric would be returned with 0.33 value.

Parameters
  • instance (str) – Unlike other column analyzers (e.g completeness) this analyzer can not infer to the metric instance name from column name. Also the constraint given here can be referring to multiple columns, so metric instance name should be provided,describing what the analysis being done for.

  • predicate (str) – SQL-predicate to apply per row

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Correlation(column1, column2, where=None)

Bases: pydeequ.analyzers._AnalyzerObject

Computes the pearson correlation coefficient between the two given columns.

Parameters
  • column1 (str) – First column in the DataFrame for which the Correlation is analyzed.

  • column2 (str) – Second column in the DataFrame for which the Correlation is analyzed.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.CountDistinct(columns)

Bases: pydeequ.analyzers._AnalyzerObject

Counts the distinct elements in the column(s).

Parameters

columns (List[str]) – Column(s) in the DataFrame for which distinctness is analyzed.

class pydeequ.analyzers.DataType(column, where=None)

Bases: pydeequ.analyzers._AnalyzerObject

Data Type Analyzer. Returns the datatypes of column

Parameters
  • column (str) – Column in the DataFrame for which data type is analyzed.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.DataTypeInstances(value)

Bases: enum.Enum

An enum class that types columns to scala datatypes

Boolean = 'Boolean'
Fractional = 'Fractional'
Integral = 'Integral'
String = 'String'
Unknown = 'Unknown'
class pydeequ.analyzers.Distinctness(columns, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Count the distinctness of elements in column(s). Distinctness is the fraction of distinct values of a column(s).

Parameters
  • columns (str OR list[str]) – Column(s) in the DataFrame for which data type is to be analyzed. The column is expected to be a str for single column or list[str] for multiple columns.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Entropy(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Entropy is a measure of the level of information contained in a message. Given the probability distribution over values in a column, it describes how many bits are required to identify a value.

Parameters
  • column (str) – Column in DataFrame for which entropy is calculated.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Histogram(column, binningUdf=None, maxDetailBins: Optional[int] = None, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Histogram is the summary of values in a column of a DataFrame. It groups the column’s values then calculates the number of rows with that specific value and the fraction of the value.

Parameters
  • column (str) – Column in DataFrame to do histogram analysis.

  • binningUdf (lambda expr) – Optional binning function to run before grouping to re-categorize the column values.For example to turn a numerical value to a categorical value binning functions might be used.

  • maxDetailBins (int) – Histogram details is only provided for N column values with top counts. MaxBins sets the N. This limit does not affect what is being returned as number of bins.It always returns the distinct value count.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.KLLParameters(spark_session: pyspark.sql.session.SparkSession, sketchSize: int, shrinkingFactor: float, numberOfBuckets: int)

Bases: object

Parameter definition for KLL Sketches.

Parameters
  • sketchSize (int) – size of kll sketch.

  • shrinkingFactor (float) – shrinking factor of kll sketch.

  • numberOfBuckets (int) – number of buckets.

class pydeequ.analyzers.KLLSketch(column: str, kllParameters: pydeequ.analyzers.KLLParameters)

Bases: pydeequ.analyzers._AnalyzerObject

The KLL Sketch analyzer.

Parameters
  • column (str) – Column in DataFrame to do histogram analysis.

  • kllParameters (KLLParameters) – parameters of KLL Sketch

class pydeequ.analyzers.MaxLength(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

MaxLength Analyzer. Get Max length of a str type column.

Parameters
  • column (str) – column in DataFrame to find the maximum length. Column is expected to be a str type.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Maximum(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Get the maximum of a numeric column.

class pydeequ.analyzers.Mean(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Mean Analyzer. Get mean of a column

Parameters
  • column (str) – Column in DataFrame to find the mean.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.MinLength(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Get the minimum length of a column

Parameters

column (str) – Column in DataFrame to find the minimum Length. Column is expected to be a str type.

:param str where : additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Minimum(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Count the distinct elements in a single or multiple columns

Parameters
  • column (str) – Column in DataFrame to find the minimum value.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.MutualInformation(columns, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Describes how much information about one column can be inferred from another column.

Parameters
  • columns (list[str]) – Columns in DataFrame for mutual information analysis.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.PatternMatch(column, pattern_regex: str, *pattern_groupNames, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

PatternMatch is a measure of the fraction of rows that complies with a given column regex constraint.

E.g A sample dataFrame column has five rows that contain a credit card number and 10 rows that do not. According to regex, using the constraint Patterns.CREDITCARD returns a doubleMetric .33 value.

Parameters
  • column (str) – Column in DataFrame to check pattern.

  • pattern_regex (str) – pattern regex

  • pattern_groupNames (str) – groupNames for pattern regex

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Size(where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Size is the number of rows in a DataFrame.

Parameters

where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.StandardDeviation(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Calculates the Standard Deviation of column

Parameters
  • column (str) – Column in DataFrame for standard deviation calculation.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Sum(column, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Calculates the sum of a column

Parameters
  • column (str) – Column in DataFrame to calculate the sum.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.UniqueValueRatio(columns, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Calculates the ratio of uniqueness.

Parameters
  • columns (list[str]) – Columns in DataFrame to find unique value ratio.

  • where (str) – additional filter to apply before the analyzer is run.

class pydeequ.analyzers.Uniqueness(columns, where: Optional[str] = None)

Bases: pydeequ.analyzers._AnalyzerObject

Uniqueness is the fraction of unique values of column(s), values that occur exactly once.

Parameters
  • columns (list[str]) – Columns in DataFrame to find uniqueness.

  • where (str) – additional filter to apply before the analyzer is run.

Anomaly Detection

class pydeequ.anomaly_detection.AbsoluteChangeStrategy(maxRateDecrease=None, maxRateIncrease=None, order=1)

Bases: pydeequ.anomaly_detection._AnomalyObject

Detects anomalies based on the values’ absolute change. AbsoluteChangeStrategy(-10.0, 10.0, 1) for example calculates the first discrete difference and if some point’s value changes by more than 10.0 in one timestep, it flags it as an anomaly.

Parameters
  • maxRateDecrease (float) – Upper bound of accepted decrease (lower bound of increase).

  • maxRateIncrease (float) – Upper bound of accepted growth.

  • order (int) – Order of the calculated difference. Defaulted to 1, it calculates the difference between two consecutive values. The order of the difference can be set manually. If it is set to 0, this strategy acts like the SimpleThresholdStrategy class.

class pydeequ.anomaly_detection.BatchNormalStrategy(lowerDeviationFactor=3.0, upperDeviationFactor=3.0, includeInterval=False)

Bases: pydeequ.anomaly_detection._AnomalyObject

Detects anomalies based on the mean and standard deviation of all available values. The strategy assumes that the data is normally distributed.

Parameters
  • lowerDeviationFactor (float) – Detect anomalies if they are smaller than mean - lowerDeviationFactor * stdDev

  • upperDeviationFactor (float) – Detect anomalies if they are bigger than mean + upperDeviationFactor * stdDev

  • includeInterval (boolean) – Discerns whether or not the values inside the detection interval should be included in the calculation of the mean / stdDev.

class pydeequ.anomaly_detection.HoltWinters(metricsInterval: pydeequ.anomaly_detection.MetricInterval, seasonality: pydeequ.anomaly_detection.SeriesSeasonality)

Bases: pydeequ.anomaly_detection._AnomalyObject

Detects anomalies based on the additive Holt-Winters model. For example if a metric is produced daily and repeats itself every Monday, then the model should be created with a Daily metric interval and a Weekly seasonality parameter. To implement two cycles of data a minimum of 15 entries must be given for SeriesSeasonality.Weekly, MetricInterval.Daily, and 25 for SeriesSeasonality.Yearly, MetricInterval.Monthly.

Parameters
class pydeequ.anomaly_detection.MetricInterval(value)

Bases: enum.Enum

Metric Interval is how often the metric of interest is computed (e.g. daily).

Daily = 'Daily'
Monthly = 'Monthly'
class pydeequ.anomaly_detection.OnlineNormalStrategy(lowerDeviationFactor=3.0, upperDeviationFactor=3.0, ignoreStartPercentage=0.1, ignoreAnomalies=True)

Bases: pydeequ.anomaly_detection._AnomalyObject

Detects anomalies based on the running mean and standard deviation. Anomalies can be excluded from the computation to not affect the calculated mean/ standard deviation. This strategy assumes that the data is normally distributed.

Parameters
  • lowerDeviationFactor (float) – Detects anomalies if they are smaller than the mean - lowerDeviationFactor *stdDev

  • upperDeviationFactor (float) – Detect anomalies if they are bigger than mean + upperDeviationFactor * stDev

  • ignoreStartPercentage (float) – Percentage of data points after start in which no anomalies should be detected (mean and stdDev are probably not representative before).

  • ignoreAnomalies (boolean) – If set to true, ignores anomalous points in mean and variance calculation

class pydeequ.anomaly_detection.RateOfChangeStrategy(maxRateDecrease=None, maxRateIncrease=None, order=1)

Bases: pydeequ.anomaly_detection._AnomalyObject

@Deprecated The old RateOfChangeStrategy actually detects absolute changes so it has been migrated to use the AbsoluteChangeStrategy class. Use RelativeRateOfChangeStrategy if you want to detect relative changes to the previous values.

class pydeequ.anomaly_detection.RelativeRateOfChangeStrategy(maxRateDecrease=None, maxRateIncrease=None, order=1)

Bases: pydeequ.anomaly_detection._AnomalyObject

Detects anomalies based on the values’ rate of change. For example RelativeRateOfChangeStrategy(.9, 1.1, 1) calculates the first discrete difference, and if some point’s value changes by more than 10.0 percent in one timestep it flags it as an anomaly.

Parameters
  • maxRateDecrease (float) – Lower bound of accepted relative change (as new value / old value).

  • maxRateIncrease (float) – Upper bound of accepted relative change (as new value / old value).

  • order (int) – Order of the calculated difference. Defaulted to 1, it calculates the difference between two consecutive values. The order of the difference can be set manually. If it is set to 0, this strategy acts like the SimpleThresholdStrategy class.

class pydeequ.anomaly_detection.SeriesSeasonality(value)

Bases: enum.Enum

SeriesSeasonality is the expected metric seasonality which defines the longest cycle in series. This is also referred to as periodicity.

Weekly = 'Weekly'
Yearly = 'Yearly'
class pydeequ.anomaly_detection.SimpleThresholdStrategy(lowerBound, upperBound)

Bases: pydeequ.anomaly_detection._AnomalyObject

A simple anomaly detection strategy that checks if values are in a specified range.

Parameters
  • lowerBound (float) – Lower bound of accepted range of values

  • upperBound (float) – Upper bound of accepted range of values

Checks

class pydeequ.checks.Check(spark_session: pyspark.sql.session.SparkSession, level: pydeequ.checks.CheckLevel, description: str, constraints: Optional[list] = None)

Bases: object

A class representing a list of constraints that can be applied to a given

[[org.apache.spark.sql.DataFrame]]. In order to run the checks, use the run method. You can also use VerificationSuite.run to run your checks along with other Checks and Analysis objects. When run with VerificationSuite, Analyzers required by multiple checks/analysis blocks is optimized to run once.

addConstraint(constraint)

Returns a new Check object with the given constraints added to the constraints list. :param Constraint constraint: new constraint to be added. :return: new Check object

addConstraints(constraints: list)
addFilterableContstraint(creationFunc)

Adds a constraint that can subsequently be replaced with a filtered version :param creationFunc: :return:

areAnyComplete(columns, hint=None)

Creates a constraint that asserts any completion in the combined set of columns.

Parameters
  • columns (list[str]) – Columns in Data Frame to run the assertion on.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

areAnyComplete self: A Check.scala object that asserts completion in the columns.

areComplete(columns, hint=None)

Creates a constraint that asserts completion in combined set of columns.

Parameters
  • columns (list[str]) – Columns in Data Frame to run the assertion on.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

areComplete self: A Check.scala object that asserts completion in the columns.

containsCreditCardNumber(column, assertion=None, hint=None)

Check to run against the compliance of a column against a Credit Card pattern.

Parameters
  • column (str) – Column in DataFrame to be checked. The column is expected to be a string type.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (hint) – A hint that states why a constraint could have failed.

Returns

containsCreditCardNumber self: A Check object that runs the compliance on the column.

containsEmail(column, assertion=None, hint=None)

Check to run against the compliance of a column against an e-mail pattern.

Parameters
  • column (str) – The Column in DataFrame to be checked. The column is expected to be a string datatype.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

containsCreditCardNumber self: A Check object that runs the compliance on the column.

containsSocialSecurityNumber(column, assertion=None, hint=None)

Check to run against the compliance of a column against the Social security number pattern for the US.

Parameters
  • column (str) – The Column in DataFrame to be checked. The column is expected to be a string datatype.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

containsSocialSecurityNumber self: A Check object that runs the compliance on the column.

containsURL(column, assertion=None, hint=None)

Check to run against the compliance of a column against an e-mail pattern.

Parameters
  • column (str) – The Column in DataFrame to be checked. The column is expected to be a string datatype.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

containsURL self: A Check object that runs the compliance on the column.

evaluate(context)

Evaluate this check on computed metrics.

Parameters

context – result of the metrics computation

Returns

evaluate self: A Check object that evaluates the check.

hasApproxCountDistinct(column, assertion, hint=None)

Creates a constraint that asserts on the approximate count distinct of the given column

Parameters
  • column (str) – Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasApproxCountDistinct self: A Check object that asserts the count distinct of the column.

hasApproxQuantile(column, quantile, assertion, hint=None)

Creates a constraint that asserts on an approximated quantile

Parameters
  • column (str) – Column in Data Frame to run the assertion on

  • quantile (float) – Quantile to run the assertion on.

  • assertion (lambda) – A function that accepts the computed quantile as an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasApproxQuantile self: A Check object that asserts the approximated quantile in the column.

hasCompleteness(column, assertion, hint=None)

Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasCompleteness self: A Check object that implements the assertion.

hasCorrelation(columnA, columnB, assertion, hint=None)

Creates a constraint that asserts on the pearson correlation between two columns.

Parameters
  • columnA (str) – First column in Data Frame which calculates the correlation.

  • columnB (str) – Second column in Data Frame which calculates the correlation.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasCorrelation self: A Check object that asserts the correlation calculation in the columns.

hasDataType(column, datatype: pydeequ.checks.ConstrainableDataTypes, assertion=None, hint=None)

Check to run against the fraction of rows that conform to the given data type.

Parameters
  • column (str) – The Column in DataFrame to be checked.

  • datatype (ConstrainableDataTypes) – Data type that the columns should be compared against

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasDataType self: A Check object that runs the compliance on the column.

hasDistinctness(columns, assertion, hint=None)

Creates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).

Parameters

columns (list[str]) – Column(s) in Data Frame to run the assertion on.

:param lambda assertion : A function that accepts an int or float parameter. :param str hint: A hint that states why a constraint could have failed. :return: hasDistinctness self: A Check object that asserts distinctness in the columns.

hasEntropy(column, assertion, hint=None)

Creates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasEntropy self: A Check object that asserts the entropy in the column.

hasHistogramValues(column, assertion, binningUdf, maxBins, hint=None)

Creates a constraint that asserts on column’s value distribution.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter as a distribution input parameter.

  • binningUDF (str) – An optional binning function.

  • maxBins (str) – Histogram details is only provided for N column values with top counts. MaxBins sets the N.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasHistogramValues self: A Check object that asserts the column’s value distribution in the column.

hasMax(column, assertion, hint=None)

Creates a constraint that asserts on the maximum of the column. The column contains either a long, int or float datatype.

Parameters
  • column (str) – Column in Data Frame to run the assertion on. The column is expected to be an int, long or float type.

  • assertion (lambda) – A function which accepts an int or float parameter that discerns the maximum value of the column.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMax self: A Check object that asserts maximum of the column.

hasMaxLength(column, assertion, hint=None)

Creates a constraint that asserts on the maximum length of a string datatype column

Parameters
  • column (str) – Column in Data Frame to run the assertion on. The column is expected to be a string type.

  • assertion (lanmbda) – A function which accepts an int or float parameter that discerns the maximum length of the string.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMaxLength self : A Check object that asserts maxLength of the column.

hasMean(column, assertion, hint=None)

Creates a constraint that asserts on the mean of the column

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter. The parameter is the mean.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMean self: A Check object that asserts the mean of the column.

hasMin(column, assertion, hint=None)

Creates a constraint that asserts on the minimum of a column. The column is contains either a long, int or float datatype.

Parameters
  • column (str) – Column in Data Frame to run the assertion on. The column is expected to be an int, long or float type.

  • assertion (lambda) – A function which accepts an int or float parameter that discerns the minumum value of the column.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMaxLength self: A Check object that asserts the minumum of the column.

hasMinLength(column, assertion, hint=None)

Creates a constraint that asserts on the minimum length of a string datatype column.

Parameters
  • column (str) – Column in Data Frame to run the assertion on. The column is expected to be a string type.

  • assertion (lambda) – A function which accepts the int or float parameter that discerns the minimum length of the string.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMinLength self: A Check object that asserts minLength of the column.

hasMutualInformation(columnA, columnB, assertion, hint=None)

Creates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.

Parameters
  • columnA (str) – First column in Data Frame which calculates the mutual information.

  • columnB (str) – Second column in Data Frame which calculates the mutual information.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMutualInformation self: A Check object that asserts the mutual information in the columns.

hasNumberOfDistinctValues(column, assertion, binningUdf, maxBins, hint=None)

Creates a constraint that asserts on the number of distinct values a column has.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • binningUDF (lambda) – An optional binning function.

  • maxBins (int) – Histogram details is only provided for N column values with top counts. MaxBins sets the N.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasNumberOfDistinctValues self: A Check object that asserts distinctness in the column.

hasPattern(column, pattern, assertion=None, name=None, hint=None)

Checks for pattern compliance. Given a column name and a regular expression, defines a Check on the average compliance of the column’s values to the regular expression.

Parameters
  • column (str) – Column in DataFrame to be checked

  • pattern (Regex) – A name that summarizes the current check and the metrics for the analysis being done.

  • assertion (lambda) – A function with an int or float parameter.

  • name (str) – A name for the pattern constraint.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasPattern self: A Check object that runs the condition on the column.

hasSize(assertion, hint=None)

Creates a constraint that calculates the data frame size and runs the assertion on it.

Parameters
  • assertion (lambda) – Refers to a data frame size. The given function can include comparisons and conjunction or disjunction statements.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasSize self: A Check.scala object that implements the assertion on the column.

hasStandardDeviation(column, assertion, hint=None)

Creates a constraint that asserts on the standard deviation of the column

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter. The parameter is the standard deviation.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMean self: A Check object that asserts the std deviation of the column.

hasSum(column, assertion, hint=None)

Creates a constraint that asserts on the sum of the column

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter. The parameter is the sum.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasMean self: A Check object that asserts the sum of the column.

hasUniqueValueRatio(columns, assertion, hint=None)

Creates a constraint on the unique value ratio in a single or combined set of key columns.

Parameters
  • columns (list[str]) – Column(s) in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasUniqueValueRatio self: A Check object that asserts the unique value ratio in the columns.

hasUniqueness(columns, assertion, hint=None)

Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.

Parameters
  • columns (list[str]) – Column(s) in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

hasUniqueness self: A Check object that asserts uniqueness in the columns.

haveAnyCompleteness(columns, assertion, hint=None)

Creates a constraint that asserts on any completion in the combined set of columns.

Parameters
  • columns (list[str]) – Columns in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

haveAnyCompleteness self: A Check.scala object that asserts completion in the columns.

haveCompleteness(columns, assertion, hint=None)

Creates a constraint that asserts on completed rows in a combined set of columns.

Parameters
  • columns (list[str]) – Columns in Data Frame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

haveCompleteness self: A Check.scala object that implements the assertion on the columns.

isComplete(column, hint=None)

Creates a constraint that asserts on a column completion.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • hint (str) – A hint that discerns why a constraint could have failed.

Returns

isComplete self:A Check.scala object that asserts on a column completion.

isContainedIn(column, allowed_values)

Asserts that every non-null value in a column is contained in a set of predefined values

Parameters
  • column (str) – Column in DataFrame to run the assertion on.

  • allowed_values (list[str]) – A function that accepts allowed values for the column.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isContainedIn self: A Check object that runs the assertion on the columns.

isGreaterThan(columnA, columnB, assertion=None, hint=None)

Asserts that, in each row, the value of columnA is greater than the value of columnB

Parameters
  • columnA (str) – Column in DataFrame to run the assertion on.

  • columnB (str) – Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isGreaterThan self: A Check object that runs the assertion on the columns.

isGreaterThanOrEqualTo(columnA, columnB, assertion=None, hint=None)

Asserts that, in each row, the value of columnA is greather than or equal to the value of columnB

Parameters
  • columnA (str) – Column in DataFrame to run the assertion on.

  • columnB (str) – Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isGreaterThanOrEqualTo self: A Check object that runs the assertion on the columns.

isLessThan(columnA, columnB, assertion=None, hint=None)

Asserts that, in each row, the value of columnA is less than the value of columnB

Parameters
  • columnA (str) – Column in DataFrame to run the assertion on.

  • columnB (str) – Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isLessThan self : A Check object that checks the assertion on the columns.

isLessThanOrEqualTo(columnA, columnB, assertion=None, hint=None)

Asserts that, in each row, the value of columnA is less than or equal to the value of columnB.

Parameters
  • columnA (str) – Column in DataFrame to run the assertion on.

  • columnB (str) – Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function that accepts an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isLessThanOrEqualTo self (isLessThanOrEqualTo): A Check object that checks the assertion on the columns.

isNonNegative(column, assertion=None, hint=None)

Creates a constraint which asserts that a column contains no negative values.

Parameters
  • column (str) – The Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

self (isNonNegative): A Check object that runs the compliance on the column.

isPositive(column, assertion=None, hint=None)

Creates a constraint which asserts that a column contains no negative values and is greater than 0.

Parameters
  • column (str) – The Column in DataFrame to run the assertion on.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isNonNegative self: A Check object that runs the assertion on the column.

isPrimaryKey(column, *columns, hint=None)

Creates a constraint that asserts on a column(s) primary key characteristics. Currently only checks uniqueness, but reserved for primary key checks if there is another assertion to run on primary key columns.

# how does column and columns differ :param str column: Column in Data Frame to run the assertion on. :param str hint: A hint that states why a constraint could have failed. :param list[str] columns: Columns to run the assertion on. :return: isPrimaryKey self: A Check.scala object that asserts completion in the columns.

isUnique(column, hint=None)

Creates a constraint that asserts on a column uniqueness

Parameters
  • columns (list[str]) – Columns in Data Frame to run the assertion on.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

isUnique self: A Check.scala object that asserts uniqueness in the column.

kllSketchSatisfies(column, assertion, kllParameters=None, hint=None)

Creates a constraint that asserts on column’s sketch size.

Parameters
  • column (str) – Column in Data Frame to run the assertion on.

  • assertion (Lambda(BucketDistribution)) – A function that accepts an int or float parameter as a distribution input parameter.

  • kllParameters (KLLParameters) – Parameters of KLL sketch

  • hint (str) – A hint that states why a constraint could have failed.

Returns

kllSketchSatisfies self: A Check object that asserts the column’s sketch size in the column.

requiredAnalyzers()
satisfies(columnCondition, constraintName, assertion=None, hint=None)

Creates a constraint that runs the given condition on the data frame.

Parameters
  • columnCondition (str) – Data frame column which is a combination of expression and the column name. It has to comply with Spark SQL syntax. Can be written in an exact same way with conditions inside the WHERE clause.

  • constraintName (str) – A name that summarizes the check being made. This name is being used to name the metrics for the analysis being done.

  • assertion (lambda) – A function with an int or float parameter.

  • hint (str) – A hint that states why a constraint could have failed.

Returns

satisfies self: A Check object that runs the condition on the data frame.

class pydeequ.checks.CheckLevel(value)

Bases: enum.Enum

An enumeration.

Error = 'Error'
Warning = 'Warning'
class pydeequ.checks.CheckResult

Bases: object

class pydeequ.checks.CheckStatus(value)

Bases: enum.Enum

An enumeration.

Error = 'Error'
Success = 'Success'
Warning = 'Warning'
class pydeequ.checks.ConstrainableDataTypes(value)

Bases: enum.Enum

An enumeration.

Boolean = 'Boolean'
Fractional = 'Fractional'
Integral = 'Integral'
Null = 'Null'
Numeric = 'Numeric'
String = 'String'

Profiles

Profiles file for all the Profiles classes in Deequ

class pydeequ.profiles.ColumnProfile(spark_session: pyspark.sql.session.SparkSession, column, java_column_profile)

Bases: object

Factory class for Standard and Numeric Column Profiles The class for getting the Standard and Numeric Column profiles of the data.

Parameters
  • spark_session (SparkSession) – sparkSession

  • column – designated column to run a profile on

  • java_column_profile – The profile mapped as a Java map

property approximateNumDistinctValues

Getter that returns the amount of distinct values in the column

Returns

gets the number of distinct values in the column

property column

Getter for the current column name in ColumnProfile

Returns

gets the column name in the Column Profile

property completeness

” Getter that returns the completeness of data in the column

Returns

gets the calculated completeness of data in the column

property dataType

Getter that returns the datatype of the column

Returns

gets the datatype of the column

property histogram

A getter for the full value distribution of the column

Returns

gets the histogram of a column

property isDataTypeInferred

Getter that returns a boolean of whether the Data Type of the column was inferred

Returns

gets the isDataTypeInferred of the column

property typeCounts

A getter for the number of values for each datatype in the column

Returns

gets the number of values for each datatype

class pydeequ.profiles.ColumnProfilerRunBuilder(spark_session: pyspark.sql.session.SparkSession, data: pyspark.sql.dataframe.DataFrame)

Bases: object

Low level class for running profiling module

Parameters
  • spark_session (SparkSession) – sparkSession

  • data – Tabular data which will be used to construct a profile.

cacheInputs(cache_inputs: bool)

Cache the inputs

Parameters

cache_inputs (bool) – Whether to print status updates

Returns

Cache inputs

printStatusUpdates(print_status_updates: bool)

Print status updates between passes

Parameters

print_status_updates (bool) – Whether to print status updates

Returns

Printed status

restrictToColumns(restrict_to_columns: list)

Can be used to specify a subset of columns to look out

Parameters

restrict_to_columns (list) – Specified columns

Returns

A subset of columns to look at

run()

A method that runs a profile check on the data to obtain a ColumnProfiles class

Returns

A ColumnProfiles result

saveOrAppendResult(resultKey)

A shortcut to save the results of the run or append them to existing results in the metrics repository

Parameters

resultKey – The result key to identify the current run

Returns

A saved results of the run in the metrics repository

setKLLParameters(kllParameters: pydeequ.analyzers.KLLParameters)

Set kllParameters

Parameters

kllParameters (KLLParameters) – kllParameters(sketchSize, shrinkingFactor, numberOfBuckets)

setPredefinedTypes(dataTypesDict: dict)

Set predefined data types for each column (e.g. baseline)

Parameters

dict{"columnName" – DataTypeInstance} dataTypes: dataType map for baseline columns.

Returns

Baseline for each column. I.E. returns the dataType label to the desired DataTypeInstance

useRepository(repository)

Set a metrics repository associated with the current data to enable features like reusing previously computed results and storing the results of the current run.

:param repository:A metrics repository to store and load results associated with the run :return: Sets a metrics repository with the current data to use features on

useSparkSession(sparkSession)

Use a sparkSession to conveniently create output files

Parameters

sparkSession (SparkSession) – sparkSession

Returns

A sparksession to create output files

withKLLProfiling()

Enable KLL Sketches profiling on Numerical columns, disabled by default.

Returns

Enable KLL Sketches profiling on Numerical columns, disabled by default.

withLowCardinalityHistogramThreshold(low_cardinality_histogram_threshold: int)

Set the thresholds of value until it is expensive to calculate the histograms

Parameters

low_cardinality_histogram_threshold (int) – The designated threshold

Returns

a set threshold

class pydeequ.profiles.ColumnProfilerRunner(spark_session: pyspark.sql.session.SparkSession)

Bases: object

Primary class for interacting with the profiles module.

Parameters

spark_session (SparkSession) – sparkSession

onData(df)

Starting point to construct a profile

Parameters

df – Tabular data on which the profiles module will use

Returns

The starting point to construct a profile

run(data, restrictToColumns, lowCardinalityHistogramThreshold, printStatusUpdates, cacheInputs, fileOutputOptions, metricsRepositoryOptions, kllParameters, predefinedTypes)
Parameters
  • data

  • restrictToColumns

  • lowCardinalityHistogramThreshold

  • printStatusUpdates

  • cacheInputs

  • fileOutputOptions

  • metricsRepositoryOptions

  • kllParameters

  • predefinedTypes

Returns

class pydeequ.profiles.ColumnProfilesBuilder(spark_session: pyspark.sql.session.SparkSession)

Bases: object

property profiles

A getter for profiles

Returns

a getter for profiles

class pydeequ.profiles.NumericColumnProfile(spark_session: pyspark.sql.session.SparkSession, column, java_column_profile)

Bases: pydeequ.profiles.ColumnProfile

Numeric Column Profile class

Parameters
  • spark_session (SparkSession) – sparkSession

  • column – the designated column of which the profile is run on

  • java_column_profile – The profile mapped as a Java map

property approxPercentiles

A getter for the approximate percentiles of the numeric column

Returns

gets the approximate percentiles of the column

property kll

A getter for the kll value of a numeric column :return: gets the kll value of a numeric column

property maximum

A getter for the maximum value of the numeric column

Returns

gets the maximum value of the column

property mean

A getter for the calculated mean of the numeric column

Returns

gets the mean of the column

property minimum

A getter for the minimum value of the numeric column

Returns

gets the minimum value of the column

property stdDev

A getter for the standard deviation of the numeric column

Returns

gets the standard deviation of the column

property sum

A getter for the sum of the numeric column

Returns

gets the sum value of the column

class pydeequ.profiles.StandardColumnProfile(spark_session: pyspark.sql.session.SparkSession, column, java_column_profile)

Bases: pydeequ.profiles.ColumnProfile

Standard Column Profile class

Parameters
  • spark_session (SparkSession) – sparkSession

  • column – the designated column of which the profile is run on

  • java_column_profile – The profile mapped as a Java map

Repository

Repository file for all the different metrics repository classes in Deequ

Author: Calvin Wang

class pydeequ.repository.FileSystemMetricsRepository(spark_session: pyspark.sql.session.SparkSession, path: Optional[str] = None)

Bases: pydeequ.repository.MetricsRepository

High level FileSystemMetricsRepository Interface

Parameters
  • spark_session – SparkSession

  • path – The location of the file metrics repository.

class pydeequ.repository.InMemoryMetricsRepository(spark_session: pyspark.sql.session.SparkSession)

Bases: pydeequ.repository.MetricsRepository

High level InMemoryMetricsRepository Interface

class pydeequ.repository.MetricsRepository

Bases: object

Base class for Metrics Repository

after(dateTime: int)

Only look at AnalysisResults with a result key with a greater value :param dateTime: The minimum dateTime of AnalysisResults to look at

before(dateTime: int)

Only look at AnalysisResults with a result key with a smaller value :param dateTime: The maximum dateTime of AnalysisResults to look at

forAnalyzers(analyzers: list)

Choose all metrics that you want to load :param analyzers: List of analyers who’s resulting metrics you want to load

getSuccessMetricsAsDataFrame(withTags: Optional[list] = None, pandas: bool = False)

Get the AnalysisResult as DataFrame :param withTags: List of tags to filter previous Metrics Repository runs with

getSuccessMetricsAsJson(withTags: Optional[list] = None)

Get the AnalysisResult as JSON :param withTags: List of tags to filter previous Metrics Repository runs with

classmethod helper_metrics_file(spark_session: pyspark.sql.session.SparkSession, filename: str = 'metrics.json')

Helper method to create the metrics file for storage

load()

Get a builder class to construct a loading query to get AnalysisResults

withTagValues(tagValues: dict)

Filter out results that don’t have specific values for specific tags :param tagValues: Dict with tag names and the corresponding values to filter for

class pydeequ.repository.ResultKey(spark_session: pyspark.sql.session.SparkSession, dataSetDate: Optional[int] = None, tags: Optional[dict] = None)

Bases: object

Information that uniquely identifies a AnalysisResult

Parameters
  • spark_session – SparkSession

  • dataSetDate – Date of the result key

  • tags – A map with additional annotations

static current_milli_time()

Get current time in milliseconds # TODO: Consider putting this into scala_utils? Or general utils?

Scala Utilities

A collection of utility functions and classes for manipulating with scala objects anc classes through py4j

class pydeequ.scala_utils.PythonCallback(gateway)

Bases: object

class pydeequ.scala_utils.ScalaFunction1(gateway, lambda_function)

Bases: pydeequ.scala_utils.PythonCallback

Implements scala.Function1 interface so we can pass lambda functions to Check https://www.scala-lang.org/api/current/scala/Function1.html

class Java

Bases: object

scala.Function1: a function that takes one argument

implements = ['scala.Function1']
apply(arg)

Implements the apply function

class pydeequ.scala_utils.ScalaFunction2(gateway, lambda_function)

Bases: pydeequ.scala_utils.PythonCallback

Implements scala.Function2 interface https://www.scala-lang.org/api/current/scala/Function2.html

class Java

Bases: object

scala.Function2: a function that takes two arguments

implements = ['scala.Function2']
apply(t1, t2)

Implements the apply function

pydeequ.scala_utils.get_or_else_none(scala_option)
pydeequ.scala_utils.java_list_to_python_list(java_list: str, datatype)
pydeequ.scala_utils.scala_map_to_dict(jvm, scala_map)
pydeequ.scala_utils.scala_map_to_java_map(jvm, scala_map)
pydeequ.scala_utils.to_scala_map(spark_session, d)

Convert a dict into a JVM Map. Args:

spark_session: Spark session d: Python dictionary

Returns:

Scala map

pydeequ.scala_utils.to_scala_seq(jvm, iterable)

Helper method to take an iterable and turn it into a Scala sequence Args:

jvm: Spark session’s JVM iterable: your iterable

Returns:

Scala sequence

Suggestions

Suggestions file for all the Constraint Suggestion classes in Deequ

Author: Calvin Wang

class pydeequ.suggestions.CategoricalRangeRule

Bases: pydeequ.suggestions._RulesObject

If we see a categorical range for a column, we suggest an IS IN (…) constraint

property rule_jvm
class pydeequ.suggestions.CompleteIfCompleteRule

Bases: pydeequ.suggestions._RulesObject

If a column is complete in the sample, we suggest a NOT NULL constraint

property rule_jvm
class pydeequ.suggestions.ConstraintSuggestion

Bases: object

class pydeequ.suggestions.ConstraintSuggestionResult

Bases: object

The result returned from the ConstraintSuggestionSuite

class pydeequ.suggestions.ConstraintSuggestionRunBuilder(spark_session: pyspark.sql.session.SparkSession, df: pyspark.sql.dataframe.DataFrame)

Bases: object

Low level class for running suggestions module

Parameters
  • spark_session (SparkSession) – sparkSession

  • df – Tabular data on which the suggestions module use

addConstraintRule(constraintRule)

Add a single rule for suggesting constraints based on ColumnProfiles to the run.

Parameters

constraintRule (ConstraintRule) – A rule that the dataset will be evaluated on throughout to the run. To run all the rules on the dataset use .addConstraintRule(DEFAULT())

:return self for further method calls.

run()
A method that runs the desired ConstraintSuggestionRunBuilder functions on the data to obtain a constraint

suggestion result. The result is then translated to python.

Returns

A constraint suggestion result

class pydeequ.suggestions.ConstraintSuggestionRunner(spark_session: pyspark.sql.session.SparkSession)

Bases: object

Primary class for interacting with suggestions module

Parameters

spark_session (SparkSession) – sparkSession

onData(df)

Starting point to construct suggestions

Parameters

df – Tabular data on which the suggestions module use

Returns

The starting point to construct a suggestions module

class pydeequ.suggestions.DEFAULT

Bases: pydeequ.suggestions._RulesObject

DEFAULT runs all the rules on the dataset.

property rule_jvm
Returns

class pydeequ.suggestions.FractionalCategoricalRangeRule(targetDataCoverageFraction: float = 0.9)

Bases: pydeequ.suggestions._RulesObject

If we see a categorical range for most values in a column, we suggest an IS IN (…) constraint that should hold for most values

Parameters

targetDataCoverageFraction (float) – The numerical fraction for which the data should stay within range

of eachother

property rule_jvm
class pydeequ.suggestions.NonNegativeNumbersRule

Bases: pydeequ.suggestions._RulesObject

If we see only non-negative numbers in a column, we suggest a corresponding constraint

property rule_jvm
class pydeequ.suggestions.RetainCompletenessRule

Bases: pydeequ.suggestions._RulesObject

If a column is incomplete in the sample, we model its completeness as a binomial variable, estimate a confidence interval and use this to define a lower bound for the completeness

property rule_jvm
class pydeequ.suggestions.RetainTypeRule

Bases: pydeequ.suggestions._RulesObject

If we detect a non-string type, we suggest a type constraint

property rule_jvm
class pydeequ.suggestions.UniqueIfApproximatelyUniqueRule

Bases: pydeequ.suggestions._RulesObject

If the ratio of approximate num distinct values in a column is close to the number of records (within error of HLL sketch), we suggest a UNIQUE constraint

property rule_jvm

Verification

class pydeequ.verification.AnomalyCheckConfig

Bases: object

class pydeequ.verification.VerificationResult(spark_session: pyspark.sql.session.SparkSession, verificationRun)

Bases: object

The results returned from the VerificationSuite :param verificationRunBuilder verificationRun: verification result run()

property checkResults

Getter for the checks and checksResults of the verification Run() :return: JSON with the checks and the checkResults of the verification Run()

classmethod checkResultsAsDataFrame(spark_session: pyspark.sql.session.SparkSession, verificationResult, forChecks=None, pandas: bool = False)

Returns the verificaton Results as a Data Frame

Parameters
  • spark_session (SparkSession) – SparkSession

  • verificationResult – The results of the verification run

  • forChecks – Subset of Checks

Returns

returns a Data Frame of the results

classmethod checkResultsAsJson(spark_session: pyspark.sql.session.SparkSession, verificationResult, forChecks=None)

Returns the check verification Results as a JSON

Parameters
  • spark_session (SparkSession) – SparkSession

  • verificationResult – The results of the verification run

  • forChecks – Subset of Checks

Returns

returns a JSON

property metrics

Getter for the analyzers and their metric results of the verification Run() :return: gets the Analyzers and their metric results of the verification Run()

property status

Getter for the overall status of the verification Run() :return: gets the overall status of the verification Run()

classmethod successMetricsAsDataFrame(spark_session: pyspark.sql.session.SparkSession, verificationResult, forAnalyzers: Optional[list] = None, pandas: bool = False)

The results returned in a Data Frame

Parameters
  • spark_session – Sparksession

  • verificationResult – Result of the verification Run()

  • forAnalyzers – Subset of Analyzers from the Analysis Run

Returns

Data frame of the verification Run()

classmethod successMetricsAsJson(spark_session: pyspark.sql.session.SparkSession, verificationResult, forAnalyzers: Optional[list] = None)

The results returned in a JSON

Parameters
  • spark_session (SparkSession) – Sparksession

  • verificationResult – Result of the verification Run()

  • forAnalyzers – Subset of Analyzers from the Analysis Run

Returns

JSON of the verification Run()

class pydeequ.verification.VerificationRunBuilder(spark_session: pyspark.sql.session.SparkSession, data: pyspark.sql.dataframe.DataFrame)

Bases: object

A class to build a Verification Run

addAnomalyCheck(anomaly, analyzer: pydeequ.analyzers._AnalyzerObject, anomalyCheckConfig=None)

Add a check using anomaly_detection methods. The Anomaly Detection Strategy only checks if the new value is an Anomaly.

:param anomaly:The anomaly detection strategy :param AnalysisRunBuilder analyzer: The analyzer for the metric to run anomaly detection on :param anomalyCheckConfig: Some configuration settings for the Check :return: Adds an anomaly strategy to the run

addCheck(check: pydeequ.checks.Check)

Adds checks to the run using checks method.

Parameters

check (Check) – A check object to be executed during the run

Returns

Adds checks to the run

run()

A method that runs the desired VerificationRunBuilder functions on the data to obtain a Verification Result :return:a verificationResult object

saveOrAppendResult(resultKey)

A shortcut to save the results of the run or append them to the existing results in the metrics repository.

Parameters

resultKey – The result key to identify the current run

Returns

:A VerificationRunBuilder.scala object that saves or appends a result

useRepository(repository)

This method reassigns our AnalysisRunBuilder because useRepository returns back a different class: AnalysisRunBuilderWithRepository

Sets a metrics repository associated with the current data to enable features like reusing previously computed results and storing the results of the current run.

Parameters

repository – a metrics repository to store and load results associated with the run

class pydeequ.verification.VerificationRunBuilderWithSparkSession(spark_session: pyspark.sql.session.SparkSession, data: pyspark.sql.dataframe.DataFrame)

Bases: pydeequ.verification.VerificationRunBuilder

class pydeequ.verification.VerificationSuite(spark_session)

Bases: object

Responsible for running checks, the required analysis and return the results

onData(df)

Starting point to construct a VerificationRun. :param data: Tabular data on which the checks should be verified :return: The starting point to construct a verificationRun