The dagster_pandas library provides utilities for using pandas with Dagster and for implementing validation on pandas DataFrames. A good place to start with dagster_pandas is the validation guide.
dagster_pandas.
create_dagster_pandas_dataframe_type
(name, description=None, columns=None, event_metadata_fn=None, dataframe_constraints=None, loader=None, materializer=None)[source]¶Constructs a custom pandas dataframe dagster type.
name (str) – Name of the dagster pandas type.
description (Optional[str]) – A markdown-formatted string, displayed in tooling.
columns (Optional[List[PandasColumn]]) – A list of PandasColumn
objects
which express dataframe column schemas and constraints.
event_metadata_fn (Optional[Callable[[], Union[Dict[str, Union[str, float, int, Dict, EventMetadata]], List[EventMetadataEntry]]]]) – A callable which takes your dataframe and returns a dict with string label keys and EventMetadata values. Can optionally return a List[EventMetadataEntry].
dataframe_constraints (Optional[List[DataFrameConstraint]]) – A list of objects that inherit from
DataFrameConstraint
. This allows you to express dataframe-level constraints.
loader (Optional[DagsterTypeLoader]) – An instance of a class that
inherits from DagsterTypeLoader
. If None, we will default
to using dataframe_loader.
materializer (Optional[DagsterTypeMaterializer]) – An instance of a class
that inherits from DagsterTypeMaterializer
. If None, we will
default to using dataframe_materializer.
dagster_pandas.
RowCountConstraint
(num_allowed_rows, error_tolerance=0)[source]¶A dataframe constraint that validates the expected count of rows.
dagster_pandas.
StrictColumnsConstraint
(strict_column_list, enforce_ordering=False)[source]¶A dataframe constraint that validates column existence and ordering.
dagster_pandas.
PandasColumn
(name, constraints=None, is_required=None)[source]¶The main API for expressing column level schemas and constraints for your custom dataframe types.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If th column exists, the validate function will validate the column. Defaults to True.
constraints (Optional[List[Constraint]]) – List of constraint objects that indicate the validation rules for the pandas column.
boolean_column
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses boolean constraints on boolean dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
categorical_column
(name, categories, of_types='object', non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses categorical constraints on specified dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
categories (List[Any]) – The valid set of buckets that all values in the column must match.
of_types (Optional[Union[str, Set[str]]]) – The expected dtype[s] that your categories and values must abide by.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
datetime_column
(name, min_datetime=Timestamp('1677-09-21 00:12:43.145225'), max_datetime=Timestamp('2262-04-11 23:47:16.854775807'), non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None, tz=None)[source]¶Simple constructor for PandasColumns that expresses datetime constraints on ‘datetime64[ns]’ dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_datetime (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to pandas.Timestamp.min.
max_datetime (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to pandas.Timestamp.max.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
tz (Optional[str]) – Required timezone for values eg: tz=’UTC’, tz=’Europe/Dublin’, tz=’US/Eastern’. Defaults to None, meaning naive datetime values.
exists
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses existence constraints.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
float_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints on float dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
integer_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints on integer dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
numeric_column
(name, min_value=- inf, max_value=inf, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses numeric constraints numeric dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
min_value (Optional[Union[int,float]]) – The lower bound for values you expect in this column. Defaults to -float(‘inf’)
max_value (Optional[Union[int,float]]) – The upper bound for values you expect in this column. Defaults to float(‘inf’)
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
string_column
(name, non_nullable=False, unique=False, ignore_missing_vals=False, is_required=None)[source]¶Simple constructor for PandasColumns that expresses constraints on string dtypes.
name (str) – Name of the column. This must match up with the column name in the dataframe you expect to receive.
non_nullable (Optional[bool]) – If true, this column will enforce a constraint that all values in the column ought to be non null values.
unique (Optional[bool]) – If true, this column will enforce a uniqueness constraint on the column values.
ignore_missing_vals (Optional[bool]) – A flag that is passed into most constraints. If true, the constraint will only evaluate non-null data. Ignore_missing_vals and non_nullable cannot both be True.
is_required (Optional[bool]) – Flag indicating the optional/required presence of the column. If the column exists the validate function will validate the column. Default to True.
dagster_pandas.
DataFrame
= <dagster.core.types.dagster_type.DagsterType object>¶Define a type in dagster. These can be used in the inputs and outputs of solids.
type_check_fn (Callable[[TypeCheckContext, Any], [Union[bool, TypeCheck]]]) – The function that defines the type check. It takes the value flowing
through the input or output of the solid. If it passes, return either
True
or a TypeCheck
with success
set to True
. If it fails,
return either False
or a TypeCheck
with success
set to False
.
The first argument must be named context
(or, if unused, _
, _context
, or context_
).
Use required_resource_keys
for access to resources.
key (Optional[str]) –
The unique key to identify types programatically.
The key property always has a value. If you omit key to the argument
to the init function, it instead receives the value of name
. If
neither key
nor name
is provided, a CheckError
is thrown.
In the case of a generic type such as List
or Optional
, this is
generated programatically based on the type parameters.
For most use cases, name should be set and the key argument should not be specified.
name (Optional[str]) – A unique name given by a user. If key
is None
, key
becomes this value. Name is not given in a case where the user does
not specify a unique name for this type, such as a generic class.
description (Optional[str]) – A markdown-formatted string, displayed in tooling.
loader (Optional[DagsterTypeLoader]) – An instance of a class that
inherits from DagsterTypeLoader
and can map config data to a value of
this type. Specify this argument if you will need to shim values of this type using the
config machinery. As a rule, you should use the
@dagster_type_loader
decorator to construct
these arguments.
materializer (Optional[DagsterTypeMaterializer]) – An instance of a class
that inherits from DagsterTypeMaterializer
and can persist values of
this type. As a rule, you should use the
@dagster_type_materializer
decorator to construct these arguments.
serialization_strategy (Optional[SerializationStrategy]) – An instance of a class that
inherits from SerializationStrategy
. The default strategy for serializing
this value when automatically persisting it between execution steps. You should set
this value if the ordinary serialization machinery (e.g., pickle) will not be adequate
for this type.
auto_plugins (Optional[List[Type[TypeStoragePlugin]]]) – If types must be serialized differently
depending on the storage being used for intermediates, they should specify this
argument. In these cases the serialization_strategy argument is not sufficient because
serialization requires specialized API calls, e.g. to call an S3 API directly instead
of using a generic file object. See dagster_pyspark.DataFrame
for an example.
required_resource_keys (Optional[Set[str]]) – Resource keys required by the type_check_fn
.
is_builtin (bool) – Defaults to False. This is used by tools to display or
filter built-in types (such as String
, Int
) to visually distinguish
them from user-defined types. Meant for internal use.
kind (DagsterTypeKind) – Defaults to None. This is used to determine the kind of runtime type for InputDefinition and OutputDefinition type checking.
typing_type – Defaults to None. A valid python typing type (e.g. Optional[List[int]]) for the value contained within the DagsterType. Meant for internal use.