Dataclasses with Inheritance?
Dataclasses and non-dataclasses inheritance
When should I implement __post_init__?
Modifying a dataclass's __init__() method in a sub-dataclass
Videos
I have a class Animal and Dog which inherits Animal. Why is it that I get an error if I try to give my Dog class a breed field?
TypeError: non-default argument 'breed' follows default argument
This is my code
from dataclasses import dataclass
@dataclass
class Animal:
species: str
arms: int
legs: int
@dataclass
class Dog(Animal):
breed: str
species: str = "Dog"
arms: int = 0
legs: int = 4
if __name__ == '__main__':
jake = Dog(breed="Bulldog")
print(jake)
I did find that if I add a breed field to Animal I wouldn't get the error.
I'm having a hard time wrapping my head around when to use __post_init__ in general. I'm building some stuff using the @dataclass decorator, but I don't really see the point in __post_init__ if the init argument is already set to true, by default? Like at that point, what would the __post_init__ being doing that the __init__ hasn't already done? Like dataclass is going to do its own thing and also define its own repr as well, so I guess the same could be questionable for why define a __repr__ for a dataclass?
Maybe its just for customization purposes that both of those are optional. But at that point, what would be the point of a dataclass over a regular class. Like assume I do something like this
@dataclass(init=False, repr=False)
class Thing:
def __init__(self):
...
def __repr__(self):
...
# what else is @dataclass doing if both of these I have to implement
# ik there are more magic / dunder methods to each class,
# is it making this type 'Thing' more operable with others that share those features?I guess what I'm getting at is: What would the dataclass be doing for me that a regular class wouldn't?
Idk maybe that didn't make sense. I'm confused haha, maybe I just don't know. Maybe I'm using it wrong, that probably is the case lol. HALP!!! lol
In the Python standard library's documentation for Dataclasses, there is a section called post init processing. In that section is this bit of code:
@dataclass
class Rectangle:
height: float
width: float
@dataclass
class Square(Rectangle):
side: float
def __post_init__(self):
super().__init__(self.side, self.side)The accompanying text says: "The __init__() method generated by dataclass() does not call base class __init__() methods. If the base class has an __init__() method that has to be called, it is common to call this method in a __post_init__() method"
That's all fine and dandy but in my experience this has not worked as I would expect. This is because attempting to instantiate Square() causes its __init__() method to ask for 3 arguments: height, width, and side. I understand this is because of how Dataclass inheritance is advertised to work. What doesn't make sense is why the above example is given as an example of post-processing when the caller of Square() would be supplying redundant values.
It seems silly to require the caller of Square() to supply the height, width, and side if they are all going to eventually be the same value.
In [2]: @dataclass ...: class Rectangle: ...: height: float ...: width: float ...: ...: @dataclass ...: class Square(Rectangle): ...: side: float ...: ...: def __post_init__(self): ...: super().__init__(self.side, self.side) ...: In [3]: s = Square(10) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [3], in <cell line: 1>() ----> 1 s = Square(10) TypeError: Square.__init__() missing 2 required positional arguments: 'width' and 'side'
The only way I have found to overcome this is to define a custom __init__() in Square, like so:
In [4]: @dataclass ...: class Rectangle: ...: height: float ...: width: float ...: ...: @dataclass ...: class Square(Rectangle): ...: def __init__(self, side): ...: Rectangle.__init__(self, side, side) ...: In [5]: s = Square(10) In [6]: s Out[6]: Square(height=10, width=10)
Is there a suggested way to make this kind of method overriding work in dataclasses? It seems like in the documentation and conference presentations that __post_init__() is the preferred option, but that does not seem to work as advertised.
The way dataclasses combines attributes prevents you from being able to use attributes with defaults in a base class and then use attributes without a default (positional attributes) in a subclass.
That's because the attributes are combined by starting from the bottom of the MRO, and building up an ordered list of the attributes in first-seen order; overrides are kept in their original location. So Parent starts out with ['name', 'age', 'ugly'], where ugly has a default, and then Child adds ['school'] to the end of that list (with ugly already in the list). This means you end up with ['name', 'age', 'ugly', 'school'] and because school doesn't have a default, this results in an invalid argument listing for __init__.
This is documented in PEP-557 Dataclasses, under inheritance:
When the Data Class is being created by the
@dataclassdecorator, it looks through all of the class's base classes in reverse MRO (that is, starting atobject) and, for each Data Class that it finds, adds the fields from that base class to an ordered mapping of fields. After all of the base class fields are added, it adds its own fields to the ordered mapping. All of the generated methods will use this combined, calculated ordered mapping of fields. Because the fields are in insertion order, derived classes override base classes.
and under Specification:
TypeErrorwill be raised if a field without a default value follows a field with a default value. This is true either when this occurs in a single class, or as a result of class inheritance.
You do have a few options here to avoid this issue.
The first option is to use separate base classes to force fields with defaults into a later position in the MRO order. At all cost, avoid setting fields directly on classes that are to be used as base classes, such as Parent.
The following class hierarchy works:
# base classes with fields; fields without defaults separate from fields with.
@dataclass
class _ParentBase:
name: str
age: int
@dataclass
class _ParentDefaultsBase:
ugly: bool = False
@dataclass
class _ChildBase(_ParentBase):
school: str
@dataclass
class _ChildDefaultsBase(_ParentDefaultsBase):
ugly: bool = True
# public classes, deriving from base-with, base-without field classes
# subclasses of public classes should put the public base class up front.
@dataclass
class Parent(_ParentDefaultsBase, _ParentBase):
def print_name(self):
print(self.name)
def print_age(self):
print(self.age)
def print_id(self):
print(f"The Name is {self.name} and {self.name} is {self.age} year old")
@dataclass
class Child(_ChildDefaultsBase, Parent, _ChildBase):
pass
By pulling out fields into separate base classes with fields without defaults and fields with defaults, and a carefully selected inheritance order, you can produce an MRO that puts all fields without defaults before those with defaults. The reversed MRO (ignoring object) for Child is:
_ParentBase
_ChildBase
_ParentDefaultsBase
Parent
_ChildDefaultsBase
Note that while Parent doesn't set any new fields, it does inherit the fields from _ParentDefaultsBase and should not end up 'last' in the field listing order; the above order puts _ChildDefaultsBase last so its fields 'win'. The dataclass rules are also satisfied; the classes with fields without defaults (_ParentBase and _ChildBase) precede the classes with fields with defaults (_ParentDefaultsBase and _ChildDefaultsBase).
The result is Parent and Child classes with a sane field older, while Child is still a subclass of Parent:
>>> from inspect import signature
>>> signature(Parent)
<Signature (name: str, age: int, ugly: bool = False) -> None>
>>> signature(Child)
<Signature (name: str, age: int, school: str, ugly: bool = True) -> None>
>>> issubclass(Child, Parent)
True
and so you can create instances of both classes:
>>> jack = Parent('jack snr', 32, ugly=True)
>>> jack_son = Child('jack jnr', 12, school='havard', ugly=True)
>>> jack
Parent(name='jack snr', age=32, ugly=True)
>>> jack_son
Child(name='jack jnr', age=12, school='havard', ugly=True)
Another option is to only use fields with defaults; you can still make in an error to not supply a school value, by raising one in __post_init__:
_no_default = object()
@dataclass
class Child(Parent):
school: str = _no_default
ugly: bool = True
def __post_init__(self):
if self.school is _no_default:
raise TypeError("__init__ missing 1 required argument: 'school'")
but this does alter the field order; school ends up after ugly:
<Signature (name: str, age: int, ugly: bool = True, school: str = <object object at 0x1101d1210>) -> None>
and a type hint checker will complain about _no_default not being a string.
You can also use the attrs project, which was the project that inspired dataclasses. It uses a different inheritance merging strategy; it pulls overridden fields in a subclass to the end of the fields list, so ['name', 'age', 'ugly'] in the Parent class becomes ['name', 'age', 'school', 'ugly'] in the Child class; by overriding the field with a default, attrs allows the override without needing to do a MRO dance.
attrs supports defining fields without type hints, but lets stick to the supported type hinting mode by setting auto_attribs=True:
import attr
@attr.s(auto_attribs=True)
class Parent:
name: str
age: int
ugly: bool = False
def print_name(self):
print(self.name)
def print_age(self):
print(self.age)
def print_id(self):
print(f"The Name is {self.name} and {self.name} is {self.age} year old")
@attr.s(auto_attribs=True)
class Child(Parent):
school: str
ugly: bool = True
Note that with Python 3.10, it is now possible to do it natively with dataclasses.
Dataclasses 3.10 added the kw_only attribute (similar to attrs).
It allows you to specify which fields are keyword_only, thus will be set at the end of the init, not causing an inheritance problem.
Taking directly from Eric Smith's blog post on the subject:
There are two reasons people [were asking for] this feature:
- When a dataclass has many fields, specifying them by position can become unreadable. It also requires that for backward compatibility, all new fields are added to the end of the dataclass. This isn't always desirable.
- When a dataclass inherits from another dataclass, and the base class has fields with default values, then all of the fields in the derived class must also have defaults.
What follows is the simplest way to do it with this new argument, but there are multiple ways you can use it to use inheritance with default values in the parent class:
from dataclasses import dataclass
@dataclass(kw_only=True)
class Parent:
name: str
age: int
ugly: bool = False
@dataclass(kw_only=True)
class Child(Parent):
school: str
ch = Child(name="Kevin", age=17, school="42")
print(ch.ugly)
Take a look at the blogpost linked above for a more thorough explanation of kw_only.
Cheers !
PS: As it is fairly new, note that your IDE might still raise a possible error, but it works at runtime
Actually there is one method which is called before __init__: it is __new__. So you can do such a trick: call Base.__init__ in Child.__new__. I can't say is it a good solution, but if you're interested, here is a working example:
class Base:
def __init__(self, a=1):
self.a = a
@dataclass
class Child(Base):
a: int
def __new__(cls, *args, **kwargs):
obj = object.__new__(cls)
Base.__init__(obj, *args, **kwargs)
return obj
c = Child(a=3)
print(c.a) # 3, not 1, because Child.__init__ overrides a
In best practice [...], when we do inheritance, the initialization should be called first.
This is a reasonable best practice to follow, but in the particular case of dataclasses, it doesn't make any sense.
There are two reasons for calling a parent's constructor, 1) to instantiate arguments that are to be handled by the parent's constructor, and 2) to run any logic in the parent constructor that needs to happen before instantiation.
Dataclasses already handles the first one for us:
@dataclass
class A:
var_1: str
@dataclass
class B(A):
var_2: str
print(B(var_1='a', var_2='b')) # prints: B(var_1='a', var_2='b')
# 'var_a' got handled without us needing to do anything
And the second one does not apply to dataclasses. Other classes might do all kinds of strange things in their constructor, but dataclasses do exactly one thing: They assign the input arguments to their attributes. If they need to do anything else (that can't by handled by a __post_init__), you might be writing a class that shouldn't be a dataclass.
This is easiest to use the actual examples. I'm making data classes for a baseball game.
There is a parent method called player. This incorporates all the base running information (pitchers are runners too).
There are two children classes, hitter and pitcher, these incorporate the hitting and defense for the hitter and the pitching information for the pitcher.
Then there is (Shohei Ohtani) the rare two way player for this class which is a child of both hitter and pitcher.
For all of these classes there is a __post__init__ method, for the hitters and pitchers I want to run the player __post__init__ method as well as the specific hitter or pitcher one. For the two way player I want to be able to run the player, hitter and pitcher __post__init__ methods.
I think for the single class inheritance I could use super(), but I don't know how either use super to specify which or both methods to call or another way I cold call all those post__init__ methods.
If I can call a parent classes __post_init__ method from within the __post_init__ of the child that would work, I just don't know how I would do it.
If not using actual code is a problem, please let me know and I will edit my question so that there is actual code.