Really good question and interesting to see the responses!
I’m heading a data collection and annotation company and we’ve worked on a lot of projects over the last couple of years. So I can give some very practical insight on this part of AI projects even though the conversation is much wider (e.g. at project design, model implementation, etc)
Bias is especially dangerous when humans are involved (e.g. applications that use biometric data, detect age/gender/“race”, etc) because then it can actually affect people’s lives. In computer vision applications that are meant for industry use for example (e.g. detecting nails and bolts) it’s still undesirable but the consequences are more limited.
In terms of data collection, some companies don’t have access to real-life data before the model is deployed so they use some kind of a proxy (e.g. I want to detect medical masks in public on security cameras so I collect a dataset from online sources of people wearing masks). Bias can come in when the dataset you are using is not representative of the actual data the model will be running on (e.g. most of the photos in your dataset are selfies while it’s supposed to run on CCTV, or most of the people who appear are Asian while it will be applied in Africa, or you didn’t realize the number of men in the dataset is 70% vs other genders).
And then in terms of annotation, a very important step is the class definition: say you want to detect gender (which is quite a problematic endeavor in itself) and you set “male” and “female” as the 2 classes for annotation. In this case, maybe include a “non-binary”, “other” or “difficult to say” class as well? This is obvious when we talk about constructs like gender and race but it’s valid for all types of projects (e.g. a drink detection app: are you detecting Cola, Sprite, and Fanta, are you grouping them as “soft drinks”, or what happens with Schweppes which doesn’t appear anywhere in your class list?)
We’re actually publishing a whitepaper called “How to avoid bias in AI through better dataset collection and annotation” soon so stay tuned!