Quality of a dataset#
What makes a good dataset? An ideal dataset should be:
Accurate - If the data is of high accuracy. Usually the case in chemistry, material science fields.
Reliable - If value for the same material/molecule, under the same conditions, are the same or within an satisfatory error margin.
Relevant - Available variable and properties are well defined for the target problem.
Large - Significantly large amount of data available. This can be an issue in scientific fields as experimental data can be difficult or time consuming to obtain.
Have high variability - Number of possible molecules/materials and their combinations often makes the data limited.
Unbiased - Scientists tend to focus on the good results but ML models need the bad results too.