Quality of a dataset#

What makes a good dataset? An ideal dataset should be:

  • Accurate - If the data is of high accuracy. Usually the case in chemistry, material science fields.

  • Reliable - If value for the same material/molecule, under the same conditions, are the same or within an satisfatory error margin.

  • Relevant - Available variable and properties are well defined for the target problem.

  • Large - Significantly large amount of data available. This can be an issue in scientific fields as experimental data can be difficult or time consuming to obtain.

  • Have high variability - Number of possible molecules/materials and their combinations often makes the data limited.

  • Unbiased - Scientists tend to focus on the good results but ML models need the bad results too.