Skip to Main Content

Inclusive and Responsible Dataset Usage

Information about getting started on working with data in an inclusive and responsible manner.

Key Considerations

If you only have 5 minutes… 

  1. Check to make sure you have the latest version of this dataset and that it hasn’t been deprecated, discredited, or removed 

  2. Ask who is included in the dataset, who is excluded, and how they are represented by the data and your use of the data. How will potential feature transformations and imputation of missing values alter representation in the data?

  3. Consider how your research will account for its impact on any stakeholders. For example, what’s the procedure if someone asks for their information to be removed from a dataset you’re using? 

  4. Make a plan for maintenance of your research. What changes will you make if the dataset you’re using gets updated or removed?

  5. Consider whether this is the best dataset for your research question. What similar datasets have you considered? What different types of datasets might you consider? 

Ideally you can spend much more than 5 minutes on each of these questions, and they are meant to be considered throughout the process of your research.

More to Keep in Mind...

Consider Who Datasets Are From, Who Datasets Are For

Organizations that advocate for Indigenous rights have developed data markers to indicate the origin, access, usage rights, transparency, and integrity of data related to their cultural heritage. Visit localcontexts.org . From this, we can acknowledge that not all data is appropriate for use in all contexts, for all purposes, for all audiences and choose with impact in mind when selecting or creating datasets. 

Not a New Problem

Any knowledge organization systems will contain biases, presumptions, and potential for confusion for their users. The same issues that faced analog systems before the digital era are facing datasets for machine learning now. For example, library cataloging systems like those used by the Library of Congress "can make materials hard to find for other users, stigmatize certain groups of people with inaccurate or demeaning labels, and create the impression that certain points of view are normal and others unusual” (Knowlton 2005). Similar classification logics are at work in datasets and machine learning systems. 

How do we as researchers address this through our work? Rather than striving for an all-inclusive  "neutral" or trying to mitigate bias with technical quick fixes, address these concerns through awareness. Know that this potential exists and acknowledge it at each stage of your research. 

Algorithmic Ecosystems

Datasets are part of a larger ecosystem of algorithm-informed research. From the models created using training datasets, to the analysis and results they produce (new datasets!), each part of the ecosystem needs to be approached with the same critical perspectives. The examples of the critical questions and approaches discussed here can be asked at every stage of creating, using, and relying on algorithmic systems.