Research Guides: Inclusive and Responsible Dataset Usage: Concepts

Concepts

Algorithms are lists of instructions for accomplishing a task or solving a problem step by step, whether for machine learning or other activities.

Annotation is the process of labeling a dataset with metadata and additional information relevant to its subsequent use in analysis. For example, this can include adding descriptive captions to images, or evaluating and sorting each item according to a rubric you develop in your research.

API (Application Programming Interface) is an access point for acquiring data using code. Companies, institutions, archives, platforms use APIs to offer instructions on how to write the code asking for the particular data you want using queries, such as scraping data from a newspaper collection.

Artificial intelligence (AI) is an overall term describing a set of different kinds of techniques to make computers behave in some kind of [seemingly] intelligent fashion. There is no agreed definition of AI, but in general the ability to perform tasks without supervision and to learn so as to improve performance are key parts of AI (Ethics of AI). Even AI researchers have no exact definition of AI. The field is rather being constantly redefined when some topics are classified as non-AI, and new topics emerge (Elements of AI). AI colloquially refers to various systems that look for patterns in provided data. AI systems can be made up of multiple components of machine learning tasks and similar techniques. No matter their context or complexity, AI tools are always socio-technical systems, meaning they are designed, operated, and influenced by humans, rather than entirely autonomous, neutral systems. (Ciston 2021)

Classification refers to specific machine learning tasks that label and sort items in a dataset by discrete categories. For example, asking whether an image is a dog or a cat is handled by a classification task. These are distinguished from Regression tasks, which show the relationship between features in a dataset, for example predicting sale price from information about a house such as number of rooms and square footage. (Ciston 2023)

Cleaning see Preprocessing.

Continuous data are numerical data that, like sizes or temperatures, have no gaps if you were to chart all possible values. For example, the temperature can be 72 degrees, 73 degrees, or any value in between. Compare this to discrete data, which deals with whole quantities, like counting the number of people in a group.

Data are values that can be assigned to a thing and can take a variety of forms (Responsible Data Handbook 2016). How you think about the information is what makes it data. They do not just exist but have to be generated, through collection by sensors or human effort. Sensing, observing, and collecting are all acts of interpretation that have contexts, which shape the data (Ciston 2023). Data that reveal identities, activities or affiliations are the most obvious areas for responsible data practices, but they should be applied in all cases. (Responsible Data Handbook 2016)

Datasets can be any kind of collected, curated, interrelated data. Often, datasets refer to large collections of data used in computation, and especially in machine learning. Information collections are transformed into datasets through a lifecycle of processes (collection/selection, cleaning and analyzing, sharing and deprecating), which shape how that information is understood. They always reflect the circumstances of their making. (Ciston 2023) Keep in mind that “sometimes standalone data deemed safe becomes harmful when combined with other data sets, or data that you thought was anonymized becomes easily discernible once combined with other data, using triangulation techniques.” (Responsible Data Handbook 85)

Data science is an umbrella term (with several subdisciplines) that includes machine learning and statistics, certain aspects of computer science including algorithms, data storage, and web application development. Data science is also a practical discipline that requires understanding of the domain in which it is applied in, for example, business or science. (Elements of AI)

Datasheets are documents describing each dataset’s characteristics and composition, motivation and collection processes, recommended usage and ethical considerations, and any other information to help people choose the best dataset for their task. Datasheets were proposed by diversity advocate and computer scientist Timnit Gebru, et al., as a field-wide practice to “encourage reflection on the process of creating, distributing, and maintaining a dataset, including any underlying assumptions, potential risks or harms, and implications for use” (Gebru 2020). Datasheets are also resources to help people select and adapt datasets for new contexts. (Ciston 2023)

Data subjects are the people and other beings whose data are gathered into a dataset. Even if identifying information has been removed, datasets are still connected to the subjects they claim to represent. This includes data subjectees, a term which specifically describes people impacted directly or indirectly by datasets, distinct from data subjects. Data subjectees include anyone affected by predictions made with machine learning models, for example someone forced to use a facial detection system to board a flight or eye-tracking software to take a test at school. (Ciston 2023)

Datafiable, machine actionable, or machine readable, formats are works that lend themselves to use by computer programs, in their form, format, provenance and represenativeness, access method, and rights (Padilla YEAR). For example, CSV and Excel files are usually considered machine actionable, whereas PDF documents are not machine readable in their existing form, but once their text is extracted into a TXT file it becomes machine actionable. (School of Data)

Deep learning see Machine learning.

Discrete data see Continuous data.

Features are the attributes being analyzed, considered, or explored across the dataset, often viewed as a column in a table. Features can be any machine-readable (i.e. numeric) form of an instance: images converted into a sequence of pixels, for example. Note: Researchers often select and “extract” the features most relevant for their purpose. Features are not given by default. They are the results of decisions made by datasets’ creators and users. (Ciston 2023)

Feature extraction and feature engineering are techniques used to focus on the specific information in a dataset that is relevant to your research goals. You may need to create features (e.g., add columns to your table) to show data from new perspectives. This can impact how the dataset can be analyzed going forward, how the model can be designed, and how the data subjects and subjectees might be affected. (Ciston 2023)

GAN stands for generative adversarial network and is a popular kind of machine learning used to generate new data. It requires two parts: One part is trained on existing data in order to check the second part's work. The second part is trying to generate new data that can fool the first part (hence adversaries).

Intersectionality, as first named by Kimberlé Crenshaw (1989), center[s] interlocking systems of oppression and in doing so make[s] visible the normative value systems that facilitate erasure (Gipson, Correy, and Noble 2021, 306).

JSON is a popular file type for working with labeled data. JSON has a nested hierarchical structure and does not require each equivalent node to match. In this way it is semi-structured, unlike CSV (comma separated values) and other spreadsheet-like types. For more on types of data, see "A Critical Field Guide to Working with Machine Learning: Types of Datasets" (Ciston 2023).

Labels can refer both to the results or output assigned by a machine learning model, or also to descriptors included in a training dataset meant for the model to practice on as it is built, or in a testing or benchmark dataset used for evaluation or verification. (Ciston 2023)

Metadata is data about other data, supplementary information that describes a file or accompanies other content, e.g. an image from your camera comes with the date and location it was shot, lens aperture, and shutter speed. Metadata can describe attributes of content and can also include who created it, with what tools, when, and how. During dataset annotation processes, additional metadata can be added by the dataset creators, gig workers, or other researchers. (Ciston 2023)

Machine learning is a set of tools used by computer programmers to find a formula that best describes (or models) a dataset. Whereas in other kinds of software the programmer will write explicit instructions for every part of a task, in machine learning, programmers will instruct the software to adjust its code based on the data it processes, thus “learning” from new information. Its learning is unlike human understanding and the term is used metaphorically. Some formulas are "deeper" than others, so called because they contain many more variables, and deep learning refers to the use of complex, many layers in a machine learning model. Due to their increasing complexity, the outputs of machine learning models are not reliable for making decisions about people, especially in highly consequential cases. When working with datasets, include machine learning as one suite of options in a broader toolkit — rather than a generalizable multi-tool for every task. (Ciston 2023)

Machine readable see Datafiable.

Models are the result of a machine learning algorithm, once it includes revisions that take into account the data it was exposed to during its training. It is the saved output of the training process, ready to make predictions about new data. One way to think of a model is as a very complex mathematical formula containing millions or billions of variables (values that can change). These variables, also called model parameters, are designed to transform a numerical input into the desired outputs. The process of model training entails adjusting the variables that make up the formula until its output matches the desired output. Much focus is put on machine learning models, but models depend directly on datasets for their predictions. (Ciston 2023)

Neural networks describe some of the ways to structure machine learning models, including making large language models. Named for the inspiration they take from brain neurons (very simplified), they move information through a series of nodes (steps) organized in layers or sets. Each node receives the output of the previous layers’ nodes, combines them using a mathematical formula, then passes the output to the next layer of nodes. (Ciston 2023)

Open source means the dataset or the source code for a software or is available and can thus be viewed, changed and used (free of charge) by the public. In most cases, licenses must be observed that describe how it should be used and not used. (Training the Archive)

Preprocessing means to check and modify data before analyzing it or using it for training a machine learning system. No data arrives ready to go. Preprocessing includes many adjustments that can affect the outcome, including selecting a subset of data (sampling), standardizing and scaling it in relation to a baseline (normalization), handling missing data and outliers with decision trees, as well as feature creation and extraction. The transformation of real-world information into data is never a neutral process but relies heavily on the conditions and goals of the research in context. (Ciston 2021) For more on preprocessing data, see "A Critical Field Guide to Working with Machine Learning Datasets: Transforming Datasets" (Ciston 2023).

Qualitative data include descriptions or categories that evaluate or label. They tell you something about qualities: e.g. description, colors etc. Interviews count as qualitative data. Quantitative data include discrete counts or continuous measurements that number or represent values. They tell you something about a measure or quantification, such as the quantity of things you have, the size (if measured) etc. (School of Data) Neither type of data is inherently more accurate or more biased, because each depends on the ways it was gathered, organized, processed, and used as part of the dataset.

Regression see Classification.

Repository describes a storage space for digital objects, whether a digital archive or software source code. Often, repositories also contain version histories of trackable changes made to the objects.

Samples are selections from the total dataset, whether chosen at random or using a particular feature or property; samples can be used to analyze a dataset, perform testing, or train a model. (Ciston 2023)

Scraping is the process of extracting data in machine-readable formats from PDFs, websites, or other unstructured sources to make the desired content available for further use. (School of Data, Training the Archive)

Structured data can be, for example, tabular data formatted in a table with labeled columns, or other forms of labeled or annotated information. Unstructured data can be plain text files or unannotated images. Annotating or coding a dataset prepares it for analysis and raises important questions about labor, classification, and power.

Supervised machine learning relies on training data that has already been labeled in order to "learn." For example, that a dataset for object recognition would contain images as well as a table to describe the manually located object(s) they contain. It might have columns for the object name or label, as well as coordinates for the object position or outline, and the corresponding image’s file name or index number. Unsupervised machine learning looks for patterns that are not yet labeled in the dataset. It uses different kinds of machine learning algorithms, such as clustering groups of data together using features they share. However, it would be a misnomer to think that conclusions drawn from unsupervised machine learning are somehow more pure or rational. Much human judgment goes into developing an unsupervised machine learning model — from adjusting weights and parameters to comparing models’ performance. Often supervised and unsupervised approaches are used in combination to ask different kinds of questions about the dataset. Other kinds of machine learning approaches (like reinforcement learning) don’t fall neatly into these high-level categories. (Ciston 2023)

Testing data see Training data.

Training data is portion of the full dataset used to create a machine learning model, which will be kept out of later testing phases. Imagining a model like a student studying for exams, you could liken the training data to their study guide which they use to practice the material. For example, in supervised machine learning, training data includes results like those the model will be asked to generate, e.g. labeled images. Training data is contrasted with validation data, which is used to optimize the model once it is created, and testing data, which is only used when the model is complete in order to assess how well it functions. (Ciston 2023)

Transfer learning updates fully trained models to apply them to new, similar problems, creating a new model that relies on the understandings of both contexts.

Unstructured data see Structured data.

Validation data see Training data.

Sources

Ciston S. (2023). “A Critical Field Guide for Working with Machine Learning Datasets.” Crawford K and Ananny M, Eds., Knowing Machines project.

Ciston S. (2021). “Intersectional AI Toolkit,” Intersectional AI Toolkit. https://intersectionalai.com/

Gebru T, et. al. (2020). “Datasheets for Datasets,” ArXiv180309010 Cs, Mar. 2020, http://arxiv.org/abs/1803.09010

Engine Room. (n.d.). Responsible Data Handbook. https://the-engine-room.github.io/responsible-data-handbook/

Padilla, T. (2021, October 13). Responsible Operations: Data Science, Machine Learning, and AI in Libraries. OCLC. https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html

School of Data. (n.d.). "Glossary." School of Data. https://schoolofdata.org/handbook/appendix/glossary/

Training the Archive. (n.d.). "Glossary." Training the Archive. https://trainingthearchive.ludwigforum.de/en/glossary/

University of Helsinki, Minna Learn. (n.d.) "Elements of AI." https://course.elementsofai.com/

University of Helsinki, Minna Learn. (n.d.) "Ethics of AI." https://ethics-of-ai.mooc.fi/