Model Training

Feeding the Machine: How AI Learns

Once the hardware has been manufactured, AI systems must be trained before they can perform any task--from image recognition to language translation to predictive analytics. Model training is the process by which an AI system processes vast amounts of data, adjusting internal parameters through thousands or even millions of iterations to “learn” from that information.

It is during this stage that AI systems consume substantial amounts of computing power, electricity, and human labor, particularly for training large-scale models. The training of large models--particularly foundation models like GPT, PaLM, or Gemini--relies on specialized hardware (e.g., NVIDIA A100s), massive datasets, and optimized infrastructure hosted in hyperscale data centers. Model training is often celebrated as the site of cutting-edge innovation, but its impacts on energy systems, labor conditions, and data justice are often overlooked.

  • Data Collection and Preparation
    Training begins with the aggregation of large-scale datasets--ranging from publicly available websites and academic texts to proprietary medical records or satellite imagery. This process involves data scraping, cleaning, deduplication, and formatting. It may also include annotation and labeling, often invisibly performed by so-called “data workers”.

    Compute Infrastructure and Training Runs
    Once the dataset is ready, the training phase begins on GPU or TPU clusters housed in high-performance data centers. Models are trained across multiple GPUs using distributed computing architectures. This phase can take days or weeks and consumes vast amounts of energy and water for computation and cooling. Companies like OpenAI, Google DeepMind, Meta, and Anthropic rely on GPU infrastructure often built by NVIDIA and often run on the cloud platforms of Amazon, Microsoft, or Google.

    Evaluation and Alignment
    After initial training, models go through a series of fine-tuning and evaluation steps--sometimes using reinforcement learning from human feedback (RLHF) or adversarial testing to correct undesired behaviors. These tasks require manual review and additional labeling labor, typically outsourced to annotators around the world.

  • Estimates suggest that training large-scale AI models can be highly energy-intensive, with one widely cited study by Emma Strubell, Ananya Ganesh, Andrew McCallum estimating that training a single large model can emit more than 284 tons of CO₂--equivalent to the lifetime emissions of five cars. However, these figures vary significantly depending on the efficiency of the model architecture, the source of electricity used, and the location of the training infrastructure. As models scale from millions to hundreds of billions of parameters, energy demands typically grow, though not always linearly.

    As models grow in size--from millions to billions of parameters--their environmental costs increase accordingly. In addition to emissions, water use for cooling in data centers has become a growing concern, particularly as training clusters expand into regions already facing water stress. In areas like Iowa, Arizona, and parts of Southern Europe, AI infrastructure is increasingly entangled with local resource pressures, especially when water withdrawals for evaporative cooling overlap with municipal or agricultural needs. These impacts are often invisible to users of AI applications, but deeply felt in the communities that host the physical infrastructure.

    Beyond energy and water, model training raises deeper questions about how data is sourced, valued, and made actionable. Datasets scraped from the open web often include personal information, copyrighted works, or culturally sensitive content--raising concerns about consent, data sovereignty, and extractive digital practices. The datasets used to train AI models often include digital traces left by people across the world--ranging from forum posts and blog entries to social media conversations and collaborative knowledge platforms. While this content may be publicly accessible, its use in training AI systems frequently occurs without informed consent, transparency, or accountability. For individuals and communities, this raises concerns about privacy, misrepresentation, and the unacknowledged commodification of one’s everyday online expression and identity. In many cases, culturally specific language patterns, Indigenous knowledge, or collective narratives are absorbed into models that commercialize or distort them. Scholars have described these dynamics as data colonialism--a process through which digital infrastructures continue long-standing patterns of extractive power, converting human communication into material for technological and economic gain.

    Labor conditions in model training are also frequently invisible. While training is framed as computationally automated, it relies heavily on the work of data annotators--often underpaid and overworked--who classify images, correct outputs, or rate model responses. Without adequate protections, these workers face emotional tolls and precarious working conditions, even as their labor is foundational to the model’s success.

  • Climate and ecological breakdown → The environmental effects of this phase depend on multiple factors: the energy efficiency of the hardware, the carbon intensity of local electricity grids, and the cooling systems used by the data centers. In regions where fossil fuels dominate the energy mix or where data centers draw heavily on groundwater for cooling--such as parts of the U.S. Midwest or semi-arid areas in Europe--model training can contribute to localized environmental stress. Yet, few AI companies disclose the carbon footprint of their training processes. The opacity in reporting makes it difficult for communities, regulators, or even researchers to evaluate the broader climate impact of model proliferation. Among a few others, one estimates suggests that the training of GPT-3 consumed 1,287 MWh of electricity, a load comparable to 123 gasoline-powered passenger vehicles driven for one year. These estimates highlight not just the energy intensity of state-of-the-art models, but the growing urgency of more transparent and accountable infrastructure reporting in the AI sector.

    Labor injustices → The training of AI models depends heavily on a global workforce of annotators and feedback personnel, many based in countries such as Kenya, India, Venezuela, and the Philippines. These workers are tasked with labeling data and evaluating model outputs for major AI laboratories. Their responsibilities often include identifying hate speech, graphic violence, or political misinformation--content that can be psychologically taxing. Despite the critical nature of their work, many of these individuals operate under precarious conditions, lacking long-term contracts or adequate psychological support. Investigations have revealed instances where workers received pay as low as $1.46 per hour, leading to significant emotional distress and minimal recognition for their contributions. This exploitation underscores the urgent need for equitable labor practices and mental health provisions within the AI industry.

    When inclusion becomes extraction → As discussed earlier, the large-scale scraping of online content for model training reflects a broader logic of data colonialism. In this framework, being “represented” in AI models often does not imply recognition or inclusion on fair terms. Instead, it may mean being exploited: absorbed into systems that commodify language, behavior, and identity while offering no say over how that data is used or to what ends. Those who pay the highest price are often the very communities whose digital presence is most vulnerable to misappropriation--whether through linguistic marginalization, lack of data protection, or the absence of meaningful political representation in AI governance. For these individuals and groups, participation in the data economy is not a choice but a condition--shaped by asymmetrical infrastructures that turn presence into profit and exposure into control. The harm lies not in exclusion alone, but in being included on extractive terms, where representation becomes another site of dispossession.

  • The capacity to train state-of-the-art AI models is heavily concentrated in a handful of companies and countries. U.S.-based firms such as OpenAI, Google DeepMind, Meta, and Anthropic currently lead in foundation model training, supported by partnerships with cloud providers like Microsoft Azure and Amazon Web Services. NVIDIA remains a critical player through its dominance in AI chip supply.

    Most model training infrastructure is hosted in North America, Europe, or China, where access to capital, compute resources, and technical talent is concentrated. This creates a geopolitical landscape in which only a few actors have the capacity to shape the future of AI. Countries in the Majority World often lack the computational sovereignty to train their own models or audit those deployed in their territories.

    Recent attempts to “democratize” AI training--through open-source models, public datasets, or regional cloud infrastructure--have met with mixed success. Projects like BigScience (France) and LAION (Germany) are developing alternatives, but they remain under-resourced compared to commercial labs. Without global investment in decentralized infrastructure, the ability to participate in model training remains limited to a few powerful players.​

  • A planetary justice lens invites us to rethink model training not simply as a technical challenge of scaling computation or optimizing model performance, but as a socio-ecological process--one that implicates extractive infrastructures, asymmetrical labor relations, and contested notions of consent and accountability. Training AI models is not neutral; it reflects specific choices about what is prioritized, who is involved, and what is rendered invisible in the process.

    A justice-oriented approach in model training thus begins with accountability--not only for the outputs of AI systems, but for the conditions under which they are produced. Transparency around energy use, emissions, and water withdrawal is essential, as is disclosure of data sources and annotation practices. But transparency alone is not enough. Meaningful participation--particularly of those whose knowledge, labor, or land is implicated in AI development--is necessary to ensure that training does not simply reproduce extractive dynamics under the banner of innovation.

    Where data reflects cultural, linguistic, or community-based knowledge, especially from historically marginalized or structurally excluded groups, participatory and consent-based practices should be explored. While questions around open access, legal licensing, and platform-based data availability remain complex, governance models that respect contextual integrity, community agency, and collective data stewardship offer pathways for more equitable inclusion.

    Further, it is important for training infrastructure to be designed with ecological and environmental costs and trade-offs in mind. Locating data centers near renewable energy sources, investing in water recycling, and reducing the frequency of unnecessary training runs are all part of a more responsible approach. Yet, these interventions must be coupled with redistributive approaches: ensuring that the benefits of AI development do not remain concentrated among a few firms or regions, but flow to those whose labor, data, and environments make these systems possible. That includes fairer compensation for data workers, meaningful inclusion of affected communities in AI policy discussions, and structural support for alternative, decentralized, or public-interest AI projects.

    Even more so, a planetary justice approach to AI model training requires us to ask not only how models are trained, but why, for whom, and to what end. The framing of problems that justify large-scale model training is rarely neutral--it reflects institutional priorities, market incentives, and particular visions of what constitutes value or progress. Often, the decision about which tasks are worth automating, which datasets are worth compiling, or which efficiencies are worth pursuing is made by actors far removed from the communities affected by those choices. In many cases, the outcomes may be ambiguous or uneven: the promised benefits--such as improved productivity, personalization, or insight--may be marginal, speculative, or disproportionately distributed.

    At the same time, what appears beneficial from one perspective may have harmful consequences from another. Tools used for automated decision-making in employment, credit, or migration, for example, may streamline processes for institutions while reinforcing bias or precarity for individuals. Systems designed for risk management or surveillance may offer perceived efficiencies for states or firms, but increase social control or reduce privacy for the public. Even in cases where access to AI tools is broadened, the conditions of that access--such as data dependency, language dominance, or infrastructural lock-in--can reinforce existing asymmetries in power, knowledge, and agency.

    Ultimately, a planetary justice approach does not take for granted that training ever-larger AI models is inherently necessary or beneficial. Instead, it calls for sustained reflection on whether, how, and why such systems are developed in the first place--recognizing that in some contexts, restraint, refusal, or redirection may be more aligned with the values of ecological sustainability, social equity, and collective autonomy than continued optimization alone. At the same time, a planetary justice approach invites us to imagine and invest in alternative pathways: approaches to AI that are purposefully limited in scope,co-designed with affected communities, environmentally grounded, and accountable to a broader set of social and ecological priorities.

Active Projects

Blank white background with no visible content.

Epistemic Justice in AI Datasets

This project, situated within the Model Training research area of the AI + Planetary Justice Alliance, investigates how AI training datasets in agriculture embed epistemic hierarchies that privilege Global North ways of knowing. Focusing on India, it critically examines how datasets used to train agricultural models often reflect standardized, techno-scientific approaches—while sidelining local, traditional, and Indigenous knowledge systems.