The Hazards of Large Language Models
Last week, a group of former employees from leading AI companies such as OpenAI and Google DeepMind publicly released a letter warning about recent advances in Artificial Intelligence (AI). They highlighted internal flaws and detailed how these technologies are being developed, leading to issues such as inequality, manipulation, and misinformation.
The letter can be found here, but it is crucial to delve deeper into their warnings and understand how these generative AI applications can harm society and potentially lead to catastrophic consequences. Understanding these risks is vital for addressing the ethical and societal implications of AI development. Specifically in Large Language Models.
Environmental Risks and Equality
Recent advancements in hardware and techniques for training neural networks have led to the development of a new generation of large models trained on vast datasets. These models have achieved significant improvements in accuracy across numerous natural language processing (NLP) tasks. However, these gains are heavily dependent on the availability of extensive computational resources, which require substantial energy consumption.
As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware.
Recent benchmarks have quantified the costs of model training and development in both dollars and estimated CO2 emissions. While the average person is responsible for approximately 5 tons of CO2 annually, the authors found that training a large Transformer model using neural architecture search emitted an estimated 284 tons of CO2. Furthermore, training a single BERT base model (without hyperparameter tuning) on GPUs was estimated to consume as much energy as a trans-American flight.
The community calls on researchers and companies to report training time and computational resources required when proposing models intended for re-training on downstream tasks, such as new domain adaptation or task-specific fine-tuning. This transparency will facilitate comparisons across models, enabling subsequent users to accurately determine whether the required computational resources are feasible for their specific settings.
Equal Access to Computational Resources
Recent advances in computing come at a high price, making them inaccessible to many who seek access.
Restricting this type of research to industry labs disadvantages the NLP research community in several ways. Researchers with innovative ideas but without access to large-scale computing resources are unable to pursue their concepts. This restriction limits certain types of research based on financial resources, further entrenching the "rich get richer" cycle in research funding.
Successful and well-funded groups continue to attract more funding due to their existing achievements. Additionally, the high start-up costs of building in-house resources compel less resource-rich groups to depend on cloud computing services like AWS, Google Cloud, and Microsoft Azure.
Prioritize Hardware and Algorithms that are Computationally Efficient.
It is recommended a collaborative effort between industry and academia to promote the development of more computationally efficient algorithms and energy-efficient hardware. Additionally, efforts should be made to optimize software for better resource efficiency.
Biases in Large Internet Datasets
The Internet, with its vastness and diversity, might suggest that large datasets like Common Crawl (which includes petabytes of data collected over 8 years of web crawling, a filtered version of which is used in GPT-3 training) are broadly representative of diverse viewpoints. However, closer examination reveals factors that limit participation, discussion inclusion based on crawling methods, and filtered content. Consequently, voices aligned with dominant perspectives, including white supremacist and misogynistic views prevalent in US and UK English, are disproportionately represented. This overrepresentation not only exceeds their prevalence in the general population but also exacerbates biases and harms when these datasets are used to train models.
The contributors to Internet text collections are disproportionately younger and predominantly from developed countries due to uneven global Internet access. Specific subsets, such as GPT-2's data sourced from Reddit outbound links, reflect demographics where a majority are young men in the United States. Similarly, surveys of Wikipedia contributors indicate a low representation of women, highlighting skewed participation demographics in these datasets.
Social movements introduce new norms, languages, and modes of communication. This poses challenges for language models (LMs), as reliance on these models risks perpetuating outdated and less inclusive viewpoints—a phenomenon known as 'value-lock'.
For example, the Black Lives Matter (BLM) movement has influenced the creation and editing of Wikipedia articles. As the movement gained momentum, articles on shootings of Black individuals increased in frequency and were generated more quickly. Notably, past incidents of police brutality and shootings were updated alongside new events, demonstrating how social movements connect events over time to construct coherent narratives.
Recent studies confirm that large language models (LMs) exhibit biases such as stereotypical associations and negative sentiment toward specific groups. Additionally, there is evidence of intersectional bias, where models like BERT, ELMo, GPT, and GPT-2 encode more bias against identities marginalized along multiple dimensions than expected. These biases are attributed to characteristics of the training data. For example, recent research reveals that BERT associates phrases about disabilities with negative sentiments, and topics like gun violence and homelessness are disproportionately linked to discussions of mental illness. Similarly, GPT-3, trained on extensive data primarily from Common Crawl, can generate highly toxic sentences even when prompted with non-toxic inputs. Their analysis of GPT-2's training data also uncovered significant content from unreliable news sites and banned subreddits.
Synthetic Text as Meaningful Source
Traditional n-gram language models are limited to modeling local dependencies, predicting each word based on a small sequence of preceding words. In contrast, Transformer-based language models like GPT-3 and GPT-4 can capture much larger contexts, generating text that is not only fluent but also coherent across paragraphs.
For instance, to demonstrate the GPT-4’s capability by prompting it with a specific text excerpt, after which it autonomously generated coherent paragraphs, including answers in a Q&A format:
Question: What is the name of the Russian mercenary group?
Answer: The Russian mercenary group often referred to in the media is known as the Wagner Group. This private military company (PMC) has gained attention for its involvement in various conflicts, including in Syria and Ukraine, where it has operated in support of Russian interests. The Wagner Group is known for its operations alongside Russian military forces and has been linked to controversial activities and alleged human rights abuses.
Question: Where is the Wagner group?
Answer: The Wagner Group, a Russian private military company (PMC), has been known to operate in several conflict zones and areas of interest to Russia. Some of the places where Wagner Group operatives have been reported include: Syria, Ukraine, Libya, Central African Republic (CAR) and Sudan.
Question: Who is the leader of the Wagner group?
Answer: The leader of the Wagner Group, the Russian private military company (PMC), is believed to be Dmitry Utkin. Dmitry Utkin is a former Russian military officer who reportedly founded the Wagner Group and has been linked to its operations in various conflict zones. Utkin, also known by his call sign "Wagner," is said to have a background in military intelligence and has been associated with coordinating and commanding Wagner Group operations. However, due to the secretive nature of private military companies like Wagner, detailed information about its leadership and structure can be difficult to verify.
This example showcases GPT-4's ability to maintain coherence and relevance to the initial prompt, exemplifying its broader contextual understanding compared to traditional models.
Text produced by a Large Language Model lacks grounding in communicative intent, any model of the world, or any model of the reader's state of mind. This is inherent because the training data never involved sharing thoughts with a listener, nor does the machine possess the capability to engage in such interaction. Despite the increasingly fluent nature of automatically generated text, our perception of natural language remains influenced by our own linguistic competence and inclination to interpret communicative acts as conveying coherent meaning and intent, whether or not they actually do. The crux of the issue lies in the fact that if one side of the communication lacks genuine meaning, any comprehension of implicit meaning is illusory and stems from our unique human comprehension of language, which exists independently of the model.
I hope this article provides you with a comprehensive understanding of the risks and issues associated with Large Language Models. It's crucial for consumers of these models to be aware of these considerations. Artificial Intelligence, including LLMs, should not remain an abstract and complex field for the general public. It's essential to explain in straightforward terms the challenges researchers currently face and the societal hurdles ahead of us.