Cracking the Code: How Computational Methods Are Deciphering the World’s Last Undecoded Scripts

Ancient clay tablet with undeciphered symbols and a blue computational grid overlay

For most of recorded history, the decipherment of ancient scripts has been a fundamentally human endeavour — part intuition, part obsessive pattern recognition, part luck. Michael Ventris spent years working on Linear B before his breakthrough in 1952. Jean-François Champollion needed the Rosetta Stone and a deep command of Coptic to crack Egyptian hieroglyphs. But a handful of writing systems have resisted every attempt at human decipherment for decades or even centuries. Now, a new generation of researchers is asking whether computational methods — from statistical modelling to deep neural networks — can succeed where traditional scholarship has stalled.

The Last Great Puzzles of Human Writing

Three undeciphered scripts dominate the field. Linear A, used by the Minoan civilisation on Crete between roughly 1800 and 1450 BC, survives in approximately 1,427 inscriptions totalling around 7,400 signs — a corpus that falls short of the estimated 8,100-sign minimum critical mass needed for reliable decipherment. The Indus Valley script, used by the Harappan civilisation from around 2600 to 1900 BC, appears on over 4,000 inscribed objects but with an average text length of just five signs per inscription, offering agonisingly little data per sample. Proto-Elamite, dating to approximately 3100–2900 BC and originating in what is now southwestern Iran, is represented by over 1,600 tablets but written in roughly 1,200 non-numerical signs that no one has convincingly linked to any known language.

Each script presents its own particular challenge. But they share a common set of obstacles: no bilingual texts exist for any of them, the underlying languages remain unknown or disputed, and the surviving corpora are small enough that traditional philological methods have run out of material to work with. It is precisely this impasse that has attracted computer scientists.

Linear A and the Shadow of Minoan Crete

Linear A is the older sibling of Linear B, the script that Ventris famously deciphered as an early form of Greek. The two systems share a number of signs, and researchers have long attempted to apply Linear B’s known phonetic values to Linear A. The results are largely meaningless — a strong indication that the Minoan language behind Linear A belongs to no known language family. This is the central problem. Without a related language to anchor the decipherment, reading the signs phonetically produces strings of syllables that match nothing in the linguistic record.

The corpus compounds the difficulty. Linear A inscriptions have been found across Crete — at Hagia Triada, Khania, Zakros, Phaistos — and as far afield as the Aegean islands, mainland Greece, western Asia Minor, and the Levant. But many are brief administrative lists: a toponym or personal name, a logogram, a numeral. They contain few spelled-out words and limited evidence of grammatical structure. Researchers working with such sparse data are trying to reconstruct a language from its receipts.

Recent computational work has nonetheless pushed forward. In 2024, Nepal and Perono Cacciafoco published a cryptanalytic approach in the journal Information, treating Linear A as an unintended crypto-system and systematically testing phonetic values derived from feature-based comparisons with the Carian Alphabet and Cypriot Syllabary against candidate languages including Ancient Egyptian, Luwian, and Hittite. Meanwhile, researchers at the University of Melbourne have begun applying deep neural network models pre-trained on potentially related languages before fine-tuning on Linear A, following the methodology that proved successful for Ugaritic and Linear B at MIT.

The Indus Valley Script: 4,000 Signs and No Rosetta Stone

The Indus Valley civilisation was one of the great urban cultures of the ancient world, contemporary with Egypt and Mesopotamia, stretching across what is now Pakistan and northwestern India. Its script appears on stamp seals, pottery, copper tablets, and other objects recovered from sites including Mohenjo-daro and Harappa. Iravatham Mahadevan’s concordance identifies 417 distinct signs, though estimates range from Asko Parpola’s 425 to Bryan K. Wells’s 676, with some scholars arguing the inventory reduces to as few as 39 elementary signs with the rest being scribal variants and compounds.

The fundamental barrier is brevity. The average inscription is just five signs long. The longest single-line inscription contains only 14 signs; the longest known inscription overall runs to approximately 26 characters. These are not texts in any conventional sense — they are closer to labels or stamps. The absence of any bilingual inscription and the lack of consensus on the underlying language (candidates include Dravidian, Indo-Aryan, and Munda) have left the field at an impasse for over a century.

In 2009, Rajesh P. N. Rao and colleagues at the University of Washington published a landmark paper in Science titled “Entropic Evidence for Linguistic Structure in the Indus Script.” Using a dataset of 1,548 lines of text containing 7,000 sign occurrences drawn from Mahadevan’s concordance, they measured the conditional entropy of the script — the degree of randomness in choosing the next symbol given the preceding one. They found that the Indus script’s conditional entropy fell squarely within the range of natural languages, between systems with little sequential structure and those with rigid sequential order. The similarity to Old Tamil, a Dravidian language, was noted as particularly interesting given the longstanding proto-Dravidian hypothesis.

In subsequent work published in the Proceedings of the National Academy of Sciences, Rao’s team applied Markov models and found that unigrams follow a Zipf-Mandelbrot distribution, that text-beginning and text-ending sign distributions are unequal (providing internal evidence for syntax), and that a quadrigram Markov chain saturates information-theoretic measures against a held-out corpus. Practically, the model could restore doubtfully read texts with approximately 75 percent accuracy and predict deliberately removed single signs with 63 percent accuracy. A further study in PLOS ONE revealed that Indus sign sequences found in West Asia are ordered differently from those in the Indus Valley itself, supporting the theory that the script was versatile enough to represent different types of content in different regions.

These findings remain contested. Critics have pointed to the inscriptions’ extreme brevity, low internal repetition, and high type-to-token ratio as evidence that the script may not encode language in the way Rao’s analysis implies. The debate is unresolved, but the statistical tools have shifted the terms of the argument from subjective interpretation to quantifiable measurement.

Proto-Elamite: The World’s Oldest Undeciphered Script

Proto-Elamite holds the distinction of being the oldest writing system that remains undeciphered. Dating to approximately 3100–2900 BC — contemporary with the Late Uruk and Jemdet Nasr periods in Mesopotamia — it was used in the ancient region of Elam, with the majority of its roughly 1,600 surviving tablets found at Susa. Smaller collections have come from Tall-e Malyan, Tepe Sialk, Tepe Yahya, and Shahr-i Sokhta, now housed primarily at the Louvre and the National Museum of Iran.

Jacob Dahl, Professor of Assyriology at the University of Oxford, has led the most sustained modern effort to make Proto-Elamite accessible to computational analysis. In 2012, Dahl launched a digitisation project that eventually made nearly all known Proto-Elamite tablets available through the Cuneiform Digital Library Initiative, with high-quality images, transliterations, and a working sign list. Through grapho-tactical analysis, the initial estimate of approximately 1,900 non-numerical signs was refined downward to about 1,200. A partnership with the National Museum of Iran, formalised in 2017, led to the scanning and digitisation of Tehran’s holdings beginning in January 2018, with Oxford staff member Parsa Daneshmand carrying out the initial captures. Dahl’s 2019 publication, Tablettes et Fragments Proto-Élamites, represented a major milestone in making the corpus systematically available.

Proto-Elamite’s decipherment is complicated by several factors beyond the unknown language. Unlike Mesopotamian cuneiform, where scribal training tablets and sign lists provide a window into how the system was taught and standardised, no such pedagogical materials exist for Proto-Elamite. The signs appear poorly standardised, with evidence of individual invention and inconsistent borrowing from proto-cuneiform. Jacob Dahl himself has described it as a system showing signs of scribal incompetence. The numerical portion is partly understood, since many number signs are direct loans from Mesopotamian proto-cuneiform, but the non-numerical signs remain opaque.

In 2020, François Desset proposed a decipherment recasting proto-Elamite as “Early Proto-Iranian,” a claim that remains controversial. Dahl has offered alternative interpretations, suggesting that late third-millennium scribes may have created a new script through “schismogenesis,” using recovered proto-Elamite tablets as inspiration rather than direct inheritance.

Machine Learning Approaches: From Statistical Models to Transformers

The most systematic application of machine learning to ancient script decipherment has come from MIT’s Computer Science and Artificial Intelligence Laboratory. In 2019, Regina Barzilay and Jiaming Luo published “Neural Decipherment via Minimum-Cost Flow” at the Annual Meeting of the Association for Computational Linguistics, demonstrating a sequence-to-sequence neural model that could perform unsupervised decipherment by formalising training as a minimum-cost flow problem. Applied to the already-deciphered scripts of Ugaritic and Linear B, the system improved on prior state-of-the-art by 5.5 percentage points for Ugaritic and achieved 67.3 percent accuracy on Linear B cognates — the first time any automated system had produced meaningful results on Linear B.

The critical limitation of that work was that it required knowing the related language in advance — Hebrew for Ugaritic, Greek for Linear B. In a 2021 follow-up published in the Transactions of the Association for Computational Linguistics, Luo, Barzilay, and collaborators addressed this constraint directly. Their new model used character embeddings based on the International Phonetic Alphabet to capture phonological geometry — the structural relationships between speech sounds that reflect consistent patterns in historical sound change. The system jointly modelled word segmentation and cognate alignment, and crucially, could infer language relationships without being told which languages to compare. Applied to Iberian, a script that scholars have debated for decades, the algorithm found that Basque and Latin were the closest candidates but too distant to be considered related — a finding that aligns with current scholarly consensus that Iberian is a language isolate.

DeepMind’s Ithaca system, published as the cover article of Nature in March 2022, took a different approach. Rather than attempting full decipherment of unknown scripts, Ithaca applied deep neural networks to ancient Greek inscriptions — a known language with extensive surviving texts — to perform three tasks: restoring damaged text, attributing geographic origin, and dating inscriptions. Trained on 78,608 inscriptions from the Packard Humanities Institute dataset, Ithaca achieved 62 percent accuracy on text restoration when used alone. When historians used it as an assistive tool, their accuracy jumped from 25 to 72 percent. For chronological attribution, Ithaca dated texts to within 30 years of their actual date, compared to an average error of 144 years for human experts working without computational assistance. DeepMind has announced plans to adapt the model for Akkadian, Demotic Egyptian, Mayan, and ancient Hebrew.

The significance of Ithaca lies less in its specific results than in its demonstration of what transformer-based architectures can achieve with epigraphic data. If similar models can be trained on the far smaller corpora of undeciphered scripts — potentially through transfer learning from related writing systems — they may extract patterns that have eluded human analysis. The challenge remains that Linear A, Proto-Elamite, and the Indus script each lack the tens of thousands of training examples that make Ithaca’s approach viable.

What Decipherment Would Mean

The stakes of decipherment extend far beyond academic curiosity. The Indus Valley civilisation was one of the largest and most urbanised societies of the Bronze Age, with a population estimated at five million people, sophisticated urban planning, and long-distance trade networks. Yet because its script is unread, it remains the only major ancient civilisation for which we have no direct textual evidence of what its people thought, believed, or legislated. Deciphering the Indus script would open a window onto the social organisation, religious practices, and economic systems of a civilisation that was contemporary with — and in contact with — Mesopotamia and Egypt.

Linear A’s decipherment would reveal the language of the Minoans, the civilisation that built the palaces of Knossos and Phaistos and whose cultural influence shaped the Aegean world for centuries. Proto-Elamite would illuminate the earliest stages of complex administration in the Iranian plateau, at a moment when writing itself was just emerging as a technology.

Computational methods will not replace traditional philology. The deep knowledge of ancient languages, archaeological context, and historical linguistics that scholars bring to decipherment remains essential — as Ithaca’s results showed, the most powerful configuration is human expertise augmented by machine analysis, not replaced by it. But the tools now available represent a genuine expansion of what is possible. Statistical models can quantify structural properties of scripts that human intuition can only approximate. Neural networks can detect patterns across corpora too large or too fragmentary for any individual scholar to hold in mind. And as these methods improve, the three great undeciphered scripts of the ancient world may finally begin to yield their secrets — not to a lone genius working in isolation, but to a collaboration between human scholarship and computational power that neither could achieve alone.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *