Concept for Algorithmic Identification of Primary Spoken Language of Second-Language Speaker Using Meta-Analysis of Deviations from Proper Usage Both for Algorithmically-Translated Text and Human-Translated Text
Human beings and machines, alike, use algorithms to parse and process information concerning language. Because simple translation algorithms still lack the richness of capacity of a human translator, it is fairly easy to determine when a sample of text is the byproduct of an algorithmic translator, provided the sample size is large enough. In the world of online propaganda, nation-states use a combination of algorithmically-translated text and “expertly” translated texts prepared by humans.
While simply having a copy of the Google Translate translation matrix would enable a programmer to create an algorithm that indicates what the native language of a speaker might be when text is algorithmically-translated (it is not clear if even this level of analysis has been achieved with an algorithm since nothing came up in a cursory search of Google Patents,) when language is translated by a human, one must build their own system for analysis of the text that takes into consideration common errors made even by skilled translators including artifacts of language that do not technically constitute errors and which would be perceived as irregular only by a native speaker with great verbal aptitude.
There are dozens if not hundreds of individual parameters this algorithm could take into consideration. Of greatest value are those “errors” that do not technically constitute errors. Most useful toward this end, in this author’s judgment, would be the close analysis of linguistic collocation.
To achieve this, an extremely detailed matrix of word collocation would be required that would need to be established by consensus of expert English (for the purposes of our prototype) speakers. Our algorithm could certainly look at other elements of language such as: Noun-Verb sequence e.g. ‘He was running’ vs. ‘Running was he.’ as well as gender-specific language as elements of Spanish and even Russian language are gender-specific. English and other languages lack these gender-specific terms. The accidental incorporation of gender-specific elements in text written by someone masquerading as an English speaker would serve as a strong red flag, of course.
In the case of algorithmic translations, one interesting red flag we could watch for would be for the entire body of a text to translate perfectly to, and once again, back from a specific language. One tactic used by adversarial propagandists is to check their translation against the translation matrix itself to see if it remains the same when translated back into their native language. They tend only to use translations that remain consistent when translated back. This does not mean the translation is correct, however, in their view, this decreases the likelihood of error in the absence of a senior English language “expert.”
Daniel Phillips
Understanding which dialect of the English language an adversary seems to be speaking can betray useful information concerning their nation of origin. Whereas Russian propagandists and spies study American English, Israeli, Chinese, and Indian spies, just to name a few, study British English. It is important to note, however, that the differences between American and British English are limited enough in number that a skilled linguist could successfully convince a reader in America of their being British or a reader in England of their being American with relative ease, especially when we are talking about written language without the benefit of being able to see with whom we’re speaking. Algorithmic translations carry significant risk including the risk that an unknown individual working at Google, for example, could be asked by a governmental entity to deliberately incorporate intentional errors into translated text that are specific to other languages in order to help them to identify algorithmically translated text. It is equally plausible that China and/or the United States is/are already employing this gimmick through official or unofficial requests to Google employees that have sufficient access.
While generic, run-of-the-mill propaganda can be readily identified through simple observation of the ostensible agenda being promoted and packet analysis of the associated traffic is certainly revelatory in most cases, there are forms of propaganda that are more carefully crafted and not deployed carelessly like buckshot into online forums. Some of these forms of propaganda can make their way into movie and television scripts, news teleprompter scripts, and especially YouTube and other online videos where analysis is not possible without first transcribing the video, which is time-intensive.
Anthony Hall
In these cases, accents can be faked with great skill and precision. However, the narrators are reading off of a script with instructions not to deviate from that script. These are the cases where it is most critical to know the source of the potential propaganda and where this sort of automated analysis can be best employed.
While I cannot practically list all of the possible collocations of the English language, I will provide some examples of how collocation analysis can be useful for achieving the task of identifying the native language of an individual behind English text:
Word-word collocation:
Whole dictionaries are available that provide common examples of word-word collocations. Absent the use of malapropisms, a non-native speaker will miss obvious opportunities to collocate words on paper that should already be mentally collocated. Mental collocation is the byproduct of being exposed to language for decades. Mental collocations tend to lead to physical collocations of the words in text. We take them for granted as native speakers, but careful meta-analysis of text can reveal the failure to collocate words. A non-native speaker may not have enough experience with their second language to use the most apt collocations in each possible situation. For a native speaker, there is generally a 100% overlap between their mental collocations and the collocations of words in their writing.
Samuel Brooks
Run-On Sentences and Pithy Sentences:
Non-native speakers will often, even in the case of human-translated text, keep their written sentences short because of their awareness of the potential pitfall of run-on sentences. They will tend to err on the side of writing shorter sentences, even at the risk of coming off as “dumb.” Spies often find it useful to be perceived in this way. Native speakers with a poor grasp of the English language will tend to engage in run-on sentences when writing and non-natives will tend to “tie off” sentences as early as possible lest they make a mistake (something human translators share in common with users of automated translators,) or lest they confuse the algorithm.
Spanish and Russian Overuse of Adjectives:
The native Spanish and Russian speakers are renowned for being self-conscious about the possibility of imprecision in language, and so they will pepper their sentences with many near-synonyms, usually “-ing” words to make sure they are conveying their point clearly, seemingly unsure of how what they’re saying will be perceived. While a native speaker engaged in persuasive writing will often use two synonyms consecutively and will unconsciously err toward alliteration in synonyms, non-native speakers will cluster 3 or more synonyms in the same sentence but consciously avoid alliterative synonyms since they understand that not all words that are alliterative or share common roots actually mean the same thing. Where the native speaker tries to convince you they are “smart” the non-native speaker tries to convince you they are native. This manifests itself as a marked difference in patterns of language usage.
Adam Bell
Chinese Failure to Use Conjunctions:
Much as with Latin, Mandarin tends not to utilize so many conjunctions and so when a Mandarin speaker writes or speaks in English, they completely omit conjunctions because in their mind, people can still “get the gist” of what they’re saying. More important in that system of language is the sequence in which words or characters are used. If someone starting out as an English speaker tried to learn Mandarin, they would find themselves trying to insert conjunctions where none are needed and would occasionally err in this way.
Native-typical Malapropism vs. Non-Native Malapropism:
Non-Native speakers take courses in common malapropisms of the English language e.g. “It’s a doggy dog world,” “for all intensive purposes,” and “Irregardless” and will actively avoid misuse of these phrases and words. In most cases, they will not even attempt correct usage because they’re taught it’s hazardous territory (If I had said ‘dangerous territory,’ that would have been an example of non-native collocation. ‘Hazardous’ and ‘Dangerous’ may mean the same thing, but in the context of verbal faux pas, ‘hazardous’ is the more frequently used pairing. Some English speakers, when they speak of collocative frequency, use the expression, “It just sounds better this way.” They do not understand that collocative frequency is at the essence of their perception that it sounds better.)
Aiden Morris
Those are just a few simple examples of the sorts of things such an algorithm could search for to achieve this goal. Where algorithms are relatively poor at understanding the concept of “uncanniness” when it comes to the fabrication of images (as in GANs,) algorithms have an opportunity to shine when it comes to characterizing dynamics of language since we unwittingly use word collocation almost exclusively to judge the “canniness” or “uncanniness” of someone’s language i.e. whether they are a native speaker or not.
I believe that specific patterns found in second-language writing samples can conclusively and accurately betray the native language of the speaker, something that has real potential to allow our own counter-propaganda machinations to function more efficiently given the limited number of human language experts available to review material. Such a system would reduce the likelihood of propaganda evading detection regardless of the care taken in its preparation. Given that misattribution has been a recurring issue with enemies and allies alike, I believe such a system would also prove useful for preventing successful misattribution attacks where written taunts are involved. Software capable of achieving this end should be easy to market to the relevant agencies interested in detecting and characterizing foreign-source propaganda.
The Future Is Made in America 03Feb2022
Jayden Peterson
you know what would be funny claim that you invented a program to determine a someone's first language and then declare anybody you don't like to be a foreign spy
Brayden Perry
If China and the United States weren't already working on something like this... they are now.
absolutely and i agree to almost every part. too rigid a framework is made out of language analysis play. it is much more important than is credited here because language is a tool and it is crafted and applied like one to extraordinary evolutionary advantage.
Jacob Anderson
If only every country had a world-class workforce like we do...
This looks a little different from the official patch, if I am not mistaken, much like the NROL-67 it includes a circular object that is not part of the official mission patch. Why buy one soliton emitter when you can buy two at twice the price?
imagine not just every word, but every pause, every breath, and stutter and blink and facial tick, every motion of the hands, every hair touch, is a language of one person about their poker hand. you can align your nerves along those nerves and understand from where the deviations from self are. these are lessons in a voice difference by which acts of translation acquire meaning in terms beyond algorithmicly structuring words. the reason why this is true is because people who dont share a language, or even a language root, when encountering each other historically, managed to share hospitality and trade without common structure/sign/play in linguistic terms.
how? universal meanings? how could that be. and if so then why not with dolphins and monkies, or dogs?
Charles Stewart
The choice to paint a different design on the side of the payload capsule than the "official" patch draws a LOT of attention. Why?
The same person who had the bright idea to paint a representation of a soliton emitter on the payload capsule obviously was allowed to continue working there this entire time.
What a relief, they're honoring the NRO's request not to show the 2nd stage deployment
Good to know they're so security-conscious
John Rogers
N V I D I A Ground Control Satoshi san Ground Control Satoshi san Open your long and put all your bitcoin on Where are you Satoshi san 10 Where are you Satoshi san 9 Too high leverage coins are gone 8 pump nvidia or all your coins are gone 7 6 5 4 3 2 1 This is NVIDIA to Satoshi san Why's Bitcoin goin down?? And All of DIA hopes to know When these dead CIA are going to re-GLOW?
This is Satoshi san to DIA I'm probably pretty dead And my spirit still lives in these threads You should be grateful for what OP is delivering TODAY.
For here OP is cutting open new tech Far above your view.
Planet Earth is Blue And Americans love youuu. (WAIT WHAT?)
[smugly dancin on dead tyrants]
Although my leverage has surpassed 100X!, I want o gamble MOAAAR. And my mother has no more f-35's left to SELL. And I'm probably going to burn in Hell.
NVIDIA to Satoshi san Bitcoin is dead There's something wrong. Can you hear me Satoshi san? Can you hear me Satoshi san? PLeaase hear me Satoshi san.
Why's Bitcoin Down?
For here, I am watching too much pornhub. (((fucking zoomer))) I'm so alone. Hope is all I do. That my Bitcoins are going to moon. (LoL)
because we have common perceptions of the physical world even if we do not have common spoken languages. AKA there is enough overlap to communicate without spoken words, but not enough shared experience to communicate with other animals. However if you have a dog you CAN pick up on each other's communications due to shared experience.
Christopher Taylor
user... one is NROL-67 the other is NROL-87
Kevin Martinez
it really is just smaller part of brain for speaking things. you should hear the wild ones sing music when they kill things.
Adam Harris
Shill... Find another job.
Hudson Butler
im a fucking line cook. I would very much like another job. but a shill? What am i shilling?