Language
General
See also Being#Language, Mind#Semiotics, Documents, Typography
tooo soooort
- Singing to babies is vital to help them learn language, say scientists | Language | The Guardian [1]
- Online Etymology Dictionary - Origin, history and meaning of English words
Blogs
Types
- https://en.wikipedia.org/wiki/Metalanguage - a language used to describe another language, often called the object language. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quotation marks, or writing on a separate line. The structure of sentences and phrases in a metalanguage can be described by a metasyntax. For example, to say that the word "noun" can be used as a noun in a sentence, one could write "noun" is a .
Languages
- http://www.economist.com/news/science-and-technology/21707183-researchers-uncover-ancient-links-between-majority-worlds [5]
- Glottolog - Comprehensive reference information for the world's languages, especially the lesser known languages.
- CLLD - Cross-Linguistic Linked Data - Helping collect the world's language diversity heritage.
- https://github.com/clld/clld
- Documentation - The goal of the Cross-Linguistic Linked Data project (CLLD) is to help record the world’s language diversity heritage. This is to be facilitated by developing, providing and maintaining interoperable data publication structures.
- https://en.wikipedia.org/wiki/Controlled_natural_language - CNLs, are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: those that improve readability for human readers (e.g. non-native speakers), and those that enable reliable automatic semantic analysis of the language.
The second type of languages have a formal syntax and formal semantics, and can be mapped to an existing formal language, such as first-order logic. Thus, those languages can be used as knowledge representation languages, and writing of those languages is supported by fully automatic consistency and redundancy checks, query answering, etc.
Proto-Indo-European
- https://en.wikipedia.org/wiki/Proto-Indo-European_language - the linguistic reconstruction of the hypothetical common ancestor of the Indo-European languages, the most widely spoken language family in the world.
Far more work has gone into reconstructing PIE than any other proto-language, and it is by far the best understood of all proto-languages of its age. The vast majority of linguistic work during the 19th century was devoted to the reconstruction of PIE or its daughter proto-languages (such as Proto-Germanic), and most of the modern techniques of linguistic reconstruction such as the comparative method were developed as a result. These methods supply all current knowledge concerning PIE since there is no written record of the language.
PIE is estimated to have been spoken as a single language from 4500 BC to 2500 BC during the Late Neolithic to Early Bronze Age, though estimates vary by more than a thousand years. According to the prevailing Kurgan hypothesis, the origin...ht into the culture and religion of its speakers.
As Proto-Indo-Europeans became isolated from each other through the Indo-European migrations, the Proto-Indo-European language became spoken by the various groups in regional dialects which then underwent the Indo-European sound laws divergence, and along with shifts in morphology, these dialects slowly but eventually transformed into the known ancient Indo-European languages. From there, further linguistic divergence led to the evolution of their current descendants, the modern Indo-European languages. Today, the descendant languages, or daughter languages, of PIE with the most speakers are Spanish, English, Hindustani (Hindi and Urdu), Portuguese, Bengali, Russian, Punjabi, German, Persian, French, Italian and Marathi. Hundreds of other living descendants of PIE range from languages as diverse as Albanian (gjuha shqipe), Kurdish (کوردی), Nepali (खस भाषा), Tsakonian (τσακώνικα), Ukrainian (українська мова), and Welsh (Cymraeg).
- https://en.wikipedia.org/wiki/Grimm%27s_law - also known as the First Germanic Sound Shift or Rask's rule) is a set of statements named after Jacob Grimm and Rasmus Rask describing the inherited Proto-Indo-European (PIE) stop consonants as they developed in Proto-Germanic (the common ancestor of the Germanic branch of the Indo-European family) in the 1st millennium BC. It establishes a set of regular correspondences between early Germanic stops and fricatives and the stop consonants of certain other centum Indo-European languages (Grimm used mostly Latin and Greek for illustration).
- https://en.wikipedia.org/wiki/Indo-European_languages - a language family of several hundred related languages and dialects. mThere are about 445 living Indo-European languages, according to the estimate by Ethnologue, with over two thirds (313) of them belonging to the Indo-Iranian branch.
Pali
English
- http://www.tfd.com/
- http://www.etymonline.com/
- http://www.kokogiak.com/logolepsy/
- http://splasho.com/upgoer5/
- http://www.wordsforthat.com/
- Flipped Form - A flipped form uses everyday language, not legalese, for legal terms like contract obligations or policy requirements. With a regular legal form, the lawyers are the experts. People using the form rely on the lawyers to interpret their needs. With a flipped form, users are the experts. Lawyers help users express their needs in their own language, helping with notes on the law and what the form should cover. If part of a flipped form doesn’t make sense to you, that’s the form’s fault, not yours. The form should change so you can understand and feel confident. You’re the expert on what you need. Only your feedback can make a flipped form a good form.
- http://www.theguardian.com/commentisfree/2014/mar/11/pronunciation-errors-english-language
- http://public.wsu.edu/~brians/errors/
Slang
- http://home.earthlink.net/~dlarkins/slang-pg.htm
- http://local.aaca.org/bntc/slang/slang.htm
- http://www.odps.org/glossword/index.php
- [1907.03920 Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings]
Scots
French
- https://github.com/soulaklabs/bitoduc.fr - A website about french words for computer concepts.
Japanese
- Japanese Complete = ジャパニーズ・コンプリート - With 777 of the most frequent kanji, one has 90.0% coverage of Kanji in the wild!
Acquisition
Translation
- https://github.com/soimort/translate-shell - a command-line translator powered by Google Translate (default), Bing Translator, Yandex.Translate, and Apertium. It gives you easy access to one of these translation engines in your terminal:
- BabelFish.org is a fish that translates speech from one language to another.
- EUdict is a collection of online dictionaries for the languages spoken mostly in the European Community. These dictionaries are the result of the work of many authors who worked very hard and finally offered their product free of charge on the internet thus making it easier to all of us to communicate with each other.
- dict.cc is not only an online dictionary. It's an attempt to create a platform where users from all over the world can share their knowledge in the field of translations. Every visitor can suggest new translations and correct or confirm other users' suggestions.
- Linguee - Dictionary and search engine for 100 million translations.
- Dictionarist - provides translations in English, Spanish, Portuguese, German, French, Italian, Russian, Turkish, Dutch, Greek,Chinese, Japanese, Korean, Arabic, Hindi, Indonesian, Polish, Romanian, Ukrainian and Vietnamese.
- http://rut.org/cgi-bin/j-e/dict - Japanese
- Pootle - Community localization server. Get your community translating your software into their languages.
Other
- FrathWiki - information for the conlanging and linguistics community
- Unker - Non-Linear Writing System
- https://github.com/ghoomy/silili - a small, logical, work-in-progress language, with a minimalist grammar and an efficient vocabulary.
to sort
- Material of language - Language is more than just words and meanings: it’s paper and ink, pixels and screens, fingertips on keyboards, voices speaking out loud. Language is, in a word, material. In this course, students will gain an understanding of how the material of language is represented digitally, and learn computational techniques for manipulating this material in order to create speculative technologies that challenge conventional reading and writing practices. Topics include asemic writing, concrete poetry, markup languages, keyboard layouts, interactive and generative typography, printing technologies and bots (alongside other forms of radical publishing). Students will complete a series of weekly readings and production-oriented assignments leading up to a final project. In addition to critique, sessions will feature lectures, class discussions and technical tutorials. Prerequisites: Introduction to Computational Media or equivalent programming experience. Ethos, practice, programming “Let us have no more of those successive, incessant, back and forth motions of our eyes, tracking from one line to the next and beginning all over again—otherwise we will miss that ecstasy in which we have become immortal for a brief hour, free of all reality, and raise our obsessions to the level of creation.” — Stéphane Mallarmé, “The Book: A Spiritual Instrument,” Selected Poetry and Prose, edited by Mary Ann Caws (New York: New Directions, 1982), p. 82. This class concerns what happens when language becomes manifest in the world, with a particular focus on forms of language or forms of manifestation that foreground computation and/or interactive media. Our methodology for approaching these themes is free-form, drawing from critical making, speculative design, creative writing, and the humanities. In particular, the class asserts that making things is one of the most effective ways of learning how to think critically.
Numbers
- https://en.wikipedia.org/wiki/English_numerals
- https://en.wikipedia.org/wiki/Ordinal_indicator - nd, rd, th, etc.
Runes
- https://en.wikipedia.org/wiki/Anglo-Saxon_runes - (Old English: rūna ᚱᚢᚾᚪ, are runes used by the early Anglo-Saxons as an alphabet in their writing system. The characters are known collectively as the futhorc (ᚠᚢᚦᚩᚱᚳ fuþorc) from the Old English sound values of the first six runes. The futhorc was a development from the 24-character Elder Futhark. Since the futhorc runes are thought to have first been used in Frisia before the Anglo-Saxon settlement of Britain, they have also been called Anglo-Frisian runes. They were likely to have been used from the 5th century onward, recording Old English and Old Frisian.
They were gradually supplanted in Anglo-Saxon England by the Old English Latin alphabet introduced by missionaries. Futhorc runes were no longer in common use by the eleventh century, but The Byrhtferth Manuscript (MS Oxford St John's College 17) indicates that fairly accurate understanding of them persisted into at least the twelfth century.
Anglish
- YouTube: The Sound of the Anglish / Pure English language (UDHR, Numbers, Words, Story & Sample Text)
- The Anglish Times - News written in Anglish, a kind of English that does not have borrowed words from other languages go here to learn more.
- YouTube: The Anglish Times
to sort from Being
- https://en.wikipedia.org/wiki/Rhetorical_modes - also known as modes of discourse, describe the variety, conventions, and purposes of the major kinds of language-based communication, particularly writing and speaking. Four of the most common rhetorical modes and their purpose are narration, description, exposition, and argumentation.
- https://en.wikipedia.org/wiki/Description - act of description may be related to that of definition. Description is also the fiction-writing mode for transmitting a mental image of the particulars of a story. Definition: The pattern of development that presents a word picture of a thing, a person, a situation, or a series of events.
- https://en.wikipedia.org/wiki/Narrative - or story is any report of connected events, real or imaginary, presented in a sequence of written or spoken words, and/or still or moving images. Narrative can be organized in a number of thematic and/or formal categories: non-fiction (such as definitively including creative non-fiction, biography, journalism, transcript poetry, and historiography); fictionalization of historical events (such as anecdote, myth, legend, and historical fiction); and fiction proper (such as literature in prose and sometimes poetry, such as short stories, novels, and narrative poems and songs, and imaginary narratives as portrayed in other textual forms, games, or live or recorded performances). Narrative is found in all forms of human creativity, art, and entertainment, including speech, literature, theatre, music and song, comics, journalism, film, television and video, radio, gameplay, unstructured recreation, and performance in general, as well as some painting, sculpture, drawing, photography, and other visual arts (though several modern art movements refuse the narrative in favor of the abstract and conceptual), as long as a sequence of events is presented. The word derives from the Latin verb narrare, "to tell", which is derived from the adjective gnarus, "knowing" or "skilled".
Oral storytelling is perhaps the earliest method for sharing narratives. During most people's childhoods, narratives are used to guide them on proper behavior, cultural history, formation of a communal identity, and values, as especially studied in anthropology today among traditional indigenous peoples. Narratives may also be nested within other narratives, such as narratives told by an unreliable narrator (a character) typically found in noir fiction genre. An important part of narration is the narrative mode, the set of methods used to communicate the narrative through a process narration (see also "Narrative Aesthetics" below). Along with exposition, argumentation, and description, narration, broadly defined, is one of four rhetorical modes of discourse. More narrowly defined, it is the fiction-writing mode in which the narrator communicates directly to the reader.
- https://en.wikipedia.org/wiki/Exposition_(narrative) - the insertion of important background information within a story; for example, information about the setting, characters' backstories, prior plot events, historical context, etc. In a specifically literary context, exposition appears in the form of expository writing embedded within the narrative.
Linguistics
- https://en.wikipedia.org/wiki/Linguistics - the scientific study of language, specifically language form, language meaning, and language in context. The earliest activities in the description of language have been attributed to the 4th century BCE Indian grammarian Pāṇini, who was an early student of linguistics and wrote a formal description of the Sanskrit language in his Aṣṭādhyāyī.
Linguistics analyzes human language as a system for relating sounds (or signs in signed languages) and meaning. Phonetics studies acoustic and articulatory properties of the production and perception of speech sounds and non-speech sounds. The study of language meaning, on the other hand, deals with how languages encode relations between entities, properties, and other aspects of the world to convey, process, and assign meaning, as well as to manage and resolve ambiguity. While the study of semantics typically concerns itself with truth conditions, pragmatics deals with how context influences meanings.
Grammar is a system of rules which govern the form of the utterances in a given language. It encompasses both sound and meaning, and includes phonology (how sounds or gestures function together), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences from words).
In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance, where competence is individual's ideal knowledge of a language, while performance is the specific way in which it is used.
The formal study of language has also led to the growth of fields like psycholinguistics, which explores the representation and function of language in the mind; neurolinguistics, which studies language processing in the brain; and language acquisition, which investigates how children and adults acquire a particular language.
Linguistics also includes non-formal approaches to the study of other aspects of human language, such as social, cultural, historical and political factors. The study of cultural discourses and dialects is the domain of sociolinguistics, which looks at the relation between linguistic variation and social structures, as well as that of discourse analysis, which examines the structure of texts and conversations. Research on language through historical and evolutionary linguistics focuses on how languages change, and on the origin and growth of languages, particularly over an extended period of time.
Corpus linguistics takes naturally occurring texts and studies the variation of grammatical and other features based on such corpora. Stylistics involves the study of patterns of style: within written, signed, or spoken discourse. Language documentation combines anthropological inquiry with linguistic inquiry to describe languages and their grammars. Lexicography covers the study and construction of dictionaries. Computational linguistics applies computer technology to address questions in theoretical linguistics, as well as to create applications for use in parsing, data retrieval, machine translation, and other areas. People can apply actual knowledge of a language in translation and interpreting, as well as in language education – the teaching of a second or foreign language. Policy makers work with governments to implement new plans in education and teaching which are based on linguistic research.
- https://en.wikipedia.org/wiki/Linguistic_turn - a major development in Western philosophy during the 20th century, the most important characteristic of which is the focusing of philosophy and the other humanities primarily on the relationship between philosophy and language.
- Max Planck Neuroscience on Nautilus: Brainwaves Encode the Grammar of Human Language - The relative timing of brainwaves encodes the structure of a sentence.
- https://en.wikipedia.org/wiki/Genetic_relationship_(linguistics) - Two languages have a genetic relationship, and belong to the same language family, if both are descended from a common ancestor through the process of language change, or one is descended from the other. The term and the process of language evolution are independent of, and not reliant on, the terminology, understanding, and theories related to genetics in the biological sense, so, to avoid confusion, some linguists prefer the term genealogical relationship.: 184
- https://en.wikipedia.org/wiki/Comparative_method - a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor and then extrapolating backwards to infer the properties of that ancestor.
- https://en.wikipedia.org/wiki/Internal_reconstruction - a method of reconstructing an earlier state in a language's history using only language-internal evidence of the language in question.
- https://en.wikipedia.org/wiki/Linguistic_typology - or language typology) is a field of linguistics that studies and classifies languages according to their structural features to allow their comparison. Its aim is to describe and explain the structural diversity and the common properties of the world's languages. Its subdisciplines include, but are not limited to phonological typology, which deals with sound features; syntactic typology, which deals with word order and form; lexical typology, which deals with language vocabulary; and theoretical typology, which aims to explain the universal tendencies.
Linguistic typology is contrasted with genealogical linguistics on the grounds that typology groups languages or their grammatical features based on formal similarities rather than historic descendence. The issue of genealogical relation is however relevant to typology because modern data sets aim to be representative and unbiased. Samples are collected evenly from different language families, emphasizing the importance of exotic languages in gaining insight into human language.
- https://en.wikipedia.org/wiki/Applied_linguistics - an interdisciplinary field of linguistics that identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, psychology, computer science, communication research, anthropology, and sociology.
- https://en.wikipedia.org/wiki/Corpus_linguistics - the study of a language as that language is expressed in its text corpus (plural corpora), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia", of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify. The text-corpus method uses the body of texts written in any natural language to derive the set of abstract rules which govern that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated.
- https://en.wikipedia.org/wiki/Text_corpus - In linguistics and natural language processing, a corpus (PL: corpora, or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched.
A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.
Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.
- https://en.wikipedia.org/wiki/Speech_corpus - or spoken corpus, is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine). In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
- https://en.wikipedia.org/wiki/Non-native_speech_database - a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.
- https://en.wikipedia.org/wiki/Computational_linguistics - an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others. Since the 2020s, computational linguistics has become a near-synonym of either natural language processing or language technology, with deep learning approaches, such as large language models, outperforming the specific approaches previously used in the field.
The field overlapped with artificial intelligence since the efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English. Since rule-based approaches were able to make arithmetic (systematic, calculations much faster and more accurately than humans, it was expected that lexicon, morphology, syntax and semantics can be learned using explicit rules, as well. After the failure of rule-based approaches, David Hays coined the term in order to distinguish the field from AI and co-founded both the Association for Computational Linguistics (ACL) and the International Committee on Computational Linguistics (ICCL) in the 1970s and 1980s. What started as an effort to translate between languages evolved into a much wider field of natural language processing.
- https://en.wikipedia.org/wiki/Etymology - is the history of words, their origins, and how their form and meaning have changed over time. By an extension, the term "the etymology of [a word]" means the origin of the particular word.
- https://en.wikipedia.org/wiki/Historical_linguistics - also termed diachronic linguistics and formerly glottology, is the scientific study of language change over time. Principal concerns of historical linguistics include: to describe and account for observed changes in particular languages; to reconstruct the pre-history of languages and to determine their relatedness, grouping them into language families (comparative linguistics); to develop general theories about how and why language changes; to describe the history of speech communities; to study the history of words, i.e. etymology
Becker's Criterion: "Any theory (or partial theory) of the English Language that is expounded in the English Language must account for (or at least apply to) the text of its own exposition."
Becker's Razor: his final riposte to theoretical linguists: "Elegance and truth are inversely related', after which he finishes with, 'Put that in your phrasal lexicon and invoke it!"
- YouTube: Chomsky on Zizek and Lacan
- https://en.wikipedia.org/wiki/Transformational_grammar - or transformational-generative grammar (TGG, is part of the theory of generative grammar, especially of natural languages. It considers grammar to be a system of rules that generate exactly those combinations of words that form grammatical sentences in a given language and involves the use of defined operations (called transformations) to produce new sentences from existing ones.
The method is commonly associated with the American linguist Noam Chomsky's biologically oriented concept of language. But in logical syntax, Rudolf Carnap introduced the term "transformation" in his application of Alfred North Whitehead's and Bertrand Russell's Principia Mathematica. In such a context, the addition of the values of one and two, for example, transform into the value of three; many types of transformation are possible.
Generative algebra was first introduced to general linguistics by the structural linguist Louis Hjelmslev, although the method was described before him by Albert Sechehaye in 1908. Chomsky adopted the concept of transformation from his teacher Zellig Harris, who followed the American descriptivist separation of semantics from syntax. Hjelmslev's structuralist conception including semantics and pragmatics is incorporated into functional grammar.
- https://en.wikipedia.org/wiki/Linguistic_typology - a subfield of linguistics that studies and classifies languages according to their structural and functional features. Its aim is to describe and explain the common properties and the structural diversity of the world's languages. It includes three subdisciplines: qualitative typology, which deals with the issue of comparing languages and within-language variance; quantitative typology, which deals with the distribution of structural patterns in the world’s languages; and theoretical typology, which explains these distributions.
- https://en.wikipedia.org/wiki/Anaphora_(linguistics) - the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent)
- https://en.wikipedia.org/wiki/Deixis - the use of general words and phrases to refer to a specific time, place, or person in context, e.g., the words tomorrow, there, and they. Words are deictic if their semantic meaning is fixed but their denoted meaning varies depending on time and/or place. Words or phrases that require contextual information to be fully understood—for example, English pronouns—are deictic. Deixis is closely related to anaphora. Although this article deals primarily with deixis in spoken language, the concept is sometimes applied to written language, gestures, and communication media as well. In linguistic anthropology, deixis is treated as a particular subclass of the more general semiotic phenomenon of indexicality, a sign "pointing to" some aspect of its context of occurrence.
- https://en.wikipedia.org/wiki/Literary_technique
- https://en.wikipedia.org/wiki/Literary_element
- https://en.wikipedia.org/wiki/Stylistic_device
- Field Linguist's Toolbox - a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data.Although Toolbox is very powerful, it is designed to be easy to learn. The user can start with a simple standard setup and gradually add the use of more powerful features as desired. The Toolbox downloads include a training package that is usable for self-paced individual learning as well as for classroom teaching of Toolbox. [13]
- FieldWorks - consists of software tools that help you manage linguistic and cultural data. FieldWorks supports tasks ranging from the initial entry of collected data through to the preparation of data for publication, including dictionary development, interlinearization of texts, morphological analysis, and other publications. Furthermore, FieldWorks BTE contains a specialized drafting and editing environment for Bible Translators, which provides interaction with the language data stored in Language Explorer.
- AGGREGATION - Implemented grammars can contribute to endangered language documentation in several ways. In the first instance, the grammars themselves provide a very rich addition to prose descriptive grammars, allowing linguists to explore analyses at a level of precision not usually achieved in prose descriptions. Furthermore, implemented grammars can be used to create treebanks, that is, collections of utterances (from running text or elicited examples) associated with syntactic and semantic structures. The process of creating the treebank can provide important feedback to the field linguist about aspects of the linguistic data not covered by current analyses. The resulting treebanks can be used to create further computational tools and are also a rich source of comparable data for qualitative and quantitative work in typology, grounding higher level linguistic abstractions in actual utterances in a computationally tractable fashion. Despite these advantages, grammar engineering for language documentation has gone largely unexplored. In this project, we investigate how to automate the construction of grammar fragments, building on interlinear glossed text (IGT) and the LinGO Grammar Matrix, a typologically motivated cross-linguistic computational resource.
- Home - DELPH-IN - Computational linguists from research sites world-wide have joined forces in a collaborative effort aimed at ‘deep’ linguistic processing of human language. The goal is the combination of linguistic and statistical processing methods for getting at the meaning of texts and utterances. The partners have adopted Head-Driven Phrase Structure Grammar (HPSG) and Minimal Recursion Semantics (MRS), two advanced models of formal linguistic analysis. They have also committed themselves to a shared format for grammatical representation and to a rigid scheme of evaluation, as well as to the general use of open-source licensing and transparency.
- SFST - A toolbox for the implementation of morphological analysers
- Foma - A Finite State Compiler and Library - a compiler, programming language, and C library for constructing finite-state automata and transducers for various uses. It has specific support for many natural language processing applications such as producing morphological analyzers. Although NLP applications are probably the main use of foma, it is sufficiently generic to use for a large number of purposes.The library contains efficient implementations of all classical automata/transducer algorithms: determinization, minimization, epsilon-removal, composition, boolean operations. Also, more advanced construction methods are available: context restriction, quotients, first-order regular logic, transducers from replacement rules, etc.
- Helsinki Finite-State Technology - Project Web Hosting - Open Source Software - intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity.
- Tatoeba - a collection of sentences and translations. It's collaborative, open, free and even addictive.
- https://en.wikipedia.org/wiki/Natural_language_processing - NLP, is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.
- https://en.wikipedia.org/wiki/History_of_natural_language_processing - describes the advances of natural language processing (Outline of natural language processing). There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence.
- https://en.wikipedia.org/wiki/Natural-language_understanding - NLU, or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem. There is considerable commercial interest in the field because of its application to automated reasoning, machine translation, question answering, news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis.
- https://en.wikipedia.org/wiki/Question_answering - a computer science discipline within the fields of information retrieval and natural language processing (NLP, that is concerned with building systems that automatically answer questions that are posed by humans in a natural language. A question-answering implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, question-answering systems can pull answers from an unstructured collection of natural language documents.
- Using Natural Language Processing to Extract Relations: A Systematic Mapping of Open Information Extraction Approaches - Context: for thousands of years humans translate their knowledge using natural language format and register it so that others can access them. Natural Language Processing (NLP) is a subarea of Artificial Intelligence (AI) that studies the linguistic phenomena and uses computational methods to process natural language written texts. More specific areas such as Open Information Extraction (Open IE) were created to perform the information extraction in textual databases, such as relationship triples, without prior information of its context or structure. Recently, research was conducted by grouping studies related to Open IE initiatives. However, some information about this domain can still be explored.
Objective: this work aims to identify in literature the main characteristics that involve the Open IE approaches.
Method: in order to achieve the proposed objective, first we conducted the update of a mapping study, and then, we performed backward snowballing and manual search to find publications of researchers and research groups that accomplished these studies. In addition, we also considered a specialized electronic database in NLP.
Results: the study resulted in a set of 159 studies proposing Open IE approaches. Data analysis showed a migration from the use of supervised techniques to neural techniques. The study also showed that the most commonly used data sets are Journalistic News. Moreover, the preferred techniques for evaluating approaches are precision and recall. Conclusion: many Open IE approaches have been published and community interest is growing in this topic. The advance of the area of AI and neural networks allowed this technique to be used to extract relevant information from texts that can be used later by other areas.
- https://github.com/rodolfo-brandao/systematic-mapping - Mining Depression Signs on Social Media: A Systematic MappingAboutThis repository contains an academic production of my authorship, on the execution of a systematic mapping of the literature, regarding academic trends in the Computer Science field, related to the extraction of information from social networks in order to identify possible cases of depression among its users.
Semiotics
See also Maths#Logic
- https://en.wikipedia.org/wiki/Semiotics - also called semiotic studies; not to be confused with the Saussurean tradition called semiology which is a part of semiotics) is the study of meaning-making, the study of sign processes and meaningful communication. This includes the study of signs and sign processes (semiosis), indication, designation, likeness, analogy, allegory, metonymy, metaphor, symbolism, signification, and communication.
Semiotics is closely related to the field of linguistics, which, for its part, studies the structure and meaning of language more specifically. The semiotic tradition explores the study of signs and symbols as a significant part of communications. As different from linguistics, however, semiotics also studies non-linguistic sign systems.
Semiotics is frequently seen as having important anthropological dimensions; for example, the late Italian semiotician and novelist Umberto Eco proposed that every cultural phenomenon may be studied as communication. Some semioticians focus on the logical dimensions of the science, however. They examine areas belonging also to the life sciences—such as how organisms make predictions about, and adapt to, their semiotic niche in the world (see semiosis). In general, semiotic theories take signs or sign systems as their object of study: the communication of information in living organisms is covered in biosemiotics (including zoosemiotics).
- Semiotics for Beginners - Daniel Chandler
- https://en.wikipedia.org/wiki/Syntagma_(linguistics) - an elementary constituent segment within a text. Such a segment can be a phoneme, a word, a grammatical phrase, a sentence, or an event within a larger narrative structure, depending on the level of analysis. Syntagmatic analysis involves the study of relationships (rules of combination) among syntagmas.
At the lexical level, syntagmatic structure in a language is the combination of words according to the rules of syntax for that language. For example, English uses determiner + adjective + noun, e.g. the big house. Another language might use determiner + noun + adjective (Spanish la casa grande) and therefore have a different syntagmatic structure.
At a higher level, narrative structures feature a realistic temporal flow guided by tension and relaxation; thus, for example, events or rhetorical figures may be treated as syntagmas of epic structures.
Syntagmatic structure is often contrasted with paradigmatic structure. In semiotics, "syntagmatic analysis" is analysis of syntax or surface structure (syntagmatic structure), rather than paradigms as in paradigmatic analysis. Analysis is often achieved through commutation tests.
- https://en.wikipedia.org/wiki/Syntagmatic_analysis - is analysis of syntax or surface structure (syntagmatic structure) as opposed to paradigms (paradigmatic analysis). This is often achieved using commutation tests.
- https://en.wikipedia.org/wiki/Computational_semiotics
- https://en.wikipedia.org/wiki/Cognitive_semiotics
- Ontology is Overrated: Categories, Links, and Tags - "The Only Group That Can Categorize Everything Is Everybody"
- https://en.wikipedia.org/wiki/Process_philosophy - ontology of becoming, Whitehead
Pragmatics
- https://en.wikipedia.org/wiki/Pragmatics - a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, linguistics and anthropology.
Unlike semantics, which examines meaning that is conventional or "coded" in a given language, pragmatics studies how the transmission of meaning depends not only on structural and linguistic knowledge (e.g., grammar, lexicon, etc.) of the speaker and listener, but also on the context of the utterance, any pre-existing knowledge about those involved, the inferred intent of the speaker, and other factors. In this respect, pragmatics explains how language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time etc. of an utterance.
The ability to understand another speaker's intended meaning is called pragmatic competence.
Phonetics
- https://en.wikipedia.org/wiki/Phonetics - a branch of linguistics that comprises the study of the sounds of human speech, or—in the case of sign languages—the equivalent aspects of sign. It is concerned with the physical properties of speech sounds or signs (phones): their physiological production, acoustic properties, auditory perception, and neurophysiological status. Phonology, on the other hand, is concerned with the abstract, grammatical characterization of systems of sounds or signs.
The field of phonetics is a multilayered subject of linguistics that focuses on speech. In the case of oral languages there are three basic areas of study inter-connected through the common mechanism of sound, such as wavelength (pitch), amplitude, and harmonics:
- https://en.wikipedia.org/wiki/Articulatory_phonetics - the study of the production of speech sounds by the articulatory and vocal tract by the speaker.
- https://en.wikipedia.org/wiki/Acoustic_phonetics - the study of the physical transmission of speech sounds from the speaker to the listener.
- https://en.wikipedia.org/wiki/Auditory_phonetics - the study of the reception and perception of speech sounds by the listener.
- https://en.wikipedia.org/wiki/Manner_of_articulation
- https://en.wikipedia.org/wiki/Place_of_articulation
Phonemics
- https://en.wikipedia.org/wiki/Phonics - a method for teaching people how to read and write an alphabetic language (such as English or Russian,. It is done by demonstrating the relationship between the sounds of the spoken language (phonemes), and the letters or groups of letters (graphemes) or syllables of the written language. In English, this is also known as the alphabetic principle or the alphabetic code.
- https://en.wikipedia.org/wiki/Phoneme - one of the units of sound that distinguish one word from another in a particular language. The difference in meaning between the English words kill and kiss is a result of the exchange of the phoneme /l/ for the phoneme /s/. Two words that differ in meaning through a contrast of a single phoneme form a minimal pair. In linguistics, phonemes (established by the use of minimal pairs, such as kill vs kiss or pat vs bat) are written between slashes like this: /p/, whereas when it is desired to show the more exact pronunciation of any sound, linguists use square brackets, for example [pʰ] (indicating an aspirated p).
Within linguistics there are differing views as to exactly what phonemes are and how a given language should be analyzed in phonemic (or phonematic) terms. However, a phoneme is generally regarded as an abstraction of a set (or equivalence class) of speech sounds (phones) which are perceived as equivalent to each other in a given language. For example, in English, the "k" sounds in the words kit and skill are not identical (as described below), but they are distributional variants of a single phoneme /k/. Different speech sounds that are realizations of the same phoneme are known as allophones. Allophonic variation may be conditioned, in which case a certain phoneme is realized as a certain allophone in particular phonological environments, or it may be free in which case it may vary randomly. In this way, phonemes are often considered to constitute an abstract underlying representation for segments of words, while speech sounds make up the corresponding phonetic realization, or surface form.
- https://en.wikipedia.org/wiki/Phonemic_awareness - a part of phonological awareness in which listeners are able to hear, identify and manipulate phonemes, the smallest mental units of sound that help to differentiate units of meaning (morphemes,. Separating the spoken word "cat" into three distinct phonemes, /k/, /æ/, and /t/, requires phonemic awareness. The National Reading Panel has found that phonemic awareness improves children's word reading and reading comprehension and helps children learn to spell. Phonemic awareness is the basis for learning phonics. Phonemic awareness and phonological awareness are often confused since they are interdependent. Phonemic awareness is the ability to hear and manipulate individual phonemes. Phonological awareness includes this ability, but it also includes the ability to hear and manipulate larger units of sound, such as onsets and rimes and syllables.
Phonetics
- https://en.wikipedia.org/wiki/ARPABET - also spelled ARPAbet, is a set of phonetic transcription codes developed by Advanced Research Projects Agency (ARPA) as a part of their Speech Understanding Research project in the 1970s. It represents phonemes and allophones of General American English with distinct sequences of ASCII characters. Two systems, one representing each segment with one character (alternating upper- and lower-case letters) and the other with two or more (case-insensitive), were devised, the latter being far more widely adopted.[1]ARPABET has been used in several speech synthesizers, including Computalker for the S-100 system, SAM for the Commodore 64, SAY for the Amiga, TextAssist for the PC and Speakeasy from Intelligent Artefacts which used the Votrax SC-01 speech synthesiser IC. It is also used in the CMU Pronouncing Dictionary. A revised version of ARPABET is used in the TIMIT corpus.
- https://en.wikipedia.org/wiki/Phonetic_algorithm - an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages.
Phonology
- https://en.wikipedia.org/wiki/Phonology - a branch of linguistics concerned with the systematic organization of sounds in languages. It has traditionally focused largely on the study of the systems of phonemes in particular languages (and therefore used to be also called phonemics, or phonematics), but it may also cover any linguistic analysis either at a level beneath the word (including syllable, onset and rime, articulatory gestures, articulatory features, mora, etc.) or at all levels of language where sound is considered to be structured for conveying linguistic meaning. Phonology also includes the study of equivalent organizational systems in sign languages.
- https://en.wikipedia.org/wiki/Phonological_awareness - an individual's awareness of the phonological structure, or sound structure, of words. Phonological awareness is an important and reliable predictor of later reading ability and has, therefore, been the focus of much research
Morphology
- https://en.wikipedia.org/wiki/Morpheme - is the smallest grammatical unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to specify that it is plural). Every word comprises one or more morphemes. The more combinations a morpheme is found in, the more productive it is said to be.
- https://en.wikipedia.org/wiki/Morphology_(linguistics) - the identification, analysis and description of the structure of a given language's morphemes and other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or implied context. In contrast, morphological typology is the classification of languages according to their use of morphemes, while lexicology is the study of those words forming a language's wordstock. The discipline that deals specifically with the sound changes occurring within morphemes is morphophonology.
While words, along with clitics, are generally accepted as being the smallest units of syntax, in most languages, if not all, many words can be related to other words by rules that collectively describe the grammar for that language. For example, English speakers recognize that the words dog and dogs are closely related, differentiated only by the plurality morpheme "-s", only found bound to nouns. Speakers of English, a fusional language, recognize these relations from their tacit knowledge of English's rules of word formation. They infer intuitively that dog is to dogs as cat is to cats; and, in similar fashion, dog is to dog catcher as dish is to dishwasher. By contrast, Classical Chinese has very little morphology, using almost exclusively unbound morphemes ("free" morphemes) and depending on word order to convey meaning. (Most words in modern Standard Chinese ("Mandarin"), however, are compounds and most roots are bound.) These are understood as grammars that represent the morphology of the language. The rules understood by a speaker reflect specific patterns or regularities in the way words are formed from smaller units in the language they are using and how those smaller units interact in speech. In this way, morphology is the branch of linguistics that studies patterns of word formation within and across languages and attempts to formulate rules that model the knowledge of the speakers of those languages.
Polysynthetic languages, such as Chukchi, have words composed of many morphemes. The Chukchi word "təmeyŋəlevtpəγtərkən", for example, meaning "I have a fierce headache", is composed of eight morphemes t-ə-meyŋ-ə-levt-pəγt-ə-rkən that may be glossed. The morphology of such languages allows for each consonant and vowel to be understood as morphemes, while the grammar of the language indicates the usage and understanding of each morpheme.
Lexicology
- https://en.wikipedia.org/wiki/Lexicology - the part of linguistics which studies words. This may include their nature and function as symbols[1] their meaning, the relationship of their meaning to epistemology in general, and the rules of their composition from smaller elements (morphemes such as the English -ed marker for past or un- for negation; and phonemes as basic sound units). Lexicology also involves relations between words, which may involve semantics (for example, love vs. affection), derivation (for example, fathom vs. unfathomably), usage and sociolinguistic distinctions (for example, flesh vs. meat), and any other issues involved in analyzing the whole lexicon of a language(s).
- https://en.wikipedia.org/wiki/Lexical_definition - of a term, also known as the dictionary definition, is the definition closely matching the meaning of the term in common usage. As its other name implies, this is the sort of definition one is likely to find in the dictionary. A lexical definition is usually the type expected from a request for definition, and it is generally expected that such a definition will be stated as simply as possible in order to convey information to the widest audience.
Note that a lexical definition is descriptive, reporting actual usage within speakers of a language, and changes with changing usage of the term, rather than prescriptive, which would be to stick with a version regarded as "correct", regardless of drift in accepted meaning. They tend to be inclusive, attempting to capture everything the term is used to refer to, and as such are often too vague for many purposes.
- https://en.wikipedia.org/wiki/Lexeme - a unit of lexical meaning that exists regardless of the number of inflectional endings it may have or the number of words it may contain. It is a basic unit of meaning, and the headwords of a dictionary are all lexemes. Put more technically, a lexeme is an abstract unit of morphological analysis in linguistics, that roughly corresponds to a set of forms taken by a single word. For example, in the English language, run, runs, ran and running are forms of the same lexeme, conventionally written as run. A related concept is the lemma (or citation form), which is a particular form of a lexeme that is chosen by convention to represent a canonical form of a lexeme. Lemmas are used in dictionaries as the headwords, and other forms of a lexeme are often listed later in the entry if they are not common conjugations of that word.
A lexeme belongs to a particular syntactic category, has a certain meaning (semantic value), and in inflecting languages, has a corresponding inflectional paradigm; that is, a lexeme in many languages will have many different forms. For example, the lexeme run has a present third person singular form runs, a present non-third-person singular form run (which also functions as the past participle and non-finite form), a past form ran, and a present participle running. (It does not include runner, runners, runnable, etc.) The use of the forms of a lexeme is governed by rules of grammar; in the case of English verbs such as run, these include subject-verb agreement and compound tense rules, which determine which form of a verb can be used in a given sentence.
- https://en.wikipedia.org/wiki/Lexicography - the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries.
Practical lexicography is the art or craft of compiling, writing and editing dictionaries.
Theoretical lexicography is the scholarly study of semantic, orthographic, syntagmatic and paradigmatic features of lexemes of the lexicon (vocabulary, of a language, developing theories of dictionary components and structures linking the data in dictionaries, the needs for information by users in specific types of situations, and how users may best access the data incorporated in printed and electronic dictionaries. This is sometimes referred to as "metalexicography".
There is some disagreement on the definition of lexicology, as distinct from lexicography. Some use "lexicology" as a synonym for theoretical lexicography; others use it to mean a branch of linguistics pertaining to the inventory of words in a particular language.
- https://en.wikipedia.org/wiki/Lexical_item - a single word, a part of a word, or a chain of words (catena, that forms the basic elements of a language's lexicon (≈ vocabulary). Examples are cat, traffic light, take care of, by the way, and it's raining cats and dogs. Lexical items can be generally understood to convey a single meaning, much as a lexeme, but are not limited to single words. Lexical items are like semes in that they are "natural units" translating between languages, or in learning a new language. In this last sense, it is sometimes said that language consists of grammaticalized lexis, and not lexicalized grammar. The entire store of lexical items in a language is called its lexis.
Lexical items composed of more than one word are also sometimes called lexical chunks, gambits, lexical phrases, lexicalized stems, or speech formulae. The term polyword listemes is also sometimes used
- https://en.wikipedia.org/wiki/Word - smallest element that may be uttered in isolation with semantic or pragmatic content.
- https://en.wikipedia.org/wiki/Open_class_(linguistics) - a word class may be either an open class or a closed class. Open classes accept the addition of new morphemes (words), through such processes as compounding, derivation, inflection, coining, and borrowing; closed classes generally do not.
- https://en.wikipedia.org/wiki/Back-formation - the process of creating a new lexeme, usually by removing actual or supposed affixes. The resulting neologism is called a back-formation, a term coined by James Murray in 1889. (OED online first definition of 'back formation' is from the definition of to burgle, which was first published in 1889.) Back-formation is different from clipping – back-formation may change the part of speech or the word's meaning, whereas clipping creates shortened words from longer words, but does not change the part of speech or the meaning of the word.
- https://en.wikipedia.org/wiki/Computational_lexicology - that branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars (Amsler, 1980) as the use of computers in the study of machine-readable dictionaries. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.
Grammar
- https://en.wikipedia.org/wiki/Grammar - the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics, and pragmatics.
- https://en.wikipedia.org/wiki/Grammatical_category - or grammatical feature is a property of items within the grammar of a language. Within each category there are two or more possible values (sometimes called grammemes), which are normally mutually exclusive. Frequently encountered grammatical categories include:
- Tense, the placing of a verb in a time frame, which can take values such as present and past
- Number, with values such as singular, plural, and sometimes dual, trial, paucal, uncountable or partitive, inclusive or exclusive
- Gender, with values such as masculine, feminine and neuter
- Noun classes, which are more general than just gender, and include additional classes like: animated, humane, plants, animals, things, and immaterial for concepts and verbal nouns/actions, sometimes as well shapes
- Locative relations, which some languages would represent using grammatical cases or tenses, or by adding a possibly agglutinated lexeme such as a preposition, adjective, or particle.
Although the use of terms varies from author to author, a distinction should be made between grammatical categories and lexical categories. Lexical categories (considered syntactic categories, largely correspond to the parts of speech of traditional grammar, and refer to nouns, adjectives, etc. A phonological manifestation of a category value (for example, a word ending that marks "number" on a noun) is sometimes called an exponent. Grammatical relations define relationships between words and phrases with certain parts of speech, depending on their position in the syntactic tree. Traditional relations include subject, object, and indirect object.
Part of speech
- https://en.wikipedia.org/wiki/Part_of_speech - also a word class, a lexical class, or a lexical category, a linguistic category of words (lexical items) defined by the items syntactic or morphological behaviour. Common linguistic categories include noun and verb, among others.
Three little words you often see Are ARTICLES: a, an, and the. A NOUN's the name of anything, As: school or garden, toy, or swing. ADJECTIVES tell the kind of noun, As: great, small, pretty, white, or brown. VERBS tell of something being done: To read, write, count, sing, jump, or run. How things are done the ADVERBS tell, As: slowly, quickly, badly, well. CONJUNCTIONS join the words together, As: men and women, wind or weather. The PREPOSITION stands before A noun as: in or through a door. The INTERJECTION shows surprise As: Oh, how pretty! Ah! how wise!
Syntax
- https://en.wikipedia.org/wiki/Syntactic_category - a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech (e.g. noun, verb, preposition, etc.), are syntactic categories. In phrase structure grammars, the phrasal categories (e.g. noun phrase, verb phrase, prepositional phrase, etc., are also syntactic categories. Dependency grammars, however, do not acknowledge phrasal categories (at least not in the traditional sense).
Word classes considered as syntactic categories may be called lexical categories, as distinct from phrasal categories. The terminology is somewhat inconsistent between the theoretical models of different linguists. However, many grammars also draw a distinction between lexical categories (which tend to consist of content words, or phrases headed by them) and functional categories (which tend to consist of function words or abstract functional elements, or phrases headed by them). The term lexical category therefore has two distinct meanings. Moreover, syntactic categories should not be confused with grammatical categories (also known as grammatical features), which are properties such as tense, gender, etc.
- https://en.wikipedia.org/wiki/Cartographic_syntax - a branch of Generative syntax. The basic assumption of Cartographic syntax is that syntactic structures are built according to the same patterns in all languages of the world. It is assumed that all languages exhibit a richly articulated structure of hierarchical projections with specific meanings. Cartography belongs to the tradition of generative grammar and is regarded as a theory belonging to the Principles and Parameters theory. The founders of Cartography are the Italian linguists Luigi Rizzi and Guglielmo Cinque.
- https://en.wikipedia.org/wiki/Parse_tree - or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common. Concrete syntax trees reflect the syntax of the input language, making them distinct from the abstract syntax trees used in computer programming. Unlike Reed-Kellogg sentence diagrams used for teaching grammar, parse trees do not use distinct symbol shapes for different types of constituents.
Parse trees are usually constructed based on either the constituency relation of constituency grammars (phrase structure grammars, or the dependency relation of dependency grammars. Parse trees may be generated for sentences in natural languages (see natural language processing), as well as during processing of computer languages, such as programming languages. A related concept is that of phrase marker or P-marker, as used in transformational generative grammar. A phrase marker is a linguistic expression marked as to its phrase structure. This may be presented in the form of a tree, or as a bracketed expression. Phrase markers are generated by applying phrase structure rules, and themselves are subject to further transformational rules. A set of possible parse trees for a syntactically ambiguous sentence is called a "parse forest."
Semantics
- https://en.wikipedia.org/wiki/Lexical_semantics - also known as lexicosemantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word. The units of analysis in lexical semantics are lexical units which include not only words but also sub-words or sub-units such as affixes and even compound words and phrases. Lexical units include the catalogue of words in a language, the lexicon. Lexical semantics looks at how the meaning of the lexical units correlates with the structure of the language or syntax. This is referred to as syntax-semantics interface.
The study of lexical semantics concerns:
- the classification and decomposition of lexical items
- the differences and similarities in lexical semantic structure cross-linguistically
- the relationship of lexical meaning to sentence meaning and syntax.
Lexical units, also referred to as syntactic atoms, can be independent such as in the case of root words or parts of compound words or they require association with other units, as prefixes and suffixes do. The former are termed free morphemes and the latter bound morphemes. They fall into a narrow range of meanings (semantic fields, and can combine with each other to generate new denotations. Cognitive semantics is the linguistic paradigm/framework that since the 1980s has generated the most studies in lexical semantics, introducing innovations like prototype theory, conceptual metaphors, and frame semantics.
- https://en.wikipedia.org/wiki/Semantic_role_labeling - also called shallow semantic parsing or slot-filling, is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.
It serves to find the meaning of the sentence. To do this, it detects the arguments associated with the predicate or verb of a sentence and how they are classified into their specific roles. A common example is the sentence "Mary sold the book to John." The agent is "Mary," the predicate is "sold" (or rather, "to sell,") the theme is "the book," and the recipient is "John." Another example is how "the book belongs to me" would need two labels such as "possessed" and "possessor" and "the book was sold to John" would need two other labels such as theme and recipient, despite these two clauses being similar to "subject" and "object" functions.
In 1968, the first idea for semantic role labeling was proposed by Charles J. Fillmore. His proposal led to the FrameNet project which produced the first major computational lexicon that systematically described many predicates and their corresponding roles. Daniel Gildea (Currently at University of Rochester, previously University of California, Berkeley / International Computer Science Institute) and Daniel Jurafsky (currently teaching at Stanford University, but previously working at University of Colorado and UC Berkeley) developed the first automatic semantic role labeling system based on FrameNet. The PropBank corpus added manually created semantic role annotations to the Penn Treebank corpus of Wall Street Journal texts. Many automatic semantic role labeling systems have used PropBank as a training dataset to learn how to annotate new sentences automatically.
Semantic role labeling is mostly used for machines to understand the roles of words within sentences. This benefits applications similar to Natural Language Processing programs that need to understand not just the words of languages, but how they can be used in varying sentences. A better understanding of semantic role labeling could lead to advancements in question answering, information extraction, automatic text summarization, text data mining, and speech recognition.
- https://en.wikipedia.org/wiki/Semantic_parsing - the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. In computer vision, semantic parsing is a process of segmentation for 3D objects.
- https://en.wikipedia.org/wiki/Frame_semantics_(linguistics) - a theory of linguistic meaning developed by Charles J. Fillmore that extends his earlier case grammar. It relates linguistic semantics to encyclopedic knowledge. The basic idea is that one cannot understand the meaning of a single word without access to all the essential knowledge that relates to that word. For example, one would not be able to understand the word "sell" without knowing anything about the situation of commercial transfer, which also involves, among other things, a seller, a buyer, goods, money, the relation between the money and the goods, the relations between the seller and the goods and the money, the relation between the buyer and the goods and the money and so on. Thus, a word activates, or evokes, a frame of semantic knowledge relating to the specific concept to which it refers (or highlights, in frame semantic terminology).
The idea of the encyclopedic organisation of knowledge itself is old and was discussed by Age of Enlightenment philosophers such as Denis Diderot and Giambattista Vico. Fillmore and other evolutionary and cognitive linguists like John Haiman and Adele Goldberg, however, make an argument against generative grammar and truth-conditional semantics. As is elementary for Lakoffian–Langackerian Cognitive Linguistics, it is claimed that knowledge of language is no different from other types of knowledge; therefore there is no grammar in the traditional sense, and language is not an independent cognitive function. Instead, the spreading and survival of linguistic units is directly comparable to that of other types of units of cultural evolution, like in memetics and other cultural replicator theories.
- semanticsarchive.net - For exchanging papers of interest to natural language semanticists and philosophers of language
to sort
- https://en.wikipedia.org/wiki/Phrase_structure_grammar
- https://en.wikipedia.org/wiki/Dependency_grammar
- https://en.wikipedia.org/wiki/Context-free_grammar
- https://en.wikipedia.org/wiki/Context-free_language
- https://en.wikipedia.org/wiki/Syntactic_hierarchy - concerned with the way sentences are constructed from smaller parts, such as words and phrases.
- https://en.wikipedia.org/wiki/Transparency_(linguistic)
- https://en.wikipedia.org/wiki/Opaque_context
- https://en.wikipedia.org/wiki/Clause - the smallest grammar unit that can express a complete proposition. typically consists of a subject and a predicate, where the predicate is typically a verb phrase – a verb together with any objects and other modifiers.
- https://en.wikipedia.org/wiki/Anaphora_(linguistics) - use of an expression the interpretation of which depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression which depends specifically upon an antecedent expression, and thus is contrasted with cataphora, which is the use of an expression which depends upon a postcedent expression. The anaphoric (referring) term is called an anaphor. For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, so her is now a cataphor (and an anaphor in the broader, but not the narrower, sense). Usually, an anaphoric expression is a proform or some other kind of deictic (contextually-dependent) expression. Both anaphora and cataphora are species of endophora, referring to something mentioned elsewhere in a dialog or text.
- https://en.wikipedia.org/wiki/Formulaic_language - previously known as automatic speech or embolalia, is a linguistic term for verbal expressions that are fixed in form, often non-literal in meaning with attitudinal nuances, and closely related to communicative-pragmatic context. Along with idioms, expletives and proverbs, formulaic language includes pause fillers (e.g., “Like,” “Er” or “Uhm”) and conversational speech formulas (e.g., “You’ve got to be kidding,” “Excuse me?” or “Hang on a minute”).
- https://en.wikipedia.org/wiki/Fixed_expression - a standard form of expression that has taken on a more specific meaning than the expression itself. It is different from a proverb in that it is used as a part of a sentence, and is the standard way of expressing a concept or idea.
- https://en.wikipedia.org/wiki/Idiom - (Latin: idioma, "special property", from Greek: ἰδίωμα – idíōma, "special feature, special phrasing, a peculiarity", f. Greek: ἴδιος – ídios, "one’s own") is a phrase or a fixed expression that has a figurative, or sometimes literal, meaning. An idiom's figurative meaning is different from the literal meaning. There are thousands of idioms, and they occur frequently in all languages. It is estimated that there are at least twenty-five thousand idiomatic expressions in the English language. Idioms fall into the category of formulaic language.
- https://en.wikipedia.org/wiki/Category:Euphemisms
- https://en.wikipedia.org/wiki/Category:Dysphemisms
- http://www.antipope.org/charlie/blog-static/2014/06/we-need-a-pony-and-the-moon-on.html [17] - on detecting sarcasm
- https://en.wikipedia.org/wiki/Archi-writing - a term used by French philosopher Jacques Derrida in his attempt to re-orient the relationship between speech and writing. Derrida argued that as far back as Plato, speech had been always given priority over writing. In the West, phonetic writing was considered as a secondary imitation of speech, a poor copy of the immediate living act of speech. Derrida argued that in later centuries philosopher Jean-Jacques Rousseau and linguist Ferdinand de Saussure both gave writing a secondary or parasitic role. In Derrida's essay Plato's Pharmacy, he sought to question this prioritising by firstly complicating the two terms speech and writing.
Formal language
- https://en.wikipedia.org/wiki/Chomsky_hierarchy - a containment hierarchy of classes of formal grammars. allows the possibility for the understanding and use of a computer science model which enables a programmer to accomplish meaningful linguistic goals systematically.
Natural language
See also Computing#NLP
Dialect
- https://en.wikipedia.org/wiki/Dialectology - the scientific study of linguistic dialect, a sub-field of sociolinguistics. It studies variations in language based primarily on geographic distribution and their associated features. Dialectology deals with such topics as divergence of two local dialects from a common ancestor and synchronic variation.
- https://en.wikipedia.org/wiki/Idiolect - an individual's distinctive and unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. Idiolect is the variety of language unique to an individual. This differs from a dialect, a common set of linguistic characteristics shared among some group of people. The term idiolect refers to the language of an individual. It is etymologically related to the Greek prefix idio- (meaning "own, personal, private, peculiar, separate, distinct") and a back-formation of dialect.
- https://en.wikipedia.org/wiki/Linguistic_map - a thematic map showing the geographic distribution of the speakers of a language, or isoglosses of a dialect continuum of the same language, or language family. A collection of such maps is a linguistic atlas.
- Language Mapping Worldwide: Methods and Traditions | SpringerLink - The chapter provides an overview about methods and traditions of linguistic cartography in the past and present. Mapping language and mapping language-related data are of increasing interest not only in disciplines such as dialectology and language typology, which are the classical domains of linguistic cartography, but also in sociolinguistics and theoretical linguistics. The chapter is structured in three main parts. First, the purposes of language mapping are introduced, ranging from visualization of the position of linguistic features in geographic space (the basic purpose of language mapping) through issues in language classification to correlations between linguistic and nonlinguistic features. Second, a formal typology of language maps based on their symbolization is given, distinguishing point-related maps, line-related maps, area-related maps, and surface maps. Third, major language mapping traditions worldwide are sketched in as much detail as possible in a short overview. The descriptions consider examples from all areas of the world (including reprints of maps and map details). A section on the effects of computerization on language mapping concludes the chapter.
- Mapmaking for Language Documentation and Description - CORE Reader - This paper introduces readers to mapmaking as part of language documentation.We discuss some of the benefits and ethical challenges in producing good maps,drawing on linguistic geography and GIS literature. We then describe currenttools and practices that are useful when creating maps of linguistic data, par-ticularly using locations of field sites to identify language areas/boundaries. Wedemonstrate a basic workflow that uses CartoDB, before demonstrating a morecomplex workflow involving Google Maps and TileMill. We also discuss presen-tation and archiving of mapping products. The majority of the tools identifiedand used are open source or free to use.
Written language
- https://en.wikipedia.org/wiki/Writing_system - any conventional method of visually representing verbal communication. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form of information storage and transfer. The processes of encoding and decoding writing systems involve shared understanding between writers and readers of the meaning behind the sets of characters that make up a script. Writing is usually recorded onto a durable medium, such as paper or electronic storage, although non-durable methods may also be used, such as writing on a computer display, on a blackboard, in sand, or by skywriting. The general attributes of writing systems can be placed into broad categories such as alphabets, syllabaries, or logographies. Any particular system can have attributes of more than one category. In the alphabetic category, there is a standard set of letters (basic written symbols or graphemes) of consonants and vowels that encode based on the general principle that the letters (or letter pair/groups) represent speech sounds. In a syllabary, each symbol correlates to a syllable or mora. In a logography, each character represents a word, morpheme, or other semantic units. Other categories include abjads, which differ from alphabets in that vowels are not indicated, and abugidas or alphasyllabaries, with each character representing a consonant–vowel pairing. Alphabets typically use a set of 20-to-35 symbols to fully express a language[citation needed], whereas syllabaries can have 80-to-100[citation needed], and logographies can have several hundreds of symbols.
- https://en.wikipedia.org/wiki/Syllabogram - signs used to write the syllables (or morae) of words. This term is most often used in the context of a writing system otherwise organized on different principles—an alphabet where most symbols represent phonemes, or a logographic script where most symbols represent morphemes—but a system based mostly on syllabograms is a syllabary.
- https://en.wikipedia.org/wiki/Syllabary - a set of written symbols that represent the syllables or (more frequently) moras which make up words.
- https://en.wikipedia.org/wiki/Grapheme - the smallest functional unit of a writing system
See also Maths
Controlled language
- https://en.wikipedia.org/wiki/Controlled_natural_language - (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: those that improve readability for human readers (e.g. non-native speakers), and those that enable reliable automatic semantic analysis of the language. The first type of languages (often called "simplified" or "technical" languages), for example ASD Simplified Technical English, Caterpillar Technical English, IBM's Easy English, are used in the industry to increase the quality of technical documentation, and possibly simplify the (semi-)automatic translation of the documentation. These languages restrict the writer by general rules such as "Keep sentences short", "Avoid the use of pronouns", "Only use dictionary-approved words", and "Use only the active voice". The second type of languages have a formal logical basis, i.e. they have a formal syntax and semantics, and can be mapped to an existing formal language, such as first-order logic. Thus, those languages can be used as knowledge representation languages, and writing of those languages is supported by fully automatic consistency and redundancy checks, query answering, etc.
- https://en.wikipedia.org/wiki/Attempto_Controlled_English - a controlled natural language, i.e. a subset of standard English with a restricted syntax and restricted semantics described by a small set of construction and interpretation rules. It has been under development at the University of Zurich since 1995. In 2013, ACE version 6.7 was announced. [19]
Technology
- https://en.wikipedia.org/wiki/Linguistic_Linked_Open_Data - In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD), describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
- https://github.com/diogocabral/sherlock - A modification of sherlock plagiarism detector.
- https://github.com/TheBerkin/rant - an all-purpose procedural text engine that is most simply described as the opposite of Regex. It has been refined to include a dizzying array of features for handling everything from the most basic of string generation tasks to advanced dialogue generation, code templating, automatic formatting, and more.The goal of the project is to enable developers of all kinds to automate repetitive writing tasks with a high degree of creative freedom.
WebAnno
- WebAnno - a general purpose web-based annotation tool for a wide range of linguistic annotations including various layers of morphological, syntactical, and semantic annotations.Additionaly, custom annotation layers can be defined, allowing WebAnno to be used also for non-linguistic annotation tasks. WebAnno is a multi-user tool supporting different roles such as annotator, curator, and project manager. The progress and quality of annotation projects can be monitored and measuered in terms of inter-annotator agreement. Multiple annotation projects can be conducted in parallel.
- https://github.com/webanno/webanno - The official WebAnno repository has reached the end of the line. -- To migrate, export your annotation projects from WebAnno, then import them into INCEpTION and just work on.
INCEpTION
- INCEpTION - A semantic annotation platform offering intelligent assistance and knowledge management The annotation of specific semantic phenomena often require compiling task-specific corpora and creating or extending task-specific knowledge bases. Presently, researchers require a broad range of skills and tools to address such semantic annotation tasks. In the recently funded INCEpTION project, UKP Lab at TU Darmstadt aims towards building an annotation platform that incorporates all the related tasks into a joint web-based platform.
Spelling
- Hunspell - the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome, and it is also used by proprietary software packages, like Mac OS X, InDesign, memoQ, Opera and SDL Trados.
- https://github.com/wolfgarbe/SymSpell - 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm, C#
- https://github.com/vvvar/symspell - Rust implementation of brilliant SymSpell originally written in C# by @wolfgarbe.
Other
- Encyclopedia Dramatica - "In lulz we trust."
- SCP Foundation - Secure, Contain, Protect
- GF - Grammatical Framework - A programming language for multilingual grammar applications
- https://news.ycombinator.com/item?id=17749217
Fiction
- http://nautil.us/issue/15/turbulence/an-astrobiologist-asks-a-sci_fi-novelist-how-to-survive-the-anthropocene [29]
Interactive
- https://en.wikipedia.org/wiki/Ludonarrative - a compound of ludology and narrative, refers to the intersection in a video game of ludic elements – or gameplay – and narrative elements. It is commonly used in the term Ludonarrative dissonance which refers to conflicts between a video game's narrative and its gameplay. The term was coined by Clint Hocking, a former creative director at LucasArts (then at Ubisoft), on his blog in October, 2007. Hocking coined the term in response to the game BioShock, which according to him promotes the theme of self-interest through its gameplay while promoting the opposing theme of selflessness through its narrative, creating a violation of aesthetic distance that often pulls the player out of the game. Video game theorist Tom Bissell, in his book Extra Lives: Why Video Games Matter (2010), notes the example of Call of Duty 4: Modern Warfare, where a player can all but kill their digital partner during gameplay without upsetting the built-in narrative of the game.