Data

Things and Stuff Wiki - An organically evolving personal wiki knowledge base. An on-the-fly taxonomy containing a patchwork trail of topic outlines, descriptions, notes, stubs and breadcrumbs, with links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, videos, screencasts, webcasts, scratchpads and more. Content is orientated towards mostly free/libre/open, mostly Linux. Quality and age varies drastically. Sometimes old things are first, sometimes last. Use the Table of Contents menu to navigate long pages. Zoom in if text is too small. Dead link? Wayback Machine. I probably need to fix the theme CSS after an update. See also libreav.org. Chat to msg me (not checking tho atm). e

General

data, noun

facts and statistics collected together for reference or analysis: there is very little data available
- the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
- Philosophy things known or assumed as facts, making the basis of reasoning or calculation.

Forbes: A Very Short History Of Data Science

https://en.wikipedia.org/wiki/Unstructured_data - information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.

Programmable Systems

https://en.wikipedia.org/wiki/Data_model - or datamodel, is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.

Overview of data modeling context: Data model is based on Data, Data relationship, Data semantic and Data constraint. A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.

https://en.wikipedia.org/wiki/Data_structure

https://en.wikipedia.org/wiki/Semi-structured_data - form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.

https://en.wikipedia.org/wiki/Data_set

https://en.wikipedia.org/wiki/Hyperdata

https://en.wikipedia.org/wiki/Metadata_standards

Data, Information, Knowledge, and Wisdom - some abstractions

http://sunlightfoundation.com/blog/2013/10/02/government-apis-arent-a-backup-plan/

Xiph.Org Video Presentations: A Digital Media Primer for Geeks
- https://wiki.xiph.org/Videos/A_Digital_Media_Primer_For_Geeks

A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?

School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.

http://www.datasciencetoolkit.org/

http://datavu.blogspot.co.uk/2014/08/useful-unix-commands-for-exploring-data.html [1]

https://github.com/datasciencemasters/go/ [2]

http://derandomized.com/

http://code.google.com/p/ourmine/wiki/LectureNaiveBayes

http://bigocheatsheet.com/

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html [3]

http://pandas.pydata.org/

http://bitly.com/bundles/bigmlcom/4

http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/ [4]

Best Practices for Scientific Computing

http://www.datastax.com/documentation/articles/cassandra/cassandrathenandnow.html

http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-science

Kaggle - Service - From Big Data to Big Analytics.

http://allthingsd.com/20130520/in-media-big-data-is-booming-but-big-results-are-lacking/?mod=thisweek

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/

https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

http://www.sqo-oss.org/

YouTube: Validated Numerics -- a short introduction to rigorous computations

https://amplab.cs.berkeley.edu/benchmark

https://github.com/jeroenjanssens/data-science-toolbox

http://storm-project.net/
- http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

https://spark.incubator.apache.org/
- http://blog.mikiobraun.de/2014/01/apache-spark.html [5]

http://beakernotebook.com/

https://news.ycombinator.com/item?id=16845666

Formats, Evaluation Factors, and Relationships

Encoding

https://en.wikipedia.org/wiki/Encoder_(digital) - or simply an encoder in digital electronics is a one-hot to binary converter. That is, if there are 2n input lines, and at most only one of them will ever be high, the binary code of this 'hot' line is produced on the n-bit output lines.For example, a 4-to-2 simple encoder takes 4 input bits and produces 2 output bits. The illustrated gate level example implements the simple encoder defined by the truth table, but it must be understood that for all the non-explicitly defined input combinations (i.e., inputs containing 0, 2, 3, or 4 high bits) the outputs are treated as don't cares.

https://en.wikipedia.org/wiki/Binary_decoder - a combinational logic circuit that converts binary information from the n coded inputs to a maximum of 2n unique outputs. They are used in a wide variety of applications, including data multiplexing and data demultiplexing, seven segment displays, and memory address decoding.There are several types of binary decoders, but in all cases a decoder is an electronic circuit with multiple input and multiple output signals, which converts every unique combination of input states to a specific combination of output states. In addition to integer data inputs, some decoders also have one or more

Numbers

Binary

https://en.wikipedia.org/wiki/Binary_number

https://en.wikipedia.org/wiki/Binary_numeral_system

https://en.wikipedia.org/wiki/Signed_number_representations

https://en.wikipedia.org/wiki/List_of_binary_codes

Kaitai Struct - a declarative language used to describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc.The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.
- https://github.com/kaitai-io/kaitai_struct

Hexadecimal

https://en.wikipedia.org/wiki/Hexadecimal

https://github.com/sharkdp/hexyl - a simple hex viewer for the terminal. It uses a colored output to distinguish different categories of bytes (NULL bytes, printable ASCII characters, ASCII whitespace characters, other ASCII characters and non-ASCII).

wxHexEditor - a Free Hex Editor / Disk Editor for Huge Files or Devices on Linux, Windows and MacOSX
- http://wiki.wxhexeditor.org

Gray code

https://en.wikipedia.org/wiki/Gray_code - after Frank Gray, or reflected binary code (RBC), also known just as reflected binary (RB), is an ordering of the binary numeral system such that two successive values differ in only one bit (binary digit). The reflected binary code was originally designed to prevent spurious output from electromechanical switches. Today, Gray codes are widely used to facilitate error correction in digital communications such as digital terrestrial television and some cable TV systems.

Character encoding

https://en.wikipedia.org/wiki/Character_encoding

What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text - [6]

https://en.wikipedia.org/wiki/List_of_information_system_character_sets

https://en.wikipedia.org/wiki/List_of_file_signatures - data used to identify or verify the content of a file. Such signatures are also known as magic numbers or Magic Bytes.

https://en.wikipedia.org/wiki/Od_(Unix) - a program for displaying ("dumping") data in various human-readable output formats. The name is an acronym for "octal dump" since it defaults to printing in the octal data format. It can also display output in a variety of other formats, including hexadecimal, decimal, and ASCII. It is useful for visualizing data that is not in a human-readable format, like the executable code of a program.

http://www.subnetonline.com/pages/converters/hex-to-bin-to-dec.php

Telegraph

https://en.wikipedia.org/wiki/Telegraph_code

https://en.wikipedia.org/wiki/Commercial_code_(communications)

Morse

http://morsecode.me/ [7]

https://en.wikipedia.org/wiki/Prosigns_for_Morse_code

https://github.com/recri/keyer - iambic/ascii morse code keyer using Jack audio connection kit

https://morsecode.scphillips.com/labs/decoder

Baudot

https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.

BCD

https://en.wikipedia.org/wiki/BCD_(character_encoding)

EBCDIC

https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.

ASCII / ANSI

to move/merge with Typography

https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.

https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.

Ascii-Codes: Ascii table for IBM PC charset (CP437)

https://en.wikipedia.org/wiki/Code_page_437

https://github.com/keaston/cp437 - Emulates an old-style "code page 437" / "IBM-PC" character set terminal on a modern UTF-8 terminal emulator.

https://en.wikipedia.org/wiki/PETSCII - also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C128[1], CBM-II, Plus/4, and VIC-20.

The Evolution of Character Codes, 1874-1968 - Eric Fischer [8]

https://news.ycombinator.com/item?id=13539552

http://textfiles.com/directory.html

http://ascii-world.wikidot.com/

https://en.wikipedia.org/wiki/Code_page_437 - "ANSI"

http://ascii-table.com/ansi-escape-sequences.php

https://en.wikipedia.org/wiki/Windows-1252

https://en.wikipedia.org/wiki/ANSI_art

https://en.wikipedia.org/wiki/ANSI_escape_code

http://cwoebker.com/posts/ansi-escape-codes

http://www.alt-codes.net/

http://www.bobbemer.com/BRACES.HTM

https://github.com/jart/hiptext

http://aewan.sourceforge.net/

http://asciiflow.com/

http://asciifi.com/

http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20

https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/ [9]

Art

http://en.wikipedia.org/wiki/ASCII_art

jp2a - a small utility that converts JPG images to ASCII. It's written in C and released under the GPL.

https://en.wikipedia.org/wiki/ANSI_art

http://en.wikipedia.org/wiki/Box_drawing_characters

http://blocktronics.org/

http://artpacks.org/ [10]

https://news.ycombinator.com/item?id=16051428

Fonts

http://www.figlet.org/

https://github.com/xero/figlet-fonts

http://caca.zoy.org/wiki/toilet - like figlet but w/ colours

CJKV

https://en.wikipedia.org/wiki/CJK_characters

A Spectre is Haunting Unicode

Unicode

https://en.wikipedia.org/wiki/Unicode - a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.

https://en.wikipedia.org/wiki/List_of_Unicode_characters

Unicode Consortium - enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

ICU - International Components for Unicode - a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.

https://en.wikipedia.org/wiki/Unicode_input#In_X11_.28Linux_and_Unix_variants.29

The history of UTF-8 as told by Rob Pike - Rob Pike explains how Ken Thompson invented UTF-8 in one evening and how they together built the first system-wide implementation in less than a week.

UAX #15: Unicode Normalization Forms - This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.

Codepoint, n. the position of a character in an encoding system.
Charbase - A visual unicode database
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://en.wikipedia.org/wiki/Unicode_control_characters
http://www.charset.org/
http://unicode.org/charts/
http://sheet.shiar.nl/unicode

http://unicode-table.com/

Unicode Fonts for Ancient Scripts

https://gist.github.com/lucasrizoli/1603274

mirroring char in brackets: (‮‮test (

http://unicodelookup.com/

http://orwell.ru/test/

http://nedbatchelder.com/text/unipain.html

http://www.jefftk.com/p/is-unicode-safe

http://eeemo.net/

http://unifoundry.com/unifont.html

http://ipa.typeit.org/full/

https://github.com/aempirei/JPEGTOCHAT

https://plus.google.com/109925364564856140495/posts

http://www.utf8everywhere.org/

http://shapecatcher.com/ [11]

https://simonsapin.github.io/wtf-8/ [12]

http://www.unicode.org/reports/tr51/ - emoji [13]

https://github.com/cspeterson/splatmoji - Quickly look up and input emoji and/or emoticons/kaomoji on your GNU/Linux desktop via pop-up menu (uses rofi, a la dmenu).

unicode.style - Style text by substituting characters with suitable unicode replacements. [14]
- https://github.com/ekmartin/unicode-style

http://unicodepowersymbol.com/we-did-it-how-a-comment-on-hackernews-lead-to-4-%C2%BD-new-unicode-characters/ [15]

http://blog.jonnew.com/posts/poo-dot-length-equals-two [16]

https://news.ycombinator.com/item?id=14468818

https://ftfy.now.sh [17]

https://news.ycombinator.com/item?id=16561806

https://github.com/reinderien/mimic - [ab]using Unicode to create tragedy

https://en.wikipedia.org/wiki/Mojibake - the garbled text that is the result of text being decoded using an unintended character encoding.[1] The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

Files

Magic number

https://en.wikipedia.org/wiki/Magic_number_(programming)

FourCC

https://en.wikipedia.org/wiki/FourCC - literally, four-character code) is a sequence of four bytes used to uniquely identify data formats. The concept originated in the OSType scheme used in the Macintosh system software and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.

Serialization and markup

https://en.wikipedia.org/wiki/Serialization

http://stackoverflow.com/questions/11817950/what-is-data-serialization

https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats

https://en.wikipedia.org/wiki/Category:Data_serialization_formats

https://scottlocklin.wordpress.com/2017/04/02/please-stop-writing-new-serialization-protocols/

http://www.drdobbs.com/web-development/after-xml-json-then-what/240151851

https://en.wikipedia.org/wiki/Marshalling_(computer_science) - or marshaling is the process of transforming the memory representation of an object to a data format suitable for storage or transmission, and it is typically used when data must be moved between different parts of a computer program or from one program to another. Marshalling is similar to serialization and is used to communicate to remote objects with an object, in this case a serialized object. It simplifies complex communication, using composite objects in order to communicate instead of primitives. The inverse, of marshalling is called unmarshalling (or demarshalling, similar to deserialization).

https://en.wikipedia.org/wiki/Unmarshalling - Comparison with deserialization: An object that is serialized is in the form of a byte stream and it can eventually be converted back to a copy of the original object. Deserialization is the process of converting the byte stream data back to its original object type.mAn object that is marshalled, however, records the state of the original object and it contains the codebase (codebase here refers to a list of URLs where the object code can be loaded from, and not source code). Hence, in order to convert the object state and codebase(s), unmarshalling must be done.

https://en.wikipedia.org/wiki/Delimiter - a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code. Delimiters represent one of various means to specify boundaries in a data stream.

https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length

https://github.com/dbohdan/structured-text-tools - A list of command line tools for manipulating structured text data [18] [19]

https://github.com/johnkerl/miller - Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

VisiData - a free, open-source tool that lets you quickly open, explore, summarize, and analyze datasets in your computer’s terminal. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.
- https://github.com/saulpw/visidata

S-expression

https://en.wikipedia.org/wiki/S-expression - sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an s-expression is classically defined as "an atom", or "an expression of the form (x . y) where x and y are s-expressions." The second, recursive part of the definition represents an ordered pair, which means that s-expressions are binary trees.

S-expressions as a Lightweight Serialization Format - [20]

https://github.com/cognitect/transit-format

https://github.com/edn-format/edn

M-Expression

https://en.wikipedia.org/wiki/M-expression - or meta-expressions, were an early proposed syntax for the Lisp programming language, inspired by contemporary languages such as Fortran and ALGOL.

GML

https://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language - GML, 1969, is a set of macros that implement intent-based (procedural) markup tags for the IBM text formatter, SCRIPT. SCRIPT/VS is the main component of IBM's Document Composition Facility (DCF). A starter set of tags in GML is provided with the DCF product.

CBCL

https://en.wikipedia.org/wiki/Common_Business_Communication_Language - (CBCL) is a communications language proposed by John McCarthy that foreshadowed much of XML. The language consists of a basic framework of hierarchical markup derived from S-expressions, coupled with some general principles about use and extensibility. Although written in 1975, the proposal was not published until 1982, and to this day remains relatively obscure.

Recfile

GNU Recutils - a set of tools and libraries to access human-editable, plain text databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. The picture below shows a sample database containing information about GNU packages, along with the main features provided by recutils.

recfile - Recfile is the file format used by GNU Recutils. It can be seen as a "vertical" counterpart to CSV.

GNU Recutils article - [21]

https://metacpan.org/pod/release/TSIBLEY/App-RecordStream-4.0.1/README.pod

CSV

https://en.wikipedia.org/wiki/Comma-separated_values - 1972

RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files

http://stackoverflow.com/questions/1875305/command-line-csv-viewer

https://github.com/wireservice/csvkit
- http://csvkit.readthedocs.org/en/540/

TSV

https://en.wikipedia.org/wiki/Tab-separated_values

HTML/CSS#Markup,

https://en.wikipedia.org/wiki/Scribe_(markup_language) - 1980

ASN.1

https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One - ASN.1 is an interface description language for defining data structures that can be serialized and deserialized in a standard, cross-platform way. It's broadly used in telecommunications and computer networking, and especially in cryptography. Protocol developers define data structures in ASN.1 modules, which are generally a section of a broader standards document written in the ASN.1 language. Because the language is both human-readable and machine-readable, modules can be automatically turned into libraries that process their data structures, using an ASN.1 compiler. ASN.1 is similar in purpose and use to protocol buffers and Apache Thrift, which are also interface description languages for cross-platform data serialization. Like those languages, it has a schema (in ASN.1, called a "module"), and a set of encodings, typically type-length-value encodings. However, ASN.1, defined in 1984, predates them by many years. It also includes a wider variety of basic data types, some of which are obsolete, and has more options for extensibility. A single ASN.1 message can include data from multiple modules defined in multiple standards, even standards defined years apart.

X.690

https://en.wikipedia.org/wiki/X.690 - an ITU-T standard specifying several ASN.1 encoding formats:
- Basic Encoding Rules (BER)
- Canonical Encoding Rules (CER)
- Distinguished Encoding Rules (DER)

JSON

JSON - JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf [22]

https://github.com/nlohmann/json - JSON for Modern C++ [23]

https://github.com/Parquery/mapry - Mapry generates polyglot code for de/serializing object graphs from JSONable structures.

https://tel.github.io/2014/08/22/JSON_is_not_object_notation

JSON as configuration files: please don’t - [24]

https://news.ycombinator.com/item?id=8284780

JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. [25]

https://programmers.stackexchange.com/questions/264706/xslt-equivalent-for-json

https://github.com/letsencrypt/acme-spec - over https

Variations

http://timelessrepo.com/json-isnt-a-javascript-subset

http://json-schema.org/

JSON-P or "JSON with padding" is a communication technique used in JavaScript programs which run in Web browsers. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the same origin policy - pre CORS
- http://en.wikipedia.org/wiki/JSONP

JsonML (JSON Markup Language) is an application of the JSON (JavaScript Object Notation) format. The purpose of JsonML is to provide a compact format for transporting XML-based markup as JSON which allows it to be losslessly converted back to its original form. Native XML/XHTML doesn't sit well embedded in JavaScript. When XHTML is stored in script it must be properly encoded as an opaque string. JsonML allows easy manipulation of the markup in script before completely rehydrating back to the original form.
- http://en.wikipedia.org/wiki/JsonML

JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. If you are already familiar with JSON, writing JSON-LD is very easy. These properties make JSON-LD an ideal Linked Data interchange language for JavaScript environments, Web service, and unstructured databases such as CouchDB and MongoDB.
- (rdf-json was shelved by the w3c)
- http://manu.sporny.org/2014/json-ld-origins-2/ [26]

json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.

Fat JSON [27]

JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.

Superfeedr: XMPP-FTW XMPP and JSON for the Web

Open Knowledge Foundation: Standards
- Data Package
- Simple Data Format (SDF)

What is rss.js? - Dave Winer; "what would JSONified RSS look like?"

Javascript Object Signing and Encryption - JavaScript Object Notation (JSON) is a text format for the serialization of structured data described in RFC 4627. The JSON format is often used for serializing and transmitting structured data over a network connection. With the increased usage of JSON in protocols in the IETF and elsewhere, there is now a desire to offer security services, which use encryption, digital signatures, message authentication codes (MACs) algorithms, that carry their data in JSON format.
JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.

json.human.js - Json Formatting for Human Beings [28]

Learning

YouTube: Douglas Crockford: The JSON Saga

Getting Started with JSON - You send data in a JSON format between different parts of your system. API results are often returned in JSON format, for example. JSON is a lightweight format which makes for easy reading if you're even the least bit familiar with JavaScript.

Tools

https://github.com/benbernard/RecordStream - commandline tools for slicing and dicing JSON records

Pjson - Like python -mjson.tool but with moar colors (and less conf)

jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
- https://jqplay.org/

YouTube: JSON: Like a Boss
- https://www.slideshare.net/btiernay/jq-json-like-a-boss - slides

https://github.com/fiatjaf/awesome-jq

https://github.com/wellsjo/JSON-Splora - GUI for editing, visualizing, and manipulating JSON data

https://github.com/tomnomnom/gron - Make JSON greppable! gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation. [29]

https://github.com/antonmedv/fx - Command-line tool and terminal JSON viewer fire

https://github.com/ddopson/underscore-cli

Jansson - a C library for encoding, decoding and manipulating JSON data.

Jshon - parses, reads and creates JSON. It is designed to be as usable as possible from within the shell and replaces fragile adhoc parsers made from grep/sed/awk as well as heavyweight one-line parsers made from perl/python.

http://jmespath.org

DataHub - represents our vision for data management and automation. It’s a tool for transforming our ability to create and use quality data, bringing dramatic improvements in ease, speed and reliability. As a community, we want DataHub to be a home for people passionate about data like us. A place to discover and share high quality datasets, to connect with others and to share knowledge.
- http://blog.okfn.org/2013/04/24/frictionless-data-making-it-radically-easier-to-get-stuff-done-with-data/

Data Protocols - the Open Knowledge Labs home of simple protocols and formats for working with open data. Our mission is both to make it easier to develop tools and services for working with data, and, to ensure greater interoperability between new and existing tools and services.
- Web-Oriented Data Formats

Web

http://www.json-generator.com

http://jsoneditoronline.org

http://www.jsonschema.net

https://github.com/kevinburke/hulk?

https://news.ycombinator.com/item?id=8128775

Checking

JSONLint - The JSON Validator

http://schematic-ipsum.herokuapp.com/

to sort

http://gun.io/blog/multi-line-strings-in-json/

http://www.trirand.com/blog/
- http://trirand.com/blog/jqgrid/jqgrid.html

https://github.com/afshinm/Json-to-HTML-Table

http://jsonwidget.org/wiki/Jsonwidget - gui

http://fadefade.com/json-comments.html [30]

http://rethinkdb.com/docs/introduction-to-reql/

https://github.com/diegoceccarelli/json-wikipedia

https://float-middle.com/json-web-tokens-jwt-vs-sessions/ [31]

https://github.com/wellsjo/JSON-Splora

https://news.ycombinator.com/item?id=13021732

https://news.ycombinator.com/item?id=13097301

http://jsonmate.com/

https://github.com/jorilallo/jsonbrowse [32]

https://github.com/DaveGamble/cJSON

YAML

http://www.yaml.org/
- http://en.wikipedia.org/wiki/YAML
- http://ajaxian.com/archives/json-yaml-its-getting-closer-to-truth

StrictYAML - a type-safe YAML parser that parses a restricted subset of the YAML specificaton.

https://github.com/tlsa/libcyaml - C library for reading and writing YAML.

YAML: probably not so great after all - [33] [34]

🚨🚨 That's alot of YAML 🚨🚨

In Defense of YAML - [35]

YAML: probably not so great after all - [36]

TOML

https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages.

Learn toml in Y Minutes

https://npf.io/2014/08/intro-to-toml

https://lobste.rs/s/gziyhd/goodbye_json_hello_toml

https://gist.github.com/oconnor663/9aeb4ed56394cb013a20 - TOML vs YAML

HCL

https://github.com/hashicorp/hcl - a configuration language built by HashiCorp. The goal of HCL is to build a structured configuration language that is both human and machine friendly for use with command-line tools, but specifically targeted towards DevOps tools, servers, etc. HCL is also fully JSON compatible. That is, JSON can be used as completely valid input to a system expecting HCL. This helps makes systems interoperable with other systems.

Linode: Introduction to HashiCorp Configuration Language (HCL)

https://github.com/hashicorp/hcl2/blob/master/hcl/hclsyntax/spec.md

CSON

https://github.com/lifthrasiir/cson - Cursive Script Object Notation

or

https://github.com/bevry/cson - CoffeeScript-Object-Notation. Same as JSON but for CoffeeScript objects.

STON

https://github.com/svenvc/ston/blob/master/ston-paper.md [37]

Hjson

Hjson - a syntax extension to JSON. It's NOT a proposal to replace JSON or to incorporate it into the JSON spec itself. It's intended to be used like a user interface for humans, to read and edit before passing the JSON data to the machine. [38]

XDR

https://en.wikipedia.org/wiki/External_Data_Representation - XDR, is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer. XDR uses a base unit of 4 bytes, serialized in big-endian order; smaller data types still occupy four bytes each after encoding. Variable-length types such as string and opaque are padded to a total divisible by four bytes. Floating-point numbers are represented in IEEE 754 format.

DSPL

DSPL - stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data of datasets. The metadata is specified in XML, whereas the data are provided in CSV format.
- https://code.google.com/p/dspl/

CUE

CUE - open source language, with a rich set APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.
- https://github.com/cuelang/cue

Cedric Charly's Blog - CUE [39]

HAL

HAL - a format you can use in your API that gives you a simple way of linking. It has two variants, one in JSON and one in XML.

Dhall

https://github.com/dhall-lang/dhall-lang - a programmable configuration language that provides a non-repetitive alternative to YAML.

HStore

HStore - a key value store within Postgres. You can use it similar to how you would use a dictionary within another language, though it's specific to a column on a row.
- http://www.craigkerstiens.com/2013/07/03/hstore-vs-json/

Protocol Buffers

Protocol Buffers - a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.
- https://developers.google.com/protocol-buffers/docs/overview
- https://news.ycombinator.com/item?id=18188519

Buf - Introduction

Cap'n Proto

Cap'n Proto - an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!
- https://github.com/capnproto/capnproto
- Kenton Varda - Google+ - The Cap'n Proto Guy (formerly The Protobuf Guy) [40]

BSON

BSON - short for Binary JSON, is a binary-encoded serialization of JSON-like documents. Like JSON, BSON supports the embedding of documents and arrays within other documents and arrays. BSON also contains extensions that allow representation of data types that are not part of the JSON spec. For example, BSON has a Date type and a BinData type.
- http://en.wikipedia.org/wiki/BSON

MessagePack

MessagePack - an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.
- https://github.com/msgpack/msgpack

CBOR

CBOR - RFC 7049 “The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation.”

Amazon Ion

Amazon Ion - a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of business data which can survive multiple generations of software evolution. Ion was built to solve the rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. Ion has been addressing these challenges within Amazon for nearly a decade, and we believe others will benefit as well. [41]
- https://github.com/amznlabs/ion-java [42]

Apache Pulsar

Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
- https://github.com/apache/incubator-pulsar
- Open-sourcing Pulsar, Pub-sub Messaging at Scale | Yahoo Engineering - [43]

der-ascii

https://github.com/google/der-ascii - a small human-editable language to emit DER (Distinguished Encoding Rules) or BER (Basic Encoding Rules) encodings of ASN.1 structures and malformed variants of them.

MQTT

MQTT - a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. For example, it has been used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers, and in a range of home automation and small device scenarios. It is also ideal for mobile applications because of its small size, low power usage, minimised data packets, and efficient distribution of information to one or many receivers (more...)
- https://en.wikipedia.org/wiki/MQTT - (MQ Telemetry Transport or Message Queuing Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) publish-subscribe-based messaging protocol. It works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. The publish-subscribe messaging pattern requires a message broker.
- MQTT Version 5.0 - a Client Server publish/subscribe messaging transport protocol. It is light weight, open, simple, and designed to be easy to implement. These characteristics make it ideal for use in many situations, including constrained environments such as for communication in Machine to Machine (M2M) and Internet of Things (IoT) contexts where a small code footprint is required and/or network bandwidth is at a premium.

recordio

https://github.com/eclesh/recordio - recordio implements a file format for a sequence of records

riegeli

https://github.com/google/riegeli - a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

gRPC

[44] - a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services.
- https://github.com/grpc
- https://en.wikipedia.org/wiki/gRPC

smf

smf - a new RPC system and code generation like gRPC, Cap n Proto, Apache Thrift, etc, but designed for microsecond tail latency.
- https://github.com/senior7515/smf

FlatBuffers

FlatBuffers - an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust. It was originally created at Google for game development and other performance-critical applications.It is available as Open Source on GitHub under the Apache license, v2
- https://github.com/google/flatbuffers

eno

eno - A modern plaintext data format, notation language with libraries, designed from the ground up for file-based content - simple, powerful and elegant [45]

Transit

https://github.com/cognitect/transit-format - a format and set of libraries for conveying values between applications written in different programming languages. This spec describes Transit in order to facilitate its implementation in a wide range of languages.

Scuttlebot

Scuttlebot - an open source peer-to-peer log store used as a database, identity provider, and messaging system. It features global replication, file-syncronization, and end-to-end encryption.
- https://github.com/ssbc/secure-scuttlebutt
- https://git.scuttlebot.io/%25n92DiQh7ietE%2BR%2BX%2FI403LQoyf2DtR3WQfCkDKlheQU%3D.sha256 [46]

cereal

https://github.com/USCiLab/cereal - a header-only C++11 serialization library. cereal takes arbitrary data types and reversibly turns them into different representations, such as compact binary encodings, XML, or JSON. cereal was designed to be fast, light-weight, and easy to extend - it has no external dependencies and can be easily bundled with other code or used standalone.

Citations

https://en.wikipedia.org/wiki/Category:Bibliography_file_formats

BibTeX

BibTeX - a tool and a file format which are used to describe and process lists of references, mostly in conjunction with LaTeX documents.

https://en.wikipedia.org/wiki/BibTeX - eference management software for formatting lists of references. The BibTeX tool is typically used together with the LaTeX document preparation system. Within the typesetting system, its name is styled as B I B T E X {\displaystyle {\mathrm {B{\scriptstyle {IB}}\!T\!_{\displaystyle E}\!X} }} {\mathrm {B{\scriptstyle {IB}}\!T\!_{\displaystyle E}\!X} }. The name is a portmanteau of the word bibliography and the name of the TeX typesetting software.The purpose of BibTeX is to make it easy to cite sources in a consistent manner, by separating bibliographic information from the presentation of this information, similarly to the separation of content and presentation/style supported by LaTeX itself.

RIS

https://en.wikipedia.org/wiki/RIS_(file_format) - a standardized tag format developed by Research Information Systems, Incorporated (the format name refers to the company) to enable citation programs to exchange data.[1] It is supported by a number of reference managers. Many digital libraries, like IEEE Xplore, Scopus, the ACM Portal, Scopemed, ScienceDirect, SpringerLink, Rayyan QCRI, Ejmanager and online library catalogs can export citations in this format. Major reference/citation manager applications, like Zotero, Citavi, Mendeley, and EndNote can export and import citations in this format.

CiteProc

https://en.wikipedia.org/wiki/CiteProc - the generic name for programs that produce formatted bibliographies and citations based on the metadata of the cited objects and the formatting instructions provided by Citation Style Language (CSL) styles. The first CiteProc implementation used XSLT 2.0, but implementations have been written for other programming languages, including JavaScript, Java, Haskell, PHP, Python, and Ruby. CiteProc, CSL, and Cite Schema make up the Citation Style Language project, a Creative Commons Attribution Share-Alike licensed effort "to provide a common framework for formatting bibliographies and citations across markup languages and document standards. In an ideal world, one could use the same CSL files to format DocBook, TEI, OpenOffice, WordML ... or even LaTeX documents." Different implementations of CiteProc are able to use different bibliographic databases; many can use MODS XML.

https://en.wikipedia.org/wiki/Citation_Style_Language - an open XML-based language to describe the formatting of citations and bibliographies. Reference management programs using CSL include Zotero, Mendeley and Papers.

Citeproc YAML for bibliographies

https://github.com/publicus/YAML-CSL - A Citation Style Language (CSL) file for creating YAML-style metadata.

https://github.com/wilx/cite-website/ - Simple Perl script to generate CSL YAML entry from given URL.

Maths

https://news.ycombinator.com/item?id=13962242

Mining

http://en.wikipedia.org/wiki/Data_mining

https://data.occrp.org/
- https://github.com/pudo/aleph

Scraping

Tools

https://www.ibm.com/developerworks/aix/library/au-unixtext/ [52]

http://vis.stanford.edu/wrangler/

http://openrefine.org/ - google refine

http://idcubed.org/open-platform/platform/
- https://wiki.idhypercubed.org/wiki/ProjectMustardSeed - A Framework for developing and deploying secure cloud applications to collect, compute on, and share personal data

Recline Data Explorer and Library - A simple but powerful library for building data applications in pure Javascript and HTML.

https://news.ycombinator.com/item?id=8197096

http://dat-data.com/
- https://github.com/maxogden/dat
- http://www.wired.com/2014/08/dat/ [53]

https://datawrapper.de/

http://cloudscrape.com/ [54]

http://databrewery.org/ [55]

http://jyotiska.github.io/blog/posts/python_libraries.html [56]

https://redash.io/
- https://github.com/getredash/redash [57]

DataLad - Providing a data portal and a versioning system for everyone, DataLad lets you have your data and control it too.

Kaitai Struct: declarative binary format parsing language - a declarative language used for describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc. The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

Datasette - a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools dedicated to make working with structured data as productive as possible.
https://github.com/simonw/datasette

Services

http://www.socrata.com/
- https://github.com/socrata

https://medium.com/civic-technology/rethinking-data-portals-30b66f00585d

Data

General

Articles

Learning

Management

Science

Encoding

Numbers

Binary

Hexadecimal

Gray code

Character encoding

Telegraph

Morse

Baudot

BCD

EBCDIC

ASCII / ANSI

Art

Fonts

CJKV

Unicode

Files

Magic number

FourCC

Serialization and markup

S-expression

M-Expression

GML

CBCL

Recfile

CSV

TSV

ASN.1

X.690

JSON

Variations

Learning

Tools

Web

Checking

to sort

YAML

TOML

HCL

CSON

STON

Hjson

XDR

DSPL

CUE

HAL

Dhall

HStore

Protocol Buffers

Cap'n Proto

BSON

MessagePack

CBOR

Amazon Ion

Apache Pulsar

der-ascii

MQTT

recordio

riegeli

gRPC

smf

FlatBuffers

eno

Transit

Scuttlebot

cereal

Citations

BibTeX

RIS

CiteProc

Maths

Mining

Scraping

Tools