Data

From Things and Stuff Wiki
Revision as of 05:13, 7 August 2023 by Milk (talk | contribs) (→‎Tools)
Jump to navigation Jump to search


General

See also Open data, Maths, Computing Visualisation, Language, Documents


  • https://en.wikipedia.org/wiki/Data_acquisition - the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems, abbreviated by the initialisms DAS or DAQ, typically convert analog waveforms into digital values for processing. The components of data acquisition systems include: Sensors, to convert physical parameters to electrical signals. Signal conditioning circuitry, to convert sensor signals into a form that can be converted to digital values. Analog-to-digital converters, to convert conditioned sensor signals to digital values.


  • https://en.wikipedia.org/wiki/Unstructured_data - information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.



  • https://en.wikipedia.org/wiki/Data_model - or datamodel, is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.


Overview of data modeling context: Data model is based on Data, Data relationship, Data semantic and Data constraint. A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.




  • https://en.wikipedia.org/wiki/Semi-structured_data - form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.






Articles




  • Your configs suck? Try a real programming language. - In this post, I'll try to explain why I find most config formats frustrating to use and suggest that using a real programming language (i.e. general purpose one, like Python) is often a feasible and more pleasant alternative for writing configs. [1]

Learning


  • Data Carpentry - develops and teaches workshops on the fundamental data skills needed to conduct research. Our mission is to provide researchers high-quality, domain-specific training covering the full lifecycle of data-driven research.

Management

See also Database, Visualisation, Maths#Software









Annotation



  • https://github.com/doccano/doccano - an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. You can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.



    • https://github.com/smistad/annotationweb - a web-based annnotation system made primarily for easy annotation of image sequences such as ultrasound and camera recordings. It uses mainly django/python for the backend and javascript/jQuery and HTML canvas for the interactive annotation frontend. Annotation Web is developed by SINTEF Medical Technology and Norwegian University of Science and Technology (NTNU), and is released under a permissive MIT license.


  • Annotator - an open-source JavaScript library to easily add annotation functionality to any webpage. Annotations can have comments, tags, links, users, and more. Annotator is designed for easy extensibility so its a cinch to add a new feature or behaviour. Annotator also fosters an active developer community with contributors from four continents, building 3rd party plugins allowing the annotation of PDFs, EPUBs, videos, images, sound, and more.



Science


  • A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?


  • School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.


  • Kaggle - Service - From Big Data to Big Analytics.




Encoding

  • https://en.wikipedia.org/wiki/Encoder_(digital) - or simply an encoder in digital electronics is a one-hot to binary converter. That is, if there are 2n input lines, and at most only one of them will ever be high, the binary code of this 'hot' line is produced on the n-bit output lines.For example, a 4-to-2 simple encoder takes 4 input bits and produces 2 output bits. The illustrated gate level example implements the simple encoder defined by the truth table, but it must be understood that for all the non-explicitly defined input combinations (i.e., inputs containing 0, 2, 3, or 4 high bits) the outputs are treated as don't cares.


  • https://en.wikipedia.org/wiki/Binary_decoder - a combinational logic circuit that converts binary information from the n coded inputs to a maximum of 2n unique outputs. They are used in a wide variety of applications, including data multiplexing and data demultiplexing, seven segment displays, and memory address decoding.There are several types of binary decoders, but in all cases a decoder is an electronic circuit with multiple input and multiple output signals, which converts every unique combination of input states to a specific combination of output states. In addition to integer data inputs, some decoders also have one or more

Numbers

Binary



  • Kaitai Struct - a declarative language used to describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc.The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

Hexadecimal


  • https://github.com/sharkdp/hexyl - a simple hex viewer for the terminal. It uses a colored output to distinguish different categories of bytes (NULL bytes, printable ASCII characters, ASCII whitespace characters, other ASCII characters and non-ASCII).


Gray code

  • https://en.wikipedia.org/wiki/Gray_code - after Frank Gray, or reflected binary code (RBC), also known just as reflected binary (RB), is an ordering of the binary numeral system such that two successive values differ in only one bit (binary digit). The reflected binary code was originally designed to prevent spurious output from electromechanical switches. Today, Gray codes are widely used to facilitate error correction in digital communications such as digital terrestrial television and some cable TV systems.

Character encoding





  • https://en.wikipedia.org/wiki/Od_(Unix) - a program for displaying ("dumping") data in various human-readable output formats. The name is an acronym for "octal dump" since it defaults to printing in the octal data format. It can also display output in a variety of other formats, including hexadecimal, decimal, and ASCII. It is useful for visualizing data that is not in a human-readable format, like the executable code of a program.


Telegraph

Morse


Baudot

  • https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.

BCD

EBCDIC

  • https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.

ASCII / ANSI

to move/merge with Typography

  • https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.


  • https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.



  • https://en.wikipedia.org/wiki/PETSCII - also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C128[1], CBM-II, Plus/4, and VIC-20.







Art


  • jp2a - a small utility that converts JPG images to ASCII. It's written in C and released under the GPL.



  • https://github.com/jtdaugherty/tart - a program that provides an image-editor-like interface to creating ASCII art - in the terminal, with your mouse! This program is written using my purely-functional terminal user interface toolkit, Brick.




Fonts

CJKV

Unicode

  • https://en.wikipedia.org/wiki/Unicode - a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.


  • Unicode Consortium - enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.



  • ICU - International Components for Unicode - a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.




  • UAX #15: Unicode Normalization Forms - This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.




mirroring char in brackets: (‮‮test ( 








  • https://en.wikipedia.org/wiki/Mojibake - the garbled text that is the result of text being decoded using an unintended character encoding.[1] The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

Checksum

  • https://en.wikipedia.org/wiki/Luhn_algorithm - also known as the "modulus 10" or "mod 10" algorithm, named after its creator, IBM scientist Hans Peter Luhn, is a simple checksum formula used to validate a variety of identification numbers, such as credit card numbers, IMEI numbers, National Provider Identifier numbers in the United States, Canadian Social Insurance Numbers, Israeli ID Numbers, South African ID Numbers, Greek Social Security Numbers (ΑΜΚΑ), and survey codes appearing on McDonald's, Taco Bell, and Tractor Supply Co. receipts. It is described in U.S. Patent No. 2,950,048, filed on January 6, 1954, and granted on August 23, 1960.The algorithm is in the public domain and is in wide use today. It is specified in ISO/IEC 7812-1. It is not intended to be a cryptographically secure hash function; it was designed to protect against accidental errors, not malicious attacks. Most credit cards and many government identification numbers use the algorithm as a simple method of distinguishing valid numbers from mistyped or otherwise incorrect numbers.

Files

Magic number

FourCC

  • https://en.wikipedia.org/wiki/FourCC - literally, four-character code) is a sequence of four bytes used to uniquely identify data formats. The concept originated in the OSType scheme used in the Macintosh system software and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.


Serialization and markup

to integrate with Documents#Markup


See also JavaScript#JSON, Documents, HTML/CSS#Markup, Semantic web, Database




  • https://en.wikipedia.org/wiki/Marshalling_(computer_science) - or marshaling is the process of transforming the memory representation of an object to a data format suitable for storage or transmission, and it is typically used when data must be moved between different parts of a computer program or from one program to another. Marshalling is similar to serialization and is used to communicate to remote objects with an object, in this case a serialized object. It simplifies complex communication, using composite objects in order to communicate instead of primitives. The inverse, of marshalling is called unmarshalling (or demarshalling, similar to deserialization).
  • https://en.wikipedia.org/wiki/Unmarshalling - Comparison with deserialization: An object that is serialized is in the form of a byte stream and it can eventually be converted back to a copy of the original object. Deserialization is the process of converting the byte stream data back to its original object type.mAn object that is marshalled, however, records the state of the original object and it contains the codebase (codebase here refers to a list of URLs where the object code can be loaded from, and not source code). Hence, in order to convert the object state and codebase(s), unmarshalling must be done.


  • https://en.wikipedia.org/wiki/Delimiter - a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code. Delimiters represent one of various means to specify boundaries in a data stream.
  • https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length





  • VisiData - a free, open-source tool that lets you quickly open, explore, summarize, and analyze datasets in your computer’s terminal. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.

S-expression

  • https://en.wikipedia.org/wiki/S-expression - sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an s-expression is classically defined as "an atom", or "an expression of the form (x . y) where x and y are s-expressions." The second, recursive part of the definition represents an ordered pair, which means that s-expressions are binary trees.

M-Expression


GML

  • https://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language - GML, 1969, is a set of macros that implement intent-based (procedural) markup tags for the IBM text formatter, SCRIPT. SCRIPT/VS is the main component of IBM's Document Composition Facility (DCF). A starter set of tags in GML is provided with the DCF product.


CBCL

  • https://en.wikipedia.org/wiki/Common_Business_Communication_Language - (CBCL) is a communications language proposed by John McCarthy that foreshadowed much of XML. The language consists of a basic framework of hierarchical markup derived from S-expressions, coupled with some general principles about use and extensibility. Although written in 1975, the proposal was not published until 1982, and to this day remains relatively obscure.

Recfile

  • GNU Recutils - a set of tools and libraries to access human-editable, plain text databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. The picture below shows a sample database containing information about GNU packages, along with the main features provided by recutils.
  • recfile - Recfile is the file format used by GNU Recutils. It can be seen as a "vertical" counterpart to CSV.




CSV



TSV



HTML/CSS#Markup,



ASN.1

  • https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One - ASN.1 is an interface description language for defining data structures that can be serialized and deserialized in a standard, cross-platform way. It's broadly used in telecommunications and computer networking, and especially in cryptography. Protocol developers define data structures in ASN.1 modules, which are generally a section of a broader standards document written in the ASN.1 language. Because the language is both human-readable and machine-readable, modules can be automatically turned into libraries that process their data structures, using an ASN.1 compiler. ASN.1 is similar in purpose and use to protocol buffers and Apache Thrift, which are also interface description languages for cross-platform data serialization. Like those languages, it has a schema (in ASN.1, called a "module"), and a set of encodings, typically type-length-value encodings. However, ASN.1, defined in 1984, predates them by many years. It also includes a wider variety of basic data types, some of which are obsolete, and has more options for extensibility. A single ASN.1 message can include data from multiple modules defined in multiple standards, even standards defined years apart.

X.690

  • https://en.wikipedia.org/wiki/X.690 - an ITU-T standard specifying several ASN.1 encoding formats:
    • Basic Encoding Rules (BER)
    • Canonical Encoding Rules (CER)
    • Distinguished Encoding Rules (DER)


JSON

  • JSON - JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.







  • JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. [26]




Extensions


  • JSON Schema - a vocabulary that allows you to annotate and validate JSON documents.


  • JSON-P or "JSON with padding" is a communication technique used in JavaScript programs which run in Web browsers. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the same origin policy - pre CORS


  • JsonML (JSON Markup Language) is an application of the JSON (JavaScript Object Notation) format. The purpose of JsonML is to provide a compact format for transporting XML-based markup as JSON which allows it to be losslessly converted back to its original form. Native XML/XHTML doesn't sit well embedded in JavaScript. When XHTML is stored in script it must be properly encoded as an opaque string. JsonML allows easy manipulation of the markup in script before completely rehydrating back to the original form.


  • JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. If you are already familiar with JSON, writing JSON-LD is very easy. These properties make JSON-LD an ideal Linked Data interchange language for JavaScript environments, Web service, and unstructured databases such as CouchDB and MongoDB.
  • https://github.com/linkedobjects/linkedobjects - The Linked Objects Notation (LION) is a simple subset of JSON-LD. It aims to avoid most of the complexity, and enables getting started quickly, using a familair notation. LION is compatible with JSON-LD and offers a full upgrade path


  • json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.



  • JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.




  • Javascript Object Signing and Encryption - JavaScript Object Notation (JSON) is a text format for the serialization of structured data described in RFC 4627. The JSON format is often used for serializing and transmitting structured data over a network connection. With the increased usage of JSON in protocols in the IETF and elsewhere, there is now a desire to offer security services, which use encryption, digital signatures, message authentication codes (MACs) algorithms, that carry their data in JSON format.
  • JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.



  • https://github.com/saulpw/jdot - Remove all the extraneous symbols from JSON, and it becomes a lot easier to read and write. Add comments and macros and it's almost pleasant. Some little ergonomics go a long way.

Learning

  • Getting Started with JSON - You send data in a JSON format between different parts of your system. API results are often returned in JSON format, for example. JSON is a lightweight format which makes for easy reading if you're even the least bit familiar with JavaScript.


Tools


  • Pjson - Like python -mjson.tool but with moar colors (and less conf)


  • jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.




  • https://github.com/tomnomnom/gron - Make JSON greppable! gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation. [31]





  • Jshon - parses, reads and creates JSON. It is designed to be as usable as possible from within the shell and replaces fragile adhoc parsers made from grep/sed/awk as well as heavyweight one-line parsers made from perl/python.




  • Data Protocols - the Open Knowledge Labs home of simple protocols and formats for working with open data. Our mission is both to make it easier to develop tools and services for working with data, and, to ensure greater interoperability between new and existing tools and services.

Web


Checking

to sort

YAML



TOML

  • https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages. [39]


HCL

  • https://github.com/hashicorp/hcl - a configuration language built by HashiCorp. The goal of HCL is to build a structured configuration language that is both human and machine friendly for use with command-line tools, but specifically targeted towards DevOps tools, servers, etc. HCL is also fully JSON compatible. That is, JSON can be used as completely valid input to a system expecting HCL. This helps makes systems interoperable with other systems.



CSON

or


STON

Hjson

  • Hjson - a syntax extension to JSON. It's NOT a proposal to replace JSON or to incorporate it into the JSON spec itself. It's intended to be used like a user interface for humans, to read and edit before passing the JSON data to the machine. [41]

Mark

  • https://github.com/henry-luo/mark - a new unified notation for both object and markup data. The notation is a superset of what can be represented by JSON, HTML and XML, but overcomes many limitations of these popular data formats, yet still having a very clean syntax and simple data model. It has clean syntax with fully-type data model (like JSON or even better) It is generic and extensible (like XML or even better) It has built-in mixed content support (like HTML5 or even better) It supports high-order composition (like S-expressions or even better)

XDR

  • https://en.wikipedia.org/wiki/External_Data_Representation - XDR, is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer. XDR uses a base unit of 4 bytes, serialized in big-endian order; smaller data types still occupy four bytes each after encoding. Variable-length types such as string and opaque are padded to a total divisible by four bytes. Floating-point numbers are represented in IEEE 754 format.


DSPL

  • DSPL - stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data of datasets. The metadata is specified in XML, whereas the data are provided in CSV format.

CUE

  • CUE - open source language, with a rich set APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.


RON

  • RON - a format for distributed live data. RON’s primary mission is continuous data synchronization. A RON object may naturally have any number of replicas, which may synchronize in real-time or intermittently. JSON, protobuf, and many other formats implicitly assume serialization of separate state snapshots. RON has versioning and addressing metadata, so state and updates can be always pieced together. RON handles state and updates all the same: state is change and change is state.

HAL

  • HAL - a format you can use in your API that gives you a simple way of linking. It has two variants, one in JSON and one in XML.

Dhall

HStore


Protocol Buffers


  • Buf - Introduction

Cap'n Proto

  • Cap'n Proto - an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!


BSON

  • BSON - short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type.

MessagePack

  • MessagePack - an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. [44]

CBOR

  • CBOR - RFC 7049 “The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation.”


Amazon Ion

  • Amazon Ion - a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of business data which can survive multiple generations of software evolution. Ion was built to solve the rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. Ion has been addressing these challenges within Amazon for nearly a decade, and we believe others will benefit as well. [45]

Apache Pulsar


der-ascii

  • https://github.com/google/der-ascii - a small human-editable language to emit DER (Distinguished Encoding Rules) or BER (Basic Encoding Rules) encodings of ASN.1 structures and malformed variants of them.

MQTT

  • MQTT - a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. For example, it has been used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers, and in a range of home automation and small device scenarios. It is also ideal for mobile applications because of its small size, low power usage, minimised data packets, and efficient distribution of information to one or many receivers (more...)
    • https://en.wikipedia.org/wiki/MQTT - (MQ Telemetry Transport or Message Queuing Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) publish-subscribe-based messaging protocol. It works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. The publish-subscribe messaging pattern requires a message broker.
    • MQTT Version 5.0 - a Client Server publish/subscribe messaging transport protocol. It is light weight, open, simple, and designed to be easy to implement. These characteristics make it ideal for use in many situations, including constrained environments such as for communication in Machine to Machine (M2M) and Internet of Things (IoT) contexts where a small code footprint is required and/or network bandwidth is at a premium.

recordio

riegeli

  • https://github.com/google/riegeli - a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

gRPC

  • [48] - a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services.

smf

FlatBuffers

  • FlatBuffers - an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust. It was originally created at Google for game development and other performance-critical applications.It is available as Open Source on GitHub under the Apache license, v2

FlexBuffers

  • FlexBuffers - allows for a very compact encoding, combining automatic pooling of strings with automatic sizing of containers to their smallest possible representation (8/16/32/64 bits). Many values and offsets can be encoded in just 8 bits. While a schema-less representation is usually more bulky because of the need to be self-descriptive, FlexBuffers generates smaller binaries for many cases than regular FlatBuffers. FlexBuffers is still slower than regular FlatBuffers though, so we recommend to only use it if you need it. [49]

eno

  • eno - A modern plaintext data format, notation language with libraries, designed from the ground up for file-based content - simple, powerful and elegant [50]

Transit

  • https://github.com/cognitect/transit-format - a format and set of libraries for conveying values between applications written in different programming languages. This spec describes Transit in order to facilitate its implementation in a wide range of languages.

Scuttlebot

cereal

  • https://github.com/USCiLab/cereal - a header-only C++11 serialization library. cereal takes arbitrary data types and reversibly turns them into different representations, such as compact binary encodings, XML, or JSON. cereal was designed to be fast, light-weight, and easy to extend - it has no external dependencies and can be easily bundled with other code or used standalone.

Arrow

  • https://github.com/apache/arrow - a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…

YAS

  • https://github.com/niXman/yas - created as a replacement of boost.serialization because of its insufficient speed of serialization (benchmark 1, benchmark 2). YAS is header only library. YAS does not depend on third-party libraries or boost. YAS require C++11 support. YAS binary archives is endian independent

Citations

See Organising#Reference.2Fcitation_management

Maths

Mining


Protocols

  • https://en.wikipedia.org/wiki/Remote_procedure_call - when a computer program causes a procedure (subroutine) to execute in a different address space (commonly on another computer on a shared network), which is coded as if it were a normal (local) procedure call, without the programmer explicitly coding the details for the remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. This is a form of client–server interaction (caller is client, executor is server), typically implemented via a request–response message-passing system. In the object-oriented programming paradigm, RPC calls are represented by remote method invocation (RMI). The RPC model implies a level of location transparency, namely that calling procedures are largely the same whether they are local or remote, but usually they are not identical, so local calls can be distinguished from remote calls. Remote calls are usually orders of magnitude slower and less reliable than local calls, so distinguishing them is important.RPCs are a form of inter-process communication (IPC), in that different processes have different address spaces: if on the same host machine, they have distinct virtual address spaces, even though the physical address space is the same; while if they are on different hosts, the physical address space is different. Many different (often incompatible) technologies have been used to implement the concept.


  • https://en.wikipedia.org/wiki/Request–response - or request–reply, is one of the basic methods computers use to communicate with each other, in which the first computer sends a request for some data and the second responds to the request. Usually, there is a series of such interchanges until the complete message is sent; browsing a web page is an example of request–response communication. Request–response can be seen as a telephone call, in which someone is called and they answer the call.Request–response is a message exchange pattern in which a requestor sends a request message to a replier system which receives and processes the request, ultimately returning a message in response. This is a simple, but powerful messaging pattern which allows two applications to have a two-way conversation with one another over a channel. This pattern is especially common in client–server architectures. For simplicity, this pattern is typically implemented in a purely synchronous fashion, as in web service calls over HTTP, which holds a connection open and waits until the response is delivered or the timeout period expired. However, request–response may also be implemented asynchronously, with a response being returned at some unknown later time. When a synchronous system communicates with an asynchronous system, it is referred to as "sync over async" or "sync/async". This is common in enterprise application integration (EAI) implementations where slow aggregations, time-intensive functions, or human workflow must be performed before a response can be constructed and delivered.


Scraping

See also Scraping

Tools


  • https://github.com/sharkdp/binocle - a graphical tool to visualize binary data. It colorizes bytes according to different rules and renders them as pixels in a rectangular grid. This allows users to identify interesting parts in large files and to reveal image-like regions.
















  • DataLad - Providing a data portal and a versioning system for everyone, DataLad lets you have your data and control it too.


  • Kaitai Struct: declarative binary format parsing language - a declarative language used for describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc. The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.


  • Datasette - a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools dedicated to make working with structured data as productive as possible.
  • https://github.com/simonw/datasette


  • Arroyo - a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.In short: Arroyo lets you ask complex questions of high-volume real-time data with subsecond results. [58]


GNU poke

  • GNU poke - an interactive, extensible editor for binary data. Not limited to editing basic entities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. [59]

to sort

  • Kaitai Struct - A new way to develop parsers for binary structures, a declarative binary format parsing language





Machine learning

Services


Barcode

  • https://en.wikipedia.org/wiki/Barcode - or bar code is a method of representing data in a visual, machine-readable form. Initially, barcodes represented data by varying the widths and spacings of parallel lines. These barcodes, now commonly referred to as linear or one-dimensional (1D), can be scanned by special optical scanners, called barcode readers. Later, two-dimensional (2D) variants were developed, using rectangles, dots, hexagons and other geometric patterns, called matrix codes or 2D barcodes, similarly they do not use bars as such. 2D barcodes can be read or deconstructed using application software on mobile devices with inbuilt cameras, such as smartphones.


QR Codes