Data

From Things and Stuff Wiki
Revision as of 20:07, 13 February 2020 by Milk (talk | contribs) (→‎YAML)
Jump to navigation Jump to search


General

See also Open data, Maths, Computing Visualisation, Language, Documents


data, noun

  • facts and statistics collected together for reference or analysis: there is very little data available
    • the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
    • Philosophy things known or assumed as facts, making the basis of reasoning or calculation.


  • https://en.wikipedia.org/wiki/Unstructured_data - information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.



  • https://en.wikipedia.org/wiki/Data_model - or datamodel, is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.


Overview of data modeling context: Data model is based on Data, Data relationship, Data semantic and Data constraint. A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.




  • https://en.wikipedia.org/wiki/Semi-structured_data - form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.





Articles


Learning


Management

See also Database, Visualisation, Maths#Software





Science

  • A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?


  • School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.


  • Kaggle - Service - From Big Data to Big Analytics.




Encoding

  • https://en.wikipedia.org/wiki/Encoder_(digital) - or simply an encoder in digital electronics is a one-hot to binary converter. That is, if there are 2n input lines, and at most only one of them will ever be high, the binary code of this 'hot' line is produced on the n-bit output lines.For example, a 4-to-2 simple encoder takes 4 input bits and produces 2 output bits. The illustrated gate level example implements the simple encoder defined by the truth table, but it must be understood that for all the non-explicitly defined input combinations (i.e., inputs containing 0, 2, 3, or 4 high bits) the outputs are treated as don't cares.


  • https://en.wikipedia.org/wiki/Binary_decoder - a combinational logic circuit that converts binary information from the n coded inputs to a maximum of 2n unique outputs. They are used in a wide variety of applications, including data multiplexing and data demultiplexing, seven segment displays, and memory address decoding.There are several types of binary decoders, but in all cases a decoder is an electronic circuit with multiple input and multiple output signals, which converts every unique combination of input states to a specific combination of output states. In addition to integer data inputs, some decoders also have one or more

Numbers

Binary



  • Kaitai Struct - a declarative language used to describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc.The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

Hexadecimal


  • https://github.com/sharkdp/hexyl - a simple hex viewer for the terminal. It uses a colored output to distinguish different categories of bytes (NULL bytes, printable ASCII characters, ASCII whitespace characters, other ASCII characters and non-ASCII).


Gray code

  • https://en.wikipedia.org/wiki/Gray_code - after Frank Gray, or reflected binary code (RBC), also known just as reflected binary (RB), is an ordering of the binary numeral system such that two successive values differ in only one bit (binary digit). The reflected binary code was originally designed to prevent spurious output from electromechanical switches. Today, Gray codes are widely used to facilitate error correction in digital communications such as digital terrestrial television and some cable TV systems.

Character encoding





  • https://en.wikipedia.org/wiki/Od_(Unix) - a program for displaying ("dumping") data in various human-readable output formats. The name is an acronym for "octal dump" since it defaults to printing in the octal data format. It can also display output in a variety of other formats, including hexadecimal, decimal, and ASCII. It is useful for visualizing data that is not in a human-readable format, like the executable code of a program.


Telegraph

Morse


Baudot

  • https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.

BCD

EBCDIC

  • https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.

ASCII / ANSI

to move/merge with Typography

  • https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.


  • https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.



  • https://en.wikipedia.org/wiki/PETSCII - also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C128[1], CBM-II, Plus/4, and VIC-20.







Art


  • jp2a - a small utility that converts JPG images to ASCII. It's written in C and released under the GPL.



Fonts

CJKV

Unicode

  • https://en.wikipedia.org/wiki/Unicode - a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.


  • Unicode Consortium - enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.


  • ICU - International Components for Unicode - a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.




  • UAX #15: Unicode Normalization Forms - This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.




mirroring char in brackets: (‮‮test ( 








  • https://en.wikipedia.org/wiki/Mojibake - the garbled text that is the result of text being decoded using an unintended character encoding.[1] The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

Files

Magic number

FourCC

  • https://en.wikipedia.org/wiki/FourCC - literally, four-character code) is a sequence of four bytes used to uniquely identify data formats. The concept originated in the OSType scheme used in the Macintosh system software and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.


Serialization and markup

See also JavaScript#JSON, Documents, HTML/CSS#Markup, Semantic web, Database




  • https://en.wikipedia.org/wiki/Marshalling_(computer_science) - or marshaling is the process of transforming the memory representation of an object to a data format suitable for storage or transmission, and it is typically used when data must be moved between different parts of a computer program or from one program to another. Marshalling is similar to serialization and is used to communicate to remote objects with an object, in this case a serialized object. It simplifies complex communication, using composite objects in order to communicate instead of primitives. The inverse, of marshalling is called unmarshalling (or demarshalling, similar to deserialization).
  • https://en.wikipedia.org/wiki/Unmarshalling - Comparison with deserialization: An object that is serialized is in the form of a byte stream and it can eventually be converted back to a copy of the original object. Deserialization is the process of converting the byte stream data back to its original object type.mAn object that is marshalled, however, records the state of the original object and it contains the codebase (codebase here refers to a list of URLs where the object code can be loaded from, and not source code). Hence, in order to convert the object state and codebase(s), unmarshalling must be done.


  • https://en.wikipedia.org/wiki/Delimiter - a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code. Delimiters represent one of various means to specify boundaries in a data stream.
  • https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length



  • VisiData - a free, open-source tool that lets you quickly open, explore, summarize, and analyze datasets in your computer’s terminal. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.

S-expression

  • https://en.wikipedia.org/wiki/S-expression - sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an s-expression is classically defined as "an atom", or "an expression of the form (x . y) where x and y are s-expressions." The second, recursive part of the definition represents an ordered pair, which means that s-expressions are binary trees.

M-Expression


GML

  • https://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language - GML, 1969, is a set of macros that implement intent-based (procedural) markup tags for the IBM text formatter, SCRIPT. SCRIPT/VS is the main component of IBM's Document Composition Facility (DCF). A starter set of tags in GML is provided with the DCF product.


CBCL

  • https://en.wikipedia.org/wiki/Common_Business_Communication_Language - (CBCL) is a communications language proposed by John McCarthy that foreshadowed much of XML. The language consists of a basic framework of hierarchical markup derived from S-expressions, coupled with some general principles about use and extensibility. Although written in 1975, the proposal was not published until 1982, and to this day remains relatively obscure.

Recfile

  • GNU Recutils - a set of tools and libraries to access human-editable, plain text databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. The picture below shows a sample database containing information about GNU packages, along with the main features provided by recutils.
  • recfile - Recfile is the file format used by GNU Recutils. It can be seen as a "vertical" counterpart to CSV.




CSV



TSV



HTML/CSS#Markup,



ASN.1

  • https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One - ASN.1 is an interface description language for defining data structures that can be serialized and deserialized in a standard, cross-platform way. It's broadly used in telecommunications and computer networking, and especially in cryptography. Protocol developers define data structures in ASN.1 modules, which are generally a section of a broader standards document written in the ASN.1 language. Because the language is both human-readable and machine-readable, modules can be automatically turned into libraries that process their data structures, using an ASN.1 compiler. ASN.1 is similar in purpose and use to protocol buffers and Apache Thrift, which are also interface description languages for cross-platform data serialization. Like those languages, it has a schema (in ASN.1, called a "module"), and a set of encodings, typically type-length-value encodings. However, ASN.1, defined in 1984, predates them by many years. It also includes a wider variety of basic data types, some of which are obsolete, and has more options for extensibility. A single ASN.1 message can include data from multiple modules defined in multiple standards, even standards defined years apart.

X.690

  • https://en.wikipedia.org/wiki/X.690 - an ITU-T standard specifying several ASN.1 encoding formats:
    • Basic Encoding Rules (BER)
    • Canonical Encoding Rules (CER)
    • Distinguished Encoding Rules (DER)


JSON

  • JSON - JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.







  • JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. [25]




Variations



  • JSON-P or "JSON with padding" is a communication technique used in JavaScript programs which run in Web browsers. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the same origin policy - pre CORS


  • JsonML (JSON Markup Language) is an application of the JSON (JavaScript Object Notation) format. The purpose of JsonML is to provide a compact format for transporting XML-based markup as JSON which allows it to be losslessly converted back to its original form. Native XML/XHTML doesn't sit well embedded in JavaScript. When XHTML is stored in script it must be properly encoded as an opaque string. JsonML allows easy manipulation of the markup in script before completely rehydrating back to the original form.


  • JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. If you are already familiar with JSON, writing JSON-LD is very easy. These properties make JSON-LD an ideal Linked Data interchange language for JavaScript environments, Web service, and unstructured databases such as CouchDB and MongoDB.


  • json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.



  • JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.




  • Javascript Object Signing and Encryption - JavaScript Object Notation (JSON) is a text format for the serialization of structured data described in RFC 4627. The JSON format is often used for serializing and transmitting structured data over a network connection. With the increased usage of JSON in protocols in the IETF and elsewhere, there is now a desire to offer security services, which use encryption, digital signatures, message authentication codes (MACs) algorithms, that carry their data in JSON format.
  • JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.



Learning

  • Getting Started with JSON - You send data in a JSON format between different parts of your system. API results are often returned in JSON format, for example. JSON is a lightweight format which makes for easy reading if you're even the least bit familiar with JavaScript.


Tools


  • Pjson - Like python -mjson.tool but with moar colors (and less conf)


  • jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.



  • https://github.com/tomnomnom/gron - Make JSON greppable! gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation. [29]




  • Jansson - a C library for encoding, decoding and manipulating JSON data.
  • Jshon - parses, reads and creates JSON. It is designed to be as usable as possible from within the shell and replaces fragile adhoc parsers made from grep/sed/awk as well as heavyweight one-line parsers made from perl/python.




  • Data Protocols - the Open Knowledge Labs home of simple protocols and formats for working with open data. Our mission is both to make it easier to develop tools and services for working with data, and, to ensure greater interoperability between new and existing tools and services.


Web


Checking

to sort

YAML


  • StrictYAML - a type-safe YAML parser that parses a restricted subset of the YAML specificaton.




TOML

  • https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages.


HCL

  • https://github.com/hashicorp/hcl - a configuration language built by HashiCorp. The goal of HCL is to build a structured configuration language that is both human and machine friendly for use with command-line tools, but specifically targeted towards DevOps tools, servers, etc. HCL is also fully JSON compatible. That is, JSON can be used as completely valid input to a system expecting HCL. This helps makes systems interoperable with other systems.



CSON

or


STON

Hjson

  • Hjson - a syntax extension to JSON. It's NOT a proposal to replace JSON or to incorporate it into the JSON spec itself. It's intended to be used like a user interface for humans, to read and edit before passing the JSON data to the machine. [38]


XDR

  • https://en.wikipedia.org/wiki/External_Data_Representation - XDR, is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer. XDR uses a base unit of 4 bytes, serialized in big-endian order; smaller data types still occupy four bytes each after encoding. Variable-length types such as string and opaque are padded to a total divisible by four bytes. Floating-point numbers are represented in IEEE 754 format.


DSPL

  • DSPL - stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data of datasets. The metadata is specified in XML, whereas the data are provided in CSV format.

CUE

  • CUE - open source language, with a rich set APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.


HAL

  • HAL - a format you can use in your API that gives you a simple way of linking. It has two variants, one in JSON and one in XML.

Dhall

HStore


Protocol Buffers


  • Buf - Introduction

Cap'n Proto

  • Cap'n Proto - an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!


BSON

  • BSON - short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type.

MessagePack

  • MessagePack - an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.


CBOR

  • CBOR - RFC 7049 “The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation.”


Amazon Ion

  • Amazon Ion - a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of business data which can survive multiple generations of software evolution. Ion was built to solve the rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. Ion has been addressing these challenges within Amazon for nearly a decade, and we believe others will benefit as well. [41]

Apache Pulsar


der-ascii

  • https://github.com/google/der-ascii - a small human-editable language to emit DER (Distinguished Encoding Rules) or BER (Basic Encoding Rules) encodings of ASN.1 structures and malformed variants of them.

MQTT

  • MQTT - a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. For example, it has been used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers, and in a range of home automation and small device scenarios. It is also ideal for mobile applications because of its small size, low power usage, minimised data packets, and efficient distribution of information to one or many receivers (more...)
    • https://en.wikipedia.org/wiki/MQTT - (MQ Telemetry Transport or Message Queuing Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) publish-subscribe-based messaging protocol. It works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. The publish-subscribe messaging pattern requires a message broker.
    • MQTT Version 5.0 - a Client Server publish/subscribe messaging transport protocol. It is light weight, open, simple, and designed to be easy to implement. These characteristics make it ideal for use in many situations, including constrained environments such as for communication in Machine to Machine (M2M) and Internet of Things (IoT) contexts where a small code footprint is required and/or network bandwidth is at a premium.

recordio

riegeli

  • https://github.com/google/riegeli - a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

gRPC

  • [44] - a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services.

smf


FlatBuffers

  • FlatBuffers - an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust. It was originally created at Google for game development and other performance-critical applications.It is available as Open Source on GitHub under the Apache license, v2


eno

  • eno - A modern plaintext data format, notation language with libraries, designed from the ground up for file-based content - simple, powerful and elegant [45]

Transit

  • https://github.com/cognitect/transit-format - a format and set of libraries for conveying values between applications written in different programming languages. This spec describes Transit in order to facilitate its implementation in a wide range of languages.

Scuttlebot

cereal

  • https://github.com/USCiLab/cereal - a header-only C++11 serialization library. cereal takes arbitrary data types and reversibly turns them into different representations, such as compact binary encodings, XML, or JSON. cereal was designed to be fast, light-weight, and easy to extend - it has no external dependencies and can be easily bundled with other code or used standalone.

Citations

See also Organising#Reference management

BibTeX

  • BibTeX - a tool and a file format which are used to describe and process lists of references, mostly in conjunction with LaTeX documents.
  • https://en.wikipedia.org/wiki/BibTeX - eference management software for formatting lists of references. The BibTeX tool is typically used together with the LaTeX document preparation system. Within the typesetting system, its name is styled as B I B T E X {\displaystyle {\mathrm {B{\scriptstyle {IB}}\!T\!_{\displaystyle E}\!X} }} {\mathrm {B{\scriptstyle {IB}}\!T\!_{\displaystyle E}\!X} }. The name is a portmanteau of the word bibliography and the name of the TeX typesetting software.The purpose of BibTeX is to make it easy to cite sources in a consistent manner, by separating bibliographic information from the presentation of this information, similarly to the separation of content and presentation/style supported by LaTeX itself.

RIS

  • https://en.wikipedia.org/wiki/RIS_(file_format) - a standardized tag format developed by Research Information Systems, Incorporated (the format name refers to the company) to enable citation programs to exchange data.[1] It is supported by a number of reference managers. Many digital libraries, like IEEE Xplore, Scopus, the ACM Portal, Scopemed, ScienceDirect, SpringerLink, Rayyan QCRI, Ejmanager and online library catalogs can export citations in this format. Major reference/citation manager applications, like Zotero, Citavi, Mendeley, and EndNote can export and import citations in this format.


CiteProc

  • https://en.wikipedia.org/wiki/CiteProc - the generic name for programs that produce formatted bibliographies and citations based on the metadata of the cited objects and the formatting instructions provided by Citation Style Language (CSL) styles. The first CiteProc implementation used XSLT 2.0, but implementations have been written for other programming languages, including JavaScript, Java, Haskell, PHP, Python, and Ruby. CiteProc, CSL, and Cite Schema make up the Citation Style Language project, a Creative Commons Attribution Share-Alike licensed effort "to provide a common framework for formatting bibliographies and citations across markup languages and document standards. In an ideal world, one could use the same CSL files to format DocBook, TEI, OpenOffice, WordML ... or even LaTeX documents." Different implementations of CiteProc are able to use different bibliographic databases; many can use MODS XML.




Maths

Mining


Scraping

See also HTTP#Scraping, Network#Saving

  • Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  • kimono - Turn websites into structured APIs from your browser in seconds [48]




Tools






  • DataLad - Providing a data portal and a versioning system for everyone, DataLad lets you have your data and control it too.


  • Kaitai Struct: declarative binary format parsing language - a declarative language used for describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc. The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.


  • Datasette - a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools dedicated to make working with structured data as productive as possible.
  • https://github.com/simonw/datasette

Services