Data

From Things and Stuff Wiki
Revision as of 18:05, 13 January 2024 by Milk (talk | contribs) (→‎CBOR)
Jump to navigation Jump to search


General

See also Database, Open data, Semantic, Documents, Organising, Maths, Computing Visualisation, Language


  • https://en.wikipedia.org/wiki/Data_(computer_science) - any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represented using the binary number system of ones (1) and zeros (0), instead of analog representation. In modern (post-1960) computer systems, all data is digital. Data exists in three states: data at rest, data in transit and data in use. Data within a computer, in most cases, moves as parallel data. Data moving to or from a computer, in most cases, moves as serial data. Data sourced from an analog device, such as a temperature sensor, may be converted to digital using an analog-to-digital converter. Data representing quantities, characters, or symbols on which operations are performed by a computer are stored and recorded on magnetic, optical, electronic, or mechanical recording media, and transmitted in the form of digital electrical or optical signals. Data pass in and out of computers via peripheral devices.

Physical computer memory elements consist of an address and a byte/word of data storage. Digital data are often stored in relational databases, like tables or SQL databases, and can generally be represented as abstract key/value pairs. Data can be organized in many different types of data structures, including arrays, graphs, and objects. Data structures can store data of many different types, including numbers, strings and even other data structures.


  • https://en.wikipedia.org/wiki/Data_literacy - the ability to read, understand, create, and communicate data as information. Much like literacy as a general concept, data literacy focuses on the competencies involved in working with data. It is, however, not similar to the ability to read text since it requires certain skills involving reading and understanding data.



  • https://en.wikipedia.org/wiki/Raw_data - also known as primary data, are data (e.g., numbers, instrument readings, figures, etc., collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores).

If a scientist sets up a computerized thermometer which records the temperature of a chemical mixture in a test tube every minute, the list of temperature readings for every minute, as printed out on a spreadsheet or viewed on a computer screen are "raw data". Raw data have not been subjected to processing, "cleaning" by researchers to remove outliers, obvious instrument reading errors or data entry errors, or any analysis (e.g., determining central tendency aspects such as the average or median result). As well, raw data have not been subject to any other manipulation by a software program or a human researcher, analyst or technician. They are also referred to as primary data. Raw data is a relative term (see data), because even once raw data have been "cleaned" and processed by one team of researchers, another team may consider these processed data to be "raw data" for another stage of research. Raw data can be inputted to a computer program or used in manual procedures such as analyzing statistics from a survey. The term "raw data" can refer to the binary data on electronic storage devices, such as hard disk drives (also referred to as "low-level data").


  • https://en.wikipedia.org/wiki/Secondary_data - refers to data that is collected by someone other than the primary user. Common sources of secondary data for social science include censuses, information collected by government departments, organizational records and data that was originally collected for other research purposes. Primary data, by contrast, are collected by the investigator conducting the research.

Secondary data analysis can save time that would otherwise be spent collecting data and, particularly in the case of quantitative data, can provide larger and higher-quality databases that would be unfeasible for any individual researcher to collect on their own. In addition, analysts of social and economic change consider secondary data essential, since it is impossible to conduct a new survey that can adequately capture past change and/or developments. However, secondary data analysis can be less useful in marketing research, as data may be outdated or inaccurate.



  • https://en.wikipedia.org/wiki/Data_system - a term used to refer to an organized collection of symbols and processes that may be used to operate on such symbols. Any organised collection of symbols and symbol-manipulating operations can be considered a data system. Hence, human-speech analysed at the level of phonemes can be considered a data system as can the Incan artefact of the khipu and an image stored as pixels. A data system is defined in terms of some data model and bears a resemblance to the idea of a physical symbol system. Symbols within some data systems may be persistent or not. Hence, the sounds of human speech are non-persistent symbols because they decay rapidly in air. In contrast, pixels stored on some peripheral storage device are persistent symbols.


  • https://en.wikipedia.org/wiki/Data_acquisition - the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems, abbreviated by the initialisms DAS or DAQ, typically convert analog waveforms into digital values for processing. The components of data acquisition systems include: Sensors, to convert physical parameters to electrical signals. Signal conditioning circuitry, to convert sensor signals into a form that can be converted to digital values. Analog-to-digital converters, to convert conditioned sensor signals to digital values.


  • https://en.wikipedia.org/wiki/Automatic_identification_and_data_capture - refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics (like iris and facial recognition system), magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture". AIDC is the process or means of obtaining external data, particularly through the analysis of images, sounds, or videos. To capture data, a transducer is employed which converts the actual image or a sound into a digital file. The file is then stored and at a later time, it can be analyzed by a computer, or compared with other files in a database to verify identity or to provide authorization to enter a secured system. Capturing data can be done in various ways; the best method depends on application.


  • https://en.wikipedia.org/wiki/Unstructured_data - information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.


  • https://en.wikipedia.org/wiki/Machine-readable_medium_and_data - or computer-readable medium, is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data. The result is called machine-readable data or computer-readable data, and the data itself can be described as having machine-readability.




Articles




  • Your configs suck? Try a real programming language. - In this post, I'll try to explain why I find most config formats frustrating to use and suggest that using a real programming language (i.e. general purpose one, like Python) is often a feasible and more pleasant alternative for writing configs. [1]

Learning

See Learning



  • Data Carpentry - develops and teaches workshops on the fundamental data skills needed to conduct research. Our mission is to provide researchers high-quality, domain-specific training covering the full lifecycle of data-driven research.


Modelling

See also Semantic, Database, Maths#Types


  • https://en.wikipedia.org/wiki/Data_model - or datamodel, is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner. The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.

Overview of data modeling context: Data model is based on Data, Data relationship, Data semantic and Data constraint. A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.


  • https://en.wikipedia.org/wiki/Data_modeling - in software engineering is the process of creating a data model for an information system by applying certain formal techniques. It may be applied as part of broader Model-driven engineering (MDD, concept.


  • https://en.wikipedia.org/wiki/Conceptual_model - refers to any model that is formed after a conceptualization or generalization process. Conceptual models are often abstractions of things in the real world, whether physical or social. Semantic studies are relevant to various stages of concept formation. Semantics is fundamentally a study of concepts, the meaning that thinking beings give to various elements of their experience.


  • https://en.wikipedia.org/wiki/Process_of_concept_formation - the basis for inductive thinking model. It requires presentation of examples. Concept formation is the process of sorting out given examples into meaningful classes. In inductive thinking model students group examples together on some basis and form as many groups as they can, each group illustrating a different concept.[


  • https://en.wikipedia.org/wiki/Reference_model - in systems, enterprise, and software engineering—is an abstract framework or domain-specific ontology consisting of an interlinked set of clearly defined concepts produced by an expert or body of experts to encourage clear communication. A reference model can represent the component parts of any consistent idea, from business functions to system components, as long as it represents a complete set. This frame of reference can then be used to communicate ideas clearly among members of the same community.

Reference models are often illustrated as a set of concepts with some indication of the relationships between the concepts.



  • https://en.wikipedia.org/wiki/Associative_entity - a term used in relational and entity–relationship theory. A relational database requires the implementation of a base relation (or base table) to resolve many-to-many relationships. A base relation representing this kind of entity is called, informally, an associative table.


  • https://en.wikipedia.org/wiki/Associative_array - map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms, an associative array is a function with finite domain. It supports 'lookup', 'remove', and 'insert' operations. The dictionary problem is the classic problem of designing efficient data structures that implement associative arrays.

The two major solutions to the dictionary problem are hash tables and search trees. In some cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more specialized structures.


  • https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model - or ER model) describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types).

In software engineering, an ER model is commonly formed to represent things a business needs to remember in order to perform business processes. Consequently, the ER model becomes an abstract data model, that defines a data or information structure which can be implemented in a database, typically a relational database. Entity–relationship modeling was developed for database and design by Peter Chen and published in a 1976 paper, with variants of the idea existing previously, but today it is commonly used for teaching students the basics of data base structure. Some ER models show super and subtype entities connected by generalization-specialization relationships, and an ER model can be used also in the specification of domain-specific ontologies.


It was developed to reflect more precisely the properties and constraints that are found in more complex databases, such as in engineering design and manufacturing (CAD/CAM), telecommunications, complex software systems and geographic information systems (GIS).



  • https://en.wikipedia.org/wiki/Reference_model - in systems, enterprise, and software engineering—is an abstract framework or domain-specific ontology consisting of an interlinked set of clearly defined concepts produced by an expert or body of experts to encourage clear communication. A reference model can represent the component parts of any consistent idea, from business functions to system components, as long as it represents a complete set. This frame of reference can then be used to communicate ideas clearly among members of the same community.

Reference models are often illustrated as a set of concepts with some indication of the relationships between the concepts.







  • https://en.wikipedia.org/wiki/Data_model - an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

The corresponding professional activity is called generally data modeling or, more specifically, database design. Data models are typically specified by a data expert, data specialist, data scientist, data librarian, or a data scholar. A data modeling language and notation are often represented in graphical form as diagrams. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models. A data model explicitly determines the structure of data; conversely, structured data is data organized according to an explicit data model or data structure. Structured data is in contrast to unstructured data and semi-structured data.




  • https://en.wikipedia.org/wiki/Semi-structured_data - form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.



  • https://en.wikipedia.org/wiki/Data_element_name - a name given to a data element in, for example, a data dictionary or metadata registry. In a formal data dictionary, there is often a requirement that no two data elements may have the same name, to allow the data element name to become an identifier, though some data dictionaries may provide ways to qualify the name in some way, for example by the application system or other context in which it occurs.



  • https://en.wikipedia.org/wiki/Representation_term - a word, or a combination of words, that semantically represent the data type (value domain) of a data element. A representation term is commonly referred to as a class word by those familiar with data dictionaries. ISO/IEC 11179-5:2005 defines representation term as a designation of an instance of a representation class As used in ISO/IEC 11179, the representation term is that part of a data element name that provides a semantic pointer to the underlying data type. A Representation class is a class of representations. This representation class provides a way to classify or group data elements. A Representation Term may be thought of as an attribute of a data element in a metadata registry that classifies the data element according to the type of data stored in the data element.


  • https://en.wikipedia.org/wiki/Golden_record_(informatics) - the valid version of a data element (record) in a single source of truth system. It may refer to a database, specific table or data field, or any unit of information used. A golden copy is a consolidated data set, and is supposed to provide a single source of truth and a "well-defined version of all the data entities in an organizational ecosystem". Other names sometimes used include master source or master version.


  • https://en.wikipedia.org/wiki/Data_definition_language - or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE, ALTER, and DROP.

History The concept of the data definition language and its name was first introduced in relation to the Codasyl database model, where the schema of the database was written in a language syntax describing the records, fields, and sets of the user data model. Later it was used to refer to a subset of Structured Query Language (SQL) for declaring tables, columns, data types and constraints. SQL-92 introduced a schema manipulation language and schema information tables to query schemas. These information tables were specified as SQL/Schemata in SQL:2003. The term DDL is also used in a generic sense to refer to any formal language for describing data or information structures.

  • https://en.wikipedia.org/wiki/Data_query_language - part of the base grouping of SQL sub-languages. DQL statements are used for performing queries on the data within schema objects. The purpose of DQL commands is to get the schema relation based on the query passed to it.
  • https://en.wikipedia.org/wiki/Data_manipulation_language - a computer programming language used for adding (inserting), deleting, and modifying (updating) data in a database. A DML is often a sublanguage of a broader database language such as SQL, with the DML comprising some of the operators in the language



  • https://en.wikipedia.org/wiki/Data_set - or dataset, is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.

In the open data discipline, data set is the unit to measure the information released in a public open data repository.


  • https://en.wikipedia.org/wiki/Data_lake - a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Microsoft, or Google).


  • https://en.wikipedia.org/wiki/Master_data - represents "data about the business entities that provide context for business transactions". The most commonly found categories of master data are parties (individuals and organisations, and their roles, such as customers, suppliers, employees), products, financial structures (such as ledgers and cost centres) and locational concepts



  • https://en.wikipedia.org/wiki/Dark_data - is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases the organisation may not even be aware that the data is being collected. IBM estimate that roughly 90 percent of data generated by sensors and analog-to-digital conversions never get used.


  • https://en.wikipedia.org/wiki/Data_format_management - the application of a systematic approach to the selection and use of the data formats used to encode information for storage on a computer. In practical terms, data format management is the analysis of data formats and their associated technical, legal or economic attributes which can either enhance or detract from the ability of a digital asset or a given information systems to meet specified objectives. Data format management is necessary as the amount of information and number of people creating it grows. This is especially the case as the information with which users are working is difficult to generate, store, costly to acquire, or to be shared.


  • https://en.wikipedia.org/wiki/Metadata - or metainformation, is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata


  • https://en.wikipedia.org/wiki/Metadata_modeling - type of metamodeling used in software engineering and systems engineering for the analysis and construction of models applicable to and useful for some predefined class of problems. Meta-modeling is the analysis, construction and development of the frames, rules, constraints, models and theories applicable and useful for the modeling in a predefined class of problems.
  • https://en.wikipedia.org/wiki/Metadata_standard - a requirement which is intended to establish a common understanding of the meaning or semantics of the data, to ensure correct and proper use and interpretation of the data by its owners and users


  • https://en.wikipedia.org/wiki/Metadata_repository - a database created to store metadata. Metadata is information about the structures that contain the actual data. Metadata is often said to be "data about data", but this is misleading. Data profiles are an example of actual "data about data". Metadata adds one layer of abstraction to this definition– it is data about the structures that contain data. Metadata may describe the structure of any data, of any subject, stored in any format. A well-designed metadata repository typically contains data far beyond simple definitions of the various data structures. Typical repositories store dozens to hundreds of separate pieces of information about each data structure.


  • https://en.wikipedia.org/wiki/Metadata_engine - collects, stores and analyzes information about data and metadata (data about data) in use within a knowledge domain. It virtualizes the view of data for an application by separating the data (physical) path from the metadata (logical) path so that data management can be performed independently from where the data physically resides. This expands the domain beyond a single storage device to span all devices within its namespace.


  • https://en.wikipedia.org/wiki/Metadata_discovery - also metadata harvesting, is the process of using automated tools to discover the semantics of a data element in data sets. This process usually ends with a set of mappings between the data source elements and a centralized metadata registry. Metadata discovery is also known as metadata scanning.


  • https://en.wikipedia.org/wiki/Metadata_management - involves managing metadata about other data, whereby this "other data" is generally referred to as content data. The term is used most often in relation to digital media, but older forms of metadata are catalogs, dictionaries, and taxonomies. For example, the Dewey Decimal Classification is a metadata management system developed in 1876 for libraries.


  • https://en.wikipedia.org/wiki/Preservation_metadata - item level information that describes the context and structure of a digital object. It provides background details pertaining to a digital object's provenance, authenticity, and environment. Preservation metadata, is a specific type of metadata that works to maintain a digital object's viability while ensuring continued access by providing contextual information, usage details, and rights. As an increasing portion of the world’s information output shifts from analog to digital form, preservation metadata is an essential component of most digital preservation strategies, including digital curation, data management, digital collections management and the preservation of digital information over the long-term. It is an integral part of the data lifecycle and helps to document a digital object’s authenticity while maintaining usability across formats.



  • https://en.wikipedia.org/wiki/Data_dictionary - or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". Oracle defines it as a collection of tables with metadata. The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS): A document describing a database or collection of databases; An integral component of a DBMS that is required to determine its structure; A piece of middleware that extends or supplants the native data dictionary of a DBMS


  • https://en.wikipedia.org/wiki/Synset - or synonym ring, is a group of data elements that are considered semantically equivalent for the purposes of information retrieval. These data elements are frequently found in different metadata registries. Although a group of terms can be considered equivalent, metadata registries store the synonyms at a central location called the preferred data element. According to WordNet, a synset or synonym set is defined as a set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.


  • https://en.wikipedia.org/wiki/Thesaurus_(information_retrieval) - a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects.


  • https://en.wikipedia.org/wiki/Hyperdata - are data objects linked to other data objects in other places, as hypertext indicates text linked to other text in other places. Hyperdata enables formation of a web of data, evolving from the "data on the Web" that is not inter-related (or at least, not linked). In the same way that hypertext usually refers to the World Wide Web but is a broader term, hyperdata usually refers to the Semantic Web, but may also be applied more broadly to other data-linking technologies such as microformats – including XHTML Friends Network. A hypertext link indicates that a link exists between two documents or "information resources". Hyperdata links go beyond simply such a connection, and express semantics about the kind of connection being made. For instance, in a document about Hillary Clinton, a hypertext link might be made from the word senator to a document about the United States Senate. In contrast, a hyperdata link from the same word to the same document might also state that senator was one of Hillary Clinton's roles, titles, or positions (depending on the ontology being used to define this link).




  • https://en.wikipedia.org/wiki/Data_Documentation_Initiative - an international standard for describing surveys, questionnaires, statistical data files, and social sciences study-level information. This information is described as metadata by the standard. Begun in 1995, the effort brings together data professionals from around the world to develop the standard. The DDI specification, most often expressed in XML, provides a format for content, exchange, and preservation of questionnaire and data file information. DDI supports the description, storage, and distribution of social science data, creating an international specification that is machine-actionable and web-friendly.


  • https://en.wikipedia.org/wiki/ISO/IEC_11179 - an international ISO/IEC standard for representing metadata for an organization in a metadata registry. It documents the standardization and registration of metadata to make data understandable and shareable



  • https://en.wikipedia.org/wiki/Data_center - or data centre is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and storage systems.


  • https://en.wikipedia.org/wiki/Data_warehouse - DW or DWH, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions. The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the data warehouse for reporting.Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main approaches used to build a data warehouse system.


  • https://en.wikipedia.org/wiki/Data_mart - a structure/access pattern specific to data warehouse environments, used to retrieve client-facing data. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team.


  • https://en.wikipedia.org/wiki/Common_warehouse_metamodel - defines a specification for modeling metadata for relational, non-relational, multi-dimensional, and most other objects found in a data warehousing environment. The specification is released and owned by the Object Management Group, which also claims a trademark in the use of "CWM".



  • https://en.wikipedia.org/wiki/Data_profiling - the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data




  • https://en.wikipedia.org/wiki/Data_steward - an oversight or data governance role within an organization, and is responsible for ensuring the quality and fitness for purpose of the organization's data assets, including the metadata for those data assets. A data steward may share some responsibilities with a data custodian, such as the awareness, accessibility, release, appropriate use, security and management of data. A data steward would also participate in the development and implementation of data assets. A data steward may seek to improve the quality and fitness for purpose of other data assets their organization depends upon but is not responsible for.





  • https://en.wikipedia.org/wiki/Data_architecture - aims to set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage, data in use, and data in motion; descriptions of data stores, data groups, and data items; and mappings of those data artifacts to data qualities, applications, locations, etc. Essential to realizing the target state, data architecture describes how data is processed, stored, and used in an information system. It provides criteria for data processing operations to make it possible to design data flows and also control the flow of data in the system.

The data architect is typically responsible for defining the target state, aligning during development and then following up to ensure enhancements are done in the spirit of the original blueprint.


  • https://en.wikipedia.org/wiki/Data_management_platform - a software platform used for collecting and managing data. They allow businesses to identify audience segments, which can be used to target specific users and contexts in online advertising campaigns. DMPs may use big data and artificial intelligence algorithms to process and analyze large data sets about users from various sources. Some advantages of using DMPs include data organization, increased insight on audiences and markets, and effective advertisement budgeting. On the other hand, DMPs often have to deal with privacy concerns due to the integration of third-party software with private data. This technology is continuously being developed by global entities such as Nielsen and Oracle. More generally, the term data platform can refer to any software platform used for collecting and managing data. It is an integrated solution which as of the 2010s can combine functionalities of for example a data lake, data warehouse or data hub for business intelligence purposes. However, this article discusses the use such technology platforms used for collecting and managing data for digital marketing purposes specifically.


  • https://en.wikipedia.org/wiki/Data_curation - the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database.

In the modern era of big data, the curation of data has become more prominent, particularly for software processing high volume and complex data systems. The term is also used in historical occasions and the humanities, where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. Specifically, data curation is the attempt to determine what information is worth saving and for how long.


  • https://en.wikipedia.org/wiki/Data_collection - or data gathering is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research component in all study fields, including physical and social sciences, humanities, and business. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The goal for all data collection is to capture evidence that allows data analysis to lead to the formulation of credible answers to the questions that have been posed. Regardless of the field of or preference for defining data (quantitative or qualitative), accurate data collection is essential to maintain research integrity. The selection of appropriate data collection instruments (existing, modified, or newly developed) and delineated instructions for their correct use reduce the likelihood of errors.


  • https://en.wikipedia.org/wiki/Data_collection_system - a computer application that facilitates the process of data collection, allowing specific, structured information to be gathered in a systematic fashion, subsequently enabling data analysis to be performed on the information. Typically a DCS displays a form that accepts data input from a user and then validates that input prior to committing the data to persistent storage such as a database.



  • https://en.wikipedia.org/wiki/Data_analysis - the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis. Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.


  • https://en.wikipedia.org/wiki/Data_mining - the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

The term "data mining" is a misnomer because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data Mining: Practical Machine Learning Tools and Techniques with Java (which covers mostly machine learning material) was originally to be named Practical Machine Learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics—or, when referring to actual methods, artificial intelligence and machine learning—are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, although they do belong to the overall KDD process as additional steps.



  • https://en.wikipedia.org/wiki/Data_integrity - the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context – even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a prerequisite for data integrity. Data integrity is the opposite of data corruption. The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.


  • https://en.wikipedia.org/wiki/Data_quality - refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, data governance is used to form agreed upon definitions and standards for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality.
  • https://en.wikipedia.org/wiki/Data_quality_firewall - the use of software to protect a computer system from the entry of erroneous, duplicated or poor quality data. Gartner estimates that poor quality data causes failure in up to 50% of customer relationship management systems. Older technology required the tight integration of data quality software, whereas this can now be accomplished by loosely coupling technology in a service-oriented architecture.
  • https://en.wikipedia.org/wiki/Information_quality - the quality of the content of information systems. It is often pragmatically defined as: "The fitness for use of the information provided". IQ frameworks also provides a tangible approach to assess and measure DQ/IQ in a robust and rigorous manner.


  • https://en.wikipedia.org/wiki/Data_cleansing - or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall. After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.


  • https://en.wikipedia.org/wiki/Database_normalization - or database normalisation (see spelling differences) is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.


  • https://en.wikipedia.org/wiki/Data_validation - the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

This is distinct from formal verification, which attempts to prove or disprove the correctness of algorithms for implementing a specification or property.

  • https://en.wikipedia.org/wiki/Data_validation_and_reconciliation - a technology that uses process information and mathematical methods in order to automatically ensure data validation and reconciliation by correcting measurements in industrial processes. The use of PDR allows for extracting accurate and reliable information about the state of industry processes from raw measurement data and produces a single consistent set of data representing the most likely process operation.




  • https://en.wikipedia.org/wiki/Data_wrangling - sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data. The process of data wrangling may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use. It is closely aligned with the ETL process.


  • https://en.wikipedia.org/wiki/Data_integration - involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume (that is, big data) and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.


  • https://en.wikipedia.org/wiki/Record_linkage - also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked.



  • https://en.wikipedia.org/wiki/Data_archaeology - the technical sense refers to the art and science of recovering computer data encoded and/or encrypted in now obsolete media or formats. Data archaeology can also refer to recovering information from damaged electronic formats after natural disasters or human error.


  • https://en.wikipedia.org/wiki/Data_virtualization - an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.


  • https://en.wikipedia.org/wiki/Data_blending - a process whereby big data from multiple sources are merged into a single data warehouse or data set. It concerns not merely the merging of different file formats or disparate sources of data but also different varieties of data. Data blending allows business analysts to cope with the expansion of data that they need to make critical business decisions based on good quality business intelligence. Data blending has been described as different from data integration due to the requirements of data analysts to merge sources very quickly, too quickly for any practical intervention by data scientists.


  • https://en.wikipedia.org/wiki/Dataspaces - an abstraction in data management that aim to overcome some of the problems encountered in data integration system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-you-go" fashion as it is used. Labor-intensive aspects of data integration are postponed until they are absolutely needed.

Traditionally, data integration and data exchange systems have aimed to offer many of the purported services of dataspace systems. Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems in the following way. Data integration systems require semantic integration before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system.


  • https://en.wikipedia.org/wiki/Data_exchange - the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.


  • https://en.wikipedia.org/wiki/Data_mapping - the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including: Data transformation or data mediation between a data source and a destination; Identification of data relationships as part of data lineage analysis; Discovery of hidden sensitive data such as the last four digits of a social security number hidden in another user id as part of a data masking or de-identification project; Consolidation of multiple databases into a single database and identifying redundant columns of data for consolidation or elimination

Traditionally, data integration and data exchange systems have aimed to offer many of the purported services of dataspace systems. Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems in the following way. Data integration systems require semantic integration before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system.


  • https://en.wikipedia.org/wiki/Schema_matching - and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: Schema matching is the process of identifying that two objects are semantically related (scope of this article) while mapping refers to the transformations between the objects


  • https://en.wikipedia.org/wiki/Semantic_mapper - tool or service that aids in the transformation of data elements from one namespace into another namespace. A semantic mapper is an essential component of a semantic broker and one tool that is enabled by the Semantic Web technologies. Essentially the problems arising in semantic mapping are the same as in data mapping for data integration purposes, with the difference that here the semantic relationships are made explicit through the use of semantic nets or ontologies which play the role of data dictionaries in data mapping.


  • https://en.wikipedia.org/wiki/Ontology_alignment - or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.



  • https://en.wikipedia.org/wiki/Ontology-based_data_integration - involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.
  • https://en.wikipedia.org/wiki/Semantic_equivalence - a declaration that two data elements from different vocabularies contain data that has similar meaning. There are three types of semantic equivalence statements: Class or concept equivalence. A statement that two high level concepts have similar or equivalent meaning. Property or attribute equivalence. A statement that two properties, descriptors or attributes of classes have similar meaning. Instance equivalence. A statement that two instances of data are the same or refer to the same instance.



  • https://en.wikipedia.org/wiki/Data_transformation_(computing) - the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.



  • https://en.wikipedia.org/wiki/Data_preservation - the act of conserving and maintaining both the safety and integrity of data. Preservation is done through formal activities that are governed by policies, regulations and strategies directed towards protecting and prolonging the existence and authenticity of data and its metadata. Data can be described as the elements or units in which knowledge and information is created, and metadata are the summarizing subsets of the elements of data; or the data about the data. The main goal of data preservation is to protect data from being lost or destroyed and to contribute to the reuse and progression of the data.



  • https://en.wikipedia.org/wiki/Clinical_data_management - critical process in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. Clinical data management ensures collection, integration and availability of data at appropriate quality and cost. It also supports the conduct, management and analysis of studies across the spectrum of clinical research as defined by the National Institutes of Health (NIH). The ultimate goal of CDM is to ensure that conclusions drawn from research are well supported by the data. Achieving this goal protects public health and increases confidence in marketed therapeutics


  • https://en.wikipedia.org/wiki/Clinical_data_management_system - or CDMS is a tool used in clinical research to manage the data of a clinical trial. The clinical trial data gathered at the investigator site in the case report form are stored in the CDMS. To reduce the possibility of errors due to human entry, the systems employ various means to verify the data. Systems for clinical data management can be self-contained or part of the functionality of a CTMS. A CTMS with clinical data management functionality can help with the validation of clinical data as well as helps the site employ for other important activities like building patient registries and assist in patient recruitment efforts.




  • https://en.wikipedia.org/wiki/Data_as_a_service - a cloud-based software tool used for working with data, such as managing data in a data warehouse or analyzing data with business intelligence. It is enabled by software as a service (SaaS). Like all "as a service" (aaS) technology, DaaS builds on the concept that its data product can be provided to the user on demand, regardless of geographic or organizational separation between provider and consumer. Service-oriented architecture (SOA) and the widespread use of APIs have rendered the platform on which the data resides as irrelevant.

Data as a service as a business model is a concept when two or more organizations buy, sell, or trade machine-readable data in exchange for something of value.

Mining



Science


  • A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?


  • Getting Started - Welcome to the “sexiest career of the 21st century”. This page contains some resources to help you get oriented to some of the foundations of the field.


  • School of Data - works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.



  • Environmental Computing - Getting started with R, data manipulation, graphics, and statistics for the environmental sciences Students and researchers in the environmental sciences require a wide range of quantitative skills in analytical and data processing software, including R, geographic information systems (GIS) and the processing of remotely sensed data. There is increasingly a need to ensure transparency of data processing supported by statistical analyses to justify conclusions of scientific research and monitoring for management and policy. This site is a brief introduction to techniques for data organisation, graphics and statistical analyses.



  • Kaggle - Service - From Big Data to Big Analytics.



  • Red Hen Lab - global big data science laboratory and cooperative for research into multimodal communication.


Management

See also Database, Visualisation, Maths#Software,, Learning




FAIR data

  • FAIRsFAIR - Fostering Fair Data Practices in Europe, aims to supply practical solutions for the use of the FAIR data principles throughout the research data life cycle. Emphasis is on fostering FAIR data culture and the uptake of good practices in making data FAIR. FAIRsFAIR will play a key role in the development of global standards for FAIR certification of repositories and the data within them contributing to those policies and practices that will turn the EOSC programme into a functioning infrastructure. In the end, FAIRsFAIR will provide a platform for using and implementing the FAIR principles in the day to day work of European research data providers and repositories. FAIRsFAIR will also deliver essential FAIR dimensions of the Rules of Participation (RoP) and regulatory compliance for participation in the EOSC. The EOSC governance structure will use these FAIR aligned RoPs to establish whether components of the infrastructure function in a FAIR manner.


  • https://en.wikipedia.org/wiki/FAIR_data - are data which meet principles of findability, accessibility, interoperability, and reusability (FAIR). The acronym and principles were defined in a March 2016 paper in the journal Scientific Data by a consortium of scientists and organizations. The FAIR principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention, because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data. The abbreviation FAIR/O data is sometimes used to indicate that the dataset or database in question complies with the FAIR principles and also carries an explicit data‑capable open license.



Trusted Digital Repository

  • https://github.com/peterVG/awesome-tdr - Trusted Digital Repository (TDR), is a term used to refer to a library or archive of digital information objects that has undergone a formal audit of its long-term content management and digital preservation capacity. This helps to assure users of its collections that its data is authentic, trusthworthy and usable. A TDR audit takes into account a number of technical, organizational, and economic factors to determine whether a given repository can be trusted to deal with long-term threats to the preservation and accessibility of the digital information objects in its care (e.g. documents, photos, videos, datasets, etc). Three well-known TDR auditing agencies are the Centre for Research Libraries (CRL), the Research Data Alliance (RDA), and the International Organization for Standardization (ISO). Organizations may use the requirements from one of these agencies to do a self-audit as a way to gauge their internal readiness and maturity. To achieve a higer-level of crediblity and accountability, an organization will hire auditors certified by CRL, RDA, or ISO to perform an external audit.

TDR auditing standards

  • CRL TRAC: Trustworthy Repositories Audit & Certification: Criteria and Checklist
  • RDA CoreTrustSeal Core Trust Seal
  • ISO 16363:2012: Space data and information transfer systems — Audit and certification of trustworthy digital repositories


  • Trusted Digital Repository - On July 8, 2010 a memorandum of understanding (MoU) was signed between three groups which are working on standards for Trusted Digital Repositories being David Giaretta in capacity as Chair of the CCSDS (Consultative Committee for Space Data Systems)/ISO Repository Audit and Certification Working Group (RAC), Henk Harmsen in his capacity of Chair of the Data Seal of Approval (DSA) Board and Christian Keitel in his capacity as Chair of the DIN Working Group “Trusted Archives - Certifiaction”. The parties to this Memorandum of Understanding all lead separate groups aiming at certifying digital repositories. They wish to put in place mechanisms to ensure that the groups can collaborate in setting up an integrated framework for auditing and certifying digital repositories.

The framework will consist of a sequence of three levels, in increasing trustworthiness:

  • Basic Certification is granted to repositories which obtain DSA certification;
  • Extended Certification is granted to Basic Certification repositories which in addition perform a structured, externally reviewed and publicly available self-audit based on ISO 16363 or DIN 31644;
  • Formal Certification is granted to repositories which in addition to Basic Certification obtain full external audit and certification based on ISO 16363 or equivalent DIN 31644.


  • CoreTrustSeal - offers to any interested data repository a core level certification based on the Core Trustworthy Data Repositories Requirements. This universal catalogue of requirements reflects the core characteristics of trustworthy data repositories. The CoreTrustSeal Data Repository Application Management Tool is available to support applications. CoreTrustSeal is an international, community based, non-governmental, and non-profit organization promoting sustainable and trustworthy data infrastructures. To manage its finances, CoreTrustSeal is a legal entity under Dutch law (CoreTrustSeal Foundation Statutes and Rules of Procedure) governed by a Standards and Certification Board composed of 12 elected members representing the Assembly of Reviewers.

Annotation

  • https://en.wikipedia.org/wiki/Text_annotation - the practice and the result of adding a note or gloss to a text, which may include highlights or underlining, comments, footnotes, tags, and links. Text annotations can include notes written for a reader's private purposes, as well as shared annotations written for the purposes of collaborative writing and editing, commentary, or social reading and sharing. In some fields, text annotation is comparable to metadata insofar as it is added post hoc and provides information about a text without fundamentally altering that original text. Text annotations are sometimes referred to as marginalia, though some reserve this term specifically for hand-written notes made in the margins of books or manuscripts. Annotations have been found to be useful and help to develop knowledge of English literature.

Annotations can be both private and socially shared, including hand-written and information technology-based annotation. Annotations are different than notetaking because annotations must be physically written or added on the actual original piece. This can be writing within the page of a book or highlighting a line, or, if the piece is digital, a comment or saved highlight or underline within the document. For information on annotation of Web content, including images and other non-textual content, see also Web annotation.


  • https://en.wikipedia.org/wiki/Note_(typography) - a string of text placed at the bottom of a page in a book or document or at the end of a chapter, volume, or the whole text. The note can provide an author's comments on the main text or citations of a reference work in support of the text. Footnotes are notes at the foot of the page while endnotes are collected under a separate heading at the end of a chapter, volume, or entire work. Unlike footnotes, endnotes have the advantage of not affecting the layout of the main text, but may cause inconvenience to readers who have to move back and forth between the main text and the endnotes.


  • https://en.wikipedia.org/wiki/Drama_annotation - the process of annotating the metadata of a drama. Given a drama expressed in some medium (text, video, audio, etc.), the process of metadata annotation identifies what are the elements that characterize the drama and annotates such elements in some metadata format. For example, in the sentence "Laertes and Polonius warn Ophelia to stay away from Hamlet." from the text Hamlet, the word "Laertes", which refers to a drama element, namely a character, will be annotated as "Char", taken from some set of metadata. This article addresses the drama annotation projects, with the sets of metadata and annotations proposed in the scientific literature, based markup languages and ontologies.



Web annotation

See also Organising#Bookmarks / social bookmarking


  • https://en.wikipedia.org/wiki/Web_annotation - can refer to online annotations of web resources such as web pages or parts of them, or a set of W3C standards developed for this purpose. The term can also refer to the creations of annotations on the World Wide Web and it has been used in this sense for the annotation tool INCEpTION, formerly WebAnno. This is a general feature of several tools for annotation in natural language processing or in the philologies.



  • Annotator - an open-source JavaScript library to easily add annotation functionality to any webpage. Annotations can have comments, tags, links, users, and more. Annotator is designed for easy extensibility so its a cinch to add a new feature or behaviour. Annotator also fosters an active developer community with contributors from four continents, building 3rd party plugins allowing the annotation of PDFs, EPUBs, videos, images, sound, and more.


  • Annotation Studio - a suite of tools for collaborative web-based annotation, currently under development by MIT’s HyperStudio. Annotation Studio actively engages students in interpreting primary sources such as literary texts and other humanities documents. Currently supporting the multimedia annotation of texts, Annotation Studio will ultimately allow students to annotate video, image, and audio sources. With Annotation Studio, students can develop traditional humanistic skills such as close reading and textual analysis while also advancing their understanding of texts and contexts by linking and comparing original documents with sources, adaptions, or variations in different media formats. Instead of passively reading, students are discovering, annotating, comparing, sampling, illustrating, and representing – activities that John Unsworth has dubbed “scholarly primitives.”

Annotation Studio is currently being used in classes throughout the humanities and social sciences, nationally and internationally, from high schools to community colleges to Ivy League universities. The project has also received funding through a Digital Humanities Start-up grant from the National Endowment for the Humanities. Annotation Studio is rooted in a technology-supported pedagogy that has been developing in MIT literature classes over the past decade. By enabling users to tag texts using folksonomies rather than TEI, Annotation Studio allows students to practice scholarly primitives quite naturally, thereby discovering how literary texts can be opened up through exploration of sources, influences, editions, and adaptations. In other words, Annotation Studio’s tools and workspaces help students hone skills traditionally used by professional humanists.



  • W3C Web Annotation Working Group - The W3C Web Annotation Working Group is chartered to develop a set of specifications for an interoperable, sharable, distributed Web Annotation architecture.


  • Deliverables of W3C’s Web Annotation Working Group - Deliverables of W3C’s Web Annotation Working Group. Note that the Working Group is now Closed; for further discussions on the evolution of Web Annotation technologies at W3C, possible errata, etc, please join the W3C Open Annotation Community Group. In particular, if you want to submit an erratum, the best is to follow the procedure described on the errata page of the documents.


  • Open Annotation Community Group - The purpose of the Open Annotation Community Group is to work towards a common, RDF-based, specification for annotating digital resources. The effort will start by working towards a reconciliation of two proposals that have emerged over the past two years: the Annotation Ontology and the Open Annotation Model . Initially, editors of these proposals will closely collaborate to devise a common draft specification that addresses requirements and use cases that were identified in the course of their respective efforts. The goal is to make this draft available for public feedback and experimentation in the second quarter of 2012. The final deliverable of the Open Annotation Community Group will be a specification, published under an appropriate open license, that is informed by the existing proposals, the common draft specification, and the community feedback. http://code.google.com/p/annotation-ontology/ http://www.openannotation.org/spec/beta/






  • https://github.com/jankaszel/simple-annotation-server - a very simple annotation server intended for testing purposes, implementing both the Web Annotation Protocol as well as the Web Annotation Data Model with simple REST-based user management. The server is written in JavaScript and runs in Node.js, running an in-process LevelDB database.



  • https://en.wikipedia.org/wiki/Hypothes.is - an open-source software project that aims to collect comments about statements made in any web-accessible content, and filter and rank those comments to assess each statement's credibility. It has been summarized as "a peer review layer for the entire Internet."


  • Apache Annotator (incubating) - provides annotation enabling code for browsers, servers, and humans. Growing out of the experiences with web annotation software (Annotator.js, Hypothes.is, and others), Apache Annotator is a collaboration between developers of annotation tools to bundle our efforts. The goal is to help developers of annotation tools create their applications without having to reinvent the wheel, while applying a standards-driven approach based on the W3C’s Web Annotation data model, in order to facilitate an ecosystem of interoperable annotation tools — ideally making annotations an integral part of the web.


  • annorepo | Annotation Repository - A webservice for W3C Web Annotations, implementing the W3C Web Annotation Protocol, plus custom services (batch upload, search, etc.).




to sort

  • doccano - an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. You can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.
  • https://github.com/doccano/doccano



  • https://github.com/smistad/annotationweb - a web-based annnotation system made primarily for easy annotation of image sequences such as ultrasound and camera recordings. It uses mainly django/python for the backend and javascript/jQuery and HTML canvas for the interactive annotation frontend. Annotation Web is developed by SINTEF Medical Technology and Norwegian University of Science and Technology (NTNU), and is released under a permissive MIT license.






  • https://en.wikipedia.org/wiki/Automatic_image_annotation - also known as automatic image tagging or linguistic indexing, is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database. This method can be regarded as a type of multi-class image classification with a very large number of classes - as large as the vocabulary size. Typically, image analysis in the form of extracted feature vectors and the training annotation words are used by machine learning techniques to attempt to automatically apply annotations to new images. The first methods learned the correlations between image features and training annotations, then techniques were developed using machine translation to try to translate the textual vocabulary with the 'visual vocabulary', or clustered regions known as blobs. Work following these efforts have included classification approaches, relevance models and so on.


Annotator

  • Annotator - an open-source JavaScript library to easily add annotation functionality to any webpage. Annotations can have comments, tags, links, users, and more. Annotator is designed for easy extensibility so its a cinch to add a new feature or behaviour. Annotator also fosters an active developer community with contributors from four continents, building 3rd party plugins allowing the annotation of PDFs, EPUBs, videos, images, sound, and more.

Recogito

  • Recogito - an online platform for collaborative document annotation. It is maintained by Pelagios, a Digital Humanities initiative aiming to foster better linkages between online resources documenting the past. Recogito provides a personal workspace where you can upload, collect and organize your source materials - texts, images and tabular data - and collaborate in their annotation and interpretation. Recogito helps you to make your work more visible on the Web more easily, and to expose the results of your research as Open Data. This tutorial will teach you the essentials of working with Recogito. To complete it, you need a Recogito account, and two sample documents: a plain .txt file with some text, and an image file. Recogito is an initative of the Pelagios Network, developed under the leadership of the Austrian Institute of Technology, Exeter University and The Open University, with funding from the Andrew W. Mellon Foundation. Recogito is provided as Open Source software, under the terms of the Apache 2 license. It can be downloaded free of charge for self-hosting from our GitHub repository. Pelagios Commons offers free access to a hosted version of the software at recogito.pelagios.org in the spirit of open data and as an act of collegiality.

INCEpTION

  • INCEpTION - A semantic annotation platform offering intelligent assistance and knowledge management The annotation of specific semantic phenomena often require compiling task-specific corpora and creating or extending task-specific knowledge bases. Presently, researchers require a broad range of skills and tools to address such semantic annotation tasks. In the recently funded INCEpTION project, UKP Lab at TU Darmstadt aims towards building an annotation platform that incorporates all the related tasks into a joint web-based platform.

Science


  • A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?


  • Getting Started - Welcome to the “sexiest career of the 21st century”. This page contains some resources to help you get oriented to some of the foundations of the field.


  • School of Data - works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.



  • Environmental Computing - Getting started with R, data manipulation, graphics, and statistics for the environmental sciences Students and researchers in the environmental sciences require a wide range of quantitative skills in analytical and data processing software, including R, geographic information systems (GIS) and the processing of remotely sensed data. There is increasingly a need to ensure transparency of data processing supported by statistical analyses to justify conclusions of scientific research and monitoring for management and policy. This site is a brief introduction to techniques for data organisation, graphics and statistical analyses.



  • Kaggle - Service - From Big Data to Big Analytics.



  • Red Hen Lab - global big data science laboratory and cooperative for research into multimodal communication.


Subsetting

  • https://en.wikipedia.org/wiki/Subsetting - In research communities (for example, earth sciences, astronomy, business, and government), subsetting is the process of retrieving just the parts (a subset) of large files which are of interest for a specific purpose. This occurs usually in a client—server setting, where the extraction of the parts of interest occurs on the server before the data is sent to the client over a network. The main purpose of subsetting is to save bandwidth on the network and storage space on the client computer. Subsetting may be favorable for the following reasons: restrict or divide the time range; select cross sections of data; select particular kinds of time series; exclude particular observations


  • Jailer - a tool for database subsetting, schema and data browsing. It creates small slices from your database and lets you navigate through your database following the relationships.. Ideal for creating small samples of test data or for local problem analysis with relevant production data.


Encoding

  • https://en.wikipedia.org/wiki/Code - a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time. The process of encoding converts information from a source into symbols for communication or storage. Decoding is the reverse process, converting code symbols back into a form that the recipient understands, such as English or/and Spanish.


  • https://en.wikipedia.org/wiki/Encoder_(digital) - or simply an encoder in digital electronics is a one-hot to binary converter. That is, if there are 2n input lines, and at most only one of them will ever be high, the binary code of this 'hot' line is produced on the n-bit output lines. For example, a 4-to-2 simple encoder takes 4 input bits and produces 2 output bits. The illustrated gate level example implements the simple encoder defined by the truth table, but it must be understood that for all the non-explicitly defined input combinations (i.e., inputs containing 0, 2, 3, or 4 high bits) the outputs are treated as don't cares.


  • https://en.wikipedia.org/wiki/Text_file - sometimes spelled textfile; an old alternative name is flatfile, is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

"Text file" refers to a type of container, while plain text refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and binary files.


  • https://en.wikipedia.org/wiki/Plain_text - a loose term for data (e.g. file contents, that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects (encoded integers, real numbers, images, etc.).


  • https://en.wikipedia.org/wiki/Binary_file - a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.


  • https://en.wikipedia.org/wiki/Binary_decoder - a combinational logic circuit that converts binary information from the n coded inputs to a maximum of 2n unique outputs. They are used in a wide variety of applications, including data multiplexing and data demultiplexing, seven segment displays, and memory address decoding.There are several types of binary decoders, but in all cases a decoder is an electronic circuit with multiple input and multiple output signals, which converts every unique combination of input states to a specific combination of output states. In addition to integer data inputs, some decoders also have one or more


  • https://en.wikipedia.org/wiki/Binary-to-text_encoding - is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data (such as email or NNTP, or is not 8-bit clean. PGP documentation (RFC 4880) uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.


Numbers

Binary


Hexadecimal


  • https://github.com/sharkdp/hexyl - a simple hex viewer for the terminal. It uses a colored output to distinguish different categories of bytes (NULL bytes, printable ASCII characters, ASCII whitespace characters, other ASCII characters and non-ASCII).


Gray code

  • https://en.wikipedia.org/wiki/Gray_code - after Frank Gray, or reflected binary code (RBC), also known just as reflected binary (RB), is an ordering of the binary numeral system such that two successive values differ in only one bit (binary digit). The reflected binary code was originally designed to prevent spurious output from electromechanical switches. Today, Gray codes are widely used to facilitate error correction in digital communications such as digital terrestrial television and some cable TV systems.

Character encoding





  • https://en.wikipedia.org/wiki/Od_(Unix) - a program for displaying ("dumping") data in various human-readable output formats. The name is an acronym for "octal dump" since it defaults to printing in the octal data format. It can also display output in a variety of other formats, including hexadecimal, decimal, and ASCII. It is useful for visualizing data that is not in a human-readable format, like the executable code of a program.


Telegraph

Morse


Baudot

  • https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.

BCD

EBCDIC

  • https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.

ASCII / ANSI

to move/merge with Typography

  • https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.


  • https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.



  • https://en.wikipedia.org/wiki/PETSCII - also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C128[1], CBM-II, Plus/4, and VIC-20.







Art


  • jp2a - a small utility that converts JPG images to ASCII. It's written in C and released under the GPL.



  • https://github.com/jtdaugherty/tart - a program that provides an image-editor-like interface to creating ASCII art - in the terminal, with your mouse! This program is written using my purely-functional terminal user interface toolkit, Brick.





  • REXPaint - powerful and user-friendly ASCII art editor. Use a wide variety of tools to create ANSI block/line art, roguelike mockups and maps, UI layouts, and for other game development needs. Originally an in-house dev tool used by Grid Sage Games for traditional roguelike development, this software has been made available to other developers and artists free of charge. While core functionality and tons of features already exist, occasional updates are known to happen. Unlock your retro potential; join thousands of other REXPaint users today


Fonts

CJKV

Unicode

  • https://en.wikipedia.org/wiki/Unicode - a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.


  • Unicode Consortium - enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.



  • ICU - International Components for Unicode - a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.




  • UAX #15: Unicode Normalization Forms - This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.




mirroring char in brackets: (‮‮test ( 








  • https://en.wikipedia.org/wiki/Mojibake - the garbled text that is the result of text being decoded using an unintended character encoding.[1] The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

base64


  • https://en.wikipedia.org/wiki/Base64 - a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes, in sequences of 24 bits that can be represented by four 6-bit Base64 digits.


$ function iter {
   N="$1"
   CMD="$2"
   STATE=$(cat)
   for i in $(seq 1 $N); do
      STATE=$(echo -n $STATE | $CMD)
   done
   cat <<EOF
   $STATE
   EOF
   }
   $ echo "HN" | iter 20 base64 | head -1
   Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
   $ echo "Hello Hacker News" | iter 20 base64 | head -1
   Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
   $ echo "Bonjour Hacker News" | iter 20 base64 | head -1
   Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU

Checksum

  • https://en.wikipedia.org/wiki/Luhn_algorithm - also known as the "modulus 10" or "mod 10" algorithm, named after its creator, IBM scientist Hans Peter Luhn, is a simple checksum formula used to validate a variety of identification numbers, such as credit card numbers, IMEI numbers, National Provider Identifier numbers in the United States, Canadian Social Insurance Numbers, Israeli ID Numbers, South African ID Numbers, Greek Social Security Numbers (ΑΜΚΑ), and survey codes appearing on McDonald's, Taco Bell, and Tractor Supply Co. receipts. It is described in U.S. Patent No. 2,950,048, filed on January 6, 1954, and granted on August 23, 1960.The algorithm is in the public domain and is in wide use today. It is specified in ISO/IEC 7812-1. It is not intended to be a cryptographically secure hash function; it was designed to protect against accidental errors, not malicious attacks. Most credit cards and many government identification numbers use the algorithm as a simple method of distinguishing valid numbers from mistyped or otherwise incorrect numbers.

Files

Magic number

FourCC

  • https://en.wikipedia.org/wiki/FourCC - literally, four-character code) is a sequence of four bytes used to uniquely identify data formats. The concept originated in the OSType scheme used in the Macintosh system software and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.

Serialization

See also JavaScript#JSON, Documents#Markup, HTML/CSS#Markup, Semantic web, Database




  • https://en.wikipedia.org/wiki/Marshalling_(computer_science) - or marshaling is the process of transforming the memory representation of an object to a data format suitable for storage or transmission, and it is typically used when data must be moved between different parts of a computer program or from one program to another. Marshalling is similar to serialization and is used to communicate to remote objects with an object, in this case a serialized object. It simplifies complex communication, using composite objects in order to communicate instead of primitives. The inverse, of marshalling is called unmarshalling (or demarshalling, similar to deserialization).
  • https://en.wikipedia.org/wiki/Unmarshalling - Comparison with deserialization: An object that is serialized is in the form of a byte stream and it can eventually be converted back to a copy of the original object. Deserialization is the process of converting the byte stream data back to its original object type.mAn object that is marshalled, however, records the state of the original object and it contains the codebase (codebase here refers to a list of URLs where the object code can be loaded from, and not source code). Hence, in order to convert the object state and codebase(s), unmarshalling must be done.


  • https://en.wikipedia.org/wiki/Delimiter - a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code. Delimiters represent one of various means to specify boundaries in a data stream.
  • https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length





  • VisiData - a free, open-source tool that lets you quickly open, explore, summarize, and analyze datasets in your computer’s terminal. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.

S-expression

  • https://en.wikipedia.org/wiki/S-expression - sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an s-expression is classically defined as "an atom", or "an expression of the form (x . y) where x and y are s-expressions." The second, recursive part of the definition represents an ordered pair, which means that s-expressions are binary trees.

M-Expression


Recfile

  • GNU Recutils - a set of tools and libraries to access human-editable, plain text databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. The picture below shows a sample database containing information about GNU packages, along with the main features provided by recutils.
  • recfile - Recfile is the file format used by GNU Recutils. It can be seen as a "vertical" counterpart to CSV.




CSV






TSV


ASN.1

  • https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One - ASN.1 is an interface description language for defining data structures that can be serialized and deserialized in a standard, cross-platform way. It's broadly used in telecommunications and computer networking, and especially in cryptography. Protocol developers define data structures in ASN.1 modules, which are generally a section of a broader standards document written in the ASN.1 language. Because the language is both human-readable and machine-readable, modules can be automatically turned into libraries that process their data structures, using an ASN.1 compiler. ASN.1 is similar in purpose and use to protocol buffers and Apache Thrift, which are also interface description languages for cross-platform data serialization. Like those languages, it has a schema (in ASN.1, called a "module"), and a set of encodings, typically type-length-value encodings. However, ASN.1, defined in 1984, predates them by many years. It also includes a wider variety of basic data types, some of which are obsolete, and has more options for extensibility. A single ASN.1 message can include data from multiple modules defined in multiple standards, even standards defined years apart.

X.690

  • https://en.wikipedia.org/wiki/X.690 - an ITU-T standard specifying several ASN.1 encoding formats:
    • Basic Encoding Rules (BER)
    • Canonical Encoding Rules (CER)
    • Distinguished Encoding Rules (DER)


JSON

  • JSON - JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.







  • JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. [32]




Extensions


  • JSON Schema - a vocabulary that allows you to annotate and validate JSON documents.


  • JSON-P or "JSON with padding" is a communication technique used in JavaScript programs which run in Web browsers. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the same origin policy - pre CORS


  • JsonML (JSON Markup Language) is an application of the JSON (JavaScript Object Notation) format. The purpose of JsonML is to provide a compact format for transporting XML-based markup as JSON which allows it to be losslessly converted back to its original form. Native XML/XHTML doesn't sit well embedded in JavaScript. When XHTML is stored in script it must be properly encoded as an opaque string. JsonML allows easy manipulation of the markup in script before completely rehydrating back to the original form.


  • JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. If you are already familiar with JSON, writing JSON-LD is very easy. These properties make JSON-LD an ideal Linked Data interchange language for JavaScript environments, Web service, and unstructured databases such as CouchDB and MongoDB.
  • https://github.com/linkedobjects/linkedobjects - The Linked Objects Notation (LION) is a simple subset of JSON-LD. It aims to avoid most of the complexity, and enables getting started quickly, using a familair notation. LION is compatible with JSON-LD and offers a full upgrade path


  • json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.



  • JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.




  • Javascript Object Signing and Encryption - JavaScript Object Notation (JSON) is a text format for the serialization of structured data described in RFC 4627. The JSON format is often used for serializing and transmitting structured data over a network connection. With the increased usage of JSON in protocols in the IETF and elsewhere, there is now a desire to offer security services, which use encryption, digital signatures, message authentication codes (MACs) algorithms, that carry their data in JSON format.
  • JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.



  • https://github.com/saulpw/jdot - Remove all the extraneous symbols from JSON, and it becomes a lot easier to read and write. Add comments and macros and it's almost pleasant. Some little ergonomics go a long way.

Learning

  • Getting Started with JSON - You send data in a JSON format between different parts of your system. API results are often returned in JSON format, for example. JSON is a lightweight format which makes for easy reading if you're even the least bit familiar with JavaScript.


Tools


  • Pjson - Like python -mjson.tool but with moar colors (and less conf)


  • jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.




  • https://github.com/tomnomnom/gron - Make JSON greppable! gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation. [37]





  • Jshon - parses, reads and creates JSON. It is designed to be as usable as possible from within the shell and replaces fragile adhoc parsers made from grep/sed/awk as well as heavyweight one-line parsers made from perl/python.




  • Data Protocols - the Open Knowledge Labs home of simple protocols and formats for working with open data. Our mission is both to make it easier to develop tools and services for working with data, and, to ensure greater interoperability between new and existing tools and services.

Web


Checking

to sort

  • JSON Schema Store - The goal of this API is to include schemas for all commonly known JSON file formats. To do that we encourage contributions in terms of new schemas, modifications and test files. SchemaStore.org is owned by the community, and we have a history of accepting most pull requests. Even if you're new to JSON Schemas, please submit new schemas anyway. We have many contributors that will help turn the schemas into perfection. [38]


YAML





Dotset


TOML

  • https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages. [46]


HCL

  • https://github.com/hashicorp/hcl - a configuration language built by HashiCorp. The goal of HCL is to build a structured configuration language that is both human and machine friendly for use with command-line tools, but specifically targeted towards DevOps tools, servers, etc. HCL is also fully JSON compatible. That is, JSON can be used as completely valid input to a system expecting HCL. This helps makes systems interoperable with other systems.



CSON

or


STON

Hjson

  • Hjson - a syntax extension to JSON. It's NOT a proposal to replace JSON or to incorporate it into the JSON spec itself. It's intended to be used like a user interface for humans, to read and edit before passing the JSON data to the machine. [48]

Mark

  • https://github.com/henry-luo/mark - a new unified notation for both object and markup data. The notation is a superset of what can be represented by JSON, HTML and XML, but overcomes many limitations of these popular data formats, yet still having a very clean syntax and simple data model. It has clean syntax with fully-type data model (like JSON or even better) It is generic and extensible (like XML or even better) It has built-in mixed content support (like HTML5 or even better) It supports high-order composition (like S-expressions or even better)

XDR

  • https://en.wikipedia.org/wiki/External_Data_Representation - XDR, is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer. XDR uses a base unit of 4 bytes, serialized in big-endian order; smaller data types still occupy four bytes each after encoding. Variable-length types such as string and opaque are padded to a total divisible by four bytes. Floating-point numbers are represented in IEEE 754 format.


DSPL

  • DSPL - stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data of datasets. The metadata is specified in XML, whereas the data are provided in CSV format.

CUE

  • CUE - open source language, with a rich set APIs and tooling, for defining, generating, and validating all kinds of data: configuration, APIs, database schemas, code, … you name it.


RON

  • RON - a format for distributed live data. RON’s primary mission is continuous data synchronization. A RON object may naturally have any number of replicas, which may synchronize in real-time or intermittently. JSON, protobuf, and many other formats implicitly assume serialization of separate state snapshots. RON has versioning and addressing metadata, so state and updates can be always pieced together. RON handles state and updates all the same: state is change and change is state.

HAL

  • HAL - a format you can use in your API that gives you a simple way of linking. It has two variants, one in JSON and one in XML.

Dhall

HStore


HOCON

  • https://github.com/lightbend/config/blob/master/HOCON.md - Human-Optimized Config Object Notation,. his is an informal spec, but hopefully it's clear. The primary goal is: keep the semantics (tree structure; set of types; encoding/escaping) from JSON, but make it more convenient as a human-editable config file format.


Protocol Buffers


  • Buf - Introduction

Cap'n Proto

  • Cap'n Proto - an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers. This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!


BSON

  • BSON - short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type.

MessagePack

  • MessagePack - an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. [51]

CBOR

  • CBOR - RFC 7049 “The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation.”

Kaitai Struct

  • Kaitai Struct - a declarative language used to describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc.The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

UCL

  • https://github.com/vstakhov/libucl - configuration language called UCL - universal configuration language. If you are looking for the libucl API documentation you can find it at this page. UCL is heavily infused by nginx configuration as the example of a convenient configuration system. However, UCL is fully compatible with JSON format and is able to parse json files. [52]

Amazon Ion

  • Amazon Ion - a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of business data which can survive multiple generations of software evolution. Ion was built to solve the rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. Ion has been addressing these challenges within Amazon for nearly a decade, and we believe others will benefit as well. [53]

Apache Pulsar


der-ascii

  • https://github.com/google/der-ascii - a small human-editable language to emit DER (Distinguished Encoding Rules) or BER (Basic Encoding Rules) encodings of ASN.1 structures and malformed variants of them.

MQTT

  • MQTT - a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. For example, it has been used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers, and in a range of home automation and small device scenarios. It is also ideal for mobile applications because of its small size, low power usage, minimised data packets, and efficient distribution of information to one or many receivers (more...)
    • https://en.wikipedia.org/wiki/MQTT - (MQ Telemetry Transport or Message Queuing Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) publish-subscribe-based messaging protocol. It works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. The publish-subscribe messaging pattern requires a message broker.
    • MQTT Version 5.0 - a Client Server publish/subscribe messaging transport protocol. It is light weight, open, simple, and designed to be easy to implement. These characteristics make it ideal for use in many situations, including constrained environments such as for communication in Machine to Machine (M2M) and Internet of Things (IoT) contexts where a small code footprint is required and/or network bandwidth is at a premium.

recordio

riegeli

  • https://github.com/google/riegeli - a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

gRPC

  • [56] - a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services.

smf

FlatBuffers

  • FlatBuffers - an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust. It was originally created at Google for game development and other performance-critical applications.It is available as Open Source on GitHub under the Apache license, v2

FlexBuffers

  • FlexBuffers - allows for a very compact encoding, combining automatic pooling of strings with automatic sizing of containers to their smallest possible representation (8/16/32/64 bits). Many values and offsets can be encoded in just 8 bits. While a schema-less representation is usually more bulky because of the need to be self-descriptive, FlexBuffers generates smaller binaries for many cases than regular FlatBuffers. FlexBuffers is still slower than regular FlatBuffers though, so we recommend to only use it if you need it. [57]

eno

  • eno - A modern plaintext data format, notation language with libraries, designed from the ground up for file-based content - simple, powerful and elegant [58]

Transit

  • https://github.com/cognitect/transit-format - a format and set of libraries for conveying values between applications written in different programming languages. This spec describes Transit in order to facilitate its implementation in a wide range of languages.

Scuttlebot

cereal

  • https://github.com/USCiLab/cereal - a header-only C++11 serialization library. cereal takes arbitrary data types and reversibly turns them into different representations, such as compact binary encodings, XML, or JSON. cereal was designed to be fast, light-weight, and easy to extend - it has no external dependencies and can be easily bundled with other code or used standalone.

Arrow

  • https://github.com/apache/arrow - a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…

YAS

  • https://github.com/niXman/yas - created as a replacement of boost.serialization because of its insufficient speed of serialization (benchmark 1, benchmark 2). YAS is header only library. YAS does not depend on third-party libraries or boost. YAS require C++11 support. YAS binary archives is endian independent

Citations

See Organising#Reference.2Fcitation_management

Tools


  • https://github.com/sharkdp/binocle - a graphical tool to visualize binary data. It colorizes bytes according to different rules and renders them as pixels in a rectangular grid. This allows users to identify interesting parts in large files and to reveal image-like regions.


  • Resonant - a platform consisting of tools that work in concert to provide storage, analysis, and visualization solutions for your data. All Resonant components are fully open source under the Apache v2 license.


  • Girder - a free and open source web-based data management platform developed by Kitware as part of the Resonant data and analytics ecosystem. What does that mean? Girder is both a standalone application and a platform for building new web services.















  • DataLad - Providing a data portal and a versioning system for everyone, DataLad lets you have your data and control it too.


  • Kaitai Struct: declarative binary format parsing language - a declarative language used for describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc. The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.


  • Datasette - a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools dedicated to make working with structured data as productive as possible.
  • https://github.com/simonw/datasette


  • Arroyo - a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.In short: Arroyo lets you ask complex questions of high-volume real-time data with subsecond results. [66]


GNU poke

  • GNU poke - an interactive, extensible editor for binary data. Not limited to editing basic entities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. [67]

to sort

  • Kaitai Struct - A new way to develop parsers for binary structures, a declarative binary format parsing language





Machine learning

Services


Barcodes

  • https://en.wikipedia.org/wiki/Barcode - or bar code is a method of representing data in a visual, machine-readable form. Initially, barcodes represented data by varying the widths, spacings and sizes of parallel lines. These barcodes, now commonly referred to as linear or one-dimensional ,1D), can be scanned by special optical scanners, called barcode readers, of which there are several types. Later, two-dimensional (2D) variants were developed, using rectangles, dots, hexagons and other patterns, called matrix codes or 2D barcodes, although they do not use bars as such. 2D barcodes can be read using purpose-built 2D optical scanners, which exist in a few different forms. 2D barcodes can also be read by a digital camera connected to a microcomputer running software that takes a photographic image of the barcode and analyzes the image to deconstruct and decode the 2D barcode. A mobile device with a built-in camera, such as smartphone, can function as the latter type of 2D barcode reader using specialized application software (The same sort of mobile device could also read 1D barcodes, depending on the application software).
  • Let’s Move Beyond Open Data Portals | by abhi nemani | Civic Technology | Medium - We too often think about building that one great portal, one great experience, for everyone, but what we learn from the web is that actually we should be listening more than we are building, be understanding user needs and habits more than presuming them, and be going to where they are instead of asking them to come to us. We should do lots of different, small things — not just build one, big thing — and instead think about crafting the ideal experiences for the wonderfully diverse users you have: be it the mayor, city staff, or even your boss; be it the researcher, the entrepreneur, or the regular citizen; be it the person you hadn’t thought of until you meet them randomly at the library. Be it anyone.


  • The Magic of UPC/EAN Barcodes - Lately I’ve been working a mobile app called Ethical Barcode that scans product barcodes and determines if the related brand is worth supporting/avoiding. As such of this I’ve been getting pretty familiar with the pitfalls and issues surrounding barcodes and product identification and I thought I’d share some of details that were tricky to find/figure out.
  • UPC/EAN Product Databases/API's - Converting a barcode into useful, descriptive information is harder then you would expect. Or rather its difficult to do with any reliability and accuracy. There are a few good API’s that I’ve been using or looking into that I’ll cover here. The API’s Each service offers different information so you need to nail down your requirements before you read further. For my needs I really just need Barcode -> Company but I’ll try to add details that fall outside of that where I can remember them.


  • ZBar - an open source software suite for reading bar codes from various sources, such as video streams, image files and raw intensity sensors. It supports many popular symbologies (types of bar codes) including EAN-13/UPC-A, UPC-E, EAN-8, Code 128, Code 39, Interleaved 2 of 5 and QR Code. The flexible, layered implementation facilitates bar code scanning and decoding for any application: use it stand-alone with the included GUI and command line programs, easily integrate a bar code scanning widget into your Qt, GTK+ or PyGTK GUI application, leverage one of the script or programming interfaces (Python, Perl, C++) ...all the way down to a streamlined C library suitable for embedded use. ZBar is licensed under the GNU LGPL 2.1 to enable development of both open source and commercial projects.
    • https://github.com/mchehab/zbar - an open source software suite for reading bar codes from various sources, including webcams. As its development stopped in 2012, I took the task of keeping it updated with the V4L2 API.

QR Codes


  • libqrencode - a fast and compact library for encoding data in a QR Code symbol, a 2D symbology that can be scanned by handy terminals such as a mobile phone with CCD. The capacity of QR Code is up to 7000 digits or 4000 characters and has high robustness.