Things and Stuff Wiki - an organically evolving personal wiki knowledge base with a totally on-the-fly taxonomy containing topic outlines, descriptions and breadcrumbs, with links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, screencasts, webcasts, scratchpads and more. use the Table of Contents to navigate and the Small-ToC / Tiny-TOC header links on longer pages. probably not that mobile friendly atm. i am milk on freenode, give me a pm for feedback, or see About for login and further information. / et / em
- 1 General
- 2 Encoding
- 3 Serialization
- 4 Markup
- 5 Maths
- 6 Mining
- 7 Scraping
- 8 Tools
- 9 Services
- facts and statistics collected together for reference or analysis: there is very little data available
- the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
- Philosophy things known or assumed as facts, making the basis of reasoning or calculation.
- https://en.wikipedia.org/wiki/Unstructured_data - information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.
- https://en.wikipedia.org/wiki/Data_model - or datamodel, is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.
The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.
Overview of data modeling context: Data model is based on Data, Data relationship, Data semantic and Data constraint. A data model provides the details of information to be stored, and is of primary use when the final product is the generation of computer software code for an application or the preparation of a functional specification to aid a computer software make-or-buy decision. The figure is an example of the interaction between process and data models. A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form. A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.
- https://en.wikipedia.org/wiki/Semi-structured_data - form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.
- Data, Information, Knowledge, and Wisdom - some abstractions
- A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?
- School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.
- Kaggle - Service - From Big Data to Big Analytics.
- https://en.wikipedia.org/wiki/Od_(Unix) - a program for displaying ("dumping") data in various human-readable output formats. The name is an acronym for "octal dump" since it defaults to printing in the octal data format. It can also display output in a variety of other formats, including hexadecimal, decimal, and ASCII. It is useful for visualizing data that is not in a human-readable format, like the executable code of a program.
- https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.
- https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.
ASCII / ANSI
- https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.
- https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.
- https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/ 
- jp2a - a small utility that converts JPG images to ASCII. It's written in C and released under the GPL.
- http://caca.zoy.org/wiki/toilet - like figlet but w/ colours
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
- Codepoint, n. the position of a character in an encoding system.
- Charbase - A visual unicode database
mirroring char in brackets: (test (
- http://unicodepowersymbol.com/we-did-it-how-a-comment-on-hackernews-lead-to-4-%C2%BD-new-unicode-characters/ 
- https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length
- https://en.wikipedia.org/wiki/TeX - 1978
- https://en.wikipedia.org/wiki/Scribe_(markup_language) - 1980
- https://en.wikipedia.org/wiki/Document_type_definition - a set of markup declarations that define a document type for an SGML-family markup language (SGML, XML, HTML).
A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list of legal elements and attributes. A DTD can be declared inline inside an XML document, or as an external reference. XML uses a subset of SGML DTD.
As of 2009, newer XML namespace-aware schema languages (such as W3C XML Schema and ISO RELAX NG) have largely superseded DTDs. A namespace-aware version of DTDs is being developed as Part 9 of ISO DSDL. DTDs persist in applications that need special publishing characters, such as the XML and HTML Character Entity References, which derive from larger sets defined as part of the ISO SGML standard effort.
- http://www.w3.org/TR/NOTE-dcd - Document Content Description for XML
- w3c: XQuery - a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats (JSON, binary, etc.). The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.
- http://www.xembly.org/ Xembly is an Assembly-like imperative programming language for data manipulation in XML documents. It is a much simplier alternative to XSLT and XQuery. Read this blog post for a more detailed explanation: Xembly, an Assembly for XML.
- XML Linking Language (XLink)
- https://en.wikipedia.org/wiki/XLink - an XML markup language and W3C specification that provides methods for creating internal and external links within XML documents, and associating metadata with those links.
- http://en.wikipedia.org/wiki/XSL - Extensible Stylesheet Language (XSL) is used to refer to a family of languages used to transform and render XML documents.
- JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. 
- https://github.com/letsencrypt/acme-spec - over https
- BSON, short for Binary JSON, is a binary-encoded serialization of JSON-like documents. Like JSON, BSON supports the embedding of documents and arrays within other documents and arrays. BSON also contains extensions that allow representation of data types that are not part of the JSON spec. For example, BSON has a Date type and a BinData type.
- json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.
- JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.
- Superfeedr: XMPP-FTW XMPP and JSON for the Web
- What is rss.js? - Dave Winer; "what would JSONified RSS look like?"
- JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.
- YouTube: Douglas Crockford: The JSON Saga
- JSONLint - The JSON Validator
- https://github.com/benbernard/RecordStream - commandline tools for slicing and dicing JSON records
- Pjson - Like python -mjson.tool but with moar colors (and less conf)
- jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
- YouTube: JSON: Like a Boss
- https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages.
- Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.
- HAL is a format you can use in your API that gives you a simple way of linking. It has two variants, one in JSON and one in XML.
- MQTT - a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. For example, it has been used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers, and in a range of home automation and small device scenarios. It is also ideal for mobile applications because of its small size, low power usage, minimised data packets, and efficient distribution of information to one or many receivers (more...)
- https://en.wikipedia.org/wiki/MQTT - (MQ Telemetry Transport or Message Queuing Telemetry Transport) is an ISO standard (ISO/IEC PRF 20922) publish-subscribe-based messaging protocol. It works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. The publish-subscribe messaging pattern requires a message broker.
- ScraperWiki - Accurately extract tables from web pages and PDFs
- Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
- Portia is a tool for visually scraping web sites using Scrapy without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.
- https://github.com/cantino/huginn/ - like yahoo pipes
- http://openrefine.org/ - google refine
- https://wiki.idhypercubed.org/wiki/ProjectMustardSeed - A Framework for developing and deploying secure cloud applications to collect, compute on, and share personal data
- DataLad - Providing a data portal and a versioning system for everyone, DataLad lets you have your data and control it too.