Data
See also Computing#Data structures, Open data, Visualisation
General
data, noun
- facts and statistics collected together for reference or analysis: there is very little data available
- the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
- Philosophy things known or assumed as facts, making the basis of reasoning or calculation.
- https://en.wikipedia.org/wiki/Unstructured_data
- https://en.wikipedia.org/wiki/Semi-structured_data
- https://en.wikipedia.org/wiki/Data_model - structured data
- https://en.wikipedia.org/wiki/Data_structure
- Data, Information, Knowledge, and Wisdom - some abstractions
Articles
Learning
Management
See also Database, Visualisation, Maths#Software
Science
- A Taxonomy of Data Science - Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?
- School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively in their efforts to create more equitable and effective societies.
- http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html
- http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext
- http://www.evanmiller.org/statistical-formulas-for-programmers.html
- Kaggle - Service - From Big Data to Big Analytics.
Encoding
Telegraph
Morse
Baudot
- https://en.wikipedia.org/wiki/Baudot_code - a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a communication channel such as a telegraph wire or a radio signal. The symbol rate measurement is known as baud, and is derived from the same name.
BCD
EBCDIC
- https://en.wikipedia.org/wiki/EBCDIC - an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. EBCDIC descended from the code used with punched cards and the corresponding six bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s.[2] It is also supported on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, and Unisys VS/9 and MCP.
ASCII / ANSI
- https://en.wikipedia.org/wiki/ASCII - abbreviated from American Standard Code for Information Interchange, is a character-encoding scheme. Originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers as shown by the ASCII chart on the right. The characters encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase j would become binary 1101010 and decimal 106.
- https://en.wikipedia.org/wiki/Extended_ASCII - eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.
- https://en.wikipedia.org/wiki/Code_page_437
- The Evolution of Character Codes, 1874-1968 - Eric Fischer [8]
- https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/ [9]
Art
Unicode
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
- Codepoint, n. the position of a character in an encoding system.
- Charbase - A visual unicode database
- http://en.wikipedia.org/wiki/List_of_Unicode_characters
- http://en.wikipedia.org/wiki/Unicode_control_characters
- http://www.charset.org/
- http://unicode.org/charts/
- http://sheet.shiar.nl/unicode
mirroring char in brackets: (test (
- http://unicodepowersymbol.com/we-did-it-how-a-comment-on-hackernews-lead-to-4-%C2%BD-new-unicode-characters/ [14]
Serialization
See also HTML/CSS#Markup, JavaScript#JSON
- https://en.wikipedia.org/wiki/Category:Data_serialization_formats
- http://www.drdobbs.com/web-development/after-xml-json-then-what/240151851
- https://en.wikipedia.org/wiki/Delimiter-separated_values - store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. A delimited text file is a text file used to store data, in which each line represents a single book, company, or other thing, and each line has fields separated by the delimiter. Compared to the kind of flat file that uses spaces to force every field to the same width, a delimited file has the advantage of allowing field values of any length
- https://en.wikipedia.org/wiki/Runoff_(program)
- https://en.wikipedia.org/wiki/IBM_Generalized_Markup_Language
- https://en.wikipedia.org/wiki/TeX - 1978
- https://en.wikipedia.org/wiki/Scribe_(markup_language) - 1980
CSV
ML
- https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language
- https://en.wikipedia.org/wiki/HTML
- https://en.wikipedia.org/wiki/Document_type_definition - a set of markup declarations that define a document type for an SGML-family markup language (SGML, XML, HTML).
A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list of legal elements and attributes. A DTD can be declared inline inside an XML document, or as an external reference. XML uses a subset of SGML DTD.
As of 2009, newer XML namespace-aware schema languages (such as W3C XML Schema and ISO RELAX NG) have largely superseded DTDs. A namespace-aware version of DTDs is being developed as Part 9 of ISO DSDL. DTDs persist in applications that need special publishing characters, such as the XML and HTML Character Entity References, which derive from larger sets defined as part of the ISO SGML standard effort.
XML
- http://www.w3.org/TR/NOTE-dcd - Document Content Description for XML
- w3c: XQuery - a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats (JSON, binary, etc.). The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.
- http://www.xembly.org/ Xembly is an Assembly-like imperative programming language for data manipulation in XML documents. It is a much simplier alternative to XSLT and XQuery. Read this blog post for a more detailed explanation: Xembly, an Assembly for XML.
- XML Linking Language (XLink)
- https://en.wikipedia.org/wiki/XLink - an XML markup language and W3C specification that provides methods for creating internal and external links within XML documents, and associating metadata with those links.
Schema
XSL
- http://en.wikipedia.org/wiki/XSL - Extensible Stylesheet Language (XSL) is used to refer to a family of languages used to transform and render XML documents.
JSON
- JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
- JSON Web Token (JWT) is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS). - IETF. [18]
- https://github.com/letsencrypt/acme-spec - over https
Variations
- JSON-P or "JSON with padding" is a communication technique used in JavaScript programs which run in Web browsers. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the same origin policy - pre CORS
- JsonML (JSON Markup Language) is an application of the JSON (JavaScript Object Notation) format. The purpose of JsonML is to provide a compact format for transporting XML-based markup as JSON which allows it to be losslessly converted back to its original form. Native XML/XHTML doesn't sit well embedded in JavaScript. When XHTML is stored in script it must be properly encoded as an opaque string. JsonML allows easy manipulation of the markup in script before completely rehydrating back to the original form.
- JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. If you are already familiar with JSON, writing JSON-LD is very easy. These properties make JSON-LD an ideal Linked Data interchange language for JavaScript environments, Web service, and unstructured databases such as CouchDB and MongoDB.
- (rdf-json was shelved by the w3c)
- http://manu.sporny.org/2014/json-ld-origins-2/ [19]
- BSON, short for Binary JSON, is a binary-encoded serialization of JSON-like documents. Like JSON, BSON supports the embedding of documents and arrays within other documents and arrays. BSON also contains extensions that allow representation of data types that are not part of the JSON spec. For example, BSON has a Date type and a BinData type.
- json-stat.org is an attempt to define a JSON schema for statistical dissemination or at least some guidelines and good practices when dealing with stats in JSON.
- JSON API is a JSON-based read/write hypermedia-type designed to support a smart client who wishes build a data-store of information.
- Superfeedr: XMPP-FTW XMPP and JSON for the Web
- What is rss.js? - Dave Winer; "what would JSONified RSS look like?"
- Javascript Object Signing and Encryption - JavaScript Object Notation (JSON) is a text format for the serialization of structured data described in RFC 4627. The JSON format is often used for serializing and transmitting structured data over a network connection. With the increased usage of JSON in protocols in the IETF and elsewhere, there is now a desire to offer security services, which use encryption, digital signatures, message authentication codes (MACs) algorithms, that carry their data in JSON format.
- JSON Web Key (JWK) is a JSON data structure that represents a set of public keys.
- json.human.js - Json Formatting for Human Beings [21]
Learning
- YouTube: Douglas Crockford: The JSON Saga
- Getting Started with JSON - You send data in a JSON format between different parts of your system. API results are often returned in JSON format, for example. JSON is a lightweight format which makes for easy reading if you're even the least bit familiar with JavaScript.
Tools
Generate
Checking
- JSONLint - The JSON Validator
Command-line
- https://github.com/benbernard/RecordStream - commandline tools for slicing and dicing JSON records
- Pjson - Like python -mjson.tool but with moar colors (and less conf)
- jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
to sort
YAML
TOML
- https://github.com/toml-lang/toml - TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages.
Other
- Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.
- http://kentonv.github.io/ - from proto buf dev [26]
Markup
- https://en.wikipedia.org/wiki/Lightweight_markup_language
- Lightweight Markup: Markdown, MediaWiki, Wikidot, LaTeX
HTML / CSS
See HTML/CSS
Markdown
See also Documents#Markdown
- Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).
- https://github.com/rhythmus/markdown-resources - A curated collection of Markdown resources: apps, dialects, parsers, people, …
Variations
- The Future of Markdown - 25 Oct 2012 [28]
- CommonMark - A strongly defined, highly compatible specification of Markdow [29]
- https://github.com/karlcow/markdown-testsuite - This project was initiated to provide a test suite for Markdown markup, and eventually create a specification from this test results. A part of of the community has started a new endeavor which seems to get traction as CommonMark.
- https://tools.ietf.org/html/rfc7763 - The text/markdown Media Type [30]
- https://tools.ietf.org/html/rfc7764 - Guidance on Markdown: Design Philosophies, Stability Strategies, and Select Registrations
Tools
- Markdown Here is a Google Chrome, Firefox, Safari, Opera, and Thunderbird extension that lets you write email in Markdown and render them before sending. It also supports syntax highlighting (just specify the language in a fenced code block).
- https://github.com/mwhite/resume - a simple Markdown resumé template, LaTeX header, and pre-processing script that can be used with Pandoc to generate professional-looking PDF and HTML output.
- Markx - Markdown editor for scientific writing. Batteries included.
- Markdown.css - CSS to make HTML markup look like plain-text markdown.
- PageDown is the JavaScript Markdown previewer used on Stack Overflow and the rest of the Stack Exchange network. It includes a Markdown-to-HTML converter and an in-page Markdown editor with live preview.
- http://blogs.plos.org/mfenner/2012/12/13/a-call-for-scholarly-markdown/
- http://indiewebcamp.com/2013/Citations_and_Scholarly_Markdown
- Lorem Markdownum - Inspired by the many excellent lorem ipsum generators, this simple webapp generates placeholder text. However, instead of generating plain text, this generator gives you structured text in the form of markdown. In order to do so, it uses Markov Chains and many heuristics.
- Markdown Extra is an extension to PHP Markdown implementing some features currently not available with the plain Markdown syntax. Markdown Extra is available as a separate parser class in PHP Markdown Lib.
- Markdown Extended is an extended implementation of John Gruber's original markdown syntax to write reach contents from simple text files such as common .txt
- Markdeep is a technology for writing plain text documents that will look good in any web browser. It supports diagrams, common styling conventions, and equations as extensions of Markdown syntax. [34]
- Fountain is a simple markup syntax for writing, editing and sharing screenplays in plain, human-readable text. Fountain allows you to work on your screenplay anywhere, on any computer or tablet, using any software that edits text files.
- Why Markdown Is Not My Favourite Language - 30 July 2012 - recommends Creole [35]
Table of Contents
- http://doctoc.herokuapp.com/ - generate ToC
cat ~/projects/Dockerfile.vim/README.md | ./gh-md-toc - * [Dockerfile.vim](#dockerfilevim) * [Screenshot](#screenshot) * [Installation](#installation) * [OR using Pathogen:](#or-using-pathogen) * [OR using Vundle:](#or-using-vundle) * [License](#license)
WYSIWYM
Configuration
JSON
- http://beautifuldocs.com/
- https://github.com/scottstanfield/markdown-to-json
- https://github.com/sheremetyev/markdown-json
- Markdown Syntax for Object Notation (MSON) - This document is a proposal of Markdown syntax for JSON & JSON Schema.
Systems
WikiCreole
other
- Pandoc - a universal document converter. If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, DocBook, LaTeX, or MediaWiki markup to; HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, Slideous, S5, or DZSlides. Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML, Ebooks: EPUB version 2 or 3, FictionBook2, Documentation formats: DocBook, GNU TexInfo, Groff man pages, TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides, PDF via LaTeX, Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile
XMPP
Other
Mining
- http://www.dcc.ufmg.br/livros/miningalgorithms/DokuWiki/doku.php?id=contents
- http://www.slideshare.net/anilmlis/semantic-web-mining
- http://www.mops1.com/oracle/event/pasig/downloads/PASIG_2010-Simon.pdf
- http://www.public.asu.edu/~hdavulcu/CSE591_Semantic_Web_Mining.html
Scraping
See also HTTP#Scraping, Network#Saving
- ScraperWiki - Accurately extract tables from web pages and PDFs
- Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
- Portia is a tool for visually scraping web sites using Scrapy without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.
- https://github.com/cantino/huginn/ - like yahoo pipes
Tools
- http://openrefine.org/ - google refine
- http://idcubed.org/open-platform/platform/
- https://wiki.idhypercubed.org/wiki/ProjectMustardSeed - A Framework for developing and deploying secure cloud applications to collect, compute on, and share personal data
- Recline Data Explorer and Library - A simple but powerful library for building data applications in pure Javascript and HTML.